Open Epidemiology Initiative

Open Epidemiology Initiative

3 weeks agoopen0

Billions have been spent trying to discover pharmaceutical treatments for dementia and mental illness. However, the effort has been a near-total failure to this point. This suggests that we may benefit from looking for an underlying cause and means of prevention.

Most people attribute depression and anxiety disorders primarily to life-events or genetics. Dementia is generally assumed to have genetic origins and something we just have to accept as a fact of life.

However, there is epidemiological evidence that some factors in our environment and behavior have a massive influence on the development of mental illness and dementia.

Patient: It hurts when I do this.

Doctor: Then don't do that.

Currently hundreds of millions of people are hurting because of things that they're doing. The problem is we have no idea what they are. Discovering this is the goal of the Open Epidemiology Initiative.

Temporal Evidence

Temporal Evidence That We Are Doing Something That Makes Dementia and Mental Illness Worse

Correlation Does Not Equal Causation

Of course, correlation is not the same as causation. The rise could also be influenced by many other factors as well. The only way to be confident in a causal relationship is through interventional experimentation.

However, we have limited resources available for this type of experimentation. So we need to prioritize which relationships are most likely to be worth further investigation. The presence of an observational correlation could be a prerequisite for devoting financial resources to more controlled studies. Conversely, the absence of a correlational relationship between an outcome and factor suggests we should not devote limited research dollars to further exploration. 

Given that nearly a billion people are suffering daily from all of these diseases combined, it's extremely urgent that we collect and make publicly available data on the incidence of these diseases over time as well as data on all factors that could be exacerbating or improving them.

Absence of Correlation DOES Suggest Absence of Causation

Geographic Evidence

Specific Aims

The CDC should provide a simple website where one can enter any communicable or non-communicable diseases in a search box and see a longitudinal chart of overlaid incidence and prevalence over time for the entered diseases.

Future Aims

  • Correlation Matrix Heat Map - This would reveal which of the entered diseases have the greatest correlation in rise and fall over time. Higher correlations suggest a greater likelihood of a shared underlying root cause of the increase or decrease of the disease prevalence.
  • Comorbidity Heat Map - This would reveal which diseases were most often co-occurring in the same individuals
  • Factor Correlation Matrix - This would allow one to select a disease and identify the population-level environment, dietary, and treatment factors most highly correlated with the rise and fall of disease for a given geography. Environmental data could be obtained from the Environmental Protection Agency, dietary data would be obtained from the USDA and treatment data could be provided by the FDA.

Existing Solutions

Currently the CDC website provides a vast amount of data on many diseases. However, it is very fragmented and disparate which makes it very difficult to study relationships between diseases.

1. Hospital Discharge Data

Most studies examining the prevalence of disease are currently based on hospital discharge data. One can get a sense of the trends in various diseases over time by looking at the hospital discharge diagnoses collected from hundreds of hospitals by the United States Centers for Disease Control and Prevention (CDC). These data are available for free download.

Raw data files are available from 1998 through 2010. Each data file contains thousands of discharge records collected from hospitals using a statistically random sampling procedure. The records contain information about the age, sex, race, geographic location, and diagnoses for each discharge. The diagnoses are recorded by the International Classification of Diseases, Ninth Revision (ICD-9) codes. Up to seven diagnostic codes can be recorded for each discharge, with the first listed being the primary reason for the hospital admission. Currently, making use of this data requires writing computer programs to query the data file for specific ICD codes for each year.

A rate of increase, as an estimate of prevalence, over time for each particular diagnosis can be obtained as follows:


  • â is the normalized number of hospital discharges of a disease in a year;
  • a is the total number of the hospital discharge records of the disease in the year computed from the raw files;
  • T represents the total number of hospital discharge records in the sampled hospitals in that same year
  • P is the total population in the US for that year

Population estimates can be obtained from the CDC mortality database.

The drawbacks of using hospital discharge data are:

  1. Difficult to Work With -  The hospital discharge data is extremely denormalized and requires a lot of work to make it analyzable.
  2. Based on Changing Diagnostic Criteria - There are subjective biases in the hospital discharge data.  It is common that hospital admission routines change without any change of prevalence of a disorder. 

2. Institute for Health Metrics and Evaluation Global Health Data Exchange (GHDx)

The Institute for Health Metrics and Evaluation's Global Health Data Exchange (GHDx) is the world’s most comprehensive catalog of surveys, censuses, vital statistics, and other health-related data.

Here's an example of using their Global Disease Burden (GBD) Compare tool to examine the disability-adjusted life years (DALYs) lost due to Alzheimer's and Depression over time:


Add a Comment

Your email address will not be published.