Open Epidemiology Initiative

Billions have been spent trying to discover pharmaceutical treatments for dementia and mental illness. However, the effort has been a near-total failure to this point. This suggests that we may benefit from looking for an underlying cause and means of prevention.

Most people attribute depression and anxiety disorders primarily to life-events or genetics. Dementia is generally assumed to have genetic origins and something we just have to accept as a fact of life.

However, there is epidemiological evidence that some factors in our environment and behavior have a massive influence on the development of mental illness and dementia.

Patient: It hurts when I do this.

Doctor: Then don’t do that.

Currently hundreds of millions of people are hurting because of things that they’re doing. The problem is we have no idea what they are. Discovering this is the goal of the Open Epidemiology Initiative.

Temporal Evidence

Temporal Evidence That We Are Doing Something That Makes Dementia and Mental Illness Worse

According to hospital discharge data, from 1990 to 2010 the incidence of autism, Alzheimer’s disease, celiac disease, sleep disorders, inflammatory bowel disease, and depression all roughly doubled or tripled.

We are a product of our genes and our environment as are all diseases. The human genome didn’t start dramatically changing in 1990. So the increases must be attributed to one or multiple changes in our diets or environment or a very powerful witch put a curse on the world.

The strongest correlation with the rise in diseases is the increase in the use of glyphosate weed killer on the majority of soy, wheat, and corn we consume. The above charts illustrate a near-identical mirror in the increase in usage of this chemical and the incidence of these diseases.

Correlation Does Not Equal Causation

Of course, correlation is not the same as causation. The rise could also be influenced by many other factors as well. The only way to be confident in a causal relationship is through interventional experimentation.

However, we have limited resources available for this type of experimentation. So we need to prioritize which relationships are most likely to be worth further investigation. The presence of an observational correlation could be a prerequisite for devoting financial resources to more controlled studies. Conversely, the absence of a correlational relationship between an outcome and factor suggests we should not devote limited research dollars to further exploration. 

Given that nearly a billion people are suffering daily from all of these diseases combined, it’s extremely urgent that we collect and make publicly available data on the incidence of these diseases over time as well as data on all factors that could be exacerbating or improving them.

Absence of Correlation DOES Suggest Absence of Causation

Something caused the incidence of non-Hodgkin lymphoma (NHL), a cancer of the immune system to quadruple from 1979 to 2011.

In 2015, the World Health Organization’s International Agency for Research on Cancer classified glyphosate as “probably carcinogenic to humans.” Thousands of people have sued Monsanto based on the belief that exposure to the herbicide caused their non-Hodgkin’s lymphoma. It’s impossible to know the precise cause of any given case of cancer. However, based on the fact that this type of cancer steadily rose through the 1970s when glyphosate was not widely used on crops suggests that there more significant causes for the societal increase in non-Hodgkin lymphoma.

Geographic Evidence

Geographic Evidence That We Are Doing Something That Makes Dementia and Mental Illness Worse

There are small areas of the world known as “Blue Zones” where the incidence of Alzheimer’s and autoimmune disease is almost non-existent. (Learn more about how Alzheimer’s and autoimmune disease are strongly linked.) 5 areas were located using epidemiological data, statistics, birth certificates, and other research. In these Blue Zones people reach age 100 at 10 times greater rates than in the United States. 

The people in these regions are not significantly genetically different from the rest of the world, but significant differences in lifestyle (such as diet) have been identified.

Lifespan even varies significantly from state to state within the same country.

Regulations Are Invaluable Natural Experiments

Macro-level epidemiological data includes the incidence of various diseases over time combined with data on the amounts of different drugs or food additives. This is how it was initially discovered that smoking caused lung cancer. With macro-level data, it’s even harder to distinguish correlation from causation. However, different countries often enact different policies that can serve as very useful natural experiments.

For instance, 30 countries have banned the use of glyphosate. If the rates of Alzheimer’s, autism, and depression declined in these countries and did not decline in the countries still using glyphosate, this would provide very powerful evidence regarding its effects. Unfortunately, there is no global database that currently provides easy access to the incidence of these conditions in various countries over time and the levels of exposure to various chemicals.

We need to collect more data to take advantage of these geographic natural experiments. With enough data, we could discover hidden factors reducing or contributing to chronic illnesses. Then by providing real-time decision support to individuals, we could apply these lessons learned to reduce chronic disease burden throughout the world.

Specific Aims

The CDC should provide a simple website where one can enter any communicable or non-communicable diseases in a search box and see a longitudinal chart of overlaid incidence and prevalence over time for the entered diseases.

Future Aims

  • Correlation Matrix Heat Map – This would reveal which of the entered diseases have the greatest correlation in rise and fall over time. Higher correlations suggest a greater likelihood of a shared underlying root cause of the increase or decrease of the disease prevalence.
  • Comorbidity Heat Map – This would reveal which diseases were most often co-occurring in the same individuals
  • Factor Correlation Matrix – This would allow one to select a disease and identify the population-level environment, dietary, and treatment factors most highly correlated with the rise and fall of disease for a given geography. Environmental data could be obtained from the Environmental Protection Agency, dietary data would be obtained from the USDA and treatment data could be provided by the FDA.

Existing Solutions

Currently the CDC website provides a vast amount of data on many diseases. However, it is very fragmented and disparate which makes it very difficult to study relationships between diseases.

1. Hospital Discharge Data

Most studies examining the prevalence of disease are currently based on hospital discharge data. One can get a sense of the trends in various diseases over time by looking at the hospital discharge diagnoses collected from hundreds of hospitals by the United States Centers for Disease Control and Prevention (CDC). These data are available for free download.

Raw data files are available from 1998 through 2010. Each data file contains thousands of discharge records collected from hospitals using a statistically random sampling procedure. The records contain information about the age, sex, race, geographic location, and diagnoses for each discharge. The diagnoses are recorded by the International Classification of Diseases, Ninth Revision (ICD-9) codes. Up to seven diagnostic codes can be recorded for each discharge, with the first listed being the primary reason for the hospital admission. Currently, making use of this data requires writing computer programs to query the data file for specific ICD codes for each year.

A rate of increase, as an estimate of prevalence, over time for each particular diagnosis can be obtained as follows:


  • â is the normalized number of hospital discharges of a disease in a year;
  • a is the total number of the hospital discharge records of the disease in the year computed from the raw files;
  • T represents the total number of hospital discharge records in the sampled hospitals in that same year
  • P is the total population in the US for that year

Population estimates can be obtained from the CDC mortality database.

The drawbacks of using hospital discharge data are:

  1. Difficult to Work With –  The hospital discharge data is extremely denormalized and requires a lot of work to make it analyzable.
  2. Based on Changing Diagnostic Criteria – There are subjective biases in the hospital discharge data.  It is common that hospital admission routines change without any change of prevalence of a disorder. 

2. Institute for Health Metrics and Evaluation Global Health Data Exchange (GHDx)

The Institute for Health Metrics and Evaluation‘s Global Health Data Exchange (GHDx) is the world’s most comprehensive catalog of surveys, censuses, vital statistics, and other health-related data.

Here’s an example of using their Global Disease Burden (GBD) Compare tool to examine the disability-adjusted life years (DALYs) lost due to Alzheimer’s and Depression over time: