The goal of this repository is to create a standardized set of data on forecasts from teams making projections of cumulative and incident deaths and incident hospitalizations due to COVID-19 in the United States. This repository is the data source for the official CDC COVID-19 Forecasting page. This project to collect, standardize, visualize and synthesize forecast data has been led by the CDC-funded UMass-Amherst Influenza Forecasting Center of Excellence based at the Reich Lab, with contributions from many others.
This README provides an overview of the project. Additional specific links can be found in the list below:
- Interactive Visualization
- Ensemble model
- Processed forecast data
- Truth data
- Technical README with detailed submission instructions
We are grateful to the teams who have generated these forecasts. They have spent a huge amount of time and effort in a short amount of time to operationalize these important real-time forecasts. The groups have graciously and courageously made their public data available under different terms and licenses. You will find the licenses (when provided) within the model-specific folders in the data-processed directory. Please consult these licenses before using these data to ensure that you follow the terms under which these data were released.
All source code that is specific to this project, along with our d3-foresight visualization tool is available under an open-source MIT license. We note that this license does NOT cover model code from the various teams (maybe available from them under other licenses) or model forecast data (available under specified licenses as described above).
Different groups are making forecasts at different times, and for different geographic scales. The specifications below were created by consulting with collaborators at CDC and looking at what models forecasting teams were already building.
What do we consider to be "gold standard" data? We will use the daily reports containing case and death data from the JHU CSSE group as the gold standard reference data for deaths in the US. These data are the time-series version of the JHU data that do occasionally contain "revisions" of previous daily reports. Note that there are not insignificant differences (especially in daily incident death data) between the JHU data and another commonly used source, from the New York Times. The team at UTexas-Austin is tracking this issue on a separate GitHub repository.
When will forecast data be updated? We will be storing new forecasts from each group as they are either provided to us directly via pull requests. Teams are encouraged to submit data as often has they have it available, although we only support one upload for each day. In general, "updates" to forecasts will not be permitted. Teams are responsible for checking that their forecasts are ready for public viewing upon submission. This can be done locally using our interactive visualization tool.
What locations will have forecasts? Currently, forecasts may be submitted for any state and county in the US and the US at the national level.
How will probabilistic forecasts be represented?
Forecasts will be represented in a standard format using quantile-based representations of predictive distributions. We encourage all groups to make available the following 23 quantiles for each distribution: c(0.01, 0.025, seq(0.05, 0.95, by = 0.05), 0.975, 0.99)
. One goal of this effort is to create probabilistic ensemble forecasts, and having high-resolution component distributions will provide data to create better ensembles.
What forecast targets will be stored? We will store forecasts for 1 through 20 week-ahead incident and cumulative deaths, 0 through 130 day-ahead incident hospitalizations, and 1 through 8 week-ahead incident reported cases. Please refer to the technical README for details on aligning targets with forecast dates.
Every Monday at 6pm ET, we will update our COVID Forecast Hub ensemble forecast and interactive visualization using the most recent forecast from each team as long as it was submitted before 6pm ET on Monday and has a forecast_date of any day since the previous Tuesday. All models meeting the above criteria will be considered for the ensemble. For inclusion in the ensemble, we additionally require that forecasts include a full set of 23 quantiles to be submitted for each of the one through four week ahead values for forecasts of deaths, and a full set of 7 quantiles for the one through four week ahead values for forecasts of cases (see technical README for details), and that the 10th quantile of the predictive distribution for a 1 week ahead forecast of cumulative deaths is not below the most recently observed data. Before the week of July 28, we also performed manual visual inspection checks to ensure that forecasts were in alignment with the ground truth data; this step is no longer a part of our weekly ensemble generation process. Details on which models were included each week in the ensemble are available in the ensemble metadata folder.
From April 13 to July 21, the ensemble was created by taking the arithmetic average of each prediction quantile for all eligible models for a given location. Starting on the week of July 28, we have instead used the median prediction across all eligible models at each quantile level.
Participating teams provide their forecasts in a quantile-based format. We have developed specifications that can be used to represent all of the forecasts in a simple, long-form data format. For details about this file format specifications, please see the technical README.
Our list of teams whose forecasts are currently standardized and in the repository are (with data reuse license):
- Auquan (CC-BY-4.0)
- BPagano (CC-BY-4.0)
- Caltech (CC-BY-4.0)
- Carnegie Mellon Delphi Group (CC-BY-4.0)
- CDDEP (CC-BY-4.0)
- CEID (CC-BY-4.0)
- Columbia University (Apache 2.0)
- Columbia University and University of North Carolina at Chapel Hill (Apache 2.0)
- CovidActNow (CC-BY-4.0)
- COVID-19 Simulator Consortium (CC-BY-4.0)
- Discrete Dynamical Systems (MIT)
- epiforecasts (MIT)
- GLEAM from Northeastern University (CC-BY-4.0)
- Georgia Institute of Technology (CC-BY-4.0)
- Georgia Institute of Technology - Center for Health and Humanitarian Systems (CC-BY-4.0)
- Google Cloud AI (other)
- IHME (CC-AT-NC-4.0)
- Institute of Business Forecasting (MIT)
- Iowa State University (CC-BY-4.0)
- Iowa State University and Peking University (CC-BY-4.0)
- IQVIA (CC-BY-4.0)
- Karlen Working Group (gpl-3.0)
- LANL (see license)
- LockNQuay (MIT)
- Imperial (CC-BY-NC-ND 4.0)
- John Burant (JCB) (CC-BY-4.0)
- Johns Hopkins Center for Systems Science and Engineering (CC-BY-4.0)
- Johns Hopkins ID Dynamics COVID-19 Working Group (MIT)
- Johns Hopkins University Applied Physics Laboratory (MIT)
- Johns Hopkins University Justin Lessler lab. Google Inc. (gpl-3.0)
- Massachusetts Institute of Technology - Covid Analytics (Apache 2.0)
- Massachusetts Institute of Technology - Covid Alliance (CC-BY-4.0)
- Massachusetts Institute of Technology - Critical Data (MIT)
- Microsoft Research Asia (CC-BY-4.0)
- Notre Dame - FRED (CC-BY-4.0)
- Notre Dame - mobility (CC-BY-4.0)
- Oliver Wyman (see license)
- PandemicCentral-USCounty (MIT)
- Predictive Science Inc (MIT)
- QJHong (CC-BY-4.0)
- Quantori (CC-BY-4.0)
- Robert Walraven (CC-BY-4.0)
- Rensselaer Polytechnic Institute and University of Washington (CC-BY-4.0)
- Snyder Wilson Consulting (CC-BY-4.0)
- Steve McConnell (CC-BY-4.0)
- STH (CC-BY-4.0)
- US Army Engineer Research and Development Center (ERDC) (see license)
- University of Arizona (CC-BY-NC-SA 4.0)
- University of California, Los Angeles (CC-BY-4.0)
- University of California Merced MESA Lab (CC-BY-4.0)
- University of California Santa Barbara (CC-BY-4.0)
- University of California San Diego (CC-BY-4.0)
- University of Chicago (CC-BY-NC-4.0)
- University of Chicago, Chattopadhyay Lab (MIT)
- University of Geneva / Swiss Data Science Center (see license)
- University of Massachusetts - Expert Model (MIT)
- University of Massachusetts - Mechanistic Bayesian model (MIT)
- University of Michigan (CC-BY-4.0)
- University of Southern California Data Science Lab (MIT)
- University of Texas-Austin (BSD-3)
- Texas Tech University (CC-BY-4.0)
- University of Virginia (CC-BY-4.0)
- Walmart Labs (CC-BY-4.0)
- Yu_Group-CLEP (CC-BY-4.0)
- YYG (MIT)
- COVIDhub ensemble forecast: this is a combination of the above models.
Participating teams must provide a metadata file.
Carefully curating these datasets into a standard format has taken a Herculean team effort. The following lists those who have helped out, in reverse alphabetical order:
- Nutcha Wattanachit (ensemble model, data processing)
- Serena Wang (data curation)
- Nicholas Reich (project lead, ensemble model, data processing)
- Evan Ray (ensemble model)
- Jarad Niemi (data processing and organization)
- Khoa Le (validation, automation)
- Ayush Khandelwal (architecture, data curation)
- Abdul Hannan Kanji (architecture, data curation)
- Katie House (visualization, validation, project management)
- Estee Cramer (data curation, ensemble model)
- Matt Cornell (validation, Zoltar integration)
- Andrea Brennen (metadata curation)
- Johannes Bracher (evaluation, data processing)