More #opendata highlights: X-rays and air quality

Monday, November 13, 2017

Because I've had a gap in blogging over the last few months, I thought I would ease back into things by highlighting public health-related data sets going back through Data is Plural, one edition at a time.

The October 4 edition of Data is Plural featured air quality data and chest X-rays from the EPA and the NIH, respectively:
Four decades of U.S. air quality. The Environmental Protection Agency collects air quality samples from thousands of monitoring stations across the country. The resulting datasets, which go back to the 1980s, are available as daily files, annual files, and via an API. The monitored pollutants include ozone, carbon monoxide, sulfur dioxide, nitrogen dioxide, particulate matter, volatile organic compounds, and more. You can also download daily Air Quality Index ratings and information about each monitoring station. Previously: Global air pollution datasets from Berkeley Earth (DIP 2017.03.22) and from the World Health Organization (DIP 2016.06.15). [h/t Swier Heeres]

Chest x-rays. Last week, the National Institutes of Health released a datasetcontaining more than 100,000 anonymized chest x-rays, from 30,000 patients, “including many with advanced lung disease.” For each image, the associated metadata includes the patient’s age, gender, and diagnosis labels. (The dataset’s authors used natural language processing to extract those labels from radiological reports; they estimate that fewer than 10% of the labels are incorrect.) Related:Andrew L. Beam’s list of medical datasets for machine learning. [h/t Chris Hamby]

This week in @CDCMMWR: Assessing Kenya's and Ghana's immunization information systems

Saturday, November 11, 2017

I'm trying to get back into blogging regularly by doing some regular, manageable features. Since I read CDC's MMWR every week and it often contains articles relevant to global health and/or data quality, I am going to try to feature articles of interest here.

This week's MMWR has an article on a recently revamped data quality assessment tool that is intended to measure immunization information systems in low- and middle-income countries. The WHO partnered with the CDC to develop updated assessment guidelines in 2014, as the original guidelines developed in 2001 were missing the mark. The article presents the results of using the updated assessment tool in Kenya in 2015 and in Ghana in 2016:
The availability, quality, and use of immunization data are widely considered to form the foundation of successful national immunization programs. Lower- and middle-income countries have used systematic methods for the assessment of administrative immunization data quality since 2001, when the World Health Organization (WHO) developed the Data Quality Audit methodology. WHO adapted this methodology for use by national programs as a self-assessment tool, the Data Quality Self-Assessment. This methodology was further refined by WHO and CDC in 2014 as an immunization information system assessment (IISA).
...
The experience gained from implementing assessments using updated IISA guidance in Kenya and Ghana provides an opportunity to inform other countries interested in best practices for assessing their data quality and creating actionable data quality improvement plans. Data quality improvement is important to provide the most accurate and actionable evidence base for future decision-making and investments in immunization programs. This review provides best practice experiences and recommendations for countries to use an IISA to assess data quality from national administrative structure down to the facility level. This methodology also meets the requirements for use by Gavi, the Vaccine Alliance, for monitoring national immunization data quality at a minimum interval of every 5 years in conjunction with funding decisions.
The issue also has articles on tobacco use and waterborne disease outbreaks in the U.S. - including in drinking water (which is scary, since most of us in the states take safe drinking water for granted).

More public data set highlights: Wildfires, vehicle safety, and water quality

Thursday, November 9, 2017

Because I've had a gap in blogging over the last few months, I thought I would ease back into things by highlighting public health-related data sets going back through Data is Plural, one edition at a time.

The October 11 edition of Data is Plural featured data sets on wildfires, vehicle safety, and the water quality of the San Francisco Bay:
Wildfires.Monitoring Trends in Burn Severity (MTBS) is an interagency program whose goal is to consistently map the burn severity and extent of large fires across all lands of the United States”; the most recent release contains more than 20,000 fires from 1984 to 2015. You can explore the data online, or download it in bulk. For more recent data, see GeoMAC, which aims to map all current wildfires; NOAA’s Hazard Mapping System, which uses satellites to detect fire locations and smoke plumes; and NASA’s MODIS and VIIRS datasets, which provide satellite-based detections for the entire globe. Previously: National Fire Incident Reporting System , which also includes structure fires and vehicle fires (DIP 2016.07.20). [h/t Max Joseph ]

Commercial vehicle safety. The Federal Motor Carrier Safety Administration helps to regulate the United States’ large trucks and passenger buses. The datasets available through its Safety Measurement System include a census of all regulated carriers, the results of safety inspections, and reported crashes. The crash files list the number of injuries and fatalities; the weather, light, and road conditions; the involved vehicle’s VIN and license plate number; and more. [h/t Dan Brady]

San Francisco Bay water. The U.S. Geological Survey has been measuring water quality in the San Francisco Bay for nearly 50 years. The agency recently published 210,826 of these measurements, collected from dozens of monitoring stations between April 1969 and December 2015. (It’s “one of the longest records of water-quality measurements in a North American estuary,” according to a recent academic article describing the data.) Each row specifies the measurement’s date, station, depth, temperature, and salinity; many rows include levels of chlorophyll, oxygen, nitrate, ammonium, and other matter.

More public data set highlights: Puerto Rico's disaster recovery

Saturday, November 4, 2017

Because I've had a gap in blogging over the last few months, I thought I would ease back into things by highlighting public health-related data sets going back through Data is Plural, one edition at a time.

The October 18 edition of Data is Plural featured a data set with different metrics related to Puerto Rico's disaster recovery efforts:
Puerto Rico’s recovery. Since shortly after Hurricane Maria hit Puerto Rico, the territory’s government has been publishing a dashboard of recovery statistics. The website tracks a couple dozen metrics, including the percent of homes with electricity, number of people in shelters, and the number of open hospitals. For several of the main metrics, researcher Michael A. Johansson has been scraping daily figures from the dashboard and publishing them as a CSV file. Related: The Washington Post has been charting the recovery, and published a deep dive into the island’s ongoing power outages.

Public data set highlights: Deepwater Horizon and cardiovascular epidemiology

Friday, November 3, 2017

This week's Data is Plural newsletter features two health-related datasets: one with NOAA data on the effects of the Deepwater Horizon explosion and one on cardiovascular mortality from IHME at the University of Washington. Hooray epidemiology!
Deepwater Horizon’s effects. For years, the National Oceanic & Atmospheric Administration has been working to assess the damage done to natural resources by the April 2010 Deepwater Horizon explosion and oil spill. As part of that effort, they’ve collected and compiled several dozen related datasets, including toxicity studies, plankton samples, necropsies of stranded turtles, dolphin health assessments, and a “backyard boater” survey. [h/t Sebastian Kraus]

County-level cardiovascular deaths. Researchers at the University of Washington’s Institute for Health Metrics and Evaluation to estimated cardiovascular mortality rates for each U.S. county, for every year between 1980 and 2014. The findings, based on 32 million de-identified death records, population data from the Census, and other sources, are also broken down by particular disease (e.g., aortic aneurysm, ischemic stroke, etc.) and gender. Related: The researchers’ JAMA article describing their methodology and findings. Previously: The Global Burden of Disease dataset, published by the same institute (DIP 2016.07.27). [h/t Michael A. Rice, a teacher at Ingraham High School in Seattle]
Bonus: The newsletter also has a public data set on all the sexual assault allegations for recent high-profile cases, including Cosby, Weinstein, and Trump.