This article is based on my entry into DengAI competition on the DrivenData platform. I’ve managed to score within 0.2% (14/9069 as on 02 Jun 2020). Some of the ideas presented here are strictly designed for competitions like that and might not be useful IRL.
Before we start I have to warn you that some parts might be obvious for more advanced data engineers, and it’s a very long article. You might read it section by section of just pick the parts that are interesting for you.
First, we need to discuss the competition itself. DengAI’s goal was (actually, at this moment even is, because the administration of DrivenData decided to make it “ongoing” competition, so you can join and try yourself) to predict a number of dengue cases in the particular week base on weather data and location. Each participant was given a training dataset and test dataset (not validation dataset). MAE ( Mean Absolute Error) is a metric used to calculate score and the training dataset covers 28 years of weekly values for 2 cities (1456 weeks). Test data is smaller and spans over 5 and 3 years (depends on the city).
For those who don’t know, Dengue fever is a mosquito-borne disease that occurs in tropical and sub-tropical parts of the world. Because it’s carried by mosquitoes, the transmission is related to climate and weather variables.
If we look at the training dataset it has multiple features:
City and date indicators:
NOAA’s GHCN daily climate data weather station measurements:
PERSIANN satellite precipitation measurements (0.25x0.25 degree scale):
NOAA’s NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale):
Satellite vegetation — Normalized difference vegetation index (NDVI) — NOAA’s CDR Normalized Difference Vegetation Index (0.5x0.5 degree scale) measurements:
Additionally, we have information about the number of total_cases each week.
It is easy to spot that for each row in the dataset we have multiple features describing similar kinds of data. There are four categories:
temperature
precipitation
humidity
ndvi (those four features are referring to different points in the cities, so they are not exactly the same data)
Because of that, we should be able to remove some of the redundant data from the input. Ofc, we cannot just pick one temperature randomly. If we look at just temperature data there is a distinguishment between ranges (min, avg, max) and even type (mean dew point or diurnal).
#machine-learning #kaggle #data-science #data-analysis #data analysis