As the saying goes, a data scientist spends 90% of their time in cleaning data and 10% in complaining about the data. Their complaints may range from data size, faulty data distributions, Null values, data randomness, systematic errors in data capture, differences between train and test sets and the list just goes on and on.

One common bottleneck theme is the enormity of data size where either the data doesn’t fit into memory or the processing time is so large(In order of multi-mins) that the inherent pattern analysis goes for a toss. Data scientists by nature are curious human beings who want to identify and interpret patterns normally hidden from cursory Drag-N-Drop glance. They need to wear multiple hats and make the data confess via repeated tortures(read iterations 😂 )

They wear multiple hats during exploratory data analysis and from a minimal dataset with 6 columns on New York Taxi Fare dataset( https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) - ID, Fare, Time of Trip, Passengers and Location, their questions may range from:

1. How the fares have changed Year-Over-Year?

2. Has the number of trips increased across the years?

3. Do people prefer traveling alone or they have company?

4. Has the small distance rides increased as people have become lazier?

5. What time of the day and day of week do people want to travel?

6. Is there emergence of new hotspots in the city recently except the regular Air Port pickup and drop?

7. Are people taking more inter-city trips?

8. Has the traffic increased leading to more fares/time taken for the same distances?

9. Are there cluster of pick-up and Drop points or areas which see high traffic?

10. Are there outliers in data i.e 0 distance and fare of $100+ and so on?

11. Do the demand change during Holiday season and airport trips increase?

12. Is there any correlation of weather i.e rain or snow with the taxi demand?

#big-data #python #parallel-processing #dask #data-science

Pandas on Steroids: Dask- End to End Data Science with python code
2.45 GEEK