Working with tabular data in data science we always use the Pandas library in Python. This is widely used for data exploration, analysis, munging and manipulation. These are the primary steps for understanding the data well and making it ready for the model to fit. The only disadvantage of using pandas is its time consuming when there’s a large amount of data(big data).

Datatable overcomes the limitations of pandas and speeds up the process of EDA(exploratory data analysis). Datatable has been built by H20.ai, one of the leading AI ML companies in the world. Datatable is pretty similar to pandas and R data.table libraries. Datatable has proper documentation. Works with Python version 3.6+.

Advantages of Datatable

  • Supports null values, date-time and categorical types.
  • Efficient algorithms for sorting/grouping/joining.
  • Minimal data copying by using “rowindex” views in filtering/sorting/grouping/joining
  • operators to avoid unnecessary data copying.Faster data accessing than pandas
  • Easily convert to another data-processing framework.

In this article, I’ll be discussing the implementation of the datatable library with a large dataset.

#developers corner #data analysis #data-science

Hands-On Guide to Datatable Library For Faster EDA
1.60 GEEK