It has been some time since we last performed a  Python libraries roundup, and as such we have taken the opportunity to start the month of November with just such a fresh list.

Last time we at KDnuggets did this, editor and author  Dan Clark split up the vast array of Python data science related libraries up into several smaller collections, including data science libraries, machine learning libraries, and deep learning libraries. While splitting libraries into categories is inherently arbitrary, this made sense at the time of previous publication.

This time, however, we have split the collected on open source Python data science libraries in two. This first post (this) covers “data science, data visualization & machine learning,” and can be thought of as “traditional” data science tools covering common tasks. The second post, to be published next week, will cover libraries for use in building neural networks, and those for performing natural language processing and computer vision tasks.

Again, this separation and classification is arbitrary, in some instances more than others, but we have done our best to group tools together by intended use case, hoping this is most useful for readers.

The categories included in this post, which we see as taking into account common data science libraries — those likely to be used by practitioners in the data science space for generalized, non-neural network, non-research work — are:

  • Data - libraries for the management, manipulation, and other processing of data
  • Math - while many libraries perform mathematical tasks, this small collection does so exclusively
  • Machine learning - self explanatory; excludes libraries primarily meant for building neural networks or for automating machine learning processes
  • Automated machine learning - libraries that primarily function to automate processes related to machine learning
  • Data visualization - libraries that primarily serve a function related to visualizing data, as opposed to modeling, preprocessing, etc.
  • Explanation & exploration - libraries primarily for exploring and explaining models or data

Our list is made up of libraries that our team decided together by consensus was repetitiveness of common and well-used Python libraries. Also, to be included a library must have a Github repository. The categories are in no particular order, and neither are the libraries included within each. We contemplated constructing an ordering arbitrarily by stars or some other metric, but decided against it in order not explicitly stray from placing any perceived value or importance of the libraries within. Their listing here, then, is purely random. Library descriptions are directly from the Github repositories, in some form or another.

Thanks to  Ahmed Anis for contributing to the collection of this data, and to the rest of the KDnuggets staff for their inputs, insights, and suggestions.

Note that visualization below, by  Gregory Piatetsky, represents each library by type, plots it by stars and contributors, and its symbol size is reflective of the relative number of commits the library has on Github.

Plotted by number of stars and number of contributors; relative size by number of contributors

And, so without further ado, here are the 38 top Python libraries for data science, data visualization & machine learning, as best determined by KDnuggets staff.

