If you think there’s a typo in the subtitle think JLO instead (:

We all know python popularity among DS practitioners has soared over the past few years, signalling both aspiring DS on the one hand and organisations on the other to favour python over R in a snowballing dynamic.

One popular way of demonstrating the rise of python is to plot the fraction of questions asked on stack overflow with the tag “pandas”, compared with “dplyr”:

But there is also another story this graph is telling: All those new pandas users search stack overflow excessively because pandas is really unintelligible. I’ll demonstrate that assertion later in this post with several examples of common operations that in dplyr are straightforward yet in pandas would require most of us to search stack overflow.

I’m a big believer in using the right tool for the job (I’ve been writing Scala Spark for the last 6 months for our Java based production environment). It has become common wisdom that for most data scientists data wrangling makes up most of the job.

It follows that dplyr users enjoy a productivity boost most of the time compared with pandas users. R has the edge in a wide range of other areas too: available IDEs, package management, data constructs and many others.

Those advantages are so numerous that covering them all in a single post would be impractical. To that end I’ve started compiling all the advantages R has over python in a dedicated github repo. I plan on sharing them in small batches in a series of posts starting with this one.

When enumerating the different reasons I try to avoid the following:

  1. Too subjective comparisons. E.g. function indentation vs curly braces closure.
  2. Issues that one can get used to after a while like python indexing (though the fact it starts from 0, or that object[0:2] returns only the first 2 elements still throws me off occasionally).

What do I hope to accomplish by this you’d might ask? I hope that:

  1. Organisations realise the value in working with R and choose to use it with/instead of python more often.
  2. As a result the python developer community becomes aware of the ways it can improve it’s DS offering and acts on them. python has already borrowed some great concepts from R (e.g. data frames, factor data typepydatatableggplot) — it’s important it does so even more so we get to enjoy them when we have to work with python.

Now, I don’t argue R is preferable to python in every imaginable scenario. I just hope that becoming aware of the advantages R has to offer would encourage organisations to consider using it more often.

#python #data-science #pandas #python-vs-r #r

Don’t Be Fooled by the Hype Python’s Got
1.20 GEEK