In this third installment of the series “Pandas vs Spark” we will have a closer look at the programming languages and the implications of choosing one.

Originally I wanted to write a single article for a fair comparison of Pandas and Spark, but it continued to grow until I decided to split this up. This is the second part of the small series.

What to Expect

This third part of the series will focus on the programming languages Scala and Python. Spark itself is written in Scala with bindings for Python while Pandas is available only for Python.

Why Programming Languages matter

Of course programming languages play an important role, although their relevance is often misunderstood. Having the right programming language in your CV may eventually be one of the deciding factors for getting a specific job or project. This is a good example where the relevance of programming languages might be misunderstood, especially in the context of Data Science.

Don’t get me wrong, being an expert for a given programming language takes far more time than coding a couple of weeks. You do not only need to get used to the syntax, but also to the language specific idioms. It’s really like learning a foreign natural language, which takes more than only knowing the words and the grammar (which in itself already is a huge task).

On the other hand, in certain areas like Data Science, methodology matters at least as much as knowing a specific programming language. I would prefer to hire a machine learning expert with profound knowledge in R for ML project using Python instead of a Python expert with no knowledge in Data Science, and I bet most of you would agree. So from an experts point of view, the programming language doesn’t matter so much on your CV (at least it shouldn’t — I know that it’s different in reality), as long as you know what’s going on under the hood and understand the scientific method of approaching a problem.

But things look quite differently from a project’s point of view: When setting up a larger project and starting to create actual code, you eventually need to think about which programming language you’d preferably want to use. And this decision has many consequences, which you should be aware of. I will discuss many of them in this article, with a strong focus on Scala and Python as being the natural programming languages for Spark and Pandas.

Python vs Scala

When comparing Spark and Pandas, we should also include a comparison of the programming languages supported by each framework. While Pandas is “Python-only”, you can use Spark with Scala, Java, Python and R with some more bindings being developed by corresponding communities.

Since choosing a programming language will have some serious direct and indirect implications, I’d like to point out some fundamental differences between Python and Scala. Going into more detail would probably make up a separate article on its own. I mainly pick up this comparison, as the original article I was referring to at the beginning also suggested that people should start using Scala (instead of Python), while I propose a more differentiated view again.

Type System

Let’s first look at the type systems: Both languages provide some simple built in types like integers, floats and strings. The fundamental types in Scala also provide some specific sizes like Short for a 16bit integer, Double for a 64bit floating point number.

Both languages also offer classes with inheritance, although many details are really different.

There are two main differences between the type systems in Scala and in Python:

  • While Scala is a strongly typed language (i.e. every variable and parameter has a fixed type and Scala immediately throws an error if you try to use a wrong type), Python is dynamically typed (i.e. a single variable or parameter technically can accept any data type — although the code may assume specific types and therefore fail later during execution).
  • Due to the dynamically typed nature of Python, a suitable type for a certain operation often is only determined by the operations it implements. Using a correct base class or inheritance often is not crucial, only the available methods are. This paradigm is called “Duck Typing

These differences have a huge impact, as we will see later.

#spark #pandas #python #scala #developer

Spark vs Pandas, Part 3 — Scala vs Python
2.70 GEEK