Step-by-Step Guide to Creating R and Python Libraries (in JupyterLab)

Step-by-Step Guide to Creating R and Python Libraries (in JupyterLab)

Using Docker and JupyterLab to create R and Python Libraries.

Using Docker and JupyterLab to create R and Python Libraries.

R and Python are the bread and butter of today’s machine learning languages. R provides powerful statistics and quick visualizations, while Python offers an intuitive syntax, abundant support, and is the choice interface to today’s major AI frameworks.

In this article we’ll look at the steps involved in creating libraries in R and Python. This is a skill every machine learning practitioner should have in their toolbox. Libraries help us organize our code and share it with others, offering packaged functionality to the data community.

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.#### THE IMPORTANCE OF CREATING LIBRARIES

Building today’s software doesn’t happen without extensive use of libraries. Libraries dramatically cut down the time and effort required for a team to bring work to production. By leveraging the open source community engineers and scientists can move their unique contribution towards a larger audience, and effectively improve the quality of their code. Companies of all sizes use these libraries to sit their work on top of existing functionality, making product development more productive and focused.

But creating libraries isn’t just for production software. Libraries are critical to rapidly prototyping ideas, helping teams validate hypotheses and craft experimental software quickly. While popular libraries enjoy massive community support and a set of best practices, smaller projects can be converted into libraries overnight.

By learning to create lighter-weight libraries we develop an ongoing habit of maintaining code and sharing our work. Our own development is sped up dramatically, and we anchor our coding efforts around a tangible unit of work we can improve over time.

ARTICLE SCOPE

In this article we will focus on creating libraries in R and Python as well as hosting them on, and installing from, GitHub. This means we won’t look at popular hosting sites like CRAN for R and PyPI for Python. These are extra steps that are beyond the scope of this article.

Focusing only on GitHub helps encourage practitioners to develop and share libraries more frequently. CRAN and PyPI have a number of criteria that must be met (and they change frequently), which can slow down the process of releasing our work. Rest assured, it is just as easy for others to install our libraries from GitHub. Also, the steps for CRAN and PyPI can always be added later should you feel your library would benefit from a hosting site.

We will build both R and Python libraries using the same environment (JupyterLab), with the same high-level steps for both languages. This should help you build a working knowledge of the core step required to package your code as a library.

Let’s get started.

SETUP

We will be creating a library called datapeek in both R and Python. The datapeek library is a simple package offering a few useful functions for handling raw data. These functions are:

encode_and_bind

remove_features

apply_function_to_column

get_closest_string

We will look at these functions later. For now we need to setup an R and Python environment to create datapeek, along with a few libraries to support packaging our code. We will be using JupyterLab inside a Docker container, along with a “docker stack” that comes with the pieces we need.

Install and Run Docker

The Docker Stack we will use is called the jupyter/datascience-notebook. This image contains both R and Python environments, along with many of the packages typically used in machine learning.

Since these run inside Docker you must have Docker installed on your machine. So install Docker if you don’t already have it, and once installed, run the following in terminal to pull the datascience-notebook:

docker pull jupyter/datascience-notebook

This will pull the most recent image hosted on Docker Hub.

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.
Immediately after running the above command you should see the following:

Once everything has been pulled we can confirm our new image exists by running the following:

docker images

… showing something similar to the following:

Now that we have our Docker stack let’s setup JupyterLab.

JupyterLab

We will create our libraries inside a JupyterLab environment. JupyterLab is a web-based user interface for programming. With JupyterLab we have a lightweight IDE in the browser, making it convenient for building quick applications. JupyterLab provides everything we need to create libraries in R and Python, including:

  • A terminal environment for running shell commands and downloading/installing libraries;
  • An R and Python console for working interactively with these languages;
  • A simple text editor for creating files with various extensions;
  • Jupyter Notebooks for prototyping ML work.

The datascience-notebook we just pulled contains an installation of JupyterLab so we don’t need to install this separately. Before running our Docker image we need to mount a volume to ensure our work is saved outside the container.

First, create a folder called datapeek on your desktop (or anywhere you wish) and change into that directory. We need to run our Docker container with JupyterLab,so our full command should look as follows:

docker run -it -v `pwd`:/home/jovyan/work -p 8888:8888 jupyter/datascience-notebook start.sh jupyter lab

You can learn more about Docker commands here. Importantly, the above command exposes our environment on port 8888, meaning we can access our container through the browser.

After running the above command you should see the following output at the end:

This tells us to copy and paste the provided URL into our browser. Open your browser and add the link in the address bar and hit enter (your token will be different):

localhost:8888/?token=11e5027e9f7cacebac465d79c9548978b03aaf53131ce5fd

This will automatically open JupyterLab in your browser as a new tab:

We are now ready to start building libraries.

We begin this article with R, then look at Python.

CREATING LIBRARIES IN R

R is one of the “big 2” languages of machine learning. At the time of this writing it has well-over 10,000 libraries. Going toAvailable CRAN Packages By Date of Publication and running…

document.getElementsByTagName('tr').length

…in the browser console gives me 13858. Minus the header and final row this gives 13856 packages. Needless to say R is not in need of variety. With strong community support and a concise (if not intuitive) language, R sits comfortably at the top of statistical languages worth learning.

The most well-known treatise on creating R packages is Hadley Wickam’s book R Packages. Its contents are available for free online. For a deeper dive on topic I recommend looking there.

We will use Hadley’s devtools package to abstract away the tedious tasks involved in creating packages. devtools is already installed in our Docker Stacks environment. We also require the roxygen2 package, which helps us document our functions. Since this doesn’t come pre-installed with our image let’s install that now.

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.
Open terminal inside JupyterLab’s Launcher:

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.

Inside the console type R, then….

install.packages("roxygen2")
library("roxygen2")

With the necessary packages installed we’re ready to tackle each step.

STEP 1: Create Package Framework

We need to create a directory for our package. We can do this in one line of code, using the devtools create function. In terminal run:

devtools::create("datapeek")

This automatically creates the bare bone files and directories needed to define our R package. In JupyterLab you will see a set of new folders and files created on the left side.

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.
If we inspect our package in JupyterLab we now see:

datapeek
├── R
├── datapeek.Rproj
├── DESCRIPTION
├── NAMESPACE

The R folder will eventually contain our R code. The my_package.Rproj file is specific to the RStudio IDE so we can ignore that. The DESCRIPTION folder holds our package’s metadata (adetailed discussion can be found here). Finally, NAMSPACE is a file that ensures our library plays nicely with others, and is more of a CRAN requirement.

Naming Conventions

We must follow these rules when naming an R package:

  • A terminal environment for running shell commands and downloading/installing libraries;
  • An R and Python console for working interactively with these languages;
  • A simple text editor for creating files with various extensions;
  • Jupyter Notebooks for prototyping ML work.

You can read more about naming packages here. Our package name “datapeek” passes the above criteria. Let’s head over to CRAN and do a Command+F search for “datapeek” to ensure it’s not already taken:

Command + F search on CRAN to check for package name uniqueness.

…look’s like we’re good.

STEP 2: Fill Out Description Details

The job of the DESCRIPTION file is to store important metadata about our package. These data include others packages required to run our library, our license, and our contact information. Technically, the definition of a package in R is any directory containing a DESCRIPTION file, so always ensure this is present.

Click on the DESCRIPTION file in JupyterLab’s directory listing. You will see the basic details created automatically when we ran devtools::create(“datapeek”) :

Let’s add our specific details so our package contains the necessary metadata. Simply edit this file inside JupyterLab. Here are the details I am adding:

  • A terminal environment for running shell commands and downloading/installing libraries;
  • An R and Python console for working interactively with these languages;
  • A simple text editor for creating files with various extensions;
  • Jupyter Notebooks for prototyping ML work.

Of course you should fill out these parts with your own details. You can read more about the definitions of each of these in Hadley’s chapter on metadata. As a brief overview…the package, title, and version parts are self-explanatory, just be sure to keep title to one line. [email protected] must adhere to the format you see above, since it contains executable R code. Note the role argument, which allows us to list the main contributors of our library. The usual ones are:

aut: author

cre: creator or maintainer

ctb: contributors

cph: copyright holder

There are many more options, with the full list found here.

You can add multiple authors by listing them as a vector:

[email protected]: as.person(c(
    "Sean McClure <[email protected]> [aut, cre]", 
    "Rick Deckard <[email protected]> [aut]",
    "Rachael Tyrell <[email protected]> [ctb]"
))

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.
The description can be multiple lines, limited to 1 paragraph. We use depends to specify the minimum version of R our package depends on. You should use an R version equal or greater than the one you used to build your library. Most people today set their License to MIT, which permits anyone to “use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software” as long as your copyright is included. You can learn more about the MIT license here. Encoding ensures our library can be opened, read and saved using modern parsers, and LazyData refers to how data in our package are loaded. Since we set ours to true it means our data won’t occupy memory until they are used.

STEP 3: Add Functions

3A: Add Functions to R Folder

Our library wouldn’t do much without functions. Let’s add the 4 functions mentioned in the beginning of this article. The following GIST shows our datapeek functions in R:

library(data.table)
library(mltools)

encode_and_bind <- function(frame, feature_to_encode) {
    res <- cbind(iris, one_hot(as.data.table(frame[[feature_to_encode]])))
    return(res)
}

remove_features <- function(frame, features) {
  rem_vec <- unlist(strsplit(features, ', '))
  res <- frame[,!(names(frame) %in% rem_vec)]
  return(res)
}

apply_function_to_column <- function(frame, list_of_columns, new_col, funct) {
    use_cols <- unlist(strsplit(list_of_columns, ', '))
    new_cols <- unlist(strsplit(new_col, ', '))
    frame[new_cols] <- apply(frame[use_cols], 2, function(x) {eval(parse(text=funct))})
    return(frame)
}

get_closest_string <- function(vector_of_strings, search_string) {
    all_dists <- adist(vector_of_strings, search_string)
    closest <- min(all_dists)
    res <- vector_of_strings[which(all_dists == closest)]
    return(res)
}

We have to add our functions to the R folder, since this is where R looks for any functions inside a library.

datapeek
├── R
├── datapeek.Rproj
├── DESCRIPTION
├── NAMESPACE

Since our library only contains 4 functions we will place all of them into a single file called utilities.R, with this file residing inside the R folder.

Go into the directory in JupyterLab and open the R folder. Click on Text File in the Launcher and paste in our 4 R functions. Right-click the file and rename it to utilities.R.

3B: Export our Functions

It isn’t enough to simply place R functions in our file. Each function must be exported to expose them to users of our library. This is accomplished by adding the @export tag above each function.

The export syntax comes from Roxygen, and ensures our function gets added to the NAMESPACE. Let’s add the @export tag to our first function:

#' @export
encode_and_bind <- function(frame, feature_to_encode) {
    res <- cbind(iris, one_hot(as.data.table(frame[[feature_to_encode]])))
    return(res)
}

Do this for the remaining functions as well.

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.#### 3C: Document our Functions

It is important to document our functions. Documenting functions provides information for users, such that when they type ?datapeek they get details about our package. Documenting also supports working with vignettes, which are a long-form type of documentation. You can read more about documenting functions here.

There are 2 sub-steps we will take:

  • A terminal environment for running shell commands and downloading/installing libraries;
  • An R and Python console for working interactively with these languages;
  • A simple text editor for creating files with various extensions;
  • Jupyter Notebooks for prototyping ML work.

— Add the Document Annotations

Documentation is added above our function, directly above our #’ @export line. Here’s the example with our first function:

#' One-hot encode categorical variables.
#'
#' This function one-hot encodes categorical variables and binds the entire frame.
#'
#' @param data frame
#' @param feature to encode
#' @return hot-encoded data frame
#'
#' @export
encode_and_bind <- function(frame, feature_to_encode) {
    res <- cbind(iris, one_hot(as.data.table(frame[[feature_to_encode]])))
    return(res)
}

We space out the lines for readability, adding a title, description, and any parameters used by the function. Let’s do this for our remaining functions:

library(data.table)
library(mltools)

#' One-hot encode categorical variables.
#'
#' This function one-hot encodes categorical variables and binds the entire frame.
#'
#' @param data frame
#' @param feature to encode
#' @return hot-encoded data frame
#'
#' @export
encode_and_bind <- function(frame, feature_to_encode) {
    res <- cbind(iris, one_hot(as.data.table(frame[[feature_to_encode]])))
    return(res)
}

#' Remove specified features from data frame.
#'
#' This function removes specified features from data frame.
#'
#' @param data frame
#' @param features to remove
#' @return original data frame less removed features
#'
#' @export
remove_features <- function(frame, features) {
  rem_vec <- unlist(strsplit(features, ', '))
  res <- frame[,!(names(frame) %in% rem_vec)]
  return(res)
}

#' Apply function to multiple columns.
#'
#' This function enables the application of a function to multiple columns.
#'
#' @param data frame
#' @param list of columns to apply function to
#' @param name of new column
#' @param function to apply
#' @return original data frame with new column attached 
#'
#' @export
apply_function_to_column <- function(frame, list_of_columns, new_col, funct) {
    use_cols <- unlist(strsplit(list_of_columns, ', '))
    new_cols <- unlist(strsplit(new_col, ', '))
    frame[new_cols] <- apply(frame[use_cols], 2, function(x) {eval(parse(text=funct))})
    return(frame)
}

#' Discover closest matching string from set.
#'
#' This function discovers the closest matching string from vector of strings.
#'
#' @param vector of strings
#' @param search string
#' @return closest matching string
#'
#' @export
get_closest_string <- function(vector_of_strings, search_string) {
    all_dists <- adist(vector_of_strings, search_string)
    closest <- min(all_dists)
    res <- vector_of_strings[which(all_dists == closest)]
    return(res)
}

— Run devtools::document()

With documentation added to our functions we then run the following in terminal,just outside the root directory:

devtools::document()

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.
You may get the error:

Error: ‘roxygen2’ >= 5.0.0 must be installed for this functionality.

In this case open terminal in JupyterLab and install roxygen2. You should also install data.table and mltools, since our first function uses these:

install.packages('roxygen2')
install.packages('data.table')
install.packages('mltools')

Run the devtools::document() again. You should see the following:

This will generate .Rd files inside a new man folder. You’ll notice an .Rd file is created for each function in our package.

If you look at your DESCRIPTION file it will now show a new line at the bottom:

This will also generate a NAMESPACE file:

We can see our 4 functions have been exposed. Let’s now move onto ensuring dependencies are specified inside our library.

STEP 4: List External Dependencies

It is common for our functions to require functions found in other libraries. There are 2 things we must do to ensure external functionality is made available to our library’s functions:

  1. Use double colons inside our functions to specify which library we are relying on;
  2. Add imports to our DESCRIPTION file.

You’ll notice in the above GIST we simply listed our libraries at the top. While this works well in stand-alone R scripts it isn’t the way to use dependencies in an R package. When creating R packages we must use the “double-colon approach” to ensure the correct function is read. This is related to how “top-level code” (code that isn’t an object like a function) in an R package is only executed when the package is built, not when it’s loaded.

For example:

library(mltools)

do_something_cool_with_mltools <- function() {
    auc_roc(preds, actuals)
}

…won’t work because auc_roc will not be available (running library(datapeek) doesn’t re-execute library(mltools)). This will work:

do_something_cool_with_mltools <- function() {
    mltools::auc_roc(preds, actuals)
}

The only function in our datapeek package requiring additional packages is our first one:

#' One-hot encode categorical variables.
#'
#' This function one-hot encodes categorical variables and binds the entire frame.
#'
#' @param data frame
#' @param feature to encode
#' @return hot-encoded data frame
#'
#' @export
encode_and_bind <- function(frame, feature_to_encode) {
    res <- cbind(iris, mltools::one_hot(data.table::as.data.table(frame[[feature_to_encode]])))
    return(res)
}

Using the double-colon approach to specify dependent packages in R.

Notice each time we call an external function we preface it with the external library and double colons.

We must also list external dependencies in our DESCRIPTION file, so they are handled correctly. Let’s add our imports to the DESCRIPTION file:

Be sure to have the imported libraries comma-separated. Notice we didn’t specify any versions for our external dependencies. If we need to specify versions we can use parentheses after the package name:

Imports:
    data.table (>= 1.12.0)

Since our encode_and_bind function isn’t taking advantage of any bleeding-edge updates we will leave it without any version specified.

STEP 5: Add Data

Sometimes it makes sense to include data inside our library. Package data can allow our user’s to practice with our library’s functions, and also helps with testing, since machine learning packages will always contain functions that ingest and transform data. The 4 options for adding external data to an R package are:

  1. Use double colons inside our functions to specify which library we are relying on;
  2. Add imports to our DESCRIPTION file.

You can learn more about these different approaches here. For this article we will stick with the most common approach, which is to add external data to an R folder.

Let’s add the Iris dataset to our library in order to provide users a quick way to test our functions. The data must be in the .rda format, created using R’s **save()** function, and have the same name as the file. We can ensure these criteria are satisfied by using devtools’ use_data function:

x <- read.csv("http://bit.ly/2HuTS0Z")
devtools::use_data(x, iris)

Above, I read in the Iris dataset from its URL and pass the data frame to devtools::use_data().

In JupyterLab we see a new data folder has been created, along with our iris.rda dataset:

datapeek
├── data
    └── iris.rda
├── man
├── R
├── datapeek.Rproj
├── DESCRIPTION
├── NAMESPACE

We will use our added dataset to run tests in the following section.

STEP 6: Add Tests

Testing is an important part of software development. Testing helps ensure our code works as expected, and makes debugging our code a much faster and more effective process. Learn more about testing R packages here.

A common challenge in testing is knowing what we should test. Testing every function in a large library is cumbersome and not always needed, while not enough testing can make it harder to find and correct bugs when they arise.

I like the following quote from Martin Fowler regarding when to test:

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.
If you prototype applications regularly you’ll find yourself writing to the console frequently to see if a piece of code returns what you expect. In data science, writing interactive code is even more common, since machine learning work is highly experimental. On one hand this provides ample opportunity to think about which tests to write. On the other hand, the non-deterministic nature of machine learning code means testing certain aspect of ML can be less than straightforward. As a general rule, look for obvious deterministic pieces of your code that should return the same output every time.

The interactive testing we do in data science is manual, but what we are looking for in our packages is automated testing. Automated testing means we run a suite of pre-defined tests to ensure our package works end-to-end.

While there are many kinds of tests in software, here we are taking about “unit tests.” Thinking in terms of unit tests forces us to break up our code into more modular components, which is good practice in software design.

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.
There are 2 sub-steps we will take for testing our R library:

6A: Creating the tests/testthat folder;

6B: Writing tests.

— 6A: Creating the **tests/testthat** folder

Just as R expects our R scripts and data to be in specific folders it also expects the same for our tests. To create the tests folder, we run the following in JupyterLab’s R console:

devtools::use_testthat()

You may get the following error:

Error: ‘testthat’ >= 1.0.2 must be installed for this functionality.

If so, use the same approach above for installing roxygen2 in Jupyter’s terminal.

install.packages('testthat')

Running devtools::use_testthat() will produce the following output:

* Adding testthat to Suggests
* Creating `tests/testthat`.
* Creating `tests/testthat.R` from template.

There will now be a new tests folder in our main directory:

datapeek
├── data
├── man
├── R
├── tests
    └── testthat.R
├── datapeek.Rproj
├── DESCRIPTION
├── NAMESPACE

The above command also created a file called testthat.R inside the tests folder. This runs all your tests when R CMD check runs (we’ll look at that shortly). You’ll also notice_testthat_ has been added under Suggests in our DESCRIPTION file:

— 6B: Writing Tests

testthat is the most popular unit testing package for R, used by at least 2,600 CRAN package, not to mention libraries on Github. You can check out the latest news regarding testthat on the Tidyverse page here. Also check out its documentation.

There are 3 levels to testing we need to consider:

  • A terminal environment for running shell commands and downloading/installing libraries;
  • An R and Python console for working interactively with these languages;
  • A simple text editor for creating files with various extensions;
  • Jupyter Notebooks for prototyping ML work.

Assertions

Assertions are the functions included in the testing library we choose. We use assertions to check whether our own functions return the expected output. Assertions come in many flavors, depending on what is being checked. In the following section I will cover the main tests used in R programming, showing each one failing so you can understand how it works.

Equality Assertions

  • A terminal environment for running shell commands and downloading/installing libraries;
  • An R and Python console for working interactively with these languages;
  • A simple text editor for creating files with various extensions;
  • Jupyter Notebooks for prototyping ML work.
# test for equality
a <- 10
expect_equal(a, 14)

> Error: `a` not equal to 14.

# test for identical 
expect_identical(42, 2)

> Error: 42 not identical to 2.

# test for equivalence 
expect_equivalent(10, 12)

> Error: 10 not equivalent to 12.

There are subtle differences between the examples above. For example, expect_equal is used to check for equality within a numerical tolerance, while expect_identical tests for exact equivalence. Here are examples:

expect_equal(10, 10 + 1e-7) # true
expect_identical(10, 10 + 1e-7) # false

As you write more tests you’ll understand when to use each one. Of course always refer to the documentation referenced above when in doubt.

Testing for String Matches

  • A terminal environment for running shell commands and downloading/installing libraries;
  • An R and Python console for working interactively with these languages;
  • A simple text editor for creating files with various extensions;
  • Jupyter Notebooks for prototyping ML work.
# test for string matching
expect_match("Machine Learning is Fun", "But also rewarding.")

> Error: "Machine Learning is Fun" does not match "But also rewarding.".

Testing for Length

  • A terminal environment for running shell commands and downloading/installing libraries;
  • An R and Python console for working interactively with these languages;
  • A simple text editor for creating files with various extensions;
  • Jupyter Notebooks for prototyping ML work.
# test for length 
vec <- 1:10
expect_length(vec, 12)

> Error: `vec` has length 10, not length 12.

Testing for Comparison

  • A terminal environment for running shell commands and downloading/installing libraries;
  • An R and Python console for working interactively with these languages;
  • A simple text editor for creating files with various extensions;
  • Jupyter Notebooks for prototyping ML work.
# test for less than
a <- 11
expect_lt(a, 10)

> Error: `a` is not strictly less than 10. Difference: 1

# test for greater than
a <- 11
expect_gt(a, 12)

> Error: `a` is not strictly more than 12. Difference: -1

Testing for Logic

  • A terminal environment for running shell commands and downloading/installing libraries;
  • An R and Python console for working interactively with these languages;
  • A simple text editor for creating files with various extensions;
  • Jupyter Notebooks for prototyping ML work.
# test for truth 
expect_true(5 == 2)

> Error: 5 == 2 isn't true.

# test for false 
expect_false(2 == 2)

> Error: 2 == 2 isn't false.

Testing for Outputs

  • A terminal environment for running shell commands and downloading/installing libraries;
  • An R and Python console for working interactively with these languages;
  • A simple text editor for creating files with various extensions;
  • Jupyter Notebooks for prototyping ML work.
# testing for outputs 
expect_output(str(mtcars), "31 obs")

> Error: `str\(mtcars\)` does not match "31 obs".

# test for warning 
f <-function(x) {
  if(x < 0) {
    message("*x* is already negative")
  }
}

expect_message(f(1))

> Error: `f(1)` did not produce any messages.

There are many more included in the testthat library. If you are new to testing, start writing a few simple ones to get used to the process. With time you’ll build an intuition around what to test and when.

Writing Tests

A test is a group of assertions. We write tests in testthat as follows:

test_that("this functionality does what it should", {
    // group of assertions here
})

We can see we have both a description (the test name) and the code (containing the assertions). The description completes the sentence, “test that ….”

Above, we are saying “test that this functionality does what it should.”

The assertions are the outputs we wish to test. For example:

test_that("trigonometric functions match identities", {
      expect_equal(sin(pi / 4), 1 / sqrt(2))
      expect_equal(cos(pi / 4), 1 / sqrt(10))
      expect_equal(tan(pi / 4), 1)
    })

> Error: Test failed: 'trigonometric functions match identities'

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.#### Creating Files

The last thing we do in testing is create files. As stated above, a“file” in testing is a group of tests covering a related set of functionality. Our test file must live inside thetests/testthat/ directory. Here is an example test file for the stringr package on GitHub:

Example Test File from the stringr package on GitHub.

The file is called test-case.R (starts with “test”) and lives inside the tests/testthat/ directory. The context at the top simply allows us to provide a simple description of the file’s contents. This appears in the console when we run our tests.

Let’s create our test file, which will contain tests and assertions related to our 4 functions. As usual, we use JupyterLab’s Text File in Launcher to create and rename a new file:

Creating a Test File in R

Now let’s add our tests:

For the first function I am going to make sure a data frame with the correct number of features is returned:

context('utility functions')

library(data.table)
library(mltools)

load('../../data/iris.rda')

test_that("data frame with correct number of features is returned", {
    res <- encode_and_bind(iris, 'Species')
    expect_equal(dim(res)[2], 8)
})

Notice how we called our encode_and_bind function, then simply checked the equality between the dimensions and the expected output. We run our automated tests at any point to ensure our test file runs and we get the expected output. Running devtools::test() in the console runs our tests:

We get a smiley face too!

Since our second function removes a specified feature I will use the same test as above, checking for the dimensions of the returned frame. Our third function applies a specified function to a chosen column, so I will write a test that checks the result of given specified function. Finally, our fourth function returns the closest matching string, so I will simply check the returned string for the expected result.

Here is our full test file:

context('utility functions')

library(data.table)
library(mltools)

load('../../data/iris.rda')

test_that("data frame with correct number of features is returned", {
    res <- encode_and_bind(iris, 'Species')
    expect_equal(dim(res)[2], 8)
})

test_that("data frame with correct number of features is returned", {
    res <- remove_features(iris, 'Species')
    expect_equal(dim(res)[2], 4)
})

test_that("newly created columns return correct sum", {
    res <- apply_function_to_column(iris, "Sepal.Width, Sepal.Length", "new_col1, new_col2", "x*5") 
    expect_equal(sum(res$new_col1), 2293)
    expect_equal(sum(res$new_col2), 4382.5)
})

test_that("closest matching string is returned", {
    res <- get_closest_string(c("hey there", "we are here", "howdy doody"), "doody") 
    expect_true(identical(res, 'howdy doody'))
})

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.#### Testing our Package

As we did above, we run our tests using the following command:

devtools::test()

This will run all tests in any test files we placed inside the testthat directory. Let’s check the result:

We had 5 assertions across 4 unit tests, placed in one test file. Looks like we’re good. If any of our tests failed we would see this in the above printout, at which point we would look to correct the issue.

STEP 7: Create Documentation

This has traditionally been done using “Vignettes” in R. You can learn about creating R vignettes for your R package here. Personally, I find this a dated approach to documentation. I prefer to use things like Sphinx or Julep. Documentation should be easily shared, searchable and hosted.

Click on the question mark at julepcode.com to learn how to use Julep.

I created and hosted some simple documentation for our R datapeek library, which you can find here.

Of course we will also have the library on GitHub, which I cover below.

STEP 8: Share your R Library

As I mentioned in the introduction we should be creating libraries on a regular basis, so others can benefit from and extend our work. The best way to do this is through GitHub, which is the standard way to distribute and collaborate on open source software projects.

In case you’re new to GitHub here’s a quick tutorial to get you started so we can push our datapeek project to a remote repo.

Sign up/in to GitHub and create a new repository.

…which will provide us with the usual screen:

With our remote repo setup we can initialize our local repo on our machine, and send our first commit.

Open Terminal in JupyterLab and change into the datapeek directory:

Initialize the local repo:

git init

Add the remote origin (your link will be different):

git remote add origin https://github.com/sean-mcclure/datapeek.git

Now run git add&nbsp;. to add all modified and new (untracked) files in the current directory and all subdirectories to the staging area:

git add .

Don’t forget the “dot” in the above command. Now we can commit our changes, which adds any new code to our local repo.

But, since we are working inside a Docker container the username and email associated with our local repo cannot be autodetected. We can set these by running the following in terminal:

git config --global user.email {emailaddress}
git config --global user.name {name}

Use the email address and username you use to sign into GitHub.

Now we can commit:

git commit -m 'initial commit'

With our new code committed we can do our push, which transfers the last commit(s) to our remote repo:

git push origin master

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.
Some readers will notice we didn’t place a .gitignore file in our directory. It is usually fine to push all files inside smaller R libraries. For larger libraries, or libraries containing large datasets, you can use the site gitignore.io to see what common gitignore files look like. Here is a common R .gitignore file for R:

### R ###
# History files
.Rhistory
.Rapp.history

# Session Data files
.RData

# User-specific files
.Ruserdata

# Example code in package build process
*-Ex.R

# Output files from R CMD build
/*.tar.gz

# Output files from R CMD check
/*.Rcheck/

# RStudio files
.Rproj.user/

# produced vignettes
vignettes/*.html
vignettes/*.pdf

# OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
.httr-oauth

# knitr and R markdown default cache directories
/*_cache/
/cache/

# Temporary files created by R markdown
*.utf8.md
*.knit.md

### R.Bookdown Stack ###
# R package: bookdown caching files
/*_files/

Example .gitignore file for an R package

To recap, git add adds all modified and new (untracked) files in the current directory to the staging area. Commit adds any changes to our local repo, and push transfers the last commit(s) to our remote repo. While git add might seem superfluous, the reason it exists is because sometimes we want to only commit certain files, this we can stage files selectively. Above, we staged all files by using the “dot” after git add.

You may also notice we didn’t include a README file. You should indeed include this, however for the sake of brevity I have left this step out.

Now, anyone can use our library. 👍 Let’s see how.

STEP 9: Install your R Library

As mentioned in the introduction I will not be discussing CRAN in this article. Sticking with GitHub make it easier to share our code frequently, and we can always add CRAN criteria later.

To install a library from GitHub, users can simply run the following command on their local machine:

devtools::install_github("yourusername/mypackage")

As such, we can simply instruct others wishing to use datapeek to run the following command on their local machine:

devtools::install_github("sean-mcclure/datapeek")

This is something we would include in a README file and/or any other documentation we create. This will install our package like any other package we get from CRAN:

Users then load the library as usual and they’re good to go:

library(datapeek)

I recommend trying the above commands in a new R environment to confirm the installation and loading of your new library works as expected.

CREATING LIBRARIES IN PYTHON

Creating Python libraries follows the same high-level steps we saw previously for R. We require a basic directory structure with proper naming conventions, functions with descriptions, imports, specified dependencies, added datasets, documentation, and the ability to share and allow others to install our library.

We will use JupyterLab to build our Python library, just as we did for R.

Library vs Package vs Module

In the beginning of this article I discussed the difference between a “library” and a “package”, and how I prefer to use these terms interchangeably. The same holds for Python libraries. “Modules” are another term, and in Python simply refer to any file containing Python code. Python libraries obviously contain modules as scripts.

Before we start:

I stated in the introduction that we will host and install our libraries on and from GitHub. This encourages rapid creation and sharing of libraries without getting bogged down by publishing criteria on popular package hosting sites for R and Python.

The most popular hosting site for Python is the Python Package Index (PyPI). This is a place for finding, installing and publishing python libraries. Whenever you run pip install <package_name> (or easy_intall) you are fetching a package from PyPI.

While we won’t cover hosting our package on PyPI it’s still a good idea to see if our library name is unique. This will minimize confusion with other popular Python libraries and improve the odds our library name is distinctive, should we decide to someday host it on PyPI.

First, we should follow a few naming conventions for Python libraries.

Python Library Naming Conventions

  • A terminal environment for running shell commands and downloading/installing libraries;
  • An R and Python console for working interactively with these languages;
  • A simple text editor for creating files with various extensions;
  • Jupyter Notebooks for prototyping ML work.

Our library name is datapeek, so the first and third criteria are met; let’s check PyPI for uniqueness:

All good. 👍

We’re now ready to move through each step required to create a Python library.

STEP 1: Create Package Framework

JupyterLab should be up-and-running as per the instructions in the setup section of this article.

Use JupyterLab’s New Folder and Text File options to create the following directory structure and files:

datapeek
├── datapeek
    └── __init__.py
    └── utilities.py
├── setup.py

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.
The following video shows me creating our datapeek directory in JupyterLab:

There will be files we do not want to commit to source control. These are files that are created by the Python build system. As such, let’s alsoadd the following** .gitignore file** to our package framework:

# Compiled python modules.
*.pyc

# Setuptools distribution folder.
/dist/

# Python egg metadata, regenerated from source files by setuptools.
/*.egg-info

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.
Add your gitignore file as a simple text file to the root directory:

datapeek
├── datapeek
    └── __init__.py
    └── utilities.py
├── setup.py
├── gitignore

STEP 2: Fill Out Description Details

Just as we did for R, we should add metadata about our new library. We do this using Setuptools. Setuptools is a Python library designed to facilitate packaging Python projects.

**Open **setup.py and add the following details for our library:

from setuptools import setup

setup(name='datapeek',
      version='0.1',
      description='A simple library for dealing with raw data.',
      url='https://github.com/sean-mcclure/datapeek_py',
      author='Sean McClure',
      author_email='[email protected]',
      license='MIT',
      packages=['datapeek'],
zip_safe=False)

Of course you should change the authoring to your own. We will add more details to this file later. The keywords are fairly self-explanatory. url is the URL of our project on GitHub, which we will add later; unless you’ve already created your python repo, in which case add the URL now. We talked about licensing in the R section. zip_safe simply means our package can be run safely as a zip file which will usually be the case. You can learn more about what can be added to the setup.py file here.

STEP 3: Add Functions

Our library obviously requires functions to be useful. For larger libraries we would organize our modules so as to balance cohesion/coupling, but since our library is small we will simply keep all functions inside a single file.

We will add the same functions we did for R, this time written in Python:

import pandas as pd
from fuzzywuzzy import fuzz

def encode_and_bind(frame, feature_to_encode):
    dummies = pd.get_dummies(frame[[feature_to_encode]])
    res = pd.concat([frame, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return(res)

def remove_features(frame, list_of_columns):
    res = frame.drop(frame[list_of_columns],axis=1)
    return(res)

def apply_function_to_column(frame, list_of_columns, new_col, funct):
    frame[new_col] = frame[list_of_columns].apply(lambda x: eval(funct))
    return(frame)

def get_closest_string(list_of_strings, search_string):
    all_dists = []
    all_strings = []
    for string in list_of_strings:
        dist = fuzz.partial_ratio(string, search_string)
        all_dists.append(dist)
        all_strings.append(string)
    top = max(all_dists)
    ind=0
    for i, x in enumerate(all_dists):
        if x == top:
            ind = i
return(all_strings[ind])

Add these functions to the utilities.py module, inside datapeek’s module directory.

STEP 4: List External Dependencies

Our library will often require other packages as dependencies. Our user’s Python environment will need to be aware of these when installing our library (so these other packages can also be installed). Setuptools provides the install_requires keyword to list any packages our library depends on.

Our datapeek library depends on the fuzzywuzzy package for fuzzy string matching, and the pandas package for high-performance manipulation of data structures. To specify our dependencies**,** add the following to your setup.py file:

install_requires=[
    'fuzzywuzzy',
    'pandas'
]

Your setup.py file should currently look as follows:

from setuptools import setup

setup(name='datapeek',
      version='0.1',
      description='A simple library for dealing with raw data.',
      url='https://github.com/sean-mcclure/datapeek_py',
      author='Sean McClure',
      author_email='[email protected]',
      license='MIT',
      packages=['datapeek'],
      install_requires=[
          'fuzzywuzzy',
          'pandas'
      ],
zip_safe=False)

We can confirm all is in order by running the following in a JupyterLab terminal session:

python setup.py develop

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.
After running the command you should see something like this:

…with an ending that reads:

Finished processing dependencies for datapeek==0.1

If one or more of our dependencies is not available on PyPI, but is available on GitHub (e.g. a bleeding-edge machine learning package is only available on Github…or it’s another one of our team’s libraries hosted only on GitHub), we can use dependency_links inside our setup call:

setup(
    ...
    dependency_links=['http://github.com/user/repo/tarball/master#egg=package-1.0'],
    ...
)

If you want to add additional metadata, such as status, licensing, language version, etc. we can use classifiers like this:

setup(
    ...
    classifiers=[
        'Development Status :: 3 - Alpha',
        'License :: OSI Approved :: MIT License',
        'Programming Language :: Python :: 2.7',
        'Topic :: Text Processing :: Linguistic',
      ],
    ...
)

To learn more about the different classifiers that can be added to our setup.py file see here.

STEP 5: Add Data

Just as we did above in R we can add data to our Python library. In Python these are called Non-Code Files and can include things like images, data, documentation, etc.

We add data to our library’s module directory, so that any code that requires those data can use a relative path from the consuming module’s __file__ variable.

Let’s add the Iris dataset to our library in order to provide users a quick way to test our functions. First, use the New Folder button in JupyterLab to create a new folder called data inside the module directory:

datapeek
├── datapeek
    └── __init__.py
    └── utilities.py
    └── data
├── setup.py
├── git

…then make a new Text File inside the data folder called iris.csv, and paste the data from here into the new file.

If you close and open the new csv file it will render inside JupyterLab as a proper table:

CSV file rendered in JupyterLab as formatted table.

We specify Non-Code Files using a MANIFEST.in file. Create another Text File called MANIFEST.in placing it inside your root folder:

datapeek
├── datapeek
    └── __init__.py
    └── utilities.py
    └── data
├── MANIFEST.in
├── setup.py
├── gitignore

…and add this line to the file:

include datapeek/data/iris.csv

NOTE: In this article I use the terms “library” and “package” interchangeably. While some people differentiate these words I don’t find this distinction useful, and rarely see it done in practice. We can think of a library (or package) as a directory of scripts containing functions. Those functions are grouped together to help engineers and scientists solve challenges.
We also need to include the following line in setup.py:

include_package_data=True

Our setup.py file should now look like this:

from setuptools import setup

setup(name='datapeek',
      version='0.1',
      description='A simple library for dealing with raw data.',
      url='https://github.com/sean-mcclure/datapeek_py',
      author='Sean McClure',
      author_email='[email protected]',
      license='MIT',
      packages=['datapeek'],
      install_requires=[
          'fuzzywuzzy',
          'pandas'
      ],
      include_package_data=True,
zip_safe=False)

STEP 6: Add Tests

As with our R library we should add tests so others can extend our library and ensure their own functions do not conflict with existing code. Add a test folder to our library’s module directory:

datapeek
├── datapeek
    └── __init__.py
    └── utilities.py
    └── data
    └── tests
├── MANIFEST.in
├── setup.py
├── gitignore

Our test folder should have its own __init__.py file as well as the test file itself. Create those now using JupyterLab’s Text File option:

datapeek
├── datapeek
    └── __init__.py
    └── utilities.py
    └── data
    └── tests
        └──__init__.py
        └──datapeek_tests.py
├── MANIFEST.in
├── setup.py
├── gitignore

Our datapeek directory structure is now set to house test functions, which we will write now.

Writing Tests

Writing tests in Python is similar to doing so in R. Assertions are used to check the expected outputs produced by our library’s functions. We can use these “unit tests” to check a variety of expected outputs depending on what might be expected to fail. For example, we might want to ensure a data frame is returned, or perhaps the correct number of columns after some known transformation.

I will add a simple test for each of our 4 functions. Feel free to add your own tests. Think about what should be checked, and keep in mind Martin Fowler’s quote shown in the R section of this article.

We will use unittest, a popular unit testing framework in Python.

Add unit tests to the datapeek_tests.py file, ensuring the unittest and datapeek libraries are imported:

from unittest import TestCase
from datapeek import utilities
import pandas as pd

use_data = pd.read_csv('datapeek/data/iris.csv')

def test_encode_and_bind():
    s = utilities.encode_and_bind(use_data, 'sepal_length')
    assert isinstance(s, pd.DataFrame)
    
def test_remove_features():
    s = utilities.remove_features(use_data, ['sepal_length','sepal_width'])
    assert s.shape[1] == 3
    
def test_apply_function_to_column():
    s = utilities.apply_function_to_column(use_data, ['sepal_length'], 'times_4', 'x*4')
    assert s['sepal_length'].sum() * 4 == 3506.0

def test_get_closest_string(): 
    s = utilities.get_closest_string(['hey there','we we are','howdy doody'], 'doody')
assert s == 'howdy doody'

To run these tests we can use Nose, which extends unittest to make testing easier. Install nose using a terminal session in JupyterLab:

$ pip install nose

We also need to add the following lines to setup.py:

setup(
    ...
    test_suite='nose.collector',
    tests_require=['nose'],
)

Our setup.py should now look like this:

from setuptools import setup

setup(name='datapeek',
      version='0.1',
      description='A simple library for dealing with raw data.',
      url='https://github.com/sean-mcclure/datapeek_py',
      author='Sean McClure',
      author_email='[email protected]',
      license='MIT',
      packages=['datapeek'],
      install_requires=[
          'fuzzywuzzy',
          'pandas'
      ],
      test_suite='nose.collector',
      tests_require=['nose'],
      include_package_data=True,
zip_safe=False)

Run the following from the root directory to run our tests:

python setup.py test

Setuptools will take care of installing nose if required and running the test suite. After running the above, you should see the following:

All our tests have passed!

If any test should fail, the unittest framework will show which functions did not pass. At this point, check to ensure you are calling the function correctly and that the output is indeed what you expected. If can also be good practice to purposely write tests to fail first, then write your functions until they pass.

STEP 7: Create Documentation

As I mentioned in the R section, I use Julep to rapidly create sharable and searchable documentation. This avoids writing cryptic annotations and provides the ability to immediately host our documentation. Of course this doesn’t come with the IDE hooks that other documentation does, but for rapidly communicating it works.

You can find the documentation I create for this library here.

STEP 8: Share Your Python Library

The standard approach for sharing python libraries is through PyPI. Just as we didn’t cover CRAN with R, we will not cover hosting our library on PyPI. While the requirements are fewer than those associated with CRAN there are still a number of steps that must be taken to successfully host on PyPI. The steps required to host on sites other than GitHub can always be added later.

GitHub

We covered the steps for adding a project to GitHub in the R section. The same steps apply here.

I mentioned above the need to rename our gitignore file to make it a hidden file.You can do that by running the following in terminal:

mv gitignore .gitignore

You’ll notice this file is no longer visible in our JupyterLab directory (it eventually disappears). Since JupyterLab still lacks a front-end setting to toggle hidden files simply run the following in terminal at anytime to see hidden files:

ls -a 

We can make it visible again should we need to view/edit the file in JupyterLab, by running:

mv .gitignore gitignore

Here is a quick recap on pushing our library to GitHub (change git URL to your own):

  • A terminal environment for running shell commands and downloading/installing libraries;
  • An R and Python console for working interactively with these languages;
  • A simple text editor for creating files with various extensions;
  • Jupyter Notebooks for prototyping ML work.
git config --global user.email {emailaddress}
git config --global user.name {name}

  • A terminal environment for running shell commands and downloading/installing libraries;
  • An R and Python console for working interactively with these languages;
  • A simple text editor for creating files with various extensions;
  • Jupyter Notebooks for prototyping ML work.

Now, anyone can use our python library. 👍 Let’s see how.

STEP 9: Install your Python Library

While we usually install Python libraries using the following command:

pip install <package_name>

… this requires hosting our library on PyPI, which as explained above is beyond the scope of this article. Instead we will learn how to install our Python libraries from GitHub, as we did for R. This approach still requires the pip install command but uses the GitHub URL instead of the package name.

Installing our Python Library from GitHub

With our library hosted on GitHub, we simply use pip install git+ followed by the URL provided on our GitHub repo (available by clicking the Clone or Download button on the GitHub website):

pip install git+https://github.com/sean-mcclure/datapeek_py

Now, we can import our library into our Python environment. For a single function:

from datapeek.utilities import encode_and_bind

…and for all functions:

from datapeek.utilities import *

Let’s do a quick check in a new Python environment to ensure our functions are available. Spinning up a new Docker container, I run the following:

Fetch a dataset:

iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')

Check functions:

encode_and_bind(iris, 'species')

remove_features(iris, ['petal_length', 'petal_width'])

apply_function_to_column(iris, ['sepal_length'], 'times_4', 'x*4')

get_closest_string(['hey there','we we are','howdy doody'], 'doody')

Success!

SUMMARY

In this article we looked at how to create both R and Python libraries using JupyterLab running inside a Docker container. Docker allowed us to leverage Docker Stacks such that our environment was easily controlled and common packages available. This also made it easy to use the same high-level interface to create libraries through the browser for 2 different languages. All files were written to our local machine since we mounted a volume inside Docker.

Creating libraries is a critical skill for any machine learning practitioner, and something I encourage others to do regularly. Libraries help isolate our work inside useful abstractions, improves reproducibility, makes our work shareable, and is the first step towards designing better software. Using a lightweight approach ensures we can prototype and share quickly, with the option to add more detailed practices and publishing criteria later as needed.

As always, please ask questions in the comments section should you run into issues. Happy coding.

Best Python Libraries For Data Science & Machine Learning

Best Python Libraries For Data Science & Machine Learning

Best Python Libraries For Data Science & Machine Learning | Data Science Python Libraries

This video will focus on the top Python libraries that you should know to master Data Science and Machine Learning. Here’s a list of topics that are covered in this session:

  • Introduction To Data Science And Machine Learning
  • Why Use Python For Data Science And Machine Learning?
  • Python Libraries for Data Science And Machine Learning
  • Python libraries for Statistics
  • Python libraries for Visualization
  • Python libraries for Machine Learning
  • Python libraries for Deep Learning
  • Python libraries for Natural Language Processing

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading about Python

Complete Python Bootcamp: Go from zero to hero in Python 3

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python and Django Full Stack Web Developer Bootcamp

Complete Python Masterclass

Python Tutorial - Python GUI Programming - Python GUI Examples (Tkinter Tutorial)

Computer Vision Using OpenCV

OpenCV Python Tutorial - Computer Vision With OpenCV In Python

Python Tutorial: Image processing with Python (Using OpenCV)

A guide to Face Detection in Python

Machine Learning Tutorial - Image Processing using Python, OpenCV, Keras and TensorFlow

PyTorch Tutorial for Beginners

The Pandas Library for Python

Introduction To Data Analytics With Pandas


From ‘R vs Python’ to ‘R and Python’

From ‘R vs Python’ to ‘R and Python’

In this article, you'll learn to leverage the best of both ‘Python and R’ in a single project.

In this article, you'll learn to leverage the best of both ‘Python and R’ in a single project.

If you are into Data Science, the two programming languages that immediately come to mind are R and Python. However, instead of considering them as two options, more often than not, we end up comparing the two. R and Python, are excellent tools in their own right but are very often conceived as rivals. If you type R vs Python , in your Google search bar, you instantly get a plethora of resources on topics which talk about the supremacy of one over the other.

One of the reasons for such an outlook is because people have divided the Data Science field into camps based on the choice of the programming language they use. There is an R camp and a Python camp and history is a testimony to the fact that camps cannot live in harmony. Members of both the camps fervently believe that their choice of language is superior to the other. So, in a way, divergence doesn’t lie with the tools but with the people using those tools.

Why not use Both?

There are people in the Data Science community who are using both Python and R, but their percentage is small. On the other hand, there are a lot of people who are committed to only one programming language but wished they had access to some of the capabilities of their adversary. For instance, R users sometimes yearn for the object-oriented capacities that are native to Python and similarly, some Python users long for the wide range of the statistical distributions that are available within R.

The figure above shows the results of the survey conducted by Red Monk in the third quarter of 2018. These results are based on the popularity of the languages on Stack Overflow as well as on Github and clearly show that both R and Python are rated quite high. Therefore, there is no inherent reason as to why we cannot work with both of them on the same project. Our ultimate goal should be to do better analytics and derive better insights and choice of a programming language should not be a hindrance in achieving that.

Overview of R and Python

Let’s have a look at the various aspects of these languages and what’s good and not so good about them.

Python

Since its release in 1991, Python has been extremely popular and is widely used in data processing. Some of the reasons for its wide popularity are:

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

However, Python doesn’t have specialized packages for statistical computing, unlike R.

R

R’s first release came in 1995 and since then it has gone on to become one of the most used tools for data science in the industry.

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

Performance wise R is not the fastest language and can be a memory glutton sometimes when dealing with large datasets.

Leveraging the best of Both Worlds

Could we utilize the statistical prowess of R along with the programming capabilities of Python? Well, when we can easily embed SQL code within either R or Python script, why not blend R and Python together?

There are basically two approaches by which we can use both Python and R side by side in a single project.

R within Python

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

PypeR provides a simple way to access R from Python through pipes. PypeR is also included in Python’s Package Index which provides a more convenient way for installation. PypeR is especially useful when there is no need for frequent interactive data transfers between Python and R. By running R through pipe, the Python program gains flexibility in sub-process controls, memory control, and portability across popular operating system platforms, including Windows, GNU Linux and Mac OS

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

pyRserve uses Rserve as an RPC connection gateway. Through such a connection, variables can be set in R from Python, and also R-functions can be called remotely. R objects are exposed as instances of Python-implemented classes, with R functions as bound methods to those objects in a number of cases.

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

rpy2 runs embedded R in a Python process. It creates a framework that can translate Python objects into R objects, pass them into R functions, and convert R output back into Python objects. rpy2 is used more often since it is one which is being actively developed.

One advantage of using R within Python is that we would able to use R’s awesome packages like ggplot2, tidyr, dplyr et al easily in Python. As an example let’s see how we can easily use ggplot2 for mapping in Python.

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

[https://rpy2.github.io/doc/latest/html/graphics.html#geometry](https://rpy2.github.io/doc/latest/html/graphics.html#geometry](https://rpy2.github.io/doc/latest/html/graphics.html#geometry) "https://rpy2.github.io/doc/latest/html/graphics.html#geometry](https://rpy2.github.io/doc/latest/html/graphics.html#geometry)")

Resources

You may want to have a look at the following resources for more in-depth review of rpy2:

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

Python with R

We can run R scripts in Python by using one of the alternatives below:

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

This package implements an interface to Python via Jython. It is intended for other packages to be able to embed python code along with R.

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

rPython is again a Package Allowing R to Call Python. It makes it possible to run Python code, make function calls, assign and retrieve variables, etc. from R.

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

SnakeCharmR is a modern overhauled version of rPython. It is a fork from ‘rPython’ which uses ‘jsonlite’ and has a lot of improvements over rPython.

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

PythonInR makes accessing Python from within R very easy by providing functions to interact with Python from within R.

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

The reticulate package provides a comprehensive set of tools for interoperability between Python and R. Out of all the above alternatives, this one is the most widely used, more so because it is being aggressively developed by Rstudio. Reticulate embeds a Python session within the R session, enabling seamless, high-performance interoperability. The package enables you to reticulate Python code into R, creating a new breed of a project that weaves together the two languages.

The reticulate package provides the following facilities:

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

Resources

Some great resources on using the reticulate package are:

  • Object-oriented language
  • General Purpose
  • Has a lot of extensions and incredible community support
  • Simple and easy to understand and learn
  • packages like pandas, numpy and scikit-learn, make Python an excellent choice for machine learning activities.

Conclusion

Both R and Python are quite robust languages and either one of them is actually sufficient to carry on the Data Analysis task. However, there are definitely some high and low points for both of them and if we could utilize the strengths of both, we could end up doing a much better job. Either way, having knowledge of both will make us more flexible thereby increasing our chances of being able to work in different environments.

References:

Interfacing R and Python — Andrew Collier

http://blog.yhat.com/tutorials/rpy2-combing-the-power-of-r-and-python.html

Learn More

An A-Z of useful Python tricks

A Complete Machine Learning Project Walk-Through in Python

A Feature Selection Tool for Machine Learning in Python

Machine Learning: how to go from Zero to Hero

Learning Python: From Zero to Hero

Introduction to PyTorch and Machine Learning

NumPy Tutorial for Beginners

Python Tutorial for Beginners (2019) - Learn Python for Machine Learning and Web Development

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python for Data Science and Machine Learning Bootcamp

Data Science, Deep Learning, & Machine Learning with Python

Deep Learning A-Z™: Hands-On Artificial Neural Networks

Python Programming for Data Science and Machine Learning

Python Programming for Data Science and Machine Learning

This article provides an overview of Python and its application to Data Science and Machine Learning and why it is important.

Originally published by Chris Kambala  at dzone.com

Python is a general-purpose, high-level, object-oriented, and easy to learn programming language. It was created by Guido van Rossum who is known as the godfather of Python.

Python is a popular programming language because of its simplicity, ease of use, open source licensing, and accessibility — the foundation of its renowned community, which provides great support and help in creating tons of packages, tutorials, and sample programs.

Python can be used to develop a wide variety of applications — ranging from Web, Desktop GUI based programs/applications to science and mathematics programs, and Machine learning and other big data computing systems.

Let’s explore the use of Python in Machine Learning, Data Science, and Data Engineering.

Machine Learning

Machine learning is a relatively new and evolving system development paradigm that has quickly become a mandatory requirement for companies and programmers to understand and use. See our previous article on Machine Learning for the background. Due to the complex, scientific computing nature of machine learning applications, Python is considered the most suitable programming language. This is because of its extensive and mature collection of mathematics and statistics libraries, extensibility, ease of use and wide adoption within the scientific community. As a result, Python has become the recommended programming language for machine learning systems development.

Data Science

Data science combines cutting edge computer and storage technologies with data representation and transformation algorithms and scientific methodology to develop solutions for a variety of complex data analysis problems encompassing raw and structured data in any format. A Data Scientist possesses knowledge of solutions to various classes of data-oriented problems and expertise in applying the necessary algorithms, statistics, and mathematic models, to create the required solutions. Python is recognized among the most effective and popular tools for solving data science related problems.

Data Engineering

Data Engineers build the foundations for Data Science and Machine Learning systems and solutions. Data Engineers are technology experts who start with the requirements identified by the data scientist. These requirements drive the development of data platforms that leverage complex data extraction, loading, and transformation to deliver structured datasets that allow the Data Scientist to focus on solving the business problem. Again, Python is an essential tool in the Data Engineer’s toolbox — one that is used every day to architect and operate the big data infrastructure that is leveraged by the data scientist.

Use Cases for Python, Data Science, and Machine Learning

Here are some example Data Science and Machine Learning applications that leverage Python.

  • Netflix uses data science to understand user viewing pattern and behavioral drivers. This, in turn, helps Netflix to understand user likes/dislikes and predict and suggest relevant items to view.
  • Amazon, Walmart, and Target are heavily using data science, data mining and machine learning to understand users preference and shopping behavior. This assists in both predicting demands to drive inventory management and to suggest relevant products to online users or via email marketing.
  • Spotify uses data science and machine learning to make music recommendations to its users.
  • Spam programs are making use of data science and machine learning algorithm(s) to detect and prevent spam emails.

This article provided an overview of Python and its application to Data Science and Machine Learning and why it is important.

Originally published by Chris Kambala  at dzone.com

============================================

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Jupyter Notebook for Data Science

☞ Data Science, Deep Learning, & Machine Learning with Python

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Python for Data Science and Machine Learning Bootcamp

☞ Machine Learning, Data Science and Deep Learning with Python

☞ [2019] Machine Learning Classification Bootcamp in Python

☞ Introduction to Machine Learning & Deep Learning in Python

☞ Machine Learning Career Guide – Technical Interview

☞ Machine Learning Guide: Learn Machine Learning Algorithms

☞ Machine Learning Basics: Building Regression Model in Python

☞ Machine Learning using Python - A Beginner’s Guide