When I was a university student, the statistics courses (Survival Analysis, Multivariate Analysis, etc…) were taught in R. Nevertheless, as I wished to learn Data Science, I choose Python because it seemed “spooky” to me.

By working only with Python, I stumble upon the need of implementing some Statistical techniques like the Grubb Test for outliers, Markov Chain Monte Carlo for simulations or Bayesian Networks for synthetic data. Thus, this article is intended to be an introductory guide to incorporate R in your workflow as a Python Data Scientist. In case, you’ll like to integrate Python in your workflow as an R Data Scientist, the reticulate package is useful, check out [1].

rpy2

We choose the rpy2 framework, other options are pyRserve or pypeR, because it runs an embedded R. In other words, it allows communication between Python and R objects through rpy2.robjects, we’ll see later a particular example when converting a pandas DataFrame to an R DataFrame. If you get stuck in any of the below steps read the official documentation or the references.

We’ll cover three steps appropriate to start working with R within Python. Finally, we’ll do a practical example and cover further functionalities that the rpy2 package allows you to handle.

Install R packages.
Importing packages and functions from R.
Converting pandas DataFrame to R data frame and vice-versa.
Practical example (Running a Bayesian Network).

But first, we should install the rpy2 package.

# Jupyter Notebook option
!pip install rpy2
# Terminal option
pip install rpy2

1. Install R packages

In R, installing packages is performed by downloading them from CRAN mirrors and then installing them locally. In a similar way to Python modules, the packages can be installed and then loaded.

# Choosing a CRAN Mirror
import rpy2.robjects.packages as rpackages
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)

# Installing required packages
from rpy2.robjects.vectors import StrVector
packages = ('bnlearn',...,'other desired packages')
utils.install_packages(StrVector(packages))

By selecting ind = 1inchosseCRANmirror , we assure an automatic redirection to the server nearest to our location. Now, we’re going to cover step two.

2. Importing packages and functions

Here, we’re going to import the libraries and functions required to perform a Bayesian Network in the practical example.

# Import packages
from rpy2.robjects.packages import importr
base, bnlearn = importr('base'), importr('bnlearn')

# Import Functions
bn_fit, rbn = bnlearn.bn_fit, bnlearn.rbn
hpc, rsmax2, tabu = bnlearn.hpc, bnlearn.rsmax2, bnlearn.tabu

In order to import any function, it is convenient to see the ‘rpy2’ key in the dictionary of every package, for example, to see available functions to import on bnlearn we run:

bnlearn.__dict__['_rpy2r']Output:
...
...
'bn_boot': 'bn.boot',
  'bn_cv': 'bn.cv',
  'bn_cv_algorithm': 'bn.cv.algorithm',
  'bn_cv_structure': 'bn.cv.structure',
  'bn_fit': 'bn.fit',
  'bn_fit_backend': 'bn.fit.backend',
  'bn_fit_backend_continuous': 'bn.fit.backend.continuous',
...
...

For more info on how to import functions checkout [4] or [5].

3. Converting pandas DataFrame to R data frame and vice-versa

Personally, I think this functionality is what allows you to combine the scalability (python) with statistical tools ®. As a personal example, while I was using the Multiprocessing python library to implement parallel computation, I also wanted to try the auto.arima() function from the forecast R library, besides the functions of statsmodels Python package, for forecasting. So, the robjects.conversion is what allows one to merge the best of the two programming languages.

# Allow conversion
import rpy2.robjects as ro
from rpy2.objects import pandas2ri
pandas2ri.activate()

# Convert to R dataframe
r_dt = ro.conversion.py2rpy(dt) # dt is a pd.DataFrame object

# Convert back to pandas DataFrame        
pd_dt = ro.conversion.rpy2py(r_dt)

When activating the pandas conversion (pandas2ri.activate()), many conversions of R to pandas will be done automatically. Yet, for explicit conversion we call the py2rpy or rpy2py functions.

4. Practical example with a Bayesian Network

Besides Monte-Carlo methods, Bayesian Networks are an option for simulating data. However, as today there is no library available for this task in Python. So, I opt for the bnlearn package, which let to learn the graphical structure of Bayesian networks and perform inference from them.

In the example below, we’re using a hybrid algorithm (rsmax2) for learning the structure of the network because it allows us to use any combination of constraint-based and score-based algorithms. However, depending on the nature of the problem you should choose the right heuristic, for the complete list of available algorithms see [7]. Once, the network is learned we simulate n random samples from the bayesian network with the rbn function. Finally, we perform a try-except structure to handle a particular type of error.

r_imputados = robjects.conversion.py2rpy(imputados)                

try:   
    # Learn structure of Network
    structure = rsmax2(data, restrict = 'hpc', maximize = 'tabu')       
    
    fitted = bn_fit(structure, data = data, method = "mle")                                               
    
    # Generate n number of observations
    r_sim = rbn(fitted, n = 10)
    
except rpy2.rinterface_lib.embedded.RRuntimeError:
    print("Error while running R methods")

RunTimeError happens when we don’t want the function to fail or do something unexpected. In this case, we’re catching this error because it is a way to inform the user when something went wrong that it isn’t another kind of error (for complete exceptions see [9]). As an illustration, I got the error of not finding the hybrid.pc.filter hybrid.pc.filter while running the rsmax2 function.

Further Functionalities

There is much more you could do with the rpy2 low-level interface and high-level interface. For instance, you could call python functions with R, let’s see how to find the minimum of a four-dimensional Colville Function through Conjugate-Gradient Method.

from rpy2.robjects.vectors import FloatVector
from rpy2.robjects.packages import importr
import rpy2.rinterface as ri
stats = importr('stats')

# Colville f: R^4 ---> R
def Colville(x):
    x1, x2, x3, x4 = x[0], x[1], x[2], x[3]
    
    return   100*(x1**2-x2)**2 + (x1-1)**2+(x3-1)**2 + 90*(x3**2-x4)**2 + 10.1*((x2-1)**2 + (x4-1)**2) + 19.8*(x2-1)*(x4-1)

# Expose function to R
Colville = ri.rternalize(Colville)

# Initial point
init_point = FloatVector((3, 3, 3, 3))

# Optimization Function
res = stats.optim(init_point, Colville, method = c("CG"))

#Python #R #MachineLearning #DataScience