Guide to R and Python in a Single Jupyter Notebook

Guide to R and Python in a Single Jupyter Notebook

Why pick one when you can use both at the same time?

Originally published by Matthew Stewart, PhD Researcher at https://towardsdatascience.com

The war between R and Python users has been raging for several years. With most of the old school statisticians being trained on R and most computer science and data science departments in universities instead preferring Python, both have pros and cons. The main cons I have noticed in practice are in the packages that are available for each language.

As of 2019, the R packages for cluster analysis and splines are superior to the Python packages of the same kind. In this article, I will show you, with coded examples, how to take R functions and datasets and import and utilize then within a Python-based Jupyter notebook.

The topics of this article are:

  • Importing (base) R functions
  • Importing R library functions
  • Populating vectors R understands
  • Populating dataframes R understands
  • Populating formulas R understands
  • Running models in R
  • Getting results back to Python
  • Getting model predictions in R
  • Plotting in R
  • Reading R’s documentation

Linear/Polynomial Regression

Firstly, we will look at performing basic linear and polynomial regression using imported R functions. We will examine a dataset looking at diabetes with information about C-peptide concentrations and acidity variables. Do not worry about the contents of the model, this is a commonly used example in the field of generalized additive models, which we will look at later in the article.

diab = pd.read_csv("data/diabetes.csv")print("""
 # Variables are:
 #  subject:  subject ID number
 #  age:      age diagnosed with diabetes
 #  acidity:  a measure of acidity called base deficit
 #  y:        natural log of serum C-peptide concentration
 #
 # Original source is Sockett et al. (1987)
 # mentioned in Hastie and Tibshirani's book 
 # "Generalized Additive Models".
 """)display(diab.head())
 display(diab.dtypes)
 display(diab.describe())

We can then plot the data:

ax0 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data") #plotting direclty from pandas!
 ax0.set_xlabel("Age at Diagnosis")
 ax0.set_ylabel("Log C-Peptide Concentration");

Linear regression with statsmodel. You may need to install the package in order to follow the code, you can do this with pip install statsmodel.

  • In Python, we work from a vector of target values and a design matrix we built ourself (e.g. from PolynomialFeatures).
  • Now, statsmodel’s formula interface can help build the target value and design matrix for you.
#Using statsmodels
 import statsmodels.formula.api as sm

model1 = sm.ols('y ~ age',data=diab)
fit1_lm = model1.fit()

Now we build a data frame to predict values on (sometimes this is just the test or validation set)

  • Very useful for making pretty plots of the model predictions — predict for TONS of values, not just whatever’s in the training set
x_pred = np.linspace(0,16,100)

predict_df = pd.DataFrame(data={"age":x_pred})
predict_df.head()

Use get_prediction(<data>).summary_frame() to get the model's prediction (and error bars!)

prediction_output = fit1_lm.get_prediction(predict_df).summary_frame()
prediction_output.head()

Plot the model and error bars

ax1 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares linear fit")
ax1.set_xlabel("Age at Diagnosis")
ax1.set_ylabel("Log C-Peptide Concentration")

ax1.plot(predict_df.age, prediction_output['mean'],color="green")
ax1.plot(predict_df.age, prediction_output['mean_ci_lower'], color="blue",linestyle="dashed")
ax1.plot(predict_df.age, prediction_output['mean_ci_upper'], color="blue",linestyle="dashed");

ax1.plot(predict_df.age, prediction_output['obs_ci_lower'], color="skyblue",linestyle="dashed")
ax1.plot(predict_df.age, prediction_output['obs_ci_upper'], color="skyblue",linestyle="dashed");

We can also fit a 3rd-degree polynomial model and plot the model error bars in two ways:

  • Route1: Build a design df with a column for each of age, age2, age3
fit2_lm = sm.ols(formula="y ~ age + np.power(age, 2) + np.power(age, 3)",data=diab).fit()

poly_predictions = fit2_lm.get_prediction(predict_df).summary_frame()
poly_predictions.head()

  • Route2: Just edit the formula
ax2 = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data with least-squares cubic fit")
ax2.set_xlabel("Age at Diagnosis")
ax2.set_ylabel("Log C-Peptide Concentration")

ax2.plot(predict_df.age, poly_predictions['mean'],color="green")
ax2.plot(predict_df.age, poly_predictions['mean_ci_lower'], color="blue",linestyle="dashed")
ax2.plot(predict_df.age, poly_predictions['mean_ci_upper'], color="blue",linestyle="dashed");

ax2.plot(predict_df.age, poly_predictions['obs_ci_lower'], color="skyblue",linestyle="dashed")
ax2.plot(predict_df.age, poly_predictions['obs_ci_upper'], color="skyblue",linestyle="dashed");

This did not use any features of the R programming language. Now, we can repeat the analysis using functions from R.

Linear/Polynomial Regression, but make it R

After this section, we’ll know everything we need to in order to work with R models. The rest of the lab is just applying these concepts to run particular models. This section, therefore, is your ‘cheat sheet’ for working in R.

What we need to know:

  • Importing (base) R functions
  • Importing R Library functions
  • Populating vectors R understands
  • Populating DataFrames R understands
  • Populating Formulas R understands
  • Running models in R
  • Getting results back to Python
  • Getting model predictions in R
  • Plotting in R
  • Reading R’s documentation

Importing R functions

To import R functions we need the rpy2 package. Depending on your environment, you may also need to specify the path to the R home directory. I have given an example below for how to specify this.

# if you're on JupyterHub you may need to specify the path to R

#import os
#os.environ['R_HOME'] = "/usr/share/anaconda3/lib/R"

import rpy2.robjects as robjects

To specify an R function, simply use robjects.r followed by the name of the package in square brackets as a string. To prevent confusion, I like to use r_ for functions, libraries, and other objects imported from R

r_lm = robjects.r["lm"]
r_predict = robjects.r["predict"]
#r_plot = robjects.r["plot"] # more on plotting later

#lm() and predict() are two of the most common functions we'll use

Importing R libraries

We can import individual functions, but we can also import entire libraries too. To import an entire library, you can extract the importr package from rpy2.robjects.packages .

from rpy2.robjects.packages import importr
#r_cluster = importr('cluster')
#r_cluster.pam;

Populating vectors R understands

To specify a float vector that can interface with Python packages, we can use the robjects.FloatVector function. The argument to this function references the data array that you wish to convert to an R object, in our case, the age and y variables from our diabetes dataset.

r_y = robjects.FloatVector(diab['y'])
r_age = robjects.FloatVector(diab['age'])

What happens if we pass the wrong type? How does r_age display? How does r_age print?

Populating Dataframes R understands

We can specify individual vectors, and we can also specify entire dataframes. This is done by using the robjects.DataFrame function. The argument to this function is a dictionary specifying the name and the vector (obtained from robjects.FloatVector ) associated with the name.

diab_r = robjects.DataFrame({"y":r_y, "age":r_age})

How does diab_r display? How does diab_r print?

Populating formulas R understands

To specify a formula, for example, for regression, we can use the robjects.Formula function. This follows the R syntax dependent variable ~ independent variables . In our case, the output y is modeled as a function of the age variable.

simple_formula = robjects.Formula("y~age")
simple_formula.environment["y"] = r_y #populate the formula's .environment, so it knows what 'y' and 'age' refer to
simple_formula.environment["age"] = r_age

Notice in the above formula we had to specify the FloatVector’s associated with each of the variables in our formula. We have to do this as the formula does not automatically relate our variable names to variables that we have previously specified —  they have not yet been associated with the robjects.Formula object.

Running Models in R

To specify a model, in this case a linear regression model using our previously imported r_lm function, we need to pass our formula variable as an argument (this will not work unless we pass an R formula object).

diab_lm = r_lm(formula=simple_formula) # the formula object is storing all the needed variables

Instead of specifying each of the individual float vectors related to the robjects.Formula object, we can reference the dataset in the formula itself (as long as this has been made into an R object itself).

simple_formula = robjects.Formula("y~age") # reset the formula
diab_lm = r_lm(formula=simple_formula, data=diab_r) #can also use a 'dumb' formula and pass a dataframe

Getting results back to Python

Using R functions and libraries is great, but we can also analyze our results and get them back to Python for further processing. To look at the output:

diab_lm #the result is already 'in' python, but it's a special object

We can also check the names in our output:

print(diab_lm.names) # view all names

To take the first element of our output:

diab_lm[0] #grab the first element

To take the coefficients:

diab_lm.rx2("coefficients") #use rx2 to get elements by name!

To put the coefficients in a Numpy array:

np.array(diab_lm.rx2("coefficients")) #r vectors can be converted to numpy (but rarely needed)

Getting Predictions

To get predictions using our R model, we can create a prediction dataframe and use the r_predict function, similar to how it is done using Python.

# make a df to predict on (might just be the validation or test dataframe)
predict_df = robjects.DataFrame({"age": robjects.FloatVector(np.linspace(0,16,100))})# call R's predict() function, passing the model and the data
predictions = r_predict(diab_lm, predict_df)

We can use the rx2 function to extract the ‘age’ values:

x_vals = predict_df.rx2("age")

We can also plot our data using Python:

ax = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data")
ax.set_xlabel("Age at Diagnosis")
ax.set_ylabel("Log C-Peptide Concentration");ax.plot(x_vals,predictions); #plt still works with r vectors as input!

We can also plot using R, although this is slightly more involved.

Plotting in R

To plot in R, we need to turn on the %R magic function using the following command:

%load_ext rpy2.ipython
  • The above turns on the %R “magic”.
  • R’s plot() command responds differently based on what you hand to it; different models get different plots!
  • For any specific model search for plot.modelname. For example, for a GAM model, search plot.gam for any details of plotting a GAM model.
  • The %R “magic” runs R code in ‘notebook’ mode, so figures display nicely
  • Ahead of the plot(<model>) code we pass in the variables R needs to know about (-i is for "input")
%R -i diab_lm plot(diab_lm);

Reading R’s documentation

The documentation for the lm() function is here, and a prettier version (same content) is here. When Googling, prefer rdocumentation.org when possible. Sections:

  • Usage: gives the function signature, including all optional arguments
  • Arguments: What each function input controls
  • Details: additional info on what the function does and how arguments interact. Often the right place to start reading
  • Value: the structure of the object returned by the function
  • References: The relevant academic papers
  • See Also: other functions of interest

Example

As an example to test our newly acquired knowledge, we will try the following:

  • Add confidence intervals calculated in R to the linear regression plot above. Use the interval= argument to r_predict()(documentation here). You will have to work with a matrix returned by R.
  • Fit a 5th-degree polynomial to the diabetes data in R. Search the web for an easier method than writing out a formula with all 5 polynomial terms.

Confidence intervals:

CI_matrix = np.array(r_predict(diab_lm, predict_df, interval="confidence"))

ax = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data")
ax.set_xlabel("Age at Diagnosis")
ax.set_ylabel("Log C-Peptide Concentration");

ax.plot(x_vals,CI_matrix[:,0], label="prediction")
ax.plot(x_vals,CI_matrix[:,1], label="95% CI", c='g')
ax.plot(x_vals,CI_matrix[:,2], label="95% CI", c='g')
plt.legend();

5-th degree polynomial:

ploy5_formula = robjects.Formula("y~poly(age,5)") # reset the formula
diab5_lm = r_lm(formula=ploy5_formula, data=diab_r) #can also use a 'dumb' formula and pass a dataframe

predictions = r_predict(diab5_lm, predict_df, interval="confidence")

ax = diab.plot.scatter(x='age',y='y',c='Red',title="Diabetes data")
ax.set_xlabel("Age at Diagnosis")
ax.set_ylabel("Log C-Peptide Concentration");

ax.plot(x_vals,predictions);


Lowess Smoothing

Now that we know how to use R objects and functions within Python, we can look at cases that we might want to do this. The first we will example is Lowess smoothing.

Lowess smoothing is implemented in both Python and R. We’ll use it as another example as we transition languages.

Python

In Python, we use the statsmodel.nonparametric.smoothers_lowess to perform lowess smoothing.

from statsmodels.nonparametric.smoothers_lowess import lowess as lowessss1 = lowess(diab['y'],diab['age'],frac=0.15)
ss2 = lowess(diab['y'],diab['age'],frac=0.25)
ss3 = lowess(diab['y'],diab['age'],frac=0.7)
ss4 = lowess(diab['y'],diab['age'],frac=1)ss1[:10,:] # we get back simple a smoothed y value for each x value in the data

Notice the clean code to plot different models. We’ll see even cleaner code in a minute.

for cur_model, cur_frac in zip([ss1,ss2,ss3,ss4],[0.15,0.25,0.7,1]):   ax = diab.plot.scatter(x='age',y='y',c='Red',title="Lowess Fit, Fraction = {}".format(cur_frac))
   ax.set_xlabel("Age at Diagnosis")
   ax.set_ylabel("Log C-Peptide Concentration")
   ax.plot(cur_model[:,0],cur_model[:,1],color="blue")
   
   plt.show()

R

To implement Lowess smoothing in R we need to:

  • Import the loess function.
  • Send the data over to R.
  • Call the function and get results.
r_loess = robjects.r['loess.smooth'] #extract R function
r_y = robjects.FloatVector(diab['y'])
r_age = robjects.FloatVector(diab['age'])ss1_r = r_loess(r_age,r_y, span=0.15, degree=1)ss1_r #again, a smoothed y value for each x value in the data

Varying span

Next, some extremely clean code to fit and plot models with various parameter settings. (Though the zip() method seen earlier is great when e.g. the label and the parameter differ)

for cur_frac in [0.15,0.25,0.7,1]:
   
   cur_smooth = r_loess(r_age,r_y, span=cur_frac)   ax = diab.plot.scatter(x='age',y='y',c='Red',title="Lowess Fit, Fraction = {}".format(cur_frac))
   ax.set_xlabel("Age at Diagnosis")
   ax.set_ylabel("Log C-Peptide Concentration")
   ax.plot(cur_smooth[0], cur_smooth[1], color="blue")
   
   plt.show()

The next example we will look at is smoothing splines, these models are not well supported in Python and so using R functions is preferred.

Smoothing Splines

From this point forward, we’re working with R functions; these models aren’t (well) supported in Python.

For clarity: this is the fancy spline model that minimizes

across all possible functions f. The winner will always be a continuous, cubic polynomial with a knot at each data point.

Some things to think about are:

  • Any idea why the winner is cubic?
  • How interpretable is this model?
  • What are the tunable parameters?

To implement the smoothing spline, we only need two lines.

r_smooth_spline = robjects.r['smooth.spline'] #extract R function# run smoothing function
spline1 = r_smooth_spline(r_age, r_y, spar=0)

Smoothing Spline Cross-Validation

R’s smooth_spline function has a built-in cross validation to find a good value for lambda. See package docs.

spline_cv = r_smooth_spline(r_age, r_y, cv=True) lambda_cv = spline_cv.rx2("lambda")[0]ax19 = diab.plot.scatter(x='age',y='y',c='Red',title="smoothing spline with $\lambda=$"+str(np.round(lambda_cv,4))+", chosen by cross-validation")
ax19.set_xlabel("Age at Diagnosis")
ax19.set_ylabel("Log C-Peptide Concentration")
ax19.plot(spline_cv.rx2("x"),spline_cv.rx2("y"),color="darkgreen")

Natural & Basis Splines

Here, we take a step backward on model complexity, but a step forward in coding complexity. We’ll be working with R’s formula interface again, so we will need to populate Formulas and Dataframes.

Some more food for thought:

  • In what way are Natural and Basis splines less complex than the splines we were just working with?
  • What makes a spline ‘natural’?
  • What makes a spline ‘basis’?
  • What are the tuning parameters?
#We will now work with a new dataset, called GAGurine.
#The dataset description (from the R package MASS) is below:
#Data were collected on the concentration of a chemical GAG

in the urine of 314 children aged from zero to seventeen years. The aim of the study was to produce a chart to help a paediatrican to assess if a child's GAG concentration is ‘normal’.#The variables are: Age: age of child in years. GAG: concentration of GAG (the units have been lost).

First, we import and plot the dataset:

GAGurine = pd.read_csv("data/GAGurine.csv")
display(GAGurine.head())

ax31 = GAGurine.plot.scatter(x='Age',y='GAG',c='black',title="GAG in urine of children")
ax31.set_xlabel("Age");
ax31.set_ylabel("GAG");

Standard stuff: import function, convert variables to R format, call function

from rpy2.robjects.packages import importr
r_splines = importr('splines')# populate R variables
r_gag = robjects.FloatVector(GAGurine['GAG'].values)
r_age = robjects.FloatVector(GAGurine['Age'].values)
r_quarts = robjects.FloatVector(np.quantile(r_age,[.25,.5,.75])) #woah, numpy functions run on R objects

What happens when we call the ns or bs functions from r_splines?

ns_design = r_splines.ns(r_age, knots=r_quarts)
bsdesign = rsplines.bs(r_age, knots=r_quarts)
print(ns_design)

ns and bs return design matrices, not model objects! That's because they're meant to work with lm's formula interface. To get a model object we populate a formula including ns(<var>,<knots>) and fit to data.

r_lm = robjects.r['lm']
r_predict = robjects.r['predict']

populate the formula

ns_formula = robjects.Formula("Gag ~ ns(Age, knots=r_quarts)")
ns_formula.environment['Gag'] = r_gag
ns_formula.environment['Age'] = r_age
ns_formula.environment['r_quarts'] = r_quarts
        

fit the model

ns_model = r_lm(ns_formula

Predict like usual: build a dataframe to predict on and call predict() .

# predict
predict_frame = robjects.DataFrame({"Age": robjects.FloatVector(np.linspace(0,20,100))})ns_out = r_predict(ns_model, predict_frame)ax32 = GAGurine.plot.scatter(x='Age',y='GAG',c='grey',title="GAG in urine of children")
ax32.set_xlabel("Age")
ax32.set_ylabel("GAG")
ax32.plot(predict_frame.rx2("Age"),ns_out, color='red')
ax32.legend(["Natural spline, knots at quartiles"]);

Examples

Let’s look at two examples of implementing basis splines.

  1. Fit a basis spline model with the same knots, and add it to the plot above.
bs_formula = robjects.Formula("Gag ~ bs(Age, knots=r_quarts)")
bs_formula.environment['Gag'] = r_gag
bs_formula.environment['Age'] = r_age
bs_formula.environment['r_quarts'] = r_quarts

bs_model = r_lm(bs_formula)
bs_out = r_predict(bs_model, predict_frame)ax32 = GAGurine.plot.scatter(x='Age',y='GAG',c='grey',title="GAG in urine of children")
ax32.set_xlabel("Age")
ax32.set_ylabel("GAG")
ax32.plot(predict_frame.rx2("Age"),ns_out, color='red')
ax32.plot(predict_frame.rx2("Age"),bs_out, color='blue')
ax32.legend(["Natural spline, knots at quartiles","B-spline, knots at quartiles"]);

2. Fit a basis spline with 8 knots placed at [2,4,6…14,16] and add it to the plot above.

overfit_formula = robjects.Formula("Gag ~ bs(Age, knots=r_quarts)")
overfit_formula.environment['Gag'] = r_gag
overfit_formula.environment['Age'] = r_age
overfit_formula.environment['r_quarts'] = robjects.FloatVector(np.array([2,4,6,8,10,12,14,16]))

overfit_model = r_lm(overfit_formula)
overfit_out = r_predict(overfit_model, predict_frame)ax32 = GAGurine.plot.scatter(x='Age',y='GAG',c='grey',title="GAG in urine of children")
ax32.set_xlabel("Age")
ax32.set_ylabel("GAG")
ax32.plot(predict_frame.rx2("Age"),ns_out, color='red')
ax32.plot(predict_frame.rx2("Age"),bs_out, color='blue')
ax32.plot(predict_frame.rx2("Age"),overfit_out, color='green')
ax32.legend(["Natural spline, knots at quartiles", "B-spline, knots at quartiles", "B-spline, lots of knots"]);

GAMs

We come, at last, to our most advanced model. The coding here isn’t any more complex than we’ve done before, though the behind-the-scenes is awesome.

First, let’s get our multivariate data.

kyphosis = pd.read_csv("data/kyphosis.csv")print("""

kyphosis - wherther a particular deformation was present post-operation age - patient's age in months number - the number of vertebrae involved in the operation start - the number of the topmost vertebrae operated on""")

display(kyphosis.head())
display(kyphosis.describe(include='all'))
display(kyphosis.dtypes)#If there are errors about missing R packages, run the code below:

#r_utils = importr('utils')
#r_utils.install_packages('codetools')
#r_utils.install_packages('gam')

To fit a GAM, we

  • Import the gam library
  • Populate a formula including s(<var>) on variables which we want to smooth
  • Call gam(formula, family=<string>) where family is a string naming a probability distribution, chosen based on how the response variable is thought to occur.

Rough family guidelines:

  • Response is binary or “N occurrences out of M tries”, e.g. number of lab rats (out of 10) developing disease: choose "binomial"
  • Response is a count with no logical upper bound, e.g. number of ice creams sold: choose "poisson"
  • Response is real, with normally-distributed noise, e.g. person’s height: choose "gaussian" (the default)
#There is a Python library in development for using GAMs
(https://github.com/dswah/pyGAM)

but it is not yet as comprehensive as the R GAM library, which we will use here instead. R also has the mgcv library, which implements some more advanced/flexible fitting methods

r_gam_lib = importr('gam')
r_gam = r_gam_lib.gam

r_kyph = robjects.FactorVector(kyphosis[["Kyphosis"]].values)
r_Age = robjects.FloatVector(kyphosis[["Age"]].values)
r_Number = robjects.FloatVector(kyphosis[["Number"]].values)
r_Start = robjects.FloatVector(kyphosis[["Start"]].values)

kyph1_fmla = robjects.Formula("Kyphosis ~ s(Age) + s(Number) + s(Start)")
kyph1_fmla.environment['Kyphosis']=r_kyph
kyph1_fmla.environment['Age']=r_Age
kyph1_fmla.environment['Number']=r_Number
kyph1_fmla.environment['Start']=r_Start

kyph1_gam = r_gam(kyph1_fmla, family="binomial")

The fitted gam model has a lot of interesting data within it:

print(kyph1_gam.names)

Remember plotting? Calling R’s plot() on a gam model is the easiest way to view the fitted splines

In [ ]:

%R -i kyph1_gam plot(kyph1_gam, residuals=TRUE,se=TRUE, scale=20);

Prediction works like normal (build a data frame to predict on, if you don’t already have one, and call predict()). However, predict always reports the sum of the individual variable effects. If family is non-default this can be different from the actual prediction for that point.

For instance, we’re doing a ‘logistic regression’ so the raw prediction is log-odds, but we can get probability by using in predict(..., type="response")

kyph_new = robjects.DataFrame({'Age': robjects.IntVector((84,85,86)),
                              'Start': robjects.IntVector((5,3,1)),
                              'Number': robjects.IntVector((1,6,10))})print("Raw response (so, Log odds):")
display(r_predict(kyph1_gam, kyph_new))
print("Scaled response (so, probabilty of kyphosis):")
display(r_predict(kyph1_gam, kyph_new, type="response"))

Final Comments

Using R functions in Python is relatively easy once you are familiar with the procedure, and it can save a lot of headaches if you need to use R packages to perform your data analysis or are a Python user who has been given R code to work with.

I hope you enjoyed this article and found it informative and useful. All the code used in this notebook can be found on my GitHub page for those of you who wish to experiment with interfacing between R and Python functions and objects in the Jupyter environment.

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python for Data Science and Machine Learning Bootcamp

Machine Learning, Data Science and Deep Learning with Python

Deep Learning A-Z™: Hands-On Artificial Neural Networks

Artificial Intelligence A-Z™: Learn How To Build An AI

A Complete Machine Learning Project Walk-Through in Python

Machine Learning: how to go from Zero to Hero

Top 18 Machine Learning Platforms For Developers

10 Amazing Articles On Python Programming And Machine Learning

100+ Basic Machine Learning Interview Questions and Answers




Data Science Full Course - Data Science For Beginners

Data Science Full Course - Data Science For Beginners

This video on Data Science is a full course compilation that will help you gain all the concepts, techniques, and algorithms involved in data science. Python and R are the primary programming languages used for data science.

Data Science Full Course | Data Science For Beginners | Learn Data Science In 10 Hours

Here, you will understand the basics of data science, such as data munging, data mining, and data wrangling. You will come across how to implement linear regression (2:30:10), logistic regression (3:09:12), decision tree (3:27:04), random forest, support vector machines (5:17:21), time series analysis (7:33:05), and lots more.

You will get an idea about the salary, skills, jobs, and resume of a data scientist (9:00:04).

Finally, you will learn about the important data science interview questions (9:04:42) that would help you crack any data science interview. Now, let's get started and learn data science in detail.

Below topics are explained in this Data Science tutorial:

1. Data Science basics (01:28)

2. What is Data Science (05:51)

3. Need for Data Science (06:38)

4. Business intelligence vs Data Science (17:30)

5. Prerequisites for Data Science (22:31)

6. What does a Data Scientist do? (30:23)

7. Demand for Data Scientist (53:03)

8. Linear regression (2:30:10)

9. Decision trees (2:53:39)

10. Logistic regression in R (3:09:12)

11. What is a decision tree? (3:27:04)

12. What is clustering? (4:35:40)

13. Divisive clustering (4:51:14)

14. Support vector machine (5:17:21)

15. K-means clustering 96:44:13)

16. Time series analysis (7:33:05)

17. How to become a Data Scientist (8:26:54)

18. Job roles in Data Science (8:30:59)

19. Simplilearn certifications in Data Science (8:33:50)

20. Who is a Data Science engineer? (8:34:34)

21. Data Science engineer resume (9:00:04)

22. Data Science interview questions and answers (9:04:42)


Best Python Libraries For Data Science & Machine Learning

Best Python Libraries For Data Science & Machine Learning

Best Python Libraries For Data Science & Machine Learning | Data Science Python Libraries

This video will focus on the top Python libraries that you should know to master Data Science and Machine Learning. Here’s a list of topics that are covered in this session:

  • Introduction To Data Science And Machine Learning
  • Why Use Python For Data Science And Machine Learning?
  • Python Libraries for Data Science And Machine Learning
  • Python libraries for Statistics
  • Python libraries for Visualization
  • Python libraries for Machine Learning
  • Python libraries for Deep Learning
  • Python libraries for Natural Language Processing

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading about Python

Complete Python Bootcamp: Go from zero to hero in Python 3

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python and Django Full Stack Web Developer Bootcamp

Complete Python Masterclass

Python Tutorial - Python GUI Programming - Python GUI Examples (Tkinter Tutorial)

Computer Vision Using OpenCV

OpenCV Python Tutorial - Computer Vision With OpenCV In Python

Python Tutorial: Image processing with Python (Using OpenCV)

A guide to Face Detection in Python

Machine Learning Tutorial - Image Processing using Python, OpenCV, Keras and TensorFlow

PyTorch Tutorial for Beginners

The Pandas Library for Python

Introduction To Data Analytics With Pandas


Data Science Vs Machine Learning Vs Artificial Intelligence

Data Science Vs Machine Learning Vs Artificial Intelligence

Data Science vs Artificial Intelligence vs Machine Learning vs Deep Learning - Learn about each concept and relation between them for their ...

Data Science vs Artificial Intelligence vs Machine Learning vs Deep Learning - Learn about each concept and relation between them for their ...

What is Data Science?

Data Science is an interdisciplinary field whose primary objective is the extraction of meaningful knowledge and insights from data. These insights are extracted with the help of various mathematical and Machine Learning-based algorithms. Hence, Machine Learning is a key element of Data Science.

Alongside Machine Learning, as the name suggests, “data” itself is the fuel for Data Science. Without the availability of appropriate data, key insights cannot be extracted from it. Both the volume and accuracy of data matters in this field, since the algorithms are designed to “learn” with “experience”, which comes through the data provided. Data Science involves the use of various types of data, from multiple sources. Some of the types of data are image data, text data, video data, time-dependent data, time-independent data, audio data, etc.

Data Science requires knowledge of multiple disciplines. As shown in the figure, it is a combination of Mathematics and Statistics, Computer Science skills and Domain Specific Knowledge. Without a mastery of all these sub-domains, the grasp on Data Science will be incomplete.

What is Machine Learning?

Machine Learning is a subset or a part of Artificial Intelligence. It primarily involves the scientific study of algorithmic, mathematical, and statistical models which performs a specific task by analyzing data, without any explicit step-by-step instructions, by relying on patterns and inference, which is drawn from the data. This also contributes to its alias, Pattern Recognition.

Its objective is to recognize patterns in a given data and draw inferences, which allows it to perform a similar task on similar but unseen data. These two separate sets of data are known as the “Training Set” and “Testing Set” respectively.

Machine Learning primarily finds its applications in solving complex problems, which, a normal procedure oriented program cannot solve, or in places where there are too many variables that need to be explicitly programmed, which is not feasible.

As shown in the figure, Machine Learning is primarily of three types, namely: Supervised Learning, Unsupervised Learning and Reinforcement Learning.

  • Supervised Learning: This is the most commonly used form of machine learning and is widely used across the industry. In fact, most of the problems that are solved by Machine Learning belong to Supervised Learning. A learning problem is known as supervised learning when the data is in the form of feature-label pairs. In other words, the algorithm is trained on data where the ground truth is known. This is learning with a teacher. Two common types of supervised learning are:
    Classification: This is a process where the dataset is categorized into discrete values or categories. For example, if the input to the algorithm is an image of a dog or a cat, ideally, a well-trained algorithm should be able to predict whether the input image is that of a dog or of a cat.Regression: This is a process where the dataset has continuous valued target values. That is, the output of the function is not categories, but is a continuous value. For example, algorithms that forecast the future price of the stock market would output a continuous value (like 34.84, etc.) for a given set of inputs. * Unsupervised Learning: This is a much lesser used, but quite important learning technique. This technique is primarily used when there is unlabeled data or data without the target values mentioned. In such learning, the algorithm has to analyze the data itself and bring out insights based on certain common traits or features in the dataset. This is learning without a teacher. Two common types of unsupervised learning are:
    Clustering: Clustering is a well known unsupervised learning technique where similar data are automatically grouped together by the algorithm based on common features or traits (eg. color, values, similarity, difference, etc.).Dimensionality Reduction: Yet another popular unsupervised learning is dimensionality reduction. The dataset that is used for machine learning is often huge and of high dimensions (higher than three dimensions). One major problem in working with high dimensional data is data-visualization. Since we can visualize and understand up-to 3 dimensions, higher dimensional data is often difficult for human beings to interpret. In addition to this, higher dimension means more features, which in turn means a more complex model, which is often a curse for any machine learning model. The aim is to keep the simplest model that works best on a wide range of unseen data. Hence, dimensionality reduction is an important part of working with high dimensional data. One of the most common methods of dimensionality reduction is Principal Component Analysis (PCA).* Reinforcement Learning: This is a completely different approach to “learning” when compared to the previous two categories. This particular class of learning algorithms primarily finds its applications in Game AI, Robotics and Automatic Trading Bots. Here, the machine is not provided with a huge amount of data. Instead, in a given scenario (playground) some parameters and constrictions are defined and the algorithm is let loose. The only feedback given to the algorithm is that, if it wins or performs a correct task, it is rewarded. If it loses or performs an incorrect task, it is penalized. Based on this minimal feedback, over time the algorithm learns to how to do the correct task on its own.
What is Artificial Intelligence?

Artificial Intelligence is a vast field made up of multidisciplinary subjects, which aims to artificially create “intelligence” to machines, similar to that displayed by humans and animals. The term is used to describe machines that mimic cognitive functions such as learning and problem-solving.

Artificial Intelligence can be broadly classified into three parts: Analytical AI, Human-Inspired AI, and Humanized AI.

  1. Analytical AI: It only has characteristics which are consistent with Cognitive Intelligence. It generates a cognitive representation of the world around it based on past experiences, which inspires future decisions.
  2. Human-Inspired AI: In addition to having Cognitive Intelligence, this class of AI also has Emotional Intelligence. It has a deeper understanding of human emotions in addition to Cognitive Intelligence and thus has a better understanding of the world around it. Both Cognitive Intelligence and Emotional Intelligence contributes to the decision making of Human-Inspired AI.
  3. Humanized AI: This is the most superior form of AI among the three. This form of AI incorporates Cognitive Intelligence, Emotional Intelligence, and Social Intelligence into its decision making. With a broader understanding of the world around it, this form of AI is able to make self-conscious and self-aware decisions and interactions with the external world.
How are they interrelated?

From the above introductions, it may seem that these fields are not related to each other. However, that is not the case. Each of these three fields is quite closely related to each other than it may seem.

If we look at Venn Diagrams, Artificial Intelligence, Machine Learning and Data Science are overlapping sets, with Machine Learning being a subset or a part of Artificial Intelligence, and Data Science having a significant chunk of it under Artificial Intelligence and Machine Learning.

Artificial Intelligence is a much broader field and it incorporates most of the other intelligence-related fields of study. Machine Learning, being a part of AI, deals with the algorithmic learning and inference based on data, and finally, Data Science is primarily based on statistics, probability theory, and has significant contribution of Machine Learning to it; of course, AI also being a part of it, since Machine Learning is indeed a subset of Artificial Intelligence.

Similarities: All of the three fields have one thing in common, Machine Learning. Each of these is heavily dependent on Machine Learning Algorithms.

In Data Science, the statistical algorithms that are used are limited to certain applications. In most cases, Data Scientists rely on Machine Learning techniques to extract inferences from data.

The current technological advancement in Artificial Intelligence is heavily based on Machine Learning. The part of AI without Machine Learning is like a car without an engine. However, without the “learning” part, Artificial Intelligence is basically Expert Systems, Search and Optimization algorithms.

Difference between the three

Even though they are significantly similar to each other, there are still a few key differences that are to be noted.

Applications

Since all the three domains are interrelated, they have some common applications and some unique to each of them. Most applications involve the use of Machine Learning in some form or the other. Even then, there are certain applications of each domain, which are unique. A few of them are listed below:

  • Data Science: The applications in this domain are dependent on machine learning and mathematical algorithms, such as statistics and probability based algorithms.
    Time Series Forecasting: This is a very important application of data science and is used across the industry, primarily in the banking sector and the stock market sector. Even though there are Machine Learning based algorithms for this specific application, Data Scientists usually prefer the statistical approach.Recommendation Engines: This is a statistics-based approach towards recommending products or services to the user, based on data of his/her previous interests. Similar to the previous application, Machine Learning based algorithms to achieve similar or better results is also present.* Machine Learning: The applications of this domain is nearly limitless. Every industry has some problem that can partially or fully be solved by Machine Learning techniques. Even Data Science and Artificial Intelligence roles make use of Machine Learning to solve a huge set of problems.
    Computer Vision: This is another sub-field which falls under Machine Learning and deals with visual information. This field itself finds its applications in many industries, for example, Autonomous Driving Vehicles, Medical Imaging, Autonomous Surveillance Systems, etc.Natural Language Processing: Similar to the previous example, this field is also self-contained sub-field of research. Natural Language Processing (NLP) or Natural Language Understanding (NLU) primarily deals with the interpretation and understanding of the meaning behind spoken or written text/language. Understanding the exact meaning of a sentence is quite difficult (even for human beings). Teaching a machine to understand the meaning behind a text is even more challenging. Few of the major applications of this sub-field are the development of intelligent chatbots, artificial voice assistants (Google Assistant, Siri, Alexa, etc.), spam detection, hate speech detection and so on.* Artificial Intelligence: Most of the current advancements and applications in this domain is based on a sub-field of Machine Learning, known as Deep Learning. Deep Learning deals with artificially emulating the structure and function of the biological neuron. However, since few of the applications of Deep Learning have already been discussed under Machine Learning, let us look at applications of Artificial Intelligence that is not primarily dependent on Machine Learning.
    Game AI: Game AI is an interesting application of Artificial Intelligence, where the machine automatically learns to play complex games to the level where it can challenge and even win against a human being. Google’s DeepMind had developed a Game AI called AlphaGo, which outperformed and beat the human world champion in 2017. Similarly, video game AI’s have been developed to play Dota 2, flappy bird and Mario. These models are developed using several algorithms like Search and Optimization, Generative Models, Reinforcement Learning, etc.Search: Artificial Intelligence has found several applications in Search Engines, for example, Google and Bing Search. The method of displaying results and the order in which results are displayed are based on algorithms developed in the field of Artificial Intelligence. These applications do contain Machine Learning techniques, but their older versions were developed by algorithms like Google’s proprietary PageRank Algorithm, which were not based on “Learning”.Robotics: One of the major applications of Artificial Intelligence is in the field of robotics. Teaching robots to walk/run automatically (for example, Spot and Atlas) using Reinforcement Learning has been one of the biggest goals of companies like Boston Dynamics. In addition to that, humanoid robots like Sophia are a perfect example of AI being applied for Humanized AI.## Skill-set Required

Since the fields are interrelated by a significant degree, the skill-set required to master each of these fields is nearly the same and overlapping. However, there are a few skill-sets that are uniquely associated with each of them. The same has been discussed further.

  • Mathematics: Each of these fields is math heavy, which means mathematics are the basic building blocks of these fields and in order to fully understand the algorithms and master them, a great math background is necessary. However, all the fields of math are not necessary for all of these. The specific fields of math that are required are discussed below:
    Linear Algebra: Since all of these fields are based on data, which comes in huge volumes of rows and columns, matrices are the easiest and most convenient method of representing and manipulating such data. Hence, a thorough knowledge of Linear Algebra and Matrix operations is necessary for all three fields.Calculus: Deep Learning, the sub-field of Machine Learning is heavily dependent on calculus. To be more precise, multivariate derivatives. In neural networks, backpropagation algorithms require multiple derivative calculations, which demands a thorough knowledge of calculus.Statistics: Since these fields deal with a huge amount of data, the knowledge of statistics is imperative. Statistical methods to deal with the selection and testing of smaller sample size with diversity is the common application for all three fields. However, statistics finds its main application in Data Science, where most of the algorithms are purely based on statistics (eg. ARIMA algorithm used for Time Series Analysis).Probability: Similar to the reason behind statistics, probability and the conditional probability of a certain event is the basic building block of important Machine Learning algorithms like Naive Bayes Classifier. Probability theory is also very important in understanding Data Science Algorithms.* Computer Science: There is no doubt about either of these fields being a part of the Computer Science field. Hence, a thorough knowledge of computer science algorithms is quite necessary.
    Search and Optimization Algorithms: Fundamental Search Algorithms like Breadth-First Search (BFS), Depth-First Search (DFS), Bidirectional Search, Route Optimization Algorithms, etc. are quite important. These search and optimization algorithms find their use in the Artificial Intelligence field.Fuzzy Logic: Fuzzy Logic (FL) is a method of reasoning that resembles human reasoning. It imitates the way human beings make decisions. For example, making a YES or NO decision based on a certain set of events or environmental conditions. Fuzzy Logic is primarily used in Artificially Intelligent Systems.Basic Algorithms and Optimization: Even though this is not a necessity, but it is a good-to-have knowledge since fundamental knowledge on algorithms (searching, sorting, recursion, etc.) and optimization (space and time complexity) is necessary for any computer science related fields.* Programming Knowledge: The implementation of any of the algorithms in these fields is through programming. Hence a thorough knowledge of programming is a necessity. Some of the most commonly used programming languages are discussed further.
    Python: One of the most commonly used programming languages for either of these fields is Python. It is used across the industry and has support for a plethora of open source libraries for Machine Learning, Deep Learning, Artificial Intelligence, and Data Science. However, programming is not just about writing code, it is about writing proper Pythonic code. This has been discussed in detail in this article: A Guide to Best Python Practices.R: This is the second most used programming language for such applications across the industry. R excels in statistical libraries and data visualization when compared to python. However, lacks significantly when it comes to Deep Learning libraries. Hence, R is a preferred tool for Data Scientists.## Job Market

The Job Market for each of these fields is in very high demand. As a direct quote from Andrew Ng says, “AI is the new Electricity”. This is quite true as the extended field of Artificial Intelligence is at the verge of revolutionizing every industry in ways that could not be anticipated earlier.

Hence, the demand for jobs in the field of Data Science and Machine Learning is quite high. There are more job openings worldwide than the number of qualified Engineers who are eligible to fill that position. Hence, due to supply-demand constraints, the amount of compensation offered by companies for such roles exceeds any other domain.

The job scenario for each of the different domains are discussed further:

  1. Data Science: The number of job posting with the profile of Data Science is highest, among the three discussed domains. Data Scientists are handsomely paid for their work. Due to the blurred lines in terms of the difference between the fields, the job description of a Data Scientist ranges from Time Series Forecasting to Computer Vision. It basically covers the entire domain. For further insights on the job aspect of Data Science, the article on What is Data Science can be referred to.
  2. Machine Learning: Even though the number of jobs postings having the job profile as “Machine Learning Engineer” is much lesser when compared to that of a Data Scientist, it is still a significant field to consider when it comes to availability of jobs. Moreover, someone who is skilled in Machine Learning is a good candidate to consider for a Data Science role. However, unlike Data Science, Machine Learning job descriptions primarily deal with the requirements of “Learning” algorithms (including Deep Learning), and the industry ranges from Natural Language Processing to developing Recommendation Engines.
  3. Artificial Intelligence: Coming across job postings with profiles of “Artificial Intelligence Developer” developer is quite rare. Instead of “Artificial Intelligence”, most companies write “Data Scientists” or “Machine/Deep Learning Engineers” in the job profile. However, Artificial Intelligence Developers, in addition to getting jobs in the Machine Learning domain, mostly find jobs in Robotics and AI R&D oriented companies like Boston Dynamics, DeepMind, OpenAI, etc.

Conclusion

Data Science, Machine Learning and Artificial Intelligence are like the different branches of the same tree. They are highly overlapping and there is no clear boundary amongst them. They have common skill set requirements and common applications as well. They are just different names given to slightly different versions of AI.

Finally, it is worth mentioning that since there is high overlap in required skill-set, an optimally skilled Engineer is eligible to work in either of the three domains and switch domains without any major changes.