From 0 to Machine Learning with R

From 0 to Machine Learning with R

A tour of R coding, dataframes, applying a model.

I recently joined a project using R, and coming from Python, I had to struggle for a few days around R documentation. The R documentation is not as friendly as Python for beginners, and putting all together may be painful at first.This article gathers all the elements and concepts to apply a machine learning model from a raw data file, with R.

Let’s get started with R, pick a dataset and start working along the code snippets.

Getting started

Install the R language on your computer.

Install a free and complete IDE : RStudio.

Libraries

The R basic install doesn’t come with every libraries, just like with pip in Python, new libraries are installed with this command, in the R terminal :

install.packages(thepackagename)

Some common packages

install.packages(readr)

Once a package is installed, here are some operations for libraries :

# Load readr in your code
library(readr)

Load a list of Libraries

p.names <- c(‘xgboost’, ‘caret’, ‘dplyr’, ‘e1071’) lapply(p.names, library, character.only = TRUE)

Manage libraries

installed.packages() remove.packages(“thepackagename”)

Getting help and documentation

?functionName help(functionName) example(functionName)

Writing some codeVariablesDefining variables is pretty straightforward, we equally use the “=” or “<-” operators. One unusual thing, if you come from Python, is that variable names may contain points “.” and you get variables like “my.data.vector”. This is actually very common in code snippets found online.

# Create new variables
my_var <- 54
my.data.vector = c(34, 54, 65)

Clean a variable

my_var <- NULL

Functions

Functions in R are similar to Python functions :

·        Assign the function like you would assign a variable.

·        Use the function keyword with parameters inside parenthesis.

·        Use return as exit points

The following small function named prodage, takes creationdate as argument. With an if statement, we treat the NULL cases, otherwise we cast the value as date.

prod_age <- function(creation_date) {
 if (is.na(creation_date)) {return(as.numeric(-1))}
 else {return(as.Date(creation_date))}
}

Working with dataframes

Load data, read files

The read_delim function, from the readr library offers a lot of tools to read most of filetypes.

In the example below, we specify the data type of each column. The file has 56 columns, and we want all of them to be read as characters, so we use the col_types argument with “c…c”, each character corresponding to a column.

# Load the library
library(readr)

Create dataframe from CSV file

my_dataframe <- read_delim(“C:/path/to/file.csv”, delim = “|”, escape_double = FALSE, col_types = paste(rep(“c”, 56), collapse = ‘’))

Subsetting a dataframe

Dataframes are not only encountered by importing your dataset. Sometimes functions results are dataframe. The main tool for subsetting is the brackets operator.

·        To access a specific column, use the $ operator, very convenient.

y_train <- my.dataframe$label

To access specific rows, we use the [] operator. You might be familiar with this syntax : [rows, columns]

# Works with numeric indices
y_train <- my.dataframe[c(0:100), 8]

Works with negative indices to exclude

y_test <- my.dataframe[-c(0:100), 8]

Here is another technique still using the bracket syntax. The which and names operators are used to subset rows and columns.

filtered.dataframe <- my.dataframe[
   which(my.dataframe$col1 == 2),     # Filter rows on condition
   names(my.dataframe) %in% c("col1","col2","col3")] # Subset cols

The subset function : first argument is the dataframe, then the filter condition on rows, then the columns to select.

filtered.dataframe <- subset(
   my.dataframe, 
   col1 == 2, 
   select = c("col1","col2","col3"))

Modify column values

When a dataframe object is created, we access specific columns with the $ operator.

# Filtering rows based on a specific column value
my_datarame <- subset(my_dataframe, COLNAME != ‘str_value’)

Assign 0 where column values match condition

non_conformites$REGUL_DAYS[non_conformites$REGUL_DAYS_NUM < 0] <- 0

Create new column from existing columns

table$AMOUNT <- table$Q_LITIG * table$PRICE

Delete a column

my_dataframe$COLNAME <- NULL

Apply a function to a column

Once we have a dataframe and functions ready, we often need to apply functions on columns, to apply transformations.

Here we use the apply operator. We use it to apply an operation to a blob of structured data, so it’s not limited to dataframes. Of course, every point must have the same type.

# Product age function
prod_age <- function(creation_date) {
 if (xxx) {return(as.numeric(-1))}
 else { return(as.Date(creation_date))}
}

Apply function on column

mytable$PRODUCT_AGE <- apply(mytable[,c(‘DATE_CREA’), drop=F], 1, function(x) prod_age(x))

Working with dates

When working with dates, the first step is to go from a date string to a date object. The as.Date function does exactly this, and parses with the specified format. At the end of this article, you will find every date formats available for the format argument.

# Convert a column into date format
sales$date_f <- as.Date(sales$date, format = ‘%d/%m/%Y’)

Create column from time difference

mytable$REGUL_DAYS = as.numeric(difftime( strptime(mytable$closing, “%Y-%m-%d”), strptime(mytable$opening, “%Y-%m-%d”), unit=”days”))

Export dataframe

Several built-in functions allow to write dataframes as files. A very common format is CSV. However, the RDS format is optimized (serialized + Gzip compression) to store any R objects.

# Write to CSV
write.csv(non_conformites,
 ‘C:\Users\path\export.csv’,
 row.names = FALSE)

Write to RDS

saveRDS( feature_importance_values, file=”c:/path/folder/feature_importance.RDS”)

Plotting

Just like Python, R comes with several libraries for plotting data. The plot function is actually similar to plt.plot with python.

RStudio is very convenient for plotting, it has a dedicated plotting window, with a possibility to back on previous plots.

Line charts

plot(
 ref_sales$Date, ref_sales$Sales,
 type = ‘l’,
 xlab = “Date”, ylab = “Sales”,
 main = paste(‘Sales evolution over time for : ‘, article_ref)
)

Various charts

R being the language of statisticians, it comes with various charts for plotting data distributions.

values <- c(1, 4, 8, 2, 4)
barplot(values)
hist(values)
pie(values)
boxplot(values)

Machine learning : XGBoost library

The xgboost package is a good starting point, as it is well documented. It enables to gain quick insights on a dataset, such as feature importance, as we will see below.

For this part, we need those specific libraries :

xgboost : Let’s work around XGB famous algorithm.

caret : Classification And REgression Training, includes lots of data processing functions

dplyr : A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

Train-Test split

Once the dataframe is prepared, we split it into train and test sets, using an index (inTrain) :

set.seed(1337)
inTrain <- createDataPartition(y = my.dataframe$label, p = 0.85, list = FALSE)

X_train = xgb.DMatrix(as.matrix(my.dataframe[inTrain, ] %>% select(-label))) y_train = my.dataframe[inTrain, ]$label X_test = xgb.DMatrix(as.matrix(my.dataframe[-inTrain, ] %>% select(-label))) y_test = my.dataframe[-inTrain, ]$label

Parameter search for XGBoost

What the following function does :

- Take our train/test sets as input.

- Define a trainControl for cross validation .

- Define a grid for parameters.

- Setup a XGB model including the parameter search.

- Evaluate the model’s accuracy

- Return the set of best parameters

param_search <- function(xtrain, ytrain, xtest, ytest) {

Cross validation init

xgb_trcontrol = trainControl(method = “cv”, number = 5, allowParallel = TRUE, verboseIter = T, returnData = FALSE)

Param grid

xgbGrid <- expand.grid(nrounds = 60, #nrounds = c(10,20,30,40), max_depth = 20, #max_depth = c(3, 5, 10, 15, 20, 30), colsample_bytree = 0.6,#colsample_bytree = seq(0.5, 0.9, length.out = 5), eta = 0.005, #eta = c(0.001, 0.0015, 0.005, 0.1), gamma=0, min_child_weight = 1, subsample = 1 )

Model and parameter search

xgb_model = train(xtrain, ytrain, trControl = xgb_trcontrol, tuneGrid = xgbGrid, method = “xgbTree”, verbose=2, #objective=”multi:softprob”, eval_metric=”mlogloss”) #num_class=3)

Evaluate du model

xgb.pred = predict(xgb_model, xtest, reshape=T) xgb.pred = as.data.frame(xgb.pred, col.names=c(“pred”)) result = sum(xgb.pred$xgb.pred==ytest) / nrow(xgb.pred) print(paste(“Final Accuracy =”,sprintf(“%1.2f%%”, 100*result))) return(xgb_model) }

Once the parameter search is done, we can use it directly to define our working model, we access each element with the $ operator :

best.model <- xgboost(
 data = as.matrix(my.dataframe[inTrain, ] %>% select(-IMPORTANCE)),
 label = as.matrix(as.numeric(my.dataframe[inTrain,]$IMPORTANCE)-1),
 nrounds = xgb_model$bestTune$nrounds,
 max_depth = xgb_model$bestTune$max_depth,
 eta = xgb_model$bestTune$eta,
 gamma = xgb_model$bestTune$gamma,
 colsample_bytree = xgb_model$bestTune$colsample_bytree,
 min_child_weight = xgb_model$bestTune$min_child_weight,
 subsample = xgb_model$bestTune$subsample,
 objective = “multi:softprob”, num_class=3)

Compute and plot feature importance

Here again, a lot of functions are available in the xgboost package. The documentation presents most of them.

xgb_feature_imp <- xgb.importance(
   colnames(donnees[inTrain, ] %>% select(-label)), 
   model = best.model
)

gg <- xgb.ggplot.importance(xgb_feature_imp, 40); gg

Below is an example of a feature importance plot, as displayed in Rstudio. Clusters made with xgboost simply group features by similar score, there is no other specific meaning for these.

Feature importance with ggplot

Further reading

I hope this was a straightforward introduction to R, I believe progress is made through manipulation and experimentation. Here are some resources to keep learning and fly on your own :

·        R-bloggers : News and tutorials about R, gathering plenty of blog post.

·        Rdocumentation : I always get back to this one

·        An introduction to R : If you need refreshers on R coding

·        Getting started with xgboost (R API)

·        Datacamp courses and articles

·        Kaggle R Kernels

Thanks for your attention, any feedback appreciated, fly safe !

Originally published by Alexandre Bec at towardsdatascience.com

=====================================

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Machine Learning A-Z™: Hands-On Python & R In Data Science

R vs Python: What’s The Difference?

An Introduction to Machine Learning for Beginners

A Complete Machine Learning Project Walk-Through in Python


machine-learning r python

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

How To Plot A Decision Boundary For Machine Learning Algorithms in Python

How To Plot A Decision Boundary For Machine Learning Algorithms in Python, you will discover how to plot a decision surface for a classification machine learning algorithm.

How To Get Started With Machine Learning With The Right Mindset

You got intrigued by the machine learning world and wanted to get started as soon as possible, read all the articles, watched all the videos, but still isn’t sure about where to start, welcome to the club.

What is Supervised Machine Learning

What is neuron analysis of a machine? Learn machine learning by designing Robotics algorithm. Click here for best machine learning course models with AI

Python For Machine Learning | Machine Learning With Python

Python For Machine Learning | Machine Learning With Python, you will be working on an end-to-end case study to understand different stages in the Machine Learning (ML) life cycle. This will deal with 'data manipulation' with pandas and 'data visualization' with seaborn. After this an ML model will be built on the dataset to get predictions. You will learn about the basics of scikit-learn library to implement the machine learning algorithm.

Python for Machine Learning | Machine Learning with Python

Python for Machine Learning | Machine Learning with Python, you'll be working on an end-to-end case study to understand different stages in the ML life cycle. This will deal with 'data manipulation' with pandas and 'data visualization' with seaborn. After this, an ML model will be built on the dataset to get predictions. You will learn about the basics of the sci-kit-learn library to implement the machine learning algorithm.