“Data is the new science, Big data holds the answer”-

Every now and then, we always face and hear that R is sluggish with big data. Here we are talking about terabytes or petabytes and this is one of the biggest limitations of R that the data should fit within the RAM.

To avoid this we use out of memory processing concept that process in chunks rather processing it all at once. We use two different packages that are shown below.

#install.packages("ff")
library(ff)

#install.packages("ffbase")
library(ffbase)

ff package basically chunks the data and stores it as an encoded raw flat file on the hard disk and also gives you access to the functions much faster. The data structure that is ff data frame also provides mapping to the dataset that is partitioned in RAM. Example of how the chunks of data are going to work, assume a 2GB file it takes about 460 seconds to read the data in the file with 1 ff data frame of a size 515 KB and 28 ff data files of 50 MB each, therefor 1.37GB.
To perform basic merging, finding duplicates and missing values, creating subset, etc we use ffbase package. We can also perform clustering, regressions, and classification directly with the ff objects.

Let’s look for some R-code for the above-described operations

# Uploading from flatfiles 

system("mkdir ffdf")
options(fftempdir = "./ffdf")
system.time(fli.ff <- read.table.ffdf(file="flights.txt", sep=",", VERBOSE=TRUE, header=TRUE, colClasses=NA))
system.time(airln.ff <- read.csv.ffdf(file="airline.csv", 
VERBOSE=TRUE, header=TRUE,colClasses=NA))
# Merging the datasets
flights.data.ff = merge.ffdf(fli.ff, airln.ff, by="Airline_id")

Subsetting

# Subset

subset.ffdf(flights.data.ff, CANCELLED == 1, select = c(Flight_date, Airline_id, Ori_city,Ori_state, Dest_city, Dest_state, Cancellation))

Descriptive statistics

# Descriptive statistics

mean(flights.data.ff$DISTANCE)
quantile(flights.data.ff$DISTANCE)
range(flights.data.ff$DISTANCE)

Regression with biglm (Dataset: Chronic Kidney Disease Dataset by the University of California Irvine at http://archive.ics.uci.edu/ml/index.html)

# Regression requires installation of biglm package

library(ffbase)
library(biglm)
model1 = bigglm.ffdf(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = ckd.ff, family=binomial(link = "logit"), na.action = na.exclude)
model1
summary(model1)
#Refining of the model can be done according to the significance level obtained in model1

#rstudio #data-science #parallel-computing #big-data #data analysis

medium.com

“Data is the new science, Big data holds the answer”-