Every now and then, we always face and hear that R is sluggish with big data. Here we are talking about terabytes or petabytes and this is one of the biggest limitations of R that the data should fit within the RAM.
To avoid this we use out of memory processing concept that process in chunks rather processing it all at once. We use two different packages that are shown below.
#install.packages("ff")
library(ff)
#install.packages("ffbase")
library(ffbase)
Let’s look for some R-code for the above-described operations
# Uploading from flatfiles
system("mkdir ffdf")
options(fftempdir = "./ffdf")
system.time(fli.ff <- read.table.ffdf(file="flights.txt", sep=",", VERBOSE=TRUE, header=TRUE, colClasses=NA))
system.time(airln.ff <- read.csv.ffdf(file="airline.csv",
VERBOSE=TRUE, header=TRUE,colClasses=NA))
# Merging the datasets
flights.data.ff = merge.ffdf(fli.ff, airln.ff, by="Airline_id")
Subsetting
# Subset
subset.ffdf(flights.data.ff, CANCELLED == 1, select = c(Flight_date, Airline_id, Ori_city,Ori_state, Dest_city, Dest_state, Cancellation))
Descriptive statistics
# Descriptive statistics
mean(flights.data.ff$DISTANCE)
quantile(flights.data.ff$DISTANCE)
range(flights.data.ff$DISTANCE)
Regression with biglm (Dataset: Chronic Kidney Disease Dataset by the University of California Irvine at http://archive.ics.uci.edu/ml/index.html)
# Regression requires installation of biglm package
library(ffbase)
library(biglm)
model1 = bigglm.ffdf(class ~ age + bp + bgr + bu + rbcc + wbcc + hemo, data = ckd.ff, family=binomial(link = "logit"), na.action = na.exclude)
model1
summary(model1)
#Refining of the model can be done according to the significance level obtained in model1
#rstudio #data-science #parallel-computing #big-data #data analysis