I'm relatively familiar with <code>data.table</code>, not so much with <code>dplyr</code>. I've read through some <code><a href="http://rpubs.com/hadley/dplyr-intro" target="_blank">dplyr</a></code><a href="http://rpubs.com/hadley/dplyr-intro" target="_blank">vignettes</a> and examples that have popped up on SO, and so far my conclusions are that:
dplyrare comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)
dplyrhas more accessible syntax
dplyrabstracts (or will) potential DB interactions
In my mind 2. doesn't bear much weight because I am fairly familiar with it
data.table, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with
data.table. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).
What I want to know is:
One recent SO question got me thinking about this a bit more, because up until that point I didn't think
dplyr would offer much beyond what I can already do in
data.table. Here is the
dplyr solution (data at end of Q):
dat %.% group_by(name, job) %.% filter(job != "Boss" | year == min(year)) %.% mutate(cumu_job2 = cumsum(job2))
Which was much better than my hack attempt at a
data.table solution. That said, good
data.tablesolutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):
setDT(dat)[, .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], by=list(id, job) ]
The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to
data.table (i.e. doesn't use some of the more esoteric tricks).
Ideally what I'd like to see is some good examples were the
data.table way is substantially more concise or performs substantially better.
dplyrdoes not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around using
doin the answer to @eddi's question).
data.tablesupports rolling joins (thanks @dholstius) as well as overlap joins
data.tableinternally optimises expressions of the form
DT[col == value]or
DT[col %in% values]for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.
dplyroffers standard evaluation versions of functions (e.g.
summarize_each_) that can simplify the programmatic use of
dplyr(note programmatic use of
data.tableis definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)
data.tablebecomes substantially faster.
data.tablescales better than
dplyras the number of groups increase (updated with recent enhancements in both packages and recent version of R). Also, a benchmark when trying to get unique values has
data.table75% faster on larger versions of a group/apply/sort while
dplyrwas 40% faster on the smaller ones (another SO question from comments, thanks danas).
data.table, has benchmarked grouping operations on
pandason up to 2 billion rows (~100GB in RAM).
This is for the first example I showed in the question section.
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", "Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", "Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", "name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, -16L))
Python For Data Analysis - Build a Data Analysis Library from Scratch - Learn Python in 2019
When we talk about data processing, Data Science vs Big Data vs Data Analytics are the terms that one might think of and there has always been a confusion between them. In this article on Data science vs Big Data vs Data Analytics, I will understand the similarities and differences between them