Subsetting data.table with a condition - r

How to sample a subsample of large data.table (data.table package)? Is there more elegant way to perform the following
DT<- data.table(cbind(site = rep(letters[1:2], 1000), value = runif(2000)))
DT[site=="a"][sample(1:nrow(DT[site=="a"]), 100)]
Guess there is a simple solution, but can't choose the right wording to search for.
UPDATE:
More generally, how can I access a row number in data.table's i argument without creating temporary column for row number?

One of the biggest benefits of using data.table is that you can set a key for your data.
Using the key and then .I (a built in vairable. see ?data.table for more info) you can use:
setkey(DT, site)
DT[DT["a", sample(.I, 100)]]
As for your second question "how can I access a row number in data.table's i argument"
# Just use the number directly:
DT[17]

Using which, you can find the row-numbers. Instead of sampling from 1:nrow(...) you can simply sample from all rows with the desired property. In your example, you can use the following:
DT[sample(which(site=="a"), 100)]

Related

how to access rownames in a chained data.table of R

We need rownames sometimes to create a new column that is a function of previous columns but aggregated just for one row (each row). In other words the function is operating across the row.
Consider this:
library(data.table)
library(geosphere)
dt <- data.table(lon=77+rnorm(100),lat=13 + rnorm(100),i.lon=77+rnorm(100),i.lat=13 + rnorm(100))
dt[,dist:=distGeo(p1=c(lon,lat),p2=c(i.lon,i.lat)),by=rownames(dt)] # correct
The second line of code works fine as the data.table name dt is available inside the square brackets (which in itself does not look quite elegant to me), but not always.
What if there is a chain of data.tables? Consider this extension of previous example:
dt[lon>77 & lat<12.5][,dist:=distGeo(p1=c(lon,lat),p2=c(i.lon,i.lat)),by=rownames(dt)] # incorrect
Clearly this is an incorrect use as rownames(dt) is a different length than the subsetted data.table passed inside to the next in chain.
I guess my larger question is: Is rownames() the only way to achieve summarisation on each row? If not then the specific question remains: how do we access the data.table inside the by= construct if it is a chained data.table?
Try cbind:
dt <- data.table(lon=77+rnorm(100),lat=13 + rnorm(100),i.lon=77+rnorm(100),i.lat=13 + rnorm(100))
dt[,dist:=distGeo(p1=cbind(lon,lat),p2=cbind(i.lon,i.lat))]
# correct : 100 lines
dt[lon>77 & lat<12.5][,dist:=distGeo(p1=cbind(lon,lat),p2=cbind(i.lon,i.lat))]
# also correct : 16 lines
:= works on each row without need for summarization.
cbind allows to supply the expexted n*2 lat-lon matrix to the function.

R: add a new column with values based on another column, indexing by category [duplicate]

I have no experience with data.table, so I don't know if there is a solution to my question (30 minutes on Google gave no answer at least), but here it goes.
With data.frame I often use the following command to check the number of observations of a unique value:
df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))
Is there any corresponding method when working with data.table?
Yes, there is. Happily, you've asked about one of the newest features of data.table, added in v1.8.2 :
:= by group is now implemented (FR#1491) and sub-assigning to a new column
by reference now adds the column automatically (initialized with NA where
the sub-assign doesn't touch) (FR#1997). := by group can be combined with all
types of i, so := by group includes grouping by i as well as by by.
Since := by group is by reference, it should be significantly faster than any
method that (directly or indirectly) cbinds the grouped results to DT, since
no copy of the (large) DT is made at all. It's a short and natural syntax that
can be compounded with other queries.
DT[,newcol:=sum(colB),by=colA]
In your example, iiuc, it should be something like :
DT[, Obs:=.N, by=ID-Date]
instead of :
df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))
Note that := by group scales well for large data sets (and smaller datasets will a lot of small groups).
See ?":=" and Search data.table tag for "reference"

R: Drop columns from data.table, by reference, without having the name

This is almost a duplicate of this. I want to drop columns from a data table, but I want to do it efficiently. I have a list of names of columns that I want to keep. All the answers to the linked question imply doing something akin to
data.table.new <- data.table.old[, my.list]
which at some crucial point will give me a new object, while the old object is still in memory. However, my data.table.old is huge, and hence I prefer to do this via reference, as suggested here
set(data.table.old, j = 'a', value = NULL)
However, as I have a whitelist of columns, and not a blacklist, I would need to iterate through all the column names, checks whether they are in my.list, and then apply set(). Is there any cleaner/other way to doing so?
Not sure if you can do by reference ops on data.frame without making it data.table.
Below code should works if you consider to use data.table.
library(data.table)
setDT(data.frame.old)
dropcols <- names(data.frame.old)[!names(data.frame.old) %in% my.list]
data.frame.old[, c(dropcols) := NULL]

R: create data.table with periodic function

I would like to create a data.table in tidy form containing the columns articleID, period and demand (with articleID and period as key). The demand is subject to a random function with input data from another data.frame (params). It is created at runtime for differing numbers of periods.
It is easy to do this in "non-tidy" form:
#example data
params <- data.frame(shape=runif(10), rate=runif(10)*2)
rownames(params) <- letters[1:10]
periods <- 10
# create non-tidy data with one column for each period
df <- replicate(nrow(params),
rgamma(periods,shape=params[,"shape"], rate=params[,"rate"]))
rownames(df) <- rownames(params)
Is there a "tidy" way to do this creation? I would need to replicate the rgamma(), but I am not sure how to make it use the parameters of the corresponding article. I tried starting with a Cross Join from data.table:
dt <- CJ(articleID=rownames(params), per=1:periods, demand=0)
but I don't know how to pass the rgamma to the dt[,demand] directly and correctly at creation nor how to change the values now without using some ugly for loop. I also considered using gather() from the tidyr package, but as far as I can see, I would need to use a for loop either.
It does not really matter to me whether I use data.frame or data.table for my current use case. Solutions for any (or both!) would be highly appreciated.
This'll do (note that it assumes that params is sorted by row names, if not you can convert it to a data.table and merge the two):
CJ(articleID=rownames(params), per=1:periods)[,
demand := rgamma(.N, shape=params[,"shape"], rate=params[,"rate"]), by = per]

How to define data.table keys for fastest aggregation using multiple keys

I am trying to better understand utilizing keyd data.tables. After reading the documentation I think I understand how to speed up subsetting when using one key. For example:
DT = data.table(x=rep(c("ad","bd","cd"),each=3), y=c(1,3,6), v=1:9)
Option one:
DT[x == "ad"]
Option two:
setkey(DT,x)
DT["ad"]
In this case option one is much slower than option two, because the data.table uses the key to seach more efficiently (using a binary search vs. a vector scan, which I do not understand but I will trust is faster.)
In the case of aggregating on a subset of the data using a by statement, what is the fastest way to define the key? Should I key the column that I am using to subset the data, or the column that defines the groups? For example:
setkey(DT,x)
DT[!"bd",sum(v),by=y]
or
setkey(DT,y)
DT[!"bd",sum(v),by=y]
Is there a way to utilize a key for both x and y?
EDIT
Does setting the key to both x and y perform two vector searches? i.e:
setkey(DT,x,y)
EDIT2
Sorry, what I meant to ask was will the call DT[!"bd",sum(v),by=y] perform two binary scans when DT is keyed by both x and y?
I believe it is not possible to perform two binary scans when the data table DT is keyed by both x and y. Instead I would repeat the keying first on x and then on y as follows:
DT = data.table(x=rep(c("ad","bd","cd"),each=3), y=as.character(c(1,3,4)), v=1:9)
setkey(DT,x)
tmp = DT[!"bd"]
setkey(tmp,y)
tmp[!"1",sum(v),by=y]

Resources