Create new column in data.table by group - r

I have no experience with data.table, so I don't know if there is a solution to my question (30 minutes on Google gave no answer at least), but here it goes.
With data.frame I often use the following command to check the number of observations of a unique value:
df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))
Is there any corresponding method when working with data.table?

Yes, there is. Happily, you've asked about one of the newest features of data.table, added in v1.8.2 :
:= by group is now implemented (FR#1491) and sub-assigning to a new column
by reference now adds the column automatically (initialized with NA where
the sub-assign doesn't touch) (FR#1997). := by group can be combined with all
types of i, so := by group includes grouping by i as well as by by.
Since := by group is by reference, it should be significantly faster than any
method that (directly or indirectly) cbinds the grouped results to DT, since
no copy of the (large) DT is made at all. It's a short and natural syntax that
can be compounded with other queries.
DT[,newcol:=sum(colB),by=colA]
In your example, iiuc, it should be something like :
DT[, Obs:=.N, by=ID-Date]
instead of :
df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))
Note that := by group scales well for large data sets (and smaller datasets will a lot of small groups).
See ?":=" and Search data.table tag for "reference"

Related

data.table switches column names

I would like to ask whether the following behavior of data.table is a feature or a bug.
Given the data.table
dt = data.table(
group = c(rep('group1',5),rep('group2',5)),
x = as.numeric(c(1:5, 1:5)),
y = as.numeric(c(5:1, 5:1)),
z = as.numeric(c(1,2,3,2,1, 1,2,3,2,1))
)
and a vector of column names containing a duplicate,
cols = c('y','x','y','z') # contains a duplicate column name
data.table rightly prevents me from assigning values to the duplicate column names:
dt[,(cols) := lapply(.SD,identity), .SDcols=cols] # Error (OK)
This seems like appropriate behavior to me, because it can help avoid unintended consequences. However, if I do the same assignment by groups,
dt[,(cols) := lapply(.SD,identity), .SDcols=cols, by=group] # No error!
then data.table doesn't throw an error. The assignment goes through, and one can verify that columns y and z have been interchanged.
This occurred for me in a large application while demeaning variables by group, and it was difficult to trace the source of this behavior. The recommendation for the user, of course, is to avoid duplicate column names when assigning, and to avoid providing duplicate names to .SDcols. However, would it not be better for data.table to throw an error in this situation?
This is a bug, which was fixed in version 1.12.4 of data.table. Here is the bug report: https://github.com/Rdatatable/data.table/issues/4874.
Other users with this issue can simply update their package version, for example using install.packages('data.table'). To check the current package version, load data.table and then look at the output of sessionInfo().
But it would be wise to avoid supplying duplicate column names to .SDcols.

R: add a new column with values based on another column, indexing by category [duplicate]

I have no experience with data.table, so I don't know if there is a solution to my question (30 minutes on Google gave no answer at least), but here it goes.
With data.frame I often use the following command to check the number of observations of a unique value:
df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))
Is there any corresponding method when working with data.table?
Yes, there is. Happily, you've asked about one of the newest features of data.table, added in v1.8.2 :
:= by group is now implemented (FR#1491) and sub-assigning to a new column
by reference now adds the column automatically (initialized with NA where
the sub-assign doesn't touch) (FR#1997). := by group can be combined with all
types of i, so := by group includes grouping by i as well as by by.
Since := by group is by reference, it should be significantly faster than any
method that (directly or indirectly) cbinds the grouped results to DT, since
no copy of the (large) DT is made at all. It's a short and natural syntax that
can be compounded with other queries.
DT[,newcol:=sum(colB),by=colA]
In your example, iiuc, it should be something like :
DT[, Obs:=.N, by=ID-Date]
instead of :
df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))
Note that := by group scales well for large data sets (and smaller datasets will a lot of small groups).
See ?":=" and Search data.table tag for "reference"

Subsetting data.table with a condition

How to sample a subsample of large data.table (data.table package)? Is there more elegant way to perform the following
DT<- data.table(cbind(site = rep(letters[1:2], 1000), value = runif(2000)))
DT[site=="a"][sample(1:nrow(DT[site=="a"]), 100)]
Guess there is a simple solution, but can't choose the right wording to search for.
UPDATE:
More generally, how can I access a row number in data.table's i argument without creating temporary column for row number?
One of the biggest benefits of using data.table is that you can set a key for your data.
Using the key and then .I (a built in vairable. see ?data.table for more info) you can use:
setkey(DT, site)
DT[DT["a", sample(.I, 100)]]
As for your second question "how can I access a row number in data.table's i argument"
# Just use the number directly:
DT[17]
Using which, you can find the row-numbers. Instead of sampling from 1:nrow(...) you can simply sample from all rows with the desired property. In your example, you can use the following:
DT[sample(which(site=="a"), 100)]

Fastest way/algorithm to find count of unique rows of a sorted file

I currently use .N to find the number of unique rows in a file using by= ... .
For eg. to find the count of unique rows of col1 and col2 in a data table, dt, the query would be,
dt[, .N, by="col1,col2"]
For very large files this could take a very long time. If the table is sorted, is there a faster way to do this? Basically, you could set a counter and update it with the number of times each row appears using a single entry every time a unique row is encountered. I can't use a for loop as that would take forever.
unique.data.table is very different than base R unique in the sense that unique.data.table fetches unique values based on only key columns of the data.table if a key is set. To explain this with an example,
Try this:
dt <- data.table(x=c(1,1,1,2,2), y=c(5,6,6,7,8))
unique(dt) # no key set, similar to 'unique.data.frame' output
# set key now
setkey(dt, "x")
unique(dt) # unique based on just column x
If you want to get just the total number of unique rows, therefore try the following:
setkeyv(dt, c("col1", "col2"))
nrow(unique(dt))
On your question:
dt[, .N, by="col1,col2"]
does actually not give you the number of unique rows, while either of these two do:
dt[, .N, by="col1,col2"][, .N] # data.table solution
nrow(dt[, .N, by="col1,col2"]) # data.frame syntax applied to data.table
My answer to your question:
A core feature of the data.table package is to work with a key. On p.2 from the short introduction to the data.table package it reads:
Furthermore, the rows are sorted by the key. Therefore, a data.table
can have at most one key, because it cannot be sorted in more than one
way.
Thus unless you have a column defining the sort order that you can set as key, the fact that your data are sorted, will be of no advantage. You thus need to set the key. For your purpose (large datafiles, thus assumingly many columns), you would want to include all of the columns in your dataset to set the key:
setkeyv(dt,c(names(dt))) # use key(dt) to check whether this went as expected
unique(dt)[, .N] # or nrow(unique(dt))
PS: please provide us a with a replicable dataset, so we can assess what you consider fast or slow.

Conflicting/duplicate column names in J()?

I have two data.tables (dat and results) that share column names. On a side note, results holds summary statistics computed earlier on *sub*groups of dat. In other words, nrow(results) != nrow(dat) (but I don't think this is relevant for the question)
Now I want to incorporate these results back into dat (i.e. the original data.table) by adding a new column (i.e. NewColZ) to dat
This doesn't work as I expect:
dat[,list(colA,colB,NewColZ=results1[colX==colX & colY==colY,colZ])
,by=list(colX, colY)]
Why? because "colX" and "colY" are columns names in both data.tables (i.e. dat and results). What I want to say is, results1[take_from_self(colX)==take_from_parent(colX)]
Therefore the following works (observe I have only RENAMED the columns)
dat[,list(colA,colB,NewCol=results1[cx==colX & cy==colY,colZ,])
,by=list(colX, colY)]
Though I have a feeling that this can simply and easily be done by a join. But dat has many more columns than results
What you are trying to do is a join on colX and colY. You can use := to assign by reference. Joining is most straightforward when you have unique combinations (which I am assuming you do)
keys <- c('colX', 'colY')
setkeyv(dat, keys)
setkeyv(results, keys)
dat[results, newcolZ := colZ]
# perhap use `i.` if there is a colZ in dat
# dat[results, newcolZ := i.colZ]
I do concur with the comments that suggest reading the FAQ and introduction vignettes as well as going through the many examples in ?data.table.
Your issue was a scoping issue, but your primary issue was not being fully aware of the data.table idioms. The join approach is the idoimatically data.table approach.

Resources