Fast grouping by list column subsets in data.table - r

I am working with a large (millions of rows) data table with a list column containing deeply nested lists, which do not have uniform structure, size or order of elements (list(x=1,y=2) and list(y=2,x=1) may both be present and should be treated as identical). I need to repeatedly perform arbitrary groupings that include some columns from the data table as well as a subset of the data in the list column. Not all rows have values that will match the subset.
The approach I've come up with feels overly complicated. Here are the key points:
Identifying values in a nested list structure. My approach is to use ul <- unlist(list_col), which "flattens" nested data structures and builds hierarchical names for direct access to each element, e.g., address.country.code.
Ensuring that permutations of the same unlisted data are considered equal from a grouping standpoint. My approach is to order the unlisted vectors by the names of their values via ul[order(names(ul))] and assign the result as a new character vector column by reference.
Performing grouping on subsets of the flattened values. I was not able to get by= to work in any way with a column whose values are lists or vectors. Therefore, I had to find a way to map unique character vectors to simple values. I did this with digest.
Here are the two workhorse functions:
# Flatten list column in a data.table
flatten_list_col <- function(dt, col_name, flattened_col_name='props') {
flatten_props <- function(d) {
if (length(d) > 0) {
ul <- unlist(d)
names <- names(ul)
if (length(names) > 0) {
ul[order(names)]
} else {
NA
}
} else {
NA
}
}
flattened <- lapply(dt[[col_name]], flatten_props)
dt[, as.character(flattened_col_name) := list(flattened), with=F]
}
# Group by properties in a flattened list column
group_props <- function(prop_group, prop_col_name='props') {
substitute({
l <- lapply(eval(as.name(prop_col_name)), function(x) x[names(x) %in% prop_group])
as.character(lapply(l, digest))
}, list(prop_group=prop_group, prop_col_name=prop_col_name))
}
Here is a reproducible example:
library(data.table)
dt <- data.table(
id=c(1,1,1,2,2,2),
count=c(1,1,2,2,3,3),
d=list(
list(x=1, y=2),
list(y=2, x=1),
list(x=1, y=2, z=3),
list(y=5, abc=list(a=1, b=2, c=3)),
NA,
NULL
)
)
flatten_list_col(dt, 'd')
dt[, list(total=sum(count)), by=list(id, eval(group_props(c('x', 'y'))))]
The output is:
> flatten_list_col(dt, 'd')
id count d props
1: 1 1 <list> 1,2
2: 1 1 <list> 1,2
3: 1 2 <list> 1,2,3
4: 2 2 <list> 1,2,3,5
5: 2 3 NA NA
6: 2 3 NA
> dt[, list(total=sum(count)), by=list(id, eval(group_props(c('x', 'y'))))]
id group_props total
1: 1 325c6bbb2c33456d0301cf3909dd1572 4
2: 2 7aa1e567cd0d6920848d331d3e49fb7e 2
3: 2 ee7aa3b9ffe6bffdee83b6ecda90faac 6
This approach works but is pretty inefficient because of the need to flatten & order the lists and because of the need to calculate digests. I'm wondering about the following:
Can this be done without having to create a flattened column by instead retrieving values directly from the list column? This will probably require specifying selected properties as expressions as opposed to simple names.
Is there a way to get around the need for digest?

There are a number of issues here. The most important (and one you haven't come to yet due to others) is that that you are assigning by reference but trying to replace with more values than you have space to do so by reference.
Take this very simple example
DT <- data.table(x=1, y = list(1:5))
DT[,new := unlist(y)]
Warning message:
In `[.data.table`(DT, , `:=`(new, unlist(y))) :
Supplied 5 items to be assigned to 1 items of column 'new' (4 unused)
You will lose all but the firstnrow(DT) items in the newly created list. They wont correspond to the rows of the data.table
Therefore you will have to create a new data.table that will be large enough for you to explode these list variables. This won't be possible by reference.
newby <- dt[,list(x, props = as.character(unlist(data))), by = list(newby = seq_len(nrow(dt)))][,newby:=NULL]
newby
x props
1: 1 1
2: 1 2
3: 1 2
4: 1 1
5: 1 10
6: 2 1
7: 2 2
8: 2 3
9: 2 5
10: 2 1
11: 2 2
12: 2 3
13: 3 NA
14: 3 NA
Note that as.character is required to ensure that all values are the same type, and a type that won't lose data in the conversion. At the momemnt you have a logical NA value amongst lists of numeric / integer data.
Another edit to force all components to be character (even the NA). props is now a list with 1 character vector for each row.
flatten_props <- function(data) {
if (is.list(data)){
ul <- unlist(data)
if (length(ul) > 1) {
ul <- ul[order(names(ul))]
}
as.character(ul) } else {
as.character(unlist(data))}}
dt[, props := lapply(data, flatten_props)]
dt
x data props
1: 1 <list> 1,2
2: 1 <list> 10,1,2
3: 2 <list> 1,2,3
4: 2 <list> 1,2,3,5
5: 3 NA NA
6: 3
dt[,lapply(props,class)]
V1 V2 V3 V4 V5 V6
1: character character character character character character

Related

How to merge lists of vectors based on one vector belonging to another vector?

In R, I have two data frames that contain list columns
d1 <- data.table(
group_id1=1:4
)
d1$Cat_grouped <- list(letters[1:2],letters[3:2],letters[3:6],letters[11:12] )
And
d_grouped <- data.table(
group_id2=1:4
)
d_grouped$Cat_grouped <- list(letters[1:5],letters[6:10],letters[1:2],letters[1] )
I would like to merge these two data.tables based on the vectors in d1$Cat_grouped being contained in the vectors in d_grouped$Cat_grouped
To be more precise, there could be two matching criteria:
a) all elements of each vector of d1$Cat_grouped must be in the matched vector of d_grouped$Cat_grouped
Resulting in the following match:
result_a <- data.table(
group_id1=c(1,2)
group_id2=c(1,1)
)
b) at least one of the elements in each vector of d1$Cat_grouped must be in the matched vector of d_grouped$Cat_grouped
Resulting in the following match:
result_b <- data.table(
group_id1=c(1,2,3,3),
group_id2=c(1,1,1,2)
)
How can I implement a) or b) ? Preferably in a data.table way.
EDIT1: added the expected results of a) and b)
EDIT2: added more groups to d_grouped, so grouping variables overlap. This breaks some of the proposed solutions
So I think long form is better, though my answer feels a little roundabout. I bet someone whose a little sleeker with data table can do this in fewer steps, but here's what I've got:
first, let's unpack the vectors in your example data:
d1_long <- d1[, list(cat=unlist(Cat_grouped)), group_id1]
d_grouped_long <- d_grouped[, list(cat=unlist(Cat_grouped)), group_id2]
Now, we can merge on the individual elements:
result_b <- merge(d1_long, d_grouped_long, by='cat')
Based on our example, it seems you don't actually need to know which elements were part of the match...
result_b[, cat := NULL]
Finally, my answer has duplicated group_id pairs because it gets a join for each pairwise match, not just the vector-level matches. So we can unique them away.
result_b <- unique(result_b)
Here's my result_b:
group_id.1 group_id.2
1: 1 1
2: 2 1
3: 3 1
4: 3 2
We can use b as an intermediate step to a, since having any elements in common is a subset of having all elements in common.
Let's merge the original tables to see what the candidates are in terms of subvectors and vectors
result_a <- merge(result_b, d1, by = 'group_id1')
result_a <- merge(result_a, d_grouped, by = 'group_id2')
So now, if the length of Cat_grouped.x matches the number of TRUEs about Cat_grouped.x being %in% Cat_grouped.y, that's a bingo.
I tried a handful of clean ways, but the weirdness of having lists in the data table defeated the most obvious attempts. This seems to work though:
Let's add a row column to operate by
result_a[, row := 1:.N]
Now let's get the length and number of matches...
result_a[, x.length := length(Cat_grouped.x[[1]]), row]
result_a[, matches := sum(Cat_grouped.x[[1]] %in% Cat_grouped.y[[1]]), row]
And filter down to just rows where length and matches are the same
result_a <- result_a[x.length==matches]
This answer focuses on part a) of the question.
It follows Harland's approach but tries to make better use of the data.table idiom for performance reasons as the OP has mentioned that his production data may contain millions of observations.
Sample data
library(data.table)
d1 <- data.table(
group_id1 = 1:4,
Cat_grouped = list(letters[1:2], letters[3:2], letters[3:6], letters[11:12]))
d_grouped <- data.table(
group_id2 = 1:2,
Cat_grouped = list(letters[1:5], letters[6:10]))
Result a)
grp_cols <- c("group_id1", "group_id2")
unique(d1[, .(unlist(Cat_grouped), lengths(Cat_grouped)), by = group_id1][
d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][
, .(V2, .N), by = grp_cols][V2 == N, ..grp_cols])
group_id1 group_id2
1: 1 1
2: 2 1
Explanation
While expanding the list elements of d1 and d_grouped into long format, the number of list elements is determined for d1 using the lengths() function. lengths() (note the difference to length()) gets the length of each element of a list and was introduced with R 3.2.0.
After the inner join (note the nomatch = 0L parameter), the number of rows in the result set is counted (using the specal symbol .N) for each combination of grp_cols. Only those rows are considered where the count in the result set does match the original length of the list. Finally, the unique combinations of grp_cols are returned.
Result b)
Result b) can be derived from above solution by omitting the counting stuff:
unique(d1[, unlist(Cat_grouped), by = group_id1][
d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][
, c("group_id1", "group_id2")])
group_id1 group_id2
1: 1 1
2: 2 1
3: 3 1
4: 3 2
Another way:
Cross-join to get all pairs of group ids:
Y = CJ(group_id1=d1$group_id1, group_id2=d_grouped$group_id2)
Then merge in the vectors:
Y = Y[d1, on='group_id1'][d_grouped, on='group_id2']
# group_id1 group_id2 Cat_grouped i.Cat_grouped
# 1: 1 1 a,b a,b,c,d,e
# 2: 2 1 c,b a,b,c,d,e
# 3: 3 1 c,d,e,f a,b,c,d,e
# 4: 4 1 k,l a,b,c,d,e
# 5: 1 2 a,b f,g,h,i,j
# 6: 2 2 c,b f,g,h,i,j
# 7: 3 2 c,d,e,f f,g,h,i,j
# 8: 4 2 k,l f,g,h,i,j
Now you can use mapply to filter however you like:
Y[mapply(function(u,v) all(u %in% v), Cat_grouped, i.Cat_grouped), 1:2]
# group_id1 group_id2
# 1: 1 1
# 2: 2 1
Y[mapply(function(u,v) length(intersect(u,v)) > 0, Cat_grouped, i.Cat_grouped), 1:2]
# group_id1 group_id2
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 3 2

Transform longitudinal table to wide format efficiently in data.table

I am working in R with a long table stored as a data.table containing values obtained in value changes for variables of numeric and character type. When I want to perform some functions like correlations, regressions, etc. I have to convert the table into wide format and homogenise the timestamp frequency.
I found a way to convert the long table to wide, but I think is not really efficient and I would like to know if there is a better more data.table native approach.
In the reproducible example below, I include the two options I found to perform the wide low transformation and in the comments I indicate what parts I believe are not optimal.
library(zoo)
library(data.table)
dt<-data.table(time=1:6,variable=factor(letters[1:6]),numeric=c(1:3,rep(NA,3)),
character=c(rep(NA,3),letters[1:3]),key="time")
print(dt)
print(dt[,lapply(.SD,typeof)])
#option 1
casted<-dcast(dt,time~variable,value.var=c("numeric","character"))
# types are correct, but I got NA filled columns,
# is there an option like drop
# available for columns instead of rows?
print(casted)
print(casted[,lapply(.SD,typeof)])
# This drop looks ugly but I did not figure out a better way to perform it
casted[,names(casted)[unlist(casted[,lapply(lapply(.SD,is.na),all)])]:=NULL]
# I perform a LOCF, I do not know if I could benefit of
# data.table's roll option somehow and avoid
# the temporal memory copy of my dataset (this would be the second
# and minor issue)
casted<-na.locf(casted)
#option2
# taken from http://stackoverflow.com/questions/19253820/how-to-implement-coalesce-efficiently-in-r
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
casted2<-dcast(dt[,coalesce2(numeric,character),by=c("time","variable")],
time~variable,value.var="V1")
# There are not NA columns but types are incorrect
# it takes more space in a real table (more observations, less variables)
print(casted2)
print(casted2[,lapply(.SD,typeof)])
# Again, I am pretty sure there is a prettier way to do this
numericvars<-names(casted2)[!unlist(casted2[,lapply(
lapply(lapply(.SD,as.numeric),is.na),all)])]
casted2[,eval(numericvars):=lapply(.SD,as.numeric),.SDcols=numericvars]
# same as option 1, is there a data.table native way to do it?
casted2<-na.locf(casted2)
Any advice/improvement in the process is welcome.
I'd maybe do the char and num tables separately and then rbind:
k = "time"
typecols = c("numeric", "character")
res = rbindlist(fill = TRUE,
lapply(typecols, function(tc){
cols = c(k, tc, "variable")
dt[!is.na(get(tc)), ..cols][, dcast(.SD, ... ~ variable, value.var=tc)]
})
)
setorderv(res, k)
res[, setdiff(names(res), k) := lapply(.SD, zoo::na.locf, na.rm = FALSE), .SDcols=!k]
which gives
time a b c d e f
1: 1 1 NA NA NA NA NA
2: 2 1 2 NA NA NA NA
3: 3 1 2 3 NA NA NA
4: 4 1 2 3 a NA NA
5: 5 1 2 3 a b NA
6: 6 1 2 3 a b c
Note that OP's final result casted2, differs in that it has all cols as char.

Supply arguments to data.table as (1) vector of strings AND (2) variablenames

Imagine you want to apply a function row-wise on a data.table. The function's arguments correspond to fixed data.table columns as well as dynamically generated column names.
Is there a way to supply fixed and dynamic column names as argument to a function while using data.tables?
The problems are:
Both, variablenames and dynamically generated strings as argument to a function over a datatable
The dynamic column name strings are stored in a vector with > 1 entries (get() won't work)
The dynamic column's values need to be supplied as a vector to the function
This illustrates it:
library('data.table')
# Sample dataframe
D <- data.table(id=1:3, fix=1:3, dyn1=1:3, dyn2=1:3) #fixed and dynamic column names
setkey(D, id)
# Sample function
foo <-function(fix, dynvector){ rep(fix,length(dynvector)) %*% dynvector}
# It does not matter what this function does.
# The result when passing column names not dynamically
D[, "new" := foo(fix,c(dyn1,dyn2)), by=id]
# id fix dyn1 dyn2 new
# 1: 1 1 1 1 2
# 2: 2 2 2 2 8
# 3: 3 3 3 3 18
I want to get rid of the c(dyn1,dyn2). I need to get the column names dyn1, dyn2 from another vector which holds them as string.
This is how far I got:
# Now we try it dynamically
cn <-paste("dyn",1:2,sep="") #vector holding column names "dyn1", "dyn2"
# Approaches that don't work
D[, "new" := foo(fix,c(cn)), by=id] #wrong as using a mere string
D[, "new" := foo(fix,c(cn)), by=id, with=F] #does not work
D[, "new" := foo(fix,c(get(cn))), by=id] #uses only the first element "dyn1"
D[, "new" := foo(fix,c(mget(cn, .GlobalEnv, inherits=T))), by=id] #does not work
D[, "new" := foo(fix,c(.SD)), by=id, .SDcols=cn] #does not work
I suppose mget() is the solution, but I know too less about scoping to figure it out.
Thanks! JBJ
Update: Solution
based on the answer by BondedDust
D[, "new" := foo(fix,sapply(cn, function(x) {get(x)})), by=id]
I wasn't able to figure out what you were trying to do with the matrix-multiplication, but this shows how to create new variables with varying and fixed inputs to a function:
D <- data.table(id=1:3, fix=1:3, dyn1=1:3, dyn2=1:3)
setkey(id)
foo <-function(fix, dynvector){ fix* dynvector}
D[, paste("new",1:2,sep="_") := lapply( c(dyn1,dyn2), foo, fix=fix), by=id]
#----------
> D
id fix dyn1 dyn2 new_1 new_2
1: 1 1 1 1 1 1
2: 2 2 2 2 4 4
3: 3 3 3 3 9 9
So you need to use a vector of character values to get columns. This is a bit of an extension to this question: Why do I need to wrap `get` in a dummy function within a J `lapply` call?
> D <- data.table(id=1:3, fix=1:3, dyn1=1:3, dyn2=1:3)
> setkey(D, id)
> id1 <- parse(text=cn)
> foo <-function( fix, dynvector){ fix*dynvector}
> D[, paste("new",1:2,sep="_") := lapply( sapply( cn, function(x) {get(x)}) , foo, fix=fix) ]
Warning message:
In `[.data.table`(D, , `:=`(paste("new", 1:2, sep = "_"), lapply(sapply(cn, :
Supplied 2 columns to be assigned a list (length 6) of values (4 unused)
> D
id fix dyn1 dyn2 new_1 new_2
1: 1 1 1 1 1 2
2: 2 2 2 2 2 4
3: 3 3 3 3 3 6
You could probably use the methods in create an expression from a function for data.table to eval as well.

Number of Unique Obs by Variable in a Data Table

I have read in a large data file into R using the following command
data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))
The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).
Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.
Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.
My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
c = rep(1, times = 10))
### Create a comparable data.frame
df <- data.frame(dt)
### Expected result
unique(dt$a)
### Expected result
length(unique(dt$a))
However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).
### I want to determine the number of unique obs in
# each variable, for a large list of vars
lapply(names(df), function(x) {
length(unique(df[, x]))
})
### Unexpected result
length(unique(dt[, 'a', with = F])) # Returns 1
It seems to me the problem is that
dt[, 'a', with = F]
returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.
Here's pseudo code for how I would remedy the solution, using the data.frame way:
for (x in names(data)) {
unique.obs <- length(unique(data[, x]))
if (unique.obs == 1) {
data[, x] <- NULL
}
}
Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.
Update: uniqueN
As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:
dt[ , lapply(.SD, uniqueN)]
If you want to find the number of unique values in each column, something like
dt[, lapply(.SD, function(x) length(unique(x)))]
## a b c
## 1: 10 10 1
To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)
lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))
or
lapply(names(df) function(x) length(unique(dt[[x]])))
will work
In one step
dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]
# or to avoid calling `.SD`
dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]
The approaches in the other answers are good. Another way to add to the mix, just for fun :
for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
or if there may be duplicate column names :
for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".
Here is a solution to your core problem (I hope I got it right).
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
d1 = "",
c = rep(1, times = 10),
d2 = "")
dt
a b d1 c d2
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
First, I introduce two columns d1 and d2 that have no values whatsoever. Those you want to delete, right? If so, I just identify those columns and select all other columns in the dt.
only_space <- function(x) {
length(unique(x))==1 && x[1]==""
}
bolCols <- apply(dt, 2, only_space)
dt[, (1:ncol(dt))[!bolCols], with=FALSE]
Somehow, I have the feeling that you could further simplify it...
Output:
a b c
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
There is an easy way to do that using "dplyr" library, and then use select function as follow:
library(dplyr)
newdata <- select(old_data, first variable,second variable)
Note that, you can choose as many variables as you like.
Then you will get the type of data that you want.
Many thanks,
Fadhah

Subsetting data.table by not head(key(DT),m), using binary search not vector scan

If I specify n columns as a key of a data.table, I'm aware that I can join to fewer columns than are defined in that key as long as I join to the head of key(DT). For example, for n=2 :
X = data.table(A=rep(1:5, each=2), B=rep(1:2, each=5), key=c('A','B'))
X
A B
1: 1 1
2: 1 1
3: 2 1
4: 2 1
5: 3 1
6: 3 2
7: 4 2
8: 4 2
9: 5 2
10: 5 2
X[J(3)]
A B
1: 3 1
2: 3 2
There I only joined to the first column of the 2-column key of DT. I know I can join to both columns of the key like this :
X[J(3,1)]
A B
1: 3 1
But how do I subset using only the second column colum of the key (e.g. B==2), but still using binary search not vector scan? I'm aware that's a duplicate of :
Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan
so I'd like to generalise this question to n. My data set has about a million rows and solution provided in dup question linked above doesn't seem to be optimal.
Here is a simple function that will extract the correct unique values and return a data table to use as a key.
X <- data.table(A=rep(1:5, each=4), B=rep(1:4, each=5),
C = letters[1:20], key=c('A','B','C'))
make.key <- function(ddd, what){
# the names of the key columns
zzz <- key(ddd)
# the key columns you wish to keep all unique values
whichUnique <- setdiff(zzz, names(what))
## unique data.table (when keyed); .. means "look up one level"
ud <- lapply([, ..whichUnique], unique)
## append the `what` columns and a Cross Join of the new
## key columns
do.call(CJ, c(ud,what)[zzz])
}
X[make.key(X, what = list(C = c('a','b'))),nomatch=0]
## A B C
## 1: 1 1 a
## 2: 1 1 b
I'm not sure this will be any quicker than a couple of vector scans on a large data.table though.
Adding secondary keys is on the feature request list :
FR#1007 Build in secondary keys
In the meantime we are stuck with either vector scan, or the approach used in the answer to the n=2 case linked in the question (which #mnel generalises nicely in his answer).

Resources