Fast melted data.table operations - r

I am looking for patterns for manipulating data.table objects whose structure resembles that of dataframes created with melt from the reshape2 package. I am dealing with data tables with millions of rows. Performance is critical.
The generalized form of the question is whether there is a way to perform grouping based on a subset of values in a column and have the result of the grouping operation create one or more new columns.
A specific form of the question could be how to use data.table to accomplish the equivalent of what dcast does in the following:
input <- data.table(
id=c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
variable=c('x', 'y', 'y', 'x', 'y', 'y', 'x', 'x', 'y', 'other'),
value=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
dcast(input,
id ~ variable, sum,
subset=.(variable %in% c('x', 'y')))
the output of which is
id x y
1 1 1 5
2 2 4 11
3 3 15 9

Quick untested answer: seems like you're looking for by-without-by, a.k.a. grouping-by-i :
setkey(input,variable)
input[c("x","y"),sum(value)]
This is like a fast HAVING in SQL. j gets evaluated for each row of i. In other words, the above is the same result but much faster than :
input[,sum(value),keyby=variable][c("x","y")]
The latter subsets and evals for all the groups (wastefully) before selecting only the groups of interest. The former (by-without-by) goes straight to the subset of groups only.
The group results will be returned in long format, as always. But reshaping to wide afterwards on the (relatively small) aggregated data should be relatively instant. That's the thinking anyway.
The first setkey(input,variable) might bite if input has a lot of columns not of interest. If so, it might be worth subsetting the columns needed :
DT = setkey(input[ , c("variable","value")], variable)
DT[c("x","y"),sum(value)]
In future when secondary keys are implemented that would be easier :
set2key(input,variable) # add a secondary key
input[c("x","y"),sum(value),key=2] # syntax speculative
To group by id as well :
setkey(input,variable)
input[c("x","y"),sum(value),by='variable,id']
and including id in the key might be worth setkey's cost depending on your data :
setkey(input,variable,id)
input[c("x","y"),sum(value),by='variable,id']
If you combine a by-without-by with by, as above, then the by-without-by then operates just like a subset; i.e., j is only run for each row of i when by is missing (hence the name by-without-by). So you need to include variable, again, in the by as shown above.
Alternatively, the following should group by id over the union of "x" and "y" instead (but the above is what you asked for in the question, iiuc) :
input[c("x","y"),sum(value),by=id]

> setkey(input, "id")
> input[ , list(sum(value)), by=id]
id V1
1: 1 6
2: 2 15
3: 3 34
> input[ variable %in% c("x", "y"), list(sum(value)), by=id]
id V1
1: 1 6
2: 2 15
3: 3 24
The last one:
> input[ variable %in% c("x", "y"), list(sum(value)), by=list(id, variable)]
id variable V1
1: 1 x 1
2: 1 y 5
3: 2 x 4
4: 2 y 11
5: 3 x 15
6: 3 y 9

I'm not sure if this is the best way, but you can try:
input[, list(x = sum(value[variable == "x"]),
y = sum(value[variable == "y"])), by = "id"]
# id x y
# 1: 1 1 5
# 2: 2 4 11
# 3: 3 15 9

Related

Flag duplicate obs between based on two ID variables

Updated Example (see RULES)
I have data.table with id1 and id2 columns (as below)
data.table(id1=c(1,1,2,3,3,3,4), id2=c(1,2,2,1,2,3,2))
id1
id2
1
1
1
2
2
2
3
1
3
2
3
3
4
2
I would like to generate a flag to identify the duplicate association between id1 and id2.
RULE : if a particular id1 is already associated with id2 then it should be flagged..one unique id2 should be associated with one id1 only (see explanation below)
a) Looking for an efficient solution and b) a solution that only uses basics and data.table functions
id1
id2
flag
1
1
1
2
Y
<== since id2=1is assicated with id1=1 in 1st row
2
2
3
1
Y
<== since id2=1 is assicated with id1=1 in 1st row
3
2
Y
<== since id2=2 is assicated with id1=2 in 3rd row
3
3
4
2
Y
<== since id2=2 is assicated with id1=2 in 3rd row
This is a tricky one. If I understand correctly, my translation of OP's rules is as follows:
For each id1 group, exactly one row is not flagged.
If the id1 group consists only of one row it is not flagged.
Within an id1 group, all rows where id2 has been used in previous groups are flagged.
If there are more than one row within an id1 group which have not been flagged up to now, only the first row is not flagged; all other rows are flagged.
So, the approach is to
create a vector of available id2 values,
step through the id1 groups,
find the first row within each group where the id2 value not already has been consumed in previous groups,
flag all other rows,
and update the vector of available (not consumed) id2 values.
avail <- unique(DT$id2)
DT[, flag := {
idx <- max(first(which(id2 %in% avail)), 1L)
avail <- setdiff(avail, id2)
replace(rep("Y", .N), idx, "")
}, by = id1][]
id1 id2 flag
1: 1 1
2: 1 2 Y
3: 2 2
4: 3 1 Y
5: 3 2 Y
6: 3 3
7: 4 2
Caveat
The above code reproduces the expected result for the use case provided by the OP. However, there might be other uses cases and/or edge cases where the code might need to be tweaked to comply with OP's expectations. E.g., it is unclear what the expected result is in case of an id1 group where all id2 values already have been consumed in previous groups.
Edit:
The OP has edited the expected result so that row 7 is now flagged as well.
Here is a tweaked version of my code which reproduces the expected result after the edit:
avail <- unique(DT$id2)
DT[, flag := {
idx <- first(which(id2 %in% avail))
avail <- setdiff(avail, id2[idx])
replace(rep("Y", .N), idx, "")
}, by = id1][]
id1 id2 flag
1: 1 1
2: 1 2 Y
3: 2 2
4: 3 1 Y
5: 3 2 Y
6: 3 3
7: 4 2 Y
Data
library(data.table)
DT = data.table(id1 = c(1, 1, 2, 3, 3, 3, 4),
id2 = c(1, 2, 2, 1, 2, 3, 2))
This is a really convoluted chain, but I think it produces the result (the result in your question does not follow your own logic):
library(data.table)
a = data.table(id1=c(1,1,2,3,3,3,4), id2=c(1,2,2,1,2,3,2))
a[, .SD[1, ], by = id2][,
Noflag := "no"][a,
on = .(id2, id1)][is.na(Noflag),
flag := "y"][,
Noflag := NULL][]
What's in there:
a[, .SD[1, ], by = id2] gets each first row of the subgroups by id2. This groups shouldn't be flagged, so
[, Noflag := "no"] flags them as "not flagged" (go figure. I said it was convoluted). We need to join this no-flagged table with the original one:
[a, on = .(id2, id1)] joins the last table with the original a on both id1 and id2. Now we need to flag the rows that aren't flagged as "shouldn't be flagged":
[is.na(Noflag), flag := "y"]. Last part is to remove the Noflag unnecessary column:
[, Noflag := NULL] and add a [] to display the new table to screen.
I agree with the comment by #akrun reagarding igraph being not only more efficient, but also a simpler sintax.
# replicate your data
df <- data.table(id1=c(1,1,2,3,3,3,4), id2=c(1,2,2,1,2,3,2))
# create and append a new, empty column that will late be filled with the info whether they match or not
empty_col <- rep(NA, nrow(df)) #create empty vector
df[ , `:=` (duplicate = empty_col)] #append to existing data set
# create loop, to iteratively check if statement is true, then fill the new column accordingly
# note that instead of indexing the columns (e.g. df[,3] you could also use their names (e.g. df$flag)
for (i in 1:nrow(df)){
if (i>=2 & df[i,1] == df[1:i-1,1]){ #check if value in loop I matches that of any rows before (1:i)
df[i,3] = 1
}
else {
df[i,3] = 0 # when they dont match: 0, false
}
}
# note that the loop only starts in the second row, as i-1 would be 0 for i=1, such that there would be an error.

data.table "out of range", how to add value to new row

While working with data.frame it is simple to insert new value by using row number;
df1 <- data.frame(c(1:3))
df1[4,1] <- 1
> df1
c.1.3.
1 1
2 2
3 3
4 1
It is not working with data.table;
df1 <- data.table(c(1:3))
df1[4,1] <- 1
Error in `[<-.data.table`(`*tmp*`, 4, 1, value = 1) : i[1] is 4 which is out of range [1,nrow=3].
How can I do it?
Data Tables were designed to work much faster with some common operations like subset, join, group, sort etc. and as a result have some differences with data.frames.
Some operations like the one you pointed out will not work on data.tables. You need to use data.table - specific operations.
dt1 <- data.table(c(1:3))
rbindlist(list(dt1, list(1)), use.names=FALSE)
dt1
# V1
# 1: 1
# 2: 2
# 3: 3
# 4: 1

R summing up total for each class for each id

Say I have a dataset like this:
df <- data.frame(id = c(1, 1, 1, 2, 2),
classname = c("Welding", "Welding", "Auto", "HVAC", "Plumbing"),
hours = c(3, 2, 4, 1, 2))
I.e.,
id classname hours
1 1 Welding 3
2 1 Welding 2
3 1 Auto 4
4 2 HVAC 1
5 2 Plumbing 2
I'm trying to figure out how to summarize the data in a way that gives me, for each id, a list of the classes they took as well as how many hours of each class. I would want these to be in a list so I can keep it one row per id. So, I would want it to return:
id class.list class.hours
1 1 Welding, Auto 5,4
2 2 HVAC, Plumbing 1,2
I was able to figure out how to get it to return the class.list.
library(dplyr)
classes <- df %>%
group_by(id) %>%
summarise(class.list = list(unique(as.character(classname))))
This gives me:
id class.list
1 1 Welding, Auto
2 2 HVAC, Plumbing
But I'm not sure how I could get it to sum the number of hours for each of those classes (class.hours).
Thanks for your help!
In base R, this can be accomplished with two calls to aggregate. The inner call sums the hours and the outer call "concatenates" the hours and the class names. In the outer call of aggregate, cbind is used to include both the hours and the class names in the output, and also to provide the desired variable names.
# convert class name to character variable
df$classname <- as.character(df$classname)
# aggregate
aggregate(cbind("class.hours"=hours, "class.list"=classname)~id,
data=aggregate(hours~id+classname, data=df, FUN=sum), toString)
id class.hours class.list
1 1 4, 5 Auto, Welding
2 2 1, 2 HVAC, Plumbing
In data.table, roughly the same output is produced with a chained statement.
setDT(df)[, .(hours=sum(hours)), by=.(id, classname)][, lapply(.SD, toString), by=id]
id classname hours
1: 1 Welding, Auto 5, 4
2: 2 HVAC, Plumbing 1, 2
The variable names could then be set using the data.table setnames function.
This is how you could do it using dplyr:
classes <- df %>%
group_by(id, classname) %>%
summarise(hours = sum(hours)) %>%
summarise(class.list = list(unique(as.character(classname))),
class.hours = list(hours))
The first summarise peels of the latest group by (classname). It is not necessary to use unique() anymore, but I kept it in there to match the part you already had.

How to join a data.table with multiple columns and multiple values

An example case is here:
DT = data.table(x=1:4, y=6:9, z=3:6)
setkey(DT, x, y)
Join columns have multiple values:
xc = c(1, 2, 4)
yc = c(6, 9)
DT[J(xc, yc), nomatch=0]
x y z
1: 1 6 3
This use of J() returns only single row. Actually, I want to join as %in% operator.
DT[x %in% xc & y %in% yc]
x y z
1: 1 6 3
2: 4 9 6
But using %in% operator makes the search a vector scan which is very slow compared to binary search. In order to have binary search, I build every possible combination of join values:
xc2 = rep(xc, length(yc))
yc2 = unlist(lapply(yc, rep, length(xc)))
DT[J(xc2, yc2), nomatch=0]
x y z
1: 1 6 3
2: 4 9 6
But building xc2, yc2 in this way makes code difficult to read. Is there a better way to have the speed of binary search and the simplicity of %in% operator in this case?
Answering to remove this question from DT tag open questions.
Code from Arun's comment DT[CJ(xc,yc), nomatch=0L] will do the job.

Number of Unique Obs by Variable in a Data Table

I have read in a large data file into R using the following command
data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))
The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).
Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.
Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.
My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
c = rep(1, times = 10))
### Create a comparable data.frame
df <- data.frame(dt)
### Expected result
unique(dt$a)
### Expected result
length(unique(dt$a))
However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).
### I want to determine the number of unique obs in
# each variable, for a large list of vars
lapply(names(df), function(x) {
length(unique(df[, x]))
})
### Unexpected result
length(unique(dt[, 'a', with = F])) # Returns 1
It seems to me the problem is that
dt[, 'a', with = F]
returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.
Here's pseudo code for how I would remedy the solution, using the data.frame way:
for (x in names(data)) {
unique.obs <- length(unique(data[, x]))
if (unique.obs == 1) {
data[, x] <- NULL
}
}
Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.
Update: uniqueN
As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:
dt[ , lapply(.SD, uniqueN)]
If you want to find the number of unique values in each column, something like
dt[, lapply(.SD, function(x) length(unique(x)))]
## a b c
## 1: 10 10 1
To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)
lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))
or
lapply(names(df) function(x) length(unique(dt[[x]])))
will work
In one step
dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]
# or to avoid calling `.SD`
dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]
The approaches in the other answers are good. Another way to add to the mix, just for fun :
for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
or if there may be duplicate column names :
for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".
Here is a solution to your core problem (I hope I got it right).
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
d1 = "",
c = rep(1, times = 10),
d2 = "")
dt
a b d1 c d2
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
First, I introduce two columns d1 and d2 that have no values whatsoever. Those you want to delete, right? If so, I just identify those columns and select all other columns in the dt.
only_space <- function(x) {
length(unique(x))==1 && x[1]==""
}
bolCols <- apply(dt, 2, only_space)
dt[, (1:ncol(dt))[!bolCols], with=FALSE]
Somehow, I have the feeling that you could further simplify it...
Output:
a b c
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
There is an easy way to do that using "dplyr" library, and then use select function as follow:
library(dplyr)
newdata <- select(old_data, first variable,second variable)
Note that, you can choose as many variables as you like.
Then you will get the type of data that you want.
Many thanks,
Fadhah

Resources