Row conditional column operations in data.table - r

I have a large data.table where I for each row need to make computations based on part of the full data.table. As an example consider the following data.table, and assume I for each row want to compute the sum of the num variable for every rows where id2 matches id1 for the current row as well as the time variable is within distance 1 from the time of the current row.
set.seed(123)
dat <- data.table(cbind(id1=sample(1:5,10,replace=T),
id2=sample(1:5,10,replace=T),
num=sample(1:10,10,replace=T),
time=sample(1:10,10,replace=T)))
This could easily be done by looping over each row like this
dat[,val:= 0]
for (i in 1:nrow(dat)){
this.val <- dat[ (id2==id1[i]) & (time>=time[i]-2) & (time<=time[i]+2),sum(num)]
dat[i,val:=this.val]
}
dat
The resulting data.table looks like this:
> dat
id1 id2 num time val
1: 2 5 9 10 6
2: 4 3 7 10 0
3: 3 4 7 7 10
4: 5 3 10 8 9
5: 5 1 7 1 2
6: 1 5 8 5 6
7: 3 2 6 8 17
8: 5 1 6 3 10
9: 3 2 3 4 0
10: 3 5 2 3 0
What is the proper/fast way to do things like this using data.table?

We can use a self-join here by creating the 'timeminus2' and 'timeplus2' column, join on by 'id2' with 'id1' and the non-equi logical condition to get the sum of 'num' and assign (:=) the 'val' column to the original dataset
tmp <- dat[.(id1 = id1, timeminus2 = time - 2, timeplus2 = time + 2),
.(val = sum(num)),
on = .(id2 = id1, time >= timeminus2, time <= timeplus2),
by = .EACHI
][is.na(val), val := 0][]
dat[, val := tmp$val][]
# id1 id2 num time val
# 1: 2 5 9 10 6
# 2: 4 3 7 10 0
# 3: 3 4 7 7 10
# 4: 5 3 10 8 9
# 5: 5 1 7 1 2
# 6: 1 5 8 5 6
# 7: 3 2 6 8 17
# 8: 5 1 6 3 10
# 9: 3 2 3 4 0
#10: 3 5 2 3 0

Related

Fill Missing Values

data=data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4),
"timeHAVE"=c(1,4,7,10,2,5,NA,11,6,NA,NA,NA,3),
"timeWANT"=c(1,4,7,10,2,5,8,11,6,9,12,15,3))
library(dplyr);library(tidyverse)
data$timeWANTattempt=data$timeHAVE
data <- data %>%
group_by(student) %>%
fill(timeWANTattempt)+3
I have 'timeHAVE' and I want to replace missing times with the previous time +3. I show my dplyr attempt but it does not work. I seek a data.table solution. Thank you.
you can try.
data %>%
group_by(student) %>%
mutate(n_na = cumsum(is.na(timeHAVE))) %>%
mutate(timeHAVE = ifelse(is.na(timeHAVE), timeHAVE[n_na == 0 & lead(n_na) == 1] + 3*n_na, timeHAVE))
student timeHAVE timeWANT n_na
<dbl> <dbl> <dbl> <int>
1 1 1 1 0
2 1 4 4 0
3 1 7 7 0
4 1 10 10 0
5 2 2 2 0
6 2 5 5 0
7 2 8 8 1
8 2 11 11 1
9 3 6 6 0
10 3 9 9 1
11 3 12 12 2
12 3 15 15 3
13 4 3 3 0
I included the little helper n_na which counts NA's in a row. Then the second mutate muliplies the number of NAs with three and adds this to the first non-NA element before NA's
Here's an approach using 'locf' filling
setDT(data)
data[ , by = student, timeWANT := {
# carry previous observations forward whenever missing
locf_fill = nafill(timeHAVE, 'locf')
# every next NA, the amount shifted goes up by another 3
na_shift = cumsum(idx <- is.na(timeHAVE))
# add the shift, but only where the original data was missing
locf_fill[idx] = locf_fill[idx] + 3*na_shift[idx]
# return the full vector
locf_fill
}]
Warning that this won't work if a given student can have more than one non-consecutive set of NA values in timeHAVE
Another data.table option without grouping:
setDT(data)[, w := fifelse(is.na(timeHAVE) & student==shift(student),
nafill(timeHAVE, "locf") + 3L * rowid(rleid(timeHAVE)),
timeHAVE)]
output:
student timeHAVE timeWANT w
1: 1 1 1 1
2: 1 4 4 4
3: 1 7 7 7
4: 1 10 10 10
5: 2 2 2 2
6: 2 5 5 5
7: 2 NA 8 8
8: 2 11 11 11
9: 3 6 6 6
10: 3 NA 9 9
11: 3 NA 12 12
12: 3 NA 15 15
13: 4 NA NA NA
14: 4 3 3 3
data with student=4 having NA for the first timeHAVE:
data = data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4),
"timeHAVE"=c(1,4,7,10,2,5,NA,11,6,NA,NA,NA,NA,3),
"timeWANT"=c(1,4,7,10,2,5,8,11,6,9,12,15,NA,3))

Create columns with different rules with data.table in r

I'm trying to better understant the data.table package in r. I want to do different types of calculation with some columns and assign the result to new columns with specific names. Here is an example:
set.seed(122)
df <- data.frame(rain = rep(5,10),temp=1:10, skip = sample(0:2,10,T),
windw_sz = sample(1:2,10,T),city =c(rep("a",5),rep("b",5)),ord=rep(sample(1:5,5),2))
df <- as.data.table(df)
vars <- c("rain","temp")
df[, paste0("mean.",vars) := lapply(mget(vars),mean), by="city" ]
This works just fine. But now I also want to calculate the sum of these variables, so I try:
df[, c(paste0("mean.",vars), paste("sum.",vars)) := list( lapply(mget(vars),mean),
lapply(mget(vars),sum)), by="city" ]
and I get an error.
How could I implement this last part?
Thanks a lot!
Instead of list wrap, we can do a c as the lapply output is a list, and when do list as wrapper, it returns a list of list. However, with c, it concats two list end to end (i.e. c(as.list(1:5), as.list(6:10)) as opposed to list(as.list(1:5), as.list(6:10))) and instead of mget, make use of .SDcols
library(data.table)
df[, paste0(rep(c("mean.", "sum."), each = 2), vars) :=
c(lapply(.SD, mean), lapply(.SD, sum)), by = .(city), .SDcols = vars]
df
# rain temp skip windw_sz city ord mean.rain mean.temp sum.rain sum.temp
# 1: 5 1 0 2 a 2 5 3 25 15
# 2: 5 2 1 1 a 5 5 3 25 15
# 3: 5 3 2 2 a 3 5 3 25 15
# 4: 5 4 2 1 a 4 5 3 25 15
# 5: 5 5 2 2 a 1 5 3 25 15
# 6: 5 6 0 1 b 2 5 8 25 40
# 7: 5 7 2 2 b 5 5 8 25 40
# 8: 5 8 1 2 b 3 5 8 25 40
# 9: 5 9 2 1 b 4 5 8 25 40
#10: 5 10 2 2 b 1 5 8 25 40

R data table: Assign a value to column based on reference column

I would like to assign a value into a column from a larger table, using another column as a reference.
E.g. data:
require(data.table)
dt <- data.table(N=c(1:5),GPa1=c(sample(0:5,5)),GPa2=c(sample(5:15,5)),
GPb1=c(sample(0:20,5)),GPb2=c(sample(0:10,5)),id=c("b","a","b","b","a"))
N GPa1 GPa2 GPb1 GPb2 id
1: 1 4 10 7 0 b
2: 2 5 15 19 7 a
3: 3 1 5 20 5 b
4: 4 0 13 3 4 b
5: 5 3 7 8 1 a
The idea is to get new columns Val1 and Val2. Any GP column ending in 1 is eligible for Val1 and any ending in 2 is eligible for Val2. The value to be insterted into the column is determined by the id column, per row.
So you can see for Val1, you'd draw on the GPb1 column, then GPa1, GPb1, GPb1 again and finally GPa1.
The final result would be;
N GPa1 GPa2 GPb1 GPb2 id Val1 Val2
1: 1 4 10 7 0 b 7 0
2: 2 5 15 19 7 a 5 15
3: 3 1 5 20 5 b 20 5
4: 4 0 13 3 4 b 3 4
5: 5 3 7 8 1 a 3 7
I did achieve the answer but in quite a few lines after melting it etc, but i'm sure there must be an elegant way to do this in data.table. I was initially frustrated by the fact paste0 doesn't work in data.table;
dt[1,paste0("GP",id,"1")]
but;
# The following gives a vector that is correct for Val1 (and works for 2)
diag(as.matrix(dt[,.SD,.SDcols=dt[,paste0("GP",id,"1")]]))
# I think the answer lies in `set`, but i've not had any luck.
for (i in 1:nrow(dt)) set(dt, i=dt[i,.SD,.SDcols=dt[,paste0("GP",id,"2")]], j=i, value=0)
The data is quite ugly this way so perhaps it's better to just use the melt method.
dt[id == "a", c("Val1", "Val2") := .(GPa1, GPa2)]
dt[id == "b", c("Val1", "Val2") := .(GPb1, GPb2)]
# N GPa1 GPa2 GPb1 GPb2 id Val1 Val2
#1: 1 2 13 5 8 b 5 8
#2: 2 3 8 7 2 a 3 8
#3: 3 5 11 19 1 b 19 1
#4: 4 4 5 6 9 b 6 9
#5: 5 1 15 1 10 a 1 15

Removing certain rows and replacing values based on a condition

I have the following data:
set.seed(2)
d <- data.frame(iteration=c(1,1,2,2,2,3,4,5,6,6,6),
value=sample(11),
var3=sample(11))
iteration value var3
1 1 3 7
2 1 8 4
3 2 6 8
4 2 2 3
5 2 7 9
6 3 9 11
7 4 1 10
8 5 4 1
9 6 10 2
10 6 11 6
11 6 5 5
Now, I want the following:
1. IF there are more than one iteration to remove the last row AND replace the value of the last row with the previous value.
So in the example above here is the output that I want:
d<-data.frame(iteration=c(1,2,2,3,4,5,6,6),
value=c(8,6,7,9,1,4,10,5))
iteration value var3
1 1 8 7
2 2 6 8
3 2 7 3
4 3 9 11
5 4 1 10
6 5 4 1
7 6 10 2
8 6 5 6
We can use data.table
library(data.table)
setDT(d)[, .(value = if(.N>1) c(value[seq_len(.N-2)], value[.N]) else value), iteration]
# iteration value
#1: 1 8
#2: 2 6
#3: 2 7
#4: 3 9
#5: 4 1
#6: 5 4
#7: 6 10
#8: 6 5
Update
Based on the update in OP's post, we can first create a new column with the lead values in 'value', assign the 'value1' to 'value' only for those meet the conditions in 'i1', then subset the rows
setDT(d)[, value1 := shift(value, type = "lead"), iteration]
i1 <- d[, if(.N >1) .I[.N-1], iteration]$V1
d[i1, value := value1]
d[d[, if(.N > 1) .I[-.N] else .I, iteration]$V1][, value1 := NULL][]
# iteration value var3
#1: 1 8 7
#2: 2 6 8
#3: 2 7 3
#4: 3 9 11
#5: 4 1 10
#6: 5 4 1
#7: 6 10 2
#8: 6 5 6
This base R solution using the split-apply-combine methodology returns the same values as #akrun's data.table version, although the logic appears to be different.
do.call(rbind, lapply(split(d, d$iteration),
function(i)
if(nrow(i) >= 3) i[-(nrow(i)-1),] else tail(i, 1)))
iteration value
1 1 8
2.3 2 6
2.5 2 7
3 3 9
4 4 1
5 5 4
6.9 6 10
6.11 6 5
The idea is to split the data.frame into a list of data.frames along iteration, then for each data.frame, check if there are more than 2 rows, if yes, grab the first and final row, if no, then return only the final row. do.call with rbind then compiles these observations into a single data.frame.
Note that this will not work in the presence of other variables.

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

Resources