Create columns with different rules with data.table in r - r

I'm trying to better understant the data.table package in r. I want to do different types of calculation with some columns and assign the result to new columns with specific names. Here is an example:
set.seed(122)
df <- data.frame(rain = rep(5,10),temp=1:10, skip = sample(0:2,10,T),
windw_sz = sample(1:2,10,T),city =c(rep("a",5),rep("b",5)),ord=rep(sample(1:5,5),2))
df <- as.data.table(df)
vars <- c("rain","temp")
df[, paste0("mean.",vars) := lapply(mget(vars),mean), by="city" ]
This works just fine. But now I also want to calculate the sum of these variables, so I try:
df[, c(paste0("mean.",vars), paste("sum.",vars)) := list( lapply(mget(vars),mean),
lapply(mget(vars),sum)), by="city" ]
and I get an error.
How could I implement this last part?
Thanks a lot!

Instead of list wrap, we can do a c as the lapply output is a list, and when do list as wrapper, it returns a list of list. However, with c, it concats two list end to end (i.e. c(as.list(1:5), as.list(6:10)) as opposed to list(as.list(1:5), as.list(6:10))) and instead of mget, make use of .SDcols
library(data.table)
df[, paste0(rep(c("mean.", "sum."), each = 2), vars) :=
c(lapply(.SD, mean), lapply(.SD, sum)), by = .(city), .SDcols = vars]
df
# rain temp skip windw_sz city ord mean.rain mean.temp sum.rain sum.temp
# 1: 5 1 0 2 a 2 5 3 25 15
# 2: 5 2 1 1 a 5 5 3 25 15
# 3: 5 3 2 2 a 3 5 3 25 15
# 4: 5 4 2 1 a 4 5 3 25 15
# 5: 5 5 2 2 a 1 5 3 25 15
# 6: 5 6 0 1 b 2 5 8 25 40
# 7: 5 7 2 2 b 5 5 8 25 40
# 8: 5 8 1 2 b 3 5 8 25 40
# 9: 5 9 2 1 b 4 5 8 25 40
#10: 5 10 2 2 b 1 5 8 25 40

Related

R data table: Assign a value to column based on reference column

I would like to assign a value into a column from a larger table, using another column as a reference.
E.g. data:
require(data.table)
dt <- data.table(N=c(1:5),GPa1=c(sample(0:5,5)),GPa2=c(sample(5:15,5)),
GPb1=c(sample(0:20,5)),GPb2=c(sample(0:10,5)),id=c("b","a","b","b","a"))
N GPa1 GPa2 GPb1 GPb2 id
1: 1 4 10 7 0 b
2: 2 5 15 19 7 a
3: 3 1 5 20 5 b
4: 4 0 13 3 4 b
5: 5 3 7 8 1 a
The idea is to get new columns Val1 and Val2. Any GP column ending in 1 is eligible for Val1 and any ending in 2 is eligible for Val2. The value to be insterted into the column is determined by the id column, per row.
So you can see for Val1, you'd draw on the GPb1 column, then GPa1, GPb1, GPb1 again and finally GPa1.
The final result would be;
N GPa1 GPa2 GPb1 GPb2 id Val1 Val2
1: 1 4 10 7 0 b 7 0
2: 2 5 15 19 7 a 5 15
3: 3 1 5 20 5 b 20 5
4: 4 0 13 3 4 b 3 4
5: 5 3 7 8 1 a 3 7
I did achieve the answer but in quite a few lines after melting it etc, but i'm sure there must be an elegant way to do this in data.table. I was initially frustrated by the fact paste0 doesn't work in data.table;
dt[1,paste0("GP",id,"1")]
but;
# The following gives a vector that is correct for Val1 (and works for 2)
diag(as.matrix(dt[,.SD,.SDcols=dt[,paste0("GP",id,"1")]]))
# I think the answer lies in `set`, but i've not had any luck.
for (i in 1:nrow(dt)) set(dt, i=dt[i,.SD,.SDcols=dt[,paste0("GP",id,"2")]], j=i, value=0)
The data is quite ugly this way so perhaps it's better to just use the melt method.
dt[id == "a", c("Val1", "Val2") := .(GPa1, GPa2)]
dt[id == "b", c("Val1", "Val2") := .(GPb1, GPb2)]
# N GPa1 GPa2 GPb1 GPb2 id Val1 Val2
#1: 1 2 13 5 8 b 5 8
#2: 2 3 8 7 2 a 3 8
#3: 3 5 11 19 1 b 19 1
#4: 4 4 5 6 9 b 6 9
#5: 5 1 15 1 10 a 1 15

Differences to earlier instance within group or external variable

I have data
dat1 <- data.table(id=1:9,
group=c(1,1,2,2,2,3,3,3,3),
t=c(14,17,20,21,26,89,90,95,99),
index=c(1,2,1,2,3,1,2,3,4)
)
and I would like to compute the difference on t to the previous value, according to index. For the first instance of each group, I would like to compute the difference to some external variable
dat2 <- data.table(group=c(1,2,3),
start=c(10,15,80)
)
such that the following result should be obtained:
> res
id group t index dif
1: 1 1 14 1 4
2: 2 1 17 2 3
3: 3 2 20 1 5
4: 4 2 21 2 1
5: 5 2 26 3 5
6: 6 3 89 1 9
7: 7 3 90 2 1
8: 8 3 95 3 5
9: 9 3 99 4 4
I have tried using
dat1[ , ifelse(index == min(index), dif := t - dat2$start, dif := t - t[-1]), by = group]
but I was unsure about referencing other elements of the same group and external elements in one step. Is this at all possible using data.table?
A possible solution:
dat1[, dif := ifelse(index == min(index),
t - dat2$start[match(.BY, dat2$group)],
t - shift(t))
, by = group][]
which gives:
id group t index dif
1: 1 1 14 1 4
2: 2 1 17 2 3
3: 3 2 20 1 5
4: 4 2 21 2 1
5: 5 2 26 3 5
6: 6 3 89 1 9
7: 7 3 90 2 1
8: 8 3 95 3 5
9: 9 3 99 4 4
Or a variant as proposed by #jogo in the comments which avoids the ifelse:
dat1[, dif := t - shift(t), by = group
][index == 1, dif := t - dat2[group==.BY, start], by = group][]
I would try to avoid ifelse and use data.tables efficient join-capabilities:
dat1[dat2, on = "group", # join on group
start := i.start][, # add start value
diff := diff(c(start[1L], t)), by = group][, # compute difference
start := NULL] # remove start value
The resulting table is:
# id group t index diff
#1: 1 1 14 1 4
#2: 2 1 17 2 3
#3: 3 2 20 1 5
#4: 4 2 21 2 1
#5: 5 2 26 3 5
#6: 6 3 89 1 9
#7: 7 3 90 2 1
#8: 8 3 95 3 5
#9: 9 3 99 4 4
You may use shift with a dynamic fill argument: Index 'dat2' with .BY to get 'start' values for each 'group':
dat1[ , dif := t - shift(t, fill = dat2[group == .BY, start]), by = group]
# id group t index dif
# 1: 1 1 14 1 4
# 2: 2 1 17 2 3
# 3: 3 2 20 1 5
# 4: 4 2 21 2 1
# 5: 5 2 26 3 5
# 6: 6 3 89 1 9
# 7: 7 3 90 2 1
# 8: 8 3 95 3 5
# 9: 9 3 99 4 4
Alternatively, you can do this in steps. Probably a matter of taste, but I find it more transparent than the ifelse way.
First the 'normal' shift. Then add an 'index' variable to 'dat2' and do an update join.
dat1[ , dif := t - shift(t), by = group]
dat2[ , index := 1]
dat1[dat2, on = .(group, index), dif := t - start]

Row conditional column operations in data.table

I have a large data.table where I for each row need to make computations based on part of the full data.table. As an example consider the following data.table, and assume I for each row want to compute the sum of the num variable for every rows where id2 matches id1 for the current row as well as the time variable is within distance 1 from the time of the current row.
set.seed(123)
dat <- data.table(cbind(id1=sample(1:5,10,replace=T),
id2=sample(1:5,10,replace=T),
num=sample(1:10,10,replace=T),
time=sample(1:10,10,replace=T)))
This could easily be done by looping over each row like this
dat[,val:= 0]
for (i in 1:nrow(dat)){
this.val <- dat[ (id2==id1[i]) & (time>=time[i]-2) & (time<=time[i]+2),sum(num)]
dat[i,val:=this.val]
}
dat
The resulting data.table looks like this:
> dat
id1 id2 num time val
1: 2 5 9 10 6
2: 4 3 7 10 0
3: 3 4 7 7 10
4: 5 3 10 8 9
5: 5 1 7 1 2
6: 1 5 8 5 6
7: 3 2 6 8 17
8: 5 1 6 3 10
9: 3 2 3 4 0
10: 3 5 2 3 0
What is the proper/fast way to do things like this using data.table?
We can use a self-join here by creating the 'timeminus2' and 'timeplus2' column, join on by 'id2' with 'id1' and the non-equi logical condition to get the sum of 'num' and assign (:=) the 'val' column to the original dataset
tmp <- dat[.(id1 = id1, timeminus2 = time - 2, timeplus2 = time + 2),
.(val = sum(num)),
on = .(id2 = id1, time >= timeminus2, time <= timeplus2),
by = .EACHI
][is.na(val), val := 0][]
dat[, val := tmp$val][]
# id1 id2 num time val
# 1: 2 5 9 10 6
# 2: 4 3 7 10 0
# 3: 3 4 7 7 10
# 4: 5 3 10 8 9
# 5: 5 1 7 1 2
# 6: 1 5 8 5 6
# 7: 3 2 6 8 17
# 8: 5 1 6 3 10
# 9: 3 2 3 4 0
#10: 3 5 2 3 0

combine two different dimension of dataframes to one dataframe

I have a problem to combine two different dimension dataframes which each dataframe has huge rows. Let's say, the sample of my dataframes are d and e, and new expected dataframe is de. I would like to make pair between all value in same row both in d and e, and construct those pairs in a new dataframe (de). Any idea/help for solving my problem is really appreciated. Thanks
> d <- data.frame(v1 = c(1,3,5), v2 = c(2,4,6))
> d
v1 v2
1 1 2
2 3 4
3 5 6
> e <- data.frame(v1 = c(11, 14), v2 = c(12,15), v3=c(13,16))
> e
v1 v2 v3
1 11 12 13
2 14 15 16
> de <- data.frame(x = c(1,1,1,2,2,2,3,3,3,4,4,4), y = c(11,12,13,11,12,13,14,15,16,14,15,16))
> de
x y
1 1 11
2 1 12
3 1 13
4 2 11
5 2 12
6 2 13
7 3 14
8 3 15
9 3 16
10 4 14
11 4 15
12 4 16
One solution is to "melt" d and e into long format, then merge, then get rid of the extra columns. If you have very large datasets, data tables are much faster (no difference for this tiny dataset).
library(reshape2) # for melt(...)
library(data.table)
# add id column
d <- cbind(id=1:nrow(d),d)
e <- cbind(id=1:nrow(e),e)
# melt to long format
d.melt <- data.table(melt(d,id.vars="id"), key="id")
e.melt <- data.table(melt(e,id.vars="id"), key="id")
# data table join, remove extra columns
result <- d.melt[e.melt, allow.cartesian=T]
result[,":="(id=NULL,variable=NULL,variable.1=NULL)]
setnames(result,c("x","y"))
setkey(result,x,y)
result
x y
1: 1 12
2: 1 13
3: 1 14
4: 2 12
5: 2 13
6: 2 14
7: 3 15
8: 3 16
9: 3 17
10: 4 15
11: 4 16
12: 4 17
If your data are numeric, like they are in this example, this is pretty straightforward in base R too. Conceptually this is the same as #jlhoward's answer: get your data into a long format, and merge:
merge(cbind(id = rownames(d), stack(d)),
cbind(id = rownames(e), stack(e)),
by = "id")[c("values.x", "values.y")]
# values.x values.y
# 1 1 11
# 2 1 12
# 3 1 13
# 4 2 11
# 5 2 12
# 6 2 13
# 7 3 14
# 8 3 15
# 9 3 16
# 10 4 14
# 11 4 15
# 12 4 16
Or, with the "reshape2" package:
merge(melt(as.matrix(d)),
melt(as.matrix(e)),
by = "Var1")[c("value.x", "value.y")]

How to calculate difference from initial value for each group in R?

I have data arranged like this in R:
indv time val
A 6 5
A 10 10
A 12 7
B 8 4
B 10 3
B 15 9
For each individual (indv) at each time, I want to calculate the change in value (val) from the initial time. So I would end up with something like this:
indv time val val_1 val_change
A 6 5 5 0
A 10 10 5 5
A 12 7 5 2
B 8 4 4 0
B 10 3 4 -1
B 15 9 4 5
Can anyone tell me how I might do this? I can use
ddply(df, .(indv), function(x)x[which.min(x$time), ])
to get a table like
indv time val
A 6 5
B 8 4
However, I cannot figure out how to make a column val_1 where the minimum values are matched up for each individual. However, if I can do that, I should be able to add column val_change using something like:
df['val_change'] = df['val_1'] - df['val']
EDIT: two excellent methods were posted below, however both rely on my time column being sorted so that small time values are on top of high time values. I'm not sure this will always be the case with my data. (I know I can sort first in Excel, but I'm trying to avoid that.) How could I deal with a case when the table appears like this:
indv time value
A 10 10
A 6 5
A 12 7
B 8 4
B 10 3
B 15 9
Here is a data.table solution that will be memory efficient as it is setting by reference within the data.table. Setting the key will sort by the key variables
library(data.table)
DT <- data.table(df)
# set key to sort by indv then time
setkey(DT, indv, time)
DT[, c('val1','change') := list(val[1], val - val[1]),by = indv]
# And to show it works....
DT
## indv time val val1 change
## 1: A 6 5 5 0
## 2: A 10 10 5 5
## 3: A 12 7 5 2
## 4: B 8 4 4 0
## 5: B 10 3 4 -1
## 6: B 15 9 4 5
Here's a plyr solution using ddply
ddply(df, .(indv), transform,
val_1 = val[1],
change = (val - val[1]))
indv time val val_1 change
1 A 6 5 5 0
2 A 10 10 5 5
3 A 12 7 5 2
4 B 8 4 4 0
5 B 10 3 4 -1
6 B 15 9 4 5
To get your second table try this:
ddply(df, .(indv), function(x) x[which.min(x$time), ])
indv time val
1 A 6 5
2 B 8 4
Edit 1
To deal with unsorted data, like the one you posted in your edit try the following
unsort <- read.table(text="indv time value
A 10 10
A 6 5
A 12 7
B 8 4
B 10 3
B 15 9", header=T)
do.call(rbind, lapply(split(unsort, unsort$indv),
function(x) x[order(x$time), ]))
indv time value
A.2 A 6 5
A.1 A 10 10
A.3 A 12 7
B.4 B 8 4
B.5 B 10 3
B.6 B 15 9
Now you can apply the procedure described above to this sorted dataframe
Edit 2
A shorter way to sort your dataframe is using sortBy function from doBy package
library(doBy)
orderBy(~ indv + time, unsort)
indv time value
2 A 6 5
1 A 10 10
3 A 12 7
4 B 8 4
5 B 10 3
6 B 15 9
Edit 3
You can even sort your df using ddply
ddply(unsort, .(indv, time), sort)
value time indv
1 5 6 A
2 10 10 A
3 7 12 A
4 4 8 B
5 3 10 B
6 9 15 B
You can do this with the base functions. using your data
df <- read.table(text = "indv time val
A 6 5
A 10 10
A 12 7
B 8 4
B 10 3
B 15 9", header = TRUE)
We first split() df on the indv variable
sdf <- split(df, df$indv)
Next we transform each component of sdf adding in the val_1 and val_change variables in a manner similar to how you suggest
sdf <- lapply(sdf, function(x) transform(x, val_1 = val[1],
val_change = val - val[1]))
Finally we arrange for the individual components to be bound row wise into a single data frame:
df <- do.call(rbind, sdf)
df
Which gives:
R> df
indv time val val_1 val_change
A.1 A 6 5 5 0
A.2 A 10 10 5 5
A.3 A 12 7 5 2
B.4 B 8 4 4 0
B.5 B 10 3 4 -1
B.6 B 15 9 4 5
Edit
To address the sorting issue the OP raises in the comments, modify the lapply() call to include a sorting step prior to the transform(). For example:
sdf <- lapply(sdf, function(x) {
x <- x[order(x$time), ]
transform(x, val_1 = val[1],
val_change = val - val[1])
})
In use we have
## scramble `df`
df <- df[sample(nrow(df)), ]
## split
sdf <- split(df, df$indv)
## apply sort and transform
sdf <- lapply(sdf, function(x) {
x <- x[order(x$time), ]
transform(x, val_1 = val[1],
val_change = val - val[1])
})
## combine
df <- do.call(rbind, sdf)
which again gives:
R> df
indv time val val_1 val_change
A.1 A 6 5 5 0
A.2 A 10 10 5 5
A.3 A 12 7 5 2
B.4 B 8 4 4 0
B.5 B 10 3 4 -1
B.6 B 15 9 4 5

Resources