R data table combining lapply with other j arguments - r

I want to combine the result of lapply using .SD in j with further output columns in j. How can I do that in the same data table?
So far Im creating two data tables (example_summary1, example_summary2) and merge them but there should be a better way?
Maybe I don't fully understand the concept of .SD/.SDcols.
example <-data.table(id=rep(1:5,3),numbers=rep(1:5,3),sample1=sample(20,15,repla ce=TRUE),sample2=sample(20,15,replace=100))
id numbers sample1 sample2
1: 1 1 17 18
2: 2 2 8 1
3: 3 3 17 12
4: 4 4 15 2
5: 5 5 14 18
6: 1 1 11 14
7: 2 2 12 12
8: 3 3 11 7
9: 4 4 16 13
10: 5 5 17 1
11: 1 1 10 3
12: 2 2 14 15
13: 3 3 13 3
14: 4 4 17 6
15: 5 5 1 5
example_summary1<-example[,lapply(.SD,mean),by=id,.SDcols=c("sample1","sample2")]
> example_summary1
id sample1 sample2
1: 1 12.66667 11.666667
2: 2 11.33333 9.333333
3: 3 13.66667 7.333333
4: 4 16.00000 7.000000
5: 5 10.66667 8.000000
example_summary2<-example[,.(example.sum=sum(numbers)),id]
> example_summary2
id example.sum
1: 1 3
2: 2 6
3: 3 9
4: 4 12
5: 5 15

This is the best you can do if you are using .SDcols:
example_summary1 <- example[, c(lapply(.SD, mean), .(example.sum = sum(numbers))),
by = id, .SDcols = c("sample1", "sample2", "numbers")][, numbers := NULL][]
If you don't include numbers in .SDcols it's not available in j.
Without .SDcols you can do this:
example_summary1 <- example[, c(lapply(.(sample1 = sample1, sample2 = sample2), mean),
.(example.sum = sum(numbers))),
by=id]
Or if you have a vector of column names:
cols <- c("sample1","sample2")
example_summary1 <- example[, c(lapply(mget(cols), mean),
.(example.sum = sum(numbers))),
by=id]
But I suspect that you don't get the same data.table optimizations then.
Finally, a data.table join is so fast that I would use your approach.

Related

apply .GRP to multiple columns in data.table R to group each column separately [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 2 years ago.
I've got a large data.table (200M rows x 300 columns), DT, with multiple (over 50) identifier columns.
The identifiers are all in different format and some of them are fairly complex and long, and I would like to convert all of them (selected_cols) to simple numerical identifiers.
I can use .GRP for one column at a time, and it's super fast (well, relatively speaking, in context!)
DT[, new_col_1 := .GRP , by = .(col_1)] #this works for one column at a time
Is there a way to do this for multiple columns using the .GRP business?
I know how to do it if I define my own function, using lapply, but I can't use .GRP in a function. Might be wishful thinking. I can also do it with a for-loop, but I hate for-loops, they give me the creeps as they don't scale up.
Just hoping to avoid creating my own function or using for-loops for speed reasons. it's a simple operation but takes a long time for a large data.table.
DT[ , (paste0('new_', selected_cols)) := lapply(.SD, some_function_with_.GRP), .SDcols = selected_cols)]
here's a data.table sample, if you need one:
require(data.table)
DT = data.table(col1 = c('A','B','B','D','B','A','A','B','R','T','E','E','H','T','Y','F','F','F')
,col2 = c('DD','GG','RR','HH','SS','AA','CC','RR','EE','DD','HH','BB','CC','AA','QQ','EE','YY','MM')
, col3 = c('FFF1', 'HHH1', 'CCC1', 'AAA1', 'FFF1', 'RRR1', 'GGG1', 'DDD1', 'FFF1', 'JJJ1', 'VVV1', 'CCC1', 'AAA1', 'XXX1', 'GGG1', 'HHH1', 'AAA1', 'RRR1'))
And this is the output I'm after:
> DT
col1 col2 col3 new_col1 new_col2 new_col3
1: A DD FFF1 1 1 1
2: B GG HHH1 2 2 2
3: B RR CCC1 2 3 3
4: D HH AAA1 3 4 4
5: B SS FFF1 2 5 1
6: A AA RRR1 1 6 5
7: A CC GGG1 1 7 6
8: B RR DDD1 2 3 7
9: R EE FFF1 4 8 1
10: T DD JJJ1 5 1 8
11: E HH VVV1 6 4 9
12: E BB CCC1 6 9 3
13: H CC AAA1 7 7 4
14: T AA XXX1 5 6 10
15: Y QQ GGG1 8 10 6
16: F EE HHH1 9 8 2
17: F YY AAA1 9 11 4
18: F MM RRR1 9 12 5
I'm looking for native data.table solution.
One way would be using match and unique :
library(data.table)
cols <- paste0('col', 1:3)
DT[, paste0('new_', cols) := lapply(.SD, function(x)
match(x, unique(x))), .SDcols = cols]
DT
# col1 col2 col3 new_col1 new_col2 new_col3
# 1: A DD FFF1 1 1 1
# 2: B GG HHH1 2 2 2
# 3: B RR CCC1 2 3 3
# 4: D HH AAA1 3 4 4
# 5: B SS FFF1 2 5 1
# 6: A AA RRR1 1 6 5
# 7: A CC GGG1 1 7 6
# 8: B RR DDD1 2 3 7
# 9: R EE FFF1 4 8 1
#10: T DD JJJ1 5 1 8
#11: E HH VVV1 6 4 9
#12: E BB CCC1 6 9 3
#13: H CC AAA1 7 7 4
#14: T AA XXX1 5 6 10
#15: Y QQ GGG1 8 10 6
#16: F EE HHH1 9 8 2
#17: F YY AAA1 9 11 4
#18: F MM RRR1 9 12 5

How to replace single occurances with the previous status

I have a data table like below :
table=data.table(x=c(1:15),y=c(1,1,1,3,1,1,2,1,2,2,3,3,3,3,3),z=c(1:15)*3)
I have to clean this data table where there are single occurrences like a 3 in between the 1s and a 1 in between the 2s. It doesn't have to be a 3 but any number which occurs only once should be replaced by the previous number.
table=data.table(x=c(1:15),y=c(1,1,1,1,1,1,2,2,2,2,3,3,3,3,3),z=c(1:15)*3)
This is the expected data table.
Any help is appreciated.
Here's one way :
library(data.table)
#Count number of rows for each group
table[, N := .N, rleid(y)]
#Change `y` value which have only one row
table[, y := replace(y, N ==1, NA)]
#Replace NA with last non-NA value
table[, y := zoo::na.locf(y)][, N := NULL]
table
# x y z
# 1: 1 1 3
# 2: 2 1 6
# 3: 3 1 9
# 4: 4 1 12
# 5: 5 1 15
# 6: 6 1 18
# 7: 7 2 21
# 8: 8 2 24
# 9: 9 2 27
#10: 10 2 30
#11: 11 3 33
#12: 12 3 36
#13: 13 3 39
#14: 14 3 42
#15: 15 3 45
Here is a base R option
inds <- which(diff(c(head(table$y,1),table$y))*diff(c(table$y,tail(table$y,1)))<0)
table$y <- replace(table$y,inds,table$y[inds-1])
such that
> table
x y z
1: 1 1 3
2: 2 1 6
3: 3 1 9
4: 4 1 12
5: 5 1 15
6: 6 1 18
7: 7 2 21
8: 8 2 24
9: 9 2 27
10: 10 2 30
11: 11 3 33
12: 12 3 36
13: 13 3 39
14: 14 3 42
15: 15 3 45

data.table manipulation and merging

I have data
dat1 <- data.table(id=1:8,
group=c(1,1,2,2,2,3,3,3),
value=c(5,6,10,11,12,20,21,22))
dat2 <- data.table(group=c(1,2,3),
value=c(3,6,13))
and I would like to subtract dat2$value from each of the dat1$value, based on group.
Is this possible using data.table or does it require additional packages?
With data.table, you could do:
library(data.table)
dat1[dat2, on = "group"][, new.value := value - i.value, by = "group"][]
Which returns:
id group value i.value new.value
1: 1 1 5 3 2
2: 2 1 6 3 3
3: 3 2 10 6 4
4: 4 2 11 6 5
5: 5 2 12 6 6
6: 6 3 20 13 7
7: 7 3 21 13 8
8: 8 3 22 13 9
Alternatively, you can do this in one step as akrun mentions:
dat1[dat2, newvalue := value - i.value, on = .(group)]
id group value newvalue
1: 1 1 5 2
2: 2 1 6 3
3: 3 2 10 4
4: 4 2 11 5
5: 5 2 12 6
6: 6 3 20 7
7: 7 3 21 8
8: 8 3 22 9

Persistent assignment in data.table with .SD

I'm struggling with .SD calls in data.table.
In particular, I'm trying to identify some logical characteristic within a grouping of data, and draw some identifying mark in another variable. Canonical application of .SD, right?
From FAQ 4.5, http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf, imagine the following table:
library(data.table) # 1.9.5
DT = data.table(a=rep(1:3,1:3),b=1:6,c=7:12)
DT[,{ mySD = copy(.SD)
mySD[1, b := 99L]
mySD },
by = a]
## a b c
## 1: 1 99 7
## 2: 2 99 8
## 3: 2 3 9
## 4: 3 99 10
## 5: 3 5 11
## 6: 3 6 12
I've assigned these values to b (using the ':=' operator) and so when I re-call DT, I expect the same output. But, unexpectedly, I'm met with the original table:
DT
## a b c
## 1: 1 1 7
## 2: 2 2 8
## 3: 2 3 9
## 4: 3 4 10
## 5: 3 5 11
## 6: 3 6 12
Expected output was the original frame, with persistent modifications in 'b':
DT
## a b c
## 1: 1 99 7
## 2: 2 99 8
## 3: 2 3 9
## 4: 3 99 10
## 5: 3 5 11
## 6: 3 6 12
Sure, I can copy this table into another one, but that doesn't seem consistent with the ethos.
DT2 <- copy(DT[,{ mySD = copy(.SD)
mySD[1, b := 99L]
mySD },
by = a])
DT2
## a b c
## 1: 1 99 7
## 2: 2 99 8
## 3: 2 3 9
## 4: 3 99 10
## 5: 3 5 11
## 6: 3 6 12
It feels like I'm missing something fundamental here.
The mentioned FAQ is just showing a workaround on how to modify (a temprory copy of) .SD but it won't update your original data in place. A possible solution for you problem would be something like
DT[DT[, .I[1L], by = a]$V1, b := 99L]
DT
# a b c
# 1: 1 99 7
# 2: 2 99 8
# 3: 2 3 9
# 4: 3 99 10
# 5: 3 5 11
# 6: 3 6 12

Use data.table in R to add multiple columns to a data.table with = with only one function call

This is a direkt expansion of this Question.
I have a dataset and I want to find all pairwise combinations of Variable v depending on Variables x and y:
library(data.table)
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,1,6), v=1:18)
x y v
1: a 1 1
2: a 1 2
3: a 6 3
4: a 1 4
5: a 1 5
6: a 6 6
7: b 1 7
8: b 1 8
9: b 6 9
10: b 1 10
11: b 1 11
12: b 6 12
13: c 1 13
14: c 1 14
15: c 6 15
16: c 1 16
17: c 1 17
18: c 6 18
DT[, list(new1 = t(combn(sort(v), m = 2))[,1],
new2 = t(combn(sort(v), m = 2))[,2]),
by = list(x, y)]
x y new1 new2
1: a 1 1 2
2: a 1 1 4
3: a 1 1 5
4: a 1 2 4
5: a 1 2 5
6: a 1 4 5
7: a 6 3 6
8: b 1 7 8
9: b 1 7 10
10: b 1 7 11
11: b 1 8 10
12: b 1 8 11
13: b 1 10 11
14: b 6 9 12
15: c 1 13 14
16: c 1 13 16
17: c 1 13 17
18: c 1 14 16
19: c 1 14 17
20: c 1 16 17
21: c 6 15 18
The Code does what I want but the twice function call makes it slow for larger dataset. My dataset has more than 3 million rows and more than 1.3 million combinations of x and y.
Any suggestions on how to do this faster?
I would prefer something like:
DT[, list(c("new1", "new2") = t(combn(sort(v), m = 2))), by = list(x, y)]
This should work:
DT[, {
tmp <- combn(sort(v), m = 2 )
list(new1 = tmp[1,], new2 = tmp[2,] )
}
, by = list(x, y) ]
The following also works. The trick is to convert the matrix into a data.table.
DT[, data.table(t(combn(sort(v), m = 2))), by=list(x, y)]
If necessary, just rename the columns after
r2 <- DT[, data.table(t(combn(sort(v), m = 2))), by=list(x, y)]
setnames(r2, c("V1", "V2"), c("new1", "new2"))

Resources