combine data.tables and sum the shared column - r

I have some large data sets and am trying out data.table to combine them while summing up the shared column over matching rows. I know how to merge using [ matching rows in the LHS data.table as shown below with tables a2:LHS and a:RHS
a2 <- data.table( b= c(letters[1:5],letters[11:15]), c = as.integer(rep(100,10)))
a <- data.table(b = letters[1:10], c = as.integer(1:10))
setkey(a2 ,"b")
setkey(a , "b")
a2
b c
1: a 100
2: b 100
3: c 100
4: d 100
5: e 100
6: k 100
7: l 100
8: m 100
9: n 100
10: o 100
a
b c
1: a 1
2: b 2
3: c 3
4: d 4
5: e 5
6: f 6
7: g 7
8: h 8
9: i 9
10: j 10
from second answer hereMerge data frames whilst summing common columns in R I saw how columns could be summed up over matching rows, as such:
setkey(a , "b")
setkey(a2, "b")
a2[a, `:=`(c = c + i.c)]
a2
b c
1: a 101
2: b 102
3: c 103
4: d 104
5: e 105
6: k 100
7: l 100
8: m 100
9: n 100
10: o 100
However I am trying retain the rows that don't match as well.
Alternately I could use merge as shown below but I would like a void making a new table with 4 rows before reducing it to 2 rows.
c <- merge(a, a2, by = "b", all=T)
c <- transform(c, value = rowSums(c[,2:3], na.rm=T))
c <- c[,c(1,4)]
c
b value
1: a 102
2: b 104
3: c 106
4: d 108
5: e 110
6: f 6
7: g 7
8: h 8
9: i 9
10: j 10
11: k 100
12: l 100
13: m 100
14: n 100
15: o 100
This last table is what I would like to achieve, Thanks in Advance.

merge is likely to not be very efficient for the end result you are after. Since both of your data.tables have the same structure, I would suggest rbinding them together and taking the sum by their key. In other words:
rbindlist(list(a, a2))[, sum(c), b]
I've used rbindlist because it is generally more efficient at rbinding data.tables (even though you have to first put your data.tables in a list).
Compare some timings on larger datasets:
library(data.table)
library(stringi)
set.seed(1)
n <- 1e7; n2 <- 1e6
x <- stri_rand_strings(n, 4)
a2 <- data.table(b = sample(x, n2), c = sample(100, n2, TRUE))
a <- data.table(b = sample(x, n2), c = sample(10, n2, TRUE))
system.time(rbindlist(list(a, a2))[, sum(c), b])
# user system elapsed
# 0.83 0.05 0.87
system.time(merge(a2, a, by = "b", all = TRUE)[, rowSums(.SD, na.rm = TRUE), b]) # Get some coffee
# user system elapsed
# 159.58 0.48 162.95
## Do we have all the rows we expect to have?
length(unique(c(a$b, a2$b)))
# [1] 1782166
nrow(rbindlist(list(a, a2))[, sum(c), b])
# [1] 1782166

Related

How to find linear regression between groups using data.table?

I am trying to find the linear regression between all available groups of the following dataset.
library(data.table)
dt <- data.table(time = c(rep(rep(1:100, times = 1), 4), rep(1:30, times = 1)),
group = c(rep(c("a","b","c","d"), each = 100), rep("e", 30)),
value = rnorm(430))
dt[]
time group value
1: 1 a 0.1625954
2: 2 a -1.2288462
3: 3 a -0.1628570
4: 4 a 1.0597886
5: 5 a -1.1828334
---
426: 26 e -1.3762654
427: 27 e 0.3761436
428: 28 e -1.6982330
429: 29 e 0.1940263
430: 30 e -0.4631258
The output should be something like
group1 group2 regression
a b 1.2
a c 0.3
b c 0.5
d a 4.3
...
I am looking for a solution using data.table library only.
Linear regression of all the combinations of groups should be found. That includes cases a~b and b~a as the regression for each of these cases will be different.
Since the size of some groups is different, the time variables should be used to find the common rows between any set of groups.
The solution will require finding all combinations of groups.
With the new data, we could split the data by 'group' into a list. Then, use combn on the names of the list for pairwise combination, extract the list elements (s1, s2), check if there are any common 'time' (intersect). Use a condition based on length i.e. if there are common elements, then apply the lm on the corresponding 'value' columns, create a data.table with summarised coef along with the group names and rbind the list elements
library(data.table)
lst1 <- split(dt, dt$group)
rbindlist(combn(names(lst1), 2, FUN = function(x) {
s1 <- lst1[[x[1]]]
s2 <- lst1[[x[2]]]
i1 <- intersect(s1$time, s2$time)
if(length(i1) > 0) na.omit(s1[s2, on = .(time)][,
. (group1 = first(s1$group), group2 = first(s2$group),
regression = lm(i.value ~ value)$coef[2])])
else
data.table(group1 = first(s1$group), group2 = first(s2$group),
regression = NA_real_)}, simplify = FALSE))
-output
group1 group2 regression
1: a b 0.03033996
2: a c 0.06391242
3: a d -0.09138112
4: a e -0.27738183
5: b c 0.05663270
6: b d 0.05481604
7: b e 0.27789495
8: c d -0.13987978
9: c e 0.16388299
10: d e 0.12380720
If we want full combinations, use either expand.grid or CJ (from data.table
dt2 <- CJ(group1 = names(lst1), group2 = names(lst1))[group1 != group2]
dt2[, rbindlist(Map(function(x, y) {
s1 <- lst1[[x]]
s2 <- lst1[[y]]
i1 <- intersect(s1$time, s2$time)
if(length(i1) > 0) na.omit(s1[s2, on = .(time)][,
data.table(group1 = x, group2 = y,
regresion = lm(i.value ~ value)$coef[2])]) else
data.table(group1 = x, group2 = y, regression = NA_real_)
}, group1, group2))]
-output
group1 group2 regresion
1: a b 0.03033996
2: a c 0.06391242
3: a d -0.09138112
4: a e -0.27738183
5: b a 0.03247826
6: b c 0.05663270
7: b d 0.05481604
8: b e 0.27789495
9: c a 0.07488082
10: c b 0.06198333
11: c d -0.13987978
12: c e 0.16388299
13: d a -0.09295215
14: d b 0.05208743
15: d c -0.12144302
16: d e 0.12380720
17: e a -0.25136439
18: e b 0.34052322
19: e c 0.28677255
20: e d 0.21435666

Randomization R ordering

I'm afraid I can't find an answer to my problem.
I am looking to create
1) They are 4 sets of cards A, B, C, D and 16 cards.
2) Each card is numbered within a set (A from 1 to 4, B from 5 to 8, and so on).
3) We want to randomize the assignment such that each person is randomly assigned a set of cards, for example A.
4) In addition, the order of the cards within the set has to be randomized.
So what we want is the following:
Person 1: Set A, cards 1-2-3-4
Person 2: Set A, cards 4-2-3-1
Person 3: Set D, cards 16-15-12-13
and so on.
I would also like each number to be in a separate column.
Thanks for your help!
S.
if each person gets one set of cards
> df=NULL
> a=rep(LETTERS[1:4],4)
> df$card1=sample(a,16,F)
> df=as.data.frame(df)
> df=df[order(card1),]
> df
card1
1: A
2: A
3: A
4: A
5: B
6: B
7: B
8: B
9: C
10: C
11: C
12: C
13: D
14: D
15: D
16: D
> df$card2=rep((1:4),4)
> df
card1 card2
1: A 1
2: A 2
3: A 3
4: A 4
5: B 1
6: B 2
7: B 3
8: B 4
9: C 1
10: C 2
11: C 3
12: C 4
13: D 1
14: D 2
15: D 3
16: D 4
> df1=df[sample(nrow(df)),]
> df1
card1 card2
1: A 2
2: D 4
3: C 3
4: D 3
5: B 3
6: D 1
7: C 2
8: A 3
9: B 2
10: D 2
11: B 1
12: A 1
13: C 4
14: C 1
15: B 4
16: A 4
Here's one way of approaching this.
person <- c("Person1", "Person2", "Person3", "Person4")
cardset <- LETTERS[1:4]
set.seed(357) # this is for reproducibility
xy <- data.frame(
person = sample(person), # pick out persons in a random order
set = sample(cardset)) # assign a random card set to a person
vx <- rep(xy$set, each = 4) # for each set, create repeats
vy <- split(paste(vx, rep(1:4, times = 4), sep = ""), f = vx) # append numbers to it
vz <- do.call(rbind, sapply(vy, FUN = sample, simplify = FALSE)) # shuffle using sapply and stitch together with do.call
cbind(xy, vz) # add it to the original data
person set 1 2 3 4
A Person1 C A4 A3 A2 A1
B Person4 B B2 B1 B4 B3
C Person3 D C2 C3 C4 C1
D Person2 A D1 D2 D4 D3
Here's another option:
# create data frame of decks and their numbered cards
cards <- data.frame(deck = rep(LETTERS[1:4], each = 4),
numbers = c(1:16),
stringsAsFactors = FALSE)
# create list of people
people <- c("Person1", "Person2", "Person3")
# loop through each person and randomly select a deck
# based on deck selected, subset the cards that can be used
# randomize the numbered cards
# add the deck, order of cards, and person to a
# growing data frame of assignments
assignment <- NULL
for(i in unique(people)) {
set <- sample(cards$deck, size = 1)
setCards <- cards[cards$deck == set, ]
orderCards <- sample(setCards$numbers)
assignment <- rbind(assignment, data.frame(Person = i,
Deck = set,
Card1 = orderCards[1],
Card2 = orderCards[2],
Card3 = orderCards[3],
Card4 = orderCards[4],
stringsAsFactors = FALSE))
}

How to compare if any of the elements in a row is same

Is there a way to compare whether "any value of" a row is identical to "any value" of the row above -- regardless of the order? Below is a very random input data table.
DT <- data.table(A=c("a","a","b","d","e","f","h","i","j"),
B=c("a","b","c","c","f","g",NA,"j",NA),
C=c("a","b","c","b","g","h",NA,NA,NA))
> DT
A B C
1: a a a
2: a b b
3: b c c
4: d c b
5: e f g
6: f g h
7: h NA NA
8: i j NA
9: j NA NA
I would like to add a column D that compares a row with the row above, and compare whether any values of the two rows are identical (regardless of the order). So the desired output would be:
> DT
A B C D
1: a a a 0 #No row above to compare; could be either NA or 0
2: a b b 1 #row 2 has "a", which is in row 1; returns 1
3: b c c 1 #row 3 has "b", which is in row 2; returns 1
4: d c b 1 #row 4 has "b" and "c", which are in row 3; returns 1
5: e f g 0 #row 5 has nothing that is in row 4; returns 0
6: f g h 1 #row 6 has "f" and "g", which are in row 5; returns 1
7: h NA NA 1 #row 7 has "h", which is in row 6; returns 1
8: i j NA 0 #row 8 has nothing that is in row 7 (NA doesn't count)
9: j NA NA 1 #row 9 has "j", which is in row 8; returns 1 (NA doesn't count)
The main idea is that I would like to compare a row (or a vector) with another row (vector), and define two rows to be identical if any of the elements in each row (vector) are. (without reiterating to compare each element)
We can do this by getting the lead rows of the dataset, paste each row, check for any pattern in with the pasteed rows of original dataset using grepl and Map, then unlist and convert to integer
DT[, D := {
v1 <- do.call(paste, .SD)
v2 <- do.call(paste, c(shift(.SD, type = "lead"), sep="|"))
v2N <- gsub("NA\\|*|\\|*NA", "", v2)
v3 <- unlist(Map(grepl, v2N, v1), use.names = FALSE)
as.integer(head(c(FALSE, v3), -1))
}]
DT
# A B C D
#1: a a a 0
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
Or we can do a split and do comparison using Map
as.integer(c(FALSE, unlist(Map(function(x,y) {
x1 <- na.omit(unlist(x))
y1 <- na.omit(unlist(y))
any(x1 %in% y1 | y1 %in% x1) },
split(DT[-nrow(DT)], 1:(nrow(DT)-1)), split(DT[-1], 2:nrow(DT))), use.names = FALSE)))
Here is another method. It's probably not advisable on large data.tables as it uses by=1:nrow(DT) which tends to be quite slow.
DT[, D:= sign(DT[, c(.SD, shift(.SD))][,
sum(!is.na(intersect(unlist(.SD[, .(A, B, C)]), unlist(.SD[, .(V4, V5, V6)])))),
by=1:nrow(DT)]$V1)]
Here, [, c(.SD, shift(.SD))] creates a copy of the data.frame, with the lagged variables included (cbinded). Then the second chain intersects the unlisted variables in the original data.table and the shifted data.table. NAs are assigned 0 and non-NAs are assigned 1 and these results are summed. This operation occurs for each row of the copied data.table. The sum is extracted with $v1 and is turned into binary (0 and 1) using sign.
It returns
DT
A B C D
1: a a a 0
2: a b b 1
3: b c c 1
4: d c b 1
5: e f g 0
6: f g h 1
7: h NA NA 1
8: i j NA 0
9: j NA NA 1
Here's a loop-free approach using data.table's joins:
DT[, id := 1:.N]
dt <- melt(DT, id.vars = "id")
dt[, id2 := id-1]
dt <- dt[!is.na(value)]
idx <- dt[dt, on = .(id2 = id, value), nomatch=0][, unique(id)]
DT[, `:=`(D = as.integer(id %in% idx), id = NULL)]
It looks somewhat complicated but id does perform pretty well with just over a second for a 1-million-row data set with three columns.
I would do a sapply along the indices (minus the last) of the table:
compare <- function(i) {
row1 <- as.character(DT[i,])
row2 <- as.character(DT[i+1,])
return(length(intersect(row1[!is.na(row1)], row2[!is.na(row2)])) > 0)
}
result <- sapply(1:(nrow(DT) - 1), compare)
This returns a vector of logicals, so if you prefer to get integers, wrap the output of compare in a as.numeric()
Here is a base R solution using intersect:
res <- c(0, sapply(2:nrow(DT), function(i)
length(intersect( na.omit(as.character(DT[i,])), na.omit(as.character(DT[i-1,])) ) )>0))
cbind(DT, D=res)
# A B C D
# 1: a a a 0
# 2: a b b 1
# 3: b c c 1
# 4: d c b 1
# 5: e f g 0
# 6: f g h 1
# 7: h NA NA 1
# 8: i j NA 0
# 9: j NA NA 1
This solution compares the two rows with %in% (after unlist()):
DT[, result:=as.integer(c(NA, sapply(2:DT[,.N], function(i) any(na.omit(unlist(DT[i])) %in% unlist(DT[i-1])))))]
#> DT
# A B C result
#1: a a a NA
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
Using a combination of intersect and mapply you could do:
#list of unique elements in each row
tableList = apply(DT,1,function(x) unique(na.omit(x)))
#a lagged list to be compared with above list
tableListLag = c(NA,tableList[2:length(tableList)-1])
#find common elements using intersect function
#if length > 0 implies common elements hence set value as 1 else 0
DT$D = mapply(function(x,y) ifelse(length(intersect(x,y))>0,1,0) ,tableList,tableListLag,
SIMPLIFY = TRUE)
DT
# A B C D
#1: a a a 0
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1

Column order of `.SD` in j argument differs when `get()` is used

I very often transform subsets of data using the .SDcols option in data.table. It makes sense that the .SD columns sent to j are in the same order as the original data.table.
EDITED to properly identify the issue
It's nice that .SD columns have the same order as that specified in the .SDcols argument. This does not happen when get is used in the j argument (inside an lapply call, at least). In this case, the .SD table columns maintain their original order.
Is there any way to override this behaviour?
An example without get works fine
# library(data.table)
dt = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# Generate columns of first differences by group
dt[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) L - shift(L, n = 1, type='lag') ),
keyby = col1, .SDcols = d.vars]
The result is assigns differenced values to the "wrong" column because my named vector (d.vars) is ordered differently than the columns in dt. The result is:
The results are as expected, the .SD table's columns are ordered the same way as the names in d.vars.
> dt
col1 b a c d.a d.b
1: A -0.28901751 1 A NA NA
2: A 0.65746901 4 D 3 0.94648651
3: A -0.10602462 7 G 3 -0.76349362
4: A -0.38406252 10 J 3 -0.27803790
5: B -1.06963450 2 B NA NA
6: B 0.35137273 5 E 3 1.42100723
7: B 0.43394046 8 H 3 0.08256772
8: B 0.82525042 11 K 3 0.39130996
9: C 0.50421710 3 C NA NA
10: C -1.09493665 6 F 3 -1.59915375
11: C -0.04858163 9 I 3 1.04635501
12: C 0.45867279 12 L 3 0.50725443
Which is the expected output because lapply in j processed column a first and b second, in spite of the column order in dt.
Example with get behaves differently
dt2 = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
neg = -1,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# name of variable to be called in j.
negate <- 'neg'
dt2[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) {(L - shift(L, n = 1, type='lag') ) * get(negate) }),
keyby = col1, .SDcols = d.vars]
Now the naming of the newly created columns doesn't align with the name order in d.vars:
> dt2
col1 b a neg c d.a d.b
1: A -0.3539066 1 -1 A NA NA
2: A 0.2702374 4 -1 D -0.62414408 -3
3: A -0.7834941 7 -1 G 1.05373150 -3
4: A -1.2765652 10 -1 J 0.49307118 -3
5: B -0.2936422 2 -1 B NA NA
6: B -0.2451996 5 -1 E -0.04844252 -3
7: B -1.6577614 8 -1 H 1.41256181 -3
8: B 1.0668059 11 -1 K -2.72456737 -3
9: C -0.1160938 3 -1 C NA NA
10: C -0.7940771 6 -1 F 0.67798333 -3
11: C 0.2951743 9 -1 I -1.08925140 -3
12: C -0.4508854 12 -1 L 0.74605969 -3
In this second example the b column is processed by lapply first and therefore assigned to d.a.
If I refer to neg directly (i.e., I don't use get) then the results are as expected: lapply processes the .SD columns in the order given in d.vars.
p.s. Thanks data.table team! I love this package!
Based on the description, we can use match to match the 'd.vars' and the column names of 'dt' ('d.vars1') and then use it to get the order right
d.vars1 <- d.vars[match(names(dt), d.vars, nomatch = 0)]
dt[, paste0("d.",d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') ), keyby = col1, .SDcols = d.vars1]
dt
# col1 b a c d.b d.a
# 1: A -0.28901751 1 A NA NA
# 2: A 0.65746901 4 D 0.94648652 3
# 3: A -0.10602462 7 G -0.76349363 3
# 4: A -0.38406252 10 J -0.27803790 3
# 5: B -1.06963450 2 B NA NA
# 6: B 0.35137273 5 E 1.42100723 3
# 7: B 0.43394046 8 H 0.08256773 3
# 8: B 0.82525042 11 K 0.39130996 3
# 9: C 0.50421710 3 C NA NA
#10: C -1.09493665 6 F -1.59915375 3
#11: C -0.04858163 9 I 1.04635502 3
#12: C 0.45867279 12 L 0.50725442 3
Update
Based on the new dataset
d.vars1 <- d.vars[match(names(dt2), d.vars, nomatch = 0)]
dt2[, paste0('d.', d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') * get(negate) ),
keyby = col1, .SDcols = d.vars1]
dt2
# col1 b a neg c d.b d.a
# 1: A -0.3539066 1 -1 A NA NA
# 2: A 0.2702374 4 -1 D -0.0836692 5
# 3: A -0.7834941 7 -1 G -0.5132567 11
# 4: A -1.2765652 10 -1 J -2.0600593 17
# 5: B -0.2936422 2 -1 B NA NA
# 6: B -0.2451996 5 -1 E -0.5388418 7
# 7: B -1.6577614 8 -1 H -1.9029610 13
# 8: B 1.0668059 11 -1 K -0.5909555 19
# 9: C -0.1160938 3 -1 C NA NA
#10: C -0.7940771 6 -1 F -0.9101709 9
#11: C 0.2951743 9 -1 I -0.4989028 15
#12: C -0.4508854 12 -1 L -0.1557111 21

Subset a data.table by matching columns of another data.table

I have been searching for a solution for subsetting a data table using matching values for certain columns in another data table.
Here is in example:
set.seed(2)
dt <-
data.table(a = 1:10,
b = rnorm(10),
c = runif(10),
d = letters[1:10])
dt2 <-
data.table(a = 5:20,
b = rnorm(16),
c = runif(16),
d = letters[5:20])
This is the result I need:
> dt2
1: 5 -2.311069085 0.62512173 e
2: 6 0.878604581 0.26030004 f
3: 7 0.035806718 0.85907312 g
4: 8 1.012828692 0.43748800 h
5: 9 0.432265155 0.38814476 i
6: 10 2.090819205 0.46150111 j
where I have the rows returned from the second data table where a and d match even though b and c may not. The real data are mutually exclusive, and I need to match on three columns.
We can use %in% to match the columns and subset accordingly.
dt2[a %in% dt$a & d %in% dt$d]
# a b c d
#1: 5 -2.31106908 0.6251217 e
#2: 6 0.87860458 0.2603000 f
#3: 7 0.03580672 0.8590731 g
#4: 8 1.01282869 0.4374880 h
#5: 9 0.43226515 0.3881448 i
#6: 10 2.09081921 0.4615011 j
Here is an option using join and specifying the on
na.omit(dt2[dt[, c("a", "d"), with = FALSE], on = c("a", "d")])
# a b c d
#1: 5 -2.31106908 0.6251217 e
#2: 6 0.87860458 0.2603000 f
#3: 7 0.03580672 0.8590731 g
#4: 8 1.01282869 0.4374880 h
#5: 9 0.43226515 0.3881448 i
#6: 10 2.09081921 0.4615011 j

Resources