R data.table grepl column on another column in i - r

Can I subset for when a string in column A is in column B?
Example:
x <- data.table(a=letters, y=paste0(letters,"x"))
x[grepl(a, y)]
x[like(y, a)]
Both return only a one row data.table of the first row and the following warning:
Warning message:
In grepl(pattern, vector) :
argument 'pattern' has length > 1 and only the first element will be used
I would expect this to return all rows.

The following code applies grepl to each row with the a and y as a pair of that row. Basically, the first argument of grepl cannot be a vector with length larger than 1, so looping or lapply based approach is needed.
x[mapply(grepl, a, y), ]
# a y
# 1: a ax
# 2: b bx
# 3: c cx
# 4: d dx
# 5: e ex
# 6: f fx
# 7: g gx
# 8: h hx
# 9: i ix
# 10: j jx
# 11: k kx
# 12: l lx
# 13: m mx
# 14: n nx
# 15: o ox
# 16: p px
# 17: q qx
# 18: r rx
# 19: s sx
# 20: t tx
# 21: u ux
# 22: v vx
# 23: w wx
# 24: x xx
# 25: y yx
# 26: z zx
# a y

One more possibility could be using dplyr. Something like:
x <- data.table(a=letters, y=paste0(letters,"x"))
x %>% rowwise() %>%
filter(grepl(a,y)) %>% as.data.frame()
a y
1: a ax
2: b bx
3: c cx
4: d dx
5: e ex
6: f fx
7: g gx
8: h hx
9: i ix
........ so

Related

How to calculate mean per group from list of data.tables?

Let's say we have a list of data.table's like this:
dt <- data.table(x=rnorm(10^6,100,10), letters=sample(LETTERS,10^6,T))
myList <- list(dt1=dt,dt2=dt,dt3=dt,dt4=dt,dt5=dt)
If I wanted a solution that would calculate the mean per group across all data.tables, I could do the following:
bigDT <- rbindlist(myList)
bigDT[,list('average'=mean(x)),by=letters]
With my data, however, each dt is quite large (millions of rows) and each list is substantial as well (500-1000 dt in each list). There are also considerably more than two options for the by choice.
Part of a function I plan on optimizing using genetic algorithms requires computing the above mean by group. I was wondering if there was a more efficient solution than rbind-ing the list before using data.table's ability to calculate per group? Otherwise, the maximization algorithm will makes many function calls to this potentially bottlenecking calculation.
Any help would be appreciated!
microbenchmark(doThis())
Unit: milliseconds
expr min lq mean median uq max neval
doThis() 151.512 154.3395 174.8071 167.7151 170.2952 440.9359 100
One method is to computed the grouped means for each table within the list, then bind, then compute a weighted mean of them. Since you have different counts of each letter, you'll need to preserve the .N as well.
I'm going to change each element of the list so that we can verify the weighted-mean calculations. For reproducibility:
set.seed(1)
myList <- replicate(5, data.table(x=rnorm(10^6,100,10), letters=sample(LETTERS,10^6,T)),
simplify=FALSE)
myList[1:2]
# [[1]]
# x letters
# 1: 93.73546 P
# 2: 101.83643 I
# 3: 91.64371 F
# 4: 115.95281 V
# 5: 103.29508 D
# ---
# 999996: 109.24487 Q
# 999997: 99.86486 K
# 999998: 93.95941 J
# 999999: 116.28763 O
# 1000000: 106.93750 E
# [[2]]
# x letters
# 1: 97.53576 R
# 2: 105.27503 T
# 3: 107.53592 L
# 4: 102.21228 M
# 5: 98.71087 G
# ---
# 999996: 109.46843 C
# 999997: 99.14458 M
# 999998: 96.76845 Y
# 999999: 94.22413 E
# 1000000: 98.25855 K
To do this for just one table:
head(myList[[1]][,.(mu = mean(x), n = .N), keyby=letters])
# letters mu n
# 1: A 100.04987 39005
# 2: B 100.01288 38576
# 3: C 99.97402 38547
# 4: D 99.99909 38460
# 5: E 100.03689 38030
# 6: F 100.02697 38293
First, compute the averages per-list-element:
myAgg <- rbindlist(lapply(myList, function(d) d[,.(mu = mean(x), n = .N), keyby="letters"]))
Now do the weighted-mean either manually or with Hmisc::wtd.mean:
cbind(
# just to verify the below answer is the same as the brute-force method of rbind-then-average
rbindlist(myList)[,.(mu = mean(x)), keyby=letters],
# either of these is your answer
myAgg[,.(mu = sum(n*mu)/sum(n)),keyby=letters],
myAgg[,.(mu = Hmisc::wtd.mean(mu, weights=n)),keyby=letters]
)
# letters mu letters mu letters mu
# 1: A 100.02325 A 100.02325 A 100.02325
# 2: B 100.03473 B 100.03473 B 100.03473
# 3: C 100.00688 C 100.00688 C 100.00688
# 4: D 100.04041 D 100.04041 D 100.04041
# 5: E 100.00780 E 100.00780 E 100.00780
# 6: F 100.01202 F 100.01202 F 100.01202
# 7: G 100.01200 G 100.01200 G 100.01200
# 8: H 99.97232 H 99.97232 H 99.97232
# 9: I 100.00495 I 100.00495 I 100.00495
# 10: J 100.03019 J 100.03019 J 100.03019
# 11: K 99.96851 K 99.96851 K 99.96851
# 12: L 100.01850 L 100.01850 L 100.01850
# 13: M 100.00976 M 100.00976 M 100.00976
# 14: N 100.01299 N 100.01299 N 100.01299
# 15: O 100.02108 O 100.02108 O 100.02108
# 16: P 100.02052 P 100.02052 P 100.02052
# 17: Q 100.03814 Q 100.03814 Q 100.03814
# 18: R 99.99013 R 99.99013 R 99.99013
# 19: S 99.95219 S 99.95219 S 99.95219
# 20: T 99.97721 T 99.97721 T 99.97721
# 21: U 99.96310 U 99.96310 U 99.96310
# 22: V 99.94430 V 99.94430 V 99.94430
# 23: W 99.98877 W 99.98877 W 99.98877
# 24: X 100.07352 X 100.07352 X 100.07352
# 25: Y 99.96677 Y 99.96677 Y 99.96677
# 26: Z 99.99397 Z 99.99397 Z 99.99397
# letters mu letters mu letters mu
Quick benchmarking, for comparison:
library(microbenchmark)
microbenchmark(
bruteforce = rbindlist(myList)[,.(mu = mean(x)), keyby=letters],
# either of these is your answer
baseR = {
myAgg <- rbindlist(lapply(myList, function(d) d[,.(mu = mean(x), n = .N), keyby="letters"]))
myAgg[,.(mu = sum(n*mu)/sum(n)),keyby=letters]
},
Hmisc = {
myAgg <- rbindlist(lapply(myList, function(d) d[,.(mu = mean(x), n = .N), keyby="letters"]))
myAgg[,.(mu = Hmisc::wtd.mean(mu, weights=n)),keyby=letters]
},
times=50
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# bruteforce 131.8770 139.4562 153.93202 151.95375 159.6329 315.6117 50
# baseR 89.7047 93.3623 109.20174 98.11670 115.0171 268.2517 50
# Hmisc 89.2784 91.5927 97.87455 93.73475 98.1655 119.2671 50

How to sort a data.table based on a set of inequality constraints?

I have a set of "x < y" inequality constraints and I would like to sort the rows of a data.table based on these.
For example,
library(data.table)
set.seed(0)
ineqs <- unique(data.table(
X = sample(letters, 10, replace = T),
Rel = "<",
Y = sample(letters, 10, replace = T)
))
ineqs
X Rel Y
1: x < b
2: g < f
3: j < e
4: o < r
5: x < j
6: f < u
7: x < m
8: y < s
9: r < z
10: q < j
So, if I start with a table of sorted letters,
dt <- data.table(Foo = letters)
Foo
1: a
2: b
3: c
---
24: x
25: y
26: z
How can I adjust the row order to satisfy my constraints? Also, I am certain that my constraints are valid (i.e. none of the constraints contradict each other).
library(igraph)
g = ineqs[, graph_from_edgelist(cbind(X,Y), directed=TRUE)]
o = names(topo_sort(g))
dt[, v := factor(Foo, levels = o, ordered=TRUE)]
dt[order(v)]
Foo v
1: x x
2: g g
3: o o
4: y y
5: q q
6: b b
7: m m
8: f f
9: r r
10: s s
11: j j
12: u u
13: z z
14: e e
15: a <NA>
16: c <NA>
17: d <NA>
18: h <NA>
19: i <NA>
20: k <NA>
21: l <NA>
22: n <NA>
23: p <NA>
24: t <NA>
25: v <NA>
26: w <NA>
Foo v
All of the terms that aren't in ineqs are sorted to the end.
If the graph of your relation has cycles, you should get a warning in topo_sort. This tells you your task is not well defined for some terms in ineqs.
Perhaps I misunderstood but this is not a trivial sort, and there doesn't necessarily exist one unique order.
Let me give you an example. Consider the conditions
X Rel Y
1: x < b
2: g < f
Various orders are conceivable
x < g < f < b
g < x < b < f
g < x < f < b
g < f < x < b
x < g < b < f
x < b < g < f
all of which satisfy the conditions laid out in the first two lines.
I was interested in seeing how an exhaustive & crude implementation would do, where we pre-calculate all possible permutations and then eliminate those that do not fulfil the pairwise conditions.
To illustrate, we will use 4 letters only and the first two lines of the pairwise condition data.
Here are my results:
To start, we define the four letters and calculate all permutations using gtools::permutations.
char <- c("b", "f", "g", "x")
library(gtools)
perm <- as.data.frame(permutations(length(char), length(char), char))
There are 24 possible permutations.
We now read in the pairwise condition data
df <- read.table(text =
"X Rel Y
x < b
g < f", header = T)
# Convert factors to character vectors
df[] <- sapply(df, as.character)
We now loop throw the permutations and the pairwise conditions and flag those rows in the permutation data that do not satisfy any of the pairwise conditions.
rmv <- c()
for (i in 1:nrow(perm)) {
# Here we loop throw all possible permutations and eliminate those that
# do not fulfil the pairwise conditions
for (j in 1:nrow(df)) {
# Here we loop throw the pairwise conditions
cond <- eval(parse(text = sprintf("`%s`", df[j, "Rel"])))(
which(perm[i, ] == df[j, "X"]),
which(perm[i, ] == df[j, "Y"]))
if (cond == FALSE) {
rmv <- c(rmv, i)
break
}
}
}
The remaining permutations that satisfy the conditions are then
perm[-rmv, ]
# V1 V2 V3 V4
#16 g f x b
#17 g x b f
#18 g x f b
#20 x b g f
#23 x g b f
#24 x g f b

How to count occurrences of different sets

sample code:
library(data.table)
set.seed(42)
dt <- data.table(id = LETTERS[1:20],
setvalues = replicate(20,
sample(letters[1:4], sample(c(2,3),1))))[order(id)]
dt
id setvalues
1: A d,a,b
2: B c,d,a
3: C c,b,d
4: D b,d,c
5: E a,b,c
6: F a,c,b
7: G c,b
8: H b,c,d
9: I b,c,a
10: J a,d,b
11: K b,d,a
12: L b,c,d
13: M d,b,a
14: N b,c
15: O c,d
16: P b,d
17: Q d,c,b
18: R a,d,b
19: S a,d,c
20: T b,a
How can count the occurence of each set (order doesn't matter).
The desired results are something like
setvalue counts
b,c,d 6
a,b,d 4
a,c,c 3
a,c,d 2
b,c 2
c,d 1
b,d 1
a,b 1
The 'setvalues' is a list of vector. We loop through the list with lapply, sort it, paste, use it in the by argument and get the 'counts' with .N
dt[ , .(counts = .N), .(setvalue = unlist(lapply(setvalues, function(x) toString(sort(x)))))]

combine data.tables and sum the shared column

I have some large data sets and am trying out data.table to combine them while summing up the shared column over matching rows. I know how to merge using [ matching rows in the LHS data.table as shown below with tables a2:LHS and a:RHS
a2 <- data.table( b= c(letters[1:5],letters[11:15]), c = as.integer(rep(100,10)))
a <- data.table(b = letters[1:10], c = as.integer(1:10))
setkey(a2 ,"b")
setkey(a , "b")
a2
b c
1: a 100
2: b 100
3: c 100
4: d 100
5: e 100
6: k 100
7: l 100
8: m 100
9: n 100
10: o 100
a
b c
1: a 1
2: b 2
3: c 3
4: d 4
5: e 5
6: f 6
7: g 7
8: h 8
9: i 9
10: j 10
from second answer hereMerge data frames whilst summing common columns in R I saw how columns could be summed up over matching rows, as such:
setkey(a , "b")
setkey(a2, "b")
a2[a, `:=`(c = c + i.c)]
a2
b c
1: a 101
2: b 102
3: c 103
4: d 104
5: e 105
6: k 100
7: l 100
8: m 100
9: n 100
10: o 100
However I am trying retain the rows that don't match as well.
Alternately I could use merge as shown below but I would like a void making a new table with 4 rows before reducing it to 2 rows.
c <- merge(a, a2, by = "b", all=T)
c <- transform(c, value = rowSums(c[,2:3], na.rm=T))
c <- c[,c(1,4)]
c
b value
1: a 102
2: b 104
3: c 106
4: d 108
5: e 110
6: f 6
7: g 7
8: h 8
9: i 9
10: j 10
11: k 100
12: l 100
13: m 100
14: n 100
15: o 100
This last table is what I would like to achieve, Thanks in Advance.
merge is likely to not be very efficient for the end result you are after. Since both of your data.tables have the same structure, I would suggest rbinding them together and taking the sum by their key. In other words:
rbindlist(list(a, a2))[, sum(c), b]
I've used rbindlist because it is generally more efficient at rbinding data.tables (even though you have to first put your data.tables in a list).
Compare some timings on larger datasets:
library(data.table)
library(stringi)
set.seed(1)
n <- 1e7; n2 <- 1e6
x <- stri_rand_strings(n, 4)
a2 <- data.table(b = sample(x, n2), c = sample(100, n2, TRUE))
a <- data.table(b = sample(x, n2), c = sample(10, n2, TRUE))
system.time(rbindlist(list(a, a2))[, sum(c), b])
# user system elapsed
# 0.83 0.05 0.87
system.time(merge(a2, a, by = "b", all = TRUE)[, rowSums(.SD, na.rm = TRUE), b]) # Get some coffee
# user system elapsed
# 159.58 0.48 162.95
## Do we have all the rows we expect to have?
length(unique(c(a$b, a2$b)))
# [1] 1782166
nrow(rbindlist(list(a, a2))[, sum(c), b])
# [1] 1782166

melt data table and split values

I have a column in a data table which is a list of comma separated values
dt = data.table( a = c('a','b','c'), b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt'))
> dt
a b
1: a xx,yy,zz
2: b mm,nn
3: c qq,rr,ss,tt
I would like to transform it into a long format
a b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt
This question has been answered for a data frame here. I'm wondering if there is an elegant data table solution.
The following will work for your example:
dt[, c(b=strsplit(b, ",")), by=a]
a b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt
This method fails if the "by" variable is repeated as in
dt = data.table(a = c('a','b','c', 'a'),
b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt', 'zz,gg,tt'))
One robust solution in this situation can be had by using paste to collapse all observations with the same grouping variable (a) and feeding the result to the code above.
dt[, .(b=paste(b, collapse=",")), by=a][, c(b=strsplit(b, ",")), by=a]
This returns
a b
1: a xx
2: a yy
3: a zz
4: a zz
5: a gg
6: a tt
7: b mm
8: b nn
9: c qq
10: c rr
11: c ss
12: c tt
There is another method, but this method involves another package : splitstackshape.
library(splitstackshape)
cSplit(dt, "b", sep = ",", direction = "long")
a b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt
This function uses data.table to work. And this work even if we have multiple same value for the column "a".
We can split the column 'b' by the delimiter ',' (using strsplit), grouped by 'a' and set the name of the new column i.e. 'V1' to 'b' with setnames
setnames(dt[, strsplit(b, ','), by = a], "V1", "b")[]
# a b
#1: a xx
#2: a yy
#3: a zz
#4: b mm
#5: b nn
#6: c qq
#7: c rr
#8: c ss
#9: c tt
If there are repeating elements in 'a' as in the below example
dt <- data.table(a = c('a','b','c', 'a'),
b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt', 'zz,gg,tt'))
we can group by the sequence of rows, do the strsplit on 'b', concatenate with the 'a' column and assign (:=) the 'grp' to NULL
dt[, c(a=a, b=strsplit(b, ",")), .(grp = 1:nrow(dt))][, grp := NULL][]
# a b
# 1: a xx
# 2: a yy
# 3: a zz
# 4: b mm
# 5: b nn
# 6: c qq
# 7: c rr
# 8: c ss
# 9: c tt
#10: a zz
#11: a gg
#12: a tt
NOTE: Both the methods are data.table methods

Resources