R - unpivot list in data.table rows - r

I have a dataset that contains several columns, including 1 with list entries:
DT = data.table(
x = c(1:5),
y = seq(2, 10, 2),
z = list(list("a","b","a"), list("a","c"), list("b","c"), list("a","b","c"), list("b","c","b"))
)
Basically, I'm trying to unlist a, b, c from column z, and aggregate the data based on the x & y values.
Desired output:
z x sum(y)
1: a 1 4
2: b 1 2
3: a 2 4
4: c 2 4
5: b 3 6
6: c 3 6
7: a 4 8
8: b 4 8
9: c 4 8
10: b 5 20
11: c 5 10
My current method is rather round-about; I created 2 other columns with x and y values in lists of the same length as the list entry in z column, then unlisted all 3 columns simultaneously before aggregating - i.e. sum y values, grouped by z & x.
Code (before unlisting & aggregation):
DT[, listlen := sapply(z, function(x) length(x))]
for (a in c(1:nrow(DT))){
DT[a, x1:= list(list(rep(DT[a, x], DT[a, listlen])))]
DT[a, y1:= list(list(rep(DT[a, y], DT[a, listlen])))]}
DT_out = data.table(x = unlist(DT[,x1]), y = unlist(DT[,y1]), z = unlist(DT[,z]))
x y z listlen x1 y1
1: 1 2 <list> 3 1,1,1 2,2,2
2: 2 4 <list> 2 2,2 4,4
3: 3 6 <list> 2 3,3 6,6
4: 4 8 <list> 3 4,4,4 8,8,8
5: 5 10 <list> 3 5,5,5 10,10,10
Is there a method through data.table or reshape packages that can help me melt the dataset / do this much simpler? As I'm working with a lot more rows than this and this step seems to be very inefficient.
Any other help regarding the aggregation step would be much appreciated too!

unlist your z column first and then just aggregate as per normal via by=:
DT[, .(z=unlist(z)), by=.(x,y)][, .(sumy=sum(y)), by=.(x,z)]
# x z sumy
# 1: 1 a 4
# 2: 1 b 2
# 3: 2 a 4
# 4: 2 c 4
# 5: 3 b 6
# 6: 3 c 6
# 7: 4 a 8
# 8: 4 b 8
# 9: 4 c 8
#10: 5 b 20
#11: 5 c 10

Related

Compare values in a grouped data frame with corresponding value in a vector

Let's say I got a data.frame like the following:
u <- as.numeric(rep(rep(1:5,3)))
w <- as.factor(c(rep("a",5), rep("b",5), rep("c",5)))
q <- data.frame(w,u)
q
w u
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
and the vector:
v <- c(2,3,1)
Now I want to find the first row in the respective group [i] where the value [i] from vector "v" is bigger than the value in column "u".
The result should look like this:
1 a 3
2 b 4
3 c 2
I tried:
fun <- function (m) {
first(which(m[,2]>v))
}
ddply(q, .(w), summarise, fun(q))
and got as a result:
w fun(q)
1 a 3
2 b 3
3 c 3
Thus it seems like, ddply is only taking the first value from the vector "v".
Does anyone know how to solve this?
We can join the vector by creating a data.frame with 'w' as the unique values from 'w' column of 'q', then do a group_by 'w' and get the first row index where u is greater than the corresponding 'vector' column value
library(dplyr)
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
summarise(n = which(u > new)[1])
# // or use findInterval
#summarise(n = findInterval(new[1], u)+1)
-output
# A tibble: 3 x 2
# w n
#* <fct> <int>
#1 a 3
#2 b 4
#3 c 2
or use Map after splitting the data by 'w' column
Map(function(x, y) which(x$u > y)[1], split(q,q$w), v)
#$a
#[1] 3
#$b
#[1] 4
#$c
#[1] 2
OP mentioned that comparison starts from the beginning and it is not correct because we have a group_by operation. If we create a column of sequence, it resets at each group
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
mutate(rn = row_number())
Joining, by = "w"
# A tibble: 15 x 4
# Groups: w [3]
w u new rn
<fct> <dbl> <dbl> <int>
1 a 1 2 1
2 a 2 2 2
3 a 3 2 3
4 a 4 2 4
5 a 5 2 5
6 b 1 3 1
7 b 2 3 2
8 b 3 3 3
9 b 4 3 4
10 b 5 3 5
11 c 1 1 1
12 c 2 1 2
13 c 3 1 3
14 c 4 1 4
15 c 5 1 5
Using data.table: for each 'w' (by = w), subset 'v' with the group index .GRP. Compare the value with 'u' (v[.GRP] < u). Get the index for the first TRUE (which.max):
library(data.table)
setDT(q)[ , which.max(v[.GRP] < u), by = w]
# w V1
# 1: a 3
# 2: b 4
# 3: c 2

Sort a data.table programmatically using character vector of multiple column names

I need to sort a data.table on multiple columns provided as character vector of variable names.
This is my approach so far:
DT = data.table(x = rep(c("b","a","c"), each = 3), y = c(1,3,6), v = 1:9)
#column names to sort by, stored in a vector
keycol <- c("x", "y")
DT[order(keycol)]
x y v
1: b 1 1
2: b 3 2
Somehow It displays just 2 rows and removes other records. But if I do this:
DT[order(x, y)]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
It works like fluid.
Can anyone help with sorting using column name vector?
You need ?setorderv and its cols argument:
A character vector of column names of x by which to order
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
#column vector
keycol <-c("x","y")
setorderv(DT, keycol)
DT
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
Note that there is no need to assign the output of setorderv back to DT. The function updates DT by reference.

How to join tables with OR condition in data.table

Is this possible in data.table to join tables with OR condition?
For example:
library(data.table)
X<-data.table(x=c('a','b','c','d','e','f'),y=c(1,1,2,2,3,3),z=c(10,11,12,13,14,15))
x y z
1: a 1 12
2: b 1 11
3: c 2 12
4: d 2 13
5: e 3 14
6: f 3 15
Y<-data.table(x=c('a','e','a'),z=c(12,20,14),t=c('a','b','c'))
x z t
1: a 12 a
2: e 20 b
3: a 14 c
# and i need something like this:
X[Y,on=c("x"|"z"),.(x,y,z,i.t)]
x y z t
1: a 1 10 a
2: a 1 10 c
3: b 1 11 NA
4: c 2 12 a
5: d 2 13 NA
6: e 3 14 b
7: e 3 14 c
8: f 3 15 NA
I haven't found information about joining with OR in documentation.
Have I missed something?
The OP requested that the result set should consist of 3 subsets:
rows matching on column x
rows matching on column y
remaining rows of data.table X
So, this is a kind of right outer join of table X with Y on either column x or y.
This can be translated into 2 separate inner joins on column x and y resp., a union of both result sets, and a final outer join to add the remaining rows from table X.
Combined in one data.table statement this becomes
unique(rbindlist(list(
X[Y, on = "x", .(x, y, z, t), nomatch = 0],
X[Y, on = "z", .(x, y, z, t), nomatch = 0]
)))[X, on = .(x, y, z)]
# x y z t
#1: a 1 10 a
#2: a 1 10 c
#3: b 1 11 NA
#4: c 2 12 a
#5: d 2 13 NA
#6: e 3 14 b
#7: e 3 14 c
#8: f 3 15 NA
The inner joins are enforced by parameter nomatch = 0. The union operation is implemented using rbindlist(list(...)). EDIT: unique() is required to remove double matches in case where x and z are matching in the same row in X and in Y (thanks to filius_arator for pointing this out).
The final right outer join uses all rows of X including those which haven't been matched yet. Note that this join is on the three columns of X.
I am not sure if this is what you want or if it is very data.table-esque but there's no other answers at the moment:
join1 <- merge(X, Y[,c('x', 't'), with=FALSE], all.x=TRUE)
merge(join1, Y[,c('z', 't'), with=FALSE], all.x=TRUE, by = 'z')[,
t := ifelse(!is.na(t.x), t.x, t.y)][,
t.x := NULL][,
t.y := NULL][]
Giving:
z x y t
1: 10 a 1 a
2: 11 b 1 NA
3: 12 c 2 a
4: 13 d 2 NA
5: 14 e 3 b
6: 15 f 3 NA
EDIT with the updated example here's an approach but I'm sure there are better ways that the data.table gurus could should:
join1 <- merge(X, Y[,c('x', 't'), with=FALSE], all.x=TRUE)
merge(join1, Y[,c('z', 't'), with=FALSE], all.x=TRUE, by = 'z')[,
id := seq(.N)][,
.(t =list( na.omit(c(t.x, t.y)))), by = c('id', 'x', 'y', 'z')][,
.(x=x, y=y, z=z, t=unlist(t)), by = c('id')][]
## id x y z t
## 1: 1 a 1 10 a
## 2: 2 a 1 10 c
## 3: 3 b 1 11 NA
## 4: 4 c 2 12 a
## 5: 5 d 2 13 NA
## 6: 6 e 3 14 b
## 7: 6 e 3 14 c
## 8: 7 f 3 15 NA

Update a data.table based on another data table

I want to update the columns in an old data.table based on a new data.table only when the value is not NA.
DT_old = data.table(x=rep(c("a","b","c")), y=c(1,3,6), v=1:3, l=c(1,1,1))
DT_old
x y v l
1: a 1 1 1
2: b 3 2 1
3: c 6 3 1
DT_new = data.table(x=rep(c("b","c",'d')), y=c(9,6,10), v=c(2,NA,10), z=c(9,9,9))
DT_new
x y v z
1: b 9 2 9
2: c 6 NA 9
3: d 10 10 9
I want the output to be
x y v z
1: b 9 2 9
2: c 6 3 9
3: d 10 10 9
4: a 1 1 NA
Currently I am merging the two data.table and going through each column and replacing the NA in the new data.table
DT_merged <- merge(DT_new, DT_old, all=TRUE, by='x')
DT_merged
x y.x v.x z y.y v.y l
1: a NA NA NA 1 1 1
2: b 9 2 9 3 2 1
3: c 6 NA 9 6 3 1
4: d 10 10 9 NA NA NA
DT_merged[is.na(y.x), y.x := y.y]
DT_merged[is.na(v.x), v.x := v.y]
DT_merged = DT_merged[, list(y=y.x, v=v.x, z=z)
Is there a better way to do the above?
Here's how I would approach this. First, I will expand DT_new according to the unique values combination of x columns of both tables using binary join
res <- setkey(DT_new, x)[unique(c(x, DT_old$x))]
res
# x y v z
# 1: b 9 2 9
# 2: c 6 NA 9
# 3: d 10 10 9
# 4: a NA NA NA
Then, I will updated the two columns in res by reference using another binary join
setkey(res, x)[DT_old, `:=`(y = i.y, v = i.v)]
res
# x y v z
# 1: a 1 1 NA
# 2: b 3 2 9
# 3: c 6 3 9
# 4: d 10 10 9
Following the comments section, it seems that you are trying to join each column by its own condition. There is no simple way of doing such thing in R or any language AFAIK. Thus, your own solution could be a good option by itself.
Though, here are some other alternatives, mainly taken from a similar question I myself asked not long ago
Using two ifelse statments
setkey(res, x)[DT_old, `:=`(y = ifelse(is.na(y), i.y, y),
v = ifelse(is.na(v), i.v, v))]
Two separate conditional joins
setkey(res, x) ; setkey(DT_old, x) ## old data set needs to be keyed too now
res[is.na(y), y := DT_old[.SD, y]]
res[is.na(v), v := DT_old[.SD, v]]
Both will give you what you need.
P.S.
If you don't want warnings, you need to define the corresponding column classes correctly, e.g. v column in DT_new should be defined as v= c(2L, NA_integer_, 10L)

get rows of unique values by group

I have a data.table and want to pick those lines of the data.table where some values of a variable x are unique relative to another variable y
It's possible to get the unique values of x, grouped by y in a separate dataset, like this
dt[,unique(x),by=y]
But I want to pick the rows in the original dataset where this is the case. I don't want a new data.table because I also need the other variables.
So, what do I have to add to my code to get the rows in dt for which the above is true?
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
y x z
1: a 1 1
2: a 2 2
3: a 2 3
4: b 3 4
5: b 2 5
6: b 1 6
What I want:
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
The idiomatic data.table way is:
require(data.table)
unique(dt, by = c("y", "x"))
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 3 4
# 4: b 2 5
# 5: b 1 6
data.table is a bit different in how to use duplicated. Here's the approach I've seen around here somewhere before:
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
setkey(dt, "y", "x")
key(dt)
# [1] "y" "x"
!duplicated(dt)
# [1] TRUE TRUE FALSE TRUE TRUE TRUE
dt[!duplicated(dt)]
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 1 6
# 4: b 2 5
# 5: b 3 4
The simpler data.table solution is to grab the first element of each group
> dt[, head(.SD, 1), by=.(y, x)]
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
Thanks to dplyR
library(dplyr)
col1 = c(1,1,3,3,5,6,7,8,9)
col2 = c("cust1", 'cust1', 'cust3', 'cust4', 'cust5', 'cust5', 'cust5', 'cust5', 'cust6')
df1 = data.frame(col1, col2)
df1
distinct(select(df1, col1, col2))

Resources