I need to sort a data.table on multiple columns provided as character vector of variable names.
This is my approach so far:
DT = data.table(x = rep(c("b","a","c"), each = 3), y = c(1,3,6), v = 1:9)
#column names to sort by, stored in a vector
keycol <- c("x", "y")
DT[order(keycol)]
x y v
1: b 1 1
2: b 3 2
Somehow It displays just 2 rows and removes other records. But if I do this:
DT[order(x, y)]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
It works like fluid.
Can anyone help with sorting using column name vector?
You need ?setorderv and its cols argument:
A character vector of column names of x by which to order
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
#column vector
keycol <-c("x","y")
setorderv(DT, keycol)
DT
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
Note that there is no need to assign the output of setorderv back to DT. The function updates DT by reference.
Related
I'm trying to find a way to merge multiple column numeric column as a new list type column.
Data Table
dt <- data.table(
a=c(1,2,3),
b=c(4,5,6),
c=c(7,8,9)
)
Expected Result
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
Attempt 1
I have tried doing append with a list with dt[,d:=list(c(a,b,c))] but it just append everything instead and get the incorrect result
a b c d
1: 1 4 7 1,2,3,4,5,6,...
2: 2 5 8 1,2,3,4,5,6,...
3: 3 6 9 1,2,3,4,5,6,...
Do a group by row and place the elements in the list
dt[, d := .(list(unlist(.SD, recursive = FALSE))), 1:nrow(dt)]
-output
dt
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
Or another option is paste and strsplit
dt[, d := strsplit(do.call(paste, c(.SD, sep=",")), ",")]
Or may use transpose
dt[, d := lapply(data.table::transpose(unname(.SD)), unlist)]
dt
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
dt[, d := purrr::pmap(.SD, ~c(...))]
This is related to this question from Henrik
Assign multiple columns using := in data.table, by group
But what if I want to create a new data.table with given column names instead of assigning new columns to an existing one?
f <- function(x){list(head(x,2),tail(x,2))}
dt <- data.table(group=sample(c('a','b'),10,replace = TRUE),val=1:10)
> dt
group val
1: b 1
2: b 2
3: a 3
4: b 4
5: a 5
6: b 6
7: a 7
8: a 8
9: b 9
10: b 10
I want to get a new data.table with predefined column names by calling the function f:
dt[,c('head','tail')=f(val),by=group]
I wish to get this:
group head tail
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
But it gives me an error. What I can do is create the table then change the column names, but that seems cumbersome:
> dt2 <- dt[,f(val),by=group]
> dt2
group V1 V2
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
> colnames(dt2)[-1] <- c('head','tail')
> dt2
group head tail
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
Is it something I can do with one call?
From running your code as-is, this is the error I get:
dt[,c('head','tail')=f(val),by=group]
# Error: unexpected '=' in "dt2[,c('head','tail')="
The problem is using = instead of := for assignment.
On to your problem of wanting a new data.table:
dt2 <- dt[, setNames(f(val), c('head', 'tail')), by = group]
I have a dataset that contains several columns, including 1 with list entries:
DT = data.table(
x = c(1:5),
y = seq(2, 10, 2),
z = list(list("a","b","a"), list("a","c"), list("b","c"), list("a","b","c"), list("b","c","b"))
)
Basically, I'm trying to unlist a, b, c from column z, and aggregate the data based on the x & y values.
Desired output:
z x sum(y)
1: a 1 4
2: b 1 2
3: a 2 4
4: c 2 4
5: b 3 6
6: c 3 6
7: a 4 8
8: b 4 8
9: c 4 8
10: b 5 20
11: c 5 10
My current method is rather round-about; I created 2 other columns with x and y values in lists of the same length as the list entry in z column, then unlisted all 3 columns simultaneously before aggregating - i.e. sum y values, grouped by z & x.
Code (before unlisting & aggregation):
DT[, listlen := sapply(z, function(x) length(x))]
for (a in c(1:nrow(DT))){
DT[a, x1:= list(list(rep(DT[a, x], DT[a, listlen])))]
DT[a, y1:= list(list(rep(DT[a, y], DT[a, listlen])))]}
DT_out = data.table(x = unlist(DT[,x1]), y = unlist(DT[,y1]), z = unlist(DT[,z]))
x y z listlen x1 y1
1: 1 2 <list> 3 1,1,1 2,2,2
2: 2 4 <list> 2 2,2 4,4
3: 3 6 <list> 2 3,3 6,6
4: 4 8 <list> 3 4,4,4 8,8,8
5: 5 10 <list> 3 5,5,5 10,10,10
Is there a method through data.table or reshape packages that can help me melt the dataset / do this much simpler? As I'm working with a lot more rows than this and this step seems to be very inefficient.
Any other help regarding the aggregation step would be much appreciated too!
unlist your z column first and then just aggregate as per normal via by=:
DT[, .(z=unlist(z)), by=.(x,y)][, .(sumy=sum(y)), by=.(x,z)]
# x z sumy
# 1: 1 a 4
# 2: 1 b 2
# 3: 2 a 4
# 4: 2 c 4
# 5: 3 b 6
# 6: 3 c 6
# 7: 4 a 8
# 8: 4 b 8
# 9: 4 c 8
#10: 5 b 20
#11: 5 c 10
I want to update the columns in an old data.table based on a new data.table only when the value is not NA.
DT_old = data.table(x=rep(c("a","b","c")), y=c(1,3,6), v=1:3, l=c(1,1,1))
DT_old
x y v l
1: a 1 1 1
2: b 3 2 1
3: c 6 3 1
DT_new = data.table(x=rep(c("b","c",'d')), y=c(9,6,10), v=c(2,NA,10), z=c(9,9,9))
DT_new
x y v z
1: b 9 2 9
2: c 6 NA 9
3: d 10 10 9
I want the output to be
x y v z
1: b 9 2 9
2: c 6 3 9
3: d 10 10 9
4: a 1 1 NA
Currently I am merging the two data.table and going through each column and replacing the NA in the new data.table
DT_merged <- merge(DT_new, DT_old, all=TRUE, by='x')
DT_merged
x y.x v.x z y.y v.y l
1: a NA NA NA 1 1 1
2: b 9 2 9 3 2 1
3: c 6 NA 9 6 3 1
4: d 10 10 9 NA NA NA
DT_merged[is.na(y.x), y.x := y.y]
DT_merged[is.na(v.x), v.x := v.y]
DT_merged = DT_merged[, list(y=y.x, v=v.x, z=z)
Is there a better way to do the above?
Here's how I would approach this. First, I will expand DT_new according to the unique values combination of x columns of both tables using binary join
res <- setkey(DT_new, x)[unique(c(x, DT_old$x))]
res
# x y v z
# 1: b 9 2 9
# 2: c 6 NA 9
# 3: d 10 10 9
# 4: a NA NA NA
Then, I will updated the two columns in res by reference using another binary join
setkey(res, x)[DT_old, `:=`(y = i.y, v = i.v)]
res
# x y v z
# 1: a 1 1 NA
# 2: b 3 2 9
# 3: c 6 3 9
# 4: d 10 10 9
Following the comments section, it seems that you are trying to join each column by its own condition. There is no simple way of doing such thing in R or any language AFAIK. Thus, your own solution could be a good option by itself.
Though, here are some other alternatives, mainly taken from a similar question I myself asked not long ago
Using two ifelse statments
setkey(res, x)[DT_old, `:=`(y = ifelse(is.na(y), i.y, y),
v = ifelse(is.na(v), i.v, v))]
Two separate conditional joins
setkey(res, x) ; setkey(DT_old, x) ## old data set needs to be keyed too now
res[is.na(y), y := DT_old[.SD, y]]
res[is.na(v), v := DT_old[.SD, v]]
Both will give you what you need.
P.S.
If you don't want warnings, you need to define the corresponding column classes correctly, e.g. v column in DT_new should be defined as v= c(2L, NA_integer_, 10L)
I have a data.table and want to pick those lines of the data.table where some values of a variable x are unique relative to another variable y
It's possible to get the unique values of x, grouped by y in a separate dataset, like this
dt[,unique(x),by=y]
But I want to pick the rows in the original dataset where this is the case. I don't want a new data.table because I also need the other variables.
So, what do I have to add to my code to get the rows in dt for which the above is true?
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
y x z
1: a 1 1
2: a 2 2
3: a 2 3
4: b 3 4
5: b 2 5
6: b 1 6
What I want:
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
The idiomatic data.table way is:
require(data.table)
unique(dt, by = c("y", "x"))
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 3 4
# 4: b 2 5
# 5: b 1 6
data.table is a bit different in how to use duplicated. Here's the approach I've seen around here somewhere before:
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
setkey(dt, "y", "x")
key(dt)
# [1] "y" "x"
!duplicated(dt)
# [1] TRUE TRUE FALSE TRUE TRUE TRUE
dt[!duplicated(dt)]
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 1 6
# 4: b 2 5
# 5: b 3 4
The simpler data.table solution is to grab the first element of each group
> dt[, head(.SD, 1), by=.(y, x)]
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
Thanks to dplyR
library(dplyr)
col1 = c(1,1,3,3,5,6,7,8,9)
col2 = c("cust1", 'cust1', 'cust3', 'cust4', 'cust5', 'cust5', 'cust5', 'cust5', 'cust6')
df1 = data.frame(col1, col2)
df1
distinct(select(df1, col1, col2))