Expanding data.table by operating on a column - r

I want to perform an operation on a subset of rows in a data.table that result in a greater number of rows than what I started out with. Is there an easy way to expand the original data.table to accommodate this? If not, how could I accomplish this?
Here's a sample of my original data.
DT <- data.table(my.id=c(1,2,3), unmodified=c("a","b","c"), vals=c("apple",NA,"cat"))
DT
my.id unmodified vals
1: 1 a apple
2: 2 b NA
3: 3 c cat
And this is my desired output.
DT
my.id unmodified vals
1: 1 a apple
2: 2 b boy
3: 2 b bat
4: 2 b bag
5: 3 c cat
The new rows can appear at the end as well, I don't care about the order. I tried DT[my.id == 2, vals := c("boy","bat","bag")], but it ignores the last 2 entries with a warning.
TIA!
EDIT: My original dataset has about 10 million rows, although the entry with a missing value occurs just once. I'd prefer not to create copies of the data.table, if possible.

You can use the summarize pattern of data.table by setting the group variables to be my.id and unmodified here; this broadcasts values within each group if the length doesn't match:
DT[, .(vals = if(my.id == 2) c("boy","bat","bag") else vals), .(my.id, unmodified)]
# my.id unmodified vals
#1: 1 a apple
#2: 2 b boy
#3: 2 b bat
#4: 2 b bag
#5: 3 c cat

Another option is to subset the datasets that have 'my.id' as 2 and not 2, then rbind
rbind(DT[my.id == 2][, .(my.id, unmodified, vals = c('boy', 'bat',
'bag'))], DT[my.id != 2])[order(my.id)]
# my.id unmodified vals
#1: 1 a apple
#2: 2 b boy
#3: 2 b bat
#4: 2 b bag
#5: 3 c cat

> DT <- data.table(my.id=c(1,2,3), unmodified=c("a","b","c"), vals=c("apple",NA,"cat"))
> DT
my.id unmodified vals
1: 1 a apple
2: 2 b NA
3: 3 c cat
> DT2 <- data.table(my.id=rep(2,3), unmodified=rep("b",3), vals=c("boy","bat","bag"))
> DT2
my.id unmodified vals
1: 2 b boy
2: 2 b bat
3: 2 b bag
> rbind(DT,DT2)
my.id unmodified vals
1: 1 a apple
2: 2 b NA
3: 3 c cat
4: 2 b boy
5: 2 b bat
6: 2 b bag
> rbind(DT,DT2)[order(my.id),]
my.id unmodified vals
1: 1 a apple
2: 2 b NA
3: 2 b boy
4: 2 b bat
5: 2 b bag
6: 3 c cat
> na.omit(rbind(DT,DT2)[order(my.id),])
my.id unmodified vals
1: 1 a apple
2: 2 b boy
3: 2 b bat
4: 2 b bag
5: 3 c cat

Related

data.table rejects to create new column when used in a chain

I am new to data.table packcage and become confused about its chaining behavior. Suppose we have the following code:
aa <- data.table(a=c(1,2,3), b = c("a","b","b"))
aa[order(a,b)][,c:=cumsum(a), by=.(b)]
> aa
a b
1: 1 a
2: 2 b
3: 3 b
Column c is not created.
But if we write the code seperately without chaining, c can be genereted:
aa <- data.table(a=c(1,2,3), b = c("a","b","b"))
aa[order(a,b)]
a b
1: 1 a
2: 2 b
3: 3 b
aa[,c:=cumsum(a), by=.(b)]
> aa
a b c
1: 1 a 1
2: 2 b 2
3: 3 b 5
So why does this happen? How can I write the chain code correctly using data.table?
Thank you a lot!
Below is new
I noticed that, if we assign it to a new dataframe, it worked again:
aa <- data.table(a=c(1,2,3), b = c("a","b","b"))
bb <- aa[order(a,b)][,c:=cumsum(a), by=.(b)]
> bb
a b c
1: 1 a 1
2: 2 b 2
3: 3 b 5
Pay attention to the order. You can try
aa[,c:=cumsum(a), by=.(b)][order(a,b)]
In this line aa[order(a,b)][,c:=cumsum(a), by=.(b)], it can be viewed as tmp <- aa[order(a,b)], tmp[,c:=cumsum(a), by=.(b)]. Column c is indeedly created but not in original aa if you run aa[order(a,b)][,c:=cumsum(a), by=.(b)][] to check.

I found a strange thing (bug?) about 'combn' function and 'data.table' package [all possible combinations by group]

I tried to find all possible combinations by group. I tried to use combn function and data.table package as a below post teaches [(here is the link)](Generate All ID Pairs, by group with data.table in R
This gives me the expected result.
dat1 <- data.table(ids=1:4, groups=c("B","A","B","A"))
dat1
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 A
dat1[, as.data.table(t(combn(ids, 2))), .(groups)]
groups V1 V2
1: B 1 3
2: A 2 4
But this gives me a strange result. It's very weird. I tried to understand this result for about 3 hours but I can't. Isn't it a bug?
dat2 <- data.table(ids=1:4, groups=c("B","A","B","C"))
dat2
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 C
dat2[, as.data.table(t(combn(ids, 2))), .( groups)]
groups V1 V2
1: B 1 3
2: A 1 2
3: C 1 2
4: C 1 3
5: C 1 4
6: C 2 3
7: C 2 4
8: C 3 4
I really appreciate it for your teaching.

rbindlist data.tables different dimensions

I perform a function multiple times with different outputs as exemplified.
require(data.table)
myfunction<-function(x){
DT1<-data.table(a=c(1,2,3),b=c("a","b","c"))
DT2<-data.table(d=c(4,5,6), e=c("d","e","f"))
return(list(DT1=DT1, DT2=DT2))
}
result<-lapply(1:2, myfunction)
I want to bind results. The desired output will be as the one I am showing. My real example uses hundreds of tables.
l1<-rbindlist(list(result[[1]]$DT1, result[[2]]$DT1), idcol = TRUE)
l2<-rbindlist(list(result[[1]]$DT2, result[[2]]$DT2), idcol = TRUE)
DESIRED_OUTPUT<-list(l1, l2)
I use this option but is not working:
rbindlist data.tables wtih different number of columns
======================================================================
Update
The option that #nicola proposed doesn´t work when the number of elements of the list was diferent than 2. For the first example (DT1 and DT2). As a solution I create a variable "l" that calculate the number of elements inside the list of the function.
New example with solution.
require(data.table)
myfunction<-function(x){
DT1<-data.table(a=c(1,2,3),b=c("a","b","c"))
DT2<-data.table(d=c(4,5), e=c("d","e"))
DT3<-data.table(f=c(7,8,NA,9), g=c("g","h","i","j"))
return(list(DT1=DT1, DT2=DT2, DT3=DT3))
}
result<-lapply(1:5, myfunction)
l<-unique(sapply(result, length))
apply(matrix(unlist(result,recursive=FALSE),nrow=l),1,rbindlist,idcol=TRUE)
Here's an option:
do.call(function(...) Map(function(...) rbind(..., idcol = T), ...), result)
#$DT1
# .id a b
#1: 1 1 a
#2: 1 2 b
#3: 1 3 c
#4: 2 1 a
#5: 2 2 b
#6: 2 3 c
#
#$DT2
# .id d e
#1: 1 4 d
#2: 1 5 e
#3: 1 6 f
#4: 2 4 d
#5: 2 5 e
#6: 2 6 f
Here's another:
lapply(purrr::transpose(result), rbindlist, idcol = T)
An attempt that should match on names of the list components:
Map(
function(LL,n) rbindlist(unname(LL[names(l) %in% n]), idcol=TRUE),
list(unlist(result, recursive=FALSE)),
unique(names(l))
)
#[[1]]
# .id a b
#1: 1 1 a
#2: 1 2 b
#3: 1 3 c
#4: 2 1 a
#5: 2 2 b
#6: 2 3 c
#
#[[2]]
# .id d e
#1: 1 4 d
#2: 1 5 e
#3: 1 6 f
#4: 2 4 d
#5: 2 5 e
#6: 2 6 f

How to sort a data.table using a target vector

So, I have the following data.table
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
> DT
x y
1: b 1
2: b 2
3: b 3
4: a 1
5: a 2
6: a 3
7: c 1
8: c 2
9: c 3
And I have the following vector
k <- c("2","3","1")
I want to use k as a target vector to sort DT using y and get something like this.
> DT
x y
1: b 2
2: a 2
3: c 2
4: b 3
5: a 3
6: c 3
7: b 1
8: a 1
9: c 1
Any ideas? If I use DT[order(k)] I get a subset of the original data, and that isn't what I am looking for.
Throw a call to match() in there.
DT[order(match(y, as.numeric(k)))]
# x y
# 1: b 2
# 2: a 2
# 3: c 2
# 4: b 3
# 5: a 3
# 6: c 3
# 7: b 1
# 8: a 1
# 9: c 1
Actually DT[order(match(y, k))] would work as well, but it is probably safest to make the arguments to match() of the same class just in case.
Note: match() is known to be sub-optimal in some cases. If you have a large number of rows, you may want to switch to fastmatch::fmatch for faster matching.
You can do this:
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
k <- c("2","3","1")
setkey(DT,y)
DT[data.table(as.numeric(k))]
or (from the comment of Richard)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
k <- c("2","3","1")
DT[data.table(y = as.numeric(k)), on = "y"]

data preparation part II

there's another problem I encountered which is (as I think) quite interesting:
dt <- data.table(K=c("A","A","A","B","B","B"),A=c(2,3,4,1,3,4),B=c(3,3,3,1,1,1))
dt
K A B
1: A 2 3
2: A 3 3
3: A 4 3
4: B 1 1
5: B 3 1
6: B 4 1
Now I want a somewhat "higher" level of the data. For each letter in K, there should only be one line and "A_sum" should include the length of A where B has the same value. So there are three values for B=3 and three values for B=1.
Resulting data.table:
dt_new
K A_sum B
1: A 3 3
2: B 3 1
It's not clear how you want to treat K, but here's one option:
dt_new <- dt[, list(A_sum = length(A)), by = list(K, B)]
# K B A_sum
# 1: A 3 3
# 2: B 1 3

Resources