Empty factors in "by" data.table - r

I have a data.table that has factor column with empty levels. I need to get the row count and sums of other variables, all grouped by multiple factors, including the one with empty levels.
My question is similar to this one, but here I need to count for multiple factors.
For example, let data.table be:
library('data.table')
dtr <- data.table(v1=sample(1:15),
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=sample(c("yes", "no"), 15, replace = TRUE))
I want to do the following:
dtr[,list(freq=.N,mm=sum(v1,na.rm=T)),by=list(v2,v3)]
#Output is:
v2 v3 freq mm
1: b yes 4 22
2: b no 1 13
3: c no 3 10
4: a no 4 49
5: c yes 1 10
6: a yes 2 16
I want output include empty levels for v2 as well ("d" and "e"), like in table(dtr$v2,dtr$v3), so the final output should look like (the order doesn't matter):
v2 v3 freq mm
1: b yes 4 22
2: b no 1 13
3: c no 3 10
4: a no 4 49
5: c yes 1 10
6: a yes 2 16
7: d yes 0 0
8: d no 0 0
9: e yes 0 0
10: e no 0 0
I tried to use the method used in the link, but I'm not sure how to use joint J() function when there are multiple columns used.
This works fine for groupping by 1 column only:
setkey(dtr,v2)
dtr[J(levels(v2)),list(freq=.N,mm=sum(v1,na.rm=T))]
However, dtr[J(levels(v2),v3),list(freq=.N,mm=sum(v1,na.rm=T))] doesn't include all combinations

library(data.table)
set.seed(42)
dtr <- data.table(v1=sample(1:15),
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=sample(c("yes", "no"), 15, replace = TRUE))
res <- dtr[,list(freq=.N,mm=sum(v1,na.rm=T)),by=list(v2,v3)]
You can use CJ (a cross join). Doing this after aggregation avoids setting the key for the big table and should be faster.
setkey(res,c("v2","v3"))
res[CJ(levels(dtr[,v2]),unique(dtr[,v3])),]
# v2 v3 freq mm
# 1: a no 1 9
# 2: a yes 2 11
# 3: b no 2 11
# 4: b yes 3 23
# 5: c no 4 40
# 6: c yes 3 26
# 7: d no NA NA
# 8: d yes NA NA
# 9: e no NA NA
# 10: e yes NA NA

table() will also capture freq values that are zero. To get the "mm" column, you could do a basic join. For example,
library(data.table)
set.seed(42)
dtr <- data.table(v1=sample(1:15),
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=sample(c("yes", "no"), 15, replace = TRUE))
res <- as.data.table(dtr[,table(v2,v3)])
setnames(res,'N','freq')
setkey(res,v2,v3)
setkey(dtr,v2,v3)
res <- dtr[,.(mm=sum(v1,na.rm=TRUE)),by=c('v2','v3')][res]
I'm not sure how table() benchmarks with cross join.

Related

replace row values based on another row value in a data.table

I have a trivial question, though I am struggling to find a simple answer. I have a data table that looks something like this:
dt <- data.table(id= c(A,A,A,A,B,B,B,C,C,C), time=c(1,2,3,1,2,3,1,2,3), score = c(10,15,13,25,NA,NA,18,29,19))
dt
# id time score
# 1: A 1 NA
# 2: A 2 10
# 3: A 3 15
# 4: A 4 13
# 5: B 1 NA
# 6: B 2 25
# 7: B 3 NA
# 8: B 4 NA
# 9: C 1 18
# 10: C 2 29
# 11: C 3 NA
# 12: C 4 19
I would like to replace the missing values of my group "B" with the values of "A".
The final dataset should look something like this
final
# id time score
# 1: A 1 NA
# 2: A 2 10
# 3: A 3 15
# 4: A 4 13
# 5: B 1 NA
# 6: B 2 25
# 7: B 3 15
# 8: B 4 13
# 9: C 1 18
# 10: C 2 29
# 11: C 3 NA
# 12: C 4 19
In other words, conditional on the fact that B is NA, I would like to replace the score of "A". Do note that "C" remains NA.
I am struggling to find a clean way to do this using data.table. However, if it is simpler with other methods it would still be ok.
Thanks a lot for your help
Here is one option where we get the index of the rows which are NA for 'score' and the 'id' is "B", use that to replace the NA with the corresponding 'score' value from 'A'
library(data.table)
i1 <- setDT(dt)[id == 'B', which(is.na(score))]
dt[, score:= replace(score, id == 'B' & is.na(score), score[which(id == 'A')[i1]])]
Or a similar option in dplyr
library(dplyr)
dt %>%
mutate(score = replace(score, id == "B" & is.na(score),
score[which(id == "A")[i1]))

rbindlist only elements that meet a condition

I have a large list. Some of the elements are strings and some of the elements are data.tables. I would like to create a big data.table, but only rbind the elements that are data.tables.
I know how to do it in a for loop, but I am looking for something more efficient as my data are big and I need something quick.
Thank you!
library(data.table)
DT1 = data.table(
ID = c("b","b","b","a","a","c"),
a = 1:6
)
DT2 = data.table(
ID = c("b","b","b","a","a","c"),
a = 11:16
)
list<- list(DT1,DT2,"string")
I am looking for a result similar to doing, but since I have many entries I cannot do it like this.
rbind(DT1, DT2)
Filter the data.table and rbind
library(data.table)
rbindlist(Filter(is.data.table, list_df))
# ID a
# 1: b 1
# 2: b 2
# 3: b 3
# 4: a 4
# 5: a 5
# 6: c 6
# 7: b 11
# 8: b 12
# 9: b 13
#10: a 14
#11: a 15
#12: c 16
data
list_df <- list(DT1,DT2,"string")
We can use keep from purrr with bind_rows
library(tidyverse)
keep(list, is.data.table) %>%
bind_rows
# ID a
# 1: b 1
# 2: b 2
# 3: b 3
# 4: a 4
# 5: a 5
# 6: c 6
# 7: b 11
# 8: b 12
# 9: b 13
#10: a 14
#11: a 15
#12: c 16
Or using rbindlist with keep
rbindlist(keep(list, is.data.table))
Using sapply() to generate a logical vector to subset your list
rbindlist(list[sapply(list, is.data.table)])

Row-wise difference in two list using Data.Table in R

I want to use data.table to incrementally find out new elements i.e. for every row, I'd see whether values in list have been seen before. If they are, we will ignore them. If not, we will select them.
I was able to wrap elements by group in a list, but I am unsure how I can find incremental differences.
Here's my attempt:
df = data.table::data.table(id = c('A','B','C','A','B','A','A','A','D','E','E','E'),
Value = c(1,2,3,4,3,5,2,3,7,2,3,9))
df_wrapped=df[,.(Values=(list(unique(Value)))), by=id]
expected_output = data.table::data.table(id = c("A","B","C","D","E"),
Value = list(c(1,4,5,2,3),c(2,3),c(3),c(7),c(2,3,9)),
Diff=list(c(1,4,5,2,3),c(NA),c(NA),c(7),c(9)),
Count = c(5,0,0,1,1))
Thoughts about expected output:
For the first row, all elements are unique. So, we will include them in Diff column.
In the second row, 2,3 have occurred in row 1. So, we will ignore them. Ditto for row 3.
Similarly, 7 and 9 are seen for the first time in row 4 and 5, so we will include them.
Here's visual representation:
expected_output
id Value Diff Count
A 1,4,5,2,3 1,4,5,2,3 5
B 2,3 NA 0
C 3 NA 0
D 7 7 1
E 2,3,9 9 1
I'd appreciate any thoughts. I am only looking for data.table based solutions because of performance issues in my original dataset.
I am not sure why you specifically need to put them in a list, but otherwise I wrote a small piece that could help you.
df = data.table::data.table(id = c('A','B','C','A','B','A','A','A','D','E','E','E'),
Value = c(1,2,3,4,3,5,2,3,7,2,3,9))
df = df[order(id, Value)]
df = df[duplicated(Value) == FALSE, diff := Value][]
df = df[, count := uniqueN(diff, na.rm = TRUE), by = id]
The outcome would be:
> df
id Value diff count
1: A 1 1 5
2: A 2 2 5
3: A 3 3 5
4: A 4 4 5
5: A 5 5 5
6: B 2 NA 0
7: B 3 NA 0
8: C 3 NA 0
9: D 7 7 1
10: E 2 NA 1
11: E 3 NA 1
12: E 9 9 1
Hope this helps, or at least get you started.
Here is another possible approach:
library(data.table)
df = data.table(
id = c('A','B','C','A','B','A','A','A','D','E','E','E'),
Value = c(1,2,3,4,3,5,2,3,7,2,3,9))
valset <- c()
df[, {
d <- setdiff(Value, valset)
valset <- unique(c(valset, Value))
.(Values=.(Value), Diff=.(d), Count=length(d))
},
by=.(id)]
output:
id Values Diff Count
1: A 1,4,5,2,3 1,4,5,2,3 5
2: B 2,3 0
3: C 3 0
4: D 7 7 1
5: E 2,3,9 9 1

Sort a data.table programmatically using character vector of multiple column names

I need to sort a data.table on multiple columns provided as character vector of variable names.
This is my approach so far:
DT = data.table(x = rep(c("b","a","c"), each = 3), y = c(1,3,6), v = 1:9)
#column names to sort by, stored in a vector
keycol <- c("x", "y")
DT[order(keycol)]
x y v
1: b 1 1
2: b 3 2
Somehow It displays just 2 rows and removes other records. But if I do this:
DT[order(x, y)]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
It works like fluid.
Can anyone help with sorting using column name vector?
You need ?setorderv and its cols argument:
A character vector of column names of x by which to order
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
#column vector
keycol <-c("x","y")
setorderv(DT, keycol)
DT
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
Note that there is no need to assign the output of setorderv back to DT. The function updates DT by reference.

merge columns that have the same name r

I am working in R with a dataset that is created from mongodb with the use of mongolite.
I am getting a list that looks like so:
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4
I would like to merge the datasetto look like this:
_id A B
1 a 1
2 k 4
1 b 2
2 l 3
1 e 5
2 c 3
1 NA NA
2 d 4
The NAs in the last columns are there because the columns are named from the first entry and if a later entry has more columns than that they don't get names assigned to them, (if I get help for this as well it would be awesome but it's not the reason I am here).
Also the number of columns might differ for different subsets of the dataset.
I have tried melt() but since it is a list and not a dataframe it doesn't work as expected, I have tried stack() but it dodn't work because the columns have the same name and some of them don't even have a name.
I know this is a very weird situation and appreciate any help.
Thank you.
using library(magrittr)
data:
df <- fread("
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4 ",header=T)
setDF(df)
Code:
df2 <- df[,-1]
odds<- df2 %>% ncol %>% {(1:.)%%2} %>% as.logical
even<- df2 %>% ncol %>% {!(1:.)%%2}
cbind(df[,1,drop=F],
A=unlist(df2[,odds]),
B=unlist(df2[,even]),
row.names=NULL)
result:
# _id A B
# 1 1 a 1
# 2 2 k 4
# 3 1 b 2
# 4 2 l 3
# 5 1 e 5
# 6 2 c 3
# 7 1 <NA> NA
# 8 2 d 4
We can use data.table. Assuming A and B are always following each other. I created an example with 2 sets of NA's in the header. With grep we can find the ones fread has named V8 etc. Using R's recycling of vectors, you can rename multiple headers in one go. If in your case these are named differently change the pattern in the grep command. Then we melt the data in via melt
library(data.table)
df <- fread("
_id A B A B A B NA NA NA NA
1 a 1 b 2 e 5 NA NA NA NA
2 k 4 l 3 c 3 d 4 e 5",
header = TRUE)
df
_id A B A B A B A B A B
1: 1 a 1 b 2 e 5 <NA> NA <NA> NA
2: 2 k 4 l 3 c 3 d 4 e 5
# assuming A B are always following each other. Can be done in 1 statement.
cols <- names(df)
cols[grep(pattern = "^V", x = cols)] <- c("A", "B")
names(df) <- cols
# melt data (if df is a data.frame replace df with setDT(df)
df_melted <- melt(df, id.vars = 1,
measure.vars = patterns(c('A', 'B')),
value.name=c('A', 'B'))
df_melted
_id variable A B
1: 1 1 a 1
2: 2 1 k 4
3: 1 2 b 2
4: 2 2 l 3
5: 1 3 e 5
6: 2 3 c 3
7: 1 4 <NA> NA
8: 2 4 d 4
9: 1 5 <NA> NA
10: 2 5 e 5
Thank you for your help, they were great inspirations.
Even though #Andre Elrico gave a solution that worked in the reproducible example better #phiver gave a solution that worked better on my overall problem.
By using both those I came up with the following.
library(data.table)
#The data were in a list of lists called list for this example
temp <- as.data.table(matrix(t(sapply(list, '[', seq(max(sapply(list, lenth))))),
nrow = m))
# m here is the number of lists in list
cols <- names(temp)
cols[grep(pattern = "^V", x = cols)] <- c("B", "A")
#They need to be the opposite way because the first column is going to be substituted with id, and this way they fall on the correct column after that
cols[1] <- "id"
names(temp) <- cols
l <- melt.data.table(temp, id.vars = 1,
measure.vars = patterns(c("A", "B")),
value.name = c("A", "B"))
That way I can use this also if I have more than 2 columns that I need to manipulate like that.

Resources