variable usage in data.table [duplicate] - r

This question already has answers here:
Keyed lookup on data.table without 'with'
(4 answers)
Subsetting data.table using variables with same name as column
(2 answers)
Closed 4 years ago.
It must be very basic thing but I can't figure out how to use a real variable that has the same name as a data.table column. I can use a different variable name to avoid conflict but I'm wondering if there's a way to eval the variable before giving up to DT.
> DT = data.table(ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c=13:18)
> DT
ID a b c
1: b 1 7 13
2: b 2 8 14
3: b 3 9 15
4: a 4 10 16
5: a 5 11 17
6: c 6 12 18
> DT[b == 7]
ID a b c
1: b 1 7 13
> b <- 7
> DT[b == b]
ID a b c
1: b 1 7 13
2: b 2 8 14
3: b 3 9 15
4: a 4 10 16
5: a 5 11 17
6: c 6 12 18

Since you have two variables named b, one inside DT and one outside the scope of DT, we have to go and get b <- 7 from the global environment. We can do that with get().
DT[b == get("b", globalenv())]
# ID a b c
# 1: b 1 7 13
Update: You mention in the comments that the variables are inside a function environment. In that case, you can use parent.frame() instead of globalenv().
f <- function(b, dt) dt[b == get("b", parent.frame(3))]
f(7, DT)
# ID a b c
# 1: b 1 7 13
f(12, DT)
# ID a b c
# 1: c 6 12 18

Related

Merge multiple numeric column as list typed column in data.table [R]

I'm trying to find a way to merge multiple column numeric column as a new list type column.
Data Table
dt <- data.table(
a=c(1,2,3),
b=c(4,5,6),
c=c(7,8,9)
)
Expected Result
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
Attempt 1
I have tried doing append with a list with dt[,d:=list(c(a,b,c))] but it just append everything instead and get the incorrect result
a b c d
1: 1 4 7 1,2,3,4,5,6,...
2: 2 5 8 1,2,3,4,5,6,...
3: 3 6 9 1,2,3,4,5,6,...
Do a group by row and place the elements in the list
dt[, d := .(list(unlist(.SD, recursive = FALSE))), 1:nrow(dt)]
-output
dt
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
Or another option is paste and strsplit
dt[, d := strsplit(do.call(paste, c(.SD, sep=",")), ",")]
Or may use transpose
dt[, d := lapply(data.table::transpose(unname(.SD)), unlist)]
dt
a b c d
1: 1 4 7 1,4,7
2: 2 5 8 2,5,8
3: 3 6 9 3,6,9
dt[, d := purrr::pmap(.SD, ~c(...))]

Creating a new data table for each row of an existing data table R while avoiding memory vector issue

Suppose I have two data tables:
library(data.table)
A=data.table(w=1:3,d=5:7)
B=data.table(K=2:4,m=9:11)
> A
w d
1: 1 5
2: 2 6
3: 3 7
> B
K m
1: 2 9
2: 3 10
3: 4 11
I want to do the following expansion, where I have a new B for each row of A:
C=A[,B[],by=names(A)]
w d K m
1: 1 5 2 9
2: 1 5 3 10
3: 1 5 4 11
4: 2 6 2 9
5: 2 6 3 10
6: 2 6 4 11
7: 3 7 2 9
8: 3 7 3 10
9: 3 7 4 11
However, when I do it with my real data, I get this error:
Error in `[.data.table`(A, , B[], by = names(A)) :
negative length vectors are not allowed
It turns out this is a memory error. However, I think there should be a way to do this without loops, memory is not an issue on my server up to 50gb of ram, which the following data table would certainly be less than.
Does anyone know an efficient way to do this?
A hacky way to handle this might be to add an identical helper column to each table and then to allow cartesian joins:
library(data.table)
A = data.table(w = 1:3, d = 5:7)
B = data.table(K = 2:4, m = 9:11)
A[, j := 1]
B[, j := 1]
C = A[B, on = 'j', allow.cartesian = T]

Renaming multiple columns in R data.table

This is related to this question from Henrik
Assign multiple columns using := in data.table, by group
But what if I want to create a new data.table with given column names instead of assigning new columns to an existing one?
f <- function(x){list(head(x,2),tail(x,2))}
dt <- data.table(group=sample(c('a','b'),10,replace = TRUE),val=1:10)
> dt
group val
1: b 1
2: b 2
3: a 3
4: b 4
5: a 5
6: b 6
7: a 7
8: a 8
9: b 9
10: b 10
I want to get a new data.table with predefined column names by calling the function f:
dt[,c('head','tail')=f(val),by=group]
I wish to get this:
group head tail
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
But it gives me an error. What I can do is create the table then change the column names, but that seems cumbersome:
> dt2 <- dt[,f(val),by=group]
> dt2
group V1 V2
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
> colnames(dt2)[-1] <- c('head','tail')
> dt2
group head tail
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
Is it something I can do with one call?
From running your code as-is, this is the error I get:
dt[,c('head','tail')=f(val),by=group]
# Error: unexpected '=' in "dt2[,c('head','tail')="
The problem is using = instead of := for assignment.
On to your problem of wanting a new data.table:
dt2 <- dt[, setNames(f(val), c('head', 'tail')), by = group]

How to remove individuals with fewer than 5 observations from a data frame [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed last month.
To clarify the question I'll briefly describe the data.
Each row in the data.frame is an observation, and the columns represent variables pertinent to that observation including: what individual was observed, when it was observed, where it was observed, etc. I want to exclude/filter individuals for which there are fewer than 5 observations.
In other words, if there are fewer than 5 rows where individual = x, then I want to remove all rows that contain individual x and reassign the result to a new data.frame. I'm aware of some brute force techniques using something like names == unique(df$individualname) and then subsetting out those names individually and applying nrow to determine whether or not to exclude them...but there has to be a better way. Any help is appreciated, I'm still pretty new to R.
An example using group_by and filter from dplyr package:
library(dplyr)
df <- data.frame(id=c(rep("a", 2), rep("b", 5), rep("c", 8)),
foo=runif(15))
> df
id foo
1 a 0.8717067
2 a 0.9086262
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
df %>% group_by(id) %>% filter(n()>= 5) %>% ungroup()
Source: local data frame [13 x 2]
id foo
(fctr) (dbl)
1 b 0.9962453
2 b 0.8980123
3 b 0.1535324
4 b 0.2802848
5 b 0.9366375
6 c 0.8109557
7 c 0.6945285
8 c 0.1012925
9 c 0.6822955
10 c 0.3757085
11 c 0.7348635
12 c 0.3026395
13 c 0.9707223
or with base R:
> df[df$id %in% names(which(table(df$id)>=5)), ]
id foo
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
Still in base R, using with is a more elegant way to do the very same thing:
df[with(df, id %in% names(which(table(id)>=5))), ]
or:
subset(df, with(df, id %in% names(which(table(id)>=5))))
Another way to do the same thing using the data.table package.
library(data.table)
set.seed(1)
dt <- data.table(id=sample(1:4,20,replace=TRUE),var=sample(1:100,20))
dt1<-dt[,count:=.N,by=id][(count>=5)]
dt2<-dt[,count:=.N,by=id][(count<5)]
dt1
id var count
1: 2 94 5
2: 2 22 5
3: 3 64 5
4: 4 13 6
5: 4 37 6
6: 4 2 6
7: 3 36 5
8: 3 81 5
9: 3 90 5
10: 2 17 5
11: 4 72 6
12: 2 57 5
13: 3 67 5
14: 4 9 6
15: 2 60 5
16: 4 34 6
dt2
id var count
1: 1 26 4
2: 1 31 4
3: 1 44 4
4: 1 54 4
It can be also with data.table using a logical condition with if after grouping by 'id'
library(data.table)
setDT(df)[, if(.N >=5) .SD, id]
# id foo
# 1: b 0.9962453
# 2: b 0.8980123
# 3: b 0.1535324
# 4: b 0.2802848
# 5: b 0.9366375
# 6: c 0.8109557
# 7: c 0.6945285
# 8: c 0.1012925
# 9: c 0.6822955
#10: c 0.3757085
#11: c 0.7348635
#12: c 0.3026395
#13: c 0.9707223
data
df <- structure(list(id = c("a", "a", "b", "b", "b", "b", "b", "c",
"c", "c", "c", "c", "c", "c", "c"), foo = c(0.8717067, 0.9086262,
0.9962453, 0.8980123, 0.1535324, 0.2802848, 0.9366375, 0.8109557,
0.6945285, 0.1012925, 0.6822955, 0.3757085, 0.7348635, 0.3026395,
0.9707223)), .Names = c("id", "foo"), class = "data.frame",
row.names = c(NA, -15L))
you can also use table. take for instance the data.frame mtcars
table(mtcars$cyl)
you will see that cyl has 3 values 4 6 8. there are 7 cars with 6 cylinders and if you want to exclude observations with less than 10 then you can exclude the cars with 6 cylinders like that
mtcars[!mtcars$cyl%in%names(table(mtcars$cyl)[table(mtcars$cyl)<10]),]
this will exclude observations using %in% names and table alone

Eliminate all duplicates according to a key + keep records from a table which are not in another table

Question 1:
I am relatively new to R and I have two distinct questions.
I need to eliminate duplicates according to a key (single or multiple) but all of them so unique wouldn't do it. I also found the function duplicated but it will mark as true only from the second occurrence onward, but I need to eliminate all of them.
> DT <- data.table(Key=c("a","a","a","b","c"),var=c(1:5))
> DT
Key var
1: a 1
2: a 2
3: a 3
4: b 4
5: c 5
> unique(DT)
Key var
1: a 1
2: b 4
3: c 5
> duplicated(DT)
[1] FALSE TRUE TRUE FALSE FALSE
what I want instead is
Key var
1: b 4
2: c 5
Question 2:
I have 2 data tables and I want to keep only records from DTFrom for which the combination of values from the 2 (or more keys) is not in DTFilter (I found similar questions for SQL but not r):
> DTFrom
key1 key2 var
1: q m 1
2: q n 2
3: q b 3
4: w n 4
5: e m 5
6: e n 6
7: e b 7
8: r n 8
9: r b 9
10: t m 10
11: t n 11
12: t b 12
13: t v 13
> DTFilter
key1 key2 var
1: q m 1
2: w n 4
3: e b 7
4: e n 6
5: r n 8
6: r b 9
7: t m 10
8: t v 13
and I want the result to be:
> DTOut
key1 key2 var
1: q n 2
2: q b 3
3: e m 5
4: t n 11
5: t b 12
Thanks in advance!
For the first question, you can use the fromLast argument in duplicated:
DT[ !(duplicated(Key) | duplicated(Key, fromLast = TRUE))]
# Key var
#1: b 4
#2: c 5
For the second question, you can do:
setkey(DTFrom, key1, key2)
DTFrom[!DTFilter]
# key1 key2 var
#1: e m 5
#2: q b 3
#3: q n 2
#4: t b 12
#5: t n 11
As for the first question you can use table function for additional filtering:
DT[!duplicated(Key)][table(DT$Key) == 1,]
# Key var
# 1: b 4
# 2: c 5
As for the second question there is anti_join function in dplyr package specially for this case:
require("dplyr")
anti_join(DTFrom, DTFilter, by = c("key1", "key2"))
# key1 key2 var
# 1 e m 5
# 2 q b 3
# 3 q n 2
# 4 t n 11
# 5 t b 12
My preferred method for the first question is:
DT[.(DT[ , .N, by=key(DT)][N==1L, !"N"])]
Similarly we can do:
DT[.(DT[, .N, by=key(DT)][N==1L, ]), .SD]
#docendodiscimus 's answer for Q2 is also mine.

Resources