I was playing around with data.table and I came across a distinction that I'm not sure I quite understand. Given the following dataset:
library(data.table)
set.seed(400)
DT <- data.table(x = sample(LETTERS[1:5], 20, TRUE), key = "x"); DT
Can you please explain to me the difference between the following expressions?
1) DT[J("E"), .I]
2) DT[ , .I[x == "E"] ]
3) DT[x == "E", .I]
set.seed(400)
library(data.table)
DT <- data.table(x = sample(LETTERS[1:5], 20, TRUE), key = "x"); DT
1)
DT[ , .I[x == "E"] ] # [1] 18 19 20
is a data.table where .I is a vector representing the row number of E in the ORIGINAL dataset DT
2)
DT[J("E") , .I] # [1] 1 2 3
DT["E" , .I] # [1] 1 2 3
DT[x == "E", .I] # [1] 1 2 3
are all the same, producing a vector where .Is are vectors representing the row numbers of the Es in the NEW subsetted data
Related
So I'm new to data.table and don't understand now I can modify by reference at the same time that I perform an operation on chosen columns using the .SD symbol? I have two examples.
Example 1
> DT <- data.table("group1:1" = 1, "group1:2" = 1, "group2:1" = 1)
> DT
group1:1 group1:2 group2:1
1: 1 1 1
Let's say for example I simply to choose only columns which contain "group1:" in the name. I know it's pretty straightforward to just reassign the result of operation to the same object like so:
cols1 <- names(DT)[grep("group1:", names(DT))]
DT <- DT[, .SD, .SDcols = cols1]
From reading the data.table vignette on reference-semantics my understanding is that the above does not modify by reference, whereas a similar operation that would use the := would do so. Is this accurate? If that's correct Is there a better way to do this operation that does modify by reference? In trying to figure this out I got stuck on how to combine the .SD symbol and the := operator. I tried
DT[, c(cols1) := .SD, .SDcols = cols1]
DT[, c(cols1) := lapply(.SD,function(x)x), .SDcols = cols1]
neither of which gave the result I wanted.
Example 2
Say I want to perform a different operation dcast that uses .SD as input. Example data table:
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> DT
x y z
1: 1 A 5
2: 2 A 6
3: 1 B 7
4: 2 B 8
Again, I know I can just reassign like so:
> DT <- dcast(DT, x ~ y, value.var = "z")
> DT
x A B
1: 1 5 7
2: 2 6 8
But don't understand why the following does not work (or whether it would be preferable in some circumstances):
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> cols <- c("x", unique(DT$y))
> DT[, cols := dcast(.SD, x ~ y, value.var = "z")]
In your example,
cols1 <- names(DT)[grep("group1:", names(DT))]
DT[, c(cols1) := .SD, .SDcols = cols1] # not this
DT[, (cols1) := .SD, .SDcols = cols1] # this will work
Below is other example to set 0 values on numeric columns .SDcols by reference.
The trick is to assign column names vector before :=.
colnames = DT[, names(.SD), .SDcols = is.numeric] # column name vector
DT[, (colnames) := lapply(.SD, nafill, fill = 0), .SDcols= is.numeric]
I'd like to subset rows where x1 and x2 == 9. My real set has over 200 columns where the column name starts with the same string. The dummy code below creates a smaller sample of the data. I'd like to do this ideally with the R data.table package if possible.
df <- data.frame('id'=c(1,2,3), 'x1'=c(9,9,4), 'x2'=c(9,9,4))
head(df)
# does not work, but thought perhaps I could have defined the columns via a paste and then subset where columns were equal to 9.
df[which(paste0("x", 1:2)==9), ]
Update: sorry if I wasn't clear. I am aware of simply adding a filter for x1 and x2. The issue is that the real data consists of over 200 columns: x1:x200. I am in search of a cleaner solution than what is proposed below.
If you want an efficient base R solution I would simply use rowSums, e.g.
cols <- paste0("x", 1:2)
df[rowSums(df[cols] == 9) == length(cols), ]
# id x1 x2
# 1 1 9 9
# 2 2 9 9
If you want a data.table solution, I would use a binary join, e.g.
library(data.table)
setDT(df)[as.list(rep(9, length(cols))), on = cols]
# id x1 x2
# 1: 1 9 9
# 2: 2 9 9
Data
df <- data.frame(id = 1:3, x1 = c(9, 9, 4), x2 = c(9, 9, 4))
Something like this, perhaps?
df[apply(df[, paste0("x", 1:200)] == 9, 1, all), ]
A melt can allow you to not have to write out every column (for your >2 column case):
> aTbl = as.data.table(df)
> aTbl[, all9sP := F]
> aTbl[, .SD
][, !'all9sP'
][, melt(.SD, id.vars=c('id'))
][, NVars := uniqueN(variable)
][value == 9
][, .(N9s=.N), .(id, NVars)
][, all9sP := N9s == NVars
][, aTbl[.SD, all9sP := i.all9sP, on=.(id)]
][all9sP == T
][, all9sP := NULL
][, .SD
]
id x1 x2
1: 1 9 9
2: 2 9 9
>
Try:
df[df$x1 == 9 & df$x2 == 9,]
EDIT (misunderstood, now it should do the trick):
for (i in 2:200) {df = df[df[,i] == 9,]}
You could also use grep with apply
# Select all columns that have (colnames) "x"
col.names <- grep("x",colnames(df), value = TRUE)
# Select rows where row == 9
sel <- apply(df[,col.names], 1, function(row) 9 %in% row)
df[sel,]
And the output
id x1 x2
1 1 9 9
2 2 9 9
Solution using data.table
Create dataset
ncols <- 5
cnms <- paste0("x", 1:ncols)
X <- data.table(ID = 1:1e6)
X[, (cnms) := NA_integer_]
X[, (cnms) := lapply(X = 1:ncols, sample, size = .N, x = 1:10)]
Find rows where sum equals 9
X1 <- X[, s := rowSums(.SD), .SDcols = cnms][s == 9, ][, s:= NULL][]
X1
Find rows where all columns are equal to 9
X[, s := NULL]
ind <- rowSums(X[, lapply(.SD, is.element, set = 9), .SDcols = cnms])
X2 <- X[ind == length(cnms)][]
X2
Edit
This is acutally a lot faster:
X[, s := NULL]
ind <- rowSums(X[, .SD , .SDcols = cnms] == 9)
X2 <- X[ind == length(cnms)][]
X2
Edit2
See answer from https://stackoverflow.com/users/3001626/david-arenburg. A lot faster.
In the tidyverse, try rowwise and use filter as usual
df %>%
rowwise() %>%
filter(x1 %in% 9 & x2 %in% 9 )
Source: local data frame [2 x 3]
Groups: <by row>
# A tibble: 2 x 3
id x1 x2
<dbl> <dbl> <dbl>
1 1 9 9
2 2 9 9
I need to take column sums over a large range of select columns. For example:
library(data.table)
set.seed(123)
DT = data.table(grp = c("A", "B", "C"),
x1 = sample(1:10, 3),
x2 = sample(1:10, 3),
x3 = sample(1:10, 3),
x4 = sample(1:10, 3))
> DT
grp x1 x2 x3 x4
1: A 3 9 6 5
2: B 8 10 9 9
3: C 4 1 5 4
Say, I want to sum over x2 and x3. I would normally do this using:
> DT[, .(total = sum(x2, x3)), by=grp]
grp total
1: A 15
2: B 19
3: C 6
However, if the range of columns is very large, say 100, how can this be coded elegantly, without spelling each column by name?
What I tried (and what didn't work):
my_cols <- paste0("x", 2:3)
DT[, .(total = sum(get(my_cols))), by=grp]
grp total
1: A 9
2: B 10
3: C 1
Appears to use only the first column (x2) and disregard the rest.
I didn't find an exact dupe (that deals with sum by row by group) so here 5 different possibilities I could think off.
The main thing to remember here that you are working with a data.table per group, hence, some functions won't work without unlist
## Create an example data
library(data.table)
set.seed(123)
DT <- data.table(grp = c("A", "B", "C"),
matrix(sample(1:10, 30 * 4, replace = TRUE), ncol = 4))
my_cols <- paste0("V", 2:3)
## 1- This won't work with `NA`s. It will work without `unlist`,
## but won't return correct results.
DT[, Reduce(`+`, unlist(.SD)), .SDcols = my_cols, by = grp]
## 2 - Convert to long format first and then aggregate
melt(DT, "grp", measure = my_cols)[, sum(value), by = grp]
## 3 - Using `base::sum` which can handle data.frames,
## see `?S4groupGeneric` (a data.table is also a data.frame)
DT[, base::sum(.SD), .SDcols = my_cols, by = grp]
## 4 - This will use data.tables enhanced `gsum` function,
## but it can't handle data.frames/data.tables
## Hence, requires unlist first. Will be interesting to measure the tradeoff
DT[, sum(unlist(.SD)), .SDcols = my_cols, by = grp]
## 5 - This is a modification to your original attempt that both handles multiple columns
## (`mget` instead of `get`) and adds `unlist`
## (no point trying wuth `base::sum` instead, because it will also require `unlist`)
DT[, sum(unlist(mget(my_cols))), by = grp]
All of these will return the same result
# grp V1
# 1: A 115
# 2: B 105
# 3: C 96
Some benchmarks
library(data.table)
library(microbenchmark)
library(stringi)
set.seed(123)
N <- 1e5
cols <- 50
DT <- data.table(grp = stri_rand_strings(N / 1e4, 2),
matrix(sample(1:10, N * cols, replace = TRUE),
ncol = cols))
my_cols <- paste0("V", 1:20)
mbench <- microbenchmark(
"Reduce/unlist: " = DT[, Reduce(`+`, unlist(.SD)), .SDcols = my_cols, by = grp],
"melt: " = melt(DT, "grp", measure = my_cols)[, sum(value), by = grp],
"base::sum: " = DT[, base::sum(.SD), .SDcols = my_cols, by = grp],
"gsum/unlist: " = DT[, sum(unlist(.SD)), .SDcols = my_cols, by = grp],
"gsum/mget/unlist: " = DT[, sum(unlist(mget(my_cols))), by = grp]
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# Reduce/unlist: 1968.93628 2185.45706 2332.66770 2301.10293 2440.43138 3161.15522 100 c
# melt: 33.91844 58.18254 66.70419 64.52190 74.29494 132.62978 100 a
# base::sum: 18.00297 22.44860 27.21083 25.14174 29.20080 77.62018 100 a
# gsum/unlist: 780.53878 852.16508 929.65818 894.73892 968.28680 1430.91928 100 b
# gsum/mget/unlist: 797.99854 876.09773 963.70562 928.27375 1003.04632 1578.76408 100 b
library(ggplot2)
autoplot(mbench)
I am trying to perform a character operation (paste) in a column from one data.table using data from a second data.table.
Since I am also performing other unrelated merge operations before and after this particular code, the rows order might change, so I am currently setting the order both before and after this manipulation.
DT1 <- data.table(ID = c("a", "b", "c"), N = c(4,1,3)) # N used
DT2 <- data.table(ID = c("b","a","c"), N = c(10,10, 15)) # N total
# without merge
DT1 <- DT1[order(ID)]
DT2 <- DT2[order(ID)]
DT1[, N := paste0(N, "/", DT2$N)]
DT1
# ID N
# 1: a 4/10
# 2: b 1/10
# 3: c 3/15
I know a merge of the two DTs (by definition) would take care of the matching, but this creates extra columns that I need to remove afterwards.
# using merge
DT1 <- merge(DT1, DT2, by = "ID")
DT1[, N := paste0(N.x, "/", N.y)]
DT1[, c("N.x", "N.y") := list(NULL, NULL)]
DT1
# ID N
# 1: a 4/10
# 2: b 1/10
# 3: c 3/15
Is there a more intelligent way of doing this using data.table?
We can use join after converting the 'N' column to character
DT1[DT2, N := paste0(N, "/", i.N), on = .(ID)]
DT1
# ID N
#1: a 4/10
#2: b 1/10
#3: c 3/15
data
DT1 <- data.table(ID = c("a", "b", "c"), N = c(4,1,3))
DT2 <- data.table(ID = c("b","a","c"), N = c(10,10, 15)) # N total
DT1[, N:= as.character(N)]
I want to select the columns in DT1 that match the pattern flux then keep only rows that have values similar to those in a predefined vector vec1
Sample Data
library(data.table)
DT1 <- structure(list(flux_1 = c(1, 6, 2, 9, 5),
FileName = c("prac_1", "prac_2", "prac_3", "prac_4", "prac_5")),
.Names = c("flux_1", "FileName"),
class = c("data.table", "data.frame"),
row.names = c(NA, -5L))
DT1
flux_1 FileName
1: 1 prac_1
2: 6 prac_2
3: 2 prac_3
4: 9 prac_4
5: 5 prac_5
vec1 <- c(6, 2)
The following code works but I need to explicitly specify flux_1.
DT1[ flux_1 %in% vec1]
flux_1 FileName
1: 6 prac_2
2: 2 prac_3
I was thinking about something like this but it didn't work
DT1[, .SD, .SDcols = names(DT1) %like% "flux"] %>%
.[. %in% vec1]
Empty data.table (0 rows) of 1 col: flux_1
Any suggestion is appreciated! Thank you!
We can use get to return the value of the column after grep
DT1[get(grep('flux', names(DT1), value = TRUE)) %in% vec1 ]
# flux_1 FileName
#1: 6 prac_2
#2: 2 prac_3
Or if we use the .SDcols route, extract the .SD as a vector do the comparison and subset the dataset
DT1[DT1[, .SD[[1]] %in% vec1, .SDcols = grep('flux', names(DT1))]]
Similar option can be used with %like%
DT1[DT1[, .SD[[1]] %in% vec1, .SDcols = names(DT1) %like% "flux"]]
Regarding the OP's approach
DT1[, .SD, .SDcols = names(DT1) %like% "flux"]
# flux_1
#1: 1
#2: 6
#3: 2
#4: 9
#5: 5
returns a data.table with a single column. By chaining, we need to extract the 'flux_1' column
DT1[, .SD, .SDcols = names(DT1) %like% "flux"] %>%
.[[1]] %in% vec1 %>%
magrittr::extract(DT1, .)
# flux_1 FileName
#1: 6 prac_2
#2: 2 prac_3