Understanding .I in data.table in R

Understanding .I in data.table in R - r

I was playing around with data.table and I came across a distinction that I'm not sure I quite understand. Given the following dataset:
library(data.table)
set.seed(400)
DT <- data.table(x = sample(LETTERS[1:5], 20, TRUE), key = "x"); DT
Can you please explain to me the difference between the following expressions?
1) DT[J("E"), .I]
2) DT[ , .I[x == "E"] ]
3) DT[x == "E", .I]

set.seed(400)
library(data.table)
DT <- data.table(x = sample(LETTERS[1:5], 20, TRUE), key = "x"); DT
1)
DT[ , .I[x == "E"] ] # [1] 18 19 20
is a data.table where .I is a vector representing the row number of E in the ORIGINAL dataset DT
2)
DT[J("E") , .I] # [1] 1 2 3
DT["E" , .I] # [1] 1 2 3
DT[x == "E", .I] # [1] 1 2 3
are all the same, producing a vector where .Is are vectors representing the row numbers of the Es in the NEW subsetted data

Related

R data.table - How to modify by reference when using .SD?

So I'm new to data.table and don't understand now I can modify by reference at the same time that I perform an operation on chosen columns using the .SD symbol? I have two examples.
Example 1
> DT <- data.table("group1:1" = 1, "group1:2" = 1, "group2:1" = 1)
> DT
group1:1 group1:2 group2:1
1: 1 1 1
Let's say for example I simply to choose only columns which contain "group1:" in the name. I know it's pretty straightforward to just reassign the result of operation to the same object like so:
cols1 <- names(DT)[grep("group1:", names(DT))]
DT <- DT[, .SD, .SDcols = cols1]
From reading the data.table vignette on reference-semantics my understanding is that the above does not modify by reference, whereas a similar operation that would use the := would do so. Is this accurate? If that's correct Is there a better way to do this operation that does modify by reference? In trying to figure this out I got stuck on how to combine the .SD symbol and the := operator. I tried
DT[, c(cols1) := .SD, .SDcols = cols1]
DT[, c(cols1) := lapply(.SD,function(x)x), .SDcols = cols1]
neither of which gave the result I wanted.
Example 2
Say I want to perform a different operation dcast that uses .SD as input. Example data table:
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> DT
x y z
1: 1 A 5
2: 2 A 6
3: 1 B 7
4: 2 B 8
Again, I know I can just reassign like so:
> DT <- dcast(DT, x ~ y, value.var = "z")
> DT
x A B
1: 1 5 7
2: 2 6 8
But don't understand why the following does not work (or whether it would be preferable in some circumstances):
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> cols <- c("x", unique(DT$y))
> DT[, cols := dcast(.SD, x ~ y, value.var = "z")]

In your example,
cols1 <- names(DT)[grep("group1:", names(DT))]
DT[, c(cols1) := .SD, .SDcols = cols1] # not this
DT[, (cols1) := .SD, .SDcols = cols1] # this will work
Below is other example to set 0 values on numeric columns .SDcols by reference.
The trick is to assign column names vector before :=.
colnames = DT[, names(.SD), .SDcols = is.numeric] # column name vector
DT[, (colnames) := lapply(.SD, nafill, fill = 0), .SDcols= is.numeric]

Removing rows where multiple columns equal an exact number R

I'd like to subset rows where x1 and x2 == 9. My real set has over 200 columns where the column name starts with the same string. The dummy code below creates a smaller sample of the data. I'd like to do this ideally with the R data.table package if possible.
df <- data.frame('id'=c(1,2,3), 'x1'=c(9,9,4), 'x2'=c(9,9,4))
head(df)
# does not work, but thought perhaps I could have defined the columns via a paste and then subset where columns were equal to 9.
df[which(paste0("x", 1:2)==9), ]
Update: sorry if I wasn't clear. I am aware of simply adding a filter for x1 and x2. The issue is that the real data consists of over 200 columns: x1:x200. I am in search of a cleaner solution than what is proposed below.

If you want an efficient base R solution I would simply use rowSums, e.g.
cols <- paste0("x", 1:2)
df[rowSums(df[cols] == 9) == length(cols), ]
# id x1 x2
# 1 1 9 9
# 2 2 9 9
If you want a data.table solution, I would use a binary join, e.g.
library(data.table)
setDT(df)[as.list(rep(9, length(cols))), on = cols]
# id x1 x2
# 1: 1 9 9
# 2: 2 9 9
Data
df <- data.frame(id = 1:3, x1 = c(9, 9, 4), x2 = c(9, 9, 4))

Something like this, perhaps?
df[apply(df[, paste0("x", 1:200)] == 9, 1, all), ]

A melt can allow you to not have to write out every column (for your >2 column case):
> aTbl = as.data.table(df)
> aTbl[, all9sP := F]
> aTbl[, .SD
][, !'all9sP'
][, melt(.SD, id.vars=c('id'))
][, NVars := uniqueN(variable)
][value == 9
][, .(N9s=.N), .(id, NVars)
][, all9sP := N9s == NVars
][, aTbl[.SD, all9sP := i.all9sP, on=.(id)]
][all9sP == T
][, all9sP := NULL
][, .SD
]
id x1 x2
1: 1 9 9
2: 2 9 9
>

Try:
df[df$x1 == 9 & df$x2 == 9,]
EDIT (misunderstood, now it should do the trick):
for (i in 2:200) {df = df[df[,i] == 9,]}

You could also use grep with apply
# Select all columns that have (colnames) "x"
col.names <- grep("x",colnames(df), value = TRUE)
# Select rows where row == 9
sel <- apply(df[,col.names], 1, function(row) 9 %in% row)
df[sel,]
And the output
id x1 x2
1 1 9 9
2 2 9 9

Solution using data.table
Create dataset
ncols <- 5
cnms <- paste0("x", 1:ncols)
X <- data.table(ID = 1:1e6)
X[, (cnms) := NA_integer_]
X[, (cnms) := lapply(X = 1:ncols, sample, size = .N, x = 1:10)]
Find rows where sum equals 9
X1 <- X[, s := rowSums(.SD), .SDcols = cnms][s == 9, ][, s:= NULL][]
X1
Find rows where all columns are equal to 9
X[, s := NULL]
ind <- rowSums(X[, lapply(.SD, is.element, set = 9), .SDcols = cnms])
X2 <- X[ind == length(cnms)][]
X2
Edit
This is acutally a lot faster:
X[, s := NULL]
ind <- rowSums(X[, .SD , .SDcols = cnms] == 9)
X2 <- X[ind == length(cnms)][]
X2
Edit2
See answer from https://stackoverflow.com/users/3001626/david-arenburg. A lot faster.

In the tidyverse, try rowwise and use filter as usual
df %>%
rowwise() %>%
filter(x1 %in% 9 & x2 %in% 9 )
Source: local data frame [2 x 3]
Groups: <by row>
# A tibble: 2 x 3
id x1 x2
<dbl> <dbl> <dbl>
1 1 9 9
2 2 9 9

Sum over rows by group (many columns at once)

I need to take column sums over a large range of select columns. For example:
library(data.table)
set.seed(123)
DT = data.table(grp = c("A", "B", "C"),
x1 = sample(1:10, 3),
x2 = sample(1:10, 3),
x3 = sample(1:10, 3),
x4 = sample(1:10, 3))
> DT
grp x1 x2 x3 x4
1: A 3 9 6 5
2: B 8 10 9 9
3: C 4 1 5 4
Say, I want to sum over x2 and x3. I would normally do this using:
> DT[, .(total = sum(x2, x3)), by=grp]
grp total
1: A 15
2: B 19
3: C 6
However, if the range of columns is very large, say 100, how can this be coded elegantly, without spelling each column by name?
What I tried (and what didn't work):
my_cols <- paste0("x", 2:3)
DT[, .(total = sum(get(my_cols))), by=grp]
grp total
1: A 9
2: B 10
3: C 1
Appears to use only the first column (x2) and disregard the rest.

I didn't find an exact dupe (that deals with sum by row by group) so here 5 different possibilities I could think off.
The main thing to remember here that you are working with a data.table per group, hence, some functions won't work without unlist
## Create an example data
library(data.table)
set.seed(123)
DT <- data.table(grp = c("A", "B", "C"),
matrix(sample(1:10, 30 * 4, replace = TRUE), ncol = 4))
my_cols <- paste0("V", 2:3)
## 1- This won't work with `NA`s. It will work without `unlist`,
## but won't return correct results.
DT[, Reduce(`+`, unlist(.SD)), .SDcols = my_cols, by = grp]
## 2 - Convert to long format first and then aggregate
melt(DT, "grp", measure = my_cols)[, sum(value), by = grp]
## 3 - Using `base::sum` which can handle data.frames,
## see `?S4groupGeneric` (a data.table is also a data.frame)
DT[, base::sum(.SD), .SDcols = my_cols, by = grp]
## 4 - This will use data.tables enhanced `gsum` function,
## but it can't handle data.frames/data.tables
## Hence, requires unlist first. Will be interesting to measure the tradeoff
DT[, sum(unlist(.SD)), .SDcols = my_cols, by = grp]
## 5 - This is a modification to your original attempt that both handles multiple columns
## (`mget` instead of `get`) and adds `unlist`
## (no point trying wuth `base::sum` instead, because it will also require `unlist`)
DT[, sum(unlist(mget(my_cols))), by = grp]
All of these will return the same result
# grp V1
# 1: A 115
# 2: B 105
# 3: C 96
Some benchmarks
library(data.table)
library(microbenchmark)
library(stringi)
set.seed(123)
N <- 1e5
cols <- 50
DT <- data.table(grp = stri_rand_strings(N / 1e4, 2),
matrix(sample(1:10, N * cols, replace = TRUE),
ncol = cols))
my_cols <- paste0("V", 1:20)
mbench <- microbenchmark(
"Reduce/unlist: " = DT[, Reduce(`+`, unlist(.SD)), .SDcols = my_cols, by = grp],
"melt: " = melt(DT, "grp", measure = my_cols)[, sum(value), by = grp],
"base::sum: " = DT[, base::sum(.SD), .SDcols = my_cols, by = grp],
"gsum/unlist: " = DT[, sum(unlist(.SD)), .SDcols = my_cols, by = grp],
"gsum/mget/unlist: " = DT[, sum(unlist(mget(my_cols))), by = grp]
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# Reduce/unlist: 1968.93628 2185.45706 2332.66770 2301.10293 2440.43138 3161.15522 100 c
# melt: 33.91844 58.18254 66.70419 64.52190 74.29494 132.62978 100 a
# base::sum: 18.00297 22.44860 27.21083 25.14174 29.20080 77.62018 100 a
# gsum/unlist: 780.53878 852.16508 929.65818 894.73892 968.28680 1430.91928 100 b
# gsum/mget/unlist: 797.99854 876.09773 963.70562 928.27375 1003.04632 1578.76408 100 b
library(ggplot2)
autoplot(mbench)

Operate in data.table column by matching column from second data.table

I am trying to perform a character operation (paste) in a column from one data.table using data from a second data.table.
Since I am also performing other unrelated merge operations before and after this particular code, the rows order might change, so I am currently setting the order both before and after this manipulation.
DT1 <- data.table(ID = c("a", "b", "c"), N = c(4,1,3)) # N used
DT2 <- data.table(ID = c("b","a","c"), N = c(10,10, 15)) # N total
# without merge
DT1 <- DT1[order(ID)]
DT2 <- DT2[order(ID)]
DT1[, N := paste0(N, "/", DT2$N)]
DT1
# ID N
# 1: a 4/10
# 2: b 1/10
# 3: c 3/15
I know a merge of the two DTs (by definition) would take care of the matching, but this creates extra columns that I need to remove afterwards.
# using merge
DT1 <- merge(DT1, DT2, by = "ID")
DT1[, N := paste0(N.x, "/", N.y)]
DT1[, c("N.x", "N.y") := list(NULL, NULL)]
DT1
# ID N
# 1: a 4/10
# 2: b 1/10
# 3: c 3/15
Is there a more intelligent way of doing this using data.table?

We can use join after converting the 'N' column to character
DT1[DT2, N := paste0(N, "/", i.N), on = .(ID)]
DT1
# ID N
#1: a 4/10
#2: b 1/10
#3: c 3/15
data
DT1 <- data.table(ID = c("a", "b", "c"), N = c(4,1,3))
DT2 <- data.table(ID = c("b","a","c"), N = c(10,10, 15)) # N total
DT1[, N:= as.character(N)]

Select column matching pattern then keep only rows that match other vector values

I want to select the columns in DT1 that match the pattern flux then keep only rows that have values similar to those in a predefined vector vec1
Sample Data
library(data.table)
DT1 <- structure(list(flux_1 = c(1, 6, 2, 9, 5),
FileName = c("prac_1", "prac_2", "prac_3", "prac_4", "prac_5")),
.Names = c("flux_1", "FileName"),
class = c("data.table", "data.frame"),
row.names = c(NA, -5L))
DT1
flux_1 FileName
1: 1 prac_1
2: 6 prac_2
3: 2 prac_3
4: 9 prac_4
5: 5 prac_5
vec1 <- c(6, 2)
The following code works but I need to explicitly specify flux_1.
DT1[ flux_1 %in% vec1]
flux_1 FileName
1: 6 prac_2
2: 2 prac_3
I was thinking about something like this but it didn't work
DT1[, .SD, .SDcols = names(DT1) %like% "flux"] %>%
.[. %in% vec1]
Empty data.table (0 rows) of 1 col: flux_1
Any suggestion is appreciated! Thank you!

We can use get to return the value of the column after grep
DT1[get(grep('flux', names(DT1), value = TRUE)) %in% vec1 ]
# flux_1 FileName
#1: 6 prac_2
#2: 2 prac_3
Or if we use the .SDcols route, extract the .SD as a vector do the comparison and subset the dataset
DT1[DT1[, .SD[[1]] %in% vec1, .SDcols = grep('flux', names(DT1))]]
Similar option can be used with %like%
DT1[DT1[, .SD[[1]] %in% vec1, .SDcols = names(DT1) %like% "flux"]]
Regarding the OP's approach
DT1[, .SD, .SDcols = names(DT1) %like% "flux"]
# flux_1
#1: 1
#2: 6
#3: 2
#4: 9
#5: 5
returns a data.table with a single column. By chaining, we need to extract the 'flux_1' column
DT1[, .SD, .SDcols = names(DT1) %like% "flux"] %>%
.[[1]] %in% vec1 %>%
magrittr::extract(DT1, .)
# flux_1 FileName
#1: 6 prac_2
#2: 2 prac_3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Understanding .I in data.table in R - r

Related

R data.table - How to modify by reference when using .SD?

Removing rows where multiple columns equal an exact number R

Sum over rows by group (many columns at once)

Operate in data.table column by matching column from second data.table

Select column matching pattern then keep only rows that match other vector values

Categories

Resources