I'd like to subset rows where x1 and x2 == 9. My real set has over 200 columns where the column name starts with the same string. The dummy code below creates a smaller sample of the data. I'd like to do this ideally with the R data.table package if possible.
df <- data.frame('id'=c(1,2,3), 'x1'=c(9,9,4), 'x2'=c(9,9,4))
head(df)
# does not work, but thought perhaps I could have defined the columns via a paste and then subset where columns were equal to 9.
df[which(paste0("x", 1:2)==9), ]
Update: sorry if I wasn't clear. I am aware of simply adding a filter for x1 and x2. The issue is that the real data consists of over 200 columns: x1:x200. I am in search of a cleaner solution than what is proposed below.
If you want an efficient base R solution I would simply use rowSums, e.g.
cols <- paste0("x", 1:2)
df[rowSums(df[cols] == 9) == length(cols), ]
# id x1 x2
# 1 1 9 9
# 2 2 9 9
If you want a data.table solution, I would use a binary join, e.g.
library(data.table)
setDT(df)[as.list(rep(9, length(cols))), on = cols]
# id x1 x2
# 1: 1 9 9
# 2: 2 9 9
Data
df <- data.frame(id = 1:3, x1 = c(9, 9, 4), x2 = c(9, 9, 4))
Something like this, perhaps?
df[apply(df[, paste0("x", 1:200)] == 9, 1, all), ]
A melt can allow you to not have to write out every column (for your >2 column case):
> aTbl = as.data.table(df)
> aTbl[, all9sP := F]
> aTbl[, .SD
][, !'all9sP'
][, melt(.SD, id.vars=c('id'))
][, NVars := uniqueN(variable)
][value == 9
][, .(N9s=.N), .(id, NVars)
][, all9sP := N9s == NVars
][, aTbl[.SD, all9sP := i.all9sP, on=.(id)]
][all9sP == T
][, all9sP := NULL
][, .SD
]
id x1 x2
1: 1 9 9
2: 2 9 9
>
Try:
df[df$x1 == 9 & df$x2 == 9,]
EDIT (misunderstood, now it should do the trick):
for (i in 2:200) {df = df[df[,i] == 9,]}
You could also use grep with apply
# Select all columns that have (colnames) "x"
col.names <- grep("x",colnames(df), value = TRUE)
# Select rows where row == 9
sel <- apply(df[,col.names], 1, function(row) 9 %in% row)
df[sel,]
And the output
id x1 x2
1 1 9 9
2 2 9 9
Solution using data.table
Create dataset
ncols <- 5
cnms <- paste0("x", 1:ncols)
X <- data.table(ID = 1:1e6)
X[, (cnms) := NA_integer_]
X[, (cnms) := lapply(X = 1:ncols, sample, size = .N, x = 1:10)]
Find rows where sum equals 9
X1 <- X[, s := rowSums(.SD), .SDcols = cnms][s == 9, ][, s:= NULL][]
X1
Find rows where all columns are equal to 9
X[, s := NULL]
ind <- rowSums(X[, lapply(.SD, is.element, set = 9), .SDcols = cnms])
X2 <- X[ind == length(cnms)][]
X2
Edit
This is acutally a lot faster:
X[, s := NULL]
ind <- rowSums(X[, .SD , .SDcols = cnms] == 9)
X2 <- X[ind == length(cnms)][]
X2
Edit2
See answer from https://stackoverflow.com/users/3001626/david-arenburg. A lot faster.
In the tidyverse, try rowwise and use filter as usual
df %>%
rowwise() %>%
filter(x1 %in% 9 & x2 %in% 9 )
Source: local data frame [2 x 3]
Groups: <by row>
# A tibble: 2 x 3
id x1 x2
<dbl> <dbl> <dbl>
1 1 9 9
2 2 9 9
Related
I have a data.table with which I'd like to perform the same operation on certain columns. The names of these columns are given in a character vector. In this particular example, I'd like to multiply all of these columns by -1.
Some toy data and a vector specifying relevant columns:
library(data.table)
dt <- data.table(a = 1:3, b = 1:3, d = 1:3)
cols <- c("a", "b")
Right now I'm doing it this way, looping over the character vector:
for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
}
Is there a way to do this directly without the for loop?
This seems to work:
dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]
The result is
a b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3
There are a few tricks here:
Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
.SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).
EDIT: Here's another way that is probably faster, as #Arun mentioned:
for (j in cols) set(dt, j = j, value = -dt[[j]])
I would like to add an answer, when you would like to change the name of the columns as well. This comes in quite handy if you want to calculate the logarithm of multiple columns, which is often the case in empirical work.
cols <- c("a", "b")
out_cols = paste("log", cols, sep = ".")
dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols]
UPDATE: Following is a neat way to do it without for loop
dt[,(cols):= - dt[,..cols]]
It is a neat way for easy code readability. But as for performance it stays behind Frank's solution according to below microbenchmark result
mbm = microbenchmark(
base = for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
},
franks_solution1 = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
franks_solution2 = for (j in cols) set(dt, j = j, value = -dt[[j]]),
hannes_solution = dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols],
orhans_solution = for (j in cols) dt[,(j):= -1 * dt[, ..j]],
orhans_solution2 = dt[,(cols):= - dt[,..cols]],
times=1000
)
mbm
Unit: microseconds
expr min lq mean median uq max neval
base_solution 3874.048 4184.4070 5205.8782 4452.5090 5127.586 69641.789 1000
franks_solution1 313.846 349.1285 448.4770 379.8970 447.384 5654.149 1000
franks_solution2 1500.306 1667.6910 2041.6134 1774.3580 1961.229 9723.070 1000
hannes_solution 326.154 405.5385 561.8263 495.1795 576.000 12432.400 1000
orhans_solution 3747.690 4008.8175 5029.8333 4299.4840 4933.739 35025.202 1000
orhans_solution2 752.000 831.5900 1061.6974 897.6405 1026.872 9913.018 1000
as shown in below chart
My Previous Answer:
The following also works
for (j in cols)
dt[,(j):= -1 * dt[, ..j]]
None of above solutions seems to work with calculation by group. Following is the best I got:
for(col in cols)
{
DT[, (col) := scale(.SD[[col]], center = TRUE, scale = TRUE), g]
}
dplyr functions work on data.tables, so here's a dplyr solution that also "avoids the for-loop" :)
dt %>% mutate(across(all_of(cols), ~ -1 * .))
I benchmarked it using orhan's code (adding rows and columns) and you'll see dplyr::mutate with across mostly executes faster than most of the other solutions and slower than the data.table solution using lapply.
library(data.table); library(dplyr)
dt <- data.table(a = 1:100000, b = 1:100000, d = 1:100000) %>%
mutate(a2 = a, a3 = a, a4 = a, a5 = a, a6 = a)
cols <- c("a", "b", "a2", "a3", "a4", "a5", "a6")
dt %>% mutate(across(all_of(cols), ~ -1 * .))
#> a b d a2 a3 a4 a5 a6
#> 1: -1 -1 1 -1 -1 -1 -1 -1
#> 2: -2 -2 2 -2 -2 -2 -2 -2
#> 3: -3 -3 3 -3 -3 -3 -3 -3
#> 4: -4 -4 4 -4 -4 -4 -4 -4
#> 5: -5 -5 5 -5 -5 -5 -5 -5
#> ---
#> 99996: -99996 -99996 99996 -99996 -99996 -99996 -99996 -99996
#> 99997: -99997 -99997 99997 -99997 -99997 -99997 -99997 -99997
#> 99998: -99998 -99998 99998 -99998 -99998 -99998 -99998 -99998
#> 99999: -99999 -99999 99999 -99999 -99999 -99999 -99999 -99999
#> 100000: -100000 -100000 100000 -100000 -100000 -100000 -100000 -100000
library(microbenchmark)
mbm = microbenchmark(
base_with_forloop = for (col in 1:length(cols)) {
dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
},
franks_soln1_w_lapply = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
franks_soln2_w_forloop = for (j in cols) set(dt, j = j, value = -dt[[j]]),
orhans_soln_w_forloop = for (j in cols) dt[,(j):= -1 * dt[, ..j]],
orhans_soln2 = dt[,(cols):= - dt[,..cols]],
dplyr_soln = (dt %>% mutate(across(all_of(cols), ~ -1 * .))),
times=1000
)
library(ggplot2)
ggplot(mbm) +
geom_violin(aes(x = expr, y = time)) +
coord_flip()
Created on 2020-10-16 by the reprex package (v0.3.0)
To add example to create new columns based on a string vector of columns. Based on Jfly answer:
dt <- data.table(a = rnorm(1:100), b = rnorm(1:100), c = rnorm(1:100), g = c(rep(1:10, 10)))
col0 <- c("a", "b", "c")
col1 <- paste0("max.", col0)
for(i in seq_along(col0)) {
dt[, (col1[i]) := max(get(col0[i])), g]
}
dt[,.N, c("g", col1)]
library(data.table)
(dt <- data.table(a = 1:3, b = 1:3, d = 1:3))
Hence:
a b d
1: 1 1 1
2: 2 2 2
3: 3 3 3
Whereas (dt*(-1)) yields:
a b d
1: -1 -1 -1
2: -2 -2 -2
3: -3 -3 -3
So I'm new to data.table and don't understand now I can modify by reference at the same time that I perform an operation on chosen columns using the .SD symbol? I have two examples.
Example 1
> DT <- data.table("group1:1" = 1, "group1:2" = 1, "group2:1" = 1)
> DT
group1:1 group1:2 group2:1
1: 1 1 1
Let's say for example I simply to choose only columns which contain "group1:" in the name. I know it's pretty straightforward to just reassign the result of operation to the same object like so:
cols1 <- names(DT)[grep("group1:", names(DT))]
DT <- DT[, .SD, .SDcols = cols1]
From reading the data.table vignette on reference-semantics my understanding is that the above does not modify by reference, whereas a similar operation that would use the := would do so. Is this accurate? If that's correct Is there a better way to do this operation that does modify by reference? In trying to figure this out I got stuck on how to combine the .SD symbol and the := operator. I tried
DT[, c(cols1) := .SD, .SDcols = cols1]
DT[, c(cols1) := lapply(.SD,function(x)x), .SDcols = cols1]
neither of which gave the result I wanted.
Example 2
Say I want to perform a different operation dcast that uses .SD as input. Example data table:
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> DT
x y z
1: 1 A 5
2: 2 A 6
3: 1 B 7
4: 2 B 8
Again, I know I can just reassign like so:
> DT <- dcast(DT, x ~ y, value.var = "z")
> DT
x A B
1: 1 5 7
2: 2 6 8
But don't understand why the following does not work (or whether it would be preferable in some circumstances):
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> cols <- c("x", unique(DT$y))
> DT[, cols := dcast(.SD, x ~ y, value.var = "z")]
In your example,
cols1 <- names(DT)[grep("group1:", names(DT))]
DT[, c(cols1) := .SD, .SDcols = cols1] # not this
DT[, (cols1) := .SD, .SDcols = cols1] # this will work
Below is other example to set 0 values on numeric columns .SDcols by reference.
The trick is to assign column names vector before :=.
colnames = DT[, names(.SD), .SDcols = is.numeric] # column name vector
DT[, (colnames) := lapply(.SD, nafill, fill = 0), .SDcols= is.numeric]
I want to update values from table df1 with values from df2, only updating null values or zeros.
I can do it with data.table or dplyr, but I can´t automate to all columns.
#data.table
df1 <- data.frame(x1=1:4, x2=c('a','b', NA, 'd'), x3=c(0,0,2,2), stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3, x2=c("zz", "qq"),x3=6:7, stringsAsFactors=FALSE)
require(data.table)
setDT(df1); setDT(df2)
df1[df2, on = .(x1), x2 := ifelse(is.na(x2) | x2 == 0 ,i.x2,x2)]
#dplyr
require(dplyr)
require(dplyr)
inner_join(df1,df2,by = c("x1" = "x1")) %>%
transmute(x1 = x1,
x2 =ifelse(is.na(x2.x) | x2.x == 0,x2.y,x2.x),
x3 =ifelse(is.na(x3.x) | x3.x == 0,x3.y,x3.x))
With dplyr at least I can manually adding columns getting the expected output, the problem is real dataframe has so much columns. Therefore I want to iterate across columns to achieved the task.
What I´ve tried:
# dplyr + apply
inner_join(df1,df2,by = c("x1" = "x1")) %>%
cbind(.$x1,
apply(.[-1],2, function(cname) ifelse(is.na(cname) | cname == 'b',paste(cname, ".x", collapse = ""),paste(cname, ".y", collapse = "")))
)
# data.table with for
for (cname in names(df1)[!names(df1) %in% c("x1")]) {
df1[i = df2, on = .(x1), j = cname := {function (x) ifelse(is.na(x) | x == 'b',i.x,x)} (cname)
, with = FALSE]
}
# data.table + lapply
df1[i = df2, on = .(x1) ,names(df1)[!names(df1) %in% c("x1")] := lapply(df1[,names(df1)[!names(df1) %in% c("x1")],with=FALSE],
function(x) ifelse(is.na(x) | x == 0,df2.x,df1.x))]
Using base R, you can create a function to replace NA and 0 with corresponding values from another column
replace_na_0 <- function(x) {
ifelse(is.na(x[[1]]) | x[[1]] == 0,x[[2]],x[[1]])
}
Do merge and pass group of columns by removing their postfix (.x, .y) to replace_na_0 function
temp_df <- merge(df1, df2, by = "x1")
cbind(temp_df[1], sapply(split.default(temp_df[-1],
sub("\\..*", "", names(temp_df)[-1])), replace_na_0))
# x1 x2 x3
#1 2 b 6
#2 3 qq 2
For data.table, you can use:
for (x in setdiff(names(df1), "x1")) {
df1[is.na(get(x)) ! get(x)==0, (x) := df2[.SD, on=.(x1), get(x)]]
}
Here is a pure data.table approach...
The melting process takes care of all columns you wish to 'fill, putting them all in one single set of columns (variable and value).
Then fill in all the 0/NA values using an update join (=fast!)
Finally, recast everything back to it's original shape.
library(data.table)
#set to data.table
setDT(df1)
setDT(df2)
#melt to long
melt1 <- melt(df1, id.vars = "x1" )
melt2 <- melt(df2, id.vars = "x1" )
#join all values with value NA or 0
melt1[ is.na(value) | value == 0,
value := melt1[ is.na( value) | value == 0,][ melt2, value := i.value, on = .(x1, variable) ]$value][]
#cast to original wide format
dcast( melt1, x1 ~ variable )
output
# x1 x2 x3
# 1: 1 a 0
# 2: 2 b 6
# 3: 3 qq 2
# 4: 4 d 2
I want to select the columns in DT1 that match the pattern flux then keep only rows that have values similar to those in a predefined vector vec1
Sample Data
library(data.table)
DT1 <- structure(list(flux_1 = c(1, 6, 2, 9, 5),
FileName = c("prac_1", "prac_2", "prac_3", "prac_4", "prac_5")),
.Names = c("flux_1", "FileName"),
class = c("data.table", "data.frame"),
row.names = c(NA, -5L))
DT1
flux_1 FileName
1: 1 prac_1
2: 6 prac_2
3: 2 prac_3
4: 9 prac_4
5: 5 prac_5
vec1 <- c(6, 2)
The following code works but I need to explicitly specify flux_1.
DT1[ flux_1 %in% vec1]
flux_1 FileName
1: 6 prac_2
2: 2 prac_3
I was thinking about something like this but it didn't work
DT1[, .SD, .SDcols = names(DT1) %like% "flux"] %>%
.[. %in% vec1]
Empty data.table (0 rows) of 1 col: flux_1
Any suggestion is appreciated! Thank you!
We can use get to return the value of the column after grep
DT1[get(grep('flux', names(DT1), value = TRUE)) %in% vec1 ]
# flux_1 FileName
#1: 6 prac_2
#2: 2 prac_3
Or if we use the .SDcols route, extract the .SD as a vector do the comparison and subset the dataset
DT1[DT1[, .SD[[1]] %in% vec1, .SDcols = grep('flux', names(DT1))]]
Similar option can be used with %like%
DT1[DT1[, .SD[[1]] %in% vec1, .SDcols = names(DT1) %like% "flux"]]
Regarding the OP's approach
DT1[, .SD, .SDcols = names(DT1) %like% "flux"]
# flux_1
#1: 1
#2: 6
#3: 2
#4: 9
#5: 5
returns a data.table with a single column. By chaining, we need to extract the 'flux_1' column
DT1[, .SD, .SDcols = names(DT1) %like% "flux"] %>%
.[[1]] %in% vec1 %>%
magrittr::extract(DT1, .)
# flux_1 FileName
#1: 6 prac_2
#2: 2 prac_3
I was playing around with data.table and I came across a distinction that I'm not sure I quite understand. Given the following dataset:
library(data.table)
set.seed(400)
DT <- data.table(x = sample(LETTERS[1:5], 20, TRUE), key = "x"); DT
Can you please explain to me the difference between the following expressions?
1) DT[J("E"), .I]
2) DT[ , .I[x == "E"] ]
3) DT[x == "E", .I]
set.seed(400)
library(data.table)
DT <- data.table(x = sample(LETTERS[1:5], 20, TRUE), key = "x"); DT
1)
DT[ , .I[x == "E"] ] # [1] 18 19 20
is a data.table where .I is a vector representing the row number of E in the ORIGINAL dataset DT
2)
DT[J("E") , .I] # [1] 1 2 3
DT["E" , .I] # [1] 1 2 3
DT[x == "E", .I] # [1] 1 2 3
are all the same, producing a vector where .Is are vectors representing the row numbers of the Es in the NEW subsetted data