I am wondering in data.table what is the most efficient or cleanest way to select rows based on the occurrence of some column values.
For example, in a 7 column data table, with each value being either 1 or 0, I want all rows where there are exactly 2 values of 1 and 5 values of 0 (1 representing "presence" and 0 "absence").
So far, here is what I am doing, assuming the following data.table (much bigger, here is just a sample of it)
name D2A1.var D2B3.var D3A1.var D4A3.var D5B3.var H2A3.var H4A4.var MA_ancestor.var
Chrom_1;10000034;G;A Chrom_1;10000034;G;A 1 1 1 1 1 1 1 1
Chrom_1;10000035;G;A Chrom_1;10000035;G;A 1 1 1 1 1 1 1 1
Chrom_1;10000042;C;A Chrom_1;10000042;C;A 1 1 1 1 1 1 1 1
Chrom_1;10000051;A;G Chrom_1;10000051;A;G 1 1 1 1 1 1 1 1
Chrom_1;10000070;G;A Chrom_1;10000070;G;A 1 1 1 1 1 1 1 1
Chrom_1;10000084;C;T Chrom_1;10000084;C;T 1 1 1 1 1 1 1 1
Chrom_6;9997224;AT;A Chrom_6;9997224;AT;A 0 0 0 0 0 1 0 1
Chrom_6;9998654;GTGTGTGTT;G Chrom_6;9998654;GTGTGTGTT;G 0 0 0 0 0 0 0 1
Chrom_6;9999553;TTTC;T Chrom_6;9999553;TTTC;T 0 0 0 0 0 0 0 1
and if I want all rows where I have 7 1 and let's say only 1 in D2A1.var and D3A1.var I am doing the following
ALL = DT[DT$MA_ancestor.var == 1 & DT$D2A1.var == 1 &DT$D2B3.var == 1 & DT$D3A1.var == 1 & DT$D4A3.var == 1 &DT$D5B3.var == 1 & DT$H2A3.var == 1 & DT$H4A4.var == 1,]
TWO = DT[DT$MA_ancestor.var == 0 & DT$D2A1.var == 1 &DT$D2B3.var == 0 & DT$D3A1.var == 1 & DT$D4A3.var == 0 &DT$D5B3.var == 0 & DT$H2A3.var == 0 & DT$H4A4.var == 0,]
DFlist=list(TWO, ALL)
DFlong = rbindlist(DFlist, use.names = TRUE, idcol = FALSE)
This returns the expected result and is fast enough. However when having multiple conditions it's a lot of typing and a lot of data.table creations. Is there a faster, cleaner and more compact way of achieving this?
We can make use of the .SDcols by specifying the columns of interest. Loop through the Subset of Data.table (.SD) create a list of logical vector and Reduce it to a single logical vector with &
ALL <- DT[, Reduce(`&`, lapply(.SD, `==`, 1), .SDcols = nm1]
TWO <- DT[, Reduce(`&`, lapply(.SD, `==`, 0), .SDcols = nm1]
where
nm1 <- names(DT)[-1] #or change the names accordingly
Another option using setkey:
setkeyv(DT, names(DT))
#create desired filtering conditions as lists
cond1 <- setNames(as.list(rep(1, ncol(DT))), names(DT))
cond2 <- list(MA_ancestor.var=0, D2A1.var=1, D2B3.var=0, D3A1.var=1, D4A3.var=0, D5B3.var=0, H2A3.var=0, H4A4.var=0)
#get list of conditions so that one does not have to type it one by one
scond <- grep("^cond", ls(), value=TRUE)
DT[rbindlist(mget(scond, envir=.GlobalEnv), use.names=TRUE)]
If you are worried about picking up spurious variable starting with cond, you can assign them to an environment using list2env and pass the envir into mget.
data:
DT <- fread("D2A1.var D2B3.var D3A1.var D4A3.var D5B3.var H2A3.var H4A4.var MA_ancestor.var
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
0 0 0 0 0 1 0 1
0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 1")
Is there a faster, cleaner and more compact way of achieving this?
Doing separate queries and rbinding as you do is probably simplest.
You can simplify each query by using replace and join syntax:
# make a list of columns initially set to value 0
vec0 = lapply(DT[, .SD, .SDcols=D2A1.var:MA_ancestor.var], function(x) 0)
# helper function for semi join
subit = function(x, d = DT) d[x, on=names(x), nomatch=0]
rbind(
subit(replace(vec0, names(vec0), 1)),
subit(replace(vec0, c("D2A1.var", "D3A1.var"), 1))
)
(This code is not tested since OP's data is not easily reproducible.)
You could probably simplify further like...
subitall = function(..., d = DT, v0 = vec0)
rbindlist(lapply(..., function(x) subit( replace(v0, names(v0), 1), d = d )))
subitall( names(vec0), c("D2A1.var", "D3A1.var") )
Regarding the function subit for subsetting / semi-join, you could modify it to meet your needs based on answers in Perform a semi-join with data.table
EDIT: Oh right, following #chinsoon's answer, you could also rbind first:
subit(rbindlist(list(
replace(vec0, names(vec0), 1),
replace(vec0, c("D2A1.var", "D3A1.var"), 1)
)))
This would mean joining only once, which is simpler.
Related
I have a question to NLP in R. My data is very big and so I need to reduce my data for further analysis to apply a SVM on it.
I have a Document-Term-Matrix like this:
Document WordY WordZ WordV WordU WordZZ
1 0 0 0 1 0
2 0 2 1 2 0
3 0 0 1 1 0
So in this example I would like to reduce the dataframe by column WordY and WordZZ because this columns have no specific meaning for this dataframe. Is this possible to remove all column with only zero values with one specific order? My problem is that my dataframe is too huge to delete every specific column with one order. Its something about 4.0000.0000 columns in the dataframe.
Thank you in Advance guys.
Cheers,
Tom
Using colSums():
df[, colSums(abs(df)) > 0]
i.e. a column has only zeros if and only if the sum of the absolute values is zero.
You could also use sapply:
df <- read.table(text=
"Document WordY WordZ WordV WordU WordZZ
1 0 0 0 1 0
2 0 2 1 2 0
3 0 0 1 1 0",header=T)
df[,sapply(df,function(x) any(x!=0))]
Document WordZ WordV WordU
1 1 0 0 1
2 2 2 1 2
3 3 0 1 1
Performance comparison:
Unit: microseconds
expr min lq mean median uq max neval
df[, sapply(df, function(x) any(x != 0))] 156.401 190.9515 236.3650 225.5510 271.0005 371.201 100
df[, colSums(abs(df)) > 0] 345.601 398.6005 555.2809 451.8010 506.8005 6005.601 100
dplyr::select_if(df, ~any(. != 0)) 2282.301 2620.9015 2939.9239 2773.1510 3019.9005 6588.402 100
df[, `:=`(which(colSums(df) == 0), NULL)] 223.201 262.4015 337.5781 297.9015 352.2020 2528.900 100
Here is how I would do it:
dplyr::select_if(YOUR_DATA, ~ any(. != 0))
Returns:
Document WordZ WordV WordU
1 1 0 0 1
2 2 2 1 2
3 3 0 1 1
Another tidyverse solution. select_if is superseded by the following useage of select and where.
library(tidyverse)
dat2 <- dat %>%
select(where(~any(. != 0)))
dat2
# Document WordZ WordV WordU
# 1 1 0 0 1
# 2 2 2 1 2
# 3 3 0 1 1
Data
dat <- read.table(text = "Document WordY WordZ WordV WordU WordZZ
1 0 0 0 1 0
2 0 2 1 2 0
3 0 0 1 1 0",
header = TRUE)
This question is a simpler version of this other SO question. Here is code inspired in the accepted answer.
df1[, which(colSums(df1) == 0) := NULL]
Data creation code
set.seed(2021)
df1 <- replicate(5, rbinom(10, 1, 0.5))
df1 <- as.data.table(df1)
df1[, 3] <- 0
I have binary R data.frame, the 1st column is ID. How do I add 1 to these non-zero values?
id A B C
001 0 1 0
002 0 0 1
Should become
id A B C
001 0 2 0
002 0 0 2
We can multiply by the number wanted which makes use of the 0 multipled by any value returns 0
df1[-1] <- df1[-1] * 2
Or an option is to create a logical matrix on the subset of columns, use that to subset the values and assign the number
df1[-1][df1[-1] ==1] <- 2
Or add
df1[-1][df1[-1] ==1] <- df1[-1][df1[-1] ==1] + 1
Since you can treat the zeros as the mask, you are able to play some tricks like below
df[-1] <- df[-1]*(df[-1]+1)
While I suspect the other answers will suffice, you mentioned data.table in the question title, so here's one specific to that package:
library(data.table)
DT <- fread(header = TRUE, text = "
id A B C
001 0 1 0
002 0 0 1", colClasses = list(character="id"))
sdcols <- c("A", "B", "C")
DT[, (sdcols) := .SD + (.SD != 0), .SDcols = sdcols]
DT
# id A B C
# 1: 001 0 2 0
# 2: 002 0 0 2
Since you have a binary data you can add the data with itself so 0 + 0 remains 0 and 1 + 1 will become 2. Ignoring the first id column of course.
df[-1] <- df[-1] + df[-1]
df
# id A B C
#1 001 0 2 0
#2 002 0 0 2
I want to identify (not eliminate) duplicates in a data frame and add 0/1 variable accordingly (wether a row is a duplicate or not), using the R dplyr package.
Example:
| A B C D
1 | 1 0 1 1
2 | 1 0 1 1
3 | 0 1 1 1
4 | 0 1 1 1
5 | 1 1 1 1
Clearly, row 1 and 2 are duplicates, so I want to create a new variable (with mutate?), say E, that is equal to 1 in row 1,2,3 and 4 since row 3 and 4 are also identical.
Moreover, I want to add another variable, F, that is equal to 1 if there is a duplicate differing only by one column. That is, F in row 1,2 and 5 would be equal to 1 since they only differ in the B column.
I hope it is clear what I want to do and I hope that dplyr offers a smooth solution to this problem. This is of course possible in "base" R but I believe (hope) that there exists a smoother solution.
You can use dist() to compute the differences, and then a search in the resulting distance object can give the needed answers (E, F, etc.). Here is an example code, where X is the original data.frame:
W=as.matrix(dist(X, method="manhattan"))
X$E = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=0))
X$F = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=1))
Just change D= for the number of different columns needed.
It's all base R though. Using plyr::laply instead of sappy has same effect. dplyr looks overkill here.
Here is a data.table solution that is extendable to an arbitrary case (1..n columns the same)- not sure if someone can convert to dpylr for you. I had to change your dataset a bit to show your desired F column - in your example all rows would get a 1 because 3 and 4 are one column different from 5 as well.
library(data.table)
DT <- data.frame(A = c(1,1,0,0,1), B = c(0,0,1,1,1), C = c(1,1,1,1,1), D = c(1,1,1,1,1), E = c(1,1,0,0,0))
DT
A B C D E
1 1 0 1 1 1
2 1 0 1 1 1
3 0 1 1 1 0
4 0 1 1 1 0
5 1 1 1 1 0
setDT(DT)
DT_ncols <- length(DT)
base <- data.table(t(combn(1:nrow(DT), 2)))
setnames(base, c("V1","V2"),c("ind_x","ind_y"))
DT[, ind := .I)]
DT_melt <- melt(DT, id.var = "ind", variable.name = "column")
base <- merge(base, DT_melt, by.x = "ind_x", by.y = "ind", allow.cartesian = TRUE)
base <- merge(base, DT_melt, by.x = c("ind_y", "column"), by.y = c("ind", "column"))
base <- base[, .(common_cols = sum(value.x == value.y)), by = .(ind_x, ind_y)]
This gives us a data.frame that looks like this:
base
ind_x ind_y common_cols
1: 1 2 5
2: 1 3 2
3: 2 3 2
4: 1 4 2
5: 2 4 2
6: 3 4 5
7: 1 5 3
8: 2 5 3
9: 3 5 4
10: 4 5 4
This says that rows 1 and 2 have 5 common columns (duplicates). Rows 3 and 5 have 4 common columns, and 4 and 5 have 4 common columns. We can now use a fairly extendable format to flag any combination we want:
base <- melt(base, id.vars = "common_cols")
# Unique - common_cols == DT_ncols
DT[, F := ifelse(ind %in% unique(base[common_cols == DT_ncols, value]), 1, 0)]
# Same save 1 - common_cols == DT_ncols - 1
DT[, G := ifelse(ind %in% unique(base[common_cols == DT_ncols - 1, value]), 1, 0)]
# Same save 2 - common_cols == DT_ncols - 2
DT[, H := ifelse(ind %in% unique(base[common_cols == DT_ncols - 2, value]), 1, 0)]
This gives:
A B C D E ind F G H
1: 1 0 1 1 1 1 1 0 1
2: 1 0 1 1 1 2 1 0 1
3: 0 1 1 1 0 3 1 1 0
4: 0 1 1 1 0 4 1 1 0
5: 1 1 1 1 0 5 0 1 1
Instead of manually selecting, you can append all combinations like so:
# run after base <- melt(base, id.vars = "common_cols")
base <- unique(base[,.(ind = value, common_cols)])
base[, common_cols := factor(common_cols, 1:DT_ncols)]
merge(DT, dcast(base, ind ~ common_cols, fun.aggregate = length, drop = FALSE), by = "ind")
ind A B C D E 1 2 3 4 5
1: 1 1 0 1 1 1 0 1 1 0 1
2: 2 1 0 1 1 1 0 1 1 0 1
3: 3 0 1 1 1 0 0 1 0 1 1
4: 4 0 1 1 1 0 0 1 0 1 1
5: 5 1 1 1 1 0 0 0 1 1 0
Here is a dplyr solution:
test%>%mutate(flag = (A==lag(A)&
B==lag(B)&
C==lag(C)&
D==lag(D)))%>%
mutate(twice = lead(flag)==T)%>%
mutate(E = ifelse(flag == T | twice ==T,1,0))%>%
mutate(E = ifelse(is.na(E),0,1))%>%
mutate(FF = ifelse( ( (A +lag(A)) + (B +lag(B)) + (C+lag(C)) + (D + lag(D))) == 7,1,0))%>%
mutate(FF = ifelse(is.na(FF)| FF == 0,0,1))%>%
select(A,B,C,D,E,FF)
Result:
A B C D E FF
1 1 0 1 1 1 0
2 1 0 1 1 1 0
3 0 1 1 1 1 0
4 0 1 1 1 1 0
5 1 1 1 1 0 1
I would love some help understanding the syntax needed to do a certain calculation in R.
I have a dataframe like this:
a b c
1 1 0
2 1 1
3 1 0
4 2 0
5 2 0
6 3 1
7 3 0
8 3 0
9 4 0
and I want to create a new column "d" that has a value of 1 if (and only if) any of the values in column "c" equal 1 for each group of rows that have the same value in column "b." Otherwise (see rows 4,5 and 9) column "d" gives 0.
a b c d
1 1 0 1
2 1 1 1
3 1 0 1
4 2 0 0
5 2 0 0
6 3 1 1
7 3 0 1
8 3 0 1
9 4 0 0
Can this be done with a for loop? If so, any advice on how to write that would be greatly appreciated.
Using data.table
setDT(df)
df[, d := as.integer(any(c == 1L)), b]
Since you asked for a loop:
# adding the result col
dat <- data.frame(dat, d = rep(NA, nrow(dat)))
# iterate over group
for(i in unique(dat$b)){
# chek if there is a one for
# each group
if(any(dat$c[dat$b == i] == 1))
dat$d[dat$b == i] <- 1
else
dat$d[dat$b == i] <- 0
}
of course the data.table solutions is more elegant ;)
To do this in base R (using the same general function as the dat.table method any), you can use ave:
df$d <- ave(cbind(df$c), df$b, FUN=function(i) any(i)==1)
I am attempting to do two things to a dataset I currently have:
ID IV1 DV1 DV2 DV3 DV4 DV5 DV6 DV7
1 97330 3 0 0 0 0 0 1 0
2 118619 0 0 0 0 0 1 1 0
3 101623 2 0 0 0 0 0 0 0
4 202626 0 0 0 0 0 0 0 0
5 182925 1 1 0 0 0 0 0 0
6 179278 1 0 0 0 0 0 0 0
Find the unique number of column combinations of 7 binary
independent variables (DV1 - DV7)
Find the sum of an independent count variable (IV1) by each unique group.
I have been able to determine the number of unique column combinations by using the following:
uniq <- unique(dat[,c('DV1','DV2','DV3','DV4','DV5','DV6','DV7')])
This indicates there are 101 unique combinations present in the dataset. What I haven't been able to figure out is how to determine how to sum the variable "IV1" by each unique group. I've been reading around on this site, and I'm fairly certain there is an easy dplyr answer to this, but it's eluded me so far.
NOTE: I'm essentially trying to find an R solution to perform a "conjunctive analysis" which is displayed in this paper. There is sample code for SPSS, SAS and STATA at the end of the paper.
library(dplyr)
group_by(dat, DV1, DV2, DV3, DV4, DV5, DV6, DV7) %>%
summarize(sumIV1 = sum(IV1))
The number of rows in the result is the number of unique combinations present in your data. The sumIV1 column, of course, has the group-wise sum of IV1.
Thanks to Frank in the comments, we can use strings with group_by_ to simplify:
group_by_(dat, .dots = paste0("DV", 1:7)) %>%
summarize(sumIV1 = sum(IV1))
Here's a reproducible example:
library(data.table)
DT <- data.table(X = c(1, 1, 1 , 1), Y = c(2, 2 , 3 , 4), Z = c(1,1,3,1))
Where X, Y ... are your columns.
Then use the Reduce function:
DT[, join_grp := Reduce(paste,list(X,Y,Z))]
This gives:
DT
X Y Z join_grp
1: 1 2 1 1 2 1
2: 1 2 1 1 2 1
3: 1 3 3 1 3 3
4: 1 4 1 1 4 1
And we can find:
unique(DT[, join_grp])
[1] "1 2 1" "1 3 3" "1 4 1"
For the sums:
DT[ , sum(X), by = join_grp]
Just put whatever column you want to sum in place of the X
Concise Solution
DT[, join_grp := Reduce(paste,list(X,Y,Z))][ , sum(X), by = join_grp]
or
DT[ , sum(X), by = list(Reduce(paste,list(X,Y,Z)))]