Remove rows from data.table that meet condition - r

I have a data table
DT <- data.table(col1=c("a", "b", "c", "c", "a"), col2=c("b", "a", "c", "a", "b"), condition=c(TRUE, FALSE, FALSE, TRUE, FALSE))
col1 col2 condition
1: a b TRUE
2: b a FALSE
3: c c FALSE
4: c a TRUE
5: a b FALSE
and would like to remove rows on the following conditions:
each row for which condition==TRUE (rows 1 and 4)
each row that has the same values for col1 and col2 as a row for which the condition==TRUE (that is row 5, col1=a, col2=b)
finally each row that has the same values for col1 and col2 for which condition==TRUE, but with col1 and col2 switched (that is row 2, col1=b and col2=a)
So only row 3 should stay.
I'm doing this by making a new data table DTcond with all rows meeting the condition, looping over the values for col1 and col2, and collecting the indices from DT which will be removed.
DTcond <- DT[condition==TRUE,]
indices <- c()
for (i in 1:nrow(DTcond)) {
n1 <- DTcond[i, col1]
n2 <- DTcond[i, col2]
indices <- c(indices, DT[ ((col1 == n1 & col2 == n2) | (col1==n2 & col2 == n1)), which=T])
}
DT[!indices,]
col1 col2 condition
1: c c FALSE
This works but is terrible slow for large datasets and I guess there must be other ways in data.table to do this without loops or apply. Any suggestions how I could improve this (I'm new to data.table)?

You can do an anti join:
mDT = DT[(condition), !"condition"][, rbind(.SD, rev(.SD), use.names = FALSE)]
DT[!mDT, on=names(mDT)]
# col1 col2 condition
# 1: c c FALSE

Related

Using grep in list in order to fill a new df in R

Hello I have a list :
list=c("OK_67J","GGT_je","Ojj_OK_778","JUu3","JJE")
and i would like to transforme it as a df :
COL1 COL2
OK_67J A
GGT_je B
Ojj_OK_778 A
JUu3 B
JJE B
where I add a A if there is the OK_pattern and B if not.
I tried :
COL2<-rep("Virus",length(list))
list[grep("OK_",tips)]<-"A"
df <- data.frame(COL1=list,COL2=COL2)
Use grepl :
ifelse(grepl('OK_', list), "A", "B")
#[1] "A" "B" "A" "B" "B"
You can also do it without ifelse :
c("B", "A")[grepl('OK_', list) + 1]
It is better to not use variable name as list since it's a default function in R.
When you exchange your list[grep("OK_",tips)]<-"A" with COL2[grep("OK_",list)] <- "A" your solution will work.
list <- c("OK_67J", "GGT_je", "Ojj_OK_778", "JUu3", "JJE")
COL2 <- rep("B", length(list))
COL2[grep("OK_", list)] <- "A"
df <- data.frame(COL1 = list, COL2 = COL2)
df
# COL1 COL2
#1 OK_67J A
#2 GGT_je B
#3 Ojj_OK_778 A
#4 JUu3 B
#5 JJE B
First off, list is not a list but a character vector:
list=c("OK_67J","GGT_je","Ojj_OK_778","JUu3","JJE")
class(list)
[1] "character"
To transform it to a dataframe:
df <- data.frame(v1 = list)
To add the new column use grepl:
df$v2 <- ifelse(grepl("OK_", df$v1), "A", "B")
or use str_detect:
library(stringr)
df$v2 <- ifelse(str_detect(df$v1, "OK_"), "A", "B")
Result:
df
v1 v2
1 OK_67J A
2 GGT_je B
3 Ojj_OK_778 A
4 JUu3 B
5 JJE B

Create a ordered list in a dataframe [duplicate]

This question already has answers here:
Sorting each row of a data frame [duplicate]
(2 answers)
Row wise Sorting in R
(2 answers)
Row-wise sort then concatenate across specific columns of data frame
(2 answers)
Closed 5 years ago.
I have the following data frame:
col1 <- c("a", "b", "c")
col2 <- c("c", "a", "d")
col3 <- c("b", "c", "a")
df <- data.frame(col1,col2,col3)
I want to create a new column in this data frame that has, for each row, the ordered list of the columns col1, col2, col3. So, for the first row it would be a list like "a", "b", "c".
The way I'm handling it is to create a loop but since I have 50k rows, it's quite inefficient, so I'm looking for a better solution.
rown <- nrow(df)
i = 0
while(i<rown){
i = i +1
col1 <- df$col1[i]
col2 <- df$col2[i]
col3 <- df$col3[i]
col1 <- as.character(col1)
col2 <- as.character(col2)
col3 <- as.character(col3)
list1 <- c(col1, col2, col3)
list1 <- list1[order(sapply(list1, '[[', 1))]
a <- list1[1]
b <- list1[2]
c <- list1[3]
df$col.list[i] <- paste(a, b, c, sep = " ")
}
Any ideas on how to make this code more efficient?
EDIT: the other question is not relevant in my case since I need to paste the three columns after sorting each row, so it's the paste statement that is dynamic, I'm not trying to change the data frame by sorting.
Expected output:
col1 col2 col3 col.list
a c b a b c
b a c a b c
c d a a c d

Use of match within i of data.table

The %in% operator is a wrapper for the match function returning "a vector of the same length as x". For instance:
> match(c("a", "b", "c"), c("a", "a"), nomatch = 0) > 0
## [1] TRUE FALSE FALSE
When used within i of data.table, however
(dt1 <- data.table(v1 = c("a", "b", "c"), v2 = "dt1"))
v1 v2
1: a dt1
2: b dt1
3: c dt1
(dt2 <- data.table(v1 = c("a", "a"), v2 = "dt2"))
v1 v2
1: a dt2
2: a dt2
dt1[v1 %in% dt2$v1]
v1 v2
1: a dt1
2: a dt1
duplicates are obtained. Should the expected behaviour of %in% within i of data.table not give the same result as
dt1[dt1$v1 %in% dt2$v1]
v1 v2
1: a dt1
i.e. without duplicates?
This was a bug in data.table V < 1.9.5 automatic indexing that was fixed in V >= 1.9.5.
I can think of 3 possible workarounds:
Disable the auto indexing and use base R %in% as in
options(datatable.auto.index = FALSE)
dt1[v1 %in% dt2$v1]
## v1 v2
## 1: a dt1
Use the built in %chin% operator which both more efficient and doesn't have this bug (works only on character vectors comparison)
dt1[v1 %chin% dt2$v1]
## v1 v2
## 1: a dt1
Install the development version from Github (Close all your R sessions first and reopen just one)
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table)
dt1 <- data.table(v1 = c("a", "b", "c"), v2 = "dt1")
dt2 <- data.table(v1 = c("a", "a"), v2 = "dt2")
dt1[v1 %in% dt2$v1]
## v1 v2
## 1: a dt1

Check frequency of data.table value in other data.table

library(data.table)
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C"))
I want to add a column popular to DT2 with value TRUE whenever DT2$group is contained in DT1$group at least twice. So, in the example above, DT2 should be
group popular
1: A TRUE
2: B TRUE
3: C FALSE
What would be an efficient way to get to this?
Updated example: DT2 may actually contain more groups than DT1, so here's an updated example:
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C", "D"))
And the desired output would be
group popular
1: A TRUE
2: B TRUE
3: C FALSE
4: D FALSE
I'd just do it this way:
## 1.9.4+
setkey(DT1, group)
DT1[J(DT2$group), list(popular = .N >= 2L), by = .EACHI]
# group popular
# 1: A TRUE
# 2: B TRUE
# 3: C FALSE
# 4: D FALSE ## on the updated example
data.table's join syntax is quite powerful, in that, while joining, you can also aggregate / select / update columns in j. Here we perform a join. For each row in DT2$group, on the corresponding matching rows in DT1, we compute the j-expression .N >= 2L; by specifying by = .EACHI (please check 1.9.4 NEWS), we compute the j-expression each time.
In 1.9.4, .() has been introduced as an alias in all i, j and by. So you could also do:
DT1[.(DT2$group), .(popular = .N >= 2L), by = .EACHI]
When you're joining by a single character column, you can drop the .() / J() syntax altogether (for convenience). So this can be also written as:
DT1[DT2$group, .(popular = .N >= 2L), by = .EACHI]
This is how I would do it: first count the number of times each group appears in DT1, then simply join DT2 and DT1.
require(data.table)
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C"))
#solution:
DT1[,num_counts:=.N,by=group] #the number of entries in this group, just count the other column
setkey(DT1, group)
setkey(DT2, group)
DT2 = DT1[DT2,mult="last"][,list(group, popular = (num_counts >= 2))]
#> DT2
# group popular
#1: A TRUE
#2: B TRUE
#3: C FALSE

Access a single cell / subsetted column of a data.table

How can I access just a single cell in a data.table in the way as I could for a data.frame:
mdf <- data.frame(a = c("A", "B", "C"), b = rnorm(3), c = 1:3)
mdf[ mdf$a == "B", "c" ]
[1] 2
Doing the analogue on a data.table a data.table is returned including the key column(s):
mdt <- data.table( mdf, key = "a" )
mdt[ "B", c ]
a c
1: B 2
mdt[ "B", c ][ , c]
[1] 2
Did I miss a parameter or does it has to be done as in the last line?
Either of these will avoid repeating the c but are not as efficient since they involve computing the first [] as well as the final answer:
> mdt[ "B", ][["c"]]
[1] 2
> mdt[ "B", ][, c]
[1] 2
Recent versions of data.table make this easier
mdt[ "B", c]
# [1] 2
Original answer was returning a data.table like:
mdt['B', 'c']
# c
# 1: 2

Resources