Fast dropping rows in data.table [duplicate] - r

R's data.table package offers fast subsetting of values based on keys.
So, for example:
set.seed(1342)
df1 <- data.table(group = gl(10, 10, labels = letters[1:10]),
value = sample(1:100))
setkey(df1, group)
df1["a"]
will return all rows in df1 where group == "a".
What if I want all rows in df1 where group != "a". Is there a concise syntax for that using data.table?

I think you answered your own question:
> nrow(df1[group != "a"])
[1] 90
> table(df1[group != "a", group])
a b c d e f g h i j
0 10 10 10 10 10 10 10 10 10
Seems pretty concise to me?
EDIT FROM MATTHEW : As per comments this a vector scan. There is a not join idiom here and here, and feature request #1384 to make it easier.
EDIT: feature request #1384 is implemented in data.table 1.8.3
df1[!'a']
# and to avoid the character-to-factor coercion warning in this example (where
# the key column happens to be a factor) :
df1[!J(factor('a'))]

I would just get all keys that are not "a":
df1[!(group %in% "a")]
Does this achieve what you want?

Related

Remove data.table rows whose vector elements contain nested NAs

I need to remove from a data.table any row in which column a contains any NA nested in a vector:
library(data.table)
a = list(as.numeric(c(NA,NA)), 2,as.numeric(c(3, NA)), c(4,5) )
b <- 11:14
dt <- data.table(a,b)
Thus, rows 1 and 3 should be removed.
I tried three solutions without success:
dt1 <- dt[!is.na(a)]
dt2 <- dt[!is.na(unlist(a))]
dt3 <- dt[dt[,!Reduce(`&`, lapply(a, is.na))]]
Any ideas? Thank you.
You can do the following:
dt[sapply(dt$a, \(l) !any(is.na(l)))]
This alternative also works, but you will get warnings
dt[sapply(dt$a, all)]
Better approach (thanks to r2evans, see comments)
dt[!sapply(a,anyNA)]
Output:
a b
1: 2 12
2: 4,5 14
A third option that you might prefer: You could move the functionality to a separate helper function that ingests a list of lists (nl), and returns a boolean vector of length equal to length(nl), and then apply that function as below. In this example, I explicitly call unlist() on the result of lapply() rather than letting sapply() do that for me, but I could also have used sapply()
f <- \(nl) unlist(lapply(nl,\(l) !any(is.na(l))))
dt[f(a)]
An alternative to *apply()
dt[, .SD[!anyNA(a, TRUE)], by = .I][, !"I"]
# a b
# <list> <int>
# 1: 2 12
# 2: 4,5 14

R: Is there a way to get unique, closest matches with the rows in the same data.table based on multiple columns?

In R, I want to get unique, closest matches for the rows in a data.table which are identified by unique ids based on values in two columns. Here, I provide a toy example and the code I'm using to achieve this.
dt <- data.table(id = letters,
value_1 = as.integer(runif(26,1,20)),
value_2 = as.integer(runif(26,1,10)))
pairs <- data.table()
while(nrow(dt) >= 2){
k <- dt[c(1)]
m <- dt[-1]
t <- m[k, roll = "nearest",on = .(value_1,value_2)]
pairs <- rbind(pairs,t)
dt <- dt[!dt$id %in% pairs$id & !dt$id %in% pairs$i.id]
}
pairs <- pairs[,-c(2,3)]
This gives me a data.table with the matched ids and the ones that do not get any matches.
id i.id
1 NA a
2 NA b
3 m c
4 v d
5 y e
6 i f
...
Is there a way to do this without the loop. I intend to implement this on a data.table with more than 20 million observations? Clearly, using a loop is extremely inefficient. I was wondering if the roll join command can be run on a copy of the main data.table by introducing an exception condition -- so as not to match the same ids with each other. Maybe something like this:
m <- dt
t <- m[dt, roll = "nearest",on = .(value_1,value_2)]
Without the exception, this command merely generates matches of ids with themselves. Also, this does not ensure unique matches.
Thanks!

Expressively select rows in a data.table which match rows from another data.table

Given two data tables (tbl_A and tbl_B), I would like to select all the rows in the tbl_A which have matching rows in tbl_B, and I would like the code to be expressive. If the %in% operator were defined for data.tables, something like this would be be ideal:
subset <- tbl_A[tbl_A %in% tbl_B]
I can think of many ways to accomplish what I want such as:
# double negation (set differences)
subset <- tbl_A[!tbl_A[!tbl_B,1,keyby=a]]
# nomatch with keyby and this annoying `[,V1:=NULL]` bit
subset <- tbl_B[,1,keyby=.(a=x)][,V1:=NULL][tbl_A,nomatch=0L]
# nomatch with !duplicated() and setnames()
subset <- tbl_B[!duplicated(tbl_B),.(x)][tbl_A,nomatch=0L]; setnames(subset,"x","a")
# nomatch with !unique() and setnames()
subset <- unique(tbl_B)[,.(x)][tbl_A,nomatch=0L]; setnames(subset,"x","a")
# use of a temporary variable (Thanks #Frank)
subset <- tbl_A[, found := FALSE][tbl_B, found := TRUE][(found)][,found:=NULL][]
but each expression is difficult to read and it's not obvious at first glance what the code is doing. Is there a more idiomatic / expressive way of accomplishing this task?
For purposes of example, here are some toy data.tables:
# toy tables
tbl_A <- data.table(a=letters[1:5],
b=1:5,
c=rnorm(5))
tbl_B <- data.table(x=letters[3:7],
y=13:17,
z=rnorm(5))
# both tables might have multiple rows with the same key fields.
tbl_A <- rbind(tbl_A,tbl_A)
tbl_B <- rbind(tbl_B,tbl_B)
setkey(tbl_A,a)
setkey(tbl_B,x)
and an expected result containing the rows in tbl_A which match at least one row in tbl_B:
a b c
1: c 3 -0.5403072
2: c 3 -0.5403072
3: d 4 -1.3353621
4: d 4 -1.3353621
5: e 5 1.1811730
6: e 5 1.1811730
Adding 2 more options
tbl_A[fintersect(tbl_A[,.(a)], tbl_B[,.(a=x)])]
and
tbl_A[unique(tbl_A[tbl_B, nomatch=0L, which=TRUE])]
I'm not sure how expressive it is (apologies if not) but this seems to work:
tbl_A[,.(a,b,c,any(a == tbl_B[,x])), by = a][V4==TRUE,.(a,b,c)]
I'm sure it can be improved - I only found out about any() yesterday and still testing it :)

R: subsetting dataframe using elements from a vector

I have a data frame which includes a vector of individual identifiers (which are 6 letters) and vectors of numbers
I would like to subset it using a vector of elements (again 6-letters identifiers) taken from another dataframe
Here is what I did (in a simplified version, my dataframe has over 200 columns and 64 rows)
n = c(2, 3, 5, 7, 8, 1)
i = c("abazzz", "bbaxxx", "ccbeee","dddfre", "sdtyuo", "loatvz" )
c = c(10, 2, 10, 2, 12, 34)
df1 = data.frame(n, i, c)
attach(example)
This is the vector whose elements I want to use for subsetting:
v<- c("abazzz", "ccbeee", "lllaaa")
This is what I do to subset
df2<-example[, i==abazzz | ccbeee | lllaaa]
This does not work, the error I get is "abazzz" not found ( I tried with and without "", I tried using the command subset, same error appears)
Moreover I would like to avoid the or operator as the vector I need to use for subsetting has about 50 elements. So, in words, what I would like to do is to subset df2 in order to extract only those individuals who already appear in df1 using their identifiers (column in df1)
Writing this makes me think this must be very easy to do, but I can't figure it out by myself, I tried looking up similar questions but could not find what I was looking for. I hope someone can help me, suggest other posts or manuals so I can learn. Thanks!
Here's another nice option using data.tables binary search (for efficiency)
library(data.table)
setkey(setDT(df1), i)[J(v), nomatch = 0]
# n i c
# 1: 2 abazzz 10
# 2: 5 ccbeee 10
Or if you don't want to reorder the data set and keep the syntax similar to base R, you could set a secondary key instead (contributed by #Arun)
set2key(setDT(df1), i)
df1[i %in% v]
Or dplyr (for simplicity)
library(dplyr)
df1 %>% filter(i %in% v)
# n i c
# 1: 2 abazzz 10
# 2: 5 ccbeee 10
As a side note: as mentioned in comments, never use attach
(1)
Instead of
attach(df1)
df2<-df1[, i==abazzz | ccbeee | lllaaa]
detach(df1)
try
df2 <- with(df1, df1[i=="abazzz" | i=="ccbeee" | i=="lllaaa", ])
(2)
with(df1, df1[i %in% v, ])
Both yield
# n i c
# 1 2 abazzz 10
# 3 5 ccbeee 10

How can I subset the negation of a key value using R's data.table package?

R's data.table package offers fast subsetting of values based on keys.
So, for example:
set.seed(1342)
df1 <- data.table(group = gl(10, 10, labels = letters[1:10]),
value = sample(1:100))
setkey(df1, group)
df1["a"]
will return all rows in df1 where group == "a".
What if I want all rows in df1 where group != "a". Is there a concise syntax for that using data.table?
I think you answered your own question:
> nrow(df1[group != "a"])
[1] 90
> table(df1[group != "a", group])
a b c d e f g h i j
0 10 10 10 10 10 10 10 10 10
Seems pretty concise to me?
EDIT FROM MATTHEW : As per comments this a vector scan. There is a not join idiom here and here, and feature request #1384 to make it easier.
EDIT: feature request #1384 is implemented in data.table 1.8.3
df1[!'a']
# and to avoid the character-to-factor coercion warning in this example (where
# the key column happens to be a factor) :
df1[!J(factor('a'))]
I would just get all keys that are not "a":
df1[!(group %in% "a")]
Does this achieve what you want?

Resources