Remove rows with NA from data.table in R [duplicate] - r

This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 5 years ago.
I'd like to remove all rows of a data.table that contain Inf in any of its columns. So far, I've been using this approach:
DT <- data.table(col1 = c(1,2,3), col2 = c(4,Inf,5))
DT[,drop := apply(.SD, 1, function(x) any(is.infinite(x))), by = 1:nrow(DT)]
DT <- DT[(!drop)][,drop:=NULL]
which comes from this Stackoverflow question. However, this approach is not well scalable to large amounts of data. Is there a better way to remove the rows with Inf?

You can use rowSums to check if any element of a row is not finite.
DT[is.finite(rowSums(DT))]
OR you can use the fact that Inf * 0 is NA and use complete.cases
DT[complete.cases(DT*0)]
Some benchmarking shows that the rowSums is fastest for smaller datasets and complete.cases is the fastest solution for larger datasets.
require(microbenchmark)
microbenchmark(
DT[is.finite(rowSums(DT))]
,
DT[complete.cases(DT*0)]
,
DT[DT[, Reduce('&', lapply(.SD, is.finite))]]
)
##
## nrow(DT) = 3000
## Unit: microseconds
## expr min lq mean median uq max neval cld
## DT[is.finite(rowSums(DT))] 786.797 839.235 864.0215 852.8465 884.756 1021.988 100 a
## DT[complete.cases(DT * 0)] 1265.658 1326.575 1363.3985 1350.0055 1386.377 1898.040 100 c
## DT[DT[, Reduce("&", lapply(.SD, is.finite))]] 1220.137 1275.030 1319.6226 1308.0555 1348.443 1624.023 100 b
##
## nrow(DT) = 300000
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## DT[is.finite(rowSums(DT))] 21.617935 22.687452 26.698070 25.75765 26.07942 87.56290 100 c
## DT[complete.cases(DT * 0)] 7.209252 7.567393 9.908503 10.17569 10.37473 71.31375 100 a
## DT[DT[, Reduce("&", lapply(.SD, is.finite))]] 11.786773 12.647652 14.128624 14.78512 15.05089 15.39542 100 b

Related

fastest way to add elements in list in R

zii=list()
zii[[1]]=c(1,2,3)
zii[[2]]=c(1,2,3)
zii[[3]]=c(1,2,3)
What is the best way to perform element-wise addition in the list , IE:
sum=c(1+1+1,2+2+2,3+3+3)=c(3,6,9)
I tried Reduce("+",zii) and it is slow. Any other suggestions ?
I'm not sure whether this will be any faster. The data.frame does a lot of validity checking:
> rowSums(data.frame(zii))
[1] 3 6 9
Could also try these if you get around to using microbenchmark. I'm guessing one of these will win and my money would be on the second one.:
> rowSums(do.call(cbind, zii))
[1] 3 6 9
> colSums(do.call(rbind, zii))
[1] 3 6 9
Looks like I lost my bet:
require(microbenchmark)
microbenchmark( Reduce("+",zii) ,
rowSums(data.frame(zii)),
rowSums(do.call(cbind, zii)),
colSums(do.call(rbind, zii)) )
#------------------------------------------------------
Unit: microseconds
expr min lq mean median uq
Reduce("+", zii) 26.975 28.1870 31.02119 30.0560 30.9695
rowSums(data.frame(zii)) 730.933 744.9015 776.36775 753.5785 787.2765
rowSums(do.call(cbind, zii)) 65.770 67.3800 71.94039 68.7050 70.1335
colSums(do.call(rbind, zii)) 61.202 62.8830 66.21362 64.1060 65.9130
max neval cld
57.958 100 a
1129.953 100 c
176.627 100 b
127.259 100 b

Count the number of unique characters in a string

I have a dataframe where one of the columns is of type string.
I would like to count the number of unique/distinct characters in that string.
eg.
"banana" -> 3
'he' -> 2
A reproducible example:
I have a data frame where a column is type string. I would need to filter out those rows where the string has only one distinct character.
col1 col2 col3
new york
qqqq
melbourne
aaaaaa
I would need to have a final data frame like
col1 col2 col3
new york
melbourne
So delete those rows completely.
This makes no assumption about "characters" being in letters and avoids making R data structures:
library(inline)
.char_unique_code <- "
std::vector < std::string > s = as< std::vector < std::string > >(x);
unsigned int input_size = s.size();
std::vector < std::string > chrs(input_size);
for (unsigned int i=0; i<input_size; i++) {
std::string t = s[i];
for (std::string::iterator chr=t.begin();
chr != t.end(); ++chr) {
if (chrs[i].find(*chr) == std::string::npos) {
chrs[i] += *chr;
}
}
}
return(wrap(chrs));
"
char_unique <-
rcpp(sig=signature(x="std::vector < std::string >"),
body=.char_unique_code,
includes=c("#include <string>",
"#include <iostream>"))
nchar(char_unique("banana"))
## [1] 3
Why avoid making R lists?
library(stringr)
library(microbenchmark)
library(ggplot2)
str_char_ct_unique <- function(x) sum(!!str_count(x, letters))
char_ct_unique <- function(x) nchar(char_unique(x))
r_char_ct_unique <- function(x) length(unique(strsplit(x, "")[[1]]))
microbenchmark(stringr=str_char_ct_unique("banana"),
rcpp=char_ct_unique("banana"),
r=r_char_ct_unique("banana"),
times=1000) -> mb
## Unit: microseconds
## expr min lq mean median uq max neval cld
## stringr 125.978 129.1765 139.271061 130.9415 139.3870 334.563 1000 c
## rcpp 1.458 2.0160 3.002184 2.6345 3.1365 32.244 1000 a
## r 4.797 6.1070 8.292847 7.3380 8.0505 86.709 1000 b
Let's make a vectorized version of Cath's pure R solution (not bothering with the other one since it's way too constrained) and compare against a vector of small random strings:
library(random)
library(purrr)
char_ct_unique <- function(x) nchar(char_unique(x))
r_char_ct_unique <- function(x) map_int(map(x, function(x) unique(strsplit(x, "")[[1]])), length)
tst <- as.vector(randomStrings(n=100, len=20, unique=FALSE))
sum(char_ct_unique(tst) == r_char_ct_unique(tst))
## [1] 100
microbenchmark(rcpp=char_ct_unique(tst),
r=r_char_ct_unique(tst),
times=1000)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## rcpp 53.643 56.2375 66.69311 60.2740 68.178 250.992 1000 a
## r 683.420 759.4070 952.14407 822.8905 922.710 6513.508 1000 b
And, now for the 10,000 character random string:
dat <- readLines("https://gist.githubusercontent.com/hrbrmstr/f80b157b383134b37fb3/raw/534b4c79e7c51710c6db6961bc5dc5ec25c4242b/gistfile1.txt")
digest::digest(dat, "sha1", serialize=FALSE)
## [1] "6c6695dd2f314762c81e6e6891ec1c138a4f3a08"
nchar(dat)
## [1] 10000
char_ct_unique(dat) == r_char_ct_unique(dat)
## [1] TRUE
microbenchmark(rcpp=char_ct_unique(dat),
r=r_char_ct_unique(dat),
times=1000)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## rcpp 73.801 110.681 122.9091 118.330 139.373 308.602 1000 a
## r 377.556 430.703 533.9120 448.631 492.466 4275.568 1000 b
I forgot to do David's "fixed" version:
f_r_char_ct_unique <- function(x) map_int(map(x, function(x) unique(strsplit(x, "", fixed=TRUE)[[1]])), length)
and, let's make it more interesting:
dat <- c(dat, toupper(dat), tolower(dat))
microbenchmark(rcpp=char_ct_unique(dat),
r=r_char_ct_unique(dat),
fr=f_r_char_ct_unique(dat),
times=1000)
## Unit: microseconds
## expr min lq mean median uq max neval
## rcpp 218.299 284.143 331.319 332.281 358.1215 696.907 1000
## r 1266.976 1442.460 1720.320 1494.167 1634.7870 5896.685 1000
## fr 1260.027 1444.298 1769.664 1501.416 1652.8895 78457.729 1000
We can use str_count
library(stringr)
sum(!!str_count(str1, letters))
#[1] 3
Update
Using the new dataset
i1 <- !sapply(df1$col1, function(x) any(str_count(x, letters)>1))
df1[i1,,drop=FALSE]
data
str1 <- "banana"

(Efficiently) merge random keyed subset

I have two data.tables; I'd like to assign an element of one to the other at random from among those that match keys. The way I'm doing so right now is quite slow.
Let's get specific; here's some sample data:
dt1<-data.table(id=sample(letters[1:5],500,replace=T),var1=rnorm(500),key="id")
dt2<-data.table(id=c(rep("a",4),rep("b",8),rep("c",2),rep("d",5),rep("e",7)),
place=paste(sample(c("Park","Pool","Rec Center","Library"),
26,replace=T),
sample(26)),key="id")
I want to add two randomly chosen places to dt1 for each observation, but the places have to match on id.
Here's what I'm doing now:
get_place<-function(xx) sapply(xx,function(x) dt2[.(x),sample(place,1)])
dt1[,paste0("place",1:2):=list(get_place(id),get_place(id))]
This works, but it's quite slow--took 66 seconds to run on my computer, basically an eon.
One issue seems to be I can't seem to take proper advantage of keying:
Something like dt2[.(dt1$id),mult="random"] would be perfect, but it doesn't appear to be possible.
Any suggestions?
A simple answer
dt2[.(dt1),as.list(c(
place=sample(place,size=2,replace=TRUE)
)),by=.EACHI,allow.cartesian=TRUE]
This approach is simple and illustrates data.table features like Cartesian joins and by=.EACHI, but is very slow because for each row of dt1 it (i) samples and (ii) coerces the result to a list.
A faster answer
nsamp <- 2
dt3 <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
dt1[.(dt3),paste0("place",1:nsamp):=
replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
,by=.EACHI]
Using replicate with simplify=FALSE (as also in #bgoldst's answer) makes the most sense:
It returns a list of vectors which is the format data.table requires when making new columns.
replicate is the standard R function for repeated simulations.
Benchmarks. We should look at varying several features and not modify dt1 as we go along:
# candidate functions
frank2 <- function(){
dt3 <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
dt1[.(dt3),
replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
,by=.EACHI]
}
david2 <- function(){
indx <- dt1[,.N, id]
sim <- dt2[.(indx),
replicate(2,sample(place,size=N,replace=TRUE),simplify=FALSE)
,by=.EACHI]
dt1[, sim[,-1,with=FALSE]]
}
bgoldst<-function(){
dt1[,
replicate(2,ave(id,id,FUN=function(x)
sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=F)
]
}
# simulation
size <- 1e6
nids <- 1e3
npls <- 2:15
dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]
# benchmarking
res <- microbenchmark(frank2(),david2(),bgoldst(),times=10)
print(res,order="cld",unit="relative")
which gives
Unit: relative
expr min lq mean median uq max neval cld
bgoldst() 8.246783 8.280276 7.090995 7.142832 6.579406 5.692655 10 b
frank2() 1.042862 1.107311 1.074722 1.152977 1.092632 0.931651 10 a
david2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
And if we switch around the parameters...
# new simulation
size <- 1e4
nids <- 10
npls <- 1e6:2e6
dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]
# new benchmarking
res <- microbenchmark(frank2(),david2(),times=10)
print(res,order="cld",unit="relative")
we see
Unit: relative
expr min lq mean median uq max neval cld
david2() 3.3008 3.2842 3.274905 3.286772 3.280362 3.10868 10 b
frank2() 1.0000 1.0000 1.000000 1.000000 1.000000 1.00000 10 a
As one might expect, which way is faster -- collapsing dt1 in david2 or collapsing dt2 in frank2 -- depends on how much information is compressed by collapsing.
The perfect function for this purpose is ave(), since it allows running a function for each group of a vector, and automatically maps the return value back to the elements of the group:
set.seed(1);
dt1 <- data.table(id=sample(letters[1:5],500,replace=T), var1=rnorm(500), key='id' );
dt2 <- data.table(id=c(rep('a',4),rep('b',8),rep('c',2),rep('d',5),rep('e',7)), place=paste(sample(c('Park','Pool','Rec Center','Library'),26,replace=T), sample(26) ), key='id' );
dt1[,paste0('place',1:2):=replicate(2,ave(id,id,FUN=function(x) sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=FALSE)]
dt1;
## id var1 place1 place2
## 1: a -0.4252677 Rec Center 23 Park 12
## 2: a -0.3892372 Park 12 Library 22
## 3: a 2.6491669 Park 14 Rec Center 23
## 4: a -2.2891240 Rec Center 23 Park 14
## 5: a -0.7012317 Library 22 Park 12
## ---
## 496: e -1.0624084 Library 16 Library 16
## 497: e -0.9838209 Library 4 Library 26
## 498: e 1.1948510 Library 26 Pool 21
## 499: e -1.3353714 Pool 18 Library 26
## 500: e 1.8017255 Park 20 Pool 21
This should work with data.frames as well as data.tables.
Edit: Adding benchmarking
This solution seems fastest, at least after having made the correction suggested by Frank below.
frank<-function(){dt2[.(dt1),as.list(c(
place=sample(place,size=2,replace=TRUE))),
by=.EACHI,allow.cartesian=TRUE]}
david<-function(){
dt1[,paste0("place",1:2):=
lapply(1:2,function(x) get_place(id,.N)),by=id]}
bgoldst<-function(){dt1[,paste0("place",1:2):=
replicate(2,ave(id,id,FUN=function(x)
sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),
simplify=F)]}
microbenchmark(times=1000L,frank(),david(),bgoldst())
Unit: milliseconds
expr min lq mean median uq max neval cld
frank() 5.125843 5.353918 6.276879 5.496042 5.772051 15.57155 1000 b
david() 6.049172 6.305768 7.172360 6.455687 6.669202 93.06398 1000 c
bgoldst() 1.421330 1.521046 1.847821 1.570573 1.628424 89.60315 1000 a
When you are running sapply over each row, you are basically not using any data.table capabilities here. Alternatively, you can use both the binary join and the by parameter by sampling only once per id. You could define get_place as follows
get_place <- function(tempid, N) dt2[.(tempid), sample(place, N, replace = TRUE)]
Then simply do
dt1[, place1 := get_place(id, .N), by = id]
Or a general solution would be
indx <- 1:2
dt1[, paste0("place", indx) := lapply(indx, function(x) get_place(id, .N)), by = id]
Here's a benchmark on a bit bigger dt1
size = 1e6
set.seed(123)
dt1 <- data.table(id=sample(letters[1:5],size,replace=TRUE),var1=rnorm(size),key="id")
Using the same functions as defined in #bgoldst answer
microbenchmark(times = 10L, frank(), david(), bgoldst())
# Unit: milliseconds
# expr min lq mean median uq max neval
# frank() 11627.68324 11771.4227 11887.1232 11804.6342 12012.4636 12238.1031 10
# david() 84.62109 122.1117 121.1003 123.5861 128.0042 132.3591 10
# bgoldst() 372.02267 400.8867 445.6231 421.3168 445.9076 709.5458 10
Here is another, faster variant on the same idea (as seen in #Frank's benchmark):
indx<- dt1[,.N, id]
sim <- dt2[.(indx),replicate(2,sample(place,size=N,replace=TRUE),simplify=FALSE),by=.EACHI]
dt1[,paste0("place",1:2):=`[.listof`(sim,-1)]

aggregate a matrix (or data.frame) by column name groups in R

I have a large matrix with about 3000 columns x 3000 rows. I'd like to aggregate (calculate the mean) grouped by column names for every row. Each column is named similar to this method...(and in random order)
Tree Tree House House Tree Car Car House
I would need the data result (aggregation of mean of every row) to have the following columns:
Tree House Car
the tricky part (at least for me) is that I do not know all the column names and they are all in random order!
You could try
res1 <- vapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
numeric(nrow(m1)) )
Or
res2 <- sapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )
identical(res1,res2)
#[1] TRUE
Another option might be to reshape into long form and then do the aggregation
library(data.table)
res3 <-dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,Var1:= NULL]
identical(res1, as.matrix(res3))
[1] TRUE
Benchmarks
It seems like the first two methods are slightly faster for a 3000*3000 matrix
set.seed(24)
m1 <- matrix(sample(0:40, 3000*3000, replace=TRUE),
ncol=3000, dimnames=list(NULL, sample(c('Tree', 'House', 'Car'),
3000,replace=TRUE)))
library(microbenchmark)
f1 <-function() {vapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE),
numeric(nrow(m1)) )}
f2 <- function() {sapply(unique(colnames(m1)), function(x)
rowMeans(m1[,colnames(m1)== x,drop=FALSE], na.rm=TRUE) )}
f3 <- function() {dcast.data.table(setDT(melt(m1)), Var1~Var2, fun=mean)[,
Var1:= NULL]}
microbenchmark(f1(), f2(), f3(), unit="relative", times=10L)
# Unit: relative
# expr min lq mean median uq max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f2() 1.026208 1.027723 1.037593 1.034516 1.028847 1.079004 10
# f3() 4.529037 4.567816 4.834498 4.855776 4.930984 5.529531 10
data
set.seed(24)
m1 <- matrix(sample(0:40, 10*40, replace=TRUE), ncol=10,
dimnames=list(NULL, sample(c("Tree", "House", "Car"), 10, replace=TRUE)))
I came up with my own solution. I first just transpose the matrix (called test_mean) so the columns become rows,then:
# removing numbers from rownames
rownames(test_mean)<-gsub("[0-9.]","",rownames(test_mean))
#aggregate by rownames
test_mean<-aggregate(test_mean, by=list(rownames(test_mean)), FUN=mean)
matrixStats:rowMeans2 with some coercive help from data.table, for the win!
Adding it to benchmarking from #akrun we get:
f4<- function() {
ucn<-unique(colnames(m1))
as.matrix(setnames(setDF(lapply(ucn, function(n) rowMeans2(m1,cols=colnames(m1)==n)))
,ucn))
}
> all.equal(f4(),f1())
[1] TRUE
> microbenchmark(f1(), f2(), f3(), f4(), unit="relative", times=10L)
Unit: relative
expr min lq mean median uq max neval cld
f1() 1.837496 1.841282 1.823375 1.834471 1.818822 1.749826 10 b
f2() 1.760133 1.825352 1.817355 1.826257 1.838439 1.793824 10 b
f3() 15.451106 15.606912 15.847117 15.586192 16.626629 16.104648 10 c
f4() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a

Evaluate at which size data.table is faster than data.frame

Can someone please help me evaluate at which size of a data frame using data.table is faster for searches? In my use case the data frames will be 24,000 rows and 560,000 rows. Blocks of 40 rows are always singled out for further use.
Example:
DF is a data frame with 120 rows, 7 columns (x1 to x7); "string" occupies the first 40 rows of x1.
DF2 is 1000 times DF => 120,000 rows
For the size of DF data.table is slower, for the size of DF2 it is faster.
Code:
> DT <- data.table(DF)
> setkey(DT, x1)
>
> DT2 <- data.table(DF2)
> setkey(DT2, x1)
>
> microbenchmark(DF[DF$x1=="string", ], unit="us")
Unit: microseconds
expr min lq median uq max neval
DF[DF$x1 == "string", ] 282.578 290.8895 297.0005 304.5785 2394.09 100
> microbenchmark(DT[.("string")], unit="us")
Unit: microseconds
expr min lq median uq max neval
DT[.("string")] 1473.512 1500.889 1536.09 1709.89 6727.113 100
>
>
> microbenchmark(DF2[DF2$x1=="string", ], unit="us")
Unit: microseconds
expr min lq median uq max neval
DF2[DF2$x1 == "string", ] 31090.4 34694.74 35537.58 36567.18 61230.41 100
> microbenchmark(DT2[.("string")], unit="us")
Unit: microseconds
expr min lq median uq max neval
DT2[.("string")] 1327.334 1350.801 1391.134 1457.378 8440.668 100
library(microbenchmark)
library(data.table)
timings <- sapply(1:10, function(n) {
DF <- data.frame(id=rep(as.character(seq_len(2^n)), each=40), val=rnorm(40*2^n), stringsAsFactors=FALSE)
DT <- data.table(DF, key="id")
tofind <- unique(DF$id)[n-1]
print(microbenchmark( DF[DF$id==tofind,],
DT[DT$id==tofind,],
DT[id==tofind],
`[.data.frame`(DT,DT$id==tofind,),
DT[tofind]), unit="ns")$median
})
matplot(1:10, log10(t(timings)), type="l", xlab="log2(n)", ylab="log10(median (ns))", lty=1)
legend("topleft", legend=c("DF[DF$id == tofind, ]",
"DT[DT$id == tofind, ]",
"DT[id == tofind]",
"`[.data.frame`(DT,DT$id==tofind,)",
"DT[tofind]"),
col=1:5, lty=1)
Jan. 2016: Update to data.table_1.9.7
data.table has made a few updates since this was written (a bit more overhead added to [.data.table as a few more arguments / robustness checks have been built in, but also the introduction of auto-indexing). Here's an updated version as of the January 13, 2016 version of 1.9.7 from GitHub:
The main innovation is that the third option now leverages auto-indexing. The main conclusion remains the same -- if your table is of any nontrivial size (roughly larger than 500 observations), data.table's within-frame calling is faster.
(notes about the updated plot: some minor things (un-logging the y-axis, expressing in microseconds, changing the x-axis labels, adding a title), but one non-trivial thing is I updated the microbenchmarks to add some stability in the estimates--namely, I set the times argument to as.integer(1e5/2^n))

Resources