check each element of vector against all rows of data frame - r

I have a vector, for which I want to check each element against each row of a data frame. It involves a grep function, since the elements to be checked are buried in other text.
With help of this forum, I got this code:
mat=data.frame(par=c('long A story','C story', 'blabla D'),val=1:3)
vec=c('Z','D','A')
mat$label <- NA
for (x in vec){
is.match <- lapply(mat$par,function(y) grep(x, y))
mat$label[which(is.match > 0)] <- x
}
The problem is that it takes minutes to execute. Is there a way to vectorize this?

I've assumed you only want the first match in each case:
which.matches <- grep("[ZDA]", mat$par)
what.matches <- regmatches(mat$par, regexpr("[ZDA]", mat$par))
mat$label[which.matches] <- what.matches
mat
par val label
1 long A story 1 A
2 C story 2 <NA>
3 blabla D 3 D
EDIT: Benchmarking
Unit: microseconds
expr min lq median uq max
1 answer(mat) 185.338 194.0925 199.073 209.1850 898.919
2 question(mat) 672.227 693.9610 708.601 725.6555 1457.046
EDIT 2:
As #mrdwab suggested, this can actually be used as a one-liner:
mat$label[grep("[ZDA]", mat$par)] <- regmatches(mat$par, regexpr("[ZDA]", mat$par))

Related

Keeping vectors (from list of vectors) whose elements do not have a proper subset within that same list (using RCPP)

I have asked this question previously (see here) and received a satisfactory answer using the purr package. However, this has proved to be a bottle neck in my program so I would like to rewrite the section using the RCPP package.
Proper subset: A proper subset S' of a set S is a subset that is strictly contained in S and so excludes S itself (note I am also excluding the empty set).
Suppose you have the following vectors in a list:
a = c(1,2)
b = c(1,3)
c = c(2,4)
d = c(1,2,3,4)
e = c(2,4,5)
f = c(1,2,3)
My aim is to keep only vectors which have no proper subset within the list, which in this example would be a, b and c.
Previous Solution
library(purr)
possibilities <- list(a,b,c,d,e,f)
keep(possibilities,
map2_lgl(.x = possibilities,
.y = seq_along(possibilities),
~ !any(map_lgl(possibilities[-.y], function(z) all(z %in% .x)))))
The notion here is to avoid the O(N^3) and use a less order instead. The other answer provided here will be slow still since it is greater than O(N^2). Here is a solution with less than O(N^2), where the worst case scenario is O(N^2) when all the elements are unique.
onlySet <- function(x){
i <- 1
repeat{
y <- sapply(x[-1], function(el)!all(is.element(x[[1]], el)))
if(all(y)){
if(i==length(x)) break
else i <- i+1
}
x <- c(x[-1][y], x[1])
}
x
}
Now to show the time difference, check out the following:
match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
method1 <- function(a){
mat <- outer(a, a, match_fun)
a[colSums(mat) == 1]
}
poss <- rep(possibilities, 100)
microbenchmark::microbenchmark(method1(poss), onlySet(poss))
Unit: milliseconds
expr min lq mean median uq max neval cld
method1(poss) 840.7919 880.12635 932.255030 889.36380 923.32555 1420.1077 100 b
onlySet(poss) 1.9845 2.07005 2.191647 2.15945 2.24245 3.3656 100 a
Have you tried optimising the solution in base R first? For example, the following reproduces your expected output and uses (faster) base R array routines:
match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
mat <- outer(possibilities, possibilities, match_fun)
possibilities[colSums(mat) == 1]
#[[1]]
#[1] 1 2
#
#[[2]]
#[1] 1 3
#
#[[3]]
#[1] 2 4
Inspired by Onyambu's performant solution, here is another base R option using a recursive function
f_recursive <- function(x, i = 1) {
if (i > length(x)) return(x)
idx <- which(sapply(x[-i], function(el) all(x[[i]] %in% el))) + 1
if (length(idx) == 0) f_recursive(x, i + 1) else f_recursive(x[-idx], i + 1)
}
f(possibilities)
The performance is on par with Onyambu's solution.
poss <- rep(possibilities, 100)
microbenchmark::microbenchmark(
method1(poss),
onlySet(poss),
f_recursive(poss))
#Unit: milliseconds
# expr min lq mean median uq
# method1(poss) 682.558602 710.974831 750.325377 730.627996 765.040976
# onlySet(poss) 1.700646 1.782713 1.870972 1.819820 1.918669
# f_recursive(poss) 1.681120 1.737459 1.884685 1.806384 1.901582
# max neval
# 1200.562889 100
# 2.371646 100
# 3.217013 100

Subset list of vectors by position in a vectorized way

I have a list of vectors and I'm trying to select (for example) the 2nd and 4th element in each vector. I can do this using lapply:
list_of_vec <- list(c(1:10), c(10:1), c(1:10), c(10:1), c(1:10))
lapply(1:length(list_of_vec), function(i) list_of_vec[[i]][c(2,4)])
[[1]]
[1] 2 4
[[2]]
[1] 9 7
[[3]]
[1] 2 4
[[4]]
[1] 9 7
[[5]]
[1] 2 4
But is there a way to do this in a vectorized way -- avoiding one of the apply functions? My problem is that my actual list_of_vec is fairly long, so lapply takes awhile.
Solutions:
Option 1 #Athe's clever solution using do.call?:
do.call(rbind, list_of_vec)[ ,c(2,4)]
Option 2 Using lapply more efficiently:
lapply(list_of_vec, `[`, c(2, 4))
Option 3 A vectorized solution:
starts <- c(0, cumsum(lengths(list_of_vec)[-1]))
matrix(unlist(list_of_vec)[c(starts + 2, starts + 4)], ncol = 2)
Option 4 the lapply solution you wanted to improve:
lapply(1:length(list_of_vec), function(i) list_of_vec[[i]][c(2,4)])
Data:
And a few datasets I will test them on:
# The original data
list_of_vec <- list(c(1:10), c(10:1), c(1:10), c(10:1), c(1:10))
# A long list with short elements
list_of_vec2 <- rep(list_of_vec, 1e5)
# A long list with long elements
list_of_vec3 <- lapply(list_of_vec, rep, 1e3)
list_of_vec3 <- rep(list_of_vec3, 1e4)
Benchmarking:
Original list:
Unit: microseconds
expr min lq mean median uq max neval cld
o1 2.276 2.8450 3.00417 2.845 3.129 10.809 100 a
o2 2.845 3.1300 3.59018 3.414 3.414 23.325 100 a
o3 3.698 4.1250 4.60558 4.267 4.552 20.480 100 a
o4 5.689 5.9735 17.52222 5.974 6.258 1144.606 100 a
Longer list, short elements:
Unit: milliseconds
expr min lq mean median uq max neval cld
o1 146.30778 146.88037 155.04077 149.89164 159.52194 184.92028 10 b
o2 185.40526 187.85717 192.83834 188.42749 190.32103 213.79226 10 c
o3 26.55091 27.27596 28.46781 27.48915 28.84041 32.19998 10 a
o4 407.66430 411.58054 426.87020 415.82161 437.19193 473.64265 10 d
Longer list, long elements:
Unit: milliseconds
expr min lq mean median uq max neval cld
o1 4855.59146 4978.31167 5012.0429 5025.97619 5072.9350 5095.7566 10 c
o2 17.88133 18.60524 103.2154 21.28613 195.0087 311.4122 10 a
o3 855.63128 872.15011 953.8423 892.96193 1069.7526 1106.1980 10 b
o4 37.92927 38.87704 135.6707 124.05127 214.6217 276.5814 10 a
Summary:
Looks like the vectorized solution wins out if the list is long and the elements are short, but lapply is the clear winner for a long list with longer elements. Some of the options output a list, others a matrix. So keep in mind what you want your output to be. Good luck!!!
If your list is composed of vectors of the same length, you could first transform it into a matrix and then get the columns you want.
matrix_of_vec <- do.call(rbind,list_of_vec)
matrix_of_vec[ ,c(2,4)]
Otherwise I'm afraid you'll have to stick to the apply family. The most efficient way to do it is using the parallel package to compute parallely (surprisingly).
corenum <- parallel::detectCores()-1
cl<-parallel::makeCluster(corenum)
parallel::clusterExport(cl,"list_of_vec"))
parallel::parSapply(cl,list_of_vec, '[', c(2,4) )
In this piece of code '[' is the name of the subsetting function and c(2,4) the argument you pass to it.

Efficiently introduce new level on a factor vector

I have a long vector of class factor that contains NA values.
# simple example
x <- factor(c(NA,'A','B','C',NA), levels=c('A','B','C'))
For purposes of modeling, I wish to replace these NA values with a new factor level (e.g., 'Unknown') and set this level as the reference level.
Because the replacement level is not an existing level, simple replacement doesn't work:
# this won't work, since the replacement value is not an existing level of the factor
x[is.na(x)] <- '?'
x # returns: [1] <NA> A B C <NA> -- the NAs remain
# this doesn't work either:
replace(x, NA,'?')
I came up with a couple solutions, but both are kind of ugly and surprisingly slow.
f1 <- function(x, uRep='?'){
# convert to character, replace NAs with Unknown, and convert back to factor
stopifnot(is.factor(x))
newLevels <- c(uRep,levels(x))
x <- as.character(x)
x[is.na(x)] <- uRep
factor(x, levels=newLevels)
}
f2 <- function(x, uRep='?'){
# add new level for Unknown, replace NAs with Unknown, and make Unknown first level
stopifnot(is.factor(x))
levels(x) <- c(levels(x),uRep)
x[is.na(x)] <- uRep
relevel(x, ref=uRep)
}
f3 <- function(x, uRep='?'){ # thanks to #HongOoi
y <- addNA(x)
levels(y)[length(levels(y))]<-uRep
relevel(y, ref=uRep)
}
#test
f1(x) # works
f2(x) # works
f3(x) # works
Solution #2 is editing only the (relatively small) set of levels, plus one arithmetic op to relevel. I would have expected that to be faster than #1, which is casting to character and back to factor.
However, #2 is twice as slow on a benchmark vector of 10K elements with 10 levels and 10% NA.
x <- sample(factor(c(LETTERS[1:10],NA),levels=LETTERS[1:10]),10000,replace=TRUE)
library(microbenchmark)
microbenchmark(f1(x),f2(x),f3(x),times=500L)
# Unit: microseconds
# expr min lq mean median uq max neval
# f1(x) 271.981 278.1825 322.4701 313.0360 360.7175 609.393 500
# f2(x) 651.728 703.2595 768.6756 747.9480 825.7800 1517.707 500
# f3(x) 808.246 883.2980 966.2374 927.5585 1061.1975 1779.424 500
Solution #3, my wrapper for the built-in addNA (mentioned in answer below) was slower than either. addNA does some extra checks for NA values and sets the new level as the last one (requiring me to relevel) and named NA (which then requires renaming by index before releveling, since NA is hard to access -- relevel(addNA(x), ref=NA_character_)) doesn't work).
Is there a more efficient way to write this, or am I just hosed?
You can use fct_explicit_na followed by fct_relevel from the forcats package if you want a pre-fab solution. It's slower than your f1 function, but it still runs in a fraction of a second on a vector of length 100,000:
library(forcats)
x <- factor(c(NA,'A','B','C',NA), levels=c('A','B','C'))
[1] <NA> A B C <NA>
Levels: A B C
x = fct_relevel(fct_explicit_na(x, "Unknown"), "Unknown")
[1] Unknown A B C Unknown
Levels: Unknown A B C
Timings on a vector of length 100,000:
x <- sample(factor(c(LETTERS[1:10],NA), levels=LETTERS[1:10]), 1e5, replace=TRUE)
microbenchmark(forcats = fct_relevel(fct_explicit_na(x, "Unknown"), "Unknown"),
f1 = f1(x),
unit="ms", times=100L)
Unit: milliseconds
expr min lq mean median uq max neval cld
forcats 7.624158 10.634761 15.303339 12.162105 15.513846 250.0516 100 b
f1 3.568801 4.226087 8.085532 5.321338 5.995522 235.2449 100 a
There is a builtin function addNA for this.
From ?factor:
addNA(x, ifany = FALSE)
addNA modifies a factor by turning NA into an extra level (so that NA values are counted in tables, for instance).

Return indices of rows whose elements (columns) all match a reference vector

Using the following code;
c <- NULL
for (a in 1:4){
b <- seq(from = a, to = a + 5)
c <- rbind(c,b)
}
c <- rbind(c,c); rm(a,b)
Results in this matrix,
> c
[,1] [,2] [,3] [,4] [,5] [,6]
b 1 2 3 4 5 6
b 2 3 4 5 6 7
b 3 4 5 6 7 8
b 4 5 6 7 8 9
b 1 2 3 4 5 6
b 2 3 4 5 6 7
b 3 4 5 6 7 8
b 4 5 6 7 8 9
How can I return row indices for rows matching a specific input?
For example, with a search term of,
z <- c(3,4,5,6,7,8)
I need the following returned,
[1] 3 7
This will be used in a fairly large data frame of test data, related to a time step column, to reduce the data by accumulating time steps for matching rows.
Question answered well by others. Due to my dataset size (9.5M rows), I came up with an efficient approach that took a couple steps.
1) Sort the big data frame 'dc' containing time steps to accumulate in column 1.
dc <- dc[order(dc[,2],dc[,3],dc[,4],dc[,5],dc[,6],dc[,7],dc[,8]),]
2) Create a new data frame with unique entries (excluding column 1).
dcU <- unique(dc[,2:8])
3) Write Rcpp (C++) function to loop through unique data frame which iterates through the original data frame accumulating time while rows are equal and indexes to the next for loop step when an unequal row is identified.
require(Rcpp)
getTsrc <-
'
NumericVector getT(NumericMatrix dc, NumericMatrix dcU)
{
int k = 0;
int n = dcU.nrow();
NumericVector tU(n);
for (int i = 0; i<n; i++)
{
while ((dcU(i,0)==dc(k,1))&&(dcU(i,1)==dc(k,2))&&(dcU(i,2)==dc(k,3))&&
(dcU(i,3)==dc(k,4))&&(dcU(i,4)==dc(k,5))&&(dcU(i,5)==dc(k,6))&&
(dcU(i,6)==dc(k,7)))
{
tU[i] = tU[i] + dc(k,0);
k++;
}
}
return(tU);
}
'
cppFunction(getTsrc)
4) Convert function inputs to matrices.
dc1 <- as.matrix(dc)
dcU1 <- as.matrix(dcU)
5) Run the function and time it (returns time vector matching unique data frame)
pt <- proc.time()
t <- getT(dc1, dcU1)
print(proc.time() - pt)
user system elapsed
0.18 0.03 0.20
6) Self high-five and more coffee.
You can use apply.
Here we use apply on c, across rows (the 1), and use a function function(x) all(x == z) on each row.
The which then pulls out the integer positions of the rows.
which(apply(c, 1, function(x) all(x == z)))
b b
3 7
EDIT: If your real data is having problems with this, and is only 9 columns (not too much typing), you could try a fully vectorized solution:
which((c[,1]==z[1] & c[,2]==z[2] & c[,3]==z[3] & c[,4]==z[4]& c[,5]==z[5]& c[,6]==z[6]))
The answer by #jeremycg will definitely work, and is fast if you have many columns and few rows. However, you might be able to go a bit faster if you have lots of rows by avoiding using apply() on the row dimension.
Here's an alternative:
l <- unlist(apply(c, 2, list), recursive=F)
logic <- mapply(function(x,y)x==y, l, z)
which(.rowSums(logic, m=nrow(logic), n=ncol(logic)) == ncol(logic))
[1] 3 7
It works by first turning each column into a list. Then, it takes each column-list and searches it for the corresponding element in z. In the last step, you find out which rows had all columns with the corresponding match in z. Even though the last step is a row-wise operation, by using .rowSums (mind the . at the front there) we can specify the dimensions of the matrix, and get a speed-up.
Let's test the timings of the two approaches.
The functions
f1 <- function(){
which(apply(c, 1, function(x) all(x == z)))
}
f2 <- function(){
l <- unlist(apply(c, 2, list), recursive=F)
logic <- mapply(function(x,y)x==y, l, z)
which(.rowSums(logic, m=nrow(logic), n=ncol(logic)) == ncol(logic))
}
With 8 rows (dim in example):
> time <- microbenchmark(f1(), f2())
> time
Unit: microseconds
expr min lq mean median uq max neval cld
f1() 21.147 21.8375 22.86096 22.6845 23.326 30.443 100 a
f2() 42.310 43.1510 45.13735 43.7500 44.438 137.413 100 b
With 80 rows:
Unit: microseconds
expr min lq mean median uq max neval cld
f1() 101.046 103.859 108.7896 105.1695 108.3320 166.745 100 a
f2() 93.631 96.204 104.6711 98.1245 104.7205 236.980 100 a
With 800 rows:
> time <- microbenchmark(f1(), f2())
> time
Unit: microseconds
expr min lq mean median uq max neval cld
f1() 920.146 1011.394 1372.3512 1042.1230 1066.7610 31290.593 100 b
f2() 572.222 579.626 593.9211 584.5815 593.6455 1104.316 100 a
Note that my timing assessment only had 100 replicates each, and although these results are reprsentative, there's a bit a of variability in the number of rows required before the two methods are equal.
Regardless, I think my approach would probably be faster once you have 100+ rows.
Also, note that you can't simply transpose c to make f1() faster. First, the t() takes up time; second, because you're comparing to z, you would then just have to make a column-wise (after the transpose) comparison, so it's no different at that point.
Finally, I'm sure there's an even faster way to do this. My answer was just the first thing that came to mind, and didn't require any packages to install. This could be a lot faster if you wanted to use data.table. Also, if you had a lot of columns, you might even be able to parallelize this procedure (although, to be worthwhile the dataset would have to be immense).
If these timings aren't tolerable for your data, you might consider reporting back with the dimensions of your data set.
In your code c is not a data frame. Try transforming it into one:
c <- data.frame(c)

Convert list of lists of vectors to data frame in R

I have a list of lists of numeric vectors. I would like to
sum each vector
make each inner list of scalars into a vector
combine the vectors into one data frame where the vectors are rows
I hacked together the following, but it seems that this is very clumsy and not the R way. I suspect that if I use rapply and four steps that I am doing something wrong (not in terms of results, but in terms of efficiency and understanding). Is there a right way to do this operation?
dat1 <- list(list(1:2, 3:4), list(5:6, 7:8))
dat2 <- rapply(dat1, sum, how="list")
dat3 <- lapply(dat2, unlist)
dat4 <- do.call(rbind, dat3)
dat5 <- data.frame(dat4)
So the desired output is
> dat5
X1 X2
1 3 7
2 11 15
> class(dat5)
[1] "data.frame"
Well, I'm not sure if it's any more efficient, but here's something similar:
as.data.frame(matrix(rapply(dat1,sum), nrow=length(dat1), byrow=TRUE))
X1 X2
1 3 7
2 11 15
This seems to get you there in one go, without much difference in efficiency from using matrix
> as.data.frame(do.call(rbind, rapply(dat1, sum, how = "list")))
# X1 X2
# 1 3 7
# 2 11 15
> f1 <- function()
as.data.frame(matrix(rapply(dat1,sum), nrow=length(dat1), byrow=TRUE))
> f2 <- function()
as.data.frame(do.call(rbind, rapply(dat1, sum, how = "list")))
> library(microbenchmark)
> microbenchmark(f1(), f2())
# Unit: microseconds
# expr min lq median uq max neval
# f1() 91.047 92.7105 93.94 102.2855 199.305 100
# f2() 95.213 97.3150 98.43 101.8240 134.403 100
Almost a wash for this example.

Resources