Efficiently introduce new level on a factor vector - r

I have a long vector of class factor that contains NA values.
# simple example
x <- factor(c(NA,'A','B','C',NA), levels=c('A','B','C'))
For purposes of modeling, I wish to replace these NA values with a new factor level (e.g., 'Unknown') and set this level as the reference level.
Because the replacement level is not an existing level, simple replacement doesn't work:
# this won't work, since the replacement value is not an existing level of the factor
x[is.na(x)] <- '?'
x # returns: [1] <NA> A B C <NA> -- the NAs remain
# this doesn't work either:
replace(x, NA,'?')
I came up with a couple solutions, but both are kind of ugly and surprisingly slow.
f1 <- function(x, uRep='?'){
# convert to character, replace NAs with Unknown, and convert back to factor
stopifnot(is.factor(x))
newLevels <- c(uRep,levels(x))
x <- as.character(x)
x[is.na(x)] <- uRep
factor(x, levels=newLevels)
}
f2 <- function(x, uRep='?'){
# add new level for Unknown, replace NAs with Unknown, and make Unknown first level
stopifnot(is.factor(x))
levels(x) <- c(levels(x),uRep)
x[is.na(x)] <- uRep
relevel(x, ref=uRep)
}
f3 <- function(x, uRep='?'){ # thanks to #HongOoi
y <- addNA(x)
levels(y)[length(levels(y))]<-uRep
relevel(y, ref=uRep)
}
#test
f1(x) # works
f2(x) # works
f3(x) # works
Solution #2 is editing only the (relatively small) set of levels, plus one arithmetic op to relevel. I would have expected that to be faster than #1, which is casting to character and back to factor.
However, #2 is twice as slow on a benchmark vector of 10K elements with 10 levels and 10% NA.
x <- sample(factor(c(LETTERS[1:10],NA),levels=LETTERS[1:10]),10000,replace=TRUE)
library(microbenchmark)
microbenchmark(f1(x),f2(x),f3(x),times=500L)
# Unit: microseconds
# expr min lq mean median uq max neval
# f1(x) 271.981 278.1825 322.4701 313.0360 360.7175 609.393 500
# f2(x) 651.728 703.2595 768.6756 747.9480 825.7800 1517.707 500
# f3(x) 808.246 883.2980 966.2374 927.5585 1061.1975 1779.424 500
Solution #3, my wrapper for the built-in addNA (mentioned in answer below) was slower than either. addNA does some extra checks for NA values and sets the new level as the last one (requiring me to relevel) and named NA (which then requires renaming by index before releveling, since NA is hard to access -- relevel(addNA(x), ref=NA_character_)) doesn't work).
Is there a more efficient way to write this, or am I just hosed?

You can use fct_explicit_na followed by fct_relevel from the forcats package if you want a pre-fab solution. It's slower than your f1 function, but it still runs in a fraction of a second on a vector of length 100,000:
library(forcats)
x <- factor(c(NA,'A','B','C',NA), levels=c('A','B','C'))
[1] <NA> A B C <NA>
Levels: A B C
x = fct_relevel(fct_explicit_na(x, "Unknown"), "Unknown")
[1] Unknown A B C Unknown
Levels: Unknown A B C
Timings on a vector of length 100,000:
x <- sample(factor(c(LETTERS[1:10],NA), levels=LETTERS[1:10]), 1e5, replace=TRUE)
microbenchmark(forcats = fct_relevel(fct_explicit_na(x, "Unknown"), "Unknown"),
f1 = f1(x),
unit="ms", times=100L)
Unit: milliseconds
expr min lq mean median uq max neval cld
forcats 7.624158 10.634761 15.303339 12.162105 15.513846 250.0516 100 b
f1 3.568801 4.226087 8.085532 5.321338 5.995522 235.2449 100 a

There is a builtin function addNA for this.
From ?factor:
addNA(x, ifany = FALSE)
addNA modifies a factor by turning NA into an extra level (so that NA values are counted in tables, for instance).

Related

Keeping vectors (from list of vectors) whose elements do not have a proper subset within that same list (using RCPP)

I have asked this question previously (see here) and received a satisfactory answer using the purr package. However, this has proved to be a bottle neck in my program so I would like to rewrite the section using the RCPP package.
Proper subset: A proper subset S' of a set S is a subset that is strictly contained in S and so excludes S itself (note I am also excluding the empty set).
Suppose you have the following vectors in a list:
a = c(1,2)
b = c(1,3)
c = c(2,4)
d = c(1,2,3,4)
e = c(2,4,5)
f = c(1,2,3)
My aim is to keep only vectors which have no proper subset within the list, which in this example would be a, b and c.
Previous Solution
library(purr)
possibilities <- list(a,b,c,d,e,f)
keep(possibilities,
map2_lgl(.x = possibilities,
.y = seq_along(possibilities),
~ !any(map_lgl(possibilities[-.y], function(z) all(z %in% .x)))))
The notion here is to avoid the O(N^3) and use a less order instead. The other answer provided here will be slow still since it is greater than O(N^2). Here is a solution with less than O(N^2), where the worst case scenario is O(N^2) when all the elements are unique.
onlySet <- function(x){
i <- 1
repeat{
y <- sapply(x[-1], function(el)!all(is.element(x[[1]], el)))
if(all(y)){
if(i==length(x)) break
else i <- i+1
}
x <- c(x[-1][y], x[1])
}
x
}
Now to show the time difference, check out the following:
match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
method1 <- function(a){
mat <- outer(a, a, match_fun)
a[colSums(mat) == 1]
}
poss <- rep(possibilities, 100)
microbenchmark::microbenchmark(method1(poss), onlySet(poss))
Unit: milliseconds
expr min lq mean median uq max neval cld
method1(poss) 840.7919 880.12635 932.255030 889.36380 923.32555 1420.1077 100 b
onlySet(poss) 1.9845 2.07005 2.191647 2.15945 2.24245 3.3656 100 a
Have you tried optimising the solution in base R first? For example, the following reproduces your expected output and uses (faster) base R array routines:
match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
mat <- outer(possibilities, possibilities, match_fun)
possibilities[colSums(mat) == 1]
#[[1]]
#[1] 1 2
#
#[[2]]
#[1] 1 3
#
#[[3]]
#[1] 2 4
Inspired by Onyambu's performant solution, here is another base R option using a recursive function
f_recursive <- function(x, i = 1) {
if (i > length(x)) return(x)
idx <- which(sapply(x[-i], function(el) all(x[[i]] %in% el))) + 1
if (length(idx) == 0) f_recursive(x, i + 1) else f_recursive(x[-idx], i + 1)
}
f(possibilities)
The performance is on par with Onyambu's solution.
poss <- rep(possibilities, 100)
microbenchmark::microbenchmark(
method1(poss),
onlySet(poss),
f_recursive(poss))
#Unit: milliseconds
# expr min lq mean median uq
# method1(poss) 682.558602 710.974831 750.325377 730.627996 765.040976
# onlySet(poss) 1.700646 1.782713 1.870972 1.819820 1.918669
# f_recursive(poss) 1.681120 1.737459 1.884685 1.806384 1.901582
# max neval
# 1200.562889 100
# 2.371646 100
# 3.217013 100

Efficient Way to Vectorize a Function over each Row of a Data-Frame

I have a data frame where one column is a list of time-stamps. I need to annotate which time-stamps are valid or not, depending on whether or not they are close enough (i.e., within 1 second) to an element of another list of valid time-stamps. For this I have a helper function.
valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974, 344)
actual_times <- c(200, 210, 215, 220.5, 260)
strain <- c("green", "green", "green", "green", "green", "green")
valid_or_not <- c(rep("NULL", 6))
df <- data.frame(strain, actual_times, valid_or_not)
My data-frame looks like this:
strain actual_times valid_or_not
1 green 200.0 NULL
2 green 210.0 NULL
3 green 215.0 NULL
4 green 220.5 NULL
5 green 260.0 NULL
My helper (that checks to see if an actual_time is within 1 second of a valid time) is as follows:
valid_or_not_fxn<- function(actual_time){
c = "not valid"
for (i in 1:length(valid_times))
if (abs(valid_times[i] - actual_time) <= 1) {
c <- "valid"
} else {
}
return(c)
}
What I've tried to do is loop through the entire data-frame using a for loop with this helper function.
However....it's really slow (on my real data-set) because it's a nested loop cross-comparing two lists that are 100s of elements long. I can't figure out to optimize this.
df$valid_or_not <- as.character(df$valid_or_not)
for (i in 1:nrow(df))
print(df[i, "valid_or_not"])
df[i, "valid_or_not"] <- valid_or_not_fxn(df[i, "actual_times"])
Thank you for any help!
No matter what you do, you essentially have to do at least length(valid_times) comparisons. Probably better off looping over valid_times and comparing each item of that vector to your actual_times column as a vectorised operation. That way you'd only have 5 loop iterations.
One way of doing this is then:
df$test <- Reduce(`|`, lapply(valid_times, function(x) abs(df$actual_times - x) <= 1))
# strain actual_times valid_or_not
#1 green 200.0 FALSE
#2 green 210.0 FALSE
#3 green 215.0 FALSE
#4 green 220.5 TRUE
#5 green 260.0 TRUE
100K rows in df and 1000 valid_times test finishes in <4 seconds:
df2 <- df[sample(1:5,1e5,replace=TRUE),]
valid_times2 <- valid_times[sample(1:5,1000,replace=TRUE)]
system.time(Reduce(`|`, lapply(valid_times2, function(x) abs(df2$actual_times - x) <= 1)))
# user system elapsed
# 3.13 0.40 3.54
The easist way to do it is avoiding data frame operations. So you can do this check and populate the valid_or_not vector before combining them into the dataframe as:
valid_or_not[sapply(actual_times, function(x) any(abs(x - valid_times) <= 1))] <- "valid"
Note that, by this line, the valid_or_not vector is indexed with an equal length vector of boolean values (whether the condition is satisfied, T or F). So only TRUE valued indices from the vector are updated. valid_or_not and actual_times vectors must be of same length where as valid_times vector can be of different length.
By the way "plying" a for loop does not enhance the performance significantly since it is just a "wrapper" for "for" loops. Only performance increase comes from avoiding intermediary objects due to neater and more concise style of code and avoiding redundant copying in some cases. The same case is true for the Vectorize function: It just wraps the for loop that goes through the function and in for example "outer" function, the FUN must be "vectorized" in that manner. In fact it does not give the performance of a truely vectorized operation. In my example the performance enhancement comes from the substitution of the for loop with the "any" function.
And because of some kind of a "bug", subsetting data frames has an important penalty. As Hadley Wickham explains in Performance topic of Advanced-R:
Extracting a single value from a data frame
The following microbenchmark shows five ways to access a single value
(the number in the bottom-right corner) from the built-in mtcars
dataset. The variation in performance is startling: the slowest method
takes 30x longer than the fastest. There’s no reason that there has to
be such a huge difference in performance. It’s simply that no one has
had the time to fix it.
microbenchmark(
"[32, 11]" = mtcars[32, 11],
"$carb[32]" = mtcars$carb[32],
"[[c(11, 32)]]" = mtcars[[c(11, 32)]],
"[[11]][32]" = mtcars[[11]][32],
".subset2" = .subset2(mtcars, 11)[32] )
## Unit: nanoseconds
## expr min lq mean median uq max neval
## [32, 11] 15,300 16,300 18354 17,000 17,800 76,400 100
## $carb[32] 8,860 9,930 12836 10,600 11,600 85,400 100
## [[c(11, 32)]] 7,200 8,110 9293 8,780 9,350 21,300 100
## [[11]][32] 6,330 7,580 8377 8,100 8,690 20,900 100
## .subset2 334 566 4461 669 800 368,000 100
The most efficient way to subset a data frame is to use the .subset2 method. Your poor performance can mostly be attributed to this fact.
And as last notes:
If the "else" in your conditional statment does not do anything (just like in your example: else {}) you do not have to include it. R has some lazy operations (does not evaluate a statement as long as it is not executed inside the code), but that does not mean it always skips non-executed code portions.
The "character" values in your example are in fact categoric: Only
one of few values can be chosen for each entry. So there is no need
to store them as "characters" and they can be converted into factors
(which are just integer values). This can also enhance
performance.
An addition for #thelatemail 's working solution:
In R, "or" (|) operator isn't lazy while "any" function is. A ply combining or's work till the end while "any" function stops at the first encounter of a TRUE value - which enhances the performance (I will write a blog post on this topic ASAP). And vectorized "any" is almost as fast as native C code while *ply can be slightly faster than for loops in R (That I will benchmark and show in another blog post soon).
Some benchmarks showing this:
Pure "any" and | comparison:
> microbenchmark(any(T,F,F,F,F,F), T|F|F|F|F|F)
Unit: nanoseconds
expr min lq mean median uq max neval cld
any(T, F, F, F, F, F) 274 307.0 545.86 366.5 429.5 16380 100 a
T | F | F | F | F | F 597 626.5 903.47 668.5 730.0 18966 100 a
Pure "Reduce" and vectorization comparison:
> vec0 <- rep(1, 1e6)
> microbenchmark(Reduce("+", vec0), sum(vec0), times = 10)
Unit: microseconds
expr min lq mean median uq
Reduce("+", vec0) 308415.064 310071.953 318503.6048 312940.6355 317648.354
sum(vec0) 930.625 936.775 944.2416 943.5425 949.257
max neval cld
369864.993 10 b
962.349 10 a
And a reduced "|" vs. vectorized "any" comparison (for an extreme case). "any" beats by more than 1e5 times:
> vec1 <- c(T, rep(F, 1e6))
> microbenchmark(Reduce("|", vec1), any(vec1), times = 10)
Unit: nanoseconds
expr min lq mean median uq
Reduce("|", vec1) 394040518 395792399 402703632.6 399191803 400990304
any(vec1) 154 267 1932.5 2588 2952
max neval cld
441805451 10 b
3420 10 a
When the single TRUE is at the very end (so "any" is not lazy anymore and has to check the whole vector), "any" still beats by more than 400 times:
> vec2 <- c(rep(F, 1e6), T)
> microbenchmark(Reduce("|", vec2), any(vec2), times = 10)
Unit: microseconds
expr min lq mean median uq
Reduce("|", vec2) 396625.318 401744.849 416732.5087 407447.375 424538.222
any(vec2) 736.975 787.047 857.5575 832.137 926.076
max neval cld
482116.632 10 b
1013.732 10 a

R missing date calculation [duplicate]

I have a huge vector which has a couple of NA values, and I'm trying to find the max value in that vector (the vector is all numbers), but I can't do this because of the NA values.
How can I remove the NA values so that I can compute the max?
Trying ?max, you'll see that it actually has a na.rm = argument, set by default to FALSE. (That's the common default for many other R functions, including sum(), mean(), etc.)
Setting na.rm=TRUE does just what you're asking for:
d <- c(1, 100, NA, 10)
max(d, na.rm=TRUE)
If you do want to remove all of the NAs, use this idiom instead:
d <- d[!is.na(d)]
A final note: Other functions (e.g. table(), lm(), and sort()) have NA-related arguments that use different names (and offer different options). So if NA's cause you problems in a function call, it's worth checking for a built-in solution among the function's arguments. I've found there's usually one already there.
The na.omit function is what a lot of the regression routines use internally:
vec <- 1:1000
vec[runif(200, 1, 1000)] <- NA
max(vec)
#[1] NA
max( na.omit(vec) )
#[1] 1000
Use discard from purrr (works with lists and vectors).
discard(v, is.na)
The benefit is that it is easy to use pipes; alternatively use the built-in subsetting function [:
v %>% discard(is.na)
v %>% `[`(!is.na(.))
Note that na.omit does not work on lists:
> x <- list(a=1, b=2, c=NA)
> na.omit(x)
$a
[1] 1
$b
[1] 2
$c
[1] NA
?max shows you that there is an extra parameter na.rm that you can set to TRUE.
Apart from that, if you really want to remove the NAs, just use something like:
myvec[!is.na(myvec)]
Just in case someone new to R wants a simplified answer to the original question
How can I remove NA values from a vector?
Here it is:
Assume you have a vector foo as follows:
foo = c(1:10, NA, 20:30)
running length(foo) gives 22.
nona_foo = foo[!is.na(foo)]
length(nona_foo) is 21, because the NA values have been removed.
Remember is.na(foo) returns a boolean matrix, so indexing foo with the opposite of this value will give you all the elements which are not NA.
You can call max(vector, na.rm = TRUE). More generally, you can use the na.omit() function.
I ran a quick benchmark comparing the two base approaches and it turns out that x[!is.na(x)] is faster than na.omit. User qwr suggested I try purrr::dicard also - this turned out to be massively slower (though I'll happily take comments on my implementation & test!)
microbenchmark::microbenchmark(
purrr::map(airquality,function(x) {x[!is.na(x)]}),
purrr::map(airquality,na.omit),
purrr::map(airquality, ~purrr::discard(.x, .p = is.na)),
times = 1e6)
Unit: microseconds
expr min lq mean median uq max neval cld
purrr::map(airquality, function(x) { x[!is.na(x)] }) 66.8 75.9 130.5643 86.2 131.80 541125.5 1e+06 a
purrr::map(airquality, na.omit) 95.7 107.4 185.5108 129.3 190.50 534795.5 1e+06 b
purrr::map(airquality, ~purrr::discard(.x, .p = is.na)) 3391.7 3648.6 5615.8965 4079.7 6486.45 1121975.4 1e+06 c
For reference, here's the original test of x[!is.na(x)] vs na.omit:
microbenchmark::microbenchmark(
purrr::map(airquality,function(x) {x[!is.na(x)]}),
purrr::map(airquality,na.omit),
times = 1000000)
Unit: microseconds
expr min lq mean median uq max neval cld
map(airquality, function(x) { x[!is.na(x)] }) 53.0 56.6 86.48231 58.1 64.8 414195.2 1e+06 a
map(airquality, na.omit) 85.3 90.4 134.49964 92.5 104.9 348352.8 1e+06 b
Another option using complete.cases like this:
d <- c(1, 100, NA, 10)
result <- complete.cases(d)
output <- d[result]
output
#> [1] 1 100 10
max(output)
#> [1] 100
Created on 2022-08-26 with reprex v2.0.2

Insertion sort in r with error "missing value where TRUE/FALSE needed"

I want to build insertion sort algorithm in r:
Given a vector x, let the initial unsorted vector u be equal to x,
and the initial sorted vector s be a vector of length 0.
Remove first element of u and insert it into s so that s is still sorted.
If u is not empty then go back to step 2.
Here is my code:
x <- round(runif(1000,1,100)
u <- x
s <- vector(mode="numeric", length=0
for(i in 1:length(u)){
number <-u[i]
u <- u[-i]
for(j in 1:length(s)){
if(length(s) == 0){
s <- c(s,number)
break
}else{
if(s[j]>=number){
s <- append(s,number,j-1)
break
}
}
if(s[length(s)]<number){
s <- c(s,number)
}
}
}
First of all, when length of u = 500 it throws me an error:
Error in if (s[j] >= number) { : missing value where TRUE/FALSE needed
next, it sorted incorrectly(for example it could be more ones than in original vector u, or for example less twos then in original vector u)
So I have two questions:
1)How can we fix that in THIS code?
2)Can you suggest another code, which is more efficient than my?
P.S Of course the code should be without sort and order command.
Many thanks
Your first question is an easy fix: when you extract a number out of your vector u, what you are really doing, is just drawing a random number from the sample without replacement. So just always take the first value.
# Change the current to this:
number <-u[1]
u <- u[-1]
For your second question: what a fun exercise! My go at it was to just implement the selection sort pseudo-code from Wikipedia, (which you could just do) but with blindfolds: I cannot look into "the bag" of x but only see my just-drawn item - this I found to be more like what you are asking for.
How do I go about solving this? Simple: I draw a value from x. Call it my control variable. Then for every draw I put the new value into a pile of small if less than (or equal) to control. Otherwise I put it into the large pile. When I have distributed all values, I do this algorithm again for each pile. I continue until my piles are all of size 1. The implementation of this below.
mysort <- function(x){
if(length(x) <= 1){
return(x) ## If nothing to sort, return itself
}
control <- x[1]
small <- c()
big <- c()
for(test in x[-1]){
if(test <= control){
small <- c(small,test) ## Less than control in small
}
if(test > control){
big <- c(big,test) ## Bigger than control in big
}
}
## Sort the new piles with the same procedure recursively.
small <- mysort(small)
big <- mysort(big)
## Return the improved order
c(small,control,big)
}
mysort(c(2,1,1,2,2,2,3,3,-3,-Inf,Inf,0,pi))
# [1] -Inf -3.000000 0.000000 1.000000 1.000000 2.000000 2.000000 2.000000
# [9] 2.000000 3.000000 3.000000 3.141593 Inf
We can compare speeds with the microbenchmark package, if we wrap your implementation (without the superfluous while-loop) in a function yoursort.
library(microbenchmark)
a <- rnorm(1e3)
microbenchmark(b <- mysort(a),times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# b <- mysort(a) 37.76747 39.2302 41.96171 40.99288 43.07412 47.85377 10
microbenchmark(c <- yoursort(a),times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# c <- yoursort(a) 786.4544 808.2312 861.8072 840.7868 879.4946 1059.913 10
microbenchmark(sort(a),times = 10)
# Unit: microseconds
# expr min lq mean median uq max neval
# sort(a) 192.763 194.384 242.7633 201.1335 263.497 390.386 10
Neither are any match to the already implemented sort function.
And of course, do they actually do the correct sorting?
any(b != sort(a)) ## Are there any elements that do not match?
# [1] FALSE
any(c != sort(a))
# [1] FALSE

check each element of vector against all rows of data frame

I have a vector, for which I want to check each element against each row of a data frame. It involves a grep function, since the elements to be checked are buried in other text.
With help of this forum, I got this code:
mat=data.frame(par=c('long A story','C story', 'blabla D'),val=1:3)
vec=c('Z','D','A')
mat$label <- NA
for (x in vec){
is.match <- lapply(mat$par,function(y) grep(x, y))
mat$label[which(is.match > 0)] <- x
}
The problem is that it takes minutes to execute. Is there a way to vectorize this?
I've assumed you only want the first match in each case:
which.matches <- grep("[ZDA]", mat$par)
what.matches <- regmatches(mat$par, regexpr("[ZDA]", mat$par))
mat$label[which.matches] <- what.matches
mat
par val label
1 long A story 1 A
2 C story 2 <NA>
3 blabla D 3 D
EDIT: Benchmarking
Unit: microseconds
expr min lq median uq max
1 answer(mat) 185.338 194.0925 199.073 209.1850 898.919
2 question(mat) 672.227 693.9610 708.601 725.6555 1457.046
EDIT 2:
As #mrdwab suggested, this can actually be used as a one-liner:
mat$label[grep("[ZDA]", mat$par)] <- regmatches(mat$par, regexpr("[ZDA]", mat$par))

Resources