R missing date calculation [duplicate]

R missing date calculation [duplicate] - r

I have a huge vector which has a couple of NA values, and I'm trying to find the max value in that vector (the vector is all numbers), but I can't do this because of the NA values.
How can I remove the NA values so that I can compute the max?

Trying ?max, you'll see that it actually has a na.rm = argument, set by default to FALSE. (That's the common default for many other R functions, including sum(), mean(), etc.)
Setting na.rm=TRUE does just what you're asking for:
d <- c(1, 100, NA, 10)
max(d, na.rm=TRUE)
If you do want to remove all of the NAs, use this idiom instead:
d <- d[!is.na(d)]
A final note: Other functions (e.g. table(), lm(), and sort()) have NA-related arguments that use different names (and offer different options). So if NA's cause you problems in a function call, it's worth checking for a built-in solution among the function's arguments. I've found there's usually one already there.

The na.omit function is what a lot of the regression routines use internally:
vec <- 1:1000
vec[runif(200, 1, 1000)] <- NA
max(vec)
#[1] NA
max( na.omit(vec) )
#[1] 1000

Use discard from purrr (works with lists and vectors).
discard(v, is.na)
The benefit is that it is easy to use pipes; alternatively use the built-in subsetting function [:
v %>% discard(is.na)
v %>% `[`(!is.na(.))
Note that na.omit does not work on lists:
> x <- list(a=1, b=2, c=NA)
> na.omit(x)
$a
[1] 1
$b
[1] 2
$c
[1] NA

?max shows you that there is an extra parameter na.rm that you can set to TRUE.
Apart from that, if you really want to remove the NAs, just use something like:
myvec[!is.na(myvec)]

Just in case someone new to R wants a simplified answer to the original question
How can I remove NA values from a vector?
Here it is:
Assume you have a vector foo as follows:
foo = c(1:10, NA, 20:30)
running length(foo) gives 22.
nona_foo = foo[!is.na(foo)]
length(nona_foo) is 21, because the NA values have been removed.
Remember is.na(foo) returns a boolean matrix, so indexing foo with the opposite of this value will give you all the elements which are not NA.

You can call max(vector, na.rm = TRUE). More generally, you can use the na.omit() function.

I ran a quick benchmark comparing the two base approaches and it turns out that x[!is.na(x)] is faster than na.omit. User qwr suggested I try purrr::dicard also - this turned out to be massively slower (though I'll happily take comments on my implementation & test!)
microbenchmark::microbenchmark(
purrr::map(airquality,function(x) {x[!is.na(x)]}),
purrr::map(airquality,na.omit),
purrr::map(airquality, ~purrr::discard(.x, .p = is.na)),
times = 1e6)
Unit: microseconds
expr min lq mean median uq max neval cld
purrr::map(airquality, function(x) { x[!is.na(x)] }) 66.8 75.9 130.5643 86.2 131.80 541125.5 1e+06 a
purrr::map(airquality, na.omit) 95.7 107.4 185.5108 129.3 190.50 534795.5 1e+06 b
purrr::map(airquality, ~purrr::discard(.x, .p = is.na)) 3391.7 3648.6 5615.8965 4079.7 6486.45 1121975.4 1e+06 c
For reference, here's the original test of x[!is.na(x)] vs na.omit:
microbenchmark::microbenchmark(
purrr::map(airquality,function(x) {x[!is.na(x)]}),
purrr::map(airquality,na.omit),
times = 1000000)
Unit: microseconds
expr min lq mean median uq max neval cld
map(airquality, function(x) { x[!is.na(x)] }) 53.0 56.6 86.48231 58.1 64.8 414195.2 1e+06 a
map(airquality, na.omit) 85.3 90.4 134.49964 92.5 104.9 348352.8 1e+06 b

Another option using complete.cases like this:
d <- c(1, 100, NA, 10)
result <- complete.cases(d)
output <- d[result]
output
#> [1] 1 100 10
max(output)
#> [1] 100
Created on 2022-08-26 with reprex v2.0.2

Related

Efficiently introduce new level on a factor vector

I have a long vector of class factor that contains NA values.
# simple example
x <- factor(c(NA,'A','B','C',NA), levels=c('A','B','C'))
For purposes of modeling, I wish to replace these NA values with a new factor level (e.g., 'Unknown') and set this level as the reference level.
Because the replacement level is not an existing level, simple replacement doesn't work:
# this won't work, since the replacement value is not an existing level of the factor
x[is.na(x)] <- '?'
x # returns: [1] <NA> A B C <NA> -- the NAs remain
# this doesn't work either:
replace(x, NA,'?')
I came up with a couple solutions, but both are kind of ugly and surprisingly slow.
f1 <- function(x, uRep='?'){
# convert to character, replace NAs with Unknown, and convert back to factor
stopifnot(is.factor(x))
newLevels <- c(uRep,levels(x))
x <- as.character(x)
x[is.na(x)] <- uRep
factor(x, levels=newLevels)
}
f2 <- function(x, uRep='?'){
# add new level for Unknown, replace NAs with Unknown, and make Unknown first level
stopifnot(is.factor(x))
levels(x) <- c(levels(x),uRep)
x[is.na(x)] <- uRep
relevel(x, ref=uRep)
}
f3 <- function(x, uRep='?'){ # thanks to #HongOoi
y <- addNA(x)
levels(y)[length(levels(y))]<-uRep
relevel(y, ref=uRep)
}
#test
f1(x) # works
f2(x) # works
f3(x) # works
Solution #2 is editing only the (relatively small) set of levels, plus one arithmetic op to relevel. I would have expected that to be faster than #1, which is casting to character and back to factor.
However, #2 is twice as slow on a benchmark vector of 10K elements with 10 levels and 10% NA.
x <- sample(factor(c(LETTERS[1:10],NA),levels=LETTERS[1:10]),10000,replace=TRUE)
library(microbenchmark)
microbenchmark(f1(x),f2(x),f3(x),times=500L)
# Unit: microseconds
# expr min lq mean median uq max neval
# f1(x) 271.981 278.1825 322.4701 313.0360 360.7175 609.393 500
# f2(x) 651.728 703.2595 768.6756 747.9480 825.7800 1517.707 500
# f3(x) 808.246 883.2980 966.2374 927.5585 1061.1975 1779.424 500
Solution #3, my wrapper for the built-in addNA (mentioned in answer below) was slower than either. addNA does some extra checks for NA values and sets the new level as the last one (requiring me to relevel) and named NA (which then requires renaming by index before releveling, since NA is hard to access -- relevel(addNA(x), ref=NA_character_)) doesn't work).
Is there a more efficient way to write this, or am I just hosed?

You can use fct_explicit_na followed by fct_relevel from the forcats package if you want a pre-fab solution. It's slower than your f1 function, but it still runs in a fraction of a second on a vector of length 100,000:
library(forcats)
x <- factor(c(NA,'A','B','C',NA), levels=c('A','B','C'))
[1] <NA> A B C <NA>
Levels: A B C
x = fct_relevel(fct_explicit_na(x, "Unknown"), "Unknown")
[1] Unknown A B C Unknown
Levels: Unknown A B C
Timings on a vector of length 100,000:
x <- sample(factor(c(LETTERS[1:10],NA), levels=LETTERS[1:10]), 1e5, replace=TRUE)
microbenchmark(forcats = fct_relevel(fct_explicit_na(x, "Unknown"), "Unknown"),
f1 = f1(x),
unit="ms", times=100L)
Unit: milliseconds
expr min lq mean median uq max neval cld
forcats 7.624158 10.634761 15.303339 12.162105 15.513846 250.0516 100 b
f1 3.568801 4.226087 8.085532 5.321338 5.995522 235.2449 100 a

There is a builtin function addNA for this.
From ?factor:
addNA(x, ifany = FALSE)
addNA modifies a factor by turning NA into an extra level (so that NA values are counted in tables, for instance).

Efficient Way to Vectorize a Function over each Row of a Data-Frame

I have a data frame where one column is a list of time-stamps. I need to annotate which time-stamps are valid or not, depending on whether or not they are close enough (i.e., within 1 second) to an element of another list of valid time-stamps. For this I have a helper function.
valid_times <- c(219.934, 229.996, 239.975, 249.935, 259.974, 344)
actual_times <- c(200, 210, 215, 220.5, 260)
strain <- c("green", "green", "green", "green", "green", "green")
valid_or_not <- c(rep("NULL", 6))
df <- data.frame(strain, actual_times, valid_or_not)
My data-frame looks like this:
strain actual_times valid_or_not
1 green 200.0 NULL
2 green 210.0 NULL
3 green 215.0 NULL
4 green 220.5 NULL
5 green 260.0 NULL
My helper (that checks to see if an actual_time is within 1 second of a valid time) is as follows:
valid_or_not_fxn<- function(actual_time){
c = "not valid"
for (i in 1:length(valid_times))
if (abs(valid_times[i] - actual_time) <= 1) {
c <- "valid"
} else {
}
return(c)
}
What I've tried to do is loop through the entire data-frame using a for loop with this helper function.
However....it's really slow (on my real data-set) because it's a nested loop cross-comparing two lists that are 100s of elements long. I can't figure out to optimize this.
df$valid_or_not <- as.character(df$valid_or_not)
for (i in 1:nrow(df))
print(df[i, "valid_or_not"])
df[i, "valid_or_not"] <- valid_or_not_fxn(df[i, "actual_times"])
Thank you for any help!

No matter what you do, you essentially have to do at least length(valid_times) comparisons. Probably better off looping over valid_times and comparing each item of that vector to your actual_times column as a vectorised operation. That way you'd only have 5 loop iterations.
One way of doing this is then:
df$test <- Reduce(`|`, lapply(valid_times, function(x) abs(df$actual_times - x) <= 1))
# strain actual_times valid_or_not
#1 green 200.0 FALSE
#2 green 210.0 FALSE
#3 green 215.0 FALSE
#4 green 220.5 TRUE
#5 green 260.0 TRUE
100K rows in df and 1000 valid_times test finishes in <4 seconds:
df2 <- df[sample(1:5,1e5,replace=TRUE),]
valid_times2 <- valid_times[sample(1:5,1000,replace=TRUE)]
system.time(Reduce(`|`, lapply(valid_times2, function(x) abs(df2$actual_times - x) <= 1)))
# user system elapsed
# 3.13 0.40 3.54

The easist way to do it is avoiding data frame operations. So you can do this check and populate the valid_or_not vector before combining them into the dataframe as:
valid_or_not[sapply(actual_times, function(x) any(abs(x - valid_times) <= 1))] <- "valid"
Note that, by this line, the valid_or_not vector is indexed with an equal length vector of boolean values (whether the condition is satisfied, T or F). So only TRUE valued indices from the vector are updated. valid_or_not and actual_times vectors must be of same length where as valid_times vector can be of different length.
By the way "plying" a for loop does not enhance the performance significantly since it is just a "wrapper" for "for" loops. Only performance increase comes from avoiding intermediary objects due to neater and more concise style of code and avoiding redundant copying in some cases. The same case is true for the Vectorize function: It just wraps the for loop that goes through the function and in for example "outer" function, the FUN must be "vectorized" in that manner. In fact it does not give the performance of a truely vectorized operation. In my example the performance enhancement comes from the substitution of the for loop with the "any" function.
And because of some kind of a "bug", subsetting data frames has an important penalty. As Hadley Wickham explains in Performance topic of Advanced-R:
Extracting a single value from a data frame
The following microbenchmark shows five ways to access a single value
(the number in the bottom-right corner) from the built-in mtcars
dataset. The variation in performance is startling: the slowest method
takes 30x longer than the fastest. There’s no reason that there has to
be such a huge difference in performance. It’s simply that no one has
had the time to fix it.
microbenchmark(
"[32, 11]" = mtcars[32, 11],
"$carb[32]" = mtcars$carb[32],
"[[c(11, 32)]]" = mtcars[[c(11, 32)]],
"[[11]][32]" = mtcars[[11]][32],
".subset2" = .subset2(mtcars, 11)[32] )
## Unit: nanoseconds
## expr min lq mean median uq max neval
## [32, 11] 15,300 16,300 18354 17,000 17,800 76,400 100
## $carb[32] 8,860 9,930 12836 10,600 11,600 85,400 100
## [[c(11, 32)]] 7,200 8,110 9293 8,780 9,350 21,300 100
## [[11]][32] 6,330 7,580 8377 8,100 8,690 20,900 100
## .subset2 334 566 4461 669 800 368,000 100
The most efficient way to subset a data frame is to use the .subset2 method. Your poor performance can mostly be attributed to this fact.
And as last notes:
If the "else" in your conditional statment does not do anything (just like in your example: else {}) you do not have to include it. R has some lazy operations (does not evaluate a statement as long as it is not executed inside the code), but that does not mean it always skips non-executed code portions.
The "character" values in your example are in fact categoric: Only
one of few values can be chosen for each entry. So there is no need
to store them as "characters" and they can be converted into factors
(which are just integer values). This can also enhance
performance.
An addition for #thelatemail 's working solution:
In R, "or" (|) operator isn't lazy while "any" function is. A ply combining or's work till the end while "any" function stops at the first encounter of a TRUE value - which enhances the performance (I will write a blog post on this topic ASAP). And vectorized "any" is almost as fast as native C code while *ply can be slightly faster than for loops in R (That I will benchmark and show in another blog post soon).
Some benchmarks showing this:
Pure "any" and | comparison:
> microbenchmark(any(T,F,F,F,F,F), T|F|F|F|F|F)
Unit: nanoseconds
expr min lq mean median uq max neval cld
any(T, F, F, F, F, F) 274 307.0 545.86 366.5 429.5 16380 100 a
T | F | F | F | F | F 597 626.5 903.47 668.5 730.0 18966 100 a
Pure "Reduce" and vectorization comparison:
> vec0 <- rep(1, 1e6)
> microbenchmark(Reduce("+", vec0), sum(vec0), times = 10)
Unit: microseconds
expr min lq mean median uq
Reduce("+", vec0) 308415.064 310071.953 318503.6048 312940.6355 317648.354
sum(vec0) 930.625 936.775 944.2416 943.5425 949.257
max neval cld
369864.993 10 b
962.349 10 a
And a reduced "|" vs. vectorized "any" comparison (for an extreme case). "any" beats by more than 1e5 times:
> vec1 <- c(T, rep(F, 1e6))
> microbenchmark(Reduce("|", vec1), any(vec1), times = 10)
Unit: nanoseconds
expr min lq mean median uq
Reduce("|", vec1) 394040518 395792399 402703632.6 399191803 400990304
any(vec1) 154 267 1932.5 2588 2952
max neval cld
441805451 10 b
3420 10 a
When the single TRUE is at the very end (so "any" is not lazy anymore and has to check the whole vector), "any" still beats by more than 400 times:
> vec2 <- c(rep(F, 1e6), T)
> microbenchmark(Reduce("|", vec2), any(vec2), times = 10)
Unit: microseconds
expr min lq mean median uq
Reduce("|", vec2) 396625.318 401744.849 416732.5087 407447.375 424538.222
any(vec2) 736.975 787.047 857.5575 832.137 926.076
max neval cld
482116.632 10 b
1013.732 10 a

Compute digit-sums in specific columns of a data frame

I'm trying to sum the digits of integers in the last 2 columns of my data frame. I have found a function that does the summing, but I think I may have an issue with applying the function - not sure?
Dataframe
a = c("a", "b", "c")
b = c(1, 11, 2)
c = c(2, 4, 23)
data <- data.frame(a,b,c)
#Digitsum function
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(as.character(x)) - 1))) %% 10)
#Applying function
data[2:3] <- lapply(data[2:3], digitsum)
This is the error that I get:
*Warning messages:
1: In 0:(nchar(as.character(x)) - 1) :
numerical expression has 3 elements: only the first used
2: In 0:(nchar(as.character(x)) - 1) :
numerical expression has 3 elements: only the first used*

Your function digitsum at the moment works fine for a single scalar input, for example,
digitsum(32)
# [1] 5
But, it can not take a vector input, otherwise ":" will complain. You need to vectorize this function, using Vectorize:
vec_digitsum <- Vectorize(digitsum)
Then it works for a vector input:
b = c(1, 11, 2)
vec_digitsum(b)
# [1] 1 2 2
Now you can use lapply without trouble.

#Zheyuan Li 's answer solved your problem of using lapply. Though I'd like to add several points:
Vectorize is just a wrapper with mapply, which doesn't give you the performance of vectorization.
The function itself can be improved for much better readability:
see
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(as.character(x)) - 1))) %% 10)
vec_digitsum <- Vectorize(digitsum)
sumdigits <- function(x){
digits <- strsplit(as.character(x), "")[[1]]
sum(as.numeric(digits))
}
vec_sumdigits <- Vectorize(sumdigits)
microbenchmark::microbenchmark(digitsum(12324255231323),
sumdigits(12324255231323), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
digitsum(12324255231323) 12.223 12.712 14.50613 13.201 13.690 96.801 100 a
sumdigits(12324255231323) 13.689 14.667 15.32743 14.668 15.157 38.134 100 a
The performance of two versions are similar, but the 2nd one is much easier to understand.
Interestingly, the Vectorize wrapper add considerable overhead for single input:
microbenchmark::microbenchmark(vec_digitsum(12324255231323),
vec_sumdigits(12324255231323), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
vec_digitsum(12324255231323) 92.890 96.801 267.2665 100.223 108.045 16387.07 100 a
vec_sumdigits(12324255231323) 94.357 98.757 106.2705 101.445 107.556 286.00 100 a
Another advantage of this function is that if you have really big numbers in string format, it will still work (with small modification of removing the as.character). While the first version function will have problem with big numbers or may introduce errors.
Note: At first my benchmark was comparing the vectorized version of OP function and non-vectorized version of my function, that gave me the wrong impression of my function is much faster. Turned out that was caused by Vectorize overhead.

Is `if` faster than ifelse?

When I was re-reading Hadley's Advanced R recently, I noticed that he said in Chapter 6 that `if` can be used as a function like
`if`(i == 1, print("yes"), print("no"))
(If you have the physical book in hand, it's on Page 80)
We know that ifelse is slow (Does ifelse really calculate both of its vectors every time? Is it slow?) as it evaluates all arguments. Will `if` be a good alternative to that as if seems to only evaluate TRUE arguments (this is just my assumption)?
Update: Based on the answers from #Benjamin and #Roman and the comments from #Gregor and many others, ifelse seems to be a better solution for vectorized calculations. I'm taking #Benjamin's answer here as it provides a more comprehensive comparison and for the community wellness. However, both answers(and the comments) are worth reading.

This is more of an extended comment building on Roman's answer, but I need the code utilities to expound:
Roman is correct that if is faster than ifelse, but I am under the impression that the speed boost of if isn't particularly interesting since it isn't something that can easily be harnessed through vectorization. That is to say, if is only advantageous over ifelse when the cond/test argument is of length 1.
Consider the following function which is an admittedly weak attempt at vectorizing if without having the side effect of evaluating both the yes and no conditions as ifelse does.
ifelse2 <- function(test, yes, no){
result <- rep(NA, length(test))
for (i in seq_along(test)){
result[i] <- `if`(test[i], yes[i], no[i])
}
result
}
ifelse2a <- function(test, yes, no){
sapply(seq_along(test),
function(i) `if`(test[i], yes[i], no[i]))
}
ifelse3 <- function(test, yes, no){
result <- rep(NA, length(test))
logic <- test
result[logic] <- yes[logic]
result[!logic] <- no[!logic]
result
}
set.seed(pi)
x <- rnorm(1000)
library(microbenchmark)
microbenchmark(
standard = ifelse(x < 0, x^2, x),
modified = ifelse2(x < 0, x^2, x),
modified_apply = ifelse2a(x < 0, x^2, x),
third = ifelse3(x < 0, x^2, x),
fourth = c(x, x^2)[1L + ( x < 0 )],
fourth_modified = c(x, x^2)[seq_along(x) + length(x) * (x < 0)]
)
Unit: microseconds
expr min lq mean median uq max neval cld
standard 52.198 56.011 97.54633 58.357 68.7675 1707.291 100 ab
modified 91.787 93.254 131.34023 94.133 98.3850 3601.967 100 b
modified_apply 645.146 653.797 718.20309 661.568 676.0840 3703.138 100 c
third 20.528 22.873 76.29753 25.513 27.4190 3294.350 100 ab
fourth 15.249 16.129 19.10237 16.715 20.9675 43.695 100 a
fourth_modified 19.061 19.941 22.66834 20.528 22.4335 40.468 100 a
SOME EDITS: Thanks to Frank and Richard Scriven for noticing my shortcomings.
As you can see, the process of breaking up the vector to be suitable to pass to if is a time consuming process and ends up being slower than just running ifelse (which is probably why no one has bothered to implement my solution).
If you're really desperate for an increase in speed, you can use the ifelse3 approach above. Or better yet, Frank's less obvious* but brilliant solution.
by 'less obvious' I mean, it took me two seconds to realize what he did. And per nicola's comment below, please note that this works only when yes and no have length 1, otherwise you'll want to stick with ifelse3

if is a primitive (complied) function called through the .Primitive interface, while ifelse is R bytecode, so it seems that if will be faster. Running some quick benchmarks
> microbenchmark(`if`(TRUE, "a", "b"), ifelse(TRUE, "a", "b"))
Unit: nanoseconds
expr min lq mean median uq max neval cld
if (TRUE) "a" else "b" 46 54 372.59 60.0 68.0 30007 100 a
ifelse(TRUE, "a", "b") 1212 1327 1581.62 1442.5 1617.5 11743 100 b
> microbenchmark(`if`(FALSE, "a", "b"), ifelse(FALSE, "a", "b"))
Unit: nanoseconds
expr min lq mean median uq max neval cld
if (FALSE) "a" else "b" 47 55 91.64 61.5 73 2550 100 a
ifelse(FALSE, "a", "b") 1256 1346 1688.78 1460.0 1677 17260 100 b
It seems that if not taking into account the code that is in actual branches, if is at least 20x faster than ifelse. However, note that this doesn't account the complexity of expression being tested and possible optimizations on that.
Update: Please note that this quick benchmark represent a very simplified and somewhat biased use case of if vs ifelse (as pointed out in the comments). While it is correct, it underrepresents the ifelse use cases, for that Benjamin's answer seems to provided more fair comparison.

Yes. I develop a for 152589 records using ifelse() took 90 min and using if() improve to 25min
for(i in ...){
# "Case 1"
# asesorMinimo<-( dummyAsesor%>%filter(FechaAsignacion==min(FechaAsignacion)) )[1,]
# asesorRegla<-tail(dummyAsesor%>%filter( FechaAsignacion<=dumFinClase)%>%arrange(FechaAsignacion),1)
# #Asigna Asesor
# dummyRow<-dummyRow%>%mutate(asesorRetencion=ifelse(dim(asesorRegla)[1]==0,asesorMinimo$OperadorNombreApellido,asesorRegla$OperadorNombreApellido))
# "Case 2"
asesorRegla<-tail(dummyAsesor%>%filter( FechaAsignacion<=dumFinClase)%>%arrange(FechaAsignacion),1)
asesorMinimo<-( dummyAsesor%>%filter(FechaAsignacion==min(FechaAsignacion)) )[1,]
if(dim(asesorRegla)[1]==0){
dummyRow<-dummyRow%>%mutate(asesorRetencion=asesorMinimo[1,7])
}else{
dummyRow<-dummyRow%>%mutate(asesorRetencion=asesorRegla[1,7])
}
}

how to count NAs in mean calculation

Very simple question, I'm sure it's been answered and I'm just phrasing it incorrectly but I want to calculate the mean of a vector of numbers including NA values, here's an example:
dummy<-c(1,2,NA, 3)
with this I can use mean with na.rm=T and receive the mean of 2, but what I want to receive is the mean of 6/4, including the NA value as a place holder which would return 1.5.

How about just swapping NA values with 0 temporarily.
mean(ifelse(is.na(dummy),0,dummy))

Try using sum and length
> sum(dummy, na.rm=TRUE)/length(dummy)
[1] 1.5

Since there are a lot of ways to do this, here goes another solution:
mean(replace(dummy, is.na(dummy), 0)) ## 1.5
[1] 1.5
Just out of curiosity, the most efficient solution seems to be the sum/length by Jilber:
bigdummy <- rnorm(1000)
bigdummy[sample(1:length(bigdummy), 100)] <- NA
library(microbenchmark)
mean_length <- function(x) sum(x, na.rm=TRUE)/length(x)
mean_replace <- function(x) mean(replace(x, is.na(x), 0))
mean_ifelse <- function(x) mean(ifelse(is.na(x),0,x))
microbenchmark(mean_length(bigdummy),
mean_replace(bigdummy),
mean_ifelse(bigdummy),
times=1000L)
Unit: microseconds
expr min lq median uq max neval
mean_length(bigdummy) 4.033 4.400 5.499 5.866 109.976 1000
mean_replace(bigdummy) 25.661 27.128 28.594 29.327 198.690 1000
mean_ifelse(bigdummy) 142.602 144.802 145.902 152.500 3405.209 1000

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R missing date calculation [duplicate] - r

I have a huge vector which has a couple of NA values, and I'm trying to find the max value in that vector (the vector is all numbers), but I can't do this because of the NA values. How can I remove the NA values so that I can compute the max?

The na.omit function is what a lot of the regression routines use internally: vec <- 1:1000 vec[runif(200, 1, 1000)] <- NA max(vec) #[1] NA max( na.omit(vec) ) #[1] 1000

?max shows you that there is an extra parameter na.rm that you can set to TRUE. Apart from that, if you really want to remove the NAs, just use something like: myvec[!is.na(myvec)]

You can call max(vector, na.rm = TRUE). More generally, you can use the na.omit() function.

Another option using complete.cases like this: d <- c(1, 100, NA, 10) result <- complete.cases(d) output <- d[result] output #> [1] 1 100 10 max(output) #> [1] 100 Created on 2022-08-26 with reprex v2.0.2

Related

Efficiently introduce new level on a factor vector

Efficient Way to Vectorize a Function over each Row of a Data-Frame

Compute digit-sums in specific columns of a data frame

Is `if` faster than ifelse?

how to count NAs in mean calculation

Categories

Resources