Vectorized IF statement in R? - r

x <- seq(0.1,10,0.1)
y <- if (x < 5) 1 else 2
This gives a warning (or error since R version 4.2.0) that the condition has length > 1.
I would want the if to operate on every single case instead of operating on the whole vector.
What do I have to change?

x <- seq(0.1,10,0.1)
> x
[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
[16] 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
[31] 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5
[46] 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0
[61] 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5
[76] 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0
[91] 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0
> ifelse(x < 5, 1, 2)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

For completeness: In big vectors, you can use the indices to speed things up (we do that often in simulations, where functions typically run 1000 to 10000 times). But as long as it isn't necessary, just use ifelse. This reads a lot easier.
> set.seed(100)
> x <- runif(1000,1,10)
> system.time(replicate(10000,{
+ y <- ifelse(x < 5,1,2)
+ }))
user system elapsed
2.56 0.08 2.64
> system.time(replicate(10000,{
+ y <- rep(2,length(x))
+ y[x < 5]<- 1
+ }))
user system elapsed
0.48 0.00 0.48

y <- if (x < 5) 1 else 2 does not operate on the whole vector (the warning you receive tells you only the first element of the condition will be used). You want ifelse:
y <- ifelse(x < 5, 1, 2)
ifelse operates on the whole logical vector, element-by-element. if only accepts one logical value. See ?"if" and ?ifelse

You could also just create a logical vector and 1 to it
x <- seq(0.1, 10, 0.1) # Your data set
(x >= 5) + 1
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [92] 2 2 2 2 2 2 2 2 2
If would like to compare performance, it would be the fastest solution
set.seed(100)
x <- runif(1e6, 1, 10)
RL <- function(x) y <- ifelse(x < 5,1,2)
JM <- function(x) {y <- rep(2, length(x)); y[x < 5] <- 1}
DA <- function(x) y <- (x >= 5) + 1
library(microbenchmark)
microbenchmark(RL(x),
JM(x),
DA(x))
# Unit: milliseconds
# expr min lq mean median uq max neval
# RL(x) 331.83448 366.52940 378.89182 374.99741 381.08659 609.21218 100
# JM(x) 38.72894 42.18745 44.36493 43.25086 44.09626 82.76168 100
# DA(x) 10.01644 11.96482 14.21593 13.17825 14.12930 53.76923 100

Following the above post you can even use and modify the elements of a vector satisfying the criteria. In my opinion if it's not more costly to compute faster one should always do it.
x = seq(0.1,10,0.1)
y <- rep(2,length(x))
y[x<5] <- x[x<5]*2
The code of the previous post is best to answer the question. But if I had to use the code above I would do:
x = seq(0.1,10,0.1)
y <- rep(2,length(x))
y[x<5] <- x[x<5]*0 +1

nzMean <- function(x) { mean(x[x!=-1],na.rm=TRUE)}
nzMin <- function(x) {min(x[x!=-1],na.rm=TRUE)}
nzMax <- function(x) { max(x[x!=-1],na.rm=TRUE)}
nzRange<-function(x) {nzMax(x)-nzMin(x)}
nzSD <- function(x) { SD(x[x!=-1],na.rm=TRUE)}
#following function works
nzN1<- function(x) {ifelse(x!=-1,(x-nzMin(x))/nzRange(x) ,x) }
#following is bad as it returns only 4 not 5 elements of vector
nzN2<- function(x) {ifelse(x!=-1,(x[x!=-1]-nzMin(x))/nzRange(x) ,x) }
#following is bad as it returns 5 elements of vector but not correct answer
nzN3<- function(x) {ifelse(x!=-1,(x[x!=-1]-nzMin(x))/nzRange(x) ,-1) }
y<-c(1,-1,-20,2,4)
a<-nzMean(y)
b<-nzMin(y)
c<-nzMax(y)
d<-nzRange(y)
# test the working function
z<-nzN1(y)
print(z)

Related

Having trouble using ifelse function in R for multiple categories [duplicate]

x <- seq(0.1,10,0.1)
y <- if (x < 5) 1 else 2
This gives a warning (or error since R version 4.2.0) that the condition has length > 1.
I would want the if to operate on every single case instead of operating on the whole vector.
What do I have to change?
x <- seq(0.1,10,0.1)
> x
[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
[16] 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
[31] 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5
[46] 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0
[61] 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5
[76] 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0
[91] 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0
> ifelse(x < 5, 1, 2)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
For completeness: In big vectors, you can use the indices to speed things up (we do that often in simulations, where functions typically run 1000 to 10000 times). But as long as it isn't necessary, just use ifelse. This reads a lot easier.
> set.seed(100)
> x <- runif(1000,1,10)
> system.time(replicate(10000,{
+ y <- ifelse(x < 5,1,2)
+ }))
user system elapsed
2.56 0.08 2.64
> system.time(replicate(10000,{
+ y <- rep(2,length(x))
+ y[x < 5]<- 1
+ }))
user system elapsed
0.48 0.00 0.48
y <- if (x < 5) 1 else 2 does not operate on the whole vector (the warning you receive tells you only the first element of the condition will be used). You want ifelse:
y <- ifelse(x < 5, 1, 2)
ifelse operates on the whole logical vector, element-by-element. if only accepts one logical value. See ?"if" and ?ifelse
You could also just create a logical vector and 1 to it
x <- seq(0.1, 10, 0.1) # Your data set
(x >= 5) + 1
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [92] 2 2 2 2 2 2 2 2 2
If would like to compare performance, it would be the fastest solution
set.seed(100)
x <- runif(1e6, 1, 10)
RL <- function(x) y <- ifelse(x < 5,1,2)
JM <- function(x) {y <- rep(2, length(x)); y[x < 5] <- 1}
DA <- function(x) y <- (x >= 5) + 1
library(microbenchmark)
microbenchmark(RL(x),
JM(x),
DA(x))
# Unit: milliseconds
# expr min lq mean median uq max neval
# RL(x) 331.83448 366.52940 378.89182 374.99741 381.08659 609.21218 100
# JM(x) 38.72894 42.18745 44.36493 43.25086 44.09626 82.76168 100
# DA(x) 10.01644 11.96482 14.21593 13.17825 14.12930 53.76923 100
Following the above post you can even use and modify the elements of a vector satisfying the criteria. In my opinion if it's not more costly to compute faster one should always do it.
x = seq(0.1,10,0.1)
y <- rep(2,length(x))
y[x<5] <- x[x<5]*2
The code of the previous post is best to answer the question. But if I had to use the code above I would do:
x = seq(0.1,10,0.1)
y <- rep(2,length(x))
y[x<5] <- x[x<5]*0 +1
nzMean <- function(x) { mean(x[x!=-1],na.rm=TRUE)}
nzMin <- function(x) {min(x[x!=-1],na.rm=TRUE)}
nzMax <- function(x) { max(x[x!=-1],na.rm=TRUE)}
nzRange<-function(x) {nzMax(x)-nzMin(x)}
nzSD <- function(x) { SD(x[x!=-1],na.rm=TRUE)}
#following function works
nzN1<- function(x) {ifelse(x!=-1,(x-nzMin(x))/nzRange(x) ,x) }
#following is bad as it returns only 4 not 5 elements of vector
nzN2<- function(x) {ifelse(x!=-1,(x[x!=-1]-nzMin(x))/nzRange(x) ,x) }
#following is bad as it returns 5 elements of vector but not correct answer
nzN3<- function(x) {ifelse(x!=-1,(x[x!=-1]-nzMin(x))/nzRange(x) ,-1) }
y<-c(1,-1,-20,2,4)
a<-nzMean(y)
b<-nzMin(y)
c<-nzMax(y)
d<-nzRange(y)
# test the working function
z<-nzN1(y)
print(z)

Else if in mutate in dplyr R [duplicate]

x <- seq(0.1,10,0.1)
y <- if (x < 5) 1 else 2
This gives a warning (or error since R version 4.2.0) that the condition has length > 1.
I would want the if to operate on every single case instead of operating on the whole vector.
What do I have to change?
x <- seq(0.1,10,0.1)
> x
[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
[16] 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
[31] 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5
[46] 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0
[61] 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5
[76] 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0
[91] 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0
> ifelse(x < 5, 1, 2)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
For completeness: In big vectors, you can use the indices to speed things up (we do that often in simulations, where functions typically run 1000 to 10000 times). But as long as it isn't necessary, just use ifelse. This reads a lot easier.
> set.seed(100)
> x <- runif(1000,1,10)
> system.time(replicate(10000,{
+ y <- ifelse(x < 5,1,2)
+ }))
user system elapsed
2.56 0.08 2.64
> system.time(replicate(10000,{
+ y <- rep(2,length(x))
+ y[x < 5]<- 1
+ }))
user system elapsed
0.48 0.00 0.48
y <- if (x < 5) 1 else 2 does not operate on the whole vector (the warning you receive tells you only the first element of the condition will be used). You want ifelse:
y <- ifelse(x < 5, 1, 2)
ifelse operates on the whole logical vector, element-by-element. if only accepts one logical value. See ?"if" and ?ifelse
You could also just create a logical vector and 1 to it
x <- seq(0.1, 10, 0.1) # Your data set
(x >= 5) + 1
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [92] 2 2 2 2 2 2 2 2 2
If would like to compare performance, it would be the fastest solution
set.seed(100)
x <- runif(1e6, 1, 10)
RL <- function(x) y <- ifelse(x < 5,1,2)
JM <- function(x) {y <- rep(2, length(x)); y[x < 5] <- 1}
DA <- function(x) y <- (x >= 5) + 1
library(microbenchmark)
microbenchmark(RL(x),
JM(x),
DA(x))
# Unit: milliseconds
# expr min lq mean median uq max neval
# RL(x) 331.83448 366.52940 378.89182 374.99741 381.08659 609.21218 100
# JM(x) 38.72894 42.18745 44.36493 43.25086 44.09626 82.76168 100
# DA(x) 10.01644 11.96482 14.21593 13.17825 14.12930 53.76923 100
Following the above post you can even use and modify the elements of a vector satisfying the criteria. In my opinion if it's not more costly to compute faster one should always do it.
x = seq(0.1,10,0.1)
y <- rep(2,length(x))
y[x<5] <- x[x<5]*2
The code of the previous post is best to answer the question. But if I had to use the code above I would do:
x = seq(0.1,10,0.1)
y <- rep(2,length(x))
y[x<5] <- x[x<5]*0 +1
nzMean <- function(x) { mean(x[x!=-1],na.rm=TRUE)}
nzMin <- function(x) {min(x[x!=-1],na.rm=TRUE)}
nzMax <- function(x) { max(x[x!=-1],na.rm=TRUE)}
nzRange<-function(x) {nzMax(x)-nzMin(x)}
nzSD <- function(x) { SD(x[x!=-1],na.rm=TRUE)}
#following function works
nzN1<- function(x) {ifelse(x!=-1,(x-nzMin(x))/nzRange(x) ,x) }
#following is bad as it returns only 4 not 5 elements of vector
nzN2<- function(x) {ifelse(x!=-1,(x[x!=-1]-nzMin(x))/nzRange(x) ,x) }
#following is bad as it returns 5 elements of vector but not correct answer
nzN3<- function(x) {ifelse(x!=-1,(x[x!=-1]-nzMin(x))/nzRange(x) ,-1) }
y<-c(1,-1,-20,2,4)
a<-nzMean(y)
b<-nzMin(y)
c<-nzMax(y)
d<-nzRange(y)
# test the working function
z<-nzN1(y)
print(z)

How to perform a percentage increase on a column when surpassing a specified value

How is it possible to introduce a percentage increase to a column after surpassing a given value in that column?
Here is a dataframe:
a <- data.frame(id = c(1,1,1,1,1,1,1,1,1), num = c(1,1.3,1.6,1.7,1.9,2.1,2.4,2.5,3.5))
For a threshold of 2 for the num column and a percentage increase of 1% would look like this:
id num adjusted
1 1.0 1.0
1 1.3 1.3
1 1.6 1.6
1 1.7 1.7
1 1.9 1.9
1 2.1 2.1
1 2.4 2.121
1 2.5 2.14221
1 3.5 2.1636321
Any help would be much appreciated
Does this work:
library(dplyr)
library(purrr)
a %>% filter(num > 2) %>% mutate(adjusted = accumulate(num, ~ .x * 1.01)) %>%
right_join(a) %>% mutate(adjusted = coalesce(adjusted, num)) %>% arrange(num)
Joining, by = c("id", "num")
id num adjusted
1 1 1.0 1.000000
2 1 1.3 1.300000
3 1 1.6 1.600000
4 1 1.7 1.700000
5 1 1.9 1.900000
6 1 2.1 2.100000
7 1 2.4 2.121000
8 1 2.5 2.142210
9 1 3.5 2.163632
thresh <- 2
pct_inc <- .01
# filter to numbers which will be changed
to_adj <- a$num[a$num > thresh]
# replace numbers above thresh with pct increase from first above thresh
a$adjusted <-
replace(a$num, a$num > thresh,
to_adj[1]*(1 + pct_inc)^(seq_along(to_adj) - 1))
a
# id num adjusted
# 1 1 1.0 1.000000
# 2 1 1.3 1.300000
# 3 1 1.6 1.600000
# 4 1 1.7 1.700000
# 5 1 1.9 1.900000
# 6 1 2.1 2.100000
# 7 1 2.4 2.121000
# 8 1 2.5 2.142210
# 9 1 3.5 2.163632

Filter a group of a data.frame based on multiple conditions

I am looking for an elegant way to filter the values of a specific group of big data.frame based on multiple conditions.
My data frame looks like this.
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0.2,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.2
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like only for the time point 1 to filter out all the values that are smaller than 1 but bigger than 0.1
I want my data.frame to look like this.
group time value
1 A 1 0.2
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
Any help is highly appreciated.
With dplyr you can do
library(dplyr)
data %>% filter(!(time == 1 & (value <= 0.1 | value >= 1)))
# group time value
# 1 A 1 0.2
# 2 A 2 0.1
# 3 B 2 10.0
# 4 C 2 20.0
# 5 A 3 10.0
# 6 B 3 20.0
# 7 C 3 30.0
Or if you have too much free time and you decided to avoid dplyr:
ind <- with(data, (data$time==1 & (data$value > 0.1 & data$value < 1)))
ind <- ifelse((data$time==1) & (data$value > 0.1 & data$value < 1), TRUE, FALSE)
#above two do the same
data$ind <- ind
data <- data[!(data$time==1 & ind==F),]
data$ind <- NULL
group time value
1 A 1 0.2
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
Another simple option would be to use subset twice and then append the results in a row wise manner.
rbind(
subset(data, time == 1 & value > 0.1 & value < 1),
subset(data, time != 1)
)
# group time value
# 1 A 1 0.2
# 4 A 2 0.1
# 5 B 2 10.0
# 6 C 2 20.0
# 7 A 3 10.0
# 8 B 3 20.0
# 9 C 3 30.0

Computing Colwise Means on a Given Interval

I have a data frame in R that can be approximated as:
df <- data.frame(x = rep(1:5, each = 4), y = rep(2:6, each = 4), z = rep(3:7, each = 4))
> df
x y z
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 2 3 4
6 2 3 4
7 2 3 4
8 2 3 4
9 3 4 5
10 3 4 5
11 3 4 5
12 3 4 5
13 4 5 6
14 4 5 6
15 4 5 6
16 4 5 6
17 5 6 7
18 5 6 7
19 5 6 7
20 5 6 7
I'd like to compute colwise means at intervals of 5, and then collapse these means into a new data frame. For example, I'd like to compute the colwise means of df[1:5,], df[6:10,], df[11:15,], and df[16:20,], and return a df that looks as follows:
[,1] [,2] [,3]
[1,] 1.2 2.2 3.2
[2,] 2.4 3.4 4.4
[3,] 3.6 4.6 5.6
[4,] 4.8 5.8 6.8
I'm currently using a for-loop as such (where temp.coeff would correspond to the "5" specified above):
my.means <- NULL
for (j in 1:baseFreq) {
temp.mean <- colMeans(temp.df[(temp.coeff*(j-1)+1):(temp.coeff*j),])
my.means <- rbind(my.means, temp.mean)
}
my.means <- t(my.means)
collapsed.df <- t(data.frame(colMeans(my.means)))
}
..but I feel like there's an apply statement that could do the job a lot more efficiently. In addition, while the above data frame only has 20 rows, the one's on which I'll be working will have several thousand. Thoughts?
Many thanks in advance SO.
aggregate can do this if you aggregate against an appropriate running index. You do end up with another column in the result (which can be removed).
aggregate(. ~ rep(seq(nrow(df)/5), each=5), data=df, FUN=mean)
## rep(seq(nrow(df)/5), each = 5) x y z
## 1 1 1.2 2.2 3.2
## 2 2 2.4 3.4 4.4
## 3 3 3.6 4.6 5.6
## 4 4 4.8 5.8 6.8
I really think data.table works great for situations like this. It is fast and easy.
require("data.table")
dt <- data.table(df)
dt[,row.num:=.I]
dt[,lapply(.SD,mean),by=list(interval=cut(row.num,seq(0,nrow(dt),by=5)))]
# interval x y z
# 1: (0,5] 1.2 2.2 3.2
# 2: (5,10] 2.4 3.4 4.4
# 3: (10,15] 3.6 4.6 5.6
# 4: (15,20] 4.8 5.8 6.8
This is a possible solution with a combination of apply and sapply:
apply(df, 2, function(x) sapply(seq(1,nrow(df),5), function(y) mean(x[y:(y+4)])))
# x y z
#[1,] 1.2 2.2 3.2
#[2,] 2.4 3.4 4.4
#[3,] 3.6 4.6 5.6
#[4,] 4.8 5.8 6.8
Edit after comment by #jbaums: depending on the desired behavior, you might want to add na.rm=TRUE to the mean calculation:
apply(df, 2, function(x) sapply(seq(1,nrow(df),5), function(y) mean(x[y:(y+4)], na.rm = TRUE)))

Resources