I have recently looked for advice on how to suppress all but the first occurrences of a value within a group using dplyr (dplyr override all but the first occurrences of a value within a group).
The solution was a really clever one and now I am struggling with finding something equally efficient in case I need to suppress only n next values.
For example, in the code below I create a new "tag" column:
library('dplyr')
data(iris)
set.seed(1)
iris$tag <- sample(c(0,1), 150, replace=TRUE, prob = c(0.7, 0.3))
giris <- iris %>% group_by(Species)
# Source: local data frame [150 x 6]
# Groups: Species [3]
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species tag
# (dbl) (dbl) (dbl) (dbl) (fctr) (dbl)
# 1 5.1 3.5 1.4 0.2 setosa 0
# 2 4.9 3.0 1.4 0.2 setosa 0
# 3 4.7 3.2 1.3 0.2 setosa 0
# 4 4.6 3.1 1.5 0.2 setosa 1
# 5 5.0 3.6 1.4 0.2 setosa 0
# 6 5.4 3.9 1.7 0.4 setosa 1
# 7 4.6 3.4 1.4 0.3 setosa 1
# 8 5.0 3.4 1.5 0.2 setosa 0
# 9 4.4 2.9 1.4 0.2 setosa 0
# 10 4.9 3.1 1.5 0.1 setosa 0
# .. ... ... ... ... ... ...
In the setosa group rows: 4, 6, 7, ... are tagged as "1"s. I am trying to suppress "1"s (i.e. convert them to "0"s) in the next two rows after any occurrence of a "1". In other words, rows #5 and #6 should be set to "0" but #7 should remain unaffected. In this case, row #7 happens to be a "1", so rows #8 and #9 should be set to "0"s and so on...
Any hint on how to do this in dplyr? This package is really powerful but for a reason it is a mental challenge for me to master all the subtleties...
Some more examples:
in case of: 0 0 1 1, the output should be 0 0 1 0
in case of: 0 0 1 1 1 1 1, the output should be 0 0 1 0 0 1 0
I can't think of any better way to do this than a loop:
flip_followers = function(tag, nf = 2L){
w = which(tag==1L)
keep = rep(TRUE, length(w))
for (i in seq_along(w)) if (keep[i]) keep[match(w[i]+seq_len(nf), w)] = FALSE
tag[w[!keep]] = 0L
tag
}
giris %>% mutate(tag = flip_followers(tag))
Source: local data frame [150 x 6]
Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species tag
(dbl) (dbl) (dbl) (dbl) (fctr) (dbl)
1 5.1 3.5 1.4 0.2 setosa 0
2 4.9 3.0 1.4 0.2 setosa 0
3 4.7 3.2 1.3 0.2 setosa 0
4 4.6 3.1 1.5 0.2 setosa 1
5 5.0 3.6 1.4 0.2 setosa 0
6 5.4 3.9 1.7 0.4 setosa 0
7 4.6 3.4 1.4 0.3 setosa 1
8 5.0 3.4 1.5 0.2 setosa 0
9 4.4 2.9 1.4 0.2 setosa 0
10 4.9 3.1 1.5 0.1 setosa 0
.. ... ... ... ... ... ...
For a possible speedup, you could switch the loop to if (keep[i]) keep[i+seq_len(nf)][match(w[i]+seq_len(nf), w[i+seq_len(nf)])] = FALSE so that match only searches the next nf elements of w. I'm sure Rcpp would be faster still, if that's a serious concern.
To me this is semantically clearer if you use an accumulating reduce to keep track of the refraction period.
suppress <- function(x, w) {
r <- Reduce(function(d,i) if(i&!d) w else max(0,d-1), x, init=0, acc=TRUE)[-1]
x * (r==w)
}
Example
suppress(c(0,0,1,1,1,1,1), 2)
#> [1] 0 0 1 0 0 1 0
Kinda clumsy but it seems like you have to walk down the vector regardless
f <- function(x, repl = c(1,0,0)) {
sx <- seq(x)
for (ii in seq_along(x))
if (x[ii] == repl[1L]) ## thanks to #Frank for catching
x[ii:(ii + length(repl) - 1)] <- repl
x[sx]
}
(x <- c(0,0,1,1,1,1,1)); f(x)
# [1] 0 0 1 1 1 1 1
# [1] 0 0 1 0 0 1 0
(x <- c(0,0,1,0,1,0,1,1)); f(x)
# [1] 0 0 1 0 1 0 1 1
# [1] 0 0 1 0 0 0 1 0
And your example
set.seed(1)
head(n = 10,
cbind(tag <- sample(c(0,1), 150, replace=TRUE, prob = c(0.7, 0.3)),
tag2 = f(tag)))
# [1,] 0 0
# [2,] 0 0
# [3,] 0 0
# [4,] 1 1
# [5,] 0 0
# [6,] 1 0
# [7,] 1 1
# [8,] 0 0
# [9,] 0 0
# [10,] 0 0
And you can replace with whatever you want
(x <- c(0,0,1,1,1,1,1)); f(x, c(1,0,0,0))
# [1] 0 0 1 1 1 1 1
# [1] 0 0 1 0 0 0 1
(x <- c(0,0,1,1,1,1,1)); f(x, 1:3)
# [1] 0 0 1 1 1 1 1
# [1] 0 0 1 2 3 1 2
## courtesy of #Frank this would also work
(x <- c(0,0,1,1,0,0,1)); f(x, 0:2)
# [1] 0 0 1 1 0 0 1
# [1] 0 1 2 1 0 1 2
Related
x <- seq(0.1,10,0.1)
y <- if (x < 5) 1 else 2
This gives a warning (or error since R version 4.2.0) that the condition has length > 1.
I would want the if to operate on every single case instead of operating on the whole vector.
What do I have to change?
x <- seq(0.1,10,0.1)
> x
[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
[16] 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
[31] 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5
[46] 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0
[61] 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5
[76] 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0
[91] 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0
> ifelse(x < 5, 1, 2)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
For completeness: In big vectors, you can use the indices to speed things up (we do that often in simulations, where functions typically run 1000 to 10000 times). But as long as it isn't necessary, just use ifelse. This reads a lot easier.
> set.seed(100)
> x <- runif(1000,1,10)
> system.time(replicate(10000,{
+ y <- ifelse(x < 5,1,2)
+ }))
user system elapsed
2.56 0.08 2.64
> system.time(replicate(10000,{
+ y <- rep(2,length(x))
+ y[x < 5]<- 1
+ }))
user system elapsed
0.48 0.00 0.48
y <- if (x < 5) 1 else 2 does not operate on the whole vector (the warning you receive tells you only the first element of the condition will be used). You want ifelse:
y <- ifelse(x < 5, 1, 2)
ifelse operates on the whole logical vector, element-by-element. if only accepts one logical value. See ?"if" and ?ifelse
You could also just create a logical vector and 1 to it
x <- seq(0.1, 10, 0.1) # Your data set
(x >= 5) + 1
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [92] 2 2 2 2 2 2 2 2 2
If would like to compare performance, it would be the fastest solution
set.seed(100)
x <- runif(1e6, 1, 10)
RL <- function(x) y <- ifelse(x < 5,1,2)
JM <- function(x) {y <- rep(2, length(x)); y[x < 5] <- 1}
DA <- function(x) y <- (x >= 5) + 1
library(microbenchmark)
microbenchmark(RL(x),
JM(x),
DA(x))
# Unit: milliseconds
# expr min lq mean median uq max neval
# RL(x) 331.83448 366.52940 378.89182 374.99741 381.08659 609.21218 100
# JM(x) 38.72894 42.18745 44.36493 43.25086 44.09626 82.76168 100
# DA(x) 10.01644 11.96482 14.21593 13.17825 14.12930 53.76923 100
Following the above post you can even use and modify the elements of a vector satisfying the criteria. In my opinion if it's not more costly to compute faster one should always do it.
x = seq(0.1,10,0.1)
y <- rep(2,length(x))
y[x<5] <- x[x<5]*2
The code of the previous post is best to answer the question. But if I had to use the code above I would do:
x = seq(0.1,10,0.1)
y <- rep(2,length(x))
y[x<5] <- x[x<5]*0 +1
nzMean <- function(x) { mean(x[x!=-1],na.rm=TRUE)}
nzMin <- function(x) {min(x[x!=-1],na.rm=TRUE)}
nzMax <- function(x) { max(x[x!=-1],na.rm=TRUE)}
nzRange<-function(x) {nzMax(x)-nzMin(x)}
nzSD <- function(x) { SD(x[x!=-1],na.rm=TRUE)}
#following function works
nzN1<- function(x) {ifelse(x!=-1,(x-nzMin(x))/nzRange(x) ,x) }
#following is bad as it returns only 4 not 5 elements of vector
nzN2<- function(x) {ifelse(x!=-1,(x[x!=-1]-nzMin(x))/nzRange(x) ,x) }
#following is bad as it returns 5 elements of vector but not correct answer
nzN3<- function(x) {ifelse(x!=-1,(x[x!=-1]-nzMin(x))/nzRange(x) ,-1) }
y<-c(1,-1,-20,2,4)
a<-nzMean(y)
b<-nzMin(y)
c<-nzMax(y)
d<-nzRange(y)
# test the working function
z<-nzN1(y)
print(z)
what I want to do is to modify all selected columns of an R data table according to the rows conditions i.e
for all 4 columns selected in cols variable, if the value is greater (or equal) than 1.5, i would like to put them to 1 else 0
I tried something like that : iris[(cols) > 1.5 , (cols) := 1, .SDcols = cols]
Thx
One data.table approach:
iris <- as.data.table(iris)
cols <- names(iris)[1:4]
cols
# [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
iris[, (cols) := lapply(.SD, function(z) fifelse(z > 1.5, 1, z)), .SDcols = cols]
iris
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <num> <num> <num> <num> <fctr>
# 1: 1 1 1.4 0.2 setosa
# 2: 1 1 1.4 0.2 setosa
# 3: 1 1 1.3 0.2 setosa
# 4: 1 1 1.5 0.2 setosa
# 5: 1 1 1.4 0.2 setosa
# 6: 1 1 1.0 0.4 setosa
# 7: 1 1 1.4 0.3 setosa
# 8: 1 1 1.5 0.2 setosa
# 9: 1 1 1.4 0.2 setosa
# 10: 1 1 1.5 0.1 setosa
# ---
# 141: 1 1 1.0 1.0 virginica
# 142: 1 1 1.0 1.0 virginica
# 143: 1 1 1.0 1.0 virginica
# 144: 1 1 1.0 1.0 virginica
# 145: 1 1 1.0 1.0 virginica
# 146: 1 1 1.0 1.0 virginica
# 147: 1 1 1.0 1.0 virginica
# 148: 1 1 1.0 1.0 virginica
# 149: 1 1 1.0 1.0 virginica
# 150: 1 1 1.0 1.0 virginica
An alternative using set:
for (nm in cols) set(iris, which(iris[[nm]] > 1.5), nm, 1)
Another solution:
library(dplyr)
library(data.table)
iris[,1:4] %>% data.table() %>% mutate_all(~ ifelse(.x>=1.5,1,0))
If you just need to check for numeric columns across can be a good fit, it also works with more specific choices like positions and names
library(tidyverse)
iris |>
as_tibble() |>
mutate(across(.cols = where(is.numeric),.fns = ~ if_else(.x > 1.5,1,.x)))
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1 1 1.4 0.2 setosa
#> 2 1 1 1.4 0.2 setosa
#> 3 1 1 1.3 0.2 setosa
#> 4 1 1 1.5 0.2 setosa
#> 5 1 1 1.4 0.2 setosa
#> 6 1 1 1 0.4 setosa
#> 7 1 1 1.4 0.3 setosa
#> 8 1 1 1.5 0.2 setosa
#> 9 1 1 1.4 0.2 setosa
#> 10 1 1 1.5 0.1 setosa
#> # ... with 140 more rows
Created on 2021-10-18 by the reprex package (v2.0.1)
Base R option -
data <- iris
cols <- 1:4
data[cols] <- +(data[cols] > 1.5)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 1 1 0 0 setosa
#2 1 1 0 0 setosa
#3 1 1 0 0 setosa
#4 1 1 0 0 setosa
#5 1 1 0 0 setosa
#6 1 1 1 0 setosa
#...
#...
The + at the beginning is used to change the logical values (TRUE/FALSE) to integers (1/0).
We may do
library(dplyr)
iris %>%
mutate(across(where(is.numeric), ~ +(. > 1.5)))
How is it possible to introduce a percentage increase to a column after surpassing a given value in that column?
Here is a dataframe:
a <- data.frame(id = c(1,1,1,1,1,1,1,1,1), num = c(1,1.3,1.6,1.7,1.9,2.1,2.4,2.5,3.5))
For a threshold of 2 for the num column and a percentage increase of 1% would look like this:
id num adjusted
1 1.0 1.0
1 1.3 1.3
1 1.6 1.6
1 1.7 1.7
1 1.9 1.9
1 2.1 2.1
1 2.4 2.121
1 2.5 2.14221
1 3.5 2.1636321
Any help would be much appreciated
Does this work:
library(dplyr)
library(purrr)
a %>% filter(num > 2) %>% mutate(adjusted = accumulate(num, ~ .x * 1.01)) %>%
right_join(a) %>% mutate(adjusted = coalesce(adjusted, num)) %>% arrange(num)
Joining, by = c("id", "num")
id num adjusted
1 1 1.0 1.000000
2 1 1.3 1.300000
3 1 1.6 1.600000
4 1 1.7 1.700000
5 1 1.9 1.900000
6 1 2.1 2.100000
7 1 2.4 2.121000
8 1 2.5 2.142210
9 1 3.5 2.163632
thresh <- 2
pct_inc <- .01
# filter to numbers which will be changed
to_adj <- a$num[a$num > thresh]
# replace numbers above thresh with pct increase from first above thresh
a$adjusted <-
replace(a$num, a$num > thresh,
to_adj[1]*(1 + pct_inc)^(seq_along(to_adj) - 1))
a
# id num adjusted
# 1 1 1.0 1.000000
# 2 1 1.3 1.300000
# 3 1 1.6 1.600000
# 4 1 1.7 1.700000
# 5 1 1.9 1.900000
# 6 1 2.1 2.100000
# 7 1 2.4 2.121000
# 8 1 2.5 2.142210
# 9 1 3.5 2.163632
It is quite easy to dummy code a collapsed column using the tidyverse. Here is a quick example of how I've done it in the past. First, I'll load the iris data and create a custom collapsed column of randomly sampled letters:
library(tidyverse)
# load practice data
data(iris)
iris <- as_tibble(iris)
# create column of collapsed values
lst <- list()
for(i in 1:150) {
value <- as.list(paste0(sample(letters[1:2], 1), ", ", sample(letters[3:4], 1)))
lst[i] <- value
}
# append custom columns to the iris dataset
iris$Samples <- unlist(lst)
iris$Subject <- c(1:150)
iris <- iris %>% select(Subject, everything())
# preview custom dataset
iris
# A tibble: 150 x 7
Subject Sepal.Length Sepal.Width Petal.Length Petal.Width Species Samples
<int> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 1 5.1 3.5 1.4 0.2 setosa a, d
2 2 4.9 3 1.4 0.2 setosa a, c
3 3 4.7 3.2 1.3 0.2 setosa a, c
4 4 4.6 3.1 1.5 0.2 setosa b, c
5 5 5 3.6 1.4 0.2 setosa a, c
6 6 5.4 3.9 1.7 0.4 setosa a, d
7 7 4.6 3.4 1.4 0.3 setosa b, c
8 8 5 3.4 1.5 0.2 setosa b, c
9 9 4.4 2.9 1.4 0.2 setosa b, d
10 10 4.9 3.1 1.5 0.1 setosa a, c
# ... with 140 more rows
So, let's say that each letter represented a unique value of interest and I wanted to wrangle this data into a series of dummy coded variables for each letter. Here is how I would do this using tidyverse functions:
iris %>%
separate_rows(Samples, sep = ', ') %>%
mutate(Values = 1) %>%
pivot_wider(names_from = "Samples", values_from = "Values") %>%
mutate_if(is.double, ~replace_na(., 0))
# A tibble: 150 x 10
Subject Sepal.Length Sepal.Width Petal.Length Petal.Width Species a d c b
<int> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 5.1 3.5 1.4 0.2 setosa 1 1 0 0
2 2 4.9 3 1.4 0.2 setosa 1 0 1 0
3 3 4.7 3.2 1.3 0.2 setosa 1 0 1 0
4 4 4.6 3.1 1.5 0.2 setosa 0 0 1 1
5 5 5 3.6 1.4 0.2 setosa 1 0 1 0
6 6 5.4 3.9 1.7 0.4 setosa 1 1 0 0
7 7 4.6 3.4 1.4 0.3 setosa 0 0 1 1
8 8 5 3.4 1.5 0.2 setosa 0 0 1 1
9 9 4.4 2.9 1.4 0.2 setosa 0 1 0 1
10 10 4.9 3.1 1.5 0.1 setosa 1 0 1 0
# ... with 140 more rows
This is fast and efficient for small datasets. But, I am quickly moving into datasets that have millions of rows. Enter data.table.
How would I accomplish the same process using data.table? Here is my attempt:
library(data.table)
# convert my tibble into a data.table
iris.dt <- as.data.table(iris)
# perform the separate_rows functionality on my data
result <- iris.dt[, list(Samples = unlist(strsplit(Samples, ", "))), by = Subject
][, Values := 1]
print(result)
Subject Samples Values
1: 1 a 1
2: 1 d 1
3: 2 a 1
4: 2 c 1
5: 3 a 1
---
296: 148 d 1
297: 149 a 1
298: 149 d 1
299: 150 b 1
300: 150 c 1
The problem is that I don't know how to (1) keep all other columns and (2) spread out this info in a way similar to dplyr::pivot_wider.
Any help would be much appreciated!
One way is to tstrsplit and then melt+dcast. Seems kind of inefficient but not sure of another way
Example Data:
library(magrittr)
library(data.table)
set.seed(2020)
iris.dt <- as.data.table(iris)
iris.dt[, samples := paste0(sample(letters[1:2], .N, T), ', ', sample(letters[3:4], .N, T))]
Create dummy cols
new_cols <-
iris.dt[, tstrsplit(samples, ', ')][, I := .I] %>%
melt('I') %>%
dcast(I ~ value, fun.agg = length) %>%
.[, I := NULL]
iris.dt[, names(new_cols) := new_cols][]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species samples a b c d
# 1: 5.1 3.5 1.4 0.2 setosa b, c 0 1 1 0
# 2: 4.9 3.0 1.4 0.2 setosa a, d 1 0 0 1
# 3: 4.7 3.2 1.3 0.2 setosa b, c 0 1 1 0
# 4: 4.6 3.1 1.5 0.2 setosa a, d 1 0 0 1
# 5: 5.0 3.6 1.4 0.2 setosa a, c 1 0 1 0
# ---
# 146: 6.7 3.0 5.2 2.3 virginica b, d 0 1 0 1
# 147: 6.3 2.5 5.0 1.9 virginica a, d 1 0 0 1
# 148: 6.5 3.0 5.2 2.0 virginica b, c 0 1 1 0
# 149: 6.2 3.4 5.4 2.3 virginica a, c 1 0 1 0
# 150: 5.9 3.0 5.1 1.8 virginica a, d 1 0 0 1
Here is another option using matrix numeric index:
l <- strsplit(DT[["Samples"]], ",")
nl <- lengths(l)
ul <- unlist(l)
cols <- sort(unique(ul))
DT[, (cols) := {
m <- matrix(0L, nrow=.N, ncol=length(cols))
m[cbind(rep(1L:.N, nl), match(ul, cols))] <- 1L
as.data.table(m)
}]
output:
Subject Samples a b c d
1: 1 a,d 1 0 0 1
2: 2 a,c 1 0 1 0
3: 3 a,c 1 0 1 0
4: 4 b,c 0 1 1 0
5: 5 a,c 1 0 1 0
6: 6 a,d 1 0 0 1
7: 7 b,c 0 1 1 0
8: 8 b,c 0 1 1 0
9: 9 b,d 0 1 0 1
10: 10 a,c 1 0 1 0
data:
DT <- fread("Subject Samples
1 a,d
2 a,c
3 a,c
4 b,c
5 a,c
6 a,d
7 b,c
8 b,c
9 b,d
10 a,c", sep=" ")
x <- seq(0.1,10,0.1)
y <- if (x < 5) 1 else 2
This gives a warning (or error since R version 4.2.0) that the condition has length > 1.
I would want the if to operate on every single case instead of operating on the whole vector.
What do I have to change?
x <- seq(0.1,10,0.1)
> x
[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
[16] 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
[31] 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5
[46] 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0
[61] 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5
[76] 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0
[91] 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0
> ifelse(x < 5, 1, 2)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
For completeness: In big vectors, you can use the indices to speed things up (we do that often in simulations, where functions typically run 1000 to 10000 times). But as long as it isn't necessary, just use ifelse. This reads a lot easier.
> set.seed(100)
> x <- runif(1000,1,10)
> system.time(replicate(10000,{
+ y <- ifelse(x < 5,1,2)
+ }))
user system elapsed
2.56 0.08 2.64
> system.time(replicate(10000,{
+ y <- rep(2,length(x))
+ y[x < 5]<- 1
+ }))
user system elapsed
0.48 0.00 0.48
y <- if (x < 5) 1 else 2 does not operate on the whole vector (the warning you receive tells you only the first element of the condition will be used). You want ifelse:
y <- ifelse(x < 5, 1, 2)
ifelse operates on the whole logical vector, element-by-element. if only accepts one logical value. See ?"if" and ?ifelse
You could also just create a logical vector and 1 to it
x <- seq(0.1, 10, 0.1) # Your data set
(x >= 5) + 1
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [92] 2 2 2 2 2 2 2 2 2
If would like to compare performance, it would be the fastest solution
set.seed(100)
x <- runif(1e6, 1, 10)
RL <- function(x) y <- ifelse(x < 5,1,2)
JM <- function(x) {y <- rep(2, length(x)); y[x < 5] <- 1}
DA <- function(x) y <- (x >= 5) + 1
library(microbenchmark)
microbenchmark(RL(x),
JM(x),
DA(x))
# Unit: milliseconds
# expr min lq mean median uq max neval
# RL(x) 331.83448 366.52940 378.89182 374.99741 381.08659 609.21218 100
# JM(x) 38.72894 42.18745 44.36493 43.25086 44.09626 82.76168 100
# DA(x) 10.01644 11.96482 14.21593 13.17825 14.12930 53.76923 100
Following the above post you can even use and modify the elements of a vector satisfying the criteria. In my opinion if it's not more costly to compute faster one should always do it.
x = seq(0.1,10,0.1)
y <- rep(2,length(x))
y[x<5] <- x[x<5]*2
The code of the previous post is best to answer the question. But if I had to use the code above I would do:
x = seq(0.1,10,0.1)
y <- rep(2,length(x))
y[x<5] <- x[x<5]*0 +1
nzMean <- function(x) { mean(x[x!=-1],na.rm=TRUE)}
nzMin <- function(x) {min(x[x!=-1],na.rm=TRUE)}
nzMax <- function(x) { max(x[x!=-1],na.rm=TRUE)}
nzRange<-function(x) {nzMax(x)-nzMin(x)}
nzSD <- function(x) { SD(x[x!=-1],na.rm=TRUE)}
#following function works
nzN1<- function(x) {ifelse(x!=-1,(x-nzMin(x))/nzRange(x) ,x) }
#following is bad as it returns only 4 not 5 elements of vector
nzN2<- function(x) {ifelse(x!=-1,(x[x!=-1]-nzMin(x))/nzRange(x) ,x) }
#following is bad as it returns 5 elements of vector but not correct answer
nzN3<- function(x) {ifelse(x!=-1,(x[x!=-1]-nzMin(x))/nzRange(x) ,-1) }
y<-c(1,-1,-20,2,4)
a<-nzMean(y)
b<-nzMin(y)
c<-nzMax(y)
d<-nzRange(y)
# test the working function
z<-nzN1(y)
print(z)