I have a data frame "var" and I need to get a vector output that satisfies the following conditions.
Basically, what I am trying to execute is the following: if the value of psqi_2_sleepstart1 is less than 15, comp21score needs to be assigned the value 0; between 16 and 30, comp21score needs to be assigned the value 1; between 31 and 60, comp21score needs to be assigned the value 2 and over 60 comp21score should take the value of 3. For example, if the data frame had values for psqi_2_sleepstart1 as 16, 40, 6 and 10; I want the output to be 1, 2, 0, 1. I was using the ifelse statement, but I got the error that argument "yes" is missing, with no default.
Here is my attempt:
for (i in 1: nrow(var)) {
ifelse (psqi_2_sleepstart1 <= 15)
comp21score [i] <- 0
ifelse (psqi_2_sleepstart1 > 15 & psqi_2_sleepstart1 <= 30)
comp21score [i] <- 1
ifelse (psqi_2_sleepstart1 > 30 & psqi_2_sleepstart1 <= 60)
comp21score [i] <- 2
ifelse (psqi_2_sleepstart1 > 60)
comp21score [i] <- 3
}
print (comp21score)
Does anyone have suggestions on what I might be able to use instead or how to avoid this error?
Thanks!
Just for kicks - here is a case_when dplyr example (as mentioned in the comments):
DF1 <- data.frame("score"= 0:20)
DF1 <- DF1 %>% mutate(value = case_when(
score < 5 ~ 1,
score >= 5| score < 10 ~ 2,
score >= 10 ~ 3
)
)
> DF1
score value
1 0 1
2 1 1
3 2 1
.....
Use this layout instead:
x <- 2
if (x==0) {
y <- log
} else if (x == 1) {
y <- identity
} else if (x == 2) {
y <- function(x) x^2
}
Related
I have a question regarding a for-loop within R's dplyr. Imagine I have the following dataframe:
id <- c(rep(8, 9))
check <- c(0,1,1,0,0,1,0,0,0)
df <- data.frame(id, check)
df$count_x <- cumsum(df$check)
df$count_y <- NA
df$count_y[1] <- ifelse(df$check[1] == 0, 0, 1)
co <- df$count_y[1]
I want to fill the variable count_y based on an adjusted cumulative function below:
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
The output of this for-loop is correct. However, in my current data set, I have many IDs, and using a for loop to iterate over the IDs will take too much time. I'm trying to use the functionality of dplyr to speed up the process.
id <- c(rep(8, 9))
check <- c(0,1,1,0,0,1,0,0,0)
df <- data.frame(id, check)
df <- df %>% group_by(id) %>% mutate(count_x = cumsum(check),
count_y = NA) %>% ungroup()
df <- df %>% group_by(id) %>% mutate(count_y = replace(count_y, 1, ifelse(check[1] == 0, 0 , 1)))
count_n <- function(df){
co <- df$count_y[1]
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
}
I want to use mutate to call the function count_n to fill count_y as described above. I'm aware that I'm passing just one variable, where I have to pass a data frame as the function relies on the column 'check' (col number 2) and 'count_y' (col number 4). I have tried multiple options (mutate_at, all, etc) but I couldn't make it to work. What can I do differently?
df <- df %>% group_by(id) %>% mutate_at(vars(count_y), ~count_n(.))
I think this is the perfect case to use purrr::accumulate2().
purrr::accumulate() is often used to calculate conditional cumulative sums. It takes a function as the second argument. This function should have 2 arguments: the cumulative output co, and the currently evaluated value x.
purrr::accumulate2() allows us to use a second variable to iterate on, and here we use lag(check) as lx. The tricky part is that this second variable should be one item shorter, as it does not matter for the initial value.
Here is the code, matching your expected output.
library(tidyverse)
df = structure(list(id = c(8, 8, 8, 8, 8, 8, 8, 8, 8),
check = c(0, 1, 1, 0, 0, 1, 0, 0, 0),
count_x = c(0, 1, 2, 2, 2, 3, 3, 3, 3)),
row.names = c(NA, -9L), class = "data.frame")
df %>%
mutate(
count_y = accumulate2(check, lag(check)[-1], function(co, x, lx){
case_when(
x==0 ~ co,
x==1 & lx==0 ~ 1,
x==1 & lx==1 ~ co+1,
TRUE ~ 999 #error value in case of unexpected input
)
})
)
#> id check count_x count_y
#> 1 8 0 0 0
#> 2 8 1 1 1
#> 3 8 1 2 2
#> 4 8 0 2 2
#> 5 8 0 2 2
#> 6 8 1 3 1
#> 7 8 0 3 1
#> 8 8 0 3 1
#> 9 8 0 3 1
Created on 2021-05-05 by the reprex package (v2.0.0)
The first issue is that you weren't returning anything in your function. The second issue is that you don't need to use a mutate_at (or even a mutate as would be more appropriate for a single variable) when you're writing the function that modifies the entire tibble. The simplest way to get it working is adding a return statement and running it in line like so:
count_n <- function(df){
co <- df$count_y[1]
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
return(df)
}
df %>% group_by(id) %>% count_n(.)
However, I would use Dan's answer above because it's much cleaner and has the advantage of not running a for loop, which isn't very "R". :)
I have question on replacing the value in between the vectors.
The algorithm should find that replacement number when the certain condition is met. In this case finding the number which makes the difference -20 with the previous number. So I prefer to use diff function.
Here is what I mean
x <- c(20,20,0,20,0,5)
> diff(x)
[1] 0 -20 20 -20 5
So in this case 0 makes the difference -20 and I want to change those 0s to 20.
. I know the easiest solution is the directly assigning x[3] <- 20 or x[5] <- 20
However, the 0 location is always different so I need an automated process that can do that. Thanks!
**EDIT
if we need to do this in a grouped data.frame
> df
x gr
1 20 1
2 20 1
3 0 1
4 20 1
5 0 1
6 5 1
7 33 2
8 0 2
9 20 2
10 0 2
11 20 2
12 0 2
How can we implement this ?
modify <- function(x){
value_search = c(0, 33)
value_replacement = c(20, 44)
for (k in 1:length(value_search)) {
index_position = which(x %in% value_search[k])
replacement = value_replacement[k]
for (i in index_position) {
x[i] = replacement
}
}
}
df%>%
group_by(gr)%>%
mutate(modif_x=modify(x))
Error in mutate_impl(.data, dots) :
Evaluation error: 'match' requires vector arguments.
You can do it using which to get the position, i.e.
x[which(diff(x) == -20)+1] <- 20
x
#[1] 20 20 20 20 20 5
if you want a generic way to replace values of a vector based on particular values, i would approach it this way.
x = c(20,20,0,20,0,5)
value_search = 0
value_replacement = 20
index_position = which(x %in% value_search)
for (i in index_position) {
x[i] = value_replacement
}
but this works for single values. if you want to look for multiple values, you can use a nested loop as below:
x = c(20,20,0,20,0,5,33)
value_search = c(0, 33)
value_replacement = c(20, 44)
for (k in 1:length(value_search)) {
index_position = which(x %in% value_search[k])
replacement = value_replacement[k]
for (i in index_position) {
x[i] = replacement
}
}
in response to OP's edits:
any number of ways to do this:
x = c(20,20,0,20,0,5,33)
gr = c(1,1,1,1,2,2,2)
df = data.frame(x, gr)
func_replace <- function(source, value_search, value_replacement) {
for (k in 1:length(source)) {
index_position = which(x %in% value_search[k])
replacement = value_replacement[k]
for (i in index_position) {
source[i] = replacement
} # for i loop
} # for k loop
return(source)
} # func_replace
value_search = c(0, 33)
value_replacement = c(20, 44)
gr_value = 1
df$replacement = with(df, ifelse(gr == gr_value, sapply(df, FUN = function(x) func_replace(x, value_search, value_replacement)), NA))
I have a dataframe of time series data with daily observations of temperatures. I need to create a dummy variable that counts each day that has temperature above a threshold of 5C. This would be easy in itself, but an additional condition exists: counting starts only after ten consecutive days above the threshold occurs. Here's an example dataframe:
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
I think I got it done, but with too many loops for my liking. This is what I did:
df$dummyUnconditional <- 0
df$dummyHead <- 0
df$dummyTail <- 0
for(i in 1:nrow(df)){
if(df$temp[i] > 5){
df$dummyUnconditional[i] <- 1
}
}
for(i in 1:(nrow(df)-9)){
if(sum(df$dummyUnconditional[i:(i+9)]) == 10){
df$dummyHead[i] <- 1
}
}
for(i in 9:nrow(df)){
if(sum(df$dummyUnconditional[(i-9):i]) == 10){
df$dummyTail[i] <- 1
}
}
df$dummyConditional <- ifelse(df$dummyHead == 1 | df$dummyTail == 1, 1, 0)
Could anyone suggest simpler ways for doing this?
Here's a base R option using rle:
df$dummy <- with(rle(df$temp > 5), rep(as.integer(values & lengths >= 10), lengths))
Some explanation: The task is a classic use case for the run length encoding (rle) function, imo. We first check if the value of temp is greater than 5 (creating a logical vector) and apply rle on that vector resulting in:
> rle(df$temp > 5)
#Run Length Encoding
# lengths: int [1:7] 66 1 1 225 2 1 69
# values : logi [1:7] FALSE TRUE FALSE TRUE FALSE TRUE ...
Now we want to find those cases where the values is TRUE (i.e. temp is greater than 5) and where at the same time the lengths is greater than 10 (i.e. at least ten consecutive tempvalues are greater than 5). We do this by running:
values & lengths >= 10
And finally, since we want to return a vector of the same lengths as nrow(df), we use rep(..., lengths) and as.integer in order to return 1/0 instead of TRUE/FALSE.
I think you could use a combination of a simple ifelse and the roll apply function in the zoo package to achieve what you are looking for. The final step just involves padding the result to account for the first N-1 days where there isnt enough information to fill the window.
library(zoo)
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
df$above5 <- ifelse(df$temp > 5, 1, 0)
temp <- rollapply(df$above5, 10, sum)
df$conseq <- c(rep(0, 9),temp)
I would do this:
set.seed(42)
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
thr <- 5
df$dum <- 0
#find first 10 consecutive values above threshold
test1 <- filter(df$temp > thr, rep(1,10), sides = 1) == 10L
test1[1:9] <- FALSE
n <- which(cumsum(test1) == 1L)
#count days above threshold after that
df$dum[(n+1):nrow(df)] <- cumsum(df$temp[(n+1):nrow(df)] > thr)
I'd like to apply a function to each row in R that "scores" each value of a row, x. It seems like i'd use the 'apply' function in R to do this, but not sure how to do it. I'd like to input a dataframe with a column of values of integers and have a vector output with the score. The code I have now is as follows:
ScoreFn <- function(x){
score <- 0
if(x<1) {
score <- 0
} else if(x==1) {
score <- 5
} else if(x==2) {
score <- 10
} else if(x==3) {
score <- 20
} else if(x >= 4 && x <= 10) {
score <- 30
} else if(x >= 11 && x <= 20) {
score <- 40
} else if(x >= 21) {
score <- 50
}
return(score)
}
apply(df$x, 1, ScoreFn())
Also, I am getting this message. Not sure the best way to do this function.
1: In if (x < 1) { :
the condition has length > 1 and only the first element will be used
2: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
3: In if (x == 2) { :
the condition has length > 1 and only the first element will be used
4: In if (x == 3) { :
the condition has length > 1 and only the first element will be used
...
You can make a vectorised function, using cut, so you don't have to use apply at all:
scorefun <- function(x){
as.numeric(as.character(cut(x, breaks = c(0, 1, 2, 3, 4, 11, 21, Inf),
labels = c(0,5,10,20,30,40, 50), right = FALSE)))
}
df <- data.frame(x = 0:10)
scorefun(df$x)
[1] 0 5 10 20 30 30 30 30 30 30 30
This also has the bonus that cut does the heavy lifting of typing the if/elses, as well as being about 10x faster than the non-vectorised version.
It works by cutting the given vector (in this case df$x) into factors by slices, given by breaks. We then label them with your scores, and then get out the numbers again by using as.character and as.numeric.
If your input is just a column of data.frame, you don't need to use apply. You can use sapply instead.
ScoreFn <- function(x){
score <- 0
if(x<1) {
score <- 0
} else if(x==1) {
score <- 5
} else if(x==2) {
score <- 10
} else if(x==3) {
score <- 20
} else if(x >= 4 && x <= 10) {
score <- 30
} else if(x >= 11 && x <= 20) {
score <- 40
} else if(x >= 21) {
score <- 50
}
return(score)
}
# Return a list of scores
sapply(df$x, ScoreFn)
I'm trying to find the the point at which participants reach 8 contiguous responses in a row that are greater than 3. For example:
x <- c(2,2,4,4,4,4,5,5,5,5,7)
i want to return the value 10.
i tried the code (Thanks #DWin):
which( rle(x)$values>3 & rle(x)$lengths >= 8)
sum(rle(x)$lengths[ 1:(min(which(rle(x)$lengths >= 8))-1) ]) + 8
the problem with the above code is that it only works if the responses are all identical and greater than 3. thus the code returns a zero.
if:
x <- c(2,2,4,4,4,4,4,4,4,4,7)
the code works fine. but this isn't how my data looks.
Thanks in advance!
Why don't you create a new vector that contains the identical values that rle needs to work properly? You can use ifelse() for this and put everything into a function:
FUN <- function(x, value, runlength) {
x2 <- ifelse(x > value, 1, 0)
ret <- sum(rle(x2)$lengths[ 1:(min(which(rle(x2)$lengths >= runlength))-1) ]) + runlength
return(ret)
}
> FUN(x, value = 3, runlength = 8)
[1] 10
You could just convert your data so that the responses are only coded discriminating the measure of interest (greater than 3) and then your code will work as it is replacing x with x1.
x1 <- ifelse( x > 3, 4, 0 )
But if I was already doing this I might rewrite the code slightly more clearly this way.
runl <- rle(x1)
i <- which( runl$length > 8 & runl$value > 3 )[1]
sum( runl$length[1:(i-1)] ) + 8
Here's a vectorized way of doing it with just cumsum and cummax. Let's take an example that has a short (less than length 8) sequence of elements greater than 3 as well as a long one, just to make sure it's doing the right thing.
> x <- c(2,2,4,5,6,7,2,2,4,9,8,7,6,5,4,5,6,9,2,2,9)
> x3 <- x > 3
> cumsum(x3) - cummax(cumsum(x3)*(!x3))
[1] 0 0 1 2 3 4 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1
> which( cumsum(x3) - cummax(cumsum(x3)*(!x3)) == 8)[1]
[1] 16