I have a dataframe of time series data with daily observations of temperatures. I need to create a dummy variable that counts each day that has temperature above a threshold of 5C. This would be easy in itself, but an additional condition exists: counting starts only after ten consecutive days above the threshold occurs. Here's an example dataframe:
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
I think I got it done, but with too many loops for my liking. This is what I did:
df$dummyUnconditional <- 0
df$dummyHead <- 0
df$dummyTail <- 0
for(i in 1:nrow(df)){
if(df$temp[i] > 5){
df$dummyUnconditional[i] <- 1
}
}
for(i in 1:(nrow(df)-9)){
if(sum(df$dummyUnconditional[i:(i+9)]) == 10){
df$dummyHead[i] <- 1
}
}
for(i in 9:nrow(df)){
if(sum(df$dummyUnconditional[(i-9):i]) == 10){
df$dummyTail[i] <- 1
}
}
df$dummyConditional <- ifelse(df$dummyHead == 1 | df$dummyTail == 1, 1, 0)
Could anyone suggest simpler ways for doing this?
Here's a base R option using rle:
df$dummy <- with(rle(df$temp > 5), rep(as.integer(values & lengths >= 10), lengths))
Some explanation: The task is a classic use case for the run length encoding (rle) function, imo. We first check if the value of temp is greater than 5 (creating a logical vector) and apply rle on that vector resulting in:
> rle(df$temp > 5)
#Run Length Encoding
# lengths: int [1:7] 66 1 1 225 2 1 69
# values : logi [1:7] FALSE TRUE FALSE TRUE FALSE TRUE ...
Now we want to find those cases where the values is TRUE (i.e. temp is greater than 5) and where at the same time the lengths is greater than 10 (i.e. at least ten consecutive tempvalues are greater than 5). We do this by running:
values & lengths >= 10
And finally, since we want to return a vector of the same lengths as nrow(df), we use rep(..., lengths) and as.integer in order to return 1/0 instead of TRUE/FALSE.
I think you could use a combination of a simple ifelse and the roll apply function in the zoo package to achieve what you are looking for. The final step just involves padding the result to account for the first N-1 days where there isnt enough information to fill the window.
library(zoo)
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
df$above5 <- ifelse(df$temp > 5, 1, 0)
temp <- rollapply(df$above5, 10, sum)
df$conseq <- c(rep(0, 9),temp)
I would do this:
set.seed(42)
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
thr <- 5
df$dum <- 0
#find first 10 consecutive values above threshold
test1 <- filter(df$temp > thr, rep(1,10), sides = 1) == 10L
test1[1:9] <- FALSE
n <- which(cumsum(test1) == 1L)
#count days above threshold after that
df$dum[(n+1):nrow(df)] <- cumsum(df$temp[(n+1):nrow(df)] > thr)
Related
I have a matrix with 52 columns, and 5,000 rows. I want to find the number of columns that contain a value less than or equal to a value (for example, how many columns out of 52 contain a number less than or equal to 10)
I was trying rowSum but I cannot remember / find a way to make this work.
Thanks!
A possible solution:
m <- matrix(1:9, 3, 3)
sum(colSums(m <= 5) != 0)
#> [1] 2
How about writing your own function?
Here's the code.
count_rows = function(df, val)
{
checks = 0
for (i in 1:ncol(df))
{
if(any(df[,i] > 0))
checks = checks + 1
}
return (checks)
}
A = matrix(runif(100), 10, 10)
count_rows(A, 0.5)
Say the matrix mat of dimensions 5000x52
set.seed(1234)
mat <- matrix(trunc(runif(5000*52)*1e5) , 5000 , 52)
dim(mat)
#> [1] 5000 52
then we can find how many columns out of 52 contains a number less than or equal to 10 using
sum(apply(mat , 2 , \(x) any(x <= 10)))
#> 24
I want to limit the 2 bounds of a vector in a IF condition. However I get the warnings "the condition has length > 1 and only the first element will be used" when I try to use the following function :
rho <- c(0.8,0,-0.5)
sigma.S <- 0.4
sigma.M <- 0.1
mu.S <- 0.06
T <- 1
N <- 365
dt <- T/N
m <- c(100,102,100,99,101)
z <- rnorm(N)
P <- matrix(0, N, 1)
P[1] <- m[1]
for (i in 2:N){
P[i] <- P[i-1]*( 1 + sigma.M*sqrt(dt)*z[i] )
}
tPts <- c(0,91,182,273,364)
yPts <- c(m[1]-P[1],m[2]-P[92],m[3]-P[183],m[4]-P[274],m[5]-P[365])
a <- tPts[1]
A <- yPts[1]
for(i in 2:5){
t <- seq(0,364,1)
b <- tPts[i]
B <- yPts[i]
if(a<=t & t<=b){
y <- ( B*(t-a) - A*(t-b) )/(b-a)
return(y)
}
a <- b
A <- B
}
Can anyone see what the problem is here? Thanks in advance!
We could change the if condition inside the loop by wrapping with all
if(all(a<=t) & all(t<=b))
assuming that we need condition to meet along the length of 't'
as a <= t or t <= b returns a logical vector of the same length as 't' and here 't' is created as a sequence from 0 to 364 i.e. even if one of the vector is of length 1 i.e. 'a' or 'b', the comparison operator does a recycling of that element to the do comparison across the larger length vector
5 < (1:6)
#[1] FALSE FALSE FALSE FALSE FALSE TRUE
and if/else requires input to be of length 1.
I have a dataset with 2 columns which consists of a boolean column and values.I will like to find the sum of the F value using while loop.Coe shown below but giving error:
sum <- 0
FM <- 0
idx <- 1
while ( idx <= nrow(dataset)){
if(subset(dataset,boolean=="F")){
sum <- sum + dataset [ idx,"value" ]
FM <- FM + 1
}
idx <- idx + 1
}
print(sum)
error message is : Error in idx : object 'idx' not found
If you count sum of logical values, you get count how many TRUE values are present. Since in this case as you want to count number of FALSE values we can negate the values and then use sum.
sum(!df$boolean)
#[1] 2
However, I guess you want this in a while loop. You can iterate over every value in boolean column, check if it is FALSE and increment the count.
i <- 1
FM <- 0
while(i <= nrow(df)) {
if(!df$boolean[i])
FM <- FM + 1
i <- i + 1
}
FM
#[1] 2
We can also do this without if condition
while(i <= nrow(df)) {
FM <- FM + !df$boolean[i]
i <- i + 1
}
data
df <- data.frame(boolean= c(TRUE,FALSE,TRUE,TRUE,FALSE),value=c(8,16,4,12,9))
first define some function to bind list rowwise and column wise
# a function to append vectors row wise
rbindlist <- function(list) {
n <- length(list)
res <- NULL
for (i in seq(n)) res <- rbind(res, list[[i]])
return(res)
}
cbindlist <- function(list) {
n <- length(list)
res <- NULL
for (i in seq(n)) res <- cbind(res, list[[i]])
return(res)
}
# generate sample data
sample.dat <- list()
set.seed(123)
for(i in 1:365){
vec1 <- sample(c(0,1), replace=TRUE, size=5)
sample.dat[[i]] <- vec1
}
dat <- rbindlist(sample.dat)
dat has five columns. Each column is a location and has 365 days of the year (365 rows) with values 1 or 0.
I have another dataframe (see below) which has certain days of the year for each column (location) in dat.
# generate second sample data
set.seed(123)
sample.dat1 <- list()
for(i in 1:5){
vec1 <- sort(sample(c(258:365), replace=TRUE, size=4), decreasing = F)
sample.dat1[[i]] <- vec1
}
dat1 <- cbindlist(sample.dat1)
I need to use dat1 to subset days in dat to do a calculation. An example below:
1) For location 1 (first column in both dat1 and dat):
In column 1 of dat, select the days from 289 till 302 (using dat1), find the longest consecutive occurrence of 1.
Repeat it and this time select the days from 303 (302 + 1) till 343 from dat, find the longest consecutive occurrence of 1.
Repeat it for 343 till 353: select the days from 344 (343 + 1) till 353, find the longest consecutive occurrence of 1.
2) Do this for all the columns
If I want to do sum of 1s, I can do this:
dat <- as.tibble(dat)
dat1 <- as.tibble(dat1)
pmap(list(dat,dat1), ~ {
range1 <- ..2[1]
range2 <- ..2[2]
range3 <- ..2[3]
range4 <- ..2[4]
sum.range1 <- sum(..1[range1:range2]) # this will generate sum between range 1 and range 2
sum.range2 <- sum(..1[range2:range3]) # this will generate sum between range 2 and range 3
sum.range3 <- sum(..1[range3:range4]) # this will generate sum between range 3 and range 4
c(sum.range1=sum.range1,sum.range2=sum.range2,sum.range3=sum.range3)
})
For longest consequtive occurrence of 1 between each range, I thought of using the rle function. Example below:
pmap(list(dat,dat1), ~ {
range1 <- ..2[1]
range2 <- ..2[2]
range3 <- ..2[3]
range4 <- ..2[4]
spell.range1 <- rle(..1[range1:range2]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range1 <- tapply(spell.range1$lengths, spell.range1$values, max)[2] # this should select the maximum consequtive run of 1
spell.range2 <- rle(..1[range2:range3]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range2 <- tapply(spell.range2$lengths, spell.range2$values, max)[2] # this should select the maximum consequtive run of 1
spell.range3 <- rle(..1[range3:range4]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range3 <- tapply(spell.range3$lengths, spell.range3$values, max)[2] # this should select the maximum consequtive run of 1
c(spell.1.range1 = spell.1.range1, spell.1.range2 = spell.1.range2, spell.1.range3 = spell.1.range3)
})
I get an error which I think is because I am not using the rle function properly here. I would really like to keep the code as above since
my others code are in the same pattern and format of the outputs is suited for my need, so I would appreciate if someone can suggest how to fix it.
OP's code does work for me. So, without a specific error message it is impossible to understand why the code is not working for the OP.
However, the sample datasets created by the OP are matrices (before they were coerced to tibble) and I felt challenged to find a way to solve the task in base R without using purrr:
To find the number of consecutive occurences of a particular value val in a vector x we can use the following function:
max_rle <- function(x, val) {
y <- rle(x)
len <- y$lengths[y$value == val]
if (length(len) > 0) max(len) else NA
}
Examples:
max_rle(c(0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1), 1)
[1] 4
max_rle(c(0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1), 0)
[1] 2
# find consecutive occurrences in column batches
lapply(seq_len(ncol(dat1)), function(col_num) {
start <- head(dat1[, col_num], -1L)
end <- tail(dat1[, col_num], -1L) - 1
sapply(seq_along(start), function(range_num) {
max_rle(dat[start[range_num]:end[range_num], col_num], 1)
})
})
[[1]]
[1] 8 4 5
[[2]]
[1] 4 5 2
[[3]]
[1] NA 3 4
[[4]]
[1] 5 5 4
[[5]]
[1] 3 2 3
The first lapply() loops over the columns of dat and dat1, resp. The second sapply() loops over the row ranges stored in dat1 and subsets dat accordingly.
I have a data.frame column with hourly values and want to count all negative values, if they are in a sequence of at least six consecutive negatives.
Example:
df <- data.frame(Values=c(-2, 2, 3, 4,-1,-1,-2,-3,
-1,-1,-1, 5, 4, 2,-4,-2,
-3,-4,-1, 3, 4, 4,-3,-1,
-2,-2,-3,-4))
The expected result would be 13, since the middle sequence of consecutive negatives contains only five negatives and is thus not counted.
You could use the base function rle() along with sign(). The sign() function converts negatives and positives to -1 and 1, respectively. This makes a nice vector to pass to rle() to get the run-lengths. Then we can subset the run-lengths with the desired condition and take the sum.
with(rle(sign(df$Values)), sum(lengths[lengths >= 6 & values < 0]))
# [1] 13
Try:
library(cgwtools)
res <- seqle(which(df<0))
sum(res$lengths[res$lengths>=6])
[1] 13
you can always define your own function and call it.
NegativeValues <- function(x) {
count <- 0
innercount <- 0
for (i in c(x, 0)) {
if (i < 0) {
innercount <- innercount + 1
}
else {
if (innercount >= 6)
count <- count + innercount
innercount <- 0
}
}
return(count)
}
NegativeValues(df$Values)