I have a data.frame column with hourly values and want to count all negative values, if they are in a sequence of at least six consecutive negatives.
Example:
df <- data.frame(Values=c(-2, 2, 3, 4,-1,-1,-2,-3,
-1,-1,-1, 5, 4, 2,-4,-2,
-3,-4,-1, 3, 4, 4,-3,-1,
-2,-2,-3,-4))
The expected result would be 13, since the middle sequence of consecutive negatives contains only five negatives and is thus not counted.
You could use the base function rle() along with sign(). The sign() function converts negatives and positives to -1 and 1, respectively. This makes a nice vector to pass to rle() to get the run-lengths. Then we can subset the run-lengths with the desired condition and take the sum.
with(rle(sign(df$Values)), sum(lengths[lengths >= 6 & values < 0]))
# [1] 13
Try:
library(cgwtools)
res <- seqle(which(df<0))
sum(res$lengths[res$lengths>=6])
[1] 13
you can always define your own function and call it.
NegativeValues <- function(x) {
count <- 0
innercount <- 0
for (i in c(x, 0)) {
if (i < 0) {
innercount <- innercount + 1
}
else {
if (innercount >= 6)
count <- count + innercount
innercount <- 0
}
}
return(count)
}
NegativeValues(df$Values)
Related
I have a matrix with 52 columns, and 5,000 rows. I want to find the number of columns that contain a value less than or equal to a value (for example, how many columns out of 52 contain a number less than or equal to 10)
I was trying rowSum but I cannot remember / find a way to make this work.
Thanks!
A possible solution:
m <- matrix(1:9, 3, 3)
sum(colSums(m <= 5) != 0)
#> [1] 2
How about writing your own function?
Here's the code.
count_rows = function(df, val)
{
checks = 0
for (i in 1:ncol(df))
{
if(any(df[,i] > 0))
checks = checks + 1
}
return (checks)
}
A = matrix(runif(100), 10, 10)
count_rows(A, 0.5)
Say the matrix mat of dimensions 5000x52
set.seed(1234)
mat <- matrix(trunc(runif(5000*52)*1e5) , 5000 , 52)
dim(mat)
#> [1] 5000 52
then we can find how many columns out of 52 contains a number less than or equal to 10 using
sum(apply(mat , 2 , \(x) any(x <= 10)))
#> 24
I'm trying to set an upper and lower boundary of a vector by simply adding and subtracting a set value from each index. I then want to create a loop that tells me for each value (i) in the vector, how many other points within the vector falls within that boundary.
Essentially creating a pseudo-density calculation based on how many values fall within the established range.
I have my vector "v" that contains random values. I then add/subtract three to it to get the upper and lower ranges. But can't create a loop that will count how many other values from that vector fall within that.
v <- c(1, 3, 4, 5, 8, 9, 10, 54)
for (i in v){
vec2 <- (vec +3 > vec[i] & vec -3 < vec[i])
}
vec2
I get NA's from this code.
I've also tried indexing the vec +/- 3 and it also didn't work.
vec2 <- (vec[i] +3 > vec[i] & vec - 3 < vec[i))
What I want is for every "i" value in the vector, I want to know how many points fall within that value + and -3.
i.e. first value being 1: so the upper limit would be 4 and the lower would be -2. I want it to count how many values remaining in the vector, fall within this. Which would be 3 for the first index (if it includes itself).
vec2 = (3, 4, 3, . . . )
Are you looking for something like this? Your code doesn't work because your syntax is incorrect.
vec <- c(1, 3, 4, 5, 8, 9, 10, 54) #Input vector
countvalswithin <- vector() #Empty vector that will store counts of values within bounds
#For loop to cycle through values stored in input vector
for(i in 1:length(vec)){
currval <- vec[i] #Take current value
lbound <- (currval - 3) #Calculate lower bound w.r.t. this value
ubound <- (currval + 3) #Calculate upper bound w.r.t. this value
#Create vector containing all values from source vector except current value
#This will be used for comparison against current value to find values within bounds.
othervals <- subset(vec, vec != currval)
currcount <- 1 #Set to 0 to exclude self; count(er) of values within bounds of current value
#For loop to cycle through all other values (excluding current value) to find values within bounds of current value
for(j in 1:length(othervals)){
#If statement to evaluate whether compared value is within bounds of current value; if it is, counter updates by 1
if(othervals[j] > lbound & othervals[j] <= ubound){
currcount <- currcount + 1
}
}
countvalswithin[i] <- currcount #Append count for current value to a vector
}
df <- data.frame(vec, countvalswithin) #Input vector and respective counts as a dataframe
df
# vec countvalswithin
# 1 1 3
# 2 3 4
# 3 4 3
# 4 5 4
# 5 8 3
# 6 9 3
# 7 10 3
# 8 54 1
Edit: added comments to the code explaining what it does.
In your for loop we can loop over every element in v, create range (-3, +3) and check how many of the elements in v fall within that range and store the result in new vector vec2.
vec2 <- numeric(length = length(v))
for (i in seq_along(v)) {
vec2[i] <- sum((v >= v[i] - 3) & (v <= v[i] + 3))
}
vec2
#[1] 3 4 4 4 4 3 3 1
However, you can avoid the for loop by using mapply
mapply(function(x, y) sum(v >= y & v <= x), v + 3, v - 3)
#[1] 3 4 4 4 4 3 3 1
first define some function to bind list rowwise and column wise
# a function to append vectors row wise
rbindlist <- function(list) {
n <- length(list)
res <- NULL
for (i in seq(n)) res <- rbind(res, list[[i]])
return(res)
}
cbindlist <- function(list) {
n <- length(list)
res <- NULL
for (i in seq(n)) res <- cbind(res, list[[i]])
return(res)
}
# generate sample data
sample.dat <- list()
set.seed(123)
for(i in 1:365){
vec1 <- sample(c(0,1), replace=TRUE, size=5)
sample.dat[[i]] <- vec1
}
dat <- rbindlist(sample.dat)
dat has five columns. Each column is a location and has 365 days of the year (365 rows) with values 1 or 0.
I have another dataframe (see below) which has certain days of the year for each column (location) in dat.
# generate second sample data
set.seed(123)
sample.dat1 <- list()
for(i in 1:5){
vec1 <- sort(sample(c(258:365), replace=TRUE, size=4), decreasing = F)
sample.dat1[[i]] <- vec1
}
dat1 <- cbindlist(sample.dat1)
I need to use dat1 to subset days in dat to do a calculation. An example below:
1) For location 1 (first column in both dat1 and dat):
In column 1 of dat, select the days from 289 till 302 (using dat1), find the longest consecutive occurrence of 1.
Repeat it and this time select the days from 303 (302 + 1) till 343 from dat, find the longest consecutive occurrence of 1.
Repeat it for 343 till 353: select the days from 344 (343 + 1) till 353, find the longest consecutive occurrence of 1.
2) Do this for all the columns
If I want to do sum of 1s, I can do this:
dat <- as.tibble(dat)
dat1 <- as.tibble(dat1)
pmap(list(dat,dat1), ~ {
range1 <- ..2[1]
range2 <- ..2[2]
range3 <- ..2[3]
range4 <- ..2[4]
sum.range1 <- sum(..1[range1:range2]) # this will generate sum between range 1 and range 2
sum.range2 <- sum(..1[range2:range3]) # this will generate sum between range 2 and range 3
sum.range3 <- sum(..1[range3:range4]) # this will generate sum between range 3 and range 4
c(sum.range1=sum.range1,sum.range2=sum.range2,sum.range3=sum.range3)
})
For longest consequtive occurrence of 1 between each range, I thought of using the rle function. Example below:
pmap(list(dat,dat1), ~ {
range1 <- ..2[1]
range2 <- ..2[2]
range3 <- ..2[3]
range4 <- ..2[4]
spell.range1 <- rle(..1[range1:range2]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range1 <- tapply(spell.range1$lengths, spell.range1$values, max)[2] # this should select the maximum consequtive run of 1
spell.range2 <- rle(..1[range2:range3]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range2 <- tapply(spell.range2$lengths, spell.range2$values, max)[2] # this should select the maximum consequtive run of 1
spell.range3 <- rle(..1[range3:range4]) # sort the data, this shows the longest run of ANY type (0 OR 1)
spell.1.range3 <- tapply(spell.range3$lengths, spell.range3$values, max)[2] # this should select the maximum consequtive run of 1
c(spell.1.range1 = spell.1.range1, spell.1.range2 = spell.1.range2, spell.1.range3 = spell.1.range3)
})
I get an error which I think is because I am not using the rle function properly here. I would really like to keep the code as above since
my others code are in the same pattern and format of the outputs is suited for my need, so I would appreciate if someone can suggest how to fix it.
OP's code does work for me. So, without a specific error message it is impossible to understand why the code is not working for the OP.
However, the sample datasets created by the OP are matrices (before they were coerced to tibble) and I felt challenged to find a way to solve the task in base R without using purrr:
To find the number of consecutive occurences of a particular value val in a vector x we can use the following function:
max_rle <- function(x, val) {
y <- rle(x)
len <- y$lengths[y$value == val]
if (length(len) > 0) max(len) else NA
}
Examples:
max_rle(c(0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1), 1)
[1] 4
max_rle(c(0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1), 0)
[1] 2
# find consecutive occurrences in column batches
lapply(seq_len(ncol(dat1)), function(col_num) {
start <- head(dat1[, col_num], -1L)
end <- tail(dat1[, col_num], -1L) - 1
sapply(seq_along(start), function(range_num) {
max_rle(dat[start[range_num]:end[range_num], col_num], 1)
})
})
[[1]]
[1] 8 4 5
[[2]]
[1] 4 5 2
[[3]]
[1] NA 3 4
[[4]]
[1] 5 5 4
[[5]]
[1] 3 2 3
The first lapply() loops over the columns of dat and dat1, resp. The second sapply() loops over the row ranges stored in dat1 and subsets dat accordingly.
I have a dataframe of time series data with daily observations of temperatures. I need to create a dummy variable that counts each day that has temperature above a threshold of 5C. This would be easy in itself, but an additional condition exists: counting starts only after ten consecutive days above the threshold occurs. Here's an example dataframe:
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
I think I got it done, but with too many loops for my liking. This is what I did:
df$dummyUnconditional <- 0
df$dummyHead <- 0
df$dummyTail <- 0
for(i in 1:nrow(df)){
if(df$temp[i] > 5){
df$dummyUnconditional[i] <- 1
}
}
for(i in 1:(nrow(df)-9)){
if(sum(df$dummyUnconditional[i:(i+9)]) == 10){
df$dummyHead[i] <- 1
}
}
for(i in 9:nrow(df)){
if(sum(df$dummyUnconditional[(i-9):i]) == 10){
df$dummyTail[i] <- 1
}
}
df$dummyConditional <- ifelse(df$dummyHead == 1 | df$dummyTail == 1, 1, 0)
Could anyone suggest simpler ways for doing this?
Here's a base R option using rle:
df$dummy <- with(rle(df$temp > 5), rep(as.integer(values & lengths >= 10), lengths))
Some explanation: The task is a classic use case for the run length encoding (rle) function, imo. We first check if the value of temp is greater than 5 (creating a logical vector) and apply rle on that vector resulting in:
> rle(df$temp > 5)
#Run Length Encoding
# lengths: int [1:7] 66 1 1 225 2 1 69
# values : logi [1:7] FALSE TRUE FALSE TRUE FALSE TRUE ...
Now we want to find those cases where the values is TRUE (i.e. temp is greater than 5) and where at the same time the lengths is greater than 10 (i.e. at least ten consecutive tempvalues are greater than 5). We do this by running:
values & lengths >= 10
And finally, since we want to return a vector of the same lengths as nrow(df), we use rep(..., lengths) and as.integer in order to return 1/0 instead of TRUE/FALSE.
I think you could use a combination of a simple ifelse and the roll apply function in the zoo package to achieve what you are looking for. The final step just involves padding the result to account for the first N-1 days where there isnt enough information to fill the window.
library(zoo)
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
df$above5 <- ifelse(df$temp > 5, 1, 0)
temp <- rollapply(df$above5, 10, sum)
df$conseq <- c(rep(0, 9),temp)
I would do this:
set.seed(42)
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
thr <- 5
df$dum <- 0
#find first 10 consecutive values above threshold
test1 <- filter(df$temp > thr, rep(1,10), sides = 1) == 10L
test1[1:9] <- FALSE
n <- which(cumsum(test1) == 1L)
#count days above threshold after that
df$dum[(n+1):nrow(df)] <- cumsum(df$temp[(n+1):nrow(df)] > thr)
bucketIndex <- function(v, N){
o <- rep(0, length(v))
curSum <- 0
index <- 1
for(i in seq(length(v))){
o[i] <- index
curSum <- curSum + v[i]
if(curSum > N){
curSum <- 0
index <- index + 1
}
}
o
}
> bucketIndex(c(1, 1, 2, 1, 5, 1), 3)
[1] 1 1 1 2 2 3
I'm wondering if this function is fundamentally un-vectorizable. If it is, is there some package to deal with this "class" of functions, or is the only alternative (if I want speed) to write it as a c extension?
Here's a try (does not yet arrive at bucketIndex!):
your
curSum <- curSum + v[i]
if(curSum > N){
curSum <- 0
index <- index + 1
}
is almost an integer division %/% of cumsum (v).
But not quite, your index only counts up 1 even if v [i] is > several times N and you start with 1. We can almost take care of that by conversion to a factor and back to integer.
However, I'm wondering (from the name of the function) whether this behaviour is really intended:
> bucketIndex (c(1, 1, 2, 1, 2, 1, 1, 2, 1, 5, 1), 3)
[1] 1 1 1 2 2 2 3 3 3 4 5
> bucketIndex (c(1, 1, 1, 2, 2, 1, 1, 2, 1, 5, 1), 3)
[1] 1 1 1 1 2 2 2 3 3 3 4
I.e. just exchangig two consecutive entries in v can lead to different maximum in the result.
the other point is that you count up only after the element that causes the sum to be > N. Which means that the results should have an additional 1 at the beginning and the last element should be dropped.
You reset curSum to 0 regardless how much it shoots over N. So for all elements with cumsum (v) > N, you'd need to subtract this value, then look for the next cumsum (v) > N and so on. This reduces the number of loop iterations with respect to your for loop, but whether this gives you a substrantial improvement depends on the entries of v and on N (or, on the max (index) : length (v) ratio). If that is 50% as in your example, I don't think you can get a substantial gain. Unless there is at least an order of magnitute between them, I'd go for inline::cfunction.
I'm going to go out on a limb here and say the answer is "no." Essentially, you're changing what it is you sum over based on the results of the current sum. This means future calculations depend on the result of an intermediate calculation, which vectorized operations can't do.
I don't think that this is completely vectorizable, but #cbeleites gets at one way to reduce the number of iterations in the loop by dealing with a whole chunk (bucket) at a time. Each iteration looks for where the cumulative sum exceeds N, assigns the index to that range, reduces the cumulative sum by whatever value was that which exceeded N, and repeats until the vector is exhausted. The rest is bookkeeping (initialization of value and incrementation of values).
bucketIndex2 <- function(v, N) {
index <- 1
cs <- cumsum(v)
bk.old <- 0
o <- rep(0, length(v))
repeat {
bk <- suppressWarnings(min(which(cs > N)))
o[(bk.old+1):min(bk,length(v))] <- index
if (bk >= length(v)) break
cs <- cs - cs[bk]
index <- index + 1
bk.old <- bk
}
o
}
This matches your function for a variety of random inputs:
for (i in 1:200) {
v <- sample(sample(20,1), sample(50,1)+20, replace=TRUE)
N <- sample(10,1)
bi <- bucketIndex(v, N)
bi2 <- bucketIndex2(v, N)
if (any(bi != bi2)) {
print("MISMATCH:")
dump("v","")
dump("N","")
}
}