Related
I have a dataframe of time series data with daily observations of temperatures. I need to create a dummy variable that counts each day that has temperature above a threshold of 5C. This would be easy in itself, but an additional condition exists: counting starts only after ten consecutive days above the threshold occurs. Here's an example dataframe:
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
I think I got it done, but with too many loops for my liking. This is what I did:
df$dummyUnconditional <- 0
df$dummyHead <- 0
df$dummyTail <- 0
for(i in 1:nrow(df)){
if(df$temp[i] > 5){
df$dummyUnconditional[i] <- 1
}
}
for(i in 1:(nrow(df)-9)){
if(sum(df$dummyUnconditional[i:(i+9)]) == 10){
df$dummyHead[i] <- 1
}
}
for(i in 9:nrow(df)){
if(sum(df$dummyUnconditional[(i-9):i]) == 10){
df$dummyTail[i] <- 1
}
}
df$dummyConditional <- ifelse(df$dummyHead == 1 | df$dummyTail == 1, 1, 0)
Could anyone suggest simpler ways for doing this?
Here's a base R option using rle:
df$dummy <- with(rle(df$temp > 5), rep(as.integer(values & lengths >= 10), lengths))
Some explanation: The task is a classic use case for the run length encoding (rle) function, imo. We first check if the value of temp is greater than 5 (creating a logical vector) and apply rle on that vector resulting in:
> rle(df$temp > 5)
#Run Length Encoding
# lengths: int [1:7] 66 1 1 225 2 1 69
# values : logi [1:7] FALSE TRUE FALSE TRUE FALSE TRUE ...
Now we want to find those cases where the values is TRUE (i.e. temp is greater than 5) and where at the same time the lengths is greater than 10 (i.e. at least ten consecutive tempvalues are greater than 5). We do this by running:
values & lengths >= 10
And finally, since we want to return a vector of the same lengths as nrow(df), we use rep(..., lengths) and as.integer in order to return 1/0 instead of TRUE/FALSE.
I think you could use a combination of a simple ifelse and the roll apply function in the zoo package to achieve what you are looking for. The final step just involves padding the result to account for the first N-1 days where there isnt enough information to fill the window.
library(zoo)
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
df$above5 <- ifelse(df$temp > 5, 1, 0)
temp <- rollapply(df$above5, 10, sum)
df$conseq <- c(rep(0, 9),temp)
I would do this:
set.seed(42)
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
thr <- 5
df$dum <- 0
#find first 10 consecutive values above threshold
test1 <- filter(df$temp > thr, rep(1,10), sides = 1) == 10L
test1[1:9] <- FALSE
n <- which(cumsum(test1) == 1L)
#count days above threshold after that
df$dum[(n+1):nrow(df)] <- cumsum(df$temp[(n+1):nrow(df)] > thr)
I am looking for an efficient way to get the first k elements that are the same between two vectors in R.
For example:
orderedIntersect(c(1,2,3,4), c(1,2,5,4))
# [1] 1 2
orderedIntersect(c(1,2,3), c(1,2,3,4))
# [1] 1 2 3
This is the same as the intersect behavior, but any values after the first mismatch should be dropped.
I also want this to work for strings.
So far, the solution that I have is this:
orderedIntersect <- function(a,b) {
a <- as.vector(a)
NAs <- is.na(match(a, as.vector(b)))
last <- ifelse(any(NAs), min(which(NAs)) - 1, length(a))
a[1:last]
}
I am troubled by the fact that I have to iterate over n input elements 6 times: match, is.na, any, which, min, and the subset [].
Clearly, it would be faster to write an external C function (with a for loop and a break), but I am wondering if there is any clever R trick I can use here.
You can compare the values of your vectors and drop elements when the first FALSE is reached:
orderedIntersect <- function(a,b) {
# check the lengths are equal and if not, "cut" the vectors so they are (to avoid warnings)
l_a <- length(a) ; l_b <- length(b)
if(l_a != l_b) {m_l <- min(l_a, l_b) ; a <- a[1:m_l] ; b <- b[1:m_l]}
# compare the elements : they are equal if both are not NA and have the same value or if both are NA
comp <- (!is.na(a) & !is.na(b) & a==b) | (is.na(a) & is.na(b))
# return the right vector : nothing if the first elements do not match, everything if all elements match or just the part that match
if(!comp[1]) return(c()) else if (all(comp)) return(a) else return(a[1:(which(!comp)[1]-1)])
}
orderedIntersect(c(1,2,3,4), c(1,2,5,4))
#[1] 1 2
orderedIntersect(c(1,2,3), c(1,2,3,4))
#[1] 1 2 3
orderedIntersect(c(1,2,3), c(2,3,4))
#NULL
The simple C solution (for integers) isn't really any longer than the R version, but it would be a little more work to extend to all the other classes.
library(inline)
orderedIntersect <- cfunction(
signature(x='integer', y='integer'),
body='
int i, l = length(x) > length(y) ? length(y) : length(x),
*xx = INTEGER(x), *yy = INTEGER(y);
SEXP res;
for (i = 0; i < l; i++) if (xx[i] != yy[i]) break;
PROTECT(res = allocVector(INTSXP, i));
for (l = 0; l < i; l++) INTEGER(res)[l] = xx[l];
UNPROTECT(1);
return res;'
)
## Tests
a <- c(1L,2L,3L,4L)
b <- c(1L,2L,5L,4L)
c <- c(1L,2L,8L,9L,9L,9L,9L,3L)
d <- c(9L,0L,0L,8L)
orderedIntersect(a,b)
# [1] 1 2
orderedIntersect(a,c)
# [1] 1 2
orderedIntersect(a,d)
# integer(0)
orderedIntersect(a, integer())
# integer(0)
This might work:
#test data
a <- c(1,2,3,4)
b <- c(1,2,5,4)
c <- c(1,2,8,9,9,9,9,3)
d <- c(9,0,0,8)
empty <- c()
string1 <- c("abc", "def", "ad","k")
string2 <- c("abc", "def", "c", "lds")
#function
orderedIntersect <- function(a, b) {
l <- min(length(a), length(b))
if (l == 0) return(numeric(0))
a1 <- a[1:l]
comp <- a1 != b[1:l]
if (all(!comp)) return(a1)
a1[ 0:(min(which(comp)) - 1) ]
}
#testing
orderedIntersect(a,b)
# [1] 1 2
orderedIntersect(a,c)
# [1] 1 2
orderedIntersect(a,d)
# numeric(0)
orderedIntersect(a, empty)
# numeric(0)
orderedIntersect(string1,string2)
# [1] "abc" "def"
The title does not really do this question justice, but I could not think of any other way to phrase the question. I can best explain the problem with an example.
Let's say we have two vectors of numbers (each of which are always going to be ascending and unique):
vector1 <- c(1,3,10,11,24,26,30,31)
vector2 <- c(5,9,15,19,21,23,28,35)
What I am trying to do is create a function that will take these two vectors and match them in the following way:
1) Start with the first element of vector1 (in this case, 1)
2) Go to vector2 and match the element from #1 with the first element in vector 2 that is bigger than it (in this case, 5)
3) Go back to vector1 and skip all elements less than the value in #2 we found (in this case, we skip 3, and grab 10)
4) Go back to vector2 and skip all elements less than the value in #3 we found (in this case, we skip 9 and grab 15)
5) repeat until we are done with all elements.
The resulting two vectors we should have are:
result1 = c(1,10,24,30)
result2 = c(5,15,28,35)
My current solution goes something like this, but I believe it might be highly inefficient:
# establishes where we start from the vector2 numbers
# just in case we have vector1 <- c(5,8,10)
# and vector2 <- c(1,2,3,4,6,7). We would want to skip the 1,2,3,4 values
i <- 1
while(vector2[i]<vector1[1]){
i <- i+1
}
# starts the result1 vector with the first value from the vector1
result1 <- vector1[1]
# starts the result2 vector empty and will add as we loop through
result2 <- c()
# super complicated and probably hugely inefficient loop within a loop within a loop
# i really want to avoid doing this, but I cannot think of any other way to accomplish this
for(j in 1:length(vector1)){
while(vector1[j] > vector2[i] && (i+1) <= length(vector2)){
result1 <- c(result1,vector1[j])
result2 <- c(result2,vector2[i])
while(vector1[j] > vector2[i+1] && (i+2) <= length(vector2)){
i <- i+1
}
i <- i+1
}
}
## have to add on the last vector2 value cause while loop skips it
## if it doesn't exist (there are no more vector2 values bigger) we put in an NA
if(result1[length(result1)] < vector2[i]){
result2 <- c(result2,vector2[i])
}
else{
### we ran out of vector2 values that are bigger
result2 <- c(result2,NA)
}
This is really difficult to explain. Just call it magic :)
vector1 <- c(1,3,10,11,24,26,30,31)
vector2 <- c(5,9,15,19,21,23,28,35)
## another case
# vector2 <- c(0,9,15,19,21,23,28,35)
## handling the case where vector2 min value(s) are < vector1 min value
if (any(idx <- which(min(vector1) >= vector2)))
vector2 <- vector2[-idx]
## interleave the two vectors
tmp <- c(vector1,vector2)[order(c(order(vector1), order(vector2)))]
## if we sort the vectors, which pairwise elements are from the same vector
r <- rle(sort(tmp) %in% vector1)$lengths
## I want to "remove" all the pairwise elements which are from the same vector
## so I again interleave two vectors:
## the first will be all TRUEs because I want the first instance of each *new* vector
## the second will be all FALSEs identifying the elements I want to throw out because
## there is a sequence of elements from the same vector
l <- rep(1, length(r))
ord <- c(l, r - 1)[order(c(order(r), order(l)))]
## create some dummy TRUE/FALSE to identify the ones I want
res <- sort(tmp)[unlist(Map(rep, c(TRUE, FALSE), ord))]
setNames(split(res, res %in% vector2), c('result1', 'result2'))
# $result1
# [1] 1 10 24 30
#
# $result2
# [1] 5 15 28 35
obviously this will only work if both vectors are ascending and unique which you said
EDIT:
works with duplicates:
vector1 <- c(1,3,10,11,24,26,30,31)
vector2 <- c(5,9,15,19,21,23,28,35)
vector2 <- c(0,9,15,19,21,23,28,35)
vector2 <- c(1,3,3,5,7,9,28,35)
f <- function(v1, v2) {
if (any(idx <- which(min(vector1) >= vector2)))
vector2 <- vector2[-idx]
vector1 <- paste0(vector1, '.0')
vector2 <- paste0(vector2, '.00')
n <- function(x) as.numeric(x)
tmp <- c(vector1, vector2)[order(n(c(vector1, vector2)))]
m <- tmp[1]
idx <- c(TRUE, sapply(1:(length(tmp) - 1), function(x) {
if (n(tmp[x + 1]) > n(m)) {
if (gsub('^.*\\.','', tmp[x + 1]) == gsub('^.*\\.','', m))
FALSE
else {
m <<- tmp[x + 1]
TRUE
}
} else FALSE
}))
setNames(split(n(tmp[idx]), grepl('\\.00$', tmp[idx])), c('result1','result2'))
}
f(vector1, vector2)
# $result1
# [1] 1 10 30
#
# $result2
# [1] 3 28 35
Is there an R idiom for performing a different (integer) range check for each element of a vector?
My function is passed a two-element (integer) vector of the form v = c(m, n) and must make the following range checks:
1 <= m <= M
1 <= n <= N
For my current task, I've implemented them by manually accessing each element, and running the associated range check against it.
# Check if this is a valid position on an M x N chess board.
validate = function (square) {
row = square[1]
col = square[2]
(row %in% 1:M) && (col %in% 1:N)
}
I wonder whether there's a compacter way of doing the range checks, especially if we were to generalize it to K-element vectors.
Since you're presumably setting up different criteria for each v[j], I'd recommend creating a list out of your range criteria. Like:
Rgames> set.seed(10)
Rgames> foo<-sample(1:5,5,rep=TRUE)
Rgames> foo
[1] 3 2 3 4 1
Rgames> bar<-list(one=1:5, two=3:5,three=1:3,four=c(2,4),five=c(1,4) )
Rgames> checkit<-NA
Rgames> for(j in 1:5) checkit[j]<-foo[j]%in%bar[[j]]
Rgames> checkit
[1] TRUE FALSE TRUE TRUE TRUE
If I understand your goal correctly, the inequality operators are vectorized in R, so you can make use of this fact.
limits <- c(M=3, N=4, 5)
v <- c(m=2, n=5, 8)
result <- 1 <= v & v <= limits
# m n
# TRUE FALSE FALSE
And if you want a single value that's FALSE if any of the limits are exceeded, then you can wrap the inequality expression with all.
all(1 <= v & v <= limits)
Maybe something like this:
`%between%` <- function(x,rng){
all(x <= max(rng,na.rm = TRUE)) && all(x >= min(rng,na.rm = TRUE))
}
> 3 %between% c(1,10)
[1] TRUE
> 3:5 %between% c(1,10)
[1] TRUE
> 9:12 %between% c(1,10)
[1] FALSE
With tweaks depending on how you want to handle NAs, and other edge cases.
Problem
Find the sum of all numbers below 1000 that can be divisible by 3 or 5
One solution I created:
x <- c(1:999)
values <- x[x %% 3 == 0 | x %% 5 == 0]
sum(values
Second solution I can't get to work and need help with. I've pasted it below.
I'm trying to use a loop (here, I use while() and after this I'll try for()). I am still struggling with keeping references to indexes (locations in a vector) separate from values/observations within vectors. Loops seem to make it more challenging for me to distinguish the two.
Why does this not produce the answer to Euler #1?
x <- 0
i <- 1
while (i < 100) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- c(x, i)
}
i <- i + 1
}
sum(x)
And in words, line by line this is what I understand is happening:
x gets value 0
i gets value 1
while object i's value (not the index #) is < 1000
if is divisible by 3 or 5
add that number i to the vector x
add 1 to i in order (in order to keep the loop going to defined limit of 1e3
sum all items in vector x
I am guessing x[i] <- c(x, i) is not the right way to add an element to vector x. How do I fix this and what else is not accurate?
First, your loop runs until i < 100, not i < 1000.
Second, replace x[i] <- c(x, i) with x <- c(x, i) to add an element to the vector.
Here is a shortcut that performs this sum, which is probably more in the spirit of the problem:
3*(333*334/2) + 5*(199*200/2) - 15*(66*67/2)
## [1] 233168
Here's why this works:
In the set of integers [1,999] there are:
333 values that are divisible by 3. Their sum is 3*sum(1:333) or 3*(333*334/2).
199 values that are divisible by 5. Their sum is 5*sum(1:199) or 5*(199*200/2).
Adding these up gives a number that is too high by their intersection, which are the values that are divisible by 15. There are 66 such values, and their sum is 15*(1:66) or 15*(66*67/2)
As a function of N, this can be written:
f <- function(N) {
threes <- floor(N/3)
fives <- floor(N/5)
fifteens <- floor(N/15)
3*(threes*(threes+1)/2) + 5*(fives*(fives+1)/2) - 15*(fifteens*(fifteens+1)/2)
}
Giving:
f(999)
## [1] 233168
f(99)
## [1] 2318
And another way:
x <- 1:999
sum(which(x%%5==0 | x%%3==0))
# [1] 233168
A very efficient approach is the following:
div_sum <- function(x, n) {
# calculates the double of the sum of all integers from 1 to n
# that are divisible by x
max_num <- n %/% x
(x * (max_num + 1) * max_num)
}
n <- 999
a <- 3
b <- 5
(div_sum(a, n) + div_sum(b, n) - div_sum(a * b, n)) / 2
In contrast, a very short code is the following:
x=1:999
sum(x[!x%%3|!x%%5])
Here is an alternative that I think gives the same answer (using 99 instead of 999 as the upper bound):
iters <- 100
x <- rep(0, iters-1)
i <- 1
while (i < iters) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- i
}
i <- i + 1
}
sum(x)
# [1] 2318
Here is the for-loop mentioned in the original post:
iters <- 99
x <- rep(0, iters)
i <- 1
for (i in 1:iters) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- i
}
i <- i + 1
}
sum(x)
# [1] 2318