I'm tyring to build a function in R which calculates the percentage change between rows based on any arbitrary index, this is, between any given row and the preceding one or any given row and n preceding ones.
perc_change <- function(x,n) {
y <- c()
z <- c()
for (i in 1:length(x)) {
z[i] <- (x[i]/(x[i-n])-1)*100
}
y <- c(rep(NA,n),z[(n+1):length(z)])
y
}
When n is one the function works properly:
x <- c(2,3.5,4,6)
perc_change(x,1)
[1] NA 75.00000 14.28571 50.00000
But when I change to 2 or other n, I receive this error:
Error in z[i] <- (x[i]/(x[i - n]) - 1) * 100 :
replacement has length zero
I just can't find why and where the logic of my function is wrong so I appreciate any comment or suggestion.
In the loop, when n is greater than 1, the i starting at 1 can result in negative or zero index (i.e. when n =2, 1 - 2). To avoid, that an if/else condition can be added
perc_change <- function(x,n) {
y <- c()
z <- c()
for (i in 1:length(x)) {
if(i > n) {
z[i] <- (x[i]/(x[i-n])-1)*100
} else z[i] <- NA
}
y <- c(rep(NA,n),z[(n+1):length(z)])
y
}
perc_change(x,1)
#[1] NA 75.00000 14.28571 50.00000
perc_change(x, 2)
#[1] NA NA 100.00000 71.42857
perc_change(x, 3)
#[1] NA NA NA 200
The following function lags the input vector and then computes the percent change with vectorized operations, no need for for loops. The lag function is a copy&paste of the last code lines of dplyr::lag.
perc_change <- function(x, n = 1) {
lag <- function(x, n = 1){
if(n == 0)
return(x)
xlen <- length(x)
n <- pmin(n, xlen)
out <- c(rep(NA, n), x[seq_len(xlen - n)])
attributes(out) <- attributes(x)
out
}
y <- lag(x, n)
(x/y - 1)*100
}
x <- c(2, 3.5, 4, 6)
perc_change(x,1)
#[1] NA 75.00000 14.28571 50.00000
perc_change(x, 2)
#[1] NA NA 100.00000 71.42857
Related
I want to calculate how many values are taken until the cumulative reaches a certain value.
This is my vector: myvec = seq(0,1,0.1)
I started with coding the cumulative sum function:
cumsum_for <- function(x)
{
y = 1
for(i in 2:length(x)) # pardon the case where x is of length 1 or 0
{x[i] = x[i-1] + x[i]
y = y+1}
return(y)
}
Now, with the limit
cumsum_for <- function(x, limit)
{
y = 1
for(i in 2:length(x)) # pardon the case where x is of length 1 or 0
{x[i] = x[i-1] + x[i]
if(x >= limit) break
y = y+1}
return(y)
}
which unfortunately errors:
myvec = seq(0,1,0.1)
cumsum_for(myvec, 0.9)
[1] 10
Warning messages:
1: In if (x >= limit) break :
the condition has length > 1 and only the first element will be used
[...]
What about this? You can use cumsum to compute the cumulative sum, and then count the number of values that are below a certain threshold n:
f <- function(x, n) sum(cumsum(x) <= n)
f(myvec, 4)
#[1] 9
f(myvec, 1.1)
#[1] 5
You can put a while loop in a function. This stops further calculation of the cumsum if the limit is reached.
cslim <- function(v, l) {
s <- 0
i <- 0L
while (s < l) {
i <- i + 1
s <- sum(v[1:i])
}
i - 1
}
cslim(v, .9)
# [1] 4
Especially useful for longer vectors, e.g.
v <- seq(0, 3e7, 0.1)
I have a raster stack with 364 layers with a daily rate of change in NDVI values.
I want to scale these values in every cell if positive from 0 to 1 and if negative from -1 to 0. So far I have only found a solution that scale values in single layers (see here: Replace specific value in each band of raster brick in R) and not along cells of multilayer objects. Additionally I have a decent amount of cells with NA for the entire time series and I'm not quite sure how to deal with this fact either.
I took the code from the previously mentioned post and tried to get it working for my problem:
norm <- function(x){-1+(x-min)*((1-(-1))/(max-min))}
for(j in 1:ncell(tif)){
if(is.na(sum(tif[j]))){
NULL
} else {
cat(paste("Currently processing layer:", j,"/",ncell(tif), "\n"))
min <- cellStats(tif[j],'min')
max <- cellStats(tif[j],'max')
#initialize cluster
#number of cores to use for clusterR function (max recommended: ncores - 1)
beginCluster(31)
#normalize
tif[j] <- clusterR(tif[j], calc, args=list(fun=norm), export=c('min',"max"))
#end cluster
endCluster()
}
}
I'm not quite certain if this produces the desired output. Any help is very much appreciated!
Some example data
library(raster)
r <- raster(ncol=10, nrow=10)
s <- stack(lapply(1:5, function(i) setValues(r, runif(100, -1, 1))))
# adding NAs
s[[2]][sample(100, 25, TRUE)] <- NA
For scaling (or any other operation) by cell (as requested) you can use calc together with a function that works on a vector. For example:
ff <- function(i) {
p <- which(i >= 0)
n <- which(i <= 0)
# positive values
if (length(p) > 0) {
i[p] <- i[p] - min(i[p], na.rm=TRUE)
i[p] <- i[p] / max(i[p])
}
# negative values
if (length(n) > 0) {
i[n] <- i[n] - max(i[n], na.rm=TRUE)
i[n] <- i[n] / abs(min(i[n]))
}
i
}
Test it
ff(c(-.3, -.1, .1, .4, .8))
#[1] -1.0000000 0.0000000 0.0000000 0.4285714 1.0000000
ff(c(-.3, -.1, .1, .4, .8, NA))
#[1] -1.0000000 0.0000000 0.0000000 0.4285714 1.0000000 NA
ff(c(-2,-1))
#[1] -1 0
ff(c(NA, NA))
#[1] NA NA
And use it
z <- calc(s, ff)
See the below to scale by layer, based on the min and max of all cell values (I first thought that this is what was asked for). Note that the functions I used below scale values from -1 to 1, but not the lowest positive value and highest negative value to zero.
minv <- abs(cellStats(s,'min'))
maxv <- cellStats(s,'max')
f1 <- function(i, mn, mx) {
j <- i < 0
j[is.na(j)] <- TRUE
i[j] <- i[j] / abs(mn)
i[!j] <- i[!j] / mx
i
}
ss <- list()
for (i in 1:nlayers(s)) {
ss[[i]] <- calc(s[[i]], fun=function(x) f1(x, minv[i], maxv[i]))
}
ss1 <- stack(ss)
Or without a loop
f2 <- function(x, mn, mx) {
x <- t(x)
i <- which(x > 0)
i[is.na(i)] <- FALSE
mxx <- x / mx
x <- x / mn
x[i] <- mxx[i]
t(x)
}
ss2 <- calc(s, fun=function(x) f2(x, minv, maxv))
For reference, to simply scale between 0 and 1
mnv <- cellStats(s,'min')
mxv <- cellStats(s,'max')
x <- (s - mnv) / (mxv - mnv)
To get values between -1 and 1 you can then do
y <- 2 * (x - 1)
But that way previously negative values can become positive and vice versa.
See ?raster::scale for other types of scaling.
I know that match(x,y) returns the first match of all elements of x in y.
Assuming that x may contain the same value multiple time, I am looking for a concise way to match the nth occurrence in x with the nth occurrence in y.
For example: `
x <- c(3,4,4,3,2,4)
y <- c(1,2,3,4,1,2,3,4)
my.match(x, y)
## 3,4,8,7,2,NA
Using a for loop to match, store and overwrite a match with NA.
idx <- c()
for (i in x) {
k <- match(i, y)
idx <- c(idx, k)
y[k] <- NA
}
idx
#[1] 3 4 8 7 2 NA
The following function is much faster when vectors are large because it does not iterate over the whole vector
my.match <- function(x,y){
fidx <- rep(FALSE,length(x))
fidy <- rep(FALSE,length(y))
ret <- rep(NA,length(x))
repeat{
nidx <- which(!fidx)
nidy <- which(!fidy)
idx <- match(x[nidx],y[nidy])
idy <- match(y[nidy],x[nidx])
ret[nidx] <- nidy[idx]
fidx[nidx[unique(idy)]] <- TRUE
fidy[nidy[unique(idx)]] <- TRUE
if(sum(!is.na(idx))==0 | sum(!is.na(idy))==0){
break
}
}
return(ret)
}
Benchmarking with the other proposed method yields:
my.match1 <- function(x,y){
idx <- c()
for (i in x) {
k <- match(i, y)
idx <- c(idx, k)
y[k] <- NA
}
return(idx)
}
x <- sample.int(100,10000,replace=T)
y <- sample.int(100,10000,replace=T)
system.time(my.match1(x,y))
## user system elapsed
## 1.016 0.003 1.020
system.time(my.match(x,y))
## user system elapsed
## 0.049 0.000 0.049
I have data that looks like this :
char_column date_column1 date_column2 integer_column
415 18JT9R6EKV 2014-08-28 2014-09-06 1
26 18JT9R6EKV 2014-12-08 2014-12-11 2
374 18JT9R6EKV 2015-03-03 2015-03-09 1
139 1PEGXAVCN5 2014-05-06 2014-05-10 3
969 1PEGXAVCN5 2014-06-11 2014-06-15 2
649 1PEGXAVCN5 2014-08-12 2014-08-16 3
I want to perform a loop that would check every row against the preceding row, and given certain conditions assign them the same number (so I can group them later) , the point is that if the date segments are close enough I would collapse them into one segment.
my attempt is the following :
i <- 1
z <- 1
v <- 1
for (i in 2:nrow(df)){
z[i] <- ifelse(df[i,'char_column'] == df[i-1,'char_column'],
ifelse((df[i,'date_column1'] - df[i-1,'date_column2']) <= 5,
ifelse(df[i,'integer_column'] == df[i-1,'integer_column'],
v, v<- v+1),
v <- v+1),
v <- v+1)}
df$grouping <- z
then I would just group using min(date_column1) and max(date_column2).
this method works perfectly for say 100,000 rows (22.86 seconds)
but for a million rows : 33.18 minutes!! I have over 60m rows to process,
is there a way I can make the process more efficient ?
PS: to generate a similar table you can use the following code :
x <- NULL
for (i in 1:200) { x[i] <- paste(sample(c(LETTERS, 1:9), 10), collapse = '')}
y <- sample((as.Date('2014-01-01')):as.Date('2015-05-01'), 1000, replace = T)
y2 <- y + sample(1:10)
df <- data.frame(char_column = sample(x, 1000, rep = T),
date_column1 = as.Date(y, origin = '1970-01-01'),
date_column2 = as.Date(y2,origin = '1970-01-01'),
integer_column = sample(1:3,1000, replace = T),
row.names = NULL)
df <- df[order(df$char_column, df$date_column1),]
Since data.table::rleid does not work, I post another (hopefully) fast solution
1. Get rid of nested ifelse
ifelse is often slow, especially for scalar evaluation, use if.
Nested ifelse should be avoided whenever possible: observe that ifelse(A, ifelse(B, x, y), y) can be suitably replaced by if (A&B) x else y
f1 <- function(df){
z <- rep(NA, nrow(df))
z[1] <- 1
char_col <- df[, 'char_column']
date_col1 <- df[, 'date_column1']
date_col2 <- df[, 'date_column2']
int_col <- df[, 'integer_column']
for (i in 2:nrow(df)){
if((char_col[i] == char_col[i-1])&((date_col1[i] - date_col2[i-1]) <= 5)&(int_col[i] == int_col[i-1]))
{
z[i] <- z[i-1]
}
else
{
z[i] <- z[i-1]+1
}
}
z
}
f1 is about 40% faster than the original solution for 10.000 rows.
system.time(f1(df))
user system elapsed
2.72 0.00 2.79
2. Vectorize
Upon closer inspection the conditions inside if can be vectorized
library(data.table)
f2 <- function(df){
z <- rep(NA, nrow(df))
z[1] <- 1
char_col <- df[, 'char_column']
date_col1 <- df[, 'date_column1']
date_col2 <- df[, 'date_column2']
int_col <- df[, 'integer_column']
cond <- (char_col==shift(char_col))&(date_col1 - shift(date_col2) <= 5)&(int_col==shift(int_col))
for (i in 2:nrow(df)){
if(cond[i])
{
z[i] <- z[i-1]
}
else
{
z[i] <- z[i-1]+1
}
}
z
}
# for 10000 rows
system.time(f2(df))
# user system elapsed
# 0.01 0.00 0.02
3. Vectorize, Vectorize
While f2 is already quite fast, a further vectorization is possible. Observe how z is calculated: cond is a logical vector, and z[i] = z[i-1] + 1 when cond is FALSE. This is none other than cumsum(!cond).
f3 <- function(df){
setDT(df)
df[, cond := (char_column==shift(char_column))&(date_column1 - shift(date_column2) <= 5)&(integer_column==shift(integer_column)),]
df[, group := cumsum(!c(FALSE, cond[-1L])),]
}
For 1M rows
system.time(f3(df))
# user system elapsed
# 0.05 0.05 0.09
system.time(f2(df))
# user system elapsed
# 1.83 0.05 1.87
Problem
Find the sum of all numbers below 1000 that can be divisible by 3 or 5
One solution I created:
x <- c(1:999)
values <- x[x %% 3 == 0 | x %% 5 == 0]
sum(values
Second solution I can't get to work and need help with. I've pasted it below.
I'm trying to use a loop (here, I use while() and after this I'll try for()). I am still struggling with keeping references to indexes (locations in a vector) separate from values/observations within vectors. Loops seem to make it more challenging for me to distinguish the two.
Why does this not produce the answer to Euler #1?
x <- 0
i <- 1
while (i < 100) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- c(x, i)
}
i <- i + 1
}
sum(x)
And in words, line by line this is what I understand is happening:
x gets value 0
i gets value 1
while object i's value (not the index #) is < 1000
if is divisible by 3 or 5
add that number i to the vector x
add 1 to i in order (in order to keep the loop going to defined limit of 1e3
sum all items in vector x
I am guessing x[i] <- c(x, i) is not the right way to add an element to vector x. How do I fix this and what else is not accurate?
First, your loop runs until i < 100, not i < 1000.
Second, replace x[i] <- c(x, i) with x <- c(x, i) to add an element to the vector.
Here is a shortcut that performs this sum, which is probably more in the spirit of the problem:
3*(333*334/2) + 5*(199*200/2) - 15*(66*67/2)
## [1] 233168
Here's why this works:
In the set of integers [1,999] there are:
333 values that are divisible by 3. Their sum is 3*sum(1:333) or 3*(333*334/2).
199 values that are divisible by 5. Their sum is 5*sum(1:199) or 5*(199*200/2).
Adding these up gives a number that is too high by their intersection, which are the values that are divisible by 15. There are 66 such values, and their sum is 15*(1:66) or 15*(66*67/2)
As a function of N, this can be written:
f <- function(N) {
threes <- floor(N/3)
fives <- floor(N/5)
fifteens <- floor(N/15)
3*(threes*(threes+1)/2) + 5*(fives*(fives+1)/2) - 15*(fifteens*(fifteens+1)/2)
}
Giving:
f(999)
## [1] 233168
f(99)
## [1] 2318
And another way:
x <- 1:999
sum(which(x%%5==0 | x%%3==0))
# [1] 233168
A very efficient approach is the following:
div_sum <- function(x, n) {
# calculates the double of the sum of all integers from 1 to n
# that are divisible by x
max_num <- n %/% x
(x * (max_num + 1) * max_num)
}
n <- 999
a <- 3
b <- 5
(div_sum(a, n) + div_sum(b, n) - div_sum(a * b, n)) / 2
In contrast, a very short code is the following:
x=1:999
sum(x[!x%%3|!x%%5])
Here is an alternative that I think gives the same answer (using 99 instead of 999 as the upper bound):
iters <- 100
x <- rep(0, iters-1)
i <- 1
while (i < iters) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- i
}
i <- i + 1
}
sum(x)
# [1] 2318
Here is the for-loop mentioned in the original post:
iters <- 99
x <- rep(0, iters)
i <- 1
for (i in 1:iters) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- i
}
i <- i + 1
}
sum(x)
# [1] 2318