Calculate maximum length of consecutive days above a certain threshold in a raster stack - r

I would like to calculate the maximum length of consecutive days above a threshold t given a raster stack s as shown below:
library(raster)
set.seed(112)
x1 <- raster(nrows=10, ncols=10)
x2=x3=x4=x5=x6=x1
x1[]= runif(ncell(x1))
x2[]= runif(ncell(x1))
x3[]= runif(ncell(x1))
x4[]= runif(ncell(x1))
x5[]= runif(ncell(x1))
x6[]= runif(ncell(x1))
s=stack(x1,x2,x3,x4,x5,x6)*56
Here is my current function.
fun <- function(x,t){
y <- rle((x > t)*1)
z <- y$lengths[y$values==1]
return(max(z,0))
}
I have also set a parameter q for export as advised in the cluster {raster} function
q <- 0
I expect a raster layer as an output but instead the error below pops up.
[1] "cannot use this function"
attr(,"class")
[1] "snow-try-error" "try-error"
Error in clusterR(s, calc, args = list(fun = fun), export = "q") :
cluster error
What could be the problem?

First, if you use random values in your example data, please also set the random seed so it's reproducible.
library(raster)
set.seed(42)
x1 <- raster(nrows=10, ncols=10)
s <- do.call(stack,lapply(1:6,function(x) setValues(x1,runif(ncell(x1)))*56))
As to your question, the only thing you need is a simple function that can be passed into calc to obtain the desired results:
cd <- function(x,t){
y <- rle((x > t)*1)
z <- y$lengths[y$values==1]
return(max(z,0))
}
This function uses rle, or run length encoding, to calculate the number of consecutive runs in a vector. In this case I'm looking for the maximum number of consecutive 1s, which come from multiplying TRUE values (value is above threshold t) with 1.
In the end you want to return the maximum run of a value 1, with 0 being a fallback in case there's no occurrence (sidenote: 1 indicates a single, non-consecutive occurence).
Finally, cd can be passed into calc, in this case using a threshold of 40:
plot(calc(s,function(x) cd(x,40)))

Related

Creating a function that determines the impact of an outlier

My big-picture goal is to demonstrate the difference outliers can have on a dataset's average. I'm trying to create a function that uses the size of an outlier "k" as an input and outputs the average. Basically, the function needs to take any value "k" (which is the outlier) and return the average of vector x if the first value of x were replaced with k. For example, say the dataset is the heights of a population of students. The first value is supposed to be 71.3 cm but the kid accidentally put 713 cm. In this case, I want my function to tell me what would be the average of my vector if there was an outlier of value 713 (k = 713). So far I have the following, where x is the name of the dataset of heights.
average_err <- function(k) {
x[1] <- k
mean(x[1])
}
Then calculate the average if there was an outlier of 713
average_err(713)
However, my output is always identical to my input. Will someone please help me?
I would suggest:
average_err <- function(x,k) {
mean(c(x,k))
}
In the above, instead of replacing one of the x-values with an outlier, you're adding an outlier to the existing x-vactor. As #SteveM suggested, you should also have the function take x as an argument
x <- rnorm(25)
average_err(x, 100)
# [1] 3.627824
You could also build it to print both the mean of the original x, x with k and the difference:
average_err <- function(x,k) {
m1 <- mean(x)
m2 <- mean(c(x,k))
d <- m2-m1
out <- data.frame(mean = c(m1, m2, d))
rownames(out) = c("x", "x,k", "difference")
out
}
average_err(x,100)
# mean
# x -0.2270631
# x,k 3.6278239
# difference 3.8548870
I'm not sure if I understand well, but I would rather replace "mean(x[1])" with "mean(x)" in your case. If you write mean(x[1]), you will do the average of one value only, the one you have replace with the outlier k.
average_err <- function(k) {
x[1] <- k
mean(x)
}

Estimating an OLS model in R with million observations and thousands of variables

I am trying to estimate a big OLS regression with ~1 million observations and ~50,000 variables using biglm.
I am planning to run each estimation using chunks of approximately 100 observations each. I tested this strategy with a small sample and it worked fine.
However, with the real data I am getting an "Error: protect(): protection stack overflow" when trying to define the formula for the biglm function.
I've already tried:
starting R with --max-ppsize=50000
setting options(expressions = 50000)
but the error persists
I am working on Windows and using Rstudio
# create the sample data frame (In my true case, I simply select 100 lines from the original data that contains ~1,000,000 lines)
DF <- data.frame(matrix(nrow=100,ncol=50000))
DF[,] <- rnorm(100*50000)
colnames(DF) <- c("y", paste0("x", seq(1:49999)))
# get names of covariates
my_xvars <- colnames(DF)[2:( ncol(DF) )]
# define the formula to be used in biglm
# HERE IS WHERE I GET THE ERROR :
my_f <- as.formula(paste("y~", paste(my_xvars, collapse = " + ")))
EDIT 1:
The ultimate goal of my exercise is to estimate the average effect of all 50,000 variables. Therefore, simplifying the model selecting fewer variables is not the solution I am looking for now.
The first bottleneck (I can't guarantee there won't be others) is in the construction of the formula. R can't construct a formula that long from text (details are too ugly to explore right now). Below I show a hacked version of the biglm code that can take the model matrix X and response variable y directly, rather than using a formula to build them. However: the next bottleneck is that the internal function biglm:::bigqr.init(), which gets called inside biglm, tries to allocate a numeric vector of size choose(nc,2)=nc*(nc-1)/2 (where nc is the number of columns. When I try with 50000 columns I get
Error: cannot allocate vector of size 9.3 Gb
(2.3Gb are required when nc is 25000). The code below runs on my laptop when nc <- 10000.
I have a few caveats about this approach:
you won't be able to handle a probelm with 50000 columns unless you have at least 10G of memory, because of the issue described above.
the biglm:::update.biglm will have to be modified in a parallel way (this shouldn't be too hard)
I have no idea if the p>>n issue (which applies at the level of fitting the initial chunk) will bite you. When running my example below (with 10 rows, 10000 columns), all but 10 of the parameters are NA. I don't know if these NA values will contaminate the results so that successive updating fails. If so, I don't know if there's a way to work around the problem, or if it's fundamental (so that you would need nr>nc for at least the initial fit). (It would be straightforward to do some small experiments to see if there is a problem, but I've already spent too long on this ...)
don't forget that with this approach you have to explicitly add an intercept column to the model matrix (e.g. X <- cbind(1,X) if you want one.
Example (first save the code at the bottom as my_biglm.R):
nr <- 10
nc <- 10000
DF <- data.frame(matrix(rnorm(nr*nc),nrow=nr))
respvars <- paste0("x", seq(nc-1))
names(DF) <- c("y", respvars)
# illustrate formula problem: fails somewhere in 15000 < nc < 20000
try(reformulate(respvars,response="y"))
source("my_biglm.R")
rr <- my_biglm(y=DF[,1],X=as.matrix(DF[,-1]))
my_biglm <- function (formula, data, weights = NULL, sandwich = FALSE,
y=NULL, X=NULL, off=0) {
if (!is.null(weights)) {
if (!inherits(weights, "formula"))
stop("`weights' must be a formula")
w <- model.frame(weights, data)[[1]]
} else w <- NULL
if (is.null(X)) {
tt <- terms(formula)
mf <- model.frame(tt, data)
if (is.null(off <- model.offset(mf)))
off <- 0
mm <- model.matrix(tt, mf)
y <- model.response(mf) - off
} else {
## model matrix specified directly
if (is.null(y)) stop("both y and X must be specified")
mm <- X
tt <- NULL
}
qr <- biglm:::bigqr.init(NCOL(mm))
qr <- biglm:::update.bigqr(qr, mm, y, w)
rval <- list(call = sys.call(), qr = qr, assign = attr(mm,
"assign"), terms = tt, n = NROW(mm), names = colnames(mm),
weights = weights)
if (sandwich) {
p <- ncol(mm)
n <- nrow(mm)
xyqr <- bigqr.init(p * (p + 1))
xx <- matrix(nrow = n, ncol = p * (p + 1))
xx[, 1:p] <- mm * y
for (i in 1:p) xx[, p * i + (1:p)] <- mm * mm[, i]
xyqr <- update(xyqr, xx, rep(0, n), w * w)
rval$sandwich <- list(xy = xyqr)
}
rval$df.resid <- rval$n - length(qr$D)
class(rval) <- "biglm"
rval
}

r - How to translate this time series calculation into a raster calculation?

I'm trying to reproduce this vector (time series) calculation code:
gamma.parameters<- fitdistr(may_baseline_3months[may_baseline_3months>0],"gamma")
into a raster calculation code.
What this code originally does is trying to fit a gamma distribution by maximum likelihood estimation to a vector (time series) may_baseline_3months.
And what I want to do is to calculate the same thing but with a raster stack.
I tried doing this with calc() function:
f1<-function(x)
{
library(MASS)
return(fitdistr(x,"gamma"))
}
gamma.parameters<- calc(x = may_baseline_3months,fun = f1)
Error in .calcTest(x[1:5], fun, na.rm, forcefun, forceapply) :
cannot use this function
but it didn't work.
Note: My raster stack has only 4 layer.
EDIT
You can download a example data here spi
The fitdistr is part of the procedure of my main goal. I'm trying to calcule the Standard Precipitation Index. I already did it with a time series of a monthly precipitation of 30 year.
Here is the code for a time series till the line that I'm stock:
data<-read.csv("guatemala_spi.csv",header = T,sep=";")
dates<-data[,1]
rain_1month<-data[,2]
rain_3months<-0
#Setting the first 2 elements to NA because I'm going to calcule the accumulating the rainfall for 3 month
for (i in c(1:2)) {
rain_3months[i]<-NA
}
#Accumulating the rainfall for the rest of the data
number_of_months<-length(rain_1month)
for (j in c(3:number_of_months))
{
rain_3months[j]<-0.0
for (i in c(0:2))
{
rain_3months[j] = rain_3months[j] + rain_1month[j-i]
}
}
#Extracting a time-series for the month of interest (May)
may_rain_3months<-rain_3months[substr(dates,5,6)==”05”]
dates_may<-dates[substr(dates,5,6)==”05”]
number_of_years<-length(dates_may)
#Fitting the gama distribution by maximum likelihood estimation
start_year<-1971
end_year<-2010
start_index<-which(substr(dates_may,1,4)==start_year)
end_index<-which(substr(dates_may,1,4)==end_year)
may_baseline_3months<-may_rain_3months[start_index:end_index]
library(MASS)
gamma.parameters<-fitdistr(may_baseline_3months[may_baseline_3months>0],"gamma")
That last line is the one that I'm having problems to calculate for a raster stack.
Here's what I have so far in raster form:
Example multi-layer raster here (Monthly precipitation 2001 to 2004, 48 layers in total)
#Initiating a dates vector
dates<-c("200101","200102","200103","200104","200105","200106","200107","200108","200109","200110","200111","200112",
"200201","200202","200203","200204","200205","200206","200207","200208","200209","200210","200211","200212",
"200301","200302","200303","200304","200305","200306","200307","200308","200309","200310","200311","200312",
"200401","200402","200403","200404","200405","200406","200407","200408","200409","200410","200411","200412")
#Initiating a NA raster
rain_3months_1layer<-raster(nrow=1600, ncol=1673,extent(-118.4539, -34.80395, -50, 30),res=c(0.05,0.05))
values(rain_3months_1layer)<-NA
#Creating a raster stack NA of 48 layers
rain_3months<-stack(mget(rep( "rain_3months_1layer" , 48 )))
#Reading the data
rain_1month <- stack("chirps_rain_1month.tif")
#Accumulating the rainfall
number_of_months<-nlayers(rain_1month)
for (j in c(3:number_of_months))
{
rain_3months[[j]]<-0.0
for (i in c(0:2))
{
rain_3months[[j]] = rain_3months[[j]] + rain_1month[[j-i]]
}
}
#Extracting the raster for the month of interest (May)
may_rain_3months<-stack(rain_3months[[which(substr(dates,5,6)=="05", arr.ind = T)]])
dates_may<-dates[substr(dates,5,6)=="05"]
number_of_years<-length(dates_may)
#Fitting the gama distribution by maximum likelihood estimation
start_year<-2001
end_year<-2004
start_index<-which(substr(dates_may,1,4)==start_year)
end_index<-which(substr(dates_may,1,4)==end_year)
may_baseline_3months<-stack(may_rain_3months[[start_index:end_index]])
library(MASS)
f1<-function(x)
{
library(MASS)
return(fitdistr(x,"gamma"))
}
gamma.parameters<- calc(x = may_baseline_3months,fun = f1)
I can't make calc() to compute fitdistr() to the raster stack.
You need to make a function that calc can use. Your function f1 returns an object of class fitdistr. The calc function does not know what to do with that:
library(MASS)
set.seed(0)
x <- runif(10)
f1 <- function(x) {
return(fitdistr(x,"gamma"))
}
a <- f1(x)
class(a)
# [1] "fitdistr"
a
# shape rate
# 4.401575 6.931571
# (1.898550) (3.167113)
You need a function that returns numbers. Like f2:
f2 <- function(x) {
fitdistr(x,"gamma")$estimate
}
b <- f2(x)
class(b)
#[1] "numeric"
b
# shape rate
#4.401575 6.931571
Test f2 with calc:
library(raster)
s <- stack(lapply(1:12, function(i) setValues(r, runif(ncell(r)))))
r <- calc(s, f2)
I assume that this answers your questions. I cannot be sure because your question is way too complex. The first thing you need to do with a problems like this is to create a simple example like I have done above.
Next question
Error in stats::optim(x = c(7, 7, 7, 7), par = list(shape = Inf, rate
= Inf), : non-finite value supplied by optim.
That is a different issue, you are providing fitdistr with values it cannot deal with. You can add a try clause to skip over those. You could identify which cells this happens in and what the values are to see if there is something else you should do.
f3 <- function(x) {
x <- try (fitdistr(x,"gamma")$estimate, silent=TRUE )
if (class(x) == 'try-error') { c(-9999, -9999) } else { x }
}
x[1] <- NA
f2(x)
#Error in fitdistr(x, "gamma") : 'x' contains missing or infinite values
f3(x)
#[1] -9999 -9999
Note that you need to make sure that the number of values returned by f3 should always be the same. In this case two values. Here I use -9999 so that you can identify the cells. You can also use NA

Derivative of a set of points

So I know you can find the derivative of something like: "x^3-6*x^2" by doing: D(expression(x^3-6*x^2), 'x'), but what if I need to find the first derivative maximum of a list of values such as:
value <- c(610,618,627,632,628,634,634,628,634,642,637,643,653,666,684,717,787,923,1197,1716,2638,4077,5461,7007,8561,9994,11278,12382,13382,14252)
these values are the y coordinate and the x coordinate starts at 1 and increments by 1. IE the first point is (1,610) second is (2,618) etc. -Thanks
Consider using the package numDerive from CRAN. It has a function grad that computes derivative of a function at a point. Example:
f = function(x) x^3 - 6*x^2
library(numDeriv)
grad(f, 1) #derivative of f at x=1
To solve your problem with a list of values, use a for loop:
xval <- c(YOUR VALUES HERE)
xval.derivatives <- c() #empty vector to hold
for(i in 1:length(xval)) xval.derivatives[i] <- grad(f,xval[i])
The gradient function from the pracma package calculates the derivative from a vector of values.
library(pracma)
value <- c(610,618,627,632,628,634,634,628,634,642,637,643,653,666,684,717,787,923,1197,1716,2638,4077,5461,7007,8561,9994,11278,12382,13382,14252)
value_prime <- pracma::gradient(value, h1 = 1)
plot(value_prime)
Alternatively, fit a spline.
spl <- smooth.spline(1:length(value), y=value)
pred <- predict(spl)
pred.prime <- predict(spl, deriv=1)
plot(pred.prime, type = 'b')
If you are interested in higher derivatives, check the pspline package.

extreme value function in R

I'm studying the an extreme value problem on http://cran.r-project.org/doc/contrib/Krijnen-IntroBioInfStatistics.pdf
follows the An interesting extreme value distribution is given by Pevsner (2003, p.103)
I tried to generate a sample (with size 1000) from the standard normal distribution and repeated for 1000 times. Then, subtract the given function from these maxima an and divide by bn, where
fn <- exp(-n)*exp(-exp(-n))
an <- sqrt(2*log(n)) - 0.5*(log(log(n))+log(4*pi))*(2*log(n))^(-1/2)
bn <- (2*log(n))^(-1/2)
> my.stat <-NULL
> for (i in 1:1000) {
+ n <- rnorm(1000)
+ fn <- exp(-n)*exp(-exp(-n))
+ an <- sqrt(2*log(n)) - 0.5*(log(log(n))+log(4*pi))*(2*log(n))^(-1/2)
+ bn <- (2*log(n))^(-1/2)
+ my.stat <- c(my.stat, sum(sum(fn-an)/bn)))
> par(mfrow=c(2,2))
> hist(my.stat,freq=FALSE,main="histogram of 1000 M",xlab="M")
Error: object 'fn' not found???? i'm more concerned how to deal with the
+ my.stat <- c(my.stat, sum(sum(fn-an)/bn)))
part in the loop, since i stored all the sample values in a vector my.stat. And try to compute (fn-an)/bn during each loop.
I guess my question is is there a better way that i can add (fn-an)/bn to my vector each time in my for loop? Essentially, i want the my.stat to store all the (fn-an)/bn values. thanks.

Resources