looping through large rasterbrick in raster R - r

How can i perform for-loop in a large daily rasterbrick to get annual stacks and calculate the maximum value (annual maxima ) for each year( each stack of 365 files).
Basically, i have same question like this. So taking same question as sample, how i can conduct a for-loop that would calculate maximum value for each 46 stacks ( each stack with 8 layers).
I tried using only stackApply but it gives all black/zero value when i run for whole period, however it gives max values if i run for individual years (tested separately for 10 years, i have more than 100 years data).
library(raster)
# example data
sca <- brick(nrow=108,ncol=132,nl=365)
values(sca) <- runif(ncell(sca)*nlayers(sca))
# indices grouping sets of 8
i <- rep(1:ceiling(365/8), each=8)
# the last period is not a complete set of 8 days
i <- i[1:nlayers(sca)]
# This does not work for me, gives output as zero.
x <- stackApply(sca, i, max)
for (i in 1:nlayers(sca)) {
x <- sca[[i]]
xx<-stackApply(sca, i, max)
plot(xx)
# etc.
}

You could loop like this:
library(raster)
sca <- brick(nrow=108,ncol=132,nl=365)
values(sca) <- runif(ncell(sca)*nlayers(sca))
i <- rep(1:ceiling(365/8), each=8)
i <- i[1:nlayers(sca)]
for (j in unique(i)) {
x <- sca[[which(j==i)]]
xx <- max(x, na.rm=TRUE)
# or
# xx <- calc(x, fun=max, na.rm=TRUE, filename = patste0(i, '.tif'))
}

Related

R: Calculate standard deviation for specific time interval

I have a dataset with daily bond returns for some unique RIC codes (in total approx. 200.000 observations).
Now I want to calculate the standard deviation of those returns for the combined period t-30 to t-6 and t+6 to t+30. This means for every observation i,t, I need the 24 returns before t in the window t-30 to t-6 and 24 returns in the window t+6 to t+30 and calculate the standard deviation based on those 48 observations.
Here is a small snippet of my dataset:
#My data:
date <- c("2022-05-11", "2022-05-12","2022-05-13","2022-05-16","2022-05-17","2022-05-11", "2022-05-12","2022-05-13","2022-05-16","2022-05-17")
ric <- c("AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "AT0000A1D541=", "SE247827293=", "SE247827293=", "SE247827293=", "SE247827293=", "SE247827293=")
return <- c(0.001009681, 0.003925873, 0.000354606, -0.000472641, -0.002935700, 0.003750854, 0.012317347, -0.001314047, 0.001014453, -0.007234452)
df <- data.frame(ric, date, return)
I have tried to use the slider package to generate two lists with the returns of the specific time frame. However, I feel that there is some more efficient way to solve this problem. I hope to find some help here.
This is what I tried before:
x <- slide(df$return, ~.x, .before=30, .after = -6)
y <- slide(df$return, ~.x, .before=-6, .after = 30)
z <- mapply(c, x, y, SIMPLIFY=TRUE)
for (i in 1:length(z))
{
df$sd[i] <- sd(z[[i]])
}

Subsetting a rasterbrick to give mean of three minimum months in each year

I'm interested in creating two variables from a time-series of spatial raster data in R from an netcdf file. Opening the data, subsetting by Z and creating a mean value is straight forward:
# Working example below using a ~16mb nc file of sea ice from HadISST
library(R.utils)
library(raster)
library(tidyverse)
download.file("https://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_ice.nc.gz","HadISST_ice.nc.gz")
gunzip("HadISST_ice.nc.gz", ext="gz", FUN=gzfile)
hadISST <- brick('Datasets/HadISST/HadISST_ice.nc')
# subset to a decade time period and create decadal mean from monthly data
hadISST_a <- hadISST %>% subset(., which(getZ(.) >= as.Date("1900-01-01") & getZ(.) <= as.Date("1909-12-31"))) %>% mean(., na.rm = TRUE)
But, I'm interested in extracting 1) annual mean values, and 2) annual mean of three minimum monthly values for the subsetted time period. My current work flow is using nc_open() and ncvar_get() to open the data, raster::extract() to get the values and then tidverse group_by() and slice_min() to get the annual coolest months, but it's a slow and cpu intensive approach. Is there a more effective way of doing this without converting from raster - data.frame?
Questions:
Using the above code how can I extract annual means rather than a mean of ALL months over the decadal period?
Is there a way of using slice_min(order_by = sst, n = 3) or similar with brick objects to get the minimum three values per year prior to annual averaging?
Example data
if (!file.exists("HadISST_ice.nc")) {
download.file("https://www.metoffice.gov.uk/hadobs/hadisst/data/HadISST_ice.nc.gz","HadISST_ice.nc.gz")
R.utils:::gunzip("HadISST_ice.nc.gz")
}
library(terra)
hadISST <- rast('HadISST_ice.nc')
Annual mean
y <- format(time(hadISST), "%Y")
m <- tapp(hadISST, y, mean)
Mean of the lowest three monthly values by year (this takes much longer because a user-defined R function is used). I now see that there is a bug in the CRAN version. You can instead use version 1.5-47 that you can install like this install.packages('terra', repos='https://rspatial.r-universe.dev').
f <- function(i) mean(sort(i)[1:3])
m3 <- tapp(hadISST, y, f)
To make this faster (if you have multiple cores):
m3 <- tapp(hadISST, y, f, cores=4)
There are likely much more intelligent ways to do this, but a thought as to your annual means, here calling hadISST, hadI:
v <- getValues(hadI)
v_t <- t(v) # this takes a while
v_mean <- vector(mode='numeric')
for(k in 1:nrow(v_t)) {
v_mean[k] = mean(v_t[k, ], na.rm = TRUE)
}
length(v_mean)
[1] 1828
v_mean[1:11]
[1] 0.1651351 0.1593368 0.1600364 0.1890360 0.1931470 0.1995657 0.1982052
[8] 0.1917534 0.1917840 0.1911638 0.1911657
I'm a little unclear on proposition 2, as it seems it would be the average of 3 0's...

Linear regresion on each raster pixel to predict future month (in R language)

I have successfull run this code. I have read it from:
Can't Calculate pixel-wise regression in R on raster stack with fun
library(raster)
# Example data
r <- raster(nrow=15, ncol=10)
set.seed(0)
# Now I make 6 raster (1 raster/months), then assign each pixel's value randomly
s <- stack(lapply(1:6, function(i) setValues(r, rnorm(ncell(r), i, 3))))
names(s) <- paste0('Month', c(1,2,3,4,5,6))
# Extract each pixel values
x <- values(s)
# Model with linreg
m <- lm(Month6 ~ ., data=data.frame(x))
# Prediction raster
p <- predict(s, m)
If you run that code, p will be a raster. But, I still confused. How to make raster in the future? For example, I want 'Month8' raster based on 6 previous raster?
What I mean is, each pixels has different linreg equations (where X=Month1, ... , Months6). If I input X=Month8, I will have 150 cells of Y for 8th Month that represent in each pixel of raster.
What I have done
# Lets try make a data frame for clear insight for my data
x <- values(s)
DF <- data.frame(x)
# Make X as month, and y is target.
library(data.table)
DF_T <- transpose(DF)
Month <- seq(1,nrow(DF_T))
DF_T <- cbind(Month, DF_T)
# Make prediction for first pixel
V1_lr <- lm(V1 ~ Month, data=DF_T)
# prediction for 8th Months in a pixel
V1_p <- predict(V1_lr, data.frame(Month=8))
V1_p
This is just one pixel. I want the entire raster for 'Month8'

r - How to translate this time series calculation into a raster calculation?

I'm trying to reproduce this vector (time series) calculation code:
gamma.parameters<- fitdistr(may_baseline_3months[may_baseline_3months>0],"gamma")
into a raster calculation code.
What this code originally does is trying to fit a gamma distribution by maximum likelihood estimation to a vector (time series) may_baseline_3months.
And what I want to do is to calculate the same thing but with a raster stack.
I tried doing this with calc() function:
f1<-function(x)
{
library(MASS)
return(fitdistr(x,"gamma"))
}
gamma.parameters<- calc(x = may_baseline_3months,fun = f1)
Error in .calcTest(x[1:5], fun, na.rm, forcefun, forceapply) :
cannot use this function
but it didn't work.
Note: My raster stack has only 4 layer.
EDIT
You can download a example data here spi
The fitdistr is part of the procedure of my main goal. I'm trying to calcule the Standard Precipitation Index. I already did it with a time series of a monthly precipitation of 30 year.
Here is the code for a time series till the line that I'm stock:
data<-read.csv("guatemala_spi.csv",header = T,sep=";")
dates<-data[,1]
rain_1month<-data[,2]
rain_3months<-0
#Setting the first 2 elements to NA because I'm going to calcule the accumulating the rainfall for 3 month
for (i in c(1:2)) {
rain_3months[i]<-NA
}
#Accumulating the rainfall for the rest of the data
number_of_months<-length(rain_1month)
for (j in c(3:number_of_months))
{
rain_3months[j]<-0.0
for (i in c(0:2))
{
rain_3months[j] = rain_3months[j] + rain_1month[j-i]
}
}
#Extracting a time-series for the month of interest (May)
may_rain_3months<-rain_3months[substr(dates,5,6)==”05”]
dates_may<-dates[substr(dates,5,6)==”05”]
number_of_years<-length(dates_may)
#Fitting the gama distribution by maximum likelihood estimation
start_year<-1971
end_year<-2010
start_index<-which(substr(dates_may,1,4)==start_year)
end_index<-which(substr(dates_may,1,4)==end_year)
may_baseline_3months<-may_rain_3months[start_index:end_index]
library(MASS)
gamma.parameters<-fitdistr(may_baseline_3months[may_baseline_3months>0],"gamma")
That last line is the one that I'm having problems to calculate for a raster stack.
Here's what I have so far in raster form:
Example multi-layer raster here (Monthly precipitation 2001 to 2004, 48 layers in total)
#Initiating a dates vector
dates<-c("200101","200102","200103","200104","200105","200106","200107","200108","200109","200110","200111","200112",
"200201","200202","200203","200204","200205","200206","200207","200208","200209","200210","200211","200212",
"200301","200302","200303","200304","200305","200306","200307","200308","200309","200310","200311","200312",
"200401","200402","200403","200404","200405","200406","200407","200408","200409","200410","200411","200412")
#Initiating a NA raster
rain_3months_1layer<-raster(nrow=1600, ncol=1673,extent(-118.4539, -34.80395, -50, 30),res=c(0.05,0.05))
values(rain_3months_1layer)<-NA
#Creating a raster stack NA of 48 layers
rain_3months<-stack(mget(rep( "rain_3months_1layer" , 48 )))
#Reading the data
rain_1month <- stack("chirps_rain_1month.tif")
#Accumulating the rainfall
number_of_months<-nlayers(rain_1month)
for (j in c(3:number_of_months))
{
rain_3months[[j]]<-0.0
for (i in c(0:2))
{
rain_3months[[j]] = rain_3months[[j]] + rain_1month[[j-i]]
}
}
#Extracting the raster for the month of interest (May)
may_rain_3months<-stack(rain_3months[[which(substr(dates,5,6)=="05", arr.ind = T)]])
dates_may<-dates[substr(dates,5,6)=="05"]
number_of_years<-length(dates_may)
#Fitting the gama distribution by maximum likelihood estimation
start_year<-2001
end_year<-2004
start_index<-which(substr(dates_may,1,4)==start_year)
end_index<-which(substr(dates_may,1,4)==end_year)
may_baseline_3months<-stack(may_rain_3months[[start_index:end_index]])
library(MASS)
f1<-function(x)
{
library(MASS)
return(fitdistr(x,"gamma"))
}
gamma.parameters<- calc(x = may_baseline_3months,fun = f1)
I can't make calc() to compute fitdistr() to the raster stack.
You need to make a function that calc can use. Your function f1 returns an object of class fitdistr. The calc function does not know what to do with that:
library(MASS)
set.seed(0)
x <- runif(10)
f1 <- function(x) {
return(fitdistr(x,"gamma"))
}
a <- f1(x)
class(a)
# [1] "fitdistr"
a
# shape rate
# 4.401575 6.931571
# (1.898550) (3.167113)
You need a function that returns numbers. Like f2:
f2 <- function(x) {
fitdistr(x,"gamma")$estimate
}
b <- f2(x)
class(b)
#[1] "numeric"
b
# shape rate
#4.401575 6.931571
Test f2 with calc:
library(raster)
s <- stack(lapply(1:12, function(i) setValues(r, runif(ncell(r)))))
r <- calc(s, f2)
I assume that this answers your questions. I cannot be sure because your question is way too complex. The first thing you need to do with a problems like this is to create a simple example like I have done above.
Next question
Error in stats::optim(x = c(7, 7, 7, 7), par = list(shape = Inf, rate
= Inf), : non-finite value supplied by optim.
That is a different issue, you are providing fitdistr with values it cannot deal with. You can add a try clause to skip over those. You could identify which cells this happens in and what the values are to see if there is something else you should do.
f3 <- function(x) {
x <- try (fitdistr(x,"gamma")$estimate, silent=TRUE )
if (class(x) == 'try-error') { c(-9999, -9999) } else { x }
}
x[1] <- NA
f2(x)
#Error in fitdistr(x, "gamma") : 'x' contains missing or infinite values
f3(x)
#[1] -9999 -9999
Note that you need to make sure that the number of values returned by f3 should always be the same. In this case two values. Here I use -9999 so that you can identify the cells. You can also use NA

Select the most dissimilar individual using cluster analysis [duplicate]

I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.
Data example:
mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))
What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there …
That has become became a problem since my data has became really big.
I will appreciate any help or suggestion on how to apply the previous in R
.
I´m not sure if it is exactly what you are searching, but maybe it helps:
mydata<-matrix(nrow=100, ncol=10, rnorm(1000, mean = 0, sd = 1))
rownames(mydata) <- paste0("id", 1:100) # some id for identification
# cluster objects and calculate dissimilarity matrix
cl <- cutree(hclust(
sim <- dist(mydata, diag = TRUE, upper=TRUE)), 5)
# combine results, take sum to aggregate dissimilarity
res <- data.frame(id=rownames(mydata),
cluster=cl, dis_sim=rowSums(as.matrix(sim)))
# order, lowest overall dissimilarity will be first
res <- res[order(res$dis_sim), ]
# split object
reslist <- split(res, f=res$cluster)
## takes first three items with highest overall dissim.
lapply(reslist, tail, n=3)
## returns id´s with highest overall dissimilarity, top 20%
lapply(reslist, function(x, p) tail(x, round(nrow(x)*p)), p=0.2)
regarding you comment, find the code below:
pleas note that the code can be improved in terms of beauty and efficiency.
Further I used a second answer, because otherwise it would be to messy.
# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata,
list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])
# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)
# add centroids and drop cluster info
totaldf <- lapply(1:5,
function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
li=newdfl, cen=cent)
# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent
for calculation of centroids out of hclust see R-Mailinglist

Resources