Creating multiple new resampled series based on observed data - r

I've got 7 years of temperature data split into 4 seasonal variables (Spring, Summer, Autumn, Winter) each of which look like this (Spring example)
Day Month Year maxtp Season.Year Season
1 3 2008 13.6 2008 SP
2 3 2008 11.3 2008 SP
3 3 2008 5.4 2008 SP
I want to create a multiple new temperature series based on these observed data, one at a time in the following way (using a similar approach to this): Block sampling according to index in panel data
Using this code
newseries1 <- sample(Spring, size=91, replace = T, prob = NULL)
But this replicated the series 91 times, and isn't what I want.
I want to select an entire Spring block from any random season.year (2008-2014), then select a summer block from any year EXCEPT the year that was chosen previously, so any year other than 2008. The resampled year is then replaced so it can be resampled again the next time, just not consecutively.
I want to take a season.year from the spring variable, follow it with a different season.year for the summer variable, then another for autumn, and another for winter, and keep doing this until the resampled is the same length as the observed (7 years in this case).
So in summary I want to:
Select a 'block' respecting the annual sequence (Spring from a random season.year) and begin a new series with it, then replace it so it can be sampled again.
Follow Spring with summer from a non-consecutive year, and replace it.
Keep going until the resampled series is the same length as the observed
Repeat this process until there are 100 resampled series

For newseries1 try instead
ndays <- length(Spring[, 1])
#select rows of Spring randomly (are you sure you want replace = T?)
newseries1 <- Spring[sample(1:ndays, size = ndays, replace = T, prob = NULL),]
Then for selecting the year data for each season successively:
y.lst <- 2008:2014
nssn <- 7*100*4 #desired number of annual cycles times four seasons
y <- rep(NA, nssn) #initialise: vector of selected years
#first spring
y[1] <- sample(y.lst, 1)
#subsequent seasons
for(s in 2:nssn){
#selects a year from a sublist of years which excludes that of the previous season
y[s] <- sample(y.lst[y.lst != y[s - 1]], 1)
}
Then compile the data frame (assume original data is in data frame data):
#first Spring
Ssn <- data[with(data, Year == y[1] & Season == "SP"),]
ndays <- length(Spring[, 1])
newseries1 <- Ssn[sample(1:ndays, size = ndays, replace = T, prob = NULL),]
#initialise data frame
data2 <- Ssn
#subsequent seasons
for(s in 2:nssn){
Ssn <- data[with(data, Year == y[s] & Season == "..."),]
ndays <- length(Spring[, 1])
newseries1 <- Ssn[sample(1:ndays, size = ndays, replace = T, prob = NULL),]
data2 <- rbind(data2, Ssn)
}
You will need to create a vector of season labels to be chosen. Use the %% remainder function to select the appropriate season label in each case (i.e. s%%4 is 2 implies "SU")

Related

How do I group data in a list by date and average the associated data values in R?

I want to group the data below by date (on a daily basis) and the get the mean of each group.
The dataset created below is 3-dimensional array where i= Time (in days), j= Latitude and k=Longitude. This dataset is 4 years in length (1461 days) and has the attribute 'Dates' to denote each of the days/dates. I want to mean the data in 'Data' so that I end up with one mean value for 1st of January, 2nd of January, etc.
#First create the example dataset
tmintest=array(1:100, c(420,189,1461))
#create the list
Variable <- list(varName="rr")
Data = tmintest
xyCoords <- list(x = seq(-40.37,64.37,length.out=420), y = seq(25.37,72.37,length.out=189))
Dates <- list(start = seq(as.Date("2012-01-01"), as.Date("2015-12-31"), by="days"), end=seq(as.Date("2012-01-01"), as.Date("2015-12-31"), by="days"))
All <- list(Variable = Variable,Data=aperm(Data), xyCoords=xyCoords,Dates=Dates)
#Make sure the dates are characters (as in the original dataset I'm, working with)
All$Dates$start=as.character(All$Dates$start)
All$Dates$end=as.character(All$Dates$end)
I have looked at using aggregate:
aggregate(All$Data,by=list(All$Dates), FUN = "mean")
but I got the error:
Error in aggregate.data.frame(as.data.frame(x), ...) :
arguments must have same length
I tried to use group_by:
group_by(All$Dates)
but was returned this error:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "list"
What functions can I use to group the data by day and mean the newly created groups in a list in R?
EDIT:
I need the resulting output to be of the size 365 x 189 x 420, where 1:365 are days of the year and 189 x 420 are the latitude/longitude.
So, I want to use all the 1st of January's in the All$Dates attribute to index/group the associated (All$Data) grids of size 189 x 420 (there will be four of them as it is four years of data) and then get the mean of these four grids/arrays. So, in this example, four January firsts, will be averaged to produce a grid of size 189 x 420. This will be carried out for every day of the year, to produce the final 365 x 189 x 420 dataset. Does that clarify what I am trying to do?
This isn't fast, but it does produce the desired output, I think.
library(lubridate)
date <- glue::glue("{month(ymd(All$Dates$start))}-{mday(ymd(All$Dates$start))}")
undate <- unique(date)
out <- array(dim=c(length(undate), 189, 420))
for(i in 1:length(undate)){
w <- which(date == undate[i])
out[i,,] <- apply(All$Data[w,,, drop=FALSE], c(2,3), mean)
}

R: Subset/extract rows of a data frame in steps of 12

I have a data frame with data for each month of a 26 years period (1993 - 2019), which makes 312 rows in total.
Unfortunately, I had to lag the data, so each year goes now from July t to June t+1. So I can't just extract the year from the date.
Now, I want to exclude the 12-month data for each year in a separate data frame. My first Idea is to insert in the first column the year and use the lapply function to filter afterward.
For this, I created the following loop:
n <- 1
m <- 1993
for (a in 1:26) {
for (i in n:(n+11)) {
t.monthly.ret.lag[i,1] <- m
}
n <- n+1
m <- m+1
}
Unfortunately, R isn't naming the year in steps of 12. Instead, it is counting directly in steps of 1.
Does anyone know how to solve this or maybe know a better way of doing it?
y.first <- 1993
y.last <- 2019
month.col <- rep(c(7:12, 1:6), y.last-y.first+1)
year.col <- rep(c(y.first:y.last), each=length(month.name))
df <- data.frame(year=year.col, month=month.col)
This yields a dataframe with months and year correspondingly tagged, which further allows to use dplyr::group_by() and so on.
You could just create a 312 element long vector giving the year (and one giving the month) using rep() and seq(). Then you can attach them as additional columns to your data.frame or just use them as reference for month and year.
month = rep(seq(1:12),27)
year = c(matrix(rep(seq(1:27),12),ncol=27,byrow=T)+1992)
month = month[7:(length(month)-6)]
year = year[7:(length(year)-6)]
The month vector counts from 1 to 12, beginning at 6, the year vector repeats the year 12 times (the first and last only 6 times).

How to perform calculations on moving subsets of n elements of data frame without loop

I'm trying to calculate Effective Drought Index using R. One of many steps needed to do so is calculate a stored water quantity (EP):
EP365=P1/1+(P1+P2)/2+(P1+P2+P3)/3+(P1+P2+P3+P4)/4+ … +(P1+…+P365)/365
Where P1 is daily precipitation last day, P2 is precipitation two day ago and P365 is precipitation 365 days ago. Calculation of EP must be done for each 365-day period starting with day 1 to 365, 2 to 366 etc.
So I have a dataframe with two columns: date and precip and more than 20000 rows. Simple (and slow) solution is calculate any subset of 365 elements from row 365 to nrow(df):
period_length <- 365
df$EP <- NA
for (i in (period_length:nrow(df))) {
first <- (i - period_length) + 1
SUB <- rev(df[first:i,]$prcp)
EP <- sum(cumsum(SUB)/seq_along(SUB))
df$EP[i] <- EP
}
Of course it works, however the question is how to calculate EP without using loop?
Use rollapplyr with the indicated function. Replace fill=NA with partial=TRUE if you want it to work with fewer than 365 days during the first 364 points or omit both if you want to drop the first 364 points.
library(zoo)
x <- 1:1000 # sample data
ep <- rollapplyr(x, 365, function(x) sum(cumsum(x) / seq_along(x)), fill = NA)

Mean Returns in Time Series - Restarting after NA values - rstudio

Has anyone encountered calculating historical mean log returns in time series datasets?
The dataset is ordered by individual security first and by time for each respective security. I am trying to form a historical mean log return, i.e. the mean log return for the security from its first appearance in the dataset to date, for each point in time for each security.
Luckily, the return time series contains NAs between returns for differing securities. My idea is to calculate a historical mean that restarts after each NA that appears.
A simple cumsum() probably will not do it, as the NAs will have to be dropped.
I thought about using rollmean(), if I only knew an efficient way to specify the 'width' parameter to the length of the vector of consecutive preceding non-NAs.
The current approach I am taking, based on Count how many consecutive values are true, takes significantly too much time, given the size of the data set I am working with.
For any x of the form x : [r(1) r(2) ... r(N)], where r(2) is the log return in period 2:
df <- data.frame(x, zcount = NA)
df[1,2] = 0 #df$x[1]=NA by construction of the data set
for(i in 2:nrow(df))
df$zcount[i] <- ifelse(!is.na(df$x[i]), df$zcount[i-1]+1, 0)
Any idea how to speed this up would be highly appreciated!
You will need to reshape the data.frame to apply the cumsum function
over each security. Here's how:
First, I'll generate some data on 100 securities over 100 months which I think corresponds to your description of the data set
securities <- 100
months <- 100
time <- seq.Date(as.Date("2010/1/1"), by = "months", length.out = months)
ID <- rep(paste0("sec", 1:months), each = securities)
returns <- rnorm(securities * months, mean = 0.08, sd = 2)
df <- data.frame(time, ID, returns)
head(df)
time ID returns
1 2010-01-01 sec1 -3.0114466
2 2010-02-01 sec1 -1.7566112
3 2010-03-01 sec1 1.6615731
4 2010-04-01 sec1 0.9692533
5 2010-05-01 sec1 1.3075774
6 2010-06-01 sec1 0.6323768
Now, you must reshape your data so that each security column contains its
returns, and each row represents the date.
library(tidyr)
df_wide <- spread(df, ID, returns)
Once this is done, you can use the apply function to sum every column which now represents each security. Or use the cumsum function. Notice the data object df_wide[-1], which drops the time column. This is necessary to avoid the sum or cumsum functions throwing an error.
matrix_sum <- apply(df_wide[-1], 2, FUN = sum)
matrix_cumsum <- apply(df_wide[-1], 2, FUN = cumsum)
Now, add the time column back as a data.frame if you like:
df_final <- data.frame(time = df_wide[,1], matrix_cumsum)

R repeat rows by vector and date

I have a data frame with 275 different stations and 43 years seasonal data (October to next April, no need for May to Sept data)and 6 variables, here is a small example of the data frame with only one variable called value:
data <- data.frame(station=rep(1,6), year=rep(1969,6), month=c(10,10,10,10,11,11),day=c(1,8,16,24,1,9),value=c(1:6))
What I need is to fill the gap of day with daily date(eg:1:8) and the value of each row the average of the 8 days, it would be look like:
data1 <- data.frame(station=rep(1,40), year=rep(1969,40), month=c(rep(10,31),rep(11,9)),day=c(1:31,1:9),value=rep(c(1/7,2/8,3/8,4/8,5/8,6/8),c(7,8,8,8,8,1)))
I wrote some poor code and searched around the site, but unfortunately didn't work out, please help or better ideas would be appreciated.
station.date <- as.Date(with(data, paste(year, month, day, sep="-")))
for (i in 1:length(station.date)){
days <- as.numeric(station.date[i+1]-station.date[i]) #not working
data <- within(data, days <- c(days,1))
}
rows <- rep(1:nrow(data), times=data[ ,data$days])
rows <- ifelse(rows > 10, 0, rows) #get rid of month May to Sept
data <- data[rows, ]
data <- within(data, value1 <- value/days)
data <- within(data, dd <- ?) #don't know to change the repeated days to real days
I wrote some code that does the same things as your example, but probably You have to modyfi it in order to handle whole data set. I wasn't sure what to do with the last observation. Eventually I made a special case for it. If it should be divided by different number, You need just to replace 8 inside values <- c(values, tail(data$value, 1) / 8)
with that number. Moreover if you have all 275 stations in one data.frame, I think the best idea would be to split it, transform it separately and than cbind it.
data <- data.frame(station=rep(1,6), year=rep(1969,6), month=c(10,10,10,10,11,11),day=c(1,8,16,24,1,9),value=c(1:6))
station.date <- as.Date(with(data, paste(year, month, day, sep="-")))
d <- as.numeric(diff(station.date))
range <- sum(d) + 1
# create dates
dates <- seq(station.date[1], by = "day", length = range)
# create values
values <- unlist(sapply(1:length(d), function(i){
rep(data$value[i] / d[i] , d[i])
}))
# adding last observation
values <- c(values, tail(data$value, 1) / 8)
# create new data frame
data2 <- data.frame(station = rep(1, range),
year = as.numeric(format(dates, "%Y")),
month = as.numeric(format(dates, "%m")),
day = as.numeric(format(dates, "%d")),
value = values)
It could probably be optimised in some way, however I hope it helps too. Note how I extract year, month and day from dates.

Resources