How do I group data in a list by date and average the associated data values in R? - r

I want to group the data below by date (on a daily basis) and the get the mean of each group.
The dataset created below is 3-dimensional array where i= Time (in days), j= Latitude and k=Longitude. This dataset is 4 years in length (1461 days) and has the attribute 'Dates' to denote each of the days/dates. I want to mean the data in 'Data' so that I end up with one mean value for 1st of January, 2nd of January, etc.
#First create the example dataset
tmintest=array(1:100, c(420,189,1461))
#create the list
Variable <- list(varName="rr")
Data = tmintest
xyCoords <- list(x = seq(-40.37,64.37,length.out=420), y = seq(25.37,72.37,length.out=189))
Dates <- list(start = seq(as.Date("2012-01-01"), as.Date("2015-12-31"), by="days"), end=seq(as.Date("2012-01-01"), as.Date("2015-12-31"), by="days"))
All <- list(Variable = Variable,Data=aperm(Data), xyCoords=xyCoords,Dates=Dates)
#Make sure the dates are characters (as in the original dataset I'm, working with)
All$Dates$start=as.character(All$Dates$start)
All$Dates$end=as.character(All$Dates$end)
I have looked at using aggregate:
aggregate(All$Data,by=list(All$Dates), FUN = "mean")
but I got the error:
Error in aggregate.data.frame(as.data.frame(x), ...) :
arguments must have same length
I tried to use group_by:
group_by(All$Dates)
but was returned this error:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "list"
What functions can I use to group the data by day and mean the newly created groups in a list in R?
EDIT:
I need the resulting output to be of the size 365 x 189 x 420, where 1:365 are days of the year and 189 x 420 are the latitude/longitude.
So, I want to use all the 1st of January's in the All$Dates attribute to index/group the associated (All$Data) grids of size 189 x 420 (there will be four of them as it is four years of data) and then get the mean of these four grids/arrays. So, in this example, four January firsts, will be averaged to produce a grid of size 189 x 420. This will be carried out for every day of the year, to produce the final 365 x 189 x 420 dataset. Does that clarify what I am trying to do?

This isn't fast, but it does produce the desired output, I think.
library(lubridate)
date <- glue::glue("{month(ymd(All$Dates$start))}-{mday(ymd(All$Dates$start))}")
undate <- unique(date)
out <- array(dim=c(length(undate), 189, 420))
for(i in 1:length(undate)){
w <- which(date == undate[i])
out[i,,] <- apply(All$Data[w,,, drop=FALSE], c(2,3), mean)
}

Related

How to perform calculations on moving subsets of n elements of data frame without loop

I'm trying to calculate Effective Drought Index using R. One of many steps needed to do so is calculate a stored water quantity (EP):
EP365=P1/1+(P1+P2)/2+(P1+P2+P3)/3+(P1+P2+P3+P4)/4+ … +(P1+…+P365)/365
Where P1 is daily precipitation last day, P2 is precipitation two day ago and P365 is precipitation 365 days ago. Calculation of EP must be done for each 365-day period starting with day 1 to 365, 2 to 366 etc.
So I have a dataframe with two columns: date and precip and more than 20000 rows. Simple (and slow) solution is calculate any subset of 365 elements from row 365 to nrow(df):
period_length <- 365
df$EP <- NA
for (i in (period_length:nrow(df))) {
first <- (i - period_length) + 1
SUB <- rev(df[first:i,]$prcp)
EP <- sum(cumsum(SUB)/seq_along(SUB))
df$EP[i] <- EP
}
Of course it works, however the question is how to calculate EP without using loop?
Use rollapplyr with the indicated function. Replace fill=NA with partial=TRUE if you want it to work with fewer than 365 days during the first 364 points or omit both if you want to drop the first 364 points.
library(zoo)
x <- 1:1000 # sample data
ep <- rollapplyr(x, 365, function(x) sum(cumsum(x) / seq_along(x)), fill = NA)

Mean Returns in Time Series - Restarting after NA values - rstudio

Has anyone encountered calculating historical mean log returns in time series datasets?
The dataset is ordered by individual security first and by time for each respective security. I am trying to form a historical mean log return, i.e. the mean log return for the security from its first appearance in the dataset to date, for each point in time for each security.
Luckily, the return time series contains NAs between returns for differing securities. My idea is to calculate a historical mean that restarts after each NA that appears.
A simple cumsum() probably will not do it, as the NAs will have to be dropped.
I thought about using rollmean(), if I only knew an efficient way to specify the 'width' parameter to the length of the vector of consecutive preceding non-NAs.
The current approach I am taking, based on Count how many consecutive values are true, takes significantly too much time, given the size of the data set I am working with.
For any x of the form x : [r(1) r(2) ... r(N)], where r(2) is the log return in period 2:
df <- data.frame(x, zcount = NA)
df[1,2] = 0 #df$x[1]=NA by construction of the data set
for(i in 2:nrow(df))
df$zcount[i] <- ifelse(!is.na(df$x[i]), df$zcount[i-1]+1, 0)
Any idea how to speed this up would be highly appreciated!
You will need to reshape the data.frame to apply the cumsum function
over each security. Here's how:
First, I'll generate some data on 100 securities over 100 months which I think corresponds to your description of the data set
securities <- 100
months <- 100
time <- seq.Date(as.Date("2010/1/1"), by = "months", length.out = months)
ID <- rep(paste0("sec", 1:months), each = securities)
returns <- rnorm(securities * months, mean = 0.08, sd = 2)
df <- data.frame(time, ID, returns)
head(df)
time ID returns
1 2010-01-01 sec1 -3.0114466
2 2010-02-01 sec1 -1.7566112
3 2010-03-01 sec1 1.6615731
4 2010-04-01 sec1 0.9692533
5 2010-05-01 sec1 1.3075774
6 2010-06-01 sec1 0.6323768
Now, you must reshape your data so that each security column contains its
returns, and each row represents the date.
library(tidyr)
df_wide <- spread(df, ID, returns)
Once this is done, you can use the apply function to sum every column which now represents each security. Or use the cumsum function. Notice the data object df_wide[-1], which drops the time column. This is necessary to avoid the sum or cumsum functions throwing an error.
matrix_sum <- apply(df_wide[-1], 2, FUN = sum)
matrix_cumsum <- apply(df_wide[-1], 2, FUN = cumsum)
Now, add the time column back as a data.frame if you like:
df_final <- data.frame(time = df_wide[,1], matrix_cumsum)

Create date index and add to data frame in R

Currently transitioning from Python to R. In Python, you can create a date range with pandas and add it to a data frame like so;
data = pd.read_csv('Data')
dates = pd.date_range('2006-01-01 00:00', periods=2920, freq='3H')
df = pd.DataFrame({'data' : data}, index = dates)
How can I do this in R?
Further, if I want to compare 2 datasets with different lengths but same time span, you can resample the dataset with lower frequency so it can be the same length as the higher frequency by placing 'NaNs' in the holes like so:
df2 = pd.read_csv('data2') #3 hour resolution = 2920 points of data
data2 = df2.resample('30Min').asfreq() #30 Min resolution = 17520 points
I guess I'm basically looking for a Pandas package equivalent for R. How can I code these in R?
The following is a way of getting your time-series data from a given time interval (3 hours)to another (30 minutes):
Get the data:
starter_df <- data.frame(dates=seq(from=(as.POSIXct(strftime("2006-01-01 00:00"))),
length.out = 2920,
by="3 hours"),
data = rnorm(2920))
Get the full sequence in 30 minute intervals and replace the NA's with the values from the starter_df data.frame:
full_data <- data.frame(dates=seq(from=min(starter_df$dates),
to=max(starter_df$dates), by="30 min"),
data=rep(NA,NROW(seq(from=min(starter_df$dates),
to=max(starter_df$dates), by="30 min"))))
full_data[full_data$dates %in% starter_df$dates,] <- starter_df[starter_df$dates %in% full_data$dates,]
I hope it helps.

How do I subset every day except the last five days of zoo data?

I am trying to extract all dates except for the last five days from a zoo dataset into a single object.
This question is somewhat related to How do I subset the last week for every month of a zoo object in R?
You can reproduce the dataset with this code:
set.seed(123)
price <- rnorm(365)
data <- cbind(seq(as.Date("2013-01-01"), by = "day", length.out = 365), price)
zoodata <- zoo(data[,2], as.Date(data[,1]))
For my output, I'm hoping to get a combined dataset of everything except the last five days of each month. For example, if there are 20 days in the first month's data and 19 days in the second month's, I only want to subset the first 15 and 14 days of data respectively.
I tried using the head() function and the first() function to extract the first three weeks, but since each month will have a different amount of days according to month or leap year months, it's not ideal.
Thank you.
Here are a few approaches:
1) as.Date Let tt be the dates. Then we compute a Date vector the same length as tt which has the corresponding last date of the month. We then pick out those dates which are at least 5 days away from that:
tt <- time(zoodata)
last.date.of.month <- as.Date(as.yearmon(tt), frac = 1)
zoodata[ last.date.of.month - tt >= 5 ]
2) tapply/head For each month tapply head(x, -5) to the data and then concatenate the reduced months back together:
do.call("c", tapply(zoodata, as.yearmon(time(zoodata)), head, -5))
3) ave Define revseq which given a vector or zoo object returns sequence numbers in reverse order so that the last element corresponds to 1. Then use ave to create a vector ix the same length as zoodata which assigns such reverse sequence numbers to the days of each month. Thus the ix value for the last day of the month will be 1, for the second last day 2, etc. Finally subset zoodata to those elements corresponding to sequence numbers greater than 5:
revseq <- function(x) rev(seq_along(x))
ix <- ave(seq_along(zoodata), as.yearmon(time(zoodata)), FUN = revseq)
z <- zoodata[ ix > 5 ]
ADDED Solutions (1) and (2).
Exactly the same way as in the answer to your other question:
Split dataset by month, remove last 5 days, just add a "-":
library(xts)
xts.data <- as.xts(zoodata)
lapply(split(xts.data, "months"), last, "-5 days")
And the same way, if you want it on one single object:
do.call(rbind, lapply(split(xts.data, "months"), last, "-5 days"))

R repeat rows by vector and date

I have a data frame with 275 different stations and 43 years seasonal data (October to next April, no need for May to Sept data)and 6 variables, here is a small example of the data frame with only one variable called value:
data <- data.frame(station=rep(1,6), year=rep(1969,6), month=c(10,10,10,10,11,11),day=c(1,8,16,24,1,9),value=c(1:6))
What I need is to fill the gap of day with daily date(eg:1:8) and the value of each row the average of the 8 days, it would be look like:
data1 <- data.frame(station=rep(1,40), year=rep(1969,40), month=c(rep(10,31),rep(11,9)),day=c(1:31,1:9),value=rep(c(1/7,2/8,3/8,4/8,5/8,6/8),c(7,8,8,8,8,1)))
I wrote some poor code and searched around the site, but unfortunately didn't work out, please help or better ideas would be appreciated.
station.date <- as.Date(with(data, paste(year, month, day, sep="-")))
for (i in 1:length(station.date)){
days <- as.numeric(station.date[i+1]-station.date[i]) #not working
data <- within(data, days <- c(days,1))
}
rows <- rep(1:nrow(data), times=data[ ,data$days])
rows <- ifelse(rows > 10, 0, rows) #get rid of month May to Sept
data <- data[rows, ]
data <- within(data, value1 <- value/days)
data <- within(data, dd <- ?) #don't know to change the repeated days to real days
I wrote some code that does the same things as your example, but probably You have to modyfi it in order to handle whole data set. I wasn't sure what to do with the last observation. Eventually I made a special case for it. If it should be divided by different number, You need just to replace 8 inside values <- c(values, tail(data$value, 1) / 8)
with that number. Moreover if you have all 275 stations in one data.frame, I think the best idea would be to split it, transform it separately and than cbind it.
data <- data.frame(station=rep(1,6), year=rep(1969,6), month=c(10,10,10,10,11,11),day=c(1,8,16,24,1,9),value=c(1:6))
station.date <- as.Date(with(data, paste(year, month, day, sep="-")))
d <- as.numeric(diff(station.date))
range <- sum(d) + 1
# create dates
dates <- seq(station.date[1], by = "day", length = range)
# create values
values <- unlist(sapply(1:length(d), function(i){
rep(data$value[i] / d[i] , d[i])
}))
# adding last observation
values <- c(values, tail(data$value, 1) / 8)
# create new data frame
data2 <- data.frame(station = rep(1, range),
year = as.numeric(format(dates, "%Y")),
month = as.numeric(format(dates, "%m")),
day = as.numeric(format(dates, "%d")),
value = values)
It could probably be optimised in some way, however I hope it helps too. Note how I extract year, month and day from dates.

Resources