List of Dataframes filter row based on Date - r

I am currently working with a list of dataframes.
Actually, I have about a hundred csv files representing forecasts of some kind, where the date on which the forecast was made is in the first line, the lines thereafter contain the predicted values. The data might look like this:
2010/04/15 10:12:51 #Date of the forecast
2010/05/02 2372 #Date for which the forecast was made and the value assigned
2010/05/09 2298
2009/04/15 10:09:13 #another forecast
....
2010/05/02 2298 #also predicts for 2010/05/02
As you might guess, the forecasts do predict values quite some time ahead (e.g. 5 years), which means predictions for the date 2010/05/02 were not only made on 2010/04/15 but also 2009/04/15 and so on (actually, forecasts are done weekly).
I would like to compare how the predicted value for a specified date (for example 2010/05/02) has changed over time.
Right now, I read in all .csv datas I have as a dataframe, and save each of the resulting dataframes in a list.
(Sadly, the date on which the prediction was made got lost-I hoped to be able to name the list elements with the respective date but have not yet figured out how to do this-still, I am pretty sure I'll find something somewhere, not the main problem here)
That's where the question title comes in: I would like to know how to filter a list of dataframes by row value.
So, I'd like to be able to use a function: function(2010/05/02) and get as a result the rows of each Element of the list (each dataframe in the list) where Date is 2010/05/02.
In this case I'd like to get:
2010/05/02 2372
2010/05/02 2298
I know how to do this using a for loop, but it needs endlessly much time.
I am happy for any suggestions.
(By this example you might understand why it is important to know when the prediction was made- which I would not have right now. I was thinking about adding a new row containing the date on which the prediction was made in each dataframe)
Threads visited until now include:
get column from list of dataframes R
convert a row of a data frame to a simple vector in R
How to get the name of a data.frame within a list? (which more or less adresses the name problem)
As you can see, no thread was particularly helpful.
As requested, a small reproducible example:
dateList <- as.Date(seq(0,100,5),origin="2010-01-01")
forecasts <- seq(2000,3000,50)
df1 <- data.frame(dateList,forecasts)
df2 <- data.frame(dateList-50,forecasts)
l <- list(df1,df2)
we have dates from 2010-01-01 in 5 days steps. I would for example like to know the predicted values for 2010-01-01 in both dataframes.
The first dataframe looks like this:
dateList forecasts
1 2010-01-01 2000
2 2010-01-06 2050
3 2010-01-11 2100
while the second looks like this:
10 2009-12-27 2450
11 2010-01-01 2500
12 2010-01-06 2550
I was hoping to find out for example the predicted values for 2010-01-01.
So, for example:
function(2010-01-01):
2000
2500

Couldn't wait for your example so I made a small one. Let me know if this is in the general direction of what you're after.
xy <- list(df1 = data.frame(dates = as.Date(c("2016-01-01", "2016-01-02", "2016-01-03")), value = runif(3)),
df2 = data.frame(dates = as.Date(c("2016-01-01", "2016-01-02", "2016-01-03")), value = runif(3)),
df3 = data.frame(dates = as.Date(c("2016-01-01", "2016-01-02", "2016-01-03")), value = runif(3))
)
getValueOnDate <- function(x, list.all) {
lapply(list.all, FUN = function(m) m[m$dates %in% x, ])
}
out <- getValueOnDate(as.Date("2016-01-02"), list.all = xy)
do.call("rbind", out)
dates value
df1 2016-01-02 0.7665590
df2 2016-01-02 0.9907976
df3 2016-01-02 0.4909025
You can obviously modify the function to return just the values.

You could alternatively use the following approach, given your list is called ls and the date column date in all data.frame's:
my.ls <- lapply(ls, subset, date == "2010/05/02")
df <- do.call("rbind", my.ls)

Related

Define different timeseries for different columns

I have a dataframe where some of the columns are starting later than the other. Please find a reproducible example.
set.seed(354)
df <- data.frame(Product_Id = rep(1:100, each = 50),
Date = seq(from = as.Date("2014/1/1"),
to = as.Date("2018/2/1"),
by = "month"),
Sales = rnorm(100, mean = 50, sd= 20))
df <- df[-c(251:256, 301:312, 2551:2562, 2651:2662, 2751:2762), ]
library(zoo)
z <- read.zoo(df, index = "Date", split = "Product_Id", FUN = as.yearmon)
tt <- as.ts(z)
Now for this dataframe for the columns 6,7,52,54 and 56 I want to define them as timeseries starting from a different date as compared to the rest of the dataframe. Supposedly the data begins from Jan 2000, column 6 will begin from July 2000, column 7 from Jan 2001 and so on. How should I proceed to do this?
Later, I want to perform a forecast on this dataset. Any inputs on this? Should I consider each column as a seperate dataframe and do the forecasting. Or can I convert each column to a different timeseries object that starts from the first non NA value?
Now for this dataframe for the columns 6,7,52,54 and 56 I want to define them as timeseries starting from a different date as compared to the rest of the dataframe. Supposedly the data begins from Jan 2000, column 6 will begin from July 2000, column 7 from Jan 2001 and so on. How should I proceed to do this?
There, AFAIK, no way to do this in R in a time series matrix. And if each column started at a different date, then (since each column has the same number of entries), each column would also need to end at a different date. Is this really what you need? A collection of time series that all happen to be of the same length (so they can fit into a matrix), but that start and end with offsets? I struggle to understand where something like this would be useful, outside a kind of forecasting competition.
If you really need this, then I would recommend you put your time series into a list structure. Then each one can start and end at any date, and they can be the same or different lengths. Take inspiration from Mcomp::M3.
Later, I want to perform a forecast on this dataset. Any inputs on this? Should I consider each column as a seperate dataframe and do the forecasting. Or can I convert each column to a different timeseries object that starts from the first non NA value?
Since your tt is already a time series object, the simplest way would be simply to iterate over its columns:
fcst <- matrix(nrow=10,ncol=ncol(tt))
for ( ii in 1:ncol(tt) ) fcst <- forecast(ets(tt[,ii]),10)$mean
Note that most modeling functions in forecast will throw a warning and do something reasonable on encountering NA values. Here, e.g.:
1: In ets(tt[, ii]) :
Missing values encountered. Using longest contiguous portion of time series
Of course, you could do something yourself inside the loop, e.g., search for the last NA and start the time series for modeling right after that (but make sure you fail gracefully if the last entry is NA).

Create efficient week over week calculation with subsetting

In my working dataset, I'm trying to calculate week-over-week values for the changes in wholesale and revenue. The code seems to work, but my estimates show it'll take about 75hrs to run what is a seemingly simple calculation. Below is the generic reproducible version which takes about 2m to run on this smaller dataset:
########################################################################################################################
# MAKE A GENERIC REPORDUCIBLE STACK OVERFLOW QUESTION
########################################################################################################################
# Create empty data frame of 26,000 observations similar to my data, but populated with noise
exampleData <- data.frame(product = rep(LETTERS,1000),
wholesale = rnorm(1000*26),
revenue = rnorm(1000*26))
# create a week_ending column which increases by one week with every set of 26 "products"
for(i in 1:nrow(exampleData)){
exampleData$week_ending[i] <- as.Date("2016-09-04")+7*floor((i-1)/26)
}
exampleData$week_ending <- as.Date(exampleData$week_ending, origin = "1970-01-01")
# create empty columns to fill
exampleData$wholesale_wow <- NA
exampleData$revenue_wow <- NA
# loop through the wholesale and revenue numbers and append the week-over-week changes
for(i in 1:nrow(exampleData)){
# set a condition where the loop only appends the week-over-week values if it's not the first week
if(exampleData$week_ending[i]!="2016-09-04"){
# set temporary values for the current and past week's wholesale value
currentWholesale <- exampleData$wholesale[i]
lastWeekWholesale <- exampleData$wholesale[which(exampleData$product==exampleData$product[i] &
exampleData$week_ending==exampleData$week_ending[i]-7)]
exampleData$wholesale_wow[i] <- currentWholesale/lastWeekWholesale -1
# set temporary values for the current and past week's revenue
currentRevenue <- exampleData$revenue[i]
lastWeekRevenue <- exampleData$revenue[which(exampleData$product==exampleData$product[i] &
exampleData$week_ending==exampleData$week_ending[i]-7)]
exampleData$revenue_wow[i] <- currentRevenue/lastWeekRevenue -1
}
}
Any help understanding why this takes so long or how to cut down the time would be much appreciated!
The first for loop can be simplified with the following for:
exampleData$week_ending2 <- as.Date("2016-09-04") + 7 * floor((seq_len(nrow(exampleData)) - 1) / 26)
setequal(exampleData$week_ending, exampleData$week_ending2)
[1] TRUE
Replacing second for loop
library(data.table)
dt1 <- as.data.table(exampleData)
dt1[, wholesale_wow := wholesale / shift(wholesale) - 1 , by = product]
dt1[, revenue_wow := revenue / shift(revenue) - 1 , by = product]
setequal(exampleData, dt1)
[1] TRUE
This takes about 4 milliseconds to run on my laptop
Here is a vectorized solution using the tidyr package.
set.seed(123)
# Create empty data frame of 26,000 observations similar to my data, but populated with noise
exampleData <- data.frame(product = rep(LETTERS,1000),
wholesale = rnorm(1000*26),
revenue = rnorm(1000*26))
# create a week_ending column which increases by one week with every set of 26 "products"
#vectorize the creating of the data
i<-1:nrow(exampleData)
exampleData$week_ending <- as.Date("2016-09-04")+7*floor((i-1)/26)
exampleData$week_ending <- as.Date(exampleData$week_ending, origin = "1970-01-01")
# create empty columns to fill
exampleData$wholesale_wow <- NA
exampleData$revenue_wow <- NA
#find the index of rows of interest (ie removing the first week)
i<-i[exampleData$week_ending!="2016-09-04"]
library(tidyr)
#create temp variables and convert into wide format
# the rows are product and the columns are the ending weeks
Wholesale<-exampleData[ ,c(1,2,4)]
Wholesale<-spread(Wholesale, week_ending, wholesale)
Revenue<-exampleData[ ,c(1,3,4)]
Revenue<-spread(Revenue, week_ending, revenue)
#number of columns
numCol<-ncol(Wholesale)
#remove the first two columns for current wholesale
#remove the first and last column for last week's wholesale
#perform calculation on ever element in dataframe (divide this week/lastweek)
Wholesale_wow<- Wholesale[ ,-c(1, 2)]/Wholesale[ ,-c(1, numCol)] - 1
#convert back to long format
Wholesale_wow<-gather(Wholesale_wow)
#repeat for revenue
Revenue_wow<- Revenue[ ,-c(1, 2)]/Revenue[ ,-c(1, numCol)] - 1
#convert back to long format
Revenue_wow<-gather(Revenue_wow)
#assemble calculated values back into the original dataframe
exampleData$wholesale_wow[i]<-Wholesale_wow$value
exampleData$revenue_wow[i]<-Revenue_wow$value
The strategy was to convert the original data into a wide format where the rows were the product id and the columns were the weeks. Then divide the data frames by each other. Convert back into a long format and add the newly calculated values to the exampleData data frame. This works, not very clean but very much faster than the loop. The dplyr package is another tool for this type of work.
To compare this results of this code with you test case use:
print(identical(goldendata, exampleData))
Where goldendata is your known good results, be sure to use the same random numbers with the set.seed() function.

Mean Returns in Time Series - Restarting after NA values - rstudio

Has anyone encountered calculating historical mean log returns in time series datasets?
The dataset is ordered by individual security first and by time for each respective security. I am trying to form a historical mean log return, i.e. the mean log return for the security from its first appearance in the dataset to date, for each point in time for each security.
Luckily, the return time series contains NAs between returns for differing securities. My idea is to calculate a historical mean that restarts after each NA that appears.
A simple cumsum() probably will not do it, as the NAs will have to be dropped.
I thought about using rollmean(), if I only knew an efficient way to specify the 'width' parameter to the length of the vector of consecutive preceding non-NAs.
The current approach I am taking, based on Count how many consecutive values are true, takes significantly too much time, given the size of the data set I am working with.
For any x of the form x : [r(1) r(2) ... r(N)], where r(2) is the log return in period 2:
df <- data.frame(x, zcount = NA)
df[1,2] = 0 #df$x[1]=NA by construction of the data set
for(i in 2:nrow(df))
df$zcount[i] <- ifelse(!is.na(df$x[i]), df$zcount[i-1]+1, 0)
Any idea how to speed this up would be highly appreciated!
You will need to reshape the data.frame to apply the cumsum function
over each security. Here's how:
First, I'll generate some data on 100 securities over 100 months which I think corresponds to your description of the data set
securities <- 100
months <- 100
time <- seq.Date(as.Date("2010/1/1"), by = "months", length.out = months)
ID <- rep(paste0("sec", 1:months), each = securities)
returns <- rnorm(securities * months, mean = 0.08, sd = 2)
df <- data.frame(time, ID, returns)
head(df)
time ID returns
1 2010-01-01 sec1 -3.0114466
2 2010-02-01 sec1 -1.7566112
3 2010-03-01 sec1 1.6615731
4 2010-04-01 sec1 0.9692533
5 2010-05-01 sec1 1.3075774
6 2010-06-01 sec1 0.6323768
Now, you must reshape your data so that each security column contains its
returns, and each row represents the date.
library(tidyr)
df_wide <- spread(df, ID, returns)
Once this is done, you can use the apply function to sum every column which now represents each security. Or use the cumsum function. Notice the data object df_wide[-1], which drops the time column. This is necessary to avoid the sum or cumsum functions throwing an error.
matrix_sum <- apply(df_wide[-1], 2, FUN = sum)
matrix_cumsum <- apply(df_wide[-1], 2, FUN = cumsum)
Now, add the time column back as a data.frame if you like:
df_final <- data.frame(time = df_wide[,1], matrix_cumsum)

How to match dates in 2 data frames in R, then sum specific range of values up to that date?

I have two data frames: rainfall data collected daily and nitrate concentrations of water samples collected irregularly, approximately once a month. I would like to create a vector of values for each nitrate concentration that is the sum of the previous 5 days' rainfall. Basically, I need to match the nitrate date with the rain date, sum the previous 5 days' rainfall, then print the sum with the nitrate data.
I think I need to either make a function, a for loop, or use tapply to do this, but I don't know how. I'm not an expert at any of those, though I've used them in simple cases. I've searched for similar posts, but none get at this exactly. This one deals with summing by factor groups. This one deals with summing each possible pair of rows. This one deals with summing by aggregate.
Here are 2 example data frames:
# rainfall df
mm<- c(0,0,0,0,5, 0,0,2,0,0, 10,0,0,0,0)
date<- c(1:15)
rain <- data.frame(cbind(mm, date))
# b/c sums of rainfall depend on correct chronological order, make sure the data are in order by date.
rain[ do.call(order, list(rain$date)),]
# nitrate df
nconc <- c(15, 12, 14, 20, 8.5) # nitrate concentration
ndate<- c(6,8,11,13,14)
nitrate <- data.frame(cbind(nconc, ndate))
I would like to have a way of finding the matching rainfall date for each nitrate measurement, such as:
match(nitrate$date[i] %in% rain$date)
(Note: Will match work with as.Date dates?) And then sum the preceding 5 days' rainfall (not including the measurement date), such as:
sum(rain$mm[j-6:j-1]
And prints the sum in a new column in nitrate
print(nitrate$mm_sum[i])
To make sure it's clear what result I'm looking for, here's how to do the calculation 'by hand'. The first nitrate concentration was collected on day 6, so the sum of rainfall on days 1-5 is 5mm.
Many thanks in advance.
You were more or less there!
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$ndate)) {
day = nitrate$ndate[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
}
Step by step explanation:
Initialize empty result column:
nitrate$prev_five_rainfall = NA
For each line in the nitrate df: (i = 1,2,3,4,5)
for (i in 1:length(nitrate$ndate)) {
Grab the day we want final result for:
day = nitrate$ndate[i]
Take the rainfull sum and it put in in the results column
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
Close the for loop :)
}
Disclaimer: This answer is basic in that:
It will break if nitrate's ndate < 6
It will be incorrect if some dates are missing in the rain dataframe
It will be slow on larger data
As you get more experience with R, you might use data manipulation packages like dplyr or data.table for these types of manipulations.
#nelsonauner's answer does all the heavy lifting. But one thing to note, in my actual data my dates are not numerical like they are in the example above, they are dates listed as MM/DD/YYYY with the appropriate as.Date(nitrate$date, "%m/%d/%Y").
I found that the for loop above gave me all zeros for nitrate$prev_five_rainfall and I suspected it was a problem with the dates.
So I changed my dates in both data sets to numerical using the difference in number of days between a common start date and the recorded date, so that the for loop would look for a matching number of days in each data frame rather than a date. First, make a column of the start date using rep_len() and format it:
nitrate$startdate <- rep_len("01/01/1980", nrow(nitrate))
nitrate$startdate <- as.Date(all$startdate, "%m/%d/%Y")
Then, calculate the difference using difftime():
nitrate$diffdays <- as.numeric(difftime(nitrate$date, nitrate$startdate, units="days"))
Do the same for the rain data frame. Finally, the for loop looks like this:
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$diffdays)) {
day = nitrate$diffdays[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-5):(day-1)]) # 5 days
}

Data aggregation loop in R

I am facing a problem concerning aggregating my data to daily data.
I have a data frame where NAs have been removed (Link of picture of data is given below). Data has been collected 3 times a day, but sometimes due to NAs, there is just 1 or 2 entries per day; some days data is missing completely.
I am now interested in calculating the daily mean of "dist": this means summing up the data of "dist" of one day and dividing it by number of entries per day (so 3 if there is no data missing that day). I would like to do this via a loop.
How can I do this with a loop? The problem is that sometimes I have 3 entries per day and sometimes just 2 or even 1. I would like to tell R that for every day, it should sum up "dist" and divide it by the number of entries that are available for every day.
I just have no idea how to formulate a for loop for this purpose. I would really appreciate if you could give me any advice on that problem. Thanks for your efforts and kind regards,
Jan
Data frame: http://www.pic-upload.de/view-11435581/Data_loop.jpg.html
Edit: I used aggregate and tapply as suggested, however, the mean value of the data was not really calculated:
Group.1 x
1 2006-10-06 12:00:00 636.5395
2 2006-10-06 20:00:00 859.0109
3 2006-10-07 04:00:00 301.8548
4 2006-10-07 12:00:00 649.3357
5 2006-10-07 20:00:00 944.8272
6 2006-10-08 04:00:00 136.7393
7 2006-10-08 12:00:00 360.9560
8 2006-10-08 20:00:00 NaN
The code used was:
dates<-Dis_sub$date
distance<-Dis_sub$dist
aggregate(distance,list(dates),mean,na.rm=TRUE)
tapply(distance,dates,mean,na.rm=TRUE)
Don't use a loop. Use R. Some example data :
dates <- rep(seq(as.Date("2001-01-05"),
as.Date("2001-01-20"),
by="day"),
each=3)
values <- rep(1:16,each=3)
values[c(4,5,6,10,14,15,30)] <- NA
and any of :
aggregate(values,list(dates),mean,na.rm=TRUE)
tapply(values,dates,mean,na.rm=TRUE)
gives you what you want. See also ?aggregate and ?tapply.
If you want a dataframe back, you can look at the package plyr :
Data <- as.data.frame(dates,values)
require(plyr)
ddply(data,"dates",mean,na.rm=TRUE)
Keep in mind that ddply is not fully supporting the date format (yet).
Look at the data.table package especially if your data is huge. Here is some code that calculates the mean of dist by day.
library(data.table)
dt = data.table(Data)
Data[,list(avg_dist = mean(dist, na.rm = T)),'date']
It looks like your main problem is that your date field has times attached. The first thing you need to do is create a column that has just the date using something like
Dis_sub$date_only <- as.Date(Dis_sub$date)
Then using Joris Meys' solution (which is the right way to do it) should work.
However if for some reason you really want to use a loop you could try something like
newFrame <- data.frame()
for d in unique(Dis_sub$date){
meanDist <- mean(Dis_sub$dist[Dis_sub$date==d],na.rm=TRUE)
newFrame <- rbind(newFrame,c(d,meanDist))
}
But keep in mind that this will be slow and memory-inefficient.

Resources