create an index for aggregating daily data to match periodic data - r

I have daily measurements prec.d and periodic measurements prec.p. The periodic measurements (3-12 days apart) are roughly the sum of the daily measurements between the start and end dates, and I need to compare prec in the two data frames. I have so far manually created an index week that represents the time span of each periodic measurement, but it would be great to make week in a reproducible fashion.
data.frame prec.d
day week prec
6/20/2013 1 0
6/21/2013 1 0
6/22/2013 1 0
6/23/2013 1 0
6/24/2013 1 41.402
6/25/2013 1 2.794
6/26/2013 1 6.096
6/27/2013 2 0.508
6/28/2013 2 0
6/29/2013 2 0
6/30/2013 2 2.54
7/1/2013 2 18.034
7/2/2013 2 4.064
And data.frame prec.p
start end week prec1 prec2 prec3
6/20/2013 6/26/2013 1 50.28 31.78042615 42.76461716
6/27/2013 7/2/2013 2 25.1 15.70964247 20.49507586
I would like to create the week field automatically, which spans from start to end in prec.p. The I can aggregate by week to make prec in both data frames match.

Introduce YYYYWW field in both weekly and minute data, WW stands for week number, that will give you a common index. For example
x <- as.Date(runif(100)*100)
yyyyww <- strftime(x, format="%Y%U")
yyyyww
Or take a look at package quantmod, if I remember correctly, it has functions for time frame conversion.

Related

Is there a way I can use r code in order to calculate the average price for specific days? (AVERAGEIF function)

Firstly: I have seen other posts about AVERAGEIF translations from excel into R but I didn't see one that worked on my specific case and I couldn't get around to making one work.
I have a dataset which encompasses the daily pricings of a bunch of listings.
It looks like this
listing_id date price
1 1000 1/2/2015 $100
2 1200 2/4/2016 $150
Sample of the dataset (and desired outcome) # https://send.firefox.com/download/228f31e39d18738d/#rlMmm6UeGxgbkzsSD5OsQw
The dataset I would like to have has only the date and the average prices of all listings on that date. The goal is to get a (different) dataframe which would look something like this so I can work with it:
Date Average Price
1 4/5/2015 204.5438
2 4/6/2015 182.6439
3 4/7/2015 176.553
4 4/8/2015 182.0448
5 4/9/2015 183.3617
6 4/10/2015 205.0997
7 4/11/2015 197.0118
8 4/12/2015 172.2943
I created this in Excel using the Average.if function (and copy pasting by value) from the sample provided above.
I tried to format the data in Excel first where I could use the AVERAGE.IF function saying take the average if it is this specific date. The problem with this is that the dataset consists of 30million rows and excel only allows for 1 million so it didn't work.
What I have done so far: I created a data frame in R (where i want the average prices to go into) using
Avg = data.frame("Date" =1:2, "Average Price"=1:2)
Avg[nrow(Avg) + 2036,] = list("v1","v2")
Avg$Date = seq(from = as.Date("2015-04-05"), to = as.Date("2020-11-01"), by = 'day')
I tried to create an averageif-like function by this article and another but could not get it to work.
I hope this is enough information to go on otherwise I would be more than happy to provide more.
If your question is how to replicate the AVERAGEIF function, you can use logical indexing :
R code :
> df
Dates Prices
1 1 100
2 2 120
3 3 150
4 1 320
5 2 250
6 3 210
7 1 102
8 2 180
9 3 150
idx <- df$Dates == 1 # Positions where condition is true
mean(df$Prices[idx]) # Prints same output as Excel

calculating date specific correlation in r (leading to a potential time series)

I have a dataset that looks somewhat like this (the actual dataset is ~150000 lines with additional columns of fluff information such as company name, etc.):
Date return1 return2 rank
01/31/2008 0.05434 0.23413 3
01/31/2008 0.03423 0.43423 4
01/31/2008 0.65277 0.23423 1
01/31/2008 0.02342 0.47234 4
02/31/2008 0.01463 0.01231 4
02/31/2008 0.13456 0.52552 2
02/31/2008 0.34534 0.36663 1
02/31/2008 0.00324 0.56463 3
...
12/31/2015 0.21234 0.02333 2
12/31/2015 0.07245 0.87234 1
12/31/2015 0.47282 0.12998 1
12/31/2015 0.99022 0.03445 2
Basically I need to caculate the date-specific correlation between return1 and rank (so the corr. on 01/31/2008, 02/31/2008, and so on). I know I can split the data using the split() function but I am unsure as to how to get the date-specific correlation. The real data has about 260 entries per date and around 68 dates, so manually subsetting the original table and performing calculations is time consuming but more importantly more susceptible to error.
My ultimate goal is to create a time series of the correlations on different dates.
Thank you in advance!
I had this same problem earlier, except I wasn't calculating correlation. What I would do is
a %>% group_by(Date) %>% summarise(Correlation = cor(return1, rank))
And this will provide, for each date, a correlation value between return1 and rank. Don't forget that you can specify what kind of correlation you would like (e.g. Spearman).

Creating a Dummy Variable for Observations within a date range

I want to create a new dummy variable that prints 1 if my observation is within a certain set of date ranges, and a 0 if its not. My dataset is a list of political contributions over a 10 year range and I want to make a dummy variable to mark if the donation came during a certain range of dates. I have 10 date ranges I'm looking at.
Does anyone know if the right way to do this is to create a loop? I've been looking at this question, which seems similar, but I think mine would be a bit more complicated: Creating a weekend dummy variable
By way of example, what I have a variable listing dates that contributions were recorded and I want to create dummy to show whether this contribution came during a budget crisis. So, if there were a budget crisis from 2010-2-01 until 2010-03-25 and another from 2009-06-05 until 2009-07-30, the variable would ideally look like this:
Contribution Date.......Budget Crisis
2009-06-01...........................0
2009-06-06...........................1
2009-07-30...........................1
2009-07-31...........................0
2010-01-31...........................0
2010-03-05...........................1
2010-03-26...........................0
Thanks yet again for your help!
This looks like a good opportunity to use the %in% syntax of the match(...) function.
dat <- data.frame(ContributionDate = as.Date(c("2009-06-01", "2009-06-06", "2009-07-30", "2009-07-31", "2010-01-31", "2010-03-05", "2010-03-26")), CrisisYes = NA)
crisisDates <- c(seq(as.Date("2010-02-01"), as.Date("2010-03-25"), by = "1 day"),
seq(as.Date("2009-06-05"), as.Date("2009-07-30"), by = "1 day")
)
dat$CrisisYes <- as.numeric(dat$ContributionDate %in% crisisDates)
dat
ContributionDate CrisisYes
1 2009-06-01 0
2 2009-06-06 1
3 2009-07-30 1
4 2009-07-31 0
5 2010-01-31 0
6 2010-03-05 1
7 2010-03-26 0

Aggregating, restructuring hourly time series data in R

I have a year's worth of hourly data in a data frame in R:
> str(df.MHwind_load) # compactly displays structure of data frame
'data.frame': 8760 obs. of 6 variables:
$ Date : Factor w/ 365 levels "2010-04-01","2010-04-02",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Time..HRs. : int 1 2 3 4 5 6 7 8 9 10 ...
$ Hour.of.Year : int 1 2 3 4 5 6 7 8 9 10 ...
$ Wind.MW : int 375 492 483 476 486 512 421 396 456 453 ...
$ MSEDCL.Demand: int 13293 13140 12806 12891 13113 13802 14186 14104 14117 14462 ...
$ Net.Load : int 12918 12648 12323 12415 12627 13290 13765 13708 13661 14009 ...
While preserving the hourly structure, I would like to know how to extract
a particular month/group of months
the first day/first week etc of each month
all mondays, all tuesdays etc of the year
I have tried using "cut" without result and after looking online think that "lubridate" might be able to do so but haven't found suitable examples. I'd greatly appreciate help on this issue.
Edit: a sample of data in the data frame is below:
Date Hour.of.Year Wind.MW datetime
1 2010-04-01 1 375 2010-04-01 00:00:00
2 2010-04-01 2 492 2010-04-01 01:00:00
3 2010-04-01 3 483 2010-04-01 02:00:00
4 2010-04-01 4 476 2010-04-01 03:00:00
5 2010-04-01 5 486 2010-04-01 04:00:00
6 2010-04-01 6 512 2010-04-01 05:00:00
7 2010-04-01 7 421 2010-04-01 06:00:00
8 2010-04-01 8 396 2010-04-01 07:00:00
9 2010-04-01 9 456 2010-04-01 08:00:00
10 2010-04-01 10 453 2010-04-01 09:00:00
.. .. ... .......... ........
8758 2011-03-31 8758 302 2011-03-31 21:00:00
8759 2011-03-31 8759 378 2011-03-31 22:00:00
8760 2011-03-31 8760 356 2011-03-31 23:00:00
EDIT: Additional time-based operations I would like to perform on the same dataset
1. Perform hour-by-hour averaging for all data points i.e average of all values in the first hour of each day in the year. The output will be an "hourly profile" of the entire year (24 time points)
2. Perform the same for each week and each month i.e obtain 52 and 12 hourly profiles respectively
3. Do seasonal averages, for example for June to September
Convert the date to the format which lubridate understands and then use the functions month, mday, wday respectively.
Suppose you have a data.frame with the time stored in column Date, then the answer for your questions would be:
###dummy data.frame
df <- data.frame(Date=c("2012-01-01","2012-02-15","2012-03-01","2012-04-01"),a=1:4)
##1. Select rows for particular month
subset(df,month(Date)==1)
##2a. Select the first day of each month
subset(df,mday(Date)==1)
##2b. Select the first week of each month
##get the week numbers which have the first day of the month
wkd <- subset(week(df$Date),mday(df$Date)==1)
##select the weeks with particular numbers
subset(df,week(Date) %in% wkd)
##3. Select all mondays
subset(df,wday(Date)==1)
First switch to a Date representation: as.Date(df.MHwind_load$Date)
Then call weekdays on the date vector to get a new factor labelled with day of week
Then call months on the date vector to get a new factor labelled with name of month
Optionally create a years variable (see below).
Now subset the data frame using the relevant combination of these.
Step 2. gets an answer to your task 3. Steps 3. and 4. get you to task 1. Task 2 might require a line or two of R. Or just select rows corresponding to, say, all the Mondays in a month and call unique, or its alter-ego duplicated on the results.
To get you going...
newdf <- df.MHwind_load ## build an augmented data set
newdf$d <- as.Date(newdf$Date)
newdf$month <- months(newdf$d)
newdf$day <- weekdays(newdf$d)
## for some reason R has no years function. Here's one
years <- function(x){ format(as.Date(x), format = "%Y") }
newdf$year <- years(newdf$d)
# get observations from January to March of every year
subset(newdf, month %*% in c('January', 'February', 'March'))
# get all Monday observations
subset(newdf, day == 'Monday')
# get all Mondays in 1999
subset(newdf, day == 'Monday' & year == '1999')
# slightly fancier: _first_ Monday of each month
# get the first weeks
first.week.of.month <- !duplicated(cbind(newdf$month, newdf$day))
# now pull out the mondays
subset(newdf, first.monday.of.month & day=='Monday')
Since you're not asking about the time (hourly) part of your data, it is best to then store your data as a Date object. Otherwise, you might be interested in chron, which also has some convenience functions like you'll see below.
With respect to Conjugate Prior's answer, you should store your date data as a Date object. Since your data already follows the default format ('yyyy-mm-dd') you can just call as.Date on it. Otherwise, you would have to specify your string format. I would also use as.character on your factor to make sure you don't get errors inline. I know I've ran into problems with factors-into-Dates for that reason (possibly corrected in current version).
df.MHwind_load <- transform(df.MHwind_load, Date = as.Date(as.character(Date)))
Now you would do well to create wrapper functions that extract the information you desire. You could use transform like I did above to simply add those columns that represent months, days, years, etc, and then subset on them logically. Alternatively, you might do something like this:
getMonth <- function(x, mo) { # This function assumes w/in single year vector
isMonth <- month(x) %in% mo # Boolean of matching months
return(x[which(isMonth)] # Return vector of matching months
} # end function
Or, in short form
getMonth <- function(x, mo) x[month(x) %in% mo]
This is just a tradeoff between storing that information (transform frame) or having it processed when desired (use accessor methods).
A more complicated process is your need for, say, the first day of a month. This is not entirely difficult, though. Below is a function that will return all of those values, but it is rather simple to just subset a sorted vector of values for a given month and take their first one.
getFirstDay <- function(x, mo) {
isMonth <- months(x) %in% mo
x <- sort(x[isMonth]) # Look at only those in the desired month.
# Sort them by date. We only want the first day.
nFirsts <- rle(as.numeric(x))$len[1] # Returns length of 1st days
return(x[seq(nFirsts)])
} # end function
The easier alternative would be
getFirstDayOnly <- function(x, mo) {sort(x[months(x) %in% mo])[1]}
I haven't prototyped these, as you didn't provide any data samples, but this is the sort of approach that can help you get the information you desire. It is up to you to figure out how to put these into your work flow. For instance, say you want to get the first day for each month of a given year (assuming we're only looking at one year; you can create wrappers or pre-process your vector to a single year beforehand).
# Return a vector of first days for each month
df <- transform(df, date = as.Date(as.character(date)))
sapply(unique(months(df$date)), # Iterate through months in Dates
function(month) {getFirstDayOnly(df$date, month)})
The above could also be designed as a separate convenience function that uses the other accessor function. In this way, you create a series of direct but concise methods for getting pieces of the information you want. Then you simply pull them together to create very simple and easy to interpret functions that you can use in your scripts to get you precise what you desire in the most efficient manner.
You should be able to use the above examples to figure out how to prototype other wrappers for accessing the date information you require. If you need help on those, feel free to ask in a comment.

Assigning week numbers in a time series to obtain weekly average price

Let's say I have a time series with daily data (business days), and I would like to organize the data by business weeks. (Monday-Friday) in a similar fashion as the one in this webpage from the EIA on futures prices of crude oil:
http://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=RCLC1&f=D
As you can see the prices are nicely organized by weeks in this webpage.
Is there any function in R that could organize the data in a similar fashion?
You can obtain the data in .xls format at:
http://www.eia.gov/dnav/pet/hist_xls/RCLC1d.xls
What I would like to do is to assign a week number to each daily observation something like this: (Look at the weeks column)
Date Price weeks day
1983-04-04 29.44 1 Monday
1983-04-05 29.71 1 Tuesday
1983-04-06 29.92 1 Wednesday
1983-04-07 30.17 1 Thursday
1983-04-08 30.38 1 Friday
1983-04-11 30.26 2 Monday
...
...
So far I have used the week function of the lubridate package but is not working well. It seems like once a year hits the 53rd week the function fails to initiate properly the week of the following year.
I have been trying to stay away from rep, seq /5 or /7 kind of solutions since there may be some observations that I may need to filter from the data later on, so I would like to have a solution that doesn't depend on the particular vector of my data but rather I would prefer the solution to be more general, that is to depend on the date class, i.e POSIcxt, xts or zoo class
Any hints would be greatly appreciated.
Wouldn't this work?:
as.POSIXlt()$yday %/% 7
I realize that it does have part of what you wanted to avoid but it does draw its starting point from a recognized class. For your data noting that I read it in with colClasses=c("Date", "numeric","numeric","character") :
> 1 + as.POSIXlt(dat$Date)$yday %/% 7
[1] 14 14 14 14 14 15
If you want to replicate those interval labels, try adding multiples of 7 to any Monday and Friday:
paste(as.Date(strptime("1983 Apr- 4",format="%Y %b- %d"))+(39)*7,
" to ",
as.Date(strptime("1983 Apr- 8",format="%Y %b- %d"))+(39)*7,
sep="")
#[1] "1984-01-02 to 1984-01-06" # The first new year change
paste(as.Date(strptime("1983 Apr- 4",format="%Y %b- %d"))+(39+52)*7,
" to ",
as.Date(strptime("1983 Apr- 8",format="%Y %b- %d"))+(39+52)*7,
sep="")
#[1] "1984-12-31 to 1985-01-04" # The second new year change
Here's a function that will accept an integer vector:
from8Apr83dts <- function(numwks) {
paste(as.Date(strptime("1983 Apr- 4",format="%Y %b- %d"))+(numwks)*7,
" to ",
as.Date(strptime("1983 Apr- 8",format="%Y %b- %d"))+(numwks)*7,
sep="")
}
# Usage
from8Apr83dts(39:40)
#[1] "1984-01-02 to 1984-01-06" "1984-01-09 to 1984-01-13"

Resources