Aggregating data on monthly basis

Aggregating data on monthly basis - r

I do have following data in R:
date category1 category2 category 3 category 4
1 2012-04-01 7496.00 77288.37 224099.15 700050.04
2 2012-04-02 24541.00 59103.94 138408.65 625006.84
3 2012-04-03 1249.00 15951.50 574170.30 249390.53
4 2012-04-04 5205.00 10866.00 0.00 358703.88
5 2012-04-05 10398.00 0.00 119745.17 270585.46
And use following script to aggregate data on monthly basis:
data <- as.xts(data$category1,order.by=as.Date(data$date))
monthly <- apply.monthly(data,sum)
monthly
Question: Instead of repeating the step each for every category and then joining each monthly dataframe, how can I apply as.xts(...) to all columns? I tried
as.xts(c("data$category1","data$category1"),order.by=as.Date(data$date))
which did not work.
Also: Is there a better way to aggregate on a monthly basis?

Use xts instead of as.xts.
apply.monthly(xts(df[ -1], order.by = as.Date(df$date)), mean)
However, this seems to only work for mean, not for sum. You can always use sapply to iterate through the columns
sapply(colnames(data[, -1]), function(x) apply.monthly(as.xts(data[,x],
order.by=as.Date(data$date)),sum))

You can use the daily2monthly function in the HydroTSM package. It can handle more than just xts for arguments, including multiple columns. Fun can be sum or mean.
monthly <- daily2monthly(data, FUN=sum, na.rm=TRUE)

Related

Time series analysis applicability?

I have a sample data frame like this (date column format is mm-dd-YYYY):
date count grp
01-09-2009 54 1
01-09-2009 100 2
01-09-2009 546 3
01-10-2009 67 4
01-11-2009 80 5
01-11-2009 45 6
I want to convert this data frame into time series using ts(), but the problem is: the current data frame has multiple values for the same date. Can we apply time series in this case?
Can I convert data frame into time series, and build a model (ARIMA) which can forecast count value on a daily basis?
OR should I forecast count value based on grp, but in that case, I have to select only grp and count column of a data frame. So in that case, I have to skip date column, and daily forecast for count value is not possible?
Suppose if I want to aggregate count value on per day basis. I tried with aggregate function, but there we have to specify date value, but I have a very large data set? Any other option available in r?
Can somebody, please, suggest if there is a better approach to follow? My assumption is that the time series forcast works only for bivariate data? Is this assumption right?

It seems like there are two aspects of your problem:
i want to convert this data frame into time series using ts(), but the
problem is- current data frame having multiple values for the same
date. can we apply time series in this case?
If you are happy making use of the xts package you could attempt:
dta2$date <- as.Date(dta2$date, "%d-%m-%Y")
dtaXTS <- xts::as.xts(dta2[,2:3], dta2$date)
which would result in:
>> head(dtaXTS)
count grp
2009-09-01 54 1
2009-09-01 100 2
2009-09-01 546 3
2009-10-01 67 4
2009-11-01 80 5
2009-11-01 45 6
of the following classes:
>> class(dtaXTS)
[1] "xts" "zoo"
You could then use your time series object as univariate time series and refer to the selected variable or as a multivariate time series, example using PerformanceAnalytics packages:
PerformanceAnalytics::chart.TimeSeries(dtaXTS)
Side points
Concerning your second question:
can somebody plz suggest me what is the better approach to follow, my
assumption is time series forcast is works only for bivariate data? is
this assumption also right?
IMHO, this is rather broad. I would suggest that you use created xts object and elaborate on the model you want to utilise and why, if it's a conceptual question about nature of time series analysis you may prefer to post your follow-up question on CrossValidated.
Data sourced via: dta2 <- read.delim(pipe("pbpaste"), sep = "") using the provided example.

Since daily forecasts are wanted we need to aggregate to daily. Using DF from the Note at the end, read the first two columns of data into a zoo series z using read.zoo and argument aggregate=sum. We could optionally convert that to a "ts" series (tser <- as.ts(z)) although this is unnecessary for many forecasting functions. In particular, checking out the source code of auto.arima we see that it runs x <- as.ts(x) on its input before further processing. Finally run auto.arima, forecast or other forecasting function.
library(forecast)
library(zoo)
z <- read.zoo(DF[1:2], format = "%m-%d-%Y", aggregate = sum)
auto.arima(z)
forecast(z)
Note: DF is given reproducibly here:
Lines <- "date count grp
01-09-2009 54 1
01-09-2009 100 2
01-09-2009 546 3
01-10-2009 67 4
01-11-2009 80 5
01-11-2009 45 6"
DF <- read.table(text = Lines, header = TRUE)
Updated: Revised after re-reading question.

Using dplyr::mutate between two dataframes to create column based on date range

Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).
set.seed(1)
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])
In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals dataframe called nhdd that summed over the values in daily corresponding to that time interval (end exclusive).
For example, in this case the first entry of this new column would be
sum(daily$value[1:5])
and the second would be
sum(daily$value[2:8]) and so on.
I tried using the following code
intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))
This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.
I'd really like to use dplyr to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate but dplyr doesn't seem to support the Period class.
Edit: I'm actually using dates from as.Date now instead of lubridatebut the basic question of how to refer to a different dataframe from within mutate still stands

eps <- .Machine$double.eps
library(dplyr)
intervals %>%
rowwise() %>%
mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
# start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481
In case if you find dplyr solution bit slow (basically due torowwise), you might want to use data.table for pure speed
library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
# start end V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481

Creating with time series from a dataset including missing values

I need to create a time series from a data frame. The problem is variables is not well-ordered. Data frame is like below
Cases Date
15 1/2009
30 3/2010
45 12/2013
I have 60 observations like that. As you can see, data was collected randomly, which is starting from 1/2008 and ending 12/2013 ( There are many missing values(cases) in bulk of the months between these years). My assumption will be there is no cases in that months. So, how can I convert this dataset as time series? Then, I will try to make some prediction for possible number of cases in future.

Try installing the plyr library,
install.packages("plyr")
and then to sum duplicated Date2 rows:
library(plyr)
mergedData <- ddply(dat, .(Date2), .fun = function(x) {
data.frame(Cases = sum(x$Cases))
})
> head(mergedData)
Date2 Cases
1 2008-01-01 16352
2 2008-11-01 10
3 2009-01-01 23
4 2009-02-01 138
5 2009-04-01 18
6 2009-06-01 3534

you can create a separate sequence of time series and merge with data series.This will create a complete time series with missing values as NA.
if df is your data frame with Date as column of date than create new time series ts and merge as below.
ts <- data.frame(Date = seq(as.Date("2008-01-01"), as.Date("2013-12-31"), by="1 month"))
dfwithmisisng <- merge(ts, df, by="Date", all=T)

calculate Value at Risk in a data frame

My data set has 1000s hedge fund returns for 140 months and I was trying to calculate Value at Risk (VaR) suing command VaR in PerformanceAnalytics package. However, I have come up with many questions when using this function. I have created a sample data frame to show my problem.
df=data.frame(matrix(rnorm(24),nrow=8))
df$X1<-c('2007-01','2007-02','2007-03','2007-04','2007-05','2007-06','2007-07','2007-08')
df[2,2]<-NA
df[2,3]<-NA
df[1,3]<-NA
df
I got a data frame:
X1 X2 X3
1 2007-01 -1.4420195 NA
2 2007-02 NA NA
3 2007-03 -0.4503824 -0.78506597
4 2007-04 1.4083746 0.02095307
5 2007-05 0.9636549 0.19584430
6 2007-06 1.1935281 -0.14175623
7 2007-07 -0.3986336 1.58128683
8 2007-08 0.8211377 -1.13347168
I then run
apply(df,2,FUN=VaR, na.rm=TRUE)
and received a warning message:
The data cannot be converted into a time series. If you are trying to pass in names from a data object with one column, you should use the form 'data[rows, columns, drop = FALSE]'. Rownames should have standard date formats, such as '1985-03-15'.
I have tried to convert my data frame into combination of time series using zoo() but it didn't help. Can someone help to figure out what should I do now?

#user2893255, you should convert your data frame into an xts-object before using the apply function:
df.xts <- as.xts(df[,2:3],order.by=as.Date(df$X1,"%Y-%m"))
and then
apply(df.xts,2,FUN=VaR, na.rm=TRUE)
gives you the result without warnings or error messages.

Try dropping the Date column:
apply(df[,-1L], 2, FUN=VaR, na.rm=TRUE)

Simple pivot table type transformation in R statistics

I've been trying learn R for a while but haven't got my knowledge up to even a decent level yet. I'll get there in the end, but I'm in a pinch at the moment and was wondering if you could help me do a quick "transformation" type piece.
I have a csv data file with 18 million rows with the following data fields: Person ID, Date and Value. It's basically from a simulation model and is simulating the contributions a person makes into their savings accounts, e.g.:
1,28/02/2013,19.49
2,13/03/2013,16.68
3,15/03/2013,20.34
2,10/01/2014,28.43
3,12/06/2014,38.13
1,29/08/2014,68.46
1,20/12/2013,20.51
So, as you can see, there can be multiple IDs in the data but each date and contribution amount for a person is unique.
I would like to transform this so I have a contribution history by year for each person. So for example the above would become:
ID,2013,2014
1,40.00,68.46
2,16.68,28.43
3,20.34,38.13
I have a rough idea how I could approach the problem: create another column of data with just the years and then summarise by ID and year to add up all contributions that fit into each ID/year bucket. I just have no clue how to even begin translating that into an R script.
Any pointers/guidance would be most appreciated.
Many Thanks and Kind Regards.

Here are a few possibilities:
zoo package read.zoo in the zoo package can produce a multivariate time series one column per series, i.e. one column per ID. We define yr to get the year from the index column and then split on the ID using the split= argument as we read it in. We use aggregate=sum to aggregate over the remaining columns -- here just one. We use text = Lines to keep the code below self contained but with a real file we would replace that with "myfile", say. The final line transposes the result. We could drop that line if it were OK to have persons in columns instead of rows.
Lines <- "1,28/02/2013,19.49
2,13/03/2013,16.68
3,15/03/2013,20.34
2,10/01/2014,28.43
3,12/06/2014,38.13
1,29/08/2014,68.46
1,20/12/2013,20.51
"
library(zoo)
# given a Date string, x, output the year
yr <- function(x) floor(as.numeric(as.yearmon(x, "%d/%m/%Y")))
# read in data, reshape & aggregate
z <- read.zoo(text = Lines, sep = ",", index = 2, FUN = yr,
aggregate = sum, split = 1)
# transpose (optional)
tz <- data.frame(ID = colnames(z), t(z), check.names = FALSE)
With the posted data we get the following result:
> tz
ID 2013 2014
1 1 40.00 68.46
2 2 16.68 28.43
3 3 20.34 38.13
See ?read.zoo and also the zoo-read vignette.
reshape2 package Here is a second solution using the reshape2 package:
library(reshape2)
# read in and fix up column names and Year
DF <- read.table(text = Lines, sep = ",") ##
colnames(DF) <- c("ID", "Year", "Value") ##
DF$Year <- sub(".*/", "", DF$Year) ##
dcast(DF, ID ~ Year, fun.aggregate = sum, value.var = "Value")
The result is:
ID 2013 2014
1 1 40.00 68.46
2 2 16.68 28.43
3 3 20.34 38.13
reshape function Here is a solution that does not use any addon packages. First read in the data using the three lines marked ## in the last solution. This will produce DF. Then aggregate the data, reshape it from long to wide form and finally fix up the column names:
Ag <- aggregate(Value ~., DF, sum)
res <- reshape(Ag, direction = "wide", idvar = "ID", timevar = "Year")
colnames(res) <- sub("Value.", "", colnames(res))
which produces this:
> res
ID 2013 2014
1 1 40.00 68.46
2 2 16.68 28.43
3 3 20.34 38.13
tapply function. This solution does not use addon packages either. Using Ag from the last solution try this:
tapply(Ag$Value, Ag[1:2], sum)
UPDATES: minor improvements and 3 additional solutions.

The approach you describe is a sound one. Translating the date string back and forth from string to date and back can be done using strptime and strftime (possible as.POSIXct. Once you have the year column, you can use a number of tools available in R, e.g. data.table, by, or ddply. I like the syntax of the last one:
library(plyr)
ddply(df, .(ID, year), summarise, total_per_year = sum(value))
This assumes that your base date is in df, and that the columns in your data are called year, ID and value. Do note that for large datasets ddply can become quite slow. If you really need raw performance, you definitely want to start working with data.table.