Time series analysis applicability? - r

I have a sample data frame like this (date column format is mm-dd-YYYY):
date count grp
01-09-2009 54 1
01-09-2009 100 2
01-09-2009 546 3
01-10-2009 67 4
01-11-2009 80 5
01-11-2009 45 6
I want to convert this data frame into time series using ts(), but the problem is: the current data frame has multiple values for the same date. Can we apply time series in this case?
Can I convert data frame into time series, and build a model (ARIMA) which can forecast count value on a daily basis?
OR should I forecast count value based on grp, but in that case, I have to select only grp and count column of a data frame. So in that case, I have to skip date column, and daily forecast for count value is not possible?
Suppose if I want to aggregate count value on per day basis. I tried with aggregate function, but there we have to specify date value, but I have a very large data set? Any other option available in r?
Can somebody, please, suggest if there is a better approach to follow? My assumption is that the time series forcast works only for bivariate data? Is this assumption right?

It seems like there are two aspects of your problem:
i want to convert this data frame into time series using ts(), but the
problem is- current data frame having multiple values for the same
date. can we apply time series in this case?
If you are happy making use of the xts package you could attempt:
dta2$date <- as.Date(dta2$date, "%d-%m-%Y")
dtaXTS <- xts::as.xts(dta2[,2:3], dta2$date)
which would result in:
>> head(dtaXTS)
count grp
2009-09-01 54 1
2009-09-01 100 2
2009-09-01 546 3
2009-10-01 67 4
2009-11-01 80 5
2009-11-01 45 6
of the following classes:
>> class(dtaXTS)
[1] "xts" "zoo"
You could then use your time series object as univariate time series and refer to the selected variable or as a multivariate time series, example using PerformanceAnalytics packages:
PerformanceAnalytics::chart.TimeSeries(dtaXTS)
Side points
Concerning your second question:
can somebody plz suggest me what is the better approach to follow, my
assumption is time series forcast is works only for bivariate data? is
this assumption also right?
IMHO, this is rather broad. I would suggest that you use created xts object and elaborate on the model you want to utilise and why, if it's a conceptual question about nature of time series analysis you may prefer to post your follow-up question on CrossValidated.
Data sourced via: dta2 <- read.delim(pipe("pbpaste"), sep = "") using the provided example.

Since daily forecasts are wanted we need to aggregate to daily. Using DF from the Note at the end, read the first two columns of data into a zoo series z using read.zoo and argument aggregate=sum. We could optionally convert that to a "ts" series (tser <- as.ts(z)) although this is unnecessary for many forecasting functions. In particular, checking out the source code of auto.arima we see that it runs x <- as.ts(x) on its input before further processing. Finally run auto.arima, forecast or other forecasting function.
library(forecast)
library(zoo)
z <- read.zoo(DF[1:2], format = "%m-%d-%Y", aggregate = sum)
auto.arima(z)
forecast(z)
Note: DF is given reproducibly here:
Lines <- "date count grp
01-09-2009 54 1
01-09-2009 100 2
01-09-2009 546 3
01-10-2009 67 4
01-11-2009 80 5
01-11-2009 45 6"
DF <- read.table(text = Lines, header = TRUE)
Updated: Revised after re-reading question.

Related

creating columns of monthly averages in R

I have a dataframe in R where each row corresponds to a household. One column describes a date in 2010 when that household planted crops. The remainder of the dataset contains over 1000 columns describing the temperature on every day between 2007-2010 for those households.
This is the basic form:
Date 2007-01-01 2007-01-02 2007-01-03
1 2010-05-01 70 72 61
2 2010-02-10 63 59 73
3 2010-03-06 60 59 81
I need to create columns for each household that describe the monthly mean temperatures of the two months following their planting date in each of the three years prior to 2010.
For instance: if a household planted on 2010-05-01, I would need the following columns:
mean temp of 2007-05-01 through 2007-06-01
mean temp of 2007-06-02 through 2007-07-01
mean temp of 2008-05-01 through 2008-06-01
...
mean temp of 2009-06-02 through 2009-07-01
I skipped two columns, but you get the idea. Specific code would be most helpful, but in general, I am just looking for a way to pull data from specific columns based upon a date that is described by another column.
Hi #bricevk you could use the apply function. It allows you to use a function over a data either column-wise or row-wise.
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/apply
Say your data is in a object df. It applies the mean function over the columns of df . Giving you the column-wise mean. The 2 indicates the columns. This wpuld the daily average, assuming each column, is a day.
Averages <- apply(df,2,mean)
If I didn't answer this the way you would like perhaps I have not really understood your dataset. Could you try explain it more clearly?
I suggest you to use tidyverse. However, in order to be compatible with this universe, you firstly have to make your data standard, ie tidy. In your example, the things would be easier if you transformed your data in order to have your observations ordrered by rows, and columns being variables. If I correctly understood your data, you have households planting trees (the row names are dates of plantation ?), and then controls with temperature. I'd do something like :
-----------------------------------------------------------------------------
| Household ID | planting date | Date of control | Temperature controlled |
-----------------------------------------------------------------------------
firstly, have your planting date stored as another thing than a rowname, by example :
library(dplyr)
df <- tibble::rownames_to_column(data, "PlantingDate")
You also have to get your household id var you haven't specified to us.
Then you can manage to have the tidy data with tidyr, using
library(tidyr)
df <- gather(df,"DateOfControl","Temperature",-c(PlantingDate,ID))
When you'll have that, you'll be able to use the package lubridate, something like
library(lubridate)
df %>%
group_by(ID,PlantingDate,year(ControlDate),month(ControlDate)) %>%
summarise(MeanT=mean(Temperature))
could work

Simple time series analysis with R: aggregating and subsetting

I want to convert monthly data into quarterly averages. These are my 2 datasets:
gas <- UKgas
dd <- UKDriverDeaths
I was able to accomplish (I think) for the dd data as so:
dd.zoo <- zoo(dd)
ddq <- aggregate(dd.zoo, as.yearqtr, mean)
However I cannot figure out how to do this with the gas data...any help?
Follow-up
When I try to subset the data based on date (1969-1984) the resulting data does not include 1969 Q1 and instead includes 1985 Q1...any suggestions on how to fix this? I was just trying to subset as gas[1969:1984].
Originally I did not plan to post answer, as it looks like you did not pre-check your UKgas dataset to see that it is already a quarterly time series.
But the follow-up question is worth answering. "ts" object comes with many handy generic functions. We can use window to easily subset a time series. To extract the section between first quarter of 1969 and the final quarter of 1984, we can use
window(UKgas, start = c(1969,1), end = c(1984,4))
The result will still be a quarterly time series.
On the other hand, if we use "[" for subsetting, we lose object class:
class(UKgas[1:12])
#[1] "numeric"

Convert characters to dates in a panel data set

I am downloading and using the following panel data,
# load / install package
library(rsdmx)
library(dplyr)
# Total
Assets.PIT <- readSDMX("http://widukind-api.cepremap.org/api/v1/sdmx/IMF/data/IFS/..Q.BFPA-BP6-USD")
Assets.PIT <- as.data.frame(Assets.PIT)
names(Assets.PIT)[10]<-"A.PI.T"
names(Assets.PIT)[6]<-"Code"
AP<-Assets.PIT[c("WIDUKIND_NAME","Code","TIME_PERIOD","A.PI.T")]
AP<-rename(AP, Country=WIDUKIND_NAME, Year=TIME_PERIOD)
My goal is to convert the column vector Year in the dataframe AP into a vector of class dates. In other words, I want R to understand the time series part of my panel data. For your information, I have quarterly data, with unbalanced date range across cross sections (in my case countries).
head(AP$Year)
[1] "2008-Q2" "2008-Q3" "2008-Q4" "2009-Q1" "2009-Q2" "2009-Q3"
Or,
AP$Year<-as.factor(AP$Year)
head(AP$Year)
[1] 2008-Q2 2008-Q3 2008-Q4 2009-Q1 2009-Q2 2009-Q3
264 Levels: 1950-Q1 1950-Q2 1950-Q3 1950-Q4 1951-Q1 1951-Q2 1951-Q3 1951-Q4 1952-Q1 1952-Q2 1952-Q3 ... 2015-Q4
Is there any easy solution to convert these character dates into time-series dates?
library(zoo)
as.Date(as.yearqtr(AP$year, format ='%YQ-%q'))
This should do it.

calculate Value at Risk in a data frame

My data set has 1000s hedge fund returns for 140 months and I was trying to calculate Value at Risk (VaR) suing command VaR in PerformanceAnalytics package. However, I have come up with many questions when using this function. I have created a sample data frame to show my problem.
df=data.frame(matrix(rnorm(24),nrow=8))
df$X1<-c('2007-01','2007-02','2007-03','2007-04','2007-05','2007-06','2007-07','2007-08')
df[2,2]<-NA
df[2,3]<-NA
df[1,3]<-NA
df
I got a data frame:
X1 X2 X3
1 2007-01 -1.4420195 NA
2 2007-02 NA NA
3 2007-03 -0.4503824 -0.78506597
4 2007-04 1.4083746 0.02095307
5 2007-05 0.9636549 0.19584430
6 2007-06 1.1935281 -0.14175623
7 2007-07 -0.3986336 1.58128683
8 2007-08 0.8211377 -1.13347168
I then run
apply(df,2,FUN=VaR, na.rm=TRUE)
and received a warning message:
The data cannot be converted into a time series. If you are trying to pass in names from a data object with one column, you should use the form 'data[rows, columns, drop = FALSE]'. Rownames should have standard date formats, such as '1985-03-15'.
I have tried to convert my data frame into combination of time series using zoo() but it didn't help. Can someone help to figure out what should I do now?
#user2893255, you should convert your data frame into an xts-object before using the apply function:
df.xts <- as.xts(df[,2:3],order.by=as.Date(df$X1,"%Y-%m"))
and then
apply(df.xts,2,FUN=VaR, na.rm=TRUE)
gives you the result without warnings or error messages.
Try dropping the Date column:
apply(df[,-1L], 2, FUN=VaR, na.rm=TRUE)

Data aggregation loop in R

I am facing a problem concerning aggregating my data to daily data.
I have a data frame where NAs have been removed (Link of picture of data is given below). Data has been collected 3 times a day, but sometimes due to NAs, there is just 1 or 2 entries per day; some days data is missing completely.
I am now interested in calculating the daily mean of "dist": this means summing up the data of "dist" of one day and dividing it by number of entries per day (so 3 if there is no data missing that day). I would like to do this via a loop.
How can I do this with a loop? The problem is that sometimes I have 3 entries per day and sometimes just 2 or even 1. I would like to tell R that for every day, it should sum up "dist" and divide it by the number of entries that are available for every day.
I just have no idea how to formulate a for loop for this purpose. I would really appreciate if you could give me any advice on that problem. Thanks for your efforts and kind regards,
Jan
Data frame: http://www.pic-upload.de/view-11435581/Data_loop.jpg.html
Edit: I used aggregate and tapply as suggested, however, the mean value of the data was not really calculated:
Group.1 x
1 2006-10-06 12:00:00 636.5395
2 2006-10-06 20:00:00 859.0109
3 2006-10-07 04:00:00 301.8548
4 2006-10-07 12:00:00 649.3357
5 2006-10-07 20:00:00 944.8272
6 2006-10-08 04:00:00 136.7393
7 2006-10-08 12:00:00 360.9560
8 2006-10-08 20:00:00 NaN
The code used was:
dates<-Dis_sub$date
distance<-Dis_sub$dist
aggregate(distance,list(dates),mean,na.rm=TRUE)
tapply(distance,dates,mean,na.rm=TRUE)
Don't use a loop. Use R. Some example data :
dates <- rep(seq(as.Date("2001-01-05"),
as.Date("2001-01-20"),
by="day"),
each=3)
values <- rep(1:16,each=3)
values[c(4,5,6,10,14,15,30)] <- NA
and any of :
aggregate(values,list(dates),mean,na.rm=TRUE)
tapply(values,dates,mean,na.rm=TRUE)
gives you what you want. See also ?aggregate and ?tapply.
If you want a dataframe back, you can look at the package plyr :
Data <- as.data.frame(dates,values)
require(plyr)
ddply(data,"dates",mean,na.rm=TRUE)
Keep in mind that ddply is not fully supporting the date format (yet).
Look at the data.table package especially if your data is huge. Here is some code that calculates the mean of dist by day.
library(data.table)
dt = data.table(Data)
Data[,list(avg_dist = mean(dist, na.rm = T)),'date']
It looks like your main problem is that your date field has times attached. The first thing you need to do is create a column that has just the date using something like
Dis_sub$date_only <- as.Date(Dis_sub$date)
Then using Joris Meys' solution (which is the right way to do it) should work.
However if for some reason you really want to use a loop you could try something like
newFrame <- data.frame()
for d in unique(Dis_sub$date){
meanDist <- mean(Dis_sub$dist[Dis_sub$date==d],na.rm=TRUE)
newFrame <- rbind(newFrame,c(d,meanDist))
}
But keep in mind that this will be slow and memory-inefficient.

Resources