Creating time series in R - r

I have a CSV file containing data as follows-
date, group, integer_value
The date starts from 01-January-2013 to 31-October-2015 for the 20 groups contained in the data.
I want to create a time series for the 20 different groups. But the dates are not continuous and have sporadic gaps in it, hence-
group4series <- ts(group4, frequency = 365.25, start = c(2013,1,1))
works from programming point of view but is not correct due to gaps in data.
How can I use the 'date' column of the data to create the time series instead of the usual 'frequency' parameter of 'ts()' function?
Thanks!

You could use zoo::zoo instead of ts.
Since you don't provide sample data, let's generate daily data, and remove some days to introduce "gaps".
set.seed(2018)
dates <- seq(as.Date("2015/12/01"), as.Date("2016/07/01"), by = "1 day")
dates <- dates[sample(length(dates), 100)]
We construct a sample data.frame
df <- data.frame(
dates = dates,
val = cumsum(runif(length(dates))))
To turn df into a zoo timeseries, you can do the following
library(zoo)
ts <- with(df, zoo(val, dates))
Let's plot the timeseries
plot.zoo(ts)

Related

Data in chronological order with condition (on each variable) and looping in a data frame based on condition (per variable)

I would like to organize the rainfall time series data into each rain gauge (by code column) chronologically.
The rain gauges are in the same column, specified in lines as in the example, and there are columns with the month, year and rainfall informations.
After organizing the data, I need to perform statistical tests and a loop can make it easier, due to the large number of rain gauges. Is there a way to loop specifying each rain gauge by code in the rain_gauge column, as the variable to be repeated in the tests?
rain_gauge = c(rep(1442032, 40), rep(1442035, 30), rep(1442036, 30),rep(1442039, 45),rep(1442049, 40),rep(1442032, 40),rep(1442045, 35))
year = runif(260, 1978,2020)
month = runif(260,1,12)
rainfall = runif(260, 50,202)
df = data.frame(rain_gauge, year, month, rainfall)#data frame to be organized in chronological order by "code" category
head(df)
#Examples of tests to apply to the series of each rain gauge in the rain_gauge column.
library(modifiedmk)
mmkh(as.vector(subset(df,rain_gauge=="1442032"))$rainfall)
pvalue_mk = mmkh(as.vector(subset(df,rain_gauge=="1442032"))$rainfall)[[2]]#Result to be save in a data frame results
library(tseries)
adf.test(subset(df,rain_gauge=="1442032")$rainfall)
pvalue_df = adf.test(subset(df,rain_gauge=="1442032")$rainfall)[[2]]#Result to be save in a data frame results
Many thanks!
Consider by, object-oriented wrapper to tapply, that allows you to slice a data frame by factor(s) and run processes on the subsets to return a simplified object (i.e., vector, matrix), or a list of any output:
library(modifiedmk)
library(tseries)
get_pvalues <- function(sub) {
mmkh_obj <- mmkh(sub$rainfall)
adf_obj <- adf.test(sub$rainfall)
# NAMED VECTOR
c(pvalue_mmkh = mmkh_obj[, "p.value"], pvalue_adf = adf_obj[, "p.value"])
}
# NAMED MATRIX
pvalues_matrix <- by(df, df$rain_gauge, get_pvalues)

different formats of dates using as.POSIXct() or similar

I am writing a function where I plot several different data frames composed by time series.
The data frames are composed by several columns. The first one of each is always "time", and the following are the parameters.
Any data frame is different from each other
Once I imported the data set, the function create a calendar vector "Time"
Time <- as.POSIXct(TS[,1])
In a for loop, I use the function xts() to create the time series of one single parameter (one by one), using the column "time" to order the time series.
xts(TS[,i],order.by = Time)
then I plot.
As result, the script looks like this
TS <- read.table("ts.txt",header = T, dec = ".")
Time <- as.POSIXct(TS[,1])
for (i in 2:length(TS[1,])
{
p <- plot(xts(TS[,i],order.by = Time)
print(p)
}
I have problems with as.POSIXct() when the format of my vector time in the data frames is not yyyy-mm-dd. Here few examples:
In some data frames, in "time" I have only "yyyy" in which pasting the "mm-dd" to "yyyy", would not make any sense because of the data (the columns, in this case, are the months).
In other situations, I have also negative dates because they are BC.
Are there other functions I can use to create calendar dates suitable to xts() using a different format like mine?
Here three examples of data set I have the problem with:
#1
year <- c(seq(1900, 2000, by = 10))
Jan <- c(rnorm(length(year), mean = 1, 5))
Feb <- c(rnorm(length(year), mean = 6, 9))
TS <- as.data.frame(cbind(year,Jan,Feb))
str(TS)
#2
year <- c(seq(-500, 2000, by = 100))
Jan <- c(rnorm(length(year), mean = 1, 5))
Feb <- c(rnorm(length(year), mean = 6, 9))
TS <- as.data.frame(cbind(year,Jan,Feb))
str(TS)
#3
time <- c("-100/01/01", "-100/06/01", "0/01/01", "1400/01/01", "2000/01/01")
people <- abs(c(rnorm(length(time), mean = 6, 9)))
TS <- as.data.frame(cbind(time,people))
str(TS)

R year quarter for in loop

I am looking to loop over my R data frame that is in year-quarter and run a rolling regression across every year quarter. I then use the coefficients from this model to fit values that are 1 quarter ahead. But would like to use quarterly date format in R?
I had similar issue with
[Stata question] (Stata year-quarter for loop), but revisiting it in R. Does R have the notion of year quarters that can be easily used in a loop? For e.g., one possibly round about way is
months.list <- c("03","06","09","12")
years.list <- c(1992:2007)
## Loop over the month and years
for(yidx in years.list)
{
for(midx in months.list)
{
}
}
I see zoo:: package has some functions, but not sure which one can I use that is specific to my case. Some thing along the following lines would be ideal:
for (yqidx in 1992Q1:2007Q4){
z <- lm(y ~ x, data = mydata <= yqidx )
}
When I do the look ahead, I need to hand it so that the predicated value is run on the the next quarter that is yqidx + 1, and so 2000Q4 moves to 2001Q1.
If all you need help on is how to generate quarters,
require(data.table)
require(zoo)
months.list <- c("03","06","09","12")
years.list <- c(1992:2007)
#The next line of code generates all the month-year combinations.
df<-expand.grid(year=years.list,month=months.list)
#Then, we paste together the year and month with a day so that we get dates like "2007-03-01". Pass that to as.Date, and pass the result to as.yearqtr.
df$Date=as.yearqtr(as.Date(paste0(df$year,"-",df$month,"-01")))
df<-df[order(df$Date),]
Then you can use loops if you'd like. I'd personally consider using data.table like so:
require(data.table)
require(zoo)
DT<-data.table(expand.grid(year=years.list,month=months.list))
DT<-DT[order(year,month)]
DT[,Date:=as.yearqtr(as.Date(paste0(year,"-",month,"-01")))]
#Generate fake x values.
DT[,X:=rnorm(64)]
#Generate time index.
DT[,t:=1:64]
#Fake time index.
DT[,Y:=X+rnorm(64)+t]
#Get rid of the year and month columns -unneeded.
DT[,c("year","month"):=NULL]
#Create a second data.table to hold all your models.
Models<-data.table(Date=DT$Date,Index=1:64)
#Generate your (rolling) models. I am assuming you want to use all past observations in each model.
Models[,Model:=list(list(lm(data=DT[1:Index],Y~X+t))),by=Index]
#You can access an individual model thusly:
Models[5,Model]

adding multiple date sets and plotting the average in R

I am using R to analyze multiple large data sets. I am trying to add a few together and averaging them to make a plot. They need to be added together with corresponding dates but the data sets are not all the same length/did not start or end at the same time. How would I go about adding them together while accounting for the differences in dates? My first option is to use an if statement, and say if date = date but I'm not sure of the correct process to call all file in the folder for comparison.
I have a script that plots one data set at a time and am simply trying to amend it to accomplish this new analysis:
library(openair)
filedir <-"C:/Users/dfmcg/Documents/Thesisfiles/NE"
myfiles <-c(list.files(path = filedir))
paste(filedir,myfiles,sep = '/')
npsfiles<-c(paste(filedir,myfiles,sep = '/'))
print(npsfiles)
for (i in npsfiles[1:3]){
x <- substr(i,54,61)
y<-paste(paste('C:/Users/dfmcg/Documents/Thesisfiles/NEavg',x,sep='/'), 'png', sep='')
png(filename = y)
timeozone<-import(i,date="DATE",date.format = "%m/%d/%Y %H",header=TRUE,na.strings="-999")
ozoneavg <- timeAverage(timeozone, pollutant = c("O3"), avg.time = "month")
timePlot(ozoneavg,pollutant=c("O3"), main = x)
dev.off()
}
Here is some of the data:
ABBR,DATE,O3,SWS,VWS,SWD,VWD,SDWD,TMP,RH,RNF,SOL
SHEN-BM,05/01/1983 00,-999,-999,-999,,-999,-999,-999,-999,-999,-999
SHEN-BM,05/01/1983 01,-999,-999,-999,,-999,-999,-999,-999,-999,-999
SHEN-BM,05/01/1983 02,-999,-999,-999,,-999,-999,-999,-999,-999,-999
Your question in not very clear. Not being very clear on exactly how you would like add the data frame together and what to average, here is a generic attempt to answer your question.
To read multiple files in and merge them into I large data frame:
#read 3 files
basefilename<-"oa_test"
npsfiles<-lapply(1:3, function(i) {read.csv(paste0(basefilename,i,".csv"))})
#merge files into one dataframe
df<-do.call(rbind, npsfiles)
#fix date column
df$DATE<-as.POSIXct(df$DATE, format="%m/%d/%Y %H")
You could use the import function from the openair package in here.
No once you have all the data into one data frame, the dplyr package makes it easy to group the data by the various variables and perform descriptive statistics on the groups:
library(dplyr)
#group by DATE and average
ozoneavedate<-summarize(group_by(df, DATE), mean(O3))
#group by ABBR and average
ozonesumabbr<-summarize(group_by(df, ABBR), sum(O3))
#group by ABBR and average
ozoneavedateabbr<-summarize(group_by(df, ABBR, DATE), mean(O3))
Hope this helps.
In the future a providing some sample data and what you hope to achieve goes a long way on soliciting help.

R: aggregate quarterly data to hourly data - different behaviour with same date fields

I am trying to understand why R behaves differently with the "aggregate" function. I wanted to average 15m-data to hourly data. For this, I passed the 15m-data together with a pre-designed "hour" array (4 times the same date per hour, taking the original POSIXct array) to the aggregate function.
After some time, I realized that the function was behaving odd (well, probably the data was odd, but why?) when giving over the date-array with
strftime(data.15min$posix, format="%Y-%m-%d %H")
However, if I handed over the data with
cut(data.15min$posix, "1 hour")
the data was averaged correctly.
Below, a minimal example is embedded, including a sample of the data.
I would be happy to understand what I did wrong.
Thanks in advance!
d <- 3
bla <- read.table("test_daten.dat",header=TRUE,sep=",")
data.15min <- NULL
data.15min$posix <- as.POSIXct(bla$dates,tz="UTC")
data.15min$o3 <- bla$o3
hourtimes <- unique(as.POSIXct(paste(strftime(data.15min$posix, format="%Y-%m-%d %H"),":00:00",sep=""),tz="Universal"))
agg.mean <- function (xx, yy, rm.na = T)
# xx: parameter that determines the aggregation: list(xx), e.g. hour etc.
# yy: parameter that will be aggregated
{
aa <- yy
out.mean <- aggregate(aa, list(xx), FUN = mean, na.rm=rm.na)
out.mean <- out.mean[,2]
}
#############
data.o3.hour.mean <- round(agg.mean(strftime(data.15min$posix, format="%m/%d/%y %H"), data.15min$o3), d); data.o3.hour.mean[1:100]
win.graph(10,5)
par(mar=c(5,15,4,2), new =T)
plot(data.15min$posix,data.15min$o3,col=3,type="l",ylim=c(10,60)) # original data
par(mar=c(5,15,4,2), new =T)
plot(data.date.hour_mean,data.o3.hour.mean,col=5,type="l",ylim=c(10,60)) # Wrong
##############
data.o3.hour.mean <- round(agg.mean(cut(data.15min$posix, "1 hour"), data.15min$o3), d); data.o3.hour.mean[1:100]
win.graph(10,5)
par(mar=c(5,15,4,2), new =T)
plot(data.15min$posix,data.15min$o3,col=3,type="l",ylim=c(10,60)) # original data
par(mar=c(5,15,4,2), new =T)
plot(data.date.hour_mean,data.o3.hour.mean,col=5,type="l",ylim=c(10,60)) # Correct
Data:
Download data
Too long for a comment.
The reason your results look different is that aggregate(...) sorts the results by your grouping variable(s). In the first case,
strftime(data.15min$posix, format="%m/%d/%y %H")
is a character vector with poorly formatted dates (they do not sort properly). So the first row corresponds to the "date" "01/01/96 00".
In your second case,
cut(data.15min$posix, "1 hour")
generates actual POSIXct dates, which sort properly. So the first row corresponds to the date: 1995-11-04 13:00:00.
If you had used
strftime(data.15min$posix, format="%Y-%m-%d %H")
in your first case you would have gotten the same result as using cut(...)

Resources