Time Series application - Guidance Needed - r

I am relatively new to R, and am currently trying to implement time series on a data set to predict product volume for next six months. My data set has 2 columns Dates(-timestamp) and volume of product in inventory (on that particular day) for example like this :
Date Volume
24-06-2013 16986
25-06-2013 11438
26-06-2013 3378
27-06-2013 27392
28-06-2013 24666
01-07-2013 52368
02-07-2013 4468
03-07-2013 34744
04-07-2013 19806
05-07-2013 69230
08-07-2013 4618
09-07-2013 7140
10-07-2013 5792
11-07-2013 60130
12-07-2013 10444
15-07-2013 36198
16-07-2013 11268
I need to predict six months of product volume required in inventory after end date(in my data set which is "14-06-2019" "3131076").Approx 6 year of data I am having start date 24-06-2013 and end date 14-06-2019
I tried using auto.arima(R) on my data set and got many errors. I started researching on the ways to make my data suitable for ts analysis and came to know about imputets and zoo packages.
I guess date has high relevance for inputting frequency value in the model so I did this : I created a new column and calculated the frequency of each weekday which is not the same
data1 <- mutate(data, day = weekdays(as.Date(Date)))
> View(data1)
> table(data1$day)
Friday Monday Saturday Sunday Thursday Tuesday Wednesday
213 214 208 207 206 211 212
There are no missing values against dates but we can see from above that count of each week day is not the same, some of the dates are missing, how to proceed with that ?
I have met kind of dead end , tried going through various posts here on impute ts and zoo package but didn't get much success.
Can someone please guide me how to proceed further and pardon me #admins and users if you think its spamming but it is really important for me at the moment. I tried to go through various tutorials on Time series out side but almost all of them have used air passengers data set which I think has no flaws.
Regards
RD
library(imputeTS)
library(dplyr)
library(forecast)
setwd("C:/Users/sittu/Downloads")
data <- read.csv("ts.csv")
str(data)
$ Date : Factor w/ 1471 levels "01-01-2014","01-01-2015",..: 1132 1181 1221 1272 1324 22 71 115 163 213 ...
$ Volume: Factor w/ 1468 levels "0","1002551",..: 379 116 840 706 643 1095 1006 864 501 1254 ...
data$Volume <- as.numeric(data$Volume)
data$Date <- as.Date(data$Date, format = "%d/%m/%Y")
str(data)
'data.frame': 1471 obs. of 2 variables:
$ Date : Date, format: NA NA NA ... ## 1st Error now showing NA instead of dates
$ Volume: num 379 116 840 706 643 ...

Let's try to generate that dataset :
First, let's reproduce a dataset with missing data :
dates <- seq(as.Date("2018-01-01"),as.Date("2018-12-31"),1)
volume <- floor(runif(365, min=2500, max=50000))
dummy_df <- do.call(rbind, Map(data.frame, date=dates, Volume=volume))
df <- dummy_df %>% sample_frac(0.8)
Here we generated a dataframe with Date and volume for the year 2018, with 20%missing data (sample_frac(0.8)).
This should mimic correctly your dataset with missing data for some days.
What we want from there is to find the days with no volume data :
Df_full_dates <- as.data.frame(dates) %>%
left_join(df,by=c('dates'='date'))
Now you want to replace the NA values (that correspond to days with no data) with a volume (I took 0 there but if its missing data, you might want to put the month avg or a specific value, I do not know what suits best your data from your sample) :
Df_full_dates[is.na(Df_full_dates)] <- 0
From there, you have a dataset with data for each day, you should be able to find a model to predict the volume in future months.
Tell me if you have any question

Related

RStudio: Separate YYYY-MM-DD into Individual Columns

I am fairly new to R and I am pulling my hair out trying to do what is probably something super simple.
I downloaded the crime data for Los Angeles from 2010 - 2019. There are 2,114,010 rows of data. Right now, it is called 'df' in my Global Environment area.
I want to manipulate one specific column titled "Occurred" - which is a date reference to when the crime occurred.
Right now, it is set up as YYYY-MM-DD (ie., 2010-02-20).
I am trying to separate all three into individual columns. I have Googled, and Googled, and Googled and tried and tried and tried things from this forum and StackExchange and just cannot get it to work.
I have tried Lubridate and followed instructions to other answers, but it simply won't create new columns (one each for Year, Month, Day).
Here is a bit of the reprex from the dataset ... I did not include all of the different variables, because they aren't the issue.
As mentioned, I am trying to separate 'occurred' into individual Year, Month, and Day columns.
> head(df, 10)[c('dr_no','occurred','time','area_name')]
dr_no occurred time area_name
1 1307355 2010-02-20 1350 Newton
2 11401303 2010-09-12 45 Pacific
3 70309629 2010-08-09 1515 Newton
4 90631215 2010-01-05 150 Hollywood
5 100100501 2010-01-02 2100 Central
6 100100506 2010-01-04 1650 Central
7 100100508 2010-01-07 2005 Central
8 100100509 2010-01-08 2100 Central
9 100100510 2010-01-09 230 Central
10 100100511 2010-01-06 2100 Central
We can do this with tidyverse and lubridate
library(dplyr)
library(lubridate)
df <- df %>%
mutate(occurred = as.Date(occurred),
year = year(occurred), month = month(occurred), day = day(occurred))

brownian.bridge slow calculation and Error in area.grid[1, 1] : incorrect number of dimensions

I am trying to calculate some BBMM.contours for caribou during a movement period in northern Canada.
I am still in the exploratory phase of using this function, and have worked through some tutorials which worked fine, but now that I am trying my sample data the brownian.bridge function seems to be taking an eternity.
I understand that this is a function that can take a long time to calculate, but I have tried subsetting my data to including fewer and fewer locations, simply to see if the end product is what I want before committing to running the dataset with thousands of locations. Currently I only have 34 locations in the subset, and I have waited over night for it to run without any completion.
When I used some practice Panther location data with 1000 locations it took under a minute to run, so I am thinking there is something wrong with my code or my data.
Any help working through this would be greatly appreciated.
#Load data
data<-(X2017loc)
#Used to sort data in code below for all caribou
data$DT <- as.POSIXct(data$TimeStamp, format='%Y-%m-%d %H:%M:%S')
#Sort Data
data <- data[order(data$SAMPLED_ANIMAL_ID, data$DT),]
#TIME DIFF NECESSARY IN BBMM CODE
###Joel is not sure about this part...Timelag is maybe time until GPS upload???.
timediff <- diff(data$DT)
data <- data[-1,]
data$timelag <-as.numeric(abs(timediff))
#set Timelag
data <- data[-1,] #Remove first record with wrong timelag
data$SAMPLED_ANIMAL_ID <- factor(data$SAMPLED_ANIMAL_ID)
data<-data[!is.na(data$timelag), ]
data$LONGITUDE<-as.numeric(data$LONGITUDE)
data$LATITUDE<-as.numeric(data$LATITUDE)
BBMM = brownian.bridge(x=data$LONGITUDE, y=data$LATITUDE, time.lag=data$timelag, location.error=6, cell.size=30)
bbmm.summary(BBMM)
Additional information:
Timelag is in seconds and
Collars have 6m location error
I am not certain what the cell.size refers to and how I should determine this number.
SAMPLED_ANIMAL_ID LONGITUDE LATITUDE TimeStamp timelag
218 -143.3138219 68.2468358 2017-05-01 02:00 18000
218 -143.1637592 68.2687447 2017-05-01 07:00 18000
218 -143.0699697 68.3082906 2017-05-01 12:00 18000
218 -142.8352869 68.3182258 2017-05-01 17:00 18000
218 -142.7707111 68.2892111 2017-05-01 22:00 18000
218 -142.5362769 68.3394269 2017-05-02 03:00 18000
218 -142.4734997 68.3459528 2017-05-02 08:00 18000
218 -142.3682272 68.3801822 2017-05-02 13:00 18000
218 -142.2198042 68.4023253 2017-05-02 18:00 18000
218 -142.0235464 68.3968672 2017-05-02 23:00 18000
I would suggest to use cell.size = 100 instead of area.grid since for area.grid, you would have to define a unique rectangular grid for all animals (which could increase compute time).
Ok, I have answered my original question, in that I was missing the following code to reproject the latlong to UTM.
data <- SpatialPoints(data[ , c("LONGITUDE","LATITUDE")], proj4string=CRS("+proj=longlat +ellps=WGS84"))
data <- spTransform(data, CRS("+proj=utm +west+zone=7 +ellps=WGS84"))

Build Graph(Plot) in R: median prices through time intervals

I have the data frame with prices and the ending date of some auctions. I want to check when appears, for example, sales with minimal and maximal prices (also the median) depending on the time of the day.
More precisely, I have the data frame mtest:
> str(mtest)
'data.frame': 9144 obs. of 2 variables:
$ Price : num 178 188 228 305 202 ...
$ EndDateTime: POSIXct, format: "2015-05-25 05:00:59" "2015-05-23 00:06:01" ...
I want to build the graph(plot), having 30 minutes time internals (00:00-00:30, 00:31-01:00 etc) on the X axis, and median (maximal, minimal prices) on Y axis.
Another idea is to draw a simple histogram for each time interval, like hist(mtest$Price, breaks=10, col="red")
How can I do this in the best way?
Try this:
cutt=seq(from=min(mtest$EndDateTime),to=max(mtest$EndDateTime), by=30*60)
if (max(mtest$EndDateTime)>max(cutt))cutt[length(cutt)+1]=max(cutt)+30*60
mtest$tint=cut(mtest$EndDateTime,cutt)
stats=do.call(rbind,tapply(mtest[,"Price"],mtest[,"tint"],
function(p)c(min=min(p),median=median(p),max=max(p))))
bp=boxplot(mtest[,"Price"]~mtest[,"tint"],xaxt="n",
col=1:length(levels(mtest$tint)))
axis(1,at=1:length(levels(mtest$tint)),labels=format.Date(levels(mtest$tint),"%Y-%m-%d %H:%M"),
las=2,cex.axis=.5)
stats
Or wilth plot
plot(NA,ylim=range(stats),xlim=c(1,lint),type="n",xaxt="n",xlab="",ylab="")
sapply(1:3,function(z)points(stats[,z]~c(1:lint),col=z))
axis(1,at=1:lint,labels=format.Date(levels(mtest$tint),"%Y-%m-%d %H:%M"),
las=2,cex.axis=.5)
You will have something like this:

Reducing data in data frame to plot data in R

I'm very new to programming so I apologise in advance for my lack of R know-how. I'm a PhD student interested in pupillometry and I have just recorded the pupil response of participants performing a listening task in two conditions (Easy and Hard). The pupil response interest period for each trial is around 20 seconds and I would like to be able to plot this data for each participant on R. The eyetracker sampling rate is 1000Hz and each participant completed 92 trials. So the data that I currently have for each participant includes close to 2million rows. I have tried to plot this using ggplot2 but, as expected, the graph is very cluttered.
I've been trying to work out a way of reducing the data so that I can plot it on R. Ideally, I would like to take the mean pupil size value for every 1000 samples (i.e. 1 second of recording) averaged across all 92 trials for each participant. With this information, I would then create a new dataframe for plotting the average slope from 1-20 seconds for the two listening conditions (Easy and Hard).
Here is the current structure of my data frame;
> str(ppt53data)
'data.frame': 1915391 obs. of 6 variables:
$ RECORDING_SESSION_LABEL: Factor w/ 1 level "ppt53": 1 1 1 1 1 1 1 1 1 1 ...
$ listening_condition : Factor w/ 2 levels "Easy","Hard": 2 2 2 2 2 2 2 2 2 2 ...
$ RIGHT_PUPIL_SIZE : Factor w/ 3690 levels ".","0.00","1000.00",..: 3266 3264 3263 3262 3262 3260 3257 3254 3252 3252 ...
$ TIMESTAMP : num 262587 262588 262589 262590 262591 ...
$ TRIAL_START_TIME : int 262587 262587 262587 262587 262587 262587 262587 262587 262587 262587 ...
$ TrialTime : num 0 1 2 3 4 5 6 7 8 9 ...
- attr(*, "na.action")=Class 'omit' Named int [1:278344] 873 874 875 876 877 878 879 880 881 882 ...
.. ..- attr(*, "names")= chr [1:278344] "873" "874" "875" "876" ...
The 'TrialTime' variable specifies the sample (i.e. millisecond) in each trial. Can anyone advise me about which step I should take next? I figure it would make sense to arrange my data into separate data frames which would allow me to calculate the mean values that I want (across trials and for every 1000 samples). However, I'm not sure what is the most efficient/best way of doing this.
I'm sorry that I can't be any more specific. Any rough guidance would be much appreciated.
I think for such a large block of data with many aggregation levels you will need to use data.table. I may have mis-structured your data, but hopefully this will give you the idea:
require(data.table)
require(ggplot2)
#100 patient * 20,000 observations (1-20,000 ms)
ppt53data<-data.frame(
RECORDING_SESSION_LABEL=paste0("pat-",rep(1:100,each=20000)), #patients
listening_condition=sample(c("Easy","Hard"),2000000,replace=T), #Easy/Hard
RIGHT_PUPIL_SIZE=rnorm(2000000,3000,500), #Pupil Size
TrialTime=rep(1:20000,100) #ms from start
)
# group in 1000ms blocks
ppt53data$group<-cut(ppt53data$TrialTime,c(0,seq(1000,20000,1000),Inf))
unique(ppt53data$group)
#convert frame to table
dt.ppt53data<-data.table(ppt53data)
#index
setkey(dt.ppt53data, RECORDING_SESSION_LABEL, group)
#create data.frame of aggregated plot data
plot.data<-data.frame(dt.ppt53data[,list(RIGHT_PUPIL_SIZE=mean(RIGHT_PUPIL_SIZE)),by=list(group)])
#plot with ggplot2
ggplot(plot.data)+geom_bar(aes(group,RIGHT_PUPIL_SIZE,stat="identity",fill=group)) +
theme(axis.text.x=element_text(angle=-90))+
coord_cartesian(ylim=c(2995,3005))
Some rough guidance:
library(plyr)
ppt53data.summarized <- ddply(ppt53data, .(TrialTime), summarize, mean = mean(RIGHT_PUPIL_SIZE))
This tells it to calculate the mean size of the right pupil for each unique TrialTime. Perhaps seeing how this works would help you figure out how to describe what you need?
Assuming that within each TrailTime there are more than 1000 observations, you can randomly select:
set.seed(42)
ppt53data.summarized <- ddply(ppt53data, .(TrialTime), summarize, mean = mean(sample(RIGHT_PUPIL_SIZE,1000)))

Aggregating, restructuring hourly time series data in R

I have a year's worth of hourly data in a data frame in R:
> str(df.MHwind_load) # compactly displays structure of data frame
'data.frame': 8760 obs. of 6 variables:
$ Date : Factor w/ 365 levels "2010-04-01","2010-04-02",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Time..HRs. : int 1 2 3 4 5 6 7 8 9 10 ...
$ Hour.of.Year : int 1 2 3 4 5 6 7 8 9 10 ...
$ Wind.MW : int 375 492 483 476 486 512 421 396 456 453 ...
$ MSEDCL.Demand: int 13293 13140 12806 12891 13113 13802 14186 14104 14117 14462 ...
$ Net.Load : int 12918 12648 12323 12415 12627 13290 13765 13708 13661 14009 ...
While preserving the hourly structure, I would like to know how to extract
a particular month/group of months
the first day/first week etc of each month
all mondays, all tuesdays etc of the year
I have tried using "cut" without result and after looking online think that "lubridate" might be able to do so but haven't found suitable examples. I'd greatly appreciate help on this issue.
Edit: a sample of data in the data frame is below:
Date Hour.of.Year Wind.MW datetime
1 2010-04-01 1 375 2010-04-01 00:00:00
2 2010-04-01 2 492 2010-04-01 01:00:00
3 2010-04-01 3 483 2010-04-01 02:00:00
4 2010-04-01 4 476 2010-04-01 03:00:00
5 2010-04-01 5 486 2010-04-01 04:00:00
6 2010-04-01 6 512 2010-04-01 05:00:00
7 2010-04-01 7 421 2010-04-01 06:00:00
8 2010-04-01 8 396 2010-04-01 07:00:00
9 2010-04-01 9 456 2010-04-01 08:00:00
10 2010-04-01 10 453 2010-04-01 09:00:00
.. .. ... .......... ........
8758 2011-03-31 8758 302 2011-03-31 21:00:00
8759 2011-03-31 8759 378 2011-03-31 22:00:00
8760 2011-03-31 8760 356 2011-03-31 23:00:00
EDIT: Additional time-based operations I would like to perform on the same dataset
1. Perform hour-by-hour averaging for all data points i.e average of all values in the first hour of each day in the year. The output will be an "hourly profile" of the entire year (24 time points)
2. Perform the same for each week and each month i.e obtain 52 and 12 hourly profiles respectively
3. Do seasonal averages, for example for June to September
Convert the date to the format which lubridate understands and then use the functions month, mday, wday respectively.
Suppose you have a data.frame with the time stored in column Date, then the answer for your questions would be:
###dummy data.frame
df <- data.frame(Date=c("2012-01-01","2012-02-15","2012-03-01","2012-04-01"),a=1:4)
##1. Select rows for particular month
subset(df,month(Date)==1)
##2a. Select the first day of each month
subset(df,mday(Date)==1)
##2b. Select the first week of each month
##get the week numbers which have the first day of the month
wkd <- subset(week(df$Date),mday(df$Date)==1)
##select the weeks with particular numbers
subset(df,week(Date) %in% wkd)
##3. Select all mondays
subset(df,wday(Date)==1)
First switch to a Date representation: as.Date(df.MHwind_load$Date)
Then call weekdays on the date vector to get a new factor labelled with day of week
Then call months on the date vector to get a new factor labelled with name of month
Optionally create a years variable (see below).
Now subset the data frame using the relevant combination of these.
Step 2. gets an answer to your task 3. Steps 3. and 4. get you to task 1. Task 2 might require a line or two of R. Or just select rows corresponding to, say, all the Mondays in a month and call unique, or its alter-ego duplicated on the results.
To get you going...
newdf <- df.MHwind_load ## build an augmented data set
newdf$d <- as.Date(newdf$Date)
newdf$month <- months(newdf$d)
newdf$day <- weekdays(newdf$d)
## for some reason R has no years function. Here's one
years <- function(x){ format(as.Date(x), format = "%Y") }
newdf$year <- years(newdf$d)
# get observations from January to March of every year
subset(newdf, month %*% in c('January', 'February', 'March'))
# get all Monday observations
subset(newdf, day == 'Monday')
# get all Mondays in 1999
subset(newdf, day == 'Monday' & year == '1999')
# slightly fancier: _first_ Monday of each month
# get the first weeks
first.week.of.month <- !duplicated(cbind(newdf$month, newdf$day))
# now pull out the mondays
subset(newdf, first.monday.of.month & day=='Monday')
Since you're not asking about the time (hourly) part of your data, it is best to then store your data as a Date object. Otherwise, you might be interested in chron, which also has some convenience functions like you'll see below.
With respect to Conjugate Prior's answer, you should store your date data as a Date object. Since your data already follows the default format ('yyyy-mm-dd') you can just call as.Date on it. Otherwise, you would have to specify your string format. I would also use as.character on your factor to make sure you don't get errors inline. I know I've ran into problems with factors-into-Dates for that reason (possibly corrected in current version).
df.MHwind_load <- transform(df.MHwind_load, Date = as.Date(as.character(Date)))
Now you would do well to create wrapper functions that extract the information you desire. You could use transform like I did above to simply add those columns that represent months, days, years, etc, and then subset on them logically. Alternatively, you might do something like this:
getMonth <- function(x, mo) { # This function assumes w/in single year vector
isMonth <- month(x) %in% mo # Boolean of matching months
return(x[which(isMonth)] # Return vector of matching months
} # end function
Or, in short form
getMonth <- function(x, mo) x[month(x) %in% mo]
This is just a tradeoff between storing that information (transform frame) or having it processed when desired (use accessor methods).
A more complicated process is your need for, say, the first day of a month. This is not entirely difficult, though. Below is a function that will return all of those values, but it is rather simple to just subset a sorted vector of values for a given month and take their first one.
getFirstDay <- function(x, mo) {
isMonth <- months(x) %in% mo
x <- sort(x[isMonth]) # Look at only those in the desired month.
# Sort them by date. We only want the first day.
nFirsts <- rle(as.numeric(x))$len[1] # Returns length of 1st days
return(x[seq(nFirsts)])
} # end function
The easier alternative would be
getFirstDayOnly <- function(x, mo) {sort(x[months(x) %in% mo])[1]}
I haven't prototyped these, as you didn't provide any data samples, but this is the sort of approach that can help you get the information you desire. It is up to you to figure out how to put these into your work flow. For instance, say you want to get the first day for each month of a given year (assuming we're only looking at one year; you can create wrappers or pre-process your vector to a single year beforehand).
# Return a vector of first days for each month
df <- transform(df, date = as.Date(as.character(date)))
sapply(unique(months(df$date)), # Iterate through months in Dates
function(month) {getFirstDayOnly(df$date, month)})
The above could also be designed as a separate convenience function that uses the other accessor function. In this way, you create a series of direct but concise methods for getting pieces of the information you want. Then you simply pull them together to create very simple and easy to interpret functions that you can use in your scripts to get you precise what you desire in the most efficient manner.
You should be able to use the above examples to figure out how to prototype other wrappers for accessing the date information you require. If you need help on those, feel free to ask in a comment.

Resources