Need to CreateLoop for ARIMA (p,d,q) - r

I need to find a way to run a loop on ARIMA code across multiple files or within a data frame
This is for my thesis project i am working on. However; given the sample size - running the code one by one will be too tedious and time consuming. Is therea way I can get the below code to work in a loop format if I had all my observations in a data frame?
Or alternatively, how can I get it to read multiple files and run the same code automatically?
library(readxl)
X104485 <- read_excel("Wits Business School/Thesis/Trial 1/Pilot - Data Files/104485.xlsx")
library(forecast)
myts <- ts(X104485, start=c(2015, 1), end=c(2019, 5), frequency=12)
fit <- arima(myts, order=c(1, 1, 36))
fcast <- forecast(fit,31)
write.csv((fcast), file = "X104485.csv")
The code above works - I just need it to be efficient in running several iterations at a time.
Please help.

Related

Regression Analysis

I am currently working on a project and I need some help. I want to predict the length of flight delays, using a statistical model. The data set does not contain the length of flight delays, but it can be calculated from the actual and scheduled departure times, I know that actual departure times - scheduled departure time will give me the flight delay which is the dependent variable. I am struggling to get the explanatory (independent) variables in a useful form to do regression analysis - the main problem is the time format of the first two columns when you read in the table from the csv file. I have attach the data file to the question because I wasn't too sure how to attach my file, I'm new to this coding thing hehe. Any help will be appreciated. xx
https://drive.google.com/file/d/11BXmJCB5UGEIRmVkM-yxPb_dHeD2CgXa/view?usp=sharing
EDIT:
Firstly Thank you for all the help
Okay I'm going to try and ask more precise questions on this topic:
So after importing the file using:
1)
Delays <- read.table("FlightDelaysSM.csv",header =T,sep=",")
2)The main issue I am having is getting the columns schedule time and deptime into a format where I can do arithmetic calculation
3)I tried the below
Delays[,1] - Delays[,2]
where the obvious issue arises for example 800 (8am) - 756 (7.56am) = 44 not 4 minutes
4)Using the help from #kerry Jackson (thank you, you're amazing x) I tried
DepartureTime <- strptime(formatC(Delays$deptime, width = 4, format = "d", flag = "0", %H%M)
ScheduleTime <- strptime(formatC(Delays$schedtime, width = 4, format = "d", flag = "0", %H%M)
DelayTime = DepartureTime - ScheduleTime
The values are also given are in seconds, I want the difference to be in minutes how would I go about doing this?
5) I then did the following:
DelayData <- data.frame(ScheduleTime, DepartureTime, DelayTime, Delays[, 4:7])
What I attain after making the DelayData
as you can see by the image I have the seconds units in my column called DelayTime which I don't want as stated in 4), and the date is in the columns ScheduleTime and DepartureTime could I possibly get some suggestions on how to correct this?
Create a new column called flight_delay:
install.packages('tidyverse')
library(tidyverse)
your_data <- your_data %>%
mutate(flight_delay=deptime-schedtime)
Now, create a linear regression model predicting flight_delay by every other variable:
mod <- lm(flight_delay ~ ., data=your_data)
To optimize your model, use the step function:
mod <- step(mod)
Analyze results:
summary(mod)

how to get tsclean working on data frame with multiple time series

I'm in the process of creating a forecast based on the hts package but before getting this far I need to clean the data for outliers and missing values.
For this I thought of using the tsclean function in the forecast package. I got my data stored in data frame with multiple columns (time series) that I wish to get cleaned. I can get the function to work when only having one time serie, but since I do have quite a lot i'm looking for a smart way to do this.
When running the code:
SFA5 <- ts(SFA4, frequency=12, start=c(2012,1), end=c(2017,10))
ggt <- tsclean(SFA5[1:70, 1:94], replace.missing = TRUE)
I get this error message:
Error in na.interp(x, lambda = lambda) : The time series is not univariate.
The data is here:
https://www.dropbox.com/s/dow2jpuv5unmtgd/Data1850.xlsx?dl=0
My question is: what am i doing wrong or is the only solution to do a loop sequence
The error message suggests that the function takes univariate time series as its first argument only. So you need to apply tsclean to each column, as you might have guessed.
library(forecast)
ggt <- sapply(X = SFA5[1:70, 1:94], FUN = tsclean)

R - Forecasting function not working

Hi and thanks in advance.
So I'm trying very hard to get a working forecasting function going in R, but I'm having no luck.
Here's what I need to accomplish:
1) Extract 2 sets of data from a single txt file (one being a value of quantity, and the other being time). The amount of the records must be able to vary (e.g. 4 quantity values in 4 different times, or 5, or 6 etc)
For this I have tried to use a data frame, but the later forecasting function won't accept it for some reason.
2) Put the data into a forecasting function in order to generate new 'forecasted' data (Btw, the method of forecasting that I need to use does not matter at all [I only can't use a mean forecast as it is too simple], but something like a naive or an rwf will work just fine).
3) I want to save the newly produced 'forecasted' data to another txt file for storage.
Here is my code so far:
These lines I used to create and save my sample data (i have currently 18 records):
library(forecast)
library(ggplot2)
library(reshape2)
Quantity <- c(5,3,8,4,0,5,2,7,4,2,6,8,4,7,8,9,4,6)
Time <- c("2010/01/01", "2010/07/02", "2010/08/03", "2011/02/04", "2011/11/05", "2011/12/06", "2012/06/07", "2012/08/30", "2013/04/16", "2013/03/18", "2014/02/22", "2014/01/27", "2015/12/15", "2015/09/08", "2016/05/04", "2017/11/07", "2017/09/22", "2017/04/04")
Frame <- data.frame(Time,Quantity)
write.table(Frame,file="....path..../Frame.txt",quote=F)
I then used these lines to put that data into (hopefully) a dataframe or any container that could hold both of the above vectors data:
Frame <- read.table("....path..../Frame.txt")
Out of surety sake, I attempted to plot the data (but I don't need to) just to see if the program has properly read my data (to no avail):
There were 4 plot attempts(unfortunately none succeeded properly):
1-
plot.ts(Frame)
2-
Frame <- read.table("....path..../Frame.txt")
3-
Frame <- window(start = 2000, end = 2019)
autoplot(Frame) + autolayer(meanf(Frame,h=11),PI=FALSE, series="Means") + autolayer(naive(Frame, h=11),PI=FALSE,series="Naive") + ggtitle("Quantity vs Time") + xlab("Time") + ylab("Quantity") + guides(colour=guide_legend(title="Forecast"))
4-
plot(Frame, xlab="Time",ylab="Quantity",main="Stock Quantity vs Time",type='l')
I have yet to reach the part where I need to send the new 'forecasted' data to a new txt file, so I don't have any attempt code for that.
Any help is appreciate. Thank you.

Is there a way to parallelize summary functions running over loop?

For an input data frame
input<-data.frame(col1=seq(1,10000),col2=seq(1,10000),col3=seq(1,10000),col4=seq(1,10000))
I have to run the following summaries stored in another Data frame
summary<-data.frame(Summary_name=c('Col1_col2','Col3_Col4','Col2_Col3'),
ColIndex=c("1,2","3,4","2,3"))
#summary
Summary_name ColIndex
Col1_col2 1,2
Col3_Col4 3,4
Col2_Col3 2,3
I have the following function to run the aggregates
loopSum<-function(input,summary){
for(i in seq(1,nrow(summary))){
summary$aggregate[i]<-sum(input[,as.numeric(unlist(str_split(summary$ColIndex[i],',')))])}
return(summary)
}
My requirement is to run the sum as used in loopSum only in parallel, ie I would like to run all the summaries in one shot and thus reduce the total time taken for the function to create the summaries. Is there a way to do this?
My actual scenarios requires me to create summary statistics over hundreds of columns for each Summary_name in summary data.frame, I am looking for the most optimized way to do this. Any help is much appreciated.
Does it improve the running time?
library(tidyr)
input1 <- colSums(input)
summary1 <- separate(summary, "ColIndex", into=c("X1", "X2"), sep=",", convert = TRUE)
summary$aggregate <- input1[summary1$X1] + input1[summary1$X2]

Need to calculate Rate of Change of two data sets over time individually and Net rate of Change

What I have done so far:
read.csv("filename.csv", header=TRUE)
df$header1
df$header2
Now I want to calculate the Rate of Change: Header1 is Dates, Header2 is prices
Rate of Change by date for all values comparative to preceding date.
I want to generate two separate columns of Rate of Change performing the same operation on another file.
Once rate of change is available for both the data sets, I will look to find the net rate of change for each date.
Where I am now:
df<-read.csv("audusd.csv", head = TRUE)
df$Date
df$Close
rate<- 100*diff(df$Close/df[-nrow(df),]$Close
This executes and then I get this:
> rate<- 100*diff(df$Close/df[-nrow(df),]$Close
+
+
In the console.
Thanks for all the help till now, please help further. :)
Also, I am a bit confused about whether I can get the results into a file? Or Do I have to run the code each time?
This certainly would help with my work in a big way, if I can understand and start using it.
You can also use the diff(...) function, which allows you to avoid looping through rows:
rate <- 100*diff(df$price)/df[-nrow(df),]$price
diff(...) calculates the difference row(n) - row(n-1). df[-nrow(df),] returns all rows of df except the last.
Here's a complete example with working code.
# all this just to get sample data.
# daily close for AAPL from 2013.01.01 - today
library(tseries)
library(zoo)
ts <- get.hist.quote(instrument="AAPL",
start="2013-01-01", end="2014-01-01",
quote="AdjClose", provider="yahoo", origin="1970-01-01",
compression="d", retclass="zoo")
df <- data.frame(ts)
df <- data.frame(date=as.Date(rownames(df)),price=df$AdjClose)
df <- df[!is.na(df$price),]
# calculate daily rate of change...
rate <- 100*diff(df$price)/df[-nrow(df),]$price
plot(df[-nrow(df),]$date,rate,type="l",xlab="2013",ylab="Pct. Change",main="APPL")
Given what you said in the comments about how you would do it in excel here is the R equivalent.
dataset=(1:10)^3 #Data in your case one of your columns
ratedata=NULL # Something you have to do so R can put data in this object
for(i in 1:(length(dataset)-1)){ratedata[i]=(dataset[i+1]-dataset[i])/dataset[i]*100} # A simple for function doing what you specified.
ratedata # Returning the rate data so you can look at it
FYI:This only works if your time is at regular intervals.

Resources