Problems with FORECASTING in R - r

I have a problem with forecasting in R.
First of all, this is an example of the original dataset (CW_data_noNA):
Loading date Year Built Vessel Type Cargo Size Week
2019-08-22 2011 Medium 30000 34
2019-09-01 2004 Aframax 80000 35
2019-08-30 2005 Panamax 60000 35
2019-09-01 2000 VLCC 270000 35
2019-08-29 2001 VLCC 270000 35
2019-09-03 2003 Suezmax 130000 36
2019-08-26 2002 Medium 30000 34
I have to create a weekly time series (showing the total number of fixed ships and the cargo capacity), and then to use naïve and simple moving average to provide one-week ahead forecast.
Weekly_base <- CW_data_noNA %>% group_by(Week) %>% summarize(Number_of_fix = n(),cargo_capacity = sum(`Cargo Size`))
Weekly_ts <- ts(Weekly_base, start = c(2019, 32), frequency = 52)
demand_training <- window(Weekly_ts, start = c(2019,32), end=c(2019,41))
demand_test <- window(Weekly_ts, start = c(2019,42))
naive(demand_training, h=1)
The problem occured with the code above is that it gives me the forcasting not for the variables (number of fix and cargo capacity) but for the week itself. This is how the result looks like:
Point Forecast Lo 80 ....
2019.788 42 -23879066 ....
Can someone help me? Thank you.

In the line where you generate your Weekly_ts, you're currently supplying the whole data frame, i.e.
Weekly_ts <- ts(Weekly_base, start = c(2019, 32), frequency = 52)
I guess the help of naive (?naive) is a bit ambiguous(?), as it states that y should be
a numeric vector or time series of class ts
and you definitely supplied an object of class ts. However, in this case you supplied multiple series when it is expecting just the one. Simply select the one you want and it should forecast the correct series
relevant_variable <- Weekly_base %>%
select(cargo_capacity)#change cargo_capacity to Number_of_fix to change variable
Weekly_ts <- ts(relevant_variable, start = c(2019, 32), frequency = 52)
Or more direct
Weekly_ts <- ts(Weekly_base$cargo_capacity, start = c(2019, 32), frequency = 52)

Related

Manipulating data for Regression Model using dplyr in R

I have data like this.
library(lubridate)
set.seed(2021)
gen_date <- seq(ymd_h("2021-01-01-00"), ymd_h("2021-09-30-23"), by = "hours")
hourx <- hour(gen_date)
datex <- date(gen_date)
sales <- round(runif(length(datex), 10, 50), 0)*100
mydata <- data.frame(datex, hourx, sales)
head(mydata)
# datex hourx sales
#1 2021-01-01 0 2800
#2 2021-01-01 1 4100
#3 2021-01-01 2 3800
#4 2021-01-01 3 2500
#5 2021-01-01 4 3500
#6 2021-01-01 5 3800
tail(mydata
# datex hourx sales
#6547 2021-09-30 18 3900
#6548 2021-09-30 19 3600
#6549 2021-09-30 20 3000
#6550 2021-09-30 21 4700
#6551 2021-09-30 22 4700
#6552 2021-09-30 23 3600
I have task to do modelling using Linear Regression but with tricky data. Assume we have data from January to March, we need those data to forecast April data. Here the steps:
We use January and February data as Independent Variables (X) and March data as Dependent Variable (Y) for building regression model, because February has the fewest days, which is 28 days, then we cut January & March data into 28 days too.
data_jan <- mydata[1:672,]
data_feb <- mydata[745:1416,]
data_mar <- mydata[1417:2088,]
Modelling Regression using lm function
mydata_reg <- data.frame(x1 = data_jan$sales,
x2 = data_feb$sales,
y = data_mar$sales)
model_reg <- lm(y~., data = mydata_reg)
After get model, we use new data within February & March as independent data (X)
mydata_reg_for <- data.frame(x1 = data_feb$sales,
x2 = data_mar$sales)
pred_data_apr <- predict(model_reg, newdata = mydata_reg_for)
Check lenght of the month, Because april has 30 days and we only get 28 days forecast data, so we still need 2 days data to complete our forecast. February only has 28 days, so we use first two dates from March, which are "2021-03-01" & "2021-03-02". Now, March has 31 days, then we don't need do anything, we just add "2021-03-29" & "2021-03-30".
data_feb_add <- mydata[1417:1464,]
data_mar_add <- mydata[2089:2136,]
mydata_reg_add <- data.frame(x1 = data_feb_add$sales,
x2 = data_mar_add$sales)
After that we do modelling using model_reg function before and Add all april forecast.
pred_data_apr_add <- predict(model_reg, newdata = mydata_reg_add)
data_apr <- c(as.numeric(pred_data_apr), as.numeric(pred_data_apr_add))
My question is how do we make this process run automatically every month using dplyr package? Because every month has different days. I use february data because it has the fewest days. This condition also is applied to other months. Many Thank You.
If you want to control the number of days after each month (or in each month) you could filter by the date not the row numbers.
I'm sure it can be tidied up more than this, but you would just need to change the forecast_date <- as.Date("2021-04-01") to whichever month you want to forecast.
##set the forecast month. This should be straight forward to automate with a list or an increment
forcast_date <- as.Date("2021-04-01") # April
##get the forecast month length. This would be used for the data_feb_add and data_mar_add step.
forcast_month_length <- days_in_month(forcast_date) #30 days
##get dates for the previous 3 months
month_1_date <- forcast_date %m-% months(3)
month_2_date <- forcast_date %m-% months(2)
month_3_date <- forcast_date %m-% months(1)
##find the shortest month in that time range.
shortest_month <- min(c(days_in_month(month_1_date),
days_in_month(month_2_date),
days_in_month(month_2_date))) #28 days
##select the first 28 days (the shortest month) for each of the months used for the variables
data_month_1 <- mydata[mydata$datex %in% month_1_date:(month_1_date + shortest_month - 1),]
data_month_2 <- mydata[mydata$datex %in% month_2_date:(month_2_date + shortest_month - 1),]
data_month_3 <- mydata[mydata$datex %in% month_3_date:(month_3_date + shortest_month - 1),]
##select the number of days needed for each month for the forecast data (30 days for april)
month_2_forecast_length <- mydata[mydata$datex %in% month_2_date:(month_2_date + forcast_month_length - 1),]
month_3_forecast_length <- mydata[mydata$datex %in% month_3_date:(month_3_date + forcast_month_length - 1),]
You can simply split data by group_split
mydata %>%
group_split(month(datex))
this code will split mydata into 12 lists, and each list elements are dataframe with each 12 month

Converting a data frame into TS object in R

I have a dataframe that looks like this:
DAY X1996 X1997
1 1-Jul 98 86
2 2-Jul 97 90
3 3-Jul 97 93
....
I want to end up with a TS object so that I can do HoltWinters smoothing on it. I think I want it to look like this (though I'm not sure because I haven't done HoltWinters before):
Day Year Temp
1-Jul 1996 98
2-Jul 1996 98
3-Jul 1996 98
...
1-Jul 1997 86
2-Jul 1997 90
3-Jul 1997 93
This is what I'm trying to do:
df <- read.delim("temps.txt")
myts <- as.ts(df)
But this doesn't look close to what I'll need to do a Holtwinters model. I've looked all over stackoverflow and the docs for TS and Zoo and I'm stuck on how to create this TS object. A push in the right direction will be much appreciated.
ts objects are normally used with monthly, quarterly or annual data, not daily data; however, if we remove Feb 29th then we can create a ts object whose times are the year plus a fraction 0/365, 1/365, ..., 364/365 which will be regularly spaced if there are no missing dates. The key point is that if the seasonality is based on a year then we must have the same number of points in each year to represent it as a ts object.
First convert to a zoo object z0 having an ordinary Date, remove Feb 29th giving z, create the time index described above in a zoo object zz and then convert that to ts.
library(data.table)
library(lubridate)
library(zoo)
m <- melt(as.data.table(df), id.vars = 1)
z0 <- with(m, zoo(value, as.Date(paste(variable, DAY), "X%Y %d-%b")))
z <- z0[! (month(time(z)) == 2 & day(time(z)) == 29)]
tt <- time(z)
zz <- zoo(coredata(z), year(tt) + (yday(tt) - ((month(tt) > 2) & leap_year(tt)) - 1)/365)
as.ts(zz)
Remove Dec 31 in leap years
Above we removed Feb 29th in leap years but an alternate approach would be to remove Dec 31st in leap years giving slightly simpler code which avoids the need to use leap_year as we can simply remove any day for which yday is 366. z0 is from above.
zz0 <- z0[yday(time(z0)) <= 365]
tt <- time(zz0)
zz <- zoo(coredata(zz0), year(tt) + (yday(tt) - 1) / 365)
as.ts(zz)
Aggregating to Monthly
Another approach would to reduce the data to monthly data. Then it is relatively straightforward since ts has facilities to represent monthly data. Below we used the last point in each month but we could use the mean value or other scalar summary if desired.
ag <- aggregate(z0, as.yearmon, tail, 1) # use last point in each month
as.ts(ag)
Note
df in the question made into a reproducible form is the following (however, we would need to fill it out with more data to avoid generating a ts object with many NAs).
df <- structure(list(DAY = structure(1:3, .Label = c("1-Jul", "2-Jul",
"3-Jul"), class = "factor"), X1996 = c(98L, 97L, 97L), X1997 = c(86L,
90L, 93L)), class = "data.frame", row.names = c("1", "2", "3"
))

R Weekly Time Series Object

I have the following vector, which contains data for each day of December.
vector1 <- c(1056772, 674172, 695744, 775040, 832036,735124,820668,1790756,1329648,1195276,1267644,986716,926468,828892,826284,749504,650924,822256,3434204,2502916,1262928,1025980,1828580,923372,658824,956916,915776,1081736,869836,898736,829368)
Now I want to create a time series object on a weekly basis and used the following code snippet:
weeklyts = ts(vector1,start=c(2016,12,01), frequency=7)
However, the starting and end points are not correct. I always get the following time series:
> weeklyts
Time Series:
Start = c(2017, 5)
End = c(2021, 7)
Frequency = 7
[1] 1056772 674172 695744 775040 832036 735124 820668 1790756 1329648 1195276 1267644 986716 926468 828892 826284 749504
[17] 650924 822256 3434204 2502916 1262928 1025980 1828580 923372 658824 956916 915776 1081736 869836 898736 829368
Does anybody nows what I am doing wrong?
To get a timeseries that starts and ends as you would expect, you need to think about the timeserie. You have 31 days from december 2016.
The timeserie start option handles 2 numbers, not 3. So something like c(2016, 1) if you start with month 1 in 2016. See following example.
ts(1:12, start = c(2016, 1), frequency = 12)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2016 1 2 3 4 5 6 7 8 9 10 11 12
Now ts and daily data is an annoyance. ts cannot handle leap years. That is why you see people using a frequency of 365.25 to get an annual timeseries. To get a good december 2016 series we can do the following:
ts(vector1, start = c(2016, 336), frequency = 366)
Time Series:
Start = c(2016, 336)
End = c(2016, 366)
Frequency = 366
[1] 1056772 674172 695744 775040 832036 735124 820668 1790756 1329648 1195276 1267644 986716 926468 828892 826284 749504
[17] 650924 822256 3434204 2502916 1262928 1025980 1828580 923372 658824 956916 915776 1081736 869836 898736 829368
Note the following things that are going on:
Frequence is 366 because 2016 is a leap year
start is c(2016, 336), because 336 is the day in the year on "2016-12-01"
Personally I use xts package (and zoo) to handle daily data and use the functions in xts to aggregate to weekly timeseries. These can then be used with packages that like ts timeseries like forecast.
edit: added small xts example
my_df <- data.frame(dates = seq.Date(as.Date("2016-12-01"), as.Date("2017-01-31"), by = "day"),
var1 = rep(1:31, 2))
library(xts)
my_xts <- xts(my_df[, -1], order.by = my_df$dates)
# rollup to weekly. Dates shown are the last day in the weekperiod.
my_xts_weekly <- period.apply(my_xts, endpoints(my_xts, on = "weeks"), colSums)
head(my_xts_weekly)
[,1]
2016-12-04 10
2016-12-11 56
2016-12-18 105
2016-12-25 154
2017-01-01 172
2017-01-08 35
Depending on your needs you can transform this back into data.frames etc etc. Read the help for period.apply as you can specify your own functions in the rolling mechanism. And read the xts (and zoo) vignettes.

R filtering/selecting data by POSIXct time and a condition

I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.

hydrological year time series

Currently I am working on a river discharge data analysis. I have the daily discharge record from 1935 to now. I want to extract the annual maximum discharge for each hydrolocial year (start from 01/11 to next year 31/10). However, I found that the hydroTSM package can only deal with the natural year. I tried to use the "zoo" package, but I found it's difficult to compute, as each year have different days. Does anyone have some idea? Thanks.
the data looks like:
01-11-1935 663
02-11-1935 596
03-11-1935 450
04-11-1935 381
05-11-1935 354
06-11-1935 312
my code:
mydata<-read.table("discharge")
colnames(mydata) <- c("date","discharge")
library(zoo)
z<-zooreg(mydata[,2],start=as.Date("1935-11-1"))
mydta$date <- as.POSIXct(dat$date)
q.month<-daily2monthly(z,FUN=max,na.rm = TRUE,date.fmt = "%Y-%m-%d",out.fmt="numeric")
q.month.plain=coredata(q.month)
z.month<-zooreg(q.month.plain,start=1,frequency=12)
With dates stored in a vector of class Date, you can just use cut() and tapply(), like this:
## Example data
df <- data.frame(date = seq(as.Date("1935-01-01"), length = 100, by = "week"),
flow = (runif(n = 100, min = 0, max = 1000)))
## Use vector of November 1st dates to cut data into hydro-years
breaks <- seq(as.Date("1934-11-01"), length=4, by="year")
df$hydroYear <- cut(df$date, breaks, labels=1935:1937)
## Find the maximum flow in each hydro-year
with(df, tapply(flow, hydroYear, max))
# 1935 1936 1937
# 984.7327 951.0440 727.4210
## Note: whenever using `cut()`, I take care to double-check that
## I've got the cuts exactly right
cut(as.Date(c("1935-10-31", "1935-11-01")), breaks, labels=1935:1937)
# [1] 1935 1936
# Levels: 1935 1936 1937
Here is a one-liner to do that.
First convert the dates to "yearmon" class. This class represents a year month as the sum of a year as the integer part and a month as the fractional part (Jan = 0, Feb = 1/12, etc.). Add 2/12 to shift November to January and then truncate to give just the years. Aggregate over those. Although the test data we used starts at the beginning of the hydro year this solution works even if the data does not start on the beginning of the hydro year.
# test data
library(zoo)
z <- zooreg(1:1000, as.Date("2000-11-01")) # test input
aggregate(z, as.integer(as.yearmon(time(z)) + 2/12), max)
This gives:
2001 2002 2003
365 730 1000
Try the xts package, which works together with zoo:
require(zoo)
require(xts)
dates = seq(Sys.Date(), by = 'day', length = 365 * 3)
y = cumsum(rnorm(365 * 3))
serie = zoo(y, dates)
# if you need to specify `start` and `end`
# serie = window(serie, start = "2015-06-01")
# xts function
apply.yearly(serie, FUN = max)

Resources