Error plotting a zoo class series data with R - r

I am new to the community of stackoverflow and this is the first question I ask. Please let me know if I did something wrong.
Here is the situation with my problem. I am dealing with Australian electricity prices, and my time series look like this. This is a high frequency data that is sampled every 30 minutes.
> head(t)
VIC NSW QLD SNOWY SA
1999-01-01 00:00:00 26.84 24.29 26.52 26.20 29.87
1999-01-01 00:30:00 30.52 27.64 19.34 29.74 36.01
1999-01-01 01:00:00 28.74 26.64 17.47 28.34 35.70
1999-01-01 01:30:00 27.94 25.81 17.08 27.43 31.67
1999-01-01 02:00:00 20.90 19.94 15.84 20.86 22.42
1999-01-01 02:30:00 20.26 19.48 15.68 20.28 21.38
> tail(t)
VIC NSW QLD SNOWY SA
2006-12-31 21:00:00 14.59 15.10 13.72 15.35 29.60
2006-12-31 21:30:00 14.77 15.42 14.12 15.61 28.79
2006-12-31 22:00:00 14.12 15.01 13.54 15.06 20.59
2006-12-31 22:30:00 15.15 16.19 15.10 16.21 17.44
2006-12-31 23:00:00 15.17 16.14 15.48 16.18 17.84
2006-12-31 23:30:00 16.96 17.14 16.37 17.63 20.20
> class(t)
[1] "xts" "zoo"
I was trying to come up with mean prices for each time point across a day. So I did this:
> half.hourly.means <- aggregate(t$VIC, list(format(index(t), "%H:%M")), FUN = mean)
> head(half.hourly.means)
00:00 26.99938
00:30 24.67273
01:00 21.78190
01:30 26.46662
02:00 21.27931
02:30 18.57727
> tail(half.hourly.means)
21:00 27.86881
21:30 26.65468
22:00 23.51793
22:30 25.68527
23:00 23.26385
23:30 30.01726
> class(half.hourly.means)
[1] "zoo"
This outcome is average prices of each time point across the sample period. Everything worked fine by far. But when I tried to plot it, an error occurred.
> plot(half.hourly.means)
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
2: In min(x) : no non-missing arguments to min; returning Inf
3: In max(x) : no non-missing arguments to max; returning -Inf
When I tried this
> plot(as.numeric(half.hourly.means), type = "l")
(I failed to post a image for lack of reputation, sorry)
It yields the correct plot, but with meaningless x axis values. So my question is: what could be done to produce the above graph with x axis values being "00:00", "00:30", "01:00", ... ?
Thanks for your patience!
Best regards,
Wei

Related

Dividing table without using split in R

I have a datatable for a time period of 21 days with data measured every 10 seconds which looks like
TimeStamp ActivePower CurrentL1 GeneratorRPM RotorRPM WindSpeed
2017-03-05 00:00:10 2183.650 1201.0 1673.90 NA 10.60
2017-03-05 00:00:20 2216.200 1224.0 1679.70 NA 11.00
2017-03-05 00:00:30 2176.500 1203.5 NA 16.05 11.90
---
2017-03-25 23:59:40 2024.20 1150.0 1687.00 16.15 10.35
2017-03-25 23:59:50 1959.05 1106.0 1661.15 15.90 8.65
2017-03-26 00:00:00 1820.55 1038.0 1665.70 15.80 9.20
I want to divide it into 30 minute blocks and my colleague said I shouldn't use the split function since the data can also have timestamps where there is no data and that I should manually make a 30 minute interval duration.
I have done this so far:
library(data.table)
library(dplyr)
library(tidyr)
datei <- file.choose()
data_csv <- fread(datei)
datatable1 <- as.data.table(data_csv)
datatable1 <- datatable1[turbine=="UTHA02",]
datatable1[, TimeStamp:=as.POSIXct(get("_time"), tz="UTC")]
setkey(datatable1, TimeStamp)
startdate <- datatable1[1,TimeStamp]
enddate <- datatable1[nrow(datatable1), TimeStamp]
durationForInterval <- 30*60 #in seconds
curr <- startdate
datatable1[TimeStamp >= curr & TimeStamp < curr + durationForInterval]
So I manually made a 30 minute interval duration and got the first interval
time ActivePower CurrentL1 GeneratorRPM RotorRPM WindSpeed
1: 2017-03-05 00:00:10 2183.65 1201.0 1673.90 NA 10.60
2: 2017-03-05 00:00:20 2216.20 1224.0 1679.70 NA 11.00
3: 2017-03-05 00:00:30 2176.50 1203.5 NA 16.05 11.90
4: 2017-03-05 00:00:40 2267.95 1256.5 1685.85 NA 10.60
5: 2017-03-05 00:00:50 2533.15 1408.0 1693.30 16.20 12.40
---
176: 2017-03-05 00:29:20 2750.35 1531.0 1694.40 16.20 11.45
177: 2017-03-05 00:29:30 2930.40 1630.5 1668.25 NA 12.65
178: 2017-03-05 00:29:40 2459.55 1367.0 1680.25 15.90 12.15
179: 2017-03-05 00:29:50 2713.80 1508.5 1681.15 16.20 12.25
180: 2017-03-05 00:30:00 2395.20 1333.0 1667.75 16.00 11.75
But I only could do it for the first interval and I dont know how to do it for the rest. Is there something that I am missing or am I overthinking? Any help is appreciated!
This will create a column interval with a unique value for every 30 minutes.
datatable1[, interval := as.integer(TimeStamp, units = "secs") %/% (60L*30L)]
You could split on that column or use it for grouping operations.
split(datatable1, datatable1$interval) # or split(datatable1, by = "interval")

Evaluating Prophet model in R, using cross-validation

I am trying to cross-validate a Prophet model in R.
The problem - this package does not work well with monthly data.
I managed to build the model
and even used a custom monthly seasonality.
as recommended by authors of this tool.
But cannot cross-validate monthly data. Tried to follow recommendations in the GitHub issue, but missing something.
Currently my code looks like this
model1_cv <- cross_validation(model1, initial = 156, period = 365/12, as.difftime(horizon = 365/12, units = "days"))
Updated:
Based on answer to this question, I visualized CV results. There some problems here. I used full data and partial data.
Also metrics do not look that good
I just tested a bit with training data from the package and from what I understood the package is not really well suited for monthly forecast, this part: [...] as.difftime(365/12, units = "days") [...] seems to have been informed just to prove the size of the month with 30something days. Meaning you can use this instead of just 365/12 por "period" and/or "horizon". One thing I noticed is, that both arguments are of type integer per description but when you look into the function they are calculated per as.datediff() so they are doubles actually.
library(dplyr)
library(prophet)
library(data.table)
#training data
df <- data.table::fread("ds y
1992-01-01 146376
1992-02-01 147079
1992-03-01 159336
1992-04-01 163669
1992-05-01 170068
1992-06-01 168663
1992-07-01 169890
1992-08-01 170364
1992-09-01 164617
1992-10-01 173655
1992-11-01 171547
1992-12-01 208838
1993-01-01 153221
1993-02-01 150087
1993-03-01 170439
1993-04-01 176456
1993-05-01 182231
1993-06-01 181535
1993-07-01 183682
1993-08-01 183318
1993-09-01 177406
1993-10-01 182737
1993-11-01 187443
1993-12-01 224540
1994-01-01 161349
1994-02-01 162841
1994-03-01 192319
1994-04-01 189569
1994-05-01 194927
1994-06-01 197946
1994-07-01 193355
1994-08-01 202388
1994-09-01 193954
1994-10-01 197956
1994-11-01 202520
1994-12-01 241111
1995-01-01 175344
1995-02-01 172138
1995-03-01 201279
1995-04-01 196039
1995-05-01 210478
1995-06-01 211844
1995-07-01 203411
1995-08-01 214248
1995-09-01 202122
1995-10-01 204044
1995-11-01 212190
1995-12-01 247491
1996-01-01 185019
1996-02-01 192380
1996-03-01 212110
1996-04-01 211718
1996-05-01 226936
1996-06-01 217511
1996-07-01 218111")
df <- df %>%
dplyr::mutate(ds = as.Date(ds))
model <- prophet::prophet(df)
(tscv.myfit <- prophet::cross_validation(model, horizon = 365/12, units = "days", period = 365/12, initial = 365/12 * 12 * 3))
y ds yhat yhat_lower yhat_upper cutoff
1: 175344 1995-01-01 170988.8 170145.9 171828.0 1994-12-31 02:00:00
2: 172138 1995-02-01 178117.4 176975.2 179070.2 1995-01-30 12:00:00
3: 201279 1995-03-01 211462.8 210277.4 212670.8 1995-01-30 12:00:00
4: 196039 1995-04-01 200113.9 198079.5 201977.8 1995-03-01 22:00:00
5: 210478 1995-05-01 202100.5 200390.8 203797.9 1995-04-01 08:00:00
6: 211844 1995-06-01 208330.5 206229.9 210497.4 1995-05-01 18:00:00
7: 203411 1995-07-01 202563.8 200786.5 204313.0 1995-06-01 04:00:00
8: 214248 1995-08-01 214639.6 212748.3 216461.3 1995-07-01 14:00:00
9: 202122 1995-09-01 204954.0 203048.9 206768.4 1995-08-31 12:00:00
10: 204044 1995-10-01 205097.5 203209.7 206882.3 1995-09-30 22:00:00
11: 212190 1995-11-01 213586.7 211728.1 215617.6 1995-10-31 08:00:00
12: 247491 1995-12-01 251518.8 249708.2 253589.2 1995-11-30 18:00:00
13: 185019 1996-01-01 182403.7 180520.1 184494.7 1995-12-31 04:00:00
14: 192380 1996-02-01 184722.9 182772.7 186686.9 1996-01-30 14:00:00
15: 212110 1996-03-01 205020.1 202823.2 206996.9 1996-01-30 14:00:00
16: 211718 1996-04-01 214514.0 211891.9 217175.3 1996-03-31 14:00:00
17: 226936 1996-05-01 218845.2 216133.8 221420.4 1996-03-31 14:00:00
18: 217511 1996-06-01 218672.2 216007.8 221459.9 1996-05-31 14:00:00
19: 218111 1996-07-01 221156.1 218540.7 224184.1 1996-05-31 14:00:00
The cutoff is not as regular as one would expect - I guess this is due to using average days per month somehow - though I could not figute out the logic. You can replace 365/12 with as.difftime(365/12, units = "days") and will get the same result.
But if you use (365+365+365+366) / 48 instead due to the 29.02. you get a slighly different average days per month and this leads to a different output:
(tscv.myfit_2 <- prophet::cross_validation(model, horizon = (365+365+365+366)/48, units = "days", period = (365+365+365+366)/48, initial = (365+365+365+366)/48 * 12 * 3))
y ds yhat yhat_lower yhat_upper cutoff
1: 172138 1995-02-01 178117.4 177075.3 179203.9 1995-01-29 13:30:00
2: 201279 1995-03-01 211462.8 210340.5 212607.3 1995-01-29 13:30:00
3: 196039 1995-04-01 200113.9 198022.6 202068.1 1995-03-31 13:30:00
4: 210478 1995-05-01 204100.2 202009.8 206098.7 1995-03-31 13:30:00
5: 211844 1995-06-01 208330.5 206114.5 210515.8 1995-05-31 13:30:00
6: 203411 1995-07-01 202606.0 200319.1 204663.4 1995-05-31 13:30:00
7: 214248 1995-08-01 214639.6 212684.4 216495.7 1995-07-31 22:30:00
8: 202122 1995-09-01 204954.0 203127.7 206951.0 1995-08-31 09:00:00
9: 204044 1995-10-01 205097.5 203285.3 207036.5 1995-09-30 19:30:00
10: 212190 1995-11-01 213586.7 211516.8 215516.2 1995-10-31 06:00:00
11: 247491 1995-12-01 251518.8 249658.3 253590.1 1995-11-30 16:30:00
12: 185019 1996-01-01 182403.7 180359.7 184399.2 1995-12-31 03:00:00
13: 192380 1996-02-01 184722.9 182652.4 186899.8 1996-01-30 13:30:00
14: 212110 1996-03-01 205020.1 203040.3 207171.9 1996-01-30 13:30:00
15: 211718 1996-04-01 214514.0 211942.6 217252.6 1996-03-31 13:30:00
16: 226936 1996-05-01 218845.2 216203.1 221506.5 1996-03-31 13:30:00
17: 217511 1996-06-01 218672.2 215823.9 221292.4 1996-05-31 13:30:00
18: 218111 1996-07-01 221156.1 218236.7 223862.0 1996-05-31 13:30:00
Form this behaviour I would say the work arround is not ideal, especially depending how exact you want the crossvalidation to be in terms of rolling month. If you need the cutoff points to be exact you could write your own function and predict always one month from the starting point, collect these results and build final comparision. I would trust this approach more than the work arround.

Why is the time series data being plotted backwards in R?

I am stuck on the why that this is happening and have tried searching everywhere for the answer. When I try to plot a timeseries object in R the resulting plot comes out in reverse.
I have the following code:
library(sqldf)
stock_prices <- read.csv('~/stockPrediction/input/REN.csv')
colnames(stock_prices) <- tolower(colnames(stock_prices))
colnames(stock_prices)[7] <- 'adjusted_close'
stock_prices <- sqldf('SELECT date, adjusted_close FROM stock_prices')
head(stock_prices)
date adjusted_close
1 2014-10-20 3.65
2 2014-10-17 3.75
3 2014-10-16 4.38
4 2014-10-15 3.86
5 2014-10-14 3.73
6 2014-10-13 4.09
tail(stock_prices)
date adjusted_close
1767 2007-10-15 8.99
1768 2007-10-12 9.01
1769 2007-10-11 9.02
1770 2007-10-10 9.06
1771 2007-10-09 9.06
1772 2007-10-08 9.08
But when I try the following code:
stock_prices_ts <- ts(stock_prices$adjusted_close, start=c(2007, 1), end=c(2014, 10), frequency=12)
plot(stock_prices_ts, col='blue', lwd=2, type='l')
How the image that results is :
And even if I reverse the time series object with this code:
plot(rev(stock_prices_ts), col='blue', lwd=2, type='l')
I get this
which has arbitrary numbers.
Any idea why this is happening? Any help is much appreciated.
This is happened because your object loose its time serie structure once you apply rev function.
For example :
set.seed(1)
gnp <- ts(cumsum(1 + round(rnorm(100), 2)),
start = c(1954, 7), frequency = 12)
gnp ## gnp has a real time serie structure
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1954 0.37 1.55 1.71 4.31 5.64 5.82
1955 7.31 9.05 10.63 11.32 13.83 15.22 15.60 14.39 16.51 17.47 18.45 20.39
1956 22.21 23.80 25.72 27.50 28.57 27.58 29.20 30.14 30.98 30.51 31.03 32.45
1957
rev(gnp) ## the reversal is just a vector
[1] 110.91 110.38 110.60 110.17 110.45 108.89 106.30 104.60 102.44 ....
In general is a liitle bit painful to manipulate the class ts. One idea is to use an xts object that "generally" conserve its structure one you apply common operation on it.
Even in this case the generic method rev is not implemented fo an xts object, it is easy to coerce the resulted zoo time series to and xts one using as.xts.
par(mfrow=c(2,2))
plot(gnp,col='red',main='gnp')
plot(rev(gnp),type='l',col='red',main='rev(gnp)')
library(xts)
xts_gnp <- as.xts(gnp)
plot(xts_gnp)
## note here that I apply as.xts again after rev operation
## otherwise i lose xts structure
rev_xts_gnp = as.xts(rev(as.xts(gnp)))
plot(rev_xts_gnp)

Intraday high/low clustering

I am attempting to perform a study on the clustering of high/low points based on time. I managed to achieve the above by using to.daily on intraday data and merging the two using:
intraday.merge <- merge(intraday,daily)
intraday.merge <- na.locf(intraday.merge)
intraday.merge <- intraday.merge["T08:30:00/T16:30:00"] # remove record at 00:00:00
Next, I tried to obtain the records where the high == daily.high/low == daily.low using:
intradayhi <- test[test$High == test$Daily.High]
intradaylo <- test[test$Low == test$Daily.Low]
Resulting data resembles the following:
Open High Low Close Volume Daily.Open Daily.High Daily.Low Daily.Close Daily.Volume
2012-06-19 08:45:00 258.9 259.1 258.5 258.7 1424 258.9 259.1 257.7 258.7 31523
2012-06-20 13:30:00 260.8 260.9 260.6 260.6 1616 260.4 260.9 259.2 260.8 35358
2012-06-21 08:40:00 260.7 260.8 260.4 260.5 493 260.7 260.8 257.4 258.3 31360
2012-06-22 12:10:00 255.9 256.2 255.9 256.1 626 254.5 256.2 253.9 255.3 50515
2012-06-22 12:15:00 256.1 256.2 255.9 255.9 779 254.5 256.2 253.9 255.3 50515
2012-06-25 11:55:00 254.5 254.7 254.4 254.6 1589 253.8 254.7 251.5 253.9 65621
2012-06-26 08:45:00 253.4 254.2 253.2 253.7 5849 253.8 254.2 252.4 253.1 70635
2012-06-27 11:25:00 255.6 256.0 255.5 255.9 973 251.8 256.0 251.8 255.2 53335
2012-06-28 09:00:00 257.0 257.3 256.9 257.1 601 255.3 257.3 255.0 255.1 23978
2012-06-29 13:45:00 253.0 253.4 253.0 253.4 451 247.3 253.4 246.9 253.4 52539
There are duplicated results using the subset, how do I achieve only the first record of the day? I would then be able to plot the count of records for periods in the day.
Also, are there alternate methods to get the results I want? Thanks in advance.
Edit:
Sample output should look like this, count could either be 1st result for day or aggregated (more than 1 occurrence in that day):
Time Count
08:40:00 60
08:45:00 54
08:50:00 60
...
14:00:00 20
14:05:00 12
14:10:00 30
You can get the first observation of each day via:
y <- apply.daily(x, first)
Then you can simply aggregate the count based on hours and minutes:
z <- aggregate(1:NROW(y), by=list(Time=format(index(y),"%H:%M")), sum)

ets() error in R

I have a time serie and when applying ets I get an error which I dont know where is generated. My timeserie is not as big or has no such big values.
Do you spot anything wrong?
> ts
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2005 22.84 21.58 21.93 25.25 28.69 30.63 34.47 28.72 36.49 34.77 40.65
2006 36.73 31.55 38.07 34.77 36.91 39.16 36.07 35.37 39.34 35.62 27.58 33.37
2007 37.11 33.32 34.09 34.64 21.05 41.60 36.52 37.63 42.66 38.17 39.26 39.95
2008 35.33 38.63 38.04 36.90 33.56 35.14 33.82 36.18 29.30 25.65 20.71 21.63
2009 17.12 19.02 22.48 24.42 16.94 19.75 24.56 22.55 16.68 17.86 20.83 18.41
2010 14.74 16.49 19.75 22.88 24.11 27.02 27.46 26.47 26.81 26.59 23.56 18.88
> fit = ets (ts)
Error in `-.default`(y, e$e) : non-numeric argument to binary operator
In addition: Warning message:
In ets(ts) :
Very large numbers which may cause numerical problems. Try scaling the data first
>
Thanks.
Update with a traceback:
Traceback:
7: NextMethod(.Generic)
6: Ops.ts(y, e$e)
5: etsmodel(y, errortype[i], trendtype[j], seasontype[k], damped[l],
alpha, beta, gamma, phi, lower = lower, upper = upper, opt.crit = opt.crit,
nmse = nmse, bounds = bounds)
4: ets(ts)
3: eval(expr, envir, enclos)
2: eval(i, envir)
1: sys.source(file = "1.R", envir = .rAenv)
It appears that my ts variable was a wrong format.
mode(ts) // should be "numerical" not "character"
After casting it with as.numeric it worked.

Resources