Errors using CausalImpact package with Zoo objects - r

I'm trying to model the impact of storms on sales patterns using the CausalImpact package. When I create a zoo object and pass it to the model I receive an error. I've read through the documentation and can't figure out what I'm doing differently from the examples in the documentation.
I'm working with the following data.frame:
> head(my.data)
date sales units
1 2014-10-17 71319.85 21436.64
2 2014-10-18 88598.26 26755.79
3 2014-10-19 95768.29 29823.86
4 2014-10-20 62303.04 19417.71
5 2014-10-21 56477.57 17562.21
6 2014-10-22 54890.39 16946.43
Then I'm converting it to a zoo object:
my.data<- zoo( my.data[ ,c('sales','units')], my.data[,'date'] )
> str(my.data)
‘zoo’ series from 2014-10-17 to 2017-04-13
Data: num [1:907, 1:2] 71320 88598 95768 62303 56478 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "sales" "units"
Index: Date[1:907], format: "2014-10-17" "2014-10-18" "2014-10-19" ...
Then I set the pre and post periods and run the model:
pre.period <- as.Date(c('2015-10-17','2017-03-09'))
post.period <- as.Date(c('2017-03-10','2017-04-13'))
library(CausalImpact)
impact<- CausalImpact(data = my.data, pre.period = pre.period, post.period = post.period, alpha = .01)
But I'm receiving this error:
> impact<- CausalImpact(data = my.data, pre.period = pre.period, post.period = post.period, alpha = .05)
Error in bsts(formula, data = data, state.specification = ss, expected.model.size = kStaticRegressionExpectedModelSize, :
Caught exception with the following error message:
BregVsSampler did not start with a legal configuration.
Selector vector: 11
beta: 0 0
I've used this package successfully with univariate time series data, but cant identify why this isn't working.
Thank you for your help!

I ran into the same exact issue, after applying recent package updates (including CausalImpact). Everything was working fine previously.
While I don't have the exact cause/solution, I have discovered something that may help you.
In my data, I tried simply replacing the dates in the zoo object with a test sequence. So in your case it would be something like:
time.pts <- seq.Date(as.Date("2014-10-17"), by = 1, length.out = 907)
my.data<- zoo( my.data[ ,c('sales','units')], time.pts )
After doing this, the "BregVsSampler" exception did not occur. So I figured the issue must be related to the dates, and then put my original date series back into the zoo object. I then noticed that I had a gap between pre.period and post.period, i.e. see the gap between 3/9 and 3/20 below:
pre.period <- as.Date(c('2015-10-17','2017-03-09'))
post.period <- as.Date(c('2017-03-20','2017-04-13'))
When I adjusted the pre/post periods to remove the gap in dates, the problem again went away.
While you don't seem to have such a gap in the code you show above, you may want to look at your date series for any inconsistencies and/or try a different date range. Obviously there is a bug somewhere that needs to get fixed, but perhaps the above info will help you work around the issue in the interim.

Related

Converting a date to numeric in R

I have data where I have the dates in YYYY-MM-DD format in one column and another column is num.
packages:
library(forecast)
library(ggplot2)
library(readr)
Running str(my_data) produces the following:
spec_tbl_df [261 x 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ date : Date[1:261], format: "2017-01-01" "2017-01-08" ...
$ popularity: num [1:261] 100 81 79 75 80 80 71 85 79 81 ...
- attr(*, "spec")=
.. cols(
.. date = col_date(format = ""),
.. popularity = col_double()
.. )
- attr(*, "problems")=<externalptr>
I would like to do some time series analysis on this. When running the first line of code for this decomp <- stl(log(my_data), s.window="periodic")
I keep running into the following error:
Error in Math.data.frame(my_data) :
non-numeric-alike variable(s) in data frame: date
Originally my date format was in MM/DD/YYYY format, so I feel like I'm... barely closer. I'm learning R again, but it's been a while since I took a formal course in it. I did a precursory search here, but could not find anything that I could identify as helpful (I'm just an amateur.)
You currently have a data.frame (or tibble variant thereof). That is not yet time aware. You can do things like
library(ggplot2)
ggplot(data=df) + aes(x=date, y=popularity) + geom_line()
to get a basic line plot properly index by date.
You will have to look more closely at package forecast and the examples of functions you want to use to predict or model. Packages like xts can help you, i.e.
library(xts)
x <- xts(df$popularity, order.by=df$date)
plot(x) # plot xts object
besides plotting you get time- and date aware lags and leads and subsetting. The rest depends more on what you want to do ... which you have not told us much about.
Lastly, if you wanted to convert your dates to numbers (since Jan 1, 1970) a quick as.numeric(df$date)) will; but using time-aware operations is often better (but has the learning curve you see now...)

Circular-linear regression with covariates in R

I have data showing when an animal came to a survey station. example csv file here The first few lines of data look like this:
Site_ID DateTime HourOfDay MinTemp LunarPhase Habitat
F1 6/12/2013 14:01:00 14 -1 0 river
F1 6/12/2013 14:23:00 14 -1 0 river
F2 6/13/2013 1:21:00 1 3 1 upland
F2 6/14/2013 1:33:00 1 4 2 upland
F3 6/14/2013 1:48:00 1 4 2 river
F3 6/15/2013 11:08:00 11 0 0 river
I would like to perform a circular-linear regression in R to determine peak activity times. The dependent variable could be DateTime or HourOfDay, whichever is easier. I would like to incorporate the covariates Site_ID (random effect), plus MinTemp, LunarPhase, and Habitat into a mixed-effects model.
I have tried using the lm.circular function of program circular, and have the following code:
data<-read.csv("StackOverflowExampleData.csv")
data$DateTime<-as.POSIXct(as.character(data$DateTime), format = "%m/%d/%Y %H:%M:%S")
data$LunarPhase<-as.factor(data$LunarPhase)
str(data)
library(circular)
y<-data$DateTime
y<-circular(y, units ="hours",template = "clock24",rotation = "clock")
x<-data[,c(1,4,5,6)]
lm.circular(y=y, x=x, init=c(1,1,1,1), type='c-l', verbose=TRUE)
I keep getting the error:
Error in Ops.POSIXt(x, 12) : '/' not defined for "POSIXt" objects
Apparently this is a known bug, but I was confused by this threat about it and could not determine an appropriate work-around. Suggestions?
Also, my ultimate goal with this data was to run a circular-linear version of a glm, and then test several models against one another using AIC or some other information theoretics method. The model I'm seeking would be a circular-linear version of something like this:
glmer(HourOfDay~MinTemp+LunarPhase+Habitat+(1|Site_ID),family=binomial,data=data)
Perhaps this is an inappropriate application of the circular package. If so, I'm open to other suggestions of models and/or graphics that would investigate peak activity using the data and covariates.
Note: I did search for related discussions and found this somewhat relevant thread, but it was never answered, did not request a solution in R, and was of a different scope.
The specific problem is caused by conversion.circular. There, a POSIXlt object is divided by 12. This is an operation that has a non-defined outcome:
> as.POSIXlt('2005-07-16') / 2
Error in Ops.POSIXt(as.POSIXlt("2005-07-16"), 2) :
'/' not defined for "POSIXt" objects
So, it seems that you cannot use data of this class as input for the circular package. I could not find any mention of POSIXlt data in the examples. Maybe you need to specify the timestamps simply as a number, not as a POSIXlt object.

Machine Learning using R linear regression

I used R for machine learning code. My project scenario as mentioned below.
I used MongoDB for database storage. In mongo db I had one collection in that collection every 5 min. one new document added. The collection description as below.
{
"_id" : ObjectId("521c980624c8600645ad23c8"),
"TimeStamp" : 1377605638752,
"cpuUsed" : -356962527,
"memory" : 2057344858,
"hostId" : "200.2.2.2"
}
Now my problem is that using above documents I want to predict next 5 min or 10 min or 24 hrs. cpuUsed and memory values. For that I write R code as below
library('RMongo')
mg1 <- mongoDbConnect('dbname')
query <- dbGetQuery(mg1,'test',"{'hostId' : '200.2.2.2'}")
data1 <- query[]
cpu <- query$cpuUtilization
memory <- query$memory
new <- data.frame(data=1377678051) # set timestamp for calculating results
predict(lm(cpu ~ data1$memory + data1$Date ), new, interval="confidence")
But, when I was execute above code it shows me following output
fit lwr upr
1 427815904 -37534223 893166030
2 -110791661 -368195697 146612374
3 137889445 -135982781 411761671
4 -165891990 -445886859 114102880
.
.
.
n
Using this output I don't know which cpuUsed value used for predicting values.
If any one knows please help me.
Thank you.
The newdata parameter of predict needs to contain the variables used in the fit:
new <- data.frame(memory = 1377678051, Date=as.Date("2013-08-28))
Only then it is actually used for prediction, otherwise you get the fitted values.
You can then cbind the predicted values with new.

unused arguments error using apply() in R

I get an error message when I attempt to use apply() conditional on a column of dates to return a set of coefficients.
I have a dataset (herein modified for simplicity, but reproducible):
ADataset <- data.table(Epoch = c("2007-11-15", "2007-11-16", "2007-11-17",
"2007-11-18", "2007-11-19", "2007-11-20", "2007-11-21"),
Distance = c("92336.22", "92336.23", "92336.22", "92336.20",
"92336.19", "92336.21", "92336.18))
ADataset
Epoch Distance
1: 2007-11-15 92336.22
2: 2007-11-16 92336.23
3: 2007-11-17 92336.22
4: 2007-11-18 92336.20
5: 2007-11-19 92336.19
6: 2007-11-20 92336.21
7: 2007-11-21 92336.18
The analysis begins with establishing start and end dates:
############## Establish dates for analysis
#4.Set date for center of duration
StartDate <- "2007-11-18"
as.numeric(as.Date(StartDate)); StartDate
EndDate <- as.Date(tail(Adataset$Epoch,1)); EndDate
Then I establish time durations for analysis:
#5.Quantify duration of time window
STDuration <- 1
LTDuration <- 3
Then I write functions to regress over both durations and return the slopes:
# Write STS and LTS functions, each with following steps
#6.Define time window- from StartDate less ShortTermDuration to
StartDate plus ShortTermDuration
#7.Define Short Term & Long Term datasets
#8. Run regression over dataset
my_STS_Function <- function (StartDate) {
STAhead <- as.Date(StartDate) + STDuration; STAhead
STBehind <- as.Date(StartDate) - STDuration; STBehind
STDataset <- subset(Adataset, as.Date(Epoch) >= STBehind & as.Date(Epoch)<STAhead)
STResults <- rlm( Distance ~ Epoch, data=STDataset); STResults
STSummary <- summary( STResults ); STSummary
# Return coefficient (Slope of regression)
STNum <- STResults$coefficients[2];STNum
}
my_LTS_Function <- function (StartDate) {
LTAhead <- as.Date(StartDate) + LTDuration; LTAhead
LTBehind <- as.Date(StartDate) - LTDuration; LTBehind
LTDataset <- subset(Adataset, as.Date(Epoch) >= LTBehind & as.Date(Epoch)<LTAhead)
LTResults <- rlm( Distance ~ Epoch, data=LTDataset); LTResults
LTSummary <- summary( LTResults ); LTSummary
# Return coefficient (Slope of regression)
LTNum <- LTResults$coefficients[2];LTNum
Then I test the function to make sure it works for a single date:
myTestResult <- my_STS_Function("2007-11-18")
It works, so I move on to apply the function over the range of dates in the dataset:
mySTSResult <- apply(Adataset, 1, my_STS_Function, seq(StartDate : EndDate))
...in which my desired result is a list or array or vector of mySTSResult (slopes) (and, subsequently, a separate list/array/vector of myLTSResults so then I can create a STSlope:LTSlope ratio over the duration), something like (mySTSResults fabricated)...
> Adataset
Epoch Distance mySTSResults
1: 2007-11-15 92336.22 3
2: 2007-11-16 92336.23 4
3: 2007-11-17 92336.22 5
4: 2007-11-18 92336.20 6
5: 2007-11-19 92336.19 7
6: 2007-11-20 92336.21 8
7: 2007-11-21 92336.18 9
Only I get this error:
Error in FUN(newX[, i], ...) : unused argument(s) (1:1185)
What is this telling me and how to do correct it? I've done some looking and cannot find the correction.
Hopefully I've explained this sufficiently. Please let me know if you need further details.
Ok, it seems the problem is in the additional arguments to my_STS_Function as stated in your apply function call (as you have defined it with only one parameter). The date range is being passed as an additional parameter to that function, and R is complaining that it is unused (a vector of 1185 elements it seems). Are you rather trying to pull a subset of the rows restricted by date range first, then wishing to apply the my_STS_Function? I'd have to think a bit on an exact solution to that.
Sorry - I did my working out in the comments there. A possible solution is this:
subSet <- Adataset[Adataset[,1] %in% seq(StartDate:EndDate),][order(na.exclude(match(Adataset[,1], seq(StartData,EndDate))),]
Adapted from the answer in this question:
R select rows in matrix from another vector (match, %in)
Adding this as a new answer as the previous one was getting confused. A previous commenter was correct, there are bugs in your code, but they aren't a sticking point.
My updated approach was to use seq.Date to generate the date sequence (only works if you have a data point for each day between the start and end - though you could use na.exclude as above):
dates = seq.Date(as.Date(StartDate),as.Date(EndDate),"days")
You then use this as the input to apply, with some munging of types to get things working correctly (I've done this with a lamda function):
mySTSResult <- apply(as.matrix(dates), 1, function(x) {class(x) <- "Date"; my_STS_Function(x)})
Then hopefully you should have a vector of the results, and you should be able to do something similar for LTS, and then manipulate that into another column in your original data frame/matrix.

plotting time series in R

I am working with data, 1st two columns are dates, 3rd column is symbol, and 4th and 5th columns are prices.
So, I created a subset of the data as follows:
test.sub<-subset(test,V3=="GOOG",select=c(V1,V4)
and then I try to plot a time series chart using the following
as.ts(test.sub)
plot(test.sub)
well, it gives me a scatter plot - not what I was looking for.
so, I tried plot(test.sub[1],test.sub[2])
and now I get the following error:
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
To make sure the no. of rows were same, I ran nrow(test.sub[1]) and nrow(test.sub[2]) and they both return equal rows, so as a newcomer to R, I am not sure what the fix is.
I also ran plot.ts(test.sub) and that works, but it doesn't show me the dates in the x-axis, which it was doing with plot(test.sub) and which is what I would like to see.
test.sub[1]
V1
1107 2011-Aug-24
1206 2011-Aug-25
1307 2011-Aug-26
1408 2011-Aug-29
1510 2011-Aug-30
1613 2011-Aug-31
1718 2011-Sep-01
1823 2011-Sep-02
1929 2011-Sep-06
2035 2011-Sep-07
2143 2011-Sep-08
2251 2011-Sep-09
2359 2011-Sep-13
2470 2011-Sep-14
2581 2011-Sep-15
2692 2011-Sep-16
2785 2011-Sep-19
2869 2011-Sep-20
2965 2011-Sep-21
3062 2011-Sep-22
3160 2011-Sep-23
3258 2011-Sep-26
3356 2011-Sep-27
3455 2011-Sep-28
3555 2011-Sep-29
3655 2011-Sep-30
3755 2011-Oct-03
3856 2011-Oct-04
3957 2011-Oct-05
4059 2011-Oct-06
4164 2011-Oct-07
4269 2011-Oct-10
4374 2011-Oct-11
4479 2011-Oct-12
4584 2011-Oct-13
4689 2011-Oct-14
str(test.sub)
'data.frame': 35 obs. of 2 variables:
$ V1:Class 'Date' num [1:35] NA NA NA NA NA NA NA NA NA NA ...
$ V4: num 0.475 0.452 0.423 0.418 0.403 ...
head(test.sub) V1 V4
1212 <NA> 0.474697
1313 <NA> 0.451907
1414 <NA> 0.423184
1516 <NA> 0.417709
1620 <NA> 0.402966
1725 <NA> 0.414264
Now that this is working, I'd like to add a 3rd variable to plot a 3d chart - any suggestions how I can do that. thx!
So I think there are a few things going on here that are worth talking through:
first, some example data:
test <- data.frame(End = Sys.Date()+1:5,
Start = Sys.Date()+0:4,
tck = rep("GOOG",5),
EndP= 1:5,
StartP= 0:4)
test.sub = subset(test, tck=="GOOG",select = c(End, EndP))
First, note that test and test.sub are both data frames, so calls like test.sub[1] don't really "mean" anything to R.** It's more R-ish to write test.sub[,1] by virtue of consistency with other R structures. If you compare the results of str(test.sub[1]) and str(test.sub[,1]) you'll see that R treats them slightly differently.
You said you typed:
as.ts(test.sub)
plot(test.sub)
I'd guess you have extensive experience with some sort of OO-language; and while R does have some OO flavor to it, it doesn't apply here. Rather than transforming test.sub to something of class ts, this just does the transformation and throws it away, then moves on to plot the data frame you started with. It's an easy fix though:
test.sub.ts <- as.ts(test.sub)
plot(test.sub.ts)
But, this probably isn't what you were looking for either. Rather, R creates a time series that has two variables called "End" (which is the date now coerced to an integer) and "EndP". Funny business like this is part of the reason time series packages like zoo and xts have caught on so I'll detail them instead a little further down.
(Unfortunately, to the best of my understanding, R doesn't keep date stamps with its default ts class, choosing instead to keep start and end dates as well as a frequency. For more general time series work, this is rarely flexible enough)
You could perhaps get what you wanted by typing
plot(test.sub[,1], test.sub[,2])
instead of
plot(test.sub[1], test.sub[2])
since the former runs into trouble given that you are passing two sub-data frames instead of two vectors (even though it looks like you would be).*
Anyways, with xts (and similarly for zoo):
library(xts) # You may need to install this
xtemp <- xts(test.sub[,2], test.sub[,1]) # Create the xts object
plot(xtemp)
# Dispatches a xts plot method which does all sorts of nice time series things
Hope some of this helps and sorry for the inline code that's not identified as such: still getting used to stack overflow.
Michael
**In reality, they access the lists that are used to structure a data frame internally, but that's more a code nuance than something worth relying on.
***The nitty-gritty is that when you pass plot(test.sub[1], test.sub[2]) to R, it dispatches the method plot.data.frame which takes a single data frame and tries to interpret the second data frame as an additional plot parameter which gets misinterpreted somewhere way down the line, giving your error.
The reason that you get the Error about different x and y lengths is immediately apparent if you do a traceback immediately upon raising the error:
> plot(test.sub[1],test.sub[2])
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
> traceback()
6: stop("'x' and 'y' lengths differ")
5: xy.coords(x, y, xlabel, ylabel, log)
4: plot.default(x1, ...)
3: plot(x1, ...)
2: plot.data.frame(test.sub[1], test.sub[2])
1: plot(test.sub[1], test.sub[2])
The problems in your call are manifold. First, as mentioned by #mweylandt test.sub[1] is a data frame with the single component, not a vector comprised of the contents of the first component of test.sub.
From the traceback, we see that the plot.data.frame method was called. R is quite happy to plot a data frame as long as it has at least two columns. R took you at your word and passed test.sub[1] (as a data.frame) on to plot() - test.sub[2] never gets a look in. test.sub[1] is eventually passed on to xy.coords() which correctly informs you that you have lots of rows for x but 0 rows for y because test.sub[1] only contains a single component.
It would have worked if you'd done plot(test.sub[,1], test.sub[,2], type = "l") or used the formula interface to name the variables plot(V4 ~ V1, data = test.sub, type = "l") as I show in my other Answer.
Surely it is easier to use the formula interface:
> test <- data.frame(End = Sys.Date()+1:5,
+ Start = Sys.Date()+0:4,
+ tck = rep("GOOG",5),
+ EndP= 1:5,
+ StartP= 0:4)
>
> test.sub = subset(test, tck=="GOOG",select = c(End, EndP))
> head(test.sub)
End EndP
1 2011-10-19 1
2 2011-10-20 2
3 2011-10-21 3
4 2011-10-22 4
5 2011-10-23 5
> plot(EndP ~ End, data = test.sub, type = "l")
I work extensively with time series type data and rarely, if ever, have any need for the "ts" class of objects. Packages zoo and xts are very useful, but if all you want to do is plot the data, i) get the date/time information correctly formatted/set-up as a "Date" or "POSIXt" class object, and then ii) just plot it using standard graphics and type = "l" (or type = "b" or type = "o" if you want to see the observation times).

Resources