I am having trouble relating how forecasts are calculated in the R packages forecast::croston and tsintermittent::crost. I understand the concept of croston, such as in the example posted here (www.robjhyndman.com/papers/MASE.xls), but the output from the R packages produces very different results.
I used the values from the Excel example (by R. Hyndman) in the following code:
library (tsintermittent)
library (forecast)
x=c(0,1,0,11,0,0,0,0,2,0,6,3,0,0,0,0,0,7,0,0,0,0) # from Hyndman Excel example
x_crost = crost(x,h=5, w=0.1, init = c(1,1) ) # from the tsintermittent package
x_croston=croston(x,h=5, alpha = 0.1) # from the forecast package
x_croston$fitted
y=data.frame(x,x_crost$frc.in,x_croston$fitted)
y
plot(x_croston)
lines(x_croston$fitted, col="blue")
lines(x_crost$frc.in,col="red")
x_crost$initial
x_crost$frc.out # forecast
x_croston$mean # forecast
The forecast from the Excel example is 1.36, crost gives 1.58 and croston gives 1.15. Why are they not the same? Also note that the in-sample (fitted) values are very different.
For crost in the tsintermittent package you need a second flag to not optimise the initial values: init.opt=FALSE, so the command should be:
crost(x,w=0.1,init=c(2,2),init.opt=FALSE)
Setting only init=c(2,2) will only set the initial values for the optimiser to work from.
Also note that the time series that Rob Hyndman has in his example has two additional values in the beggining (see column B), so x should be:
x=c(0,2,0,1,0,11,0,0,0,0,2,0,6,3,0,0,0,0,0,7,0,0,0,0)
Running these two commands produces the same values as in the excel example.
Related
I am using the npudens function in the np package for R.
I am trying to find a kernel density function of a multivariate dataset and the density evaluated at each of the 632 points to run a conditional efficiency analysis.
I have 4 continuous one dummy variable and my sample size is 632 observations.
I use the below function in R.
kerz <- npudens(bws=bw_cx[i,], cykertype="epanechnikov", cxkertype="epanechnikov",
oxkertype="liracine", tdat=tdata, edat=dat)
In earlier versions, this worked fine, as I was able to retrieve the necessary density estimates with kerz$dens.
In newer version and in Rstudio Cloud I get an error:
Error in if (any(a <= 0)) warning(paste("variable", which(a <= 0), " appears to be constant",:missing value where TRUE/FALSE needed
I suppose some if-statement doesn't evaluate to a TRUE or FALSE somewhere in the npudens function. I have tried to debug the command by changing it to the following command:
kerz2 <- npudens(bws=(bw_cx[i,]), ckertype="epanechnikov",, okertype="liracine",
tdat=tdata, edat=dat)
Unfortunately, I get the same error.
Any help/advice on how to fix this would be greatly appreciated.
I have a time series of rainfall values in a csv file.I plotted the histogram of the data. The histogram is skewed to the left. I wanted to transform the values so that it will have a normal distribution. I used the Yeo-Johnson transform available in R. The transformed values are here.
My question is:
In the above transformation, I used a test value of 0.5 for lambda, which works fine. Is there away to determine the optimal value of lambda based on the time series? I'll appreciate any suggestions.
So far, here's the code:
library(car)
dat <- scan("Zamboanga.csv")
hist(dat)
trans <- yjPower(dat,0.5,jacobian.adjusted=TRUE)
hist(trans)
Here is the csv file.
First find the optimal lambda by using the function boxCox from the car package to estimate λ by maximum likelihood.
You can plot it like this:
boxCox(your_model, family="yjPower", plotit = TRUE)
As Ben Bolker said in a comment, the model here could be something like
your_model <- lm(dat~1)
Then use the optimized lambda in your existing code.
I would appreciate some input in this a lot!
I have data for 5 time series (an example of 1 step in the series is in the plot below), where each step in the series is a vertical profile of species sightings in the ocean which were investigated 6h apart. All 5 steps are spaced vertically by 0.1m (and the 6h in time).
What I want to do is calculate the multivariate cross-correlation between all series in order to find out at which lag the profiles are most correlated and stable over time.
Profile example:
I find the documentation in R on that not so great, so what I did so far is use the package MTS with the ccm function to create cross correlation matrices. However, the interpretation of the figures is rather difficult with sparse documentation. I would appreciate some help with that a lot.
Data example:
http://pastebin.com/embed_iframe.php?i=8gdAeGP4
Save in file cross_correlation_stack.csv or change as you wish.
library(dplyr)
library(MTS)
library(data.table)
d1 <- file.path('cross_correlation_stack.csv')
d2 = read.csv(d1)
# USING package MTS
mod1<-ccm(d2,lag=1000,level=T)
#USING base R
acf(d2,lag.max=1000)
# MQ plot also from MTS package
mq(d2,lag=1000)
Which produces this (the ccm command):
This:
and this:
In parallel, the acf command from above produces this:
My question now is if somebody can give some input in whether I am going in the right direction or are there better suited packages and commands?
Since the default figures don't get any titles etc. What am I looking at, specifically in the ccm figures?
The ACF command was proposed somewhere, but can I use it here? In it's documentation it says ... calculates autocovariance or autocorrelation... I assume this is not what I want. But then again it's the only command that seems to work multivariate. I am confused.
The plot with the significance values shows that after a lag of 150 (15 meters) the p values increase. How would you interpret that regarding my data? 0.1 intervals of species sightings and many lags up to 100-150 are significant? Would that mean something like that peaks in sightings are stable over the 5 time-steps on a scale of 150 lags aka 15 meters?
In either way it would be nice if somebody who worked with this before can explain what I am looking at! Any input is highly appreciated!
You can use the base R function ccf(), which will estimate the cross-correlation function between any two variables x and y. However, it only works on vectors, so you'll have to loop over the columns in d1. Something like:
cc <- vector("list",choose(dim(d1)[2],2))
par(mfrow=c(ceiling(choose(dim(d1)[2],2)/2),2))
cnt <- 1
for(i in 1:(dim(d1)[2]-1)) {
for(j in (i+1):dim(d1)[2]) {
cc[[cnt]] <- ccf(d1[,i],d1[,j],main=paste0("Cross-correlation of ",colnames(d1)[i]," with ",colnames(d1)[j]))
cnt <- cnt + 1
}
}
This will plot each of the estimated CCF's and store the estimates in the list cc. It is important to remember that the lag-k value returned by ccf(x,y) is an estimate of the correlation between x[t+k] and y[t].
All of that said, however, the ccf is only defined for data that are more-or-less normally distributed, but your data are clearly overdispersed with all of those zeroes. Therefore, lacking some adequate transformation, you should really look into other metrics of "association" such as the mutual information as estimated from entropy. I suggest checking out the R packages entropy and infotheo.
I'm attempting to grab one plot from a multiple plot output. For example
library(mboost);
mod=gamboost(Ozone~.,data=airquality[complete.cases(airquality),]);
plot(mod)
The above creates a plot for each variable's "partial effect". The same could be said for the residual plots created when plotting a linear model (lm). I've attempted to save the output in a list akin to how ggplots can be saved and have spent a few hours searching how to extract just one plot but have failed. Any advice?
As for the context of the question, I'm trying to put the plots into a shiny app and have a variable number of plots show up as output.
Session info is as follows:
R version 2.15.2 (2012-10-26)
Platform: i386-redhat-linux-gnu (32-bit)
Many functions that produce multiple plots also have an argument to select a subset of the plots. In the case of plot.lm it is the which argument. So saying plot(fit, which=1) will only produce one plot.
You can check the mboost documentation to see if there is a similar argument for that plotting function.
Essentially, #greg-snow gave a proper solution. I will elaborate this a bit.
In mboost you can use
plot(mod, which = "Day")
to plot the effect of Day only. As we use regular expressions you can even do much more using the argument which. In a model with linear and smooth effects you can for example extract all smooth effects for plotting:
airquality$Month <- as.factor(airquality$Month)
mod <- mod <- gamboost(Ozone ~ bbs(Solar.R) + bbs(Wind) + bbs(Temp) + bols(Month) + bbs(Day), data=airquality[complete.cases(airquality),])
## now plot bbs effects, i.e., smooth effects:
par(mfrow = c(2,2))
plot(mod, which = "bbs")
## or the linear effect only
par(mfrow = c(1,1))
plot(mod, which = "bols")
You can use any portion of the name (see e.g. names(coef(mod)))to define the effect to be plotted. You can also use integer values to define which effect to plot:
plot(mod, which = 1:2)
Note that this can be also used to certain extract coefficients. E.g.
coef(mod, which = 1)
coef(mod, which = "Solar")
coef(mod, which = "bbs(Solar.R)")
are all the same. For more on how to specify which, both in coef and plot please see our tutorial paper (Hofner et al. (2014), Model-based Boosting in R - A Hands-on Tutorial Using the R Package mboost. Computational Statistics, 29:3-35. DOI 10.1007/s00180-012-0382-5).
We acknowledge that this currently isn't documented in mboost but it is on our todo list (see github issue 14).
(I'm not familiar with GAMboost.)
Looking at the documentation for ?plot.GAMBoost, I see there is an argument called select. I gather you would set this argument to the variable you are interested in, and then you would get just the single plot you want. This is analogous the the which argument in plot.lm that #GregSnow notes.
I'm doing some survival analysis in R, and looking to tidy up/simplify my code.
At the moment I'm doing several steps in my data analysis:
make a Surv object (time variable with indication as to whether each observation was censored);
fit this Surv object according to a categorical predictor, for plotting/estimation of median survival time processes; and
calculate a log-rank test to ask whether there is evidence of "significant" differences in survival between the groups.
As an example, here is a mock-up using the lung dataset in the survival package from R. So the following code is similar enough to what I want to do, but much simplified in terms of the predictor set (which is why I want to simplify the code, so I don't make inconsistent calls across models).
library(survival)
# Step 1: Make a survival object with time-to-event and censoring indicator.
# Following works with defaults as status = 2 = dead in this dataset.
# Create survival object
lung.Surv <- with(lung, Surv(time=time, event=status))
# Step 2: Fit survival curves to object based on patient sex, plot this.
lung.survfit <- survfit(lung.Surv ~ lung$sex)
print(lung.survfit)
plot(lung.survfit)
# Step 3: Calculate log-rank test for difference in survival objects
lung.survdiff <- survdiff(lung.Surv ~ lung$sex)
print(lung.survdiff)
Now this is all fine and dandy, and I can live with this but would like to do better.
So my question is around step 3. What I would like to do here is to be able to use information in the formula from the lung.survfit object to feed into the calculation of the differences in survival curves: i.e. in the call to survdiff. And this is where my domitable [sic] programming skills hit a wall. Below is my current attempt to do this: I'd appreciate any help that you can give! Once I can get this sorted out I should be able to wrap a solution up in a function.
lung.survdiff <- survdiff(parse(text=(lung.survfit$call$formula)))
## Which returns following:
# Error in survdiff(parse(text = (lung.survfit$call$formula))) :
# The 'formula' argument is not a formula
As I commented above, I actually sorted out the answer to this shortly after having written this question.
So step 3 above could be replaced by:
lung.survdiff <- survdiff(formula(lung.survfit$call$formula))
But as Ben Barnes points out in the comment to the question, the formula from the survfit object can be more directly extracted with
lung.survdiff <- survdiff(formula(lung.survfit))
Which is exactly what I wanted and hoped would be available -- thanks Ben!