This may or may not be a simple question. Any help is appreciated.
I have accessed a paper regarding to GARCH and long memory. It has a figure, particularly Fig. 1.1 that I haven't learnt how to plot it in R. The author said that ACF function has a respective hyperbolic function. It is very important to discover whether the data has long memory or not. So I want to apply this technique to my squared returns. The sample data is supplied in this link.
My code is:
data=read.csv("sample.csv",header=T)
lret=100*diff(log(data$CLOSE))
acf(lret^2)
How do we find the hyperbolic function of ACFs and how do we plot it in ACF graph?
ACF with hyperbolic line
Mikosch and Starica stress that the ACF does not follow a hyperbolic function; that figure is devoted to showing how a misuse of statistical tools can lead to wrong conclusions - the data is shown in the other windows of figure 1.1 to be uncorrelated! Anyways, that is a discussion for Cross Validated Stack Exchange.
You can make non-linear regression fits with nls. I have used the ACF of an AR(2)-process with parameters 0.8 and 0.1 as an example (fit will of course be incorrect here but it demonstrates a few of the problems you may experience when working with autocorrelation functions).
set.seed(1e2)
## AR(2) simulation
arsim <- arima.sim(list(ar = c(0.8,0.1)),n = 1000)
## Autocorrelation function of absolute values:
myacf <- acf(abs(arsim),ci = 0)
## Fit acf = b*x^(-c)
nls_fit <- nls(y ~ b*x^(-c),
data.frame(x = myacf$lag[-1], y = myacf$acf[-1]), #Remove lag 0
start = list(b=1,c=1))
curve(nls_fit$m$getPars()[1]*x^(-nls_fit$m$getPars()[2]),
add = TRUE,col="red")
Note how I remove the data at lag 0 since 0^(-c) does not make sense. This is in agreement with what the authors usually do (ignore at lag 0 - never makes sense to plot anyways. Why it is the default of plot.acf I do not know).
Mikosch usually suggests to remove the iid confidence bands that are shown by default when the data is clearly not iid. You do this with the plot.acf option ci = 0.
Related
First of all, I thank you all beforehand for reading this.
I am trying to fit a Standardized T-Student Distribution (i.e. a T-Student with standard deviation = 1) on a series of data; that is: I want to estimate the degrees of freedom via Maximum Likelihood Estimation.
An example of what I need to achieve can be found in the following (simple) Excel file I made:
https://www.dropbox.com/s/6wv6egzurxh4zap/Excel%20Implementation%20Example.xlsx?dl=0
Inside the Excel file, I have an image that contains the formula corresponding to the calculation of the loglikelihood function for the Standardized T Student Distribution. The formula was extracted from a Finance book (Elements of Financial Risk Management - by Peter Christoffersen).
So far, I have tried this with R:
copula.data <- read.csv(file.choose(),header = TRUE)
z1 <- copula.data[,1]
library(fitdistrplus)
ft1 = fitdist(z1, "t", method = "mle", start = 10)
df1=ft1$estimate[1]
df1
logLik(ft1)
df1 yields the number: 13.11855278779897
logLike(ft1) yields the number: -3600.2918050056487
However, the Excel file yields degrees of freedom of: 8.2962365022727, and a log-likelihood of: -3588.8879 (which is the right answer).
Note: the .csv file that my code reads is the following:
https://www.dropbox.com/s/nnh2jgq4fl6cm12/Data%20for%20T%20Copula.csv?dl=0
Any ideas? Thank you people!
The formula from your spreadsheet (with n, x substituted for the df parameter and the data)
=GAMMALN((n+1)/2)-GAMMALN(n/2)-LN(PI())/2-LN(n-2)/2-1/2*(1+n)*LN(1+x^2/(n-2))
or, exponentiating,
Gamma((n+1)/2) / (sqrt((n-2) pi) Gamma(n/2)) (1+x^2/(n-2))^-((n+1)/2)
?dt gives
f(x) = Gamma((n+1)/2) / (sqrt(n pi) Gamma(n/2)) (1 + x^2/n)^-((n+1)/2)
So the difference lies in those n-2 values in two places in the formula. I don't have enough context to see why the author is defining the t distribution in that different way; there may be some good reason ...
Looking at the negative log-likelihood curve directly, it certainly seems as though the fitdistrplus answer is agreeing with the direct calculation. (It would be very surprising if there were a bug in the dt() function, R's distribution functions are very broadly used and thoroughly tested.)
LL <- function(p,data=z1) {
-sum(dt(data,df=p,log=TRUE))
}
pvec <- seq(6,20,by=0.05)
Lvec <- sapply(pvec,LL)
par(las=1,bty="l")
plot(pvec,Lvec,type="l",
xlab="df parameter",ylab="negative log-likelihood")
## superimpose fitdistr results ...
abline(v=coef(ft1),lty=2)
abline(h=-logLik(ft1),lty=2)
Unless there's something else you're not telling us about the problem definition, it seems to me that R is getting the right answer. (The mean and sd of the data you gave were not exactly equal to 0 and 1 respectively, but they were close; centering and scaling gave an even larger value for the parameter.)
I have an issue with Random Forest with the Importance / varImPlot function, I hope someone could help me with?
I tried to code versions but I am confused about the (different) results:
1.)
rffit = randomForest(price~.,data=train,mtry=x,ntree=500)
rfvalpred = predict(rffit,newdata=test)
varImpPlot(rffit)
importance(rffit)
Shows the plot and the data of “importance”, however only “IncNodePurity”. And the data is different the plot and the data, I tried with "Scale" but did not work.
2.)
rf.analyzed_data = randomForest(price~.,data=train,mtry=x,ntree=500,importance=TRUE)
yhat.rf = predict(rf.analyzed_data,newdata=test)
varImpPlot(rf.analyzed_data)
importance(rf.analyzed_data)
In that case it does not produce any plot anymore and the importance data is showing “%IncMSE” and “IncNodePurity” data but the “IncNodePurity” data is different to first code?
Questions:
1.) Any idea why data is different for “IncNodePurity”?
2.) Any idea why no “%IncMSE” is shown in the first version?
3.) Why no plot is shown in the second version?
Many thanks!!
Ed
1) IncNodePurity is derived from the loss function, and you get that measure for free just by training the model. On the downside it is a more unstable estimate as results may vary from each model run. It is also more biased as it favors variables with many levels. I guess your found the differences are due to randomness.
2) VI, %IncMSE takes a little extra time to compute and is therefore optional. Roughly all values in data set needs to be shuffled and every OOB sample needs to be predicted once for every tree times for every variable. As the package randomForest is designed, you have to compute VI during training. importance must be set to TRUE. varImpPlot cannot plot it as it has not been computed.
3) Not sure. In this code example I see both plots at least.
library(randomForest)
#data
X = data.frame(replicate(6,rnorm(1000)))
y = with(X, X1^2 + sin(X2*pi) + X3*X4)
train = data.frame(y=y,X=X)
#training
rf1=randomForest(y~.,data=train,importance=F)
rf2=randomForest(y~.,data=train, importance=T)
#plotting importnace
varImpPlot(rf1) #plot only with IncNodePurity
varImpPlot(rf2) #bi-plot also with %IncMSE
So I've got a data set that I want to parameterise but it is not a Gaussian distribution so I can't parameterise it in terms of it's mean and standard deviation. I want to fit a distribution function with a set of parameters and extract the values of the parameters (eg. a and b) that give the best fit. I want to do this exactly the same as the
lm(y~f(x;a,b))
except that I don't have a y, I have a distribution of different x values.
Here's an example. If I assume that the data follows a Gumbel, double exponential, distribution
f(x;u,b) = 1/b exp-(z + exp-(z)) [where z = (x-u)/b]:
#library(QRM)
#library(ggplot2)
rg <- rGumbel(1000) #default parameters are 0 and 1 for u and b
#then plot it's distribution
qplot(rg)
#should give a nice skewed distribution
If I assume that I don't know the distribution parameters and I want to perform a best fit of the probability density function to the observed frequency data, how do I go about showing that the best fit is (in this test case), u = 0 and b = 1?
I don't want code that simply maps the function onto the plot graphically, although that would be a nice aside. I want a method that I can repeatedly use to extract variables from the function to compare to others. GGPlot / qplot was used as it quickly shows the distribution for anyone wanting to test the code. I prefer to use it but I can use other packages if they are easier.
Note: This seems to me like a really obvious thing to have been asked before but I can't find one that relates to histogram data (which again seems strange) so if there's another tutorial I'd really like to see it.
I have conducted an NMDS analysis and have plotted the output too. However, I am unsure how to actually report the results from R. Which parts from the following output are of most importance? The graph that is produced also shows two clear groups, how are you supposed to describe these results?
MDS.out
Call:
metaMDS(comm = dgge2, distance = "bray")
global Multidimensional Scaling using monoMDS
Data: dgge2
Distance: bray
Dimensions: 2
Stress: 0
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation, halfchange scaling
Species: expanded scores based on ‘dgge2’
The most important pieces of information are that stress=0 which means the fit is complete and there is still no convergence. This happens if you have six or fewer observations for two dimensions, or you have degenerate data. You should not use NMDS in these cases. Current versions of vegan will issue a warning with near zero stress. Perhaps you had an outdated version.
I think the best interpretation is just a plot of principal component. yOu can use plot and text provided by vegan package. Here I am creating a ggplot2 version( to get the legend gracefully):
library(vegan)
library(ggplot2)
data(dune)
ord = metaMDS(comm = dune)
ord_spec <- scores(ord, "spec")
ord_spec <- cbind.data.frame(ord_spec,label=rownames(ord_spec))
ord_sites <- scores(ord, "sites")
ord_sites <- cbind.data.frame(ord_sites,label=rownames(ord_sites))
ggplot(data=ord_spec,aes(x=NMDS1,y=NMDS2)) +
geom_text(aes(label=label,col='species')) +
geom_text(data=ord_sites,aes(label=label,col='sites'))
I'm attempting to grab one plot from a multiple plot output. For example
library(mboost);
mod=gamboost(Ozone~.,data=airquality[complete.cases(airquality),]);
plot(mod)
The above creates a plot for each variable's "partial effect". The same could be said for the residual plots created when plotting a linear model (lm). I've attempted to save the output in a list akin to how ggplots can be saved and have spent a few hours searching how to extract just one plot but have failed. Any advice?
As for the context of the question, I'm trying to put the plots into a shiny app and have a variable number of plots show up as output.
Session info is as follows:
R version 2.15.2 (2012-10-26)
Platform: i386-redhat-linux-gnu (32-bit)
Many functions that produce multiple plots also have an argument to select a subset of the plots. In the case of plot.lm it is the which argument. So saying plot(fit, which=1) will only produce one plot.
You can check the mboost documentation to see if there is a similar argument for that plotting function.
Essentially, #greg-snow gave a proper solution. I will elaborate this a bit.
In mboost you can use
plot(mod, which = "Day")
to plot the effect of Day only. As we use regular expressions you can even do much more using the argument which. In a model with linear and smooth effects you can for example extract all smooth effects for plotting:
airquality$Month <- as.factor(airquality$Month)
mod <- mod <- gamboost(Ozone ~ bbs(Solar.R) + bbs(Wind) + bbs(Temp) + bols(Month) + bbs(Day), data=airquality[complete.cases(airquality),])
## now plot bbs effects, i.e., smooth effects:
par(mfrow = c(2,2))
plot(mod, which = "bbs")
## or the linear effect only
par(mfrow = c(1,1))
plot(mod, which = "bols")
You can use any portion of the name (see e.g. names(coef(mod)))to define the effect to be plotted. You can also use integer values to define which effect to plot:
plot(mod, which = 1:2)
Note that this can be also used to certain extract coefficients. E.g.
coef(mod, which = 1)
coef(mod, which = "Solar")
coef(mod, which = "bbs(Solar.R)")
are all the same. For more on how to specify which, both in coef and plot please see our tutorial paper (Hofner et al. (2014), Model-based Boosting in R - A Hands-on Tutorial Using the R Package mboost. Computational Statistics, 29:3-35. DOI 10.1007/s00180-012-0382-5).
We acknowledge that this currently isn't documented in mboost but it is on our todo list (see github issue 14).
(I'm not familiar with GAMboost.)
Looking at the documentation for ?plot.GAMBoost, I see there is an argument called select. I gather you would set this argument to the variable you are interested in, and then you would get just the single plot you want. This is analogous the the which argument in plot.lm that #GregSnow notes.