I am looking for the matlab code for CaliPro: A Calibration Protocol That Utilizes Parameter Density
Estimation to Explore Parameter Space and Calibrate Complex Biological Models
I am looking for the code "plot_outcomes.m" which plots the experimental data values, and each of the model simulations for both predator and prey.
I run an analysis where I need to plot a nonlinear relation between two variables. I read about spline regression where one challenge is to find the number and the position of the knots. So I was happy to read in this book that generalized additive models (GAM) fit "spline models with automated selection of knots". Thus, I started to read how to do GAM analysis in R and I was surprised to see that the gam() function has a knots argument.
Now I am confused. Does the gam() function in R run a GAM which atomatically finds the best knots? If so, why should we provide the knots argument? Also, the documentation says "If they are not supplied then the knots of the spline are placed evenly throughout the covariate values to which the term refers. For example, if fitting 101 data with an 11 knot spline of x then there would be a knot at every 10th (ordered) x value". This does not sound like a very elaborated algorithm for knots selection.
I could not find another source validating the statement that GAM is a spline model with automated knots selection. So is the gam() function the same as pspline() where degree is 3 (cubic) with the difference that gam() sets some default for the df argument?
The term GAM covers a broad church of models and approaches to solve the smoothness selection problem.
mgcv uses penalized regression spline bases, with a wiggliness penalty to choose the complexity of the fitted smooth(s). As such, it doesn't choose the number of knots as part of the smoothness selection.
Basically, you as the user choose how large a basis to use for each smooth function (by setting argument k in the s(), te(), etc functions used in the model formula). The value(s) for k set the upper limit on the wiggliness of the smooth function(s). The penalty measures the wiggliness of the function (it is typically the squared second derivative of the smooth summed over the range of the covariate). The model then estimates values for the coefficients for the basis functions representing each smooth and chooses smoothness parameter(s) by maximizing the penalized log likelihood criterion. The penalized log likelihood is the log likelihood plus some amount of penalty for wiggliness for each smooth.
Basically, you set the upper limit of expected complexity (wiggliness) for each smooth and when the model is fitted, the penalty(ies) shrink the coefficients behind each smooth so that excess wiggliness is removed from the fit.
In this way, the smoothness parameters control how much shrinkage happens and hence how complex (wiggly) each fitted smooth is.
This approach avoids the problems of choosing where to put the knots.
This doesn't mean the bases used to represent the smooths don't have knots. In the cubic regression spline basis you mention, the value you give to k sets the dimensionality of the basis, which implies a certain number of knots. These knots are placed at the boundaries of the covariate involved in the smooth and then evenly over the range of the covariate, unless the user supplies a different set of knot locations. However, once the number of knots and their locations are set, thus forming the basis, they are fixed, with the wiggliness of the smooth being controlled by the wiggliness penalty, not by varying the number of knots.
You have to be very careful also with R as there are two packages providing a gam() function. The original gam package provides an R version of the software and approach described in the original GAM book by Hastie and Tibshirani. This package doesn't fit GAMs using penalized regression splines as I describe above.
R ships with the mgcv package, which fits GAMs using penalized regression splines as I outline above. You control the size (dimensionality) of the basis for each smooth using the argument k. There is no argument df.
Like I said, GAMs are a broad church and there are many ways to fit them. It is important to know what software you are using and what methods that software is employing to estimate the GAM. Once you have that info in hand, you can home in on specific material for that particular approach to estimating GAMs. In this case, you should look at Simon Wood's book GAMs: an introduction with R as this describes the mgcv package and is written by the author of the mgcv package.
I am working on quantile forecasting with time-series data. The model I am using is ARIMA(1,1,2)-ARCH(2) and I am trying to get quantile regression estimates of my data.
So far, I have found "quantreg" package to perform quantile regression, but I have no idea how to put ARIMA-ARCH models as the model formula in function rq.
rq function seems to work for regressions with dependent and independent variables but not for time-series.
Is there some other package that I can put time-series models and do quantile regression in R? Any advice is welcome. Thanks.
I just put an answer on the Data Science forum.
It basically says that most of the ready made packages are using so called exact test based on assumption on the distribution (independent identical normal-Gauss distribution, or wider).
You also have a family of resampling methods in which you simulate a sample with a similar distribution of your observed sample, perform your ARIMA(1,1,2)-ARCH(2) and repeat the process a great number of times. Then you analyze this great number of forecast and measure (as opposed to compute) your confidence intervals.
The resampling methods differs in the way to generate the simulated samples. The most used are:
The Jackknife: in which you "forget" one point, that is you simulate a n samples of size n-1 (if n is the size of the observed sample).
The Bootstrap: in which you simulate a sample by taking n values of the original sample with replacements: some will be taken once, some twice or more, some never,...
It is a (not easy) theorem that the expectation of the confidence intervals, as most of the usual statistical estimators, are the same on the simulated sample than on the original sample. With the difference that you can measure them with a great number of simulations.
Hello and welcome to StackOverflow. Please take some time to read the help page, especially the sections named "What topics can I ask about here?" and "What types of questions should I avoid asking?". And more importantly, please read the Stack Overflow question checklist. You might also want to learn about Minimal, Complete, and Verifiable Examples.
I can try to address your question, although this is hard since you don't provide any code/data. Also, I guess by "put ARIMA-ARCH models" you actually mean that you want to make an integrated series stationary using an ARIMA(1,1,2) plus an ARCH(2) filters.
For an overview of the R time-series capabilities you can refer to the CRAN task list.
You can easily apply these filters in R with an appropriate function.
For instance, you could use the Arima() function from the forecast package, then compute the residuals with residuals() from the stats package. Next, you can use this filtered series as input for the garch() function from the tseries package. Other possibilities are of course possible. Finally, you can apply quantile regression on this filtered series. For instance, you can check out the dynrq() function from the quantreg package, which allows time-series objects in the data argument.
I'm interesting in estimating probability density functions in quite high dimension (around 15 or even more). I've heard about Dirichlet processes. Do you know a R package which implements this kind of methods ?
Thank You
I think you can use the "DPpackage" (http://cran.r-project.org/web/packages/DPpackage/index.html) that provides some Bayesian nonparametric density estimation functions. You can perform the Dirichlet mixture density estimation with 'DPdensity' function.
I am actually a novice to R and stats.. Could something like this be done in R
Determining the density estimates of two samples ( 2 Vectors )..??
I have done this Using R and obtained 2 density curves for the 2 samples using kernel density estimation ..
Is there anyway to quantitatively compare how similar/Dissimilar the density estimates of 2 samples are..?
I am trying to find out which data sample exhibits has a similar distribution to a particular distribution..
I am using R Language... Can somebody please help..??
You can use Kolmogorov-Smirnov test (ks.test) to compare two distributions. Cramer-von-Mises test is another one. There is this PDF Fitting Distributions with R where they also list other tests that are available (although the nortest package that he uses only tests for normality).
Apprentice Queue is right about using the Kolmogorov-Smirnoff test, but I wanted to add a warning: don't use it on its own. You should visually compare the distributions as well, either with two kernel density plots or histograms, or with a qqplot. Human brains are very good at playing spot-the-difference.
You can try calculating the Earth mover's distance
What functions do you use in R to fit a curve to your data and test how well that curve fits? What results are considered good?
Just the first part of that question can fill entire books. Just some quick choices:
lm() for standard linear models
glm() for generalised linear models (eg for logistic regression)
rlm() from package MASS for robust linear models
lmrob() from package robustbase for robust linear models
loess() for non-linear / non-parametric models
Then there are domain-specific models as e.g. time series, micro-econometrics, mixed-effects and much more. Several of the Task Views as e.g. Econometrics discuss this in more detail. As for goodness of fit, that is also something one can spend easily an entire book discussing.
The workhorses of canonical curve fitting in R are lm(), glm() and nls(). To me, goodness-of-fit is a subproblem in the larger problem of model selection. Infact, using goodness-of-fit incorrectly (e.g., via stepwise regression) can give rise to seriously misspecified model (see Harrell's book on "Regression Modeling Strategies"). Rather than discussing the issue from scratch, I recommend Harrell's book for lm and glm. Venables and Ripley's bible is terse, but still worth a reading. "Extending the Linear Model with R" by Faraway is comprehensive and readable. nls is not covered in these sources, but "Nonlinear Regression with R" by Ritz & Streibig fills the gap and is very hands-on.
The nls() function (http://sekhon.berkeley.edu/stats/html/nls.html) is pretty standard for nonlinear least-squares curve fitting. Chi squared (the sum of the squared residuals) is the metric that is optimized in that case, but it is not normalized so you can't readily use it to determine how good the fit is. The main thing you should ensure is that your residuals are normally distributed. Unfortunately I'm not sure of an automated way to do that.
The Quick R site has a reasonable good summary of basic functions used for fitting models and testing the fits, along with sample R code:
The main thing you should ensure is
that your residuals are normally
distributed. Unfortunately I'm not
sure of an automated way to do that.
qqnorm() could probably be modified to find the correlation between the sample quantiles and the theoretical quantiles. Essentially, this would just be a numerical interpretation of the normal quantile plot. Perhaps providing several values of the correlation coefficient for different ranges of quantiles could be useful. For example, if the correlation coefficient is close to 1 for the middle 97% of the data and much lower at the tails, this tells us the distribution of residuals is approximately normal, with some funniness going on in the tails.
Best to keep simple, and see if linear methods work "well enuff". You can judge your goodness of fit GENERALLY by looking at the R squared AND F statistic, together, never separate. Adding variables to your model that have no bearing on your dependant variable can increase R2, so you must also consider F statistic.
You should also compare your model to other nested, or more simpler, models. Do this using log liklihood ratio test, so long as dependant variables are the same.
Jarque–Bera test is good for testing the normality of the residual distribution.