Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm using a GAMM and want to extract the intervals for the model with the Variance-Covariance Matrix but am having problems doing so
Coding:
fishdata <- read.csv("http://dl.dropbox.com/s/4w0utkqdhqribl4/fishdata.csv",
header=T)
attach(fishdata)
require(mgcv)
gammdl <- gamm(inlandfao ~ s(marinefao), correlation = corAR1(form = ~1 | year),
family=poisson())
summary(gammdl$gam)
intervals(gammdl$lme)
but the final line of coding returns,
Error in intervals.lme(gammdl$lme) :
cannot get confidence intervals on var-cov components: Non-positive definite approximate variance-covariance
I don't see why this error message appears.
I'm trying to replicate what Simon Wood does on page 316 of Generalized Additive Models: An Introduction with R using my data.
This is really an issue of statistical computation, not R usage. Essentially, the error means that whilst the computations might have converged to a fitted model, there are problems with that model. What this boils down to is that the Hessian computed at the fitted values of the model is non-positive definite, and hence the maximum likelihood estimate of the covariance matrix is not available. Often this occurs where the log-likelihood function becomes flat, no further progress in optimisation can occur, and arises from the fact that the model is most likely too complex for the given data.
You could try fitting the model with and without the AR(1) and compare them using a generalised likelihood ratio test (via anova()). I would guess that the AR(1) nested within year is redundant - i.e. the parameter $\phi$ is effectively 0 - and it is this that is causing the error.
Actually, now it also occurs to me that you are trying to model the variable inlandfao as a smooth function of itself s(inlandfao). There is something wrong here - you'd expect a perfect fit without needing a smooth function. You need to fix that; you can't model the response as a function of itself.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am using a generalized additive model built using the bam() function in the mgcv R package, to predict probabilities for a binary response.
I seem to get slightly different predictions for the same input data, depending on the make-up of the newdata table provided, and don't understand why.
The model was built using a formula like this:
model <- bam(response ~ categorical_predictor1 + s(continuous_predictor, bs='tp'),
data=data,
family="binomial",
select=TRUE,
discrete=TRUE,
nthreads = 16)
I have several more categorical and continuous predictors, but to save space I only mention two in the above formula.
I then predict like this:
predictions <- predict(model,
newdata = newdata,
type="response")
I want to make predictions for about 2.5 million rows, but during my testing I predicted for a subset of 250,000.
Each time I use the model to predict for that subset (i.e. newdata=subset) I get the same outputs - this is reproducible. However, if I use the model to predict that same subset within a the full table of 2.5 million rows (i.e. newdata=full_data), then I get slightly different predictions for that subset of 250,000 than when I predict them separately.
I always thought that each row is predicted in turn, based on the predictors provided, so can't understand why the predictions change with the context of the "newdata". This does not happen if I predict using a standard glm, or a random forest, so I assume it's something specific to gams or the mgcv package.
Sorry, I haven't been able to provide a reproducible example - my datasets are large, and I'm not sure if the same thing would happen with a small example dataset.
From the predict.bam help:
"When discrete=TRUE the prediction data in newdata is discretized in the same way as is done when using discrete fitting methods with bam. However the discretization grids are not currently identical to those used during fitting. Instead, discretization is done afresh for the prediction data. This means that if you are predicting for a relatively small set of prediction data, or on a regular grid, then the results may in fact be identical to those obtained without discretization. The disadvantage to this approach is that if you make predictions with a large data frame, and then split it into smaller data frames to make the predictions again, the results may differ slightly, because of slightly different discretization errors."
You likely can't switch to gam or use discrete=FALSE because you need the speed. But you'll have to deal with some small differences in exchange. From the help, it sounds like you might be able to minimize this by choosing the subsets carefully, but you won't be able to eliminate it completely.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I get different standard error values depending on the way I specify a linear regression model (for categorical variable) in R.
For example,
library(alr4)
library(tidyverse)
FactorModel <- lm(I(log(acrePrice)) ~ I(as.factor(year)), data = MinnLand)
summary(FactorModel)
However, if I specify a regression model with "+0", the standard error (and consequently other values) are different.
FactorModel1 <- lm(I(log(acrePrice)) ~ I(as.factor(year)) + 0, data = MinnLand)
summary(FactorModel1)
How do I know which one is the correct standard error? I kinda understand how to interpret the estimates. For instance, in the first model if I want to know the estimate of 2003, its intercept + the coefficient of 2003. However, in the second model, it automatically calculates the actual value.
Is there a similar interpretation for standard error?
The first model has an intercept but the second model has no intercept at all so they have different residuals. They therefore have different residual standard errors given that the standard error is an estimate of the standard deviation of the residuals. Since they are different models they also have different coefficient and coefficient standard errors too. We can see that the intercept in the first model is significant (it has three stars beside it) but you can also compare the models using anova:
anova(FactorModel1, FactorModel)
Use predict to get predictions but it will be easier to use it if you first put the transformed variables in the data frame and then use those instead of trying to transform them in the formula.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
I am new to Gamms and gams, so this question may be a bit basic, I'd appreciate your help on this very much:
I am using the following code:
M.gamm <- gamm (bsigsi ~ s(summodpa, sed,k= 1, fx= TRUE, bs="tp") + s(sumlightpa, sed, k=1, fx= TRUE, bs="tp") , random = list(school=~ 1) , method= "ML", na.action= na.omit, data= Pilot_fitbit2)
The code runs, but gives me this feedback:
Warning messages: 1: In smooth.construct.tp.smooth.spec(object,
dk$data, dk$knots) : basis dimension, k, increased to minimum
possible
2: In smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
basis dimension, k, increased to minimum possible
Questions:
My major question is however how I can get an AIC or BIC from this?
I've tried BIC(M.gamm$gam) and BIC(M.gamm$lme), since gamm exists of both parts (lme and gam), and for the latter one (with lme) I do get a value, bot for the first one, I don'get a value. Does anyone know why and how I can get one?
The issue is that I would like to compare this value to the BIC value of a gam model, and I am not sure which one (BIC(M.gamm$lme) or BIC(M.gam$gam)) would be the correct one. It is possible for me to derive a BIC and AIC for the gam and lme model.
If I'd be able to get the AIC or BIC for the gamm model - how can I know I can trust the results? What do I need to be careful with so I interpret the result correctly? Currently, I am using ML in all models and also use the same package (mgcv) to estimate lme, gam, and gamm to estabilish comparability.
Any help/ advice or ideas on this would be greatly appreciated!!
Best wishes,
Noemi
Thank you very much for this!
This warnings come as a result of requesting a smoother basis of a single function for each of your two smooths; this doesn't make any sense as both such bases would only contain equivalent of constant functions, both of which are are unidentifiable given you have another constant term (the intercept) in your model. Once mgcv applies identifiable to constraints the two smooths would get dropped entirely from the model.
Hence the warnings; mgcv didn't do what you wanted. Instead it set k to be the smallest values possible. Set k to something larger; you might as well leave it at the default and not specify it in the s() if you want a low rank smooth. Also, unless you really want an unpenalized spline fit, don't use fix = TRUE.
I'm not really familiar with any theory for BIC applied to GAM(M)s that corrects for smoothness selection. The AIC method for gam() models estimated using REML smoothness selection does have some theory beyond it, including a recent paper by Simon Wood and colleagues.
The mgcv FAQ has the following two things to say
How can I compare gamm models? In the identity link normal errors case, then AIC and hypotheis testing based methods are fine. Otherwise it is best to work out a strategy based on the summary.gam Alternatively, simple random effects can be fitted with gam, which makes comparison straightforward. Package gamm4 is an alternative, which allows AIC type model selection for generalized models.
When using gamm or gamm4, the reported AIC is different for the gam object and the lme or lmer object. Why is this? There are several reasons for this. The most important is that the models being used are actually different in the two representations. When treating the GAM as a mixed model, you are implicitly assuming that if you gathered a replicate dataset, the smooths in your model would look completely different to the smooths from the original model, except for having the same degree of smoothness. Technically you would expect the smooths to be drawn afresh from their distribution under the random effects model. When viewing the gam from the usual penalized regression perspective, you would expect smooths to look broadly similar under replication of the data. i.e. you are really using Bayesian model for the smooths, rather than a random effects model (it's just that the frequentist random effects and Bayesian computations happen to coincide for computing the estimates). As a result of the different assumptions about the data generating process, AIC model comparisons can give rather different answers depending on the model adopted. Which you use should depend on which model you really think is appropriate. In addition the computations of the AICs are different. The mixed model AIC uses the marginal likelihood and the corresponding number of model parameters. The gam model uses the penalized likelihood and the effective degrees of freedom.
So, I'd probably stick to AIC, not use BIC. I'd be thinking about which interpretation of the GAM(M) I was interested most in. I'd also likely fit the random effects you have here using gam() if they are this simple. An equivalent model would include + s(school, bs = 're') in the main formula and exclude the random bit whilst using gam()
gam(bsigsi ~ s(summodpa, sed) + s(sumlightpa, sed) +
s(school, bs = 're'), data = Pilot_fitbit2,
method = 'REML')
Do be careful with 2D isotopic smooths; both sed and summodpa and sumlightpa need to be in the same units have the same degrees of wiggliness in each smooth. If these aren't in the same units or have different wigglinesses, use te() instead of s() for the 2D terms.
Also be careful with variables that appear in two or more smooths like this; mgcv will do it's best to make the models identifiable, but you can easily get into computational problems even so. A better modelling approach would to be estimate the marginal effects of sed and the other terms plus their 2nd order interactions by decomposing the effects in the two 2d smooths as follows:
gam(bsigsi ~ s(sed) + s(summodpa) + s(sumlightpa) +
ti(summodpa, sed) + ti(sumlightpa, sed) +
s(school, bs = 're'), data = Pilot_fitbit2,
method = 'REML')
where the ti() smooths are tensor product interaction bases, when're the main effects of the two marginal variables have been removed from the basis. Hence you can treat them as a pure smooth interaction term. In this way, the main effect of sed is contained in a single smooth term.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
So I've read about fitting student-t in R using MLE but it always appears to be the case that location and scale parameters are of the utmost interest. I just want to fit a student-t (as described by wikipedia) to data that is usually considered to be distributed like a standard normal so I can assume the mean is 0 and the scale is 1. How can I do this is R?
If you "assume" your location and scale parameters, you are not "fitting" a distribution to the data, you are simply assuming that the data follows a certain distribution.
"Fitting" a distribution to some data means finding "appropriate" parameters of this distribution so that it "accurately" models your data. Maximum likelihood estimation is a method to find point-estimates of the parameters based on some data.
The easiest way to fit a classic distribution such as student-t is to use the function fitdistr from the MASS package, which uses MLE.
Assuming you have some data:
library("MASS")
# generating some data following a normal dist
x <- rnorm(100)
# fitting a t dist, although this makes little sense here
# since you know x comes from a normal dist...
fitdistr(x, densfun="t", df=length(x)-1)
Note that the student-t density is parameterised by location m, scale s and the degrees of freedom df. df is not tuned, but is set based on the data.
The output of fitdistr contains the fitted values for m and s. If you store the output in an object, you can access programmatically to all sorts of info about the fit.
The question now is whether fitting a t dist is what you really want to do. If the data is normal, why would you want to fit a t dist?
You are looking for the function t.test.
x <- rnorm(100)
t.test(x)
I put a sample here
EDIT
I think I misunderstood your question slightly. Use t.test for hypothesis testing about the location of your population density (here a standard normal).
As for fitting the parameters of a t distribution, you should not do this unless your data comes from a t distribution. If you know your data comes from a standard normal distribution, you already know the location and scale parameters, so what's the sense?
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am working with R to create some linear models (using lm()) on the data that i have collected. Now I am not that good at statistics and am finding it difficult to understand the summary of the linear model that is generated through R.
I mean the residual value : Min , 1Q, Median, 3Q, Max
My question is: what do these values mean and how can I know from these values if my model is good ?
This are some of the residual values that I have.
Min: -4725611 1Q:-2161468 median:-1352080 3Q:3007561 Max:6035077
One fundamental assumption of linear regression (and the associated hypothesis tests in particular) is that residuals are normal distributed with expected value zero. A slight violation of this assumption is not problematic, as the statistics is pretty robust. However, the distribution should be at least symmetric.
The best way to judge if the assumption of normality is fullfilled, is to plot the residuals. There are many different diagnostic plots available, e.g., you can do the following:
fit <- lm(y~x)
plot(fit)
This will give you a plot of residuals vs. fitted values and a qq-plot of standardized residuals. The quantiles given by summary(fit) are useful for a quick check if residuals are symmetric. There, min and max values are not that important, but the median should be close to zero and the first and third quartil should have similar absolute values. Of course, this check only makes sense if you have a sufficient number of values.
If residuals are not normal distributed there are several possibilities to deal with that, e.g.,
transformations,
generalized linear models,
or a non-linear model could be more appropriate.
There are many good books on linear regression and even some good web tutorials. I suggest to read at least one of those carefully.