Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am working with R to create some linear models (using lm()) on the data that i have collected. Now I am not that good at statistics and am finding it difficult to understand the summary of the linear model that is generated through R.
I mean the residual value : Min , 1Q, Median, 3Q, Max
My question is: what do these values mean and how can I know from these values if my model is good ?
This are some of the residual values that I have.
Min: -4725611 1Q:-2161468 median:-1352080 3Q:3007561 Max:6035077
One fundamental assumption of linear regression (and the associated hypothesis tests in particular) is that residuals are normal distributed with expected value zero. A slight violation of this assumption is not problematic, as the statistics is pretty robust. However, the distribution should be at least symmetric.
The best way to judge if the assumption of normality is fullfilled, is to plot the residuals. There are many different diagnostic plots available, e.g., you can do the following:
fit <- lm(y~x)
plot(fit)
This will give you a plot of residuals vs. fitted values and a qq-plot of standardized residuals. The quantiles given by summary(fit) are useful for a quick check if residuals are symmetric. There, min and max values are not that important, but the median should be close to zero and the first and third quartil should have similar absolute values. Of course, this check only makes sense if you have a sufficient number of values.
If residuals are not normal distributed there are several possibilities to deal with that, e.g.,
transformations,
generalized linear models,
or a non-linear model could be more appropriate.
There are many good books on linear regression and even some good web tutorials. I suggest to read at least one of those carefully.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I tried several ways of selecting predictors for a logistic regression in R. I used lasso logistic regression to get rid of irrelevant features, cutting their number from 60 to 24, then I used those 24 variables in my stepAIC logistic regression, after which I further cut 1 variable with p-value of approximately 0.1. What other feature selection methods I can or even should use? I tried to look for Anova correlation coefficient analysis, but I didn't find any examples for R. And I think I cannot use correlation heatmap in this situation since my output is categorical? I seen some instances recommending Lasso and StepAIC, and other instances criticising them, but I didn't find any definitive comprehensive alternative, which left me confused.
Given the methodological nature of your question you might also get a more detailed answer at Cross Validated: https://stats.stackexchange.com/
From your information provided, 23-24 independent variables seems quite a number to me. If you do not have a large sample, remember that overfitting might be an issue (i.e. low cases to variables ratio). Indications of overfitting are large parameter estimates & standard errors, or failure of convergence, for instance. You obviously have already used stepwise variable selection according to stepAIC which would have also been my first try if I would have chosen to let the model do the variable selection.
If you spot any issues with standard errors/parameter estimates further options down the road might be to collapse categories of independent variables, or check whether there is any evidence of multicollinearity which could also result in deleting highly-correlated variables and narrow down the number of remaining features.
Apart from a strictly mathematical approach you might also want to identify features that are likely to be related to your outcome of interest according to your underlying content hypothesis and your previous experience, meaning to look at the model from your point of view as expert in your field of interest.
If sample size is not an issue and the point is reduction of feature numbers, you may consider running a principal component analysis (PCA) to find out about highly correlated features and do the regression with the PCAs instead which are non-correlated linear combination of your "old" features. A package to accomplish PCA is factoextra using prcomp or princomp arguments http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I get different standard error values depending on the way I specify a linear regression model (for categorical variable) in R.
For example,
library(alr4)
library(tidyverse)
FactorModel <- lm(I(log(acrePrice)) ~ I(as.factor(year)), data = MinnLand)
summary(FactorModel)
However, if I specify a regression model with "+0", the standard error (and consequently other values) are different.
FactorModel1 <- lm(I(log(acrePrice)) ~ I(as.factor(year)) + 0, data = MinnLand)
summary(FactorModel1)
How do I know which one is the correct standard error? I kinda understand how to interpret the estimates. For instance, in the first model if I want to know the estimate of 2003, its intercept + the coefficient of 2003. However, in the second model, it automatically calculates the actual value.
Is there a similar interpretation for standard error?
The first model has an intercept but the second model has no intercept at all so they have different residuals. They therefore have different residual standard errors given that the standard error is an estimate of the standard deviation of the residuals. Since they are different models they also have different coefficient and coefficient standard errors too. We can see that the intercept in the first model is significant (it has three stars beside it) but you can also compare the models using anova:
anova(FactorModel1, FactorModel)
Use predict to get predictions but it will be easier to use it if you first put the transformed variables in the data frame and then use those instead of trying to transform them in the formula.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
So I've read about fitting student-t in R using MLE but it always appears to be the case that location and scale parameters are of the utmost interest. I just want to fit a student-t (as described by wikipedia) to data that is usually considered to be distributed like a standard normal so I can assume the mean is 0 and the scale is 1. How can I do this is R?
If you "assume" your location and scale parameters, you are not "fitting" a distribution to the data, you are simply assuming that the data follows a certain distribution.
"Fitting" a distribution to some data means finding "appropriate" parameters of this distribution so that it "accurately" models your data. Maximum likelihood estimation is a method to find point-estimates of the parameters based on some data.
The easiest way to fit a classic distribution such as student-t is to use the function fitdistr from the MASS package, which uses MLE.
Assuming you have some data:
library("MASS")
# generating some data following a normal dist
x <- rnorm(100)
# fitting a t dist, although this makes little sense here
# since you know x comes from a normal dist...
fitdistr(x, densfun="t", df=length(x)-1)
Note that the student-t density is parameterised by location m, scale s and the degrees of freedom df. df is not tuned, but is set based on the data.
The output of fitdistr contains the fitted values for m and s. If you store the output in an object, you can access programmatically to all sorts of info about the fit.
The question now is whether fitting a t dist is what you really want to do. If the data is normal, why would you want to fit a t dist?
You are looking for the function t.test.
x <- rnorm(100)
t.test(x)
I put a sample here
EDIT
I think I misunderstood your question slightly. Use t.test for hypothesis testing about the location of your population density (here a standard normal).
As for fitting the parameters of a t distribution, you should not do this unless your data comes from a t distribution. If you know your data comes from a standard normal distribution, you already know the location and scale parameters, so what's the sense?
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm using a GAMM and want to extract the intervals for the model with the Variance-Covariance Matrix but am having problems doing so
Coding:
fishdata <- read.csv("http://dl.dropbox.com/s/4w0utkqdhqribl4/fishdata.csv",
header=T)
attach(fishdata)
require(mgcv)
gammdl <- gamm(inlandfao ~ s(marinefao), correlation = corAR1(form = ~1 | year),
family=poisson())
summary(gammdl$gam)
intervals(gammdl$lme)
but the final line of coding returns,
Error in intervals.lme(gammdl$lme) :
cannot get confidence intervals on var-cov components: Non-positive definite approximate variance-covariance
I don't see why this error message appears.
I'm trying to replicate what Simon Wood does on page 316 of Generalized Additive Models: An Introduction with R using my data.
This is really an issue of statistical computation, not R usage. Essentially, the error means that whilst the computations might have converged to a fitted model, there are problems with that model. What this boils down to is that the Hessian computed at the fitted values of the model is non-positive definite, and hence the maximum likelihood estimate of the covariance matrix is not available. Often this occurs where the log-likelihood function becomes flat, no further progress in optimisation can occur, and arises from the fact that the model is most likely too complex for the given data.
You could try fitting the model with and without the AR(1) and compare them using a generalised likelihood ratio test (via anova()). I would guess that the AR(1) nested within year is redundant - i.e. the parameter $\phi$ is effectively 0 - and it is this that is causing the error.
Actually, now it also occurs to me that you are trying to model the variable inlandfao as a smooth function of itself s(inlandfao). There is something wrong here - you'd expect a perfect fit without needing a smooth function. You need to fix that; you can't model the response as a function of itself.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Given a simplified example time series looking at a population by year
Year<-c(2001,2002,2003,2004,2005,2006)
Pop<-c(1,4,7,9,20,21)
DF<-data.frame(Year,Pop)
What is the best method to test for significance in terms of change between years/ which years are significantly different from each other?
As #joran mentioned, this is really a statistics question rather than a programming question. You could try asking on http://stats.stackexchange.com to obtain more statistical expertise.
In brief, however, two approaches come to mind immediately:
If you fit a regression line to the population vs. year and have a statistically significant slope, that would indicate that there is an overall trend in population over the years, i.e. use lm() in R, like this lmPop <- lm(Pop ~ Year,data=DF).
You could divide the time period into blocks (e.g. the first three years and the last three years), and assume that the population figures for the years in each block are all estimates of the mean population during that block of years. That would give you a mean and a standard deviation of the population for each block of years, which would let you do a t-test, like this: t.test(Pop[1:3],Pop[4:6]).
Both of these approaches suffer from some potential difficulties and the validity of each would depend on the nature of the data that you're examining. For the sample data, however, the first approach suggests that there appears to be a trend over time at a 95% confidence level (p=0.00214 for the slope coefficient) while the second approach suggests that the null hypothesis that there is no difference in means cannot be falsified at the 95% confidence level (p = 0.06332).
They're all significantly different from each other. 1 is significantly different from 4, 4 is significantly different from 7 and so on.
Wait, that's not what you meant? Well, that's all the information you've given us. As a statistician, I can't work with anything more.
So now you tell us something else. "Are any of the values significantly different from a straight line where the variation in the Pop values are independent Normally distributed values with mean 0 and the same variance?" or something.
Simply put, just a bunch of numbers can not be the subject of a statistical analysis. Working with a statistician you need to agree on a model for the data, and then the statistical methods can answer questions about significance and uncertainty.
I think that's often the thing non-statisticians don't get. They go "here's my numbers, is this significant?" - which usually means typing them into SPSS and getting a p-value out.
[have flagged this Q for transfer to stats.stackexchange.com where it belongs]