R Student-t (location 0 scale 1) [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
So I've read about fitting student-t in R using MLE but it always appears to be the case that location and scale parameters are of the utmost interest. I just want to fit a student-t (as described by wikipedia) to data that is usually considered to be distributed like a standard normal so I can assume the mean is 0 and the scale is 1. How can I do this is R?

If you "assume" your location and scale parameters, you are not "fitting" a distribution to the data, you are simply assuming that the data follows a certain distribution.
"Fitting" a distribution to some data means finding "appropriate" parameters of this distribution so that it "accurately" models your data. Maximum likelihood estimation is a method to find point-estimates of the parameters based on some data.
The easiest way to fit a classic distribution such as student-t is to use the function fitdistr from the MASS package, which uses MLE.
Assuming you have some data:
library("MASS")
# generating some data following a normal dist
x <- rnorm(100)
# fitting a t dist, although this makes little sense here
# since you know x comes from a normal dist...
fitdistr(x, densfun="t", df=length(x)-1)
Note that the student-t density is parameterised by location m, scale s and the degrees of freedom df. df is not tuned, but is set based on the data.
The output of fitdistr contains the fitted values for m and s. If you store the output in an object, you can access programmatically to all sorts of info about the fit.
The question now is whether fitting a t dist is what you really want to do. If the data is normal, why would you want to fit a t dist?

You are looking for the function t.test.
x <- rnorm(100)
t.test(x)
I put a sample here
EDIT
I think I misunderstood your question slightly. Use t.test for hypothesis testing about the location of your population density (here a standard normal).
As for fitting the parameters of a t distribution, you should not do this unless your data comes from a t distribution. If you know your data comes from a standard normal distribution, you already know the location and scale parameters, so what's the sense?

Related

Reproducibility of predictions from GAM (mgcv package) [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am using a generalized additive model built using the bam() function in the mgcv R package, to predict probabilities for a binary response.
I seem to get slightly different predictions for the same input data, depending on the make-up of the newdata table provided, and don't understand why.
The model was built using a formula like this:
model <- bam(response ~ categorical_predictor1 + s(continuous_predictor, bs='tp'),
data=data,
family="binomial",
select=TRUE,
discrete=TRUE,
nthreads = 16)
I have several more categorical and continuous predictors, but to save space I only mention two in the above formula.
I then predict like this:
predictions <- predict(model,
newdata = newdata,
type="response")
I want to make predictions for about 2.5 million rows, but during my testing I predicted for a subset of 250,000.
Each time I use the model to predict for that subset (i.e. newdata=subset) I get the same outputs - this is reproducible. However, if I use the model to predict that same subset within a the full table of 2.5 million rows (i.e. newdata=full_data), then I get slightly different predictions for that subset of 250,000 than when I predict them separately.
I always thought that each row is predicted in turn, based on the predictors provided, so can't understand why the predictions change with the context of the "newdata". This does not happen if I predict using a standard glm, or a random forest, so I assume it's something specific to gams or the mgcv package.
Sorry, I haven't been able to provide a reproducible example - my datasets are large, and I'm not sure if the same thing would happen with a small example dataset.
From the predict.bam help:
"When discrete=TRUE the prediction data in newdata is discretized in the same way as is done when using discrete fitting methods with bam. However the discretization grids are not currently identical to those used during fitting. Instead, discretization is done afresh for the prediction data. This means that if you are predicting for a relatively small set of prediction data, or on a regular grid, then the results may in fact be identical to those obtained without discretization. The disadvantage to this approach is that if you make predictions with a large data frame, and then split it into smaller data frames to make the predictions again, the results may differ slightly, because of slightly different discretization errors."
You likely can't switch to gam or use discrete=FALSE because you need the speed. But you'll have to deal with some small differences in exchange. From the help, it sounds like you might be able to minimize this by choosing the subsets carefully, but you won't be able to eliminate it completely.

Interpreting standard error for similar (but different) specified linear regression models: R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I get different standard error values depending on the way I specify a linear regression model (for categorical variable) in R.
For example,
library(alr4)
library(tidyverse)
FactorModel <- lm(I(log(acrePrice)) ~ I(as.factor(year)), data = MinnLand)
summary(FactorModel)
However, if I specify a regression model with "+0", the standard error (and consequently other values) are different.
FactorModel1 <- lm(I(log(acrePrice)) ~ I(as.factor(year)) + 0, data = MinnLand)
summary(FactorModel1)
How do I know which one is the correct standard error? I kinda understand how to interpret the estimates. For instance, in the first model if I want to know the estimate of 2003, its intercept + the coefficient of 2003. However, in the second model, it automatically calculates the actual value.
Is there a similar interpretation for standard error?
The first model has an intercept but the second model has no intercept at all so they have different residuals. They therefore have different residual standard errors given that the standard error is an estimate of the standard deviation of the residuals. Since they are different models they also have different coefficient and coefficient standard errors too. We can see that the intercept in the first model is significant (it has three stars beside it) but you can also compare the models using anova:
anova(FactorModel1, FactorModel)
Use predict to get predictions but it will be easier to use it if you first put the transformed variables in the data frame and then use those instead of trying to transform them in the formula.

R calculate percentile using Stata definition [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I did two different analyses. One with R and another with Stata, based on percentile calculation. However I have a mismatch between the two results due to a different percentile method calculation between R and Stata. Do you know if I can use the Stata's percentile definition in R?
R has at least 9 definitions of quantiles and percentiles are just quantile(.) * 100. This link suggests that the corresponding quantile-type would be type=4. I was unable to find a percentile or quantile function documented in the Base Stata Manual, but I would welcome correction if that is in error.
Nick Cox is right. The quantile (the value in the data domain) at probability of 0.25 is the 25th percentile. The question appears unclear on both sides of the R-Stata divide because the original efforts in R were being done with the ecdf function in an unspecified manner. Fortunately the poster was satisfied by being pointed toward the R quantile function.
After looking at the Version 13 Stata Manual section on centile, I'm not sure it matches up with any of the R quantile methods although it would appear to match the type=4 method for percentiles away from the "extremes":
By default, centile estimates Cq for the variables in varlist and for
the values of q given in centile(numlist). It makes no assumptions about the distribution of X, and, if necessary, uses linear interpolation between neighboring sample values. Extreme centiles (for example, the 99th centile in samples smaller than 100) are fixed at the minimum or maximum sample
value

Obtaining GAMM intervals [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm using a GAMM and want to extract the intervals for the model with the Variance-Covariance Matrix but am having problems doing so
Coding:
fishdata <- read.csv("http://dl.dropbox.com/s/4w0utkqdhqribl4/fishdata.csv",
header=T)
attach(fishdata)
require(mgcv)
gammdl <- gamm(inlandfao ~ s(marinefao), correlation = corAR1(form = ~1 | year),
family=poisson())
summary(gammdl$gam)
intervals(gammdl$lme)
but the final line of coding returns,
Error in intervals.lme(gammdl$lme) :
cannot get confidence intervals on var-cov components: Non-positive definite approximate variance-covariance
I don't see why this error message appears.
I'm trying to replicate what Simon Wood does on page 316 of Generalized Additive Models: An Introduction with R using my data.
This is really an issue of statistical computation, not R usage. Essentially, the error means that whilst the computations might have converged to a fitted model, there are problems with that model. What this boils down to is that the Hessian computed at the fitted values of the model is non-positive definite, and hence the maximum likelihood estimate of the covariance matrix is not available. Often this occurs where the log-likelihood function becomes flat, no further progress in optimisation can occur, and arises from the fact that the model is most likely too complex for the given data.
You could try fitting the model with and without the AR(1) and compare them using a generalised likelihood ratio test (via anova()). I would guess that the AR(1) nested within year is redundant - i.e. the parameter $\phi$ is effectively 0 - and it is this that is causing the error.
Actually, now it also occurs to me that you are trying to model the variable inlandfao as a smooth function of itself s(inlandfao). There is something wrong here - you'd expect a perfect fit without needing a smooth function. You need to fix that; you can't model the response as a function of itself.

Interpreting residual value statement in lm() summary [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am working with R to create some linear models (using lm()) on the data that i have collected. Now I am not that good at statistics and am finding it difficult to understand the summary of the linear model that is generated through R.
I mean the residual value : Min , 1Q, Median, 3Q, Max
My question is: what do these values mean and how can I know from these values if my model is good ?
This are some of the residual values that I have.
Min: -4725611 1Q:-2161468 median:-1352080 3Q:3007561 Max:6035077
One fundamental assumption of linear regression (and the associated hypothesis tests in particular) is that residuals are normal distributed with expected value zero. A slight violation of this assumption is not problematic, as the statistics is pretty robust. However, the distribution should be at least symmetric.
The best way to judge if the assumption of normality is fullfilled, is to plot the residuals. There are many different diagnostic plots available, e.g., you can do the following:
fit <- lm(y~x)
plot(fit)
This will give you a plot of residuals vs. fitted values and a qq-plot of standardized residuals. The quantiles given by summary(fit) are useful for a quick check if residuals are symmetric. There, min and max values are not that important, but the median should be close to zero and the first and third quartil should have similar absolute values. Of course, this check only makes sense if you have a sufficient number of values.
If residuals are not normal distributed there are several possibilities to deal with that, e.g.,
transformations,
generalized linear models,
or a non-linear model could be more appropriate.
There are many good books on linear regression and even some good web tutorials. I suggest to read at least one of those carefully.

Resources