Trouble with 'fitdistrplus' package, t-distribution - r

I am trying to fit t-distributions to my data but am unable to do so. My first try was
fitdistr(myData, "t")
There are 41 warnings, all saying that NaNs are produced. I don't know how, logarithms seem to be involved. So I adjusted my data somewhat so that all data is >0, but I still have the same problem (9 fewer warnings though...). Same problem with sstdFit(), produces NaNs.
So instead I try with fitdist which I've seen on stackoverflow and CrossValidated:
fitdist(myData, "t")
I then get
Error in mledist(data, distname, start, fix.arg, ...) :
'start' must be defined as a named list for this distribution
What does this mean? I tried looking into the documentation but that told me nothing. I just want to possibly fit a t-distribution, this is so frustrating :P
Thanks!

Start is the initial guess for the parameters of your distribution. There are logs involved because it is using maximum likelihood and hence log-likelihoods.
library(fitdistrplus)
dat <- rt(100, df=10)
fit <- fitdist(dat, "t", start=list(df=2))

I think it's worth adding that in most cases, using the fitdistrplus package to fit a t-distribution to real data will lead to a very bad fit, which is actually quite misleading. This is because the default t-distribution functions in R are used, and they don't support shifting or scaling. That is, if your data has a mean other than 0, or is scaled in some way, then the fitdist function will simply lead to a bad fit.
In real life, if data fits a t-distribution, it is usually shifted (i.e. has a mean other than 0) and / or scaled. Let's generate some data like that:
data = 1.5*rt(10000,df=5) + 0.5
Given this data has been sampled from the t-distribution with 5 degrees of freedom, you'd think that trying to fit a t-distribution to this should work quite nicely. But actually, here is the result. It estimates a df of 2, and provides a bad fit as shown in the qq plot.
> fit_bad <- fitdist(data,"t",start=list(df=3))
> fit_bad
Fitting of the distribution ' t ' by maximum likelihood
Parameters:
estimate Std. Error
df 2.050967 0.04301357
> qqcomp(list(fit_bad)) # generates plot to show fit
When you fit to a t-distribution you want to not only estimate the degrees of freedom, but also a mean and scaling parameter.
The metRology package provides a version of the t-distribution called t.scaled that has a mean and sd parameter in addition to the df parameter [metRology]. Now let's fit it again:
> library("metRology")
> fit_good <- fitdist(data,"t.scaled",
start=list(df=3,mean=mean(data),sd=sd(data)))
> fit_good
Fitting of the distribution ' t.scaled ' by maximum likelihood
Parameters:
estimate Std. Error
df 4.9732159 0.24849246
mean 0.4945922 0.01716461
sd 1.4860637 0.01828821
> qqcomp(list(fit_good)) # generates plot to show fit
Much better :-) The parameters are very close to how we generated the data in the first place! And the QQ plot shows a much nicer fit.

Related

R absolute value of residuals with log transformation

I have a linear model in R of the form
lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
I want to interpret the residuals but get them back on the scale of num_encounters. I have seen residuals.lm(x, type="working") and residuals.lm(x, type="response") but I'm not sure about the values returned by them. Do I for instance still need to use exp() to get the residual values back on the num_encounters scale? Or are they already on that scale? I want to plot these absolute values back, both in a histogram and in a raster map afterwards.
EDIT:
Basically my confusion is that the following code results in 3 different histograms, while I was expecting the first 2 to be identical.
df$predicted <- exp(predict(x, newdata=df))
histogram(df$num_encounters-df$predicted)
histogram(exp(residuals(x, type="response")))
histogram(residuals(x, type="response"))
I want to interpret the residuals but get them back on the scale of
num_encounters.
You can easily calculate them:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
In addition what #Roland suggests, which indeed is correct and works, the problem with my confusion was just basic high-school logarithm algebra.
Indeed the absolute response residuals (on the scale of the original dependent variable) can be calculated as #Roland says with
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
If you want to calculate them from the model residuals, you need to keep logarithm substraction rules into account.
log(a)-log(b)=log(a/b)
The residual is calculated from the original model. So in my case, the model predicts log(num_encounters). So the residual is log(observed)-log(predicted).
What I was trying to do was
exp(resid) = exp(log(obs)-log(pred)) = exp(log(obs/pred)) = obs/pred
which is clearly not the number I was looking for. To get the absolute response residual from the model response residual, this is what I needed.
obs-obs/exp(resid)
So in R code, this is what you could also do:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
abs_resid <- df$num_encounters - df$num_encounters/exp(residuals(mod, type="response"))
This resulted in the same number as with the method described by #Roland which is much easier of course. But at least I got my brain lined up again.

Finding Mean Squared Error?

I have produced a linear data set and have used lm() to fit a model to that dataset. I am now trying to find the MSE using mse()
I know the formula for MSE but I'm trying to use this function. What would be the proper way to do so? I have looked at the documentation, but I'm either dumb or it's just worded for people who actually know what they're doing.
library(hydroGOF)
x.linear <- seq(0, 200, by=1) # x data
error.linear <- rnorm(n=length(x.linear), mean=0, sd=1) # Error (0, 1)
y.linear <- x.linear + error.linear # y data
training.data <- data.frame(x.linear, y.linear)
training.model <- lm(training.data)
training.mse <- mse(training.model, training.data)
plot(training.data)
mse() needs two data frames. I'm not sure how to get a data frame out of lm(). Am I even on the right track to finding a proper MSE for my data?
Try this:
mean((training.data - predict(training.model))^2)
#[1] 0.4467098
You can also use below mentioned code which is very clean to get mean square error
install.packages("Metrics")
library(Metrics)
mse(actual, predicted)
The first data set on which is actual one : training.data
The second argument is the one which you will predict like :
pd <- predict(training.model , training.data)
mse(training.data$,pd)
Seems you have not done prediction yet so first predict the data based on your model and then calculate mse
You can use the residual component from lm model output to find mse in this manner :
mse = mean(training.model$residuals^2)
Note: if you come from another program (like SAS)
they get the mean using the sum and the degrees of freedom of the residual. I recommend doing the same if you want a more accurate estimate of the error.
mse = sum(training.model$residuals^2)/training.model$df.residual
I found this while trying to figure out why mean(my_model$residuals^2) was different in R than the MSE in SAS.

Get degrees of freedom for a Standardized T Distribution with MLE

First of all, I thank you all beforehand for reading this.
I am trying to fit a Standardized T-Student Distribution (i.e. a T-Student with standard deviation = 1) on a series of data; that is: I want to estimate the degrees of freedom via Maximum Likelihood Estimation.
An example of what I need to achieve can be found in the following (simple) Excel file I made:
https://www.dropbox.com/s/6wv6egzurxh4zap/Excel%20Implementation%20Example.xlsx?dl=0
Inside the Excel file, I have an image that contains the formula corresponding to the calculation of the loglikelihood function for the Standardized T Student Distribution. The formula was extracted from a Finance book (Elements of Financial Risk Management - by Peter Christoffersen).
So far, I have tried this with R:
copula.data <- read.csv(file.choose(),header = TRUE)
z1 <- copula.data[,1]
library(fitdistrplus)
ft1 = fitdist(z1, "t", method = "mle", start = 10)
df1=ft1$estimate[1]
df1
logLik(ft1)
df1 yields the number: 13.11855278779897
logLike(ft1) yields the number: -3600.2918050056487
However, the Excel file yields degrees of freedom of: 8.2962365022727, and a log-likelihood of: -3588.8879 (which is the right answer).
Note: the .csv file that my code reads is the following:
https://www.dropbox.com/s/nnh2jgq4fl6cm12/Data%20for%20T%20Copula.csv?dl=0
Any ideas? Thank you people!
The formula from your spreadsheet (with n, x substituted for the df parameter and the data)
=GAMMALN((n+1)/2)-GAMMALN(n/2)-LN(PI())/2-LN(n-2)/2-1/2*(1+n)*LN(1+x^2/(n-2))
or, exponentiating,
Gamma((n+1)/2) / (sqrt((n-2) pi) Gamma(n/2)) (1+x^2/(n-2))^-((n+1)/2)
?dt gives
f(x) = Gamma((n+1)/2) / (sqrt(n pi) Gamma(n/2)) (1 + x^2/n)^-((n+1)/2)
So the difference lies in those n-2 values in two places in the formula. I don't have enough context to see why the author is defining the t distribution in that different way; there may be some good reason ...
Looking at the negative log-likelihood curve directly, it certainly seems as though the fitdistrplus answer is agreeing with the direct calculation. (It would be very surprising if there were a bug in the dt() function, R's distribution functions are very broadly used and thoroughly tested.)
LL <- function(p,data=z1) {
-sum(dt(data,df=p,log=TRUE))
}
pvec <- seq(6,20,by=0.05)
Lvec <- sapply(pvec,LL)
par(las=1,bty="l")
plot(pvec,Lvec,type="l",
xlab="df parameter",ylab="negative log-likelihood")
## superimpose fitdistr results ...
abline(v=coef(ft1),lty=2)
abline(h=-logLik(ft1),lty=2)
Unless there's something else you're not telling us about the problem definition, it seems to me that R is getting the right answer. (The mean and sd of the data you gave were not exactly equal to 0 and 1 respectively, but they were close; centering and scaling gave an even larger value for the parameter.)

Bootstrap failed using mixed model in lme4 package

I want to use the bootMer() feature of the lme4 package using linear mixed model and also using boot.ci to get 95% CIs by parametric bootstrapping, and have been getting the warnings of the type "In bootMer(object, bootFun, nsim = nsim, ...) : some bootstrap runs failed (30/100)”.
My code is:
> lmer(LLA ~ 1 +(1|PopID/FamID), data=fp1) -> LLA
> LLA.boot <- bootMer(LLA, qst, nsim=999, use.u=F, type="parametric")
Warning message:
In bootMer(LLA, qst, nsim = 999, use.u = F, type = "parametric") :
some bootstrap runs failed (3/999)
> boot.ci(LLA.boot, type=c("norm", "basic", "perc"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 996 bootstrap replicates
CALL :
boot.ci(boot.out = LLA.boot, type = c("norm", "basic", "perc"))
Intervals :
Level Normal Basic Percentile
95% (-0.2424, 1.0637 ) (-0.1861, 0.8139 ) ( 0.0000, 1.0000 )
Calculations and Intervals on Original Scale
my problem is why Bootstrap fails for a few values ? and Confidence interval estimated using boot.ci at 95% show negative value, though there are no negative values in the array of values generated by bootstrap.'
The result of plot(LLA.boot):
It's not surprising, for a slightly difficult or unstable model, that a few parametric bootstrap runs might fail to converge for numerical reasons. You should be able to retrieve the specific error messages via attr(LLA.boot,"boot.fail.msgs") (this really should be documented, but isn't ...) In general I wouldn't worry about it too much if the failure fraction is very small (which it is in this case); if were large (say >5-10%) I would revisit my data and model and try to see if there was something else wrong that was manifesting itself in this way.
As for the confidence intervals: the "basic" and "norm" methods use Normal and bias-corrected Normal approximations, respectively, so it's not surprising that intervals should go beyond the range of the computed values. Since your function is
Qst <- function(x){
uu <- unlist(VarCorr(x))
uu[2]/(uu[3]+uu[2])}
}
its possible range is from 0 to 1, and your percentile bootstrap CI shows this range is attained. If your model were perfectly uninformative, the distribution of Qst would be uniform (mean=0.5, sd=sqrt(1/12)=0.288) and the Normal approximation to the CI would be
> 0.5+c(-1,1)*1.96*sqrt(1/12)
[1] -0.06580326 1.06580326
The upper end is about in the same place as your Normal CI, but your lower limit is even smaller, suggesting that there may even be some bimodality in the sampling distribution of your estimate (this is confirmed by the distribution plot you posted). In any case, I suspect that the bottom line is that your confidence intervals (however computed) are so wide that they're telling you that your data provide almost no practical information about the value of Qst ... In particular, it looks like the majority of your bootstrap replicates are finding singular fits, in which one or the other of the variances are estimated as zero. I'm guessing your data set is just not large enough to estimate these variances very precisely.
For more information on how the Normal and bias-corrected Normal approximations are computed, see boot:::basic.ci and boot:::norm.ci or chapter 5 of Davison and Hinkley as cited in ?boot.ci.

How to get bootstrapped p-values and bootstrapped t-values and how does the function boot() work?

I would like to get the bootstrapped t-value and the bootstrapped p-value of a lm.
I have the following code (basically copied from a paper) which works.
# First of all you need the following packages
install.packages("car")
install.packages("MASS")
install.packages("boot")
library("car")
library("MASS")
library("boot")
boot.function <- function(data, indices){
data <- data[indices,]
mod <- lm(prestige ~ income + education, data=data) # the liear model
# the first element of the following vector contains the t-value
# and the second element is the p-value
c(summary(mod)[["coefficients"]][2,3], summary(mod)[["coefficients"]][2,4])
}
Now, I compute the bootstrapping model, which gives me the following:
duncan.boot <- boot(Duncan, boot.function, 1999)
duncan.boot
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Duncan, statistic = boot.function, R = 1999)
Bootstrap Statistics :
original bias std. error
t1* 5.003310e+00 0.288746545 1.71684664
t2* 1.053184e-05 0.002701685 0.01642399
I have two questions:
My understanding is that the bootsrapped value is the original plus the bias, which means that both bootstrapped values (the bootstrapped t-value as well as the bootstrapped p-value) are greater than the original values. This in turn is not possible, because if the t-value rises (which means more significance) the p-values MUST be lower, right? Therefore I think that I have not yet really understood the output of the boot function (here: duncan.boot). How do I compute the bootstrapped values?
I do not understand how the boot() works. If you look at duncan.boot <- boot(Duncan, boot.function, 1999) you see that I have not passed any arguments for the function "boot.function". I suppose that R sets data <- Duncan. But since I have not passed anything for the argument "indices", I do not understand how the following line in the function "boot.function" works data <- data[indices,]
I hope the questions make sense!??
The boot function is "expecting" to get a function that has two arguments: the first being a data.frame and the second being an "indices" vector (possibly with duplicate entries and probably not using all the indices) to use in selecting rows and probably having some duplicate or triplicates.) It then samples with replacement determined by the pattern of duplicates and triplicates from the original dataframe (multiple times determined by "R" with different "choice sets"), passes those to the indices argument in the boot.function, and then collects the results of the R number of function applications.
Regarding what is reported by the print method for boot objects, take a look at this (done after examining the returned object with str()
> duncan.boot$t0
[1] 5.003310e+00 1.053184e-05
> apply(duncan.boot$t, 2, mean)
[1] 5.342895220 0.002607943
> apply(duncan.boot$t, 2, mean) - duncan.boot$t0
[1] 0.339585441 0.002597411
It becomes more obvious that the T0 value is from the original data while the bias is the difference between the mean of the boot()-ed values and the T0 values. I don't think it makes a lot of sense to be asking why p-values based on parametric considerations are increasing in association with an increase in estimated t-statistics. You are really in two disparate regions of statistical thought when you do that. I would have interpreted the increase in p-values as an effect of the sampling process, which does not take into account the Normal distribution assumptions. It is simply saying something about the sampling distribution of the p-value (which is really just another sample statistic).
(Comment: The sourcebook used at the time of R development was Davison and Hinkley's "Bootstrap Methods and their Applications". I'm no claiming any support for my answer above, but I thought to put it in as a reference after Hagen Brenner asked about sampling with two indices in the comments below. There are many unexpected aspects of bootstrapping that arise after one goes beyond the simple parametric estimation and I would first turn to that reference if I were tackling more complex sampling situations.)

Resources