Fitting a lognormal or poisson distribution - r

I have a vector of 1096 numbers, the daily average concentration of NOx measured in 3 years in a measurement station.
You can observe the type of distribution in the image:
I used these commands to do the histogram:
NOxV<-scan("NOx_Vt15-17.txt")
hist.NOxVt<-hist(NOxV, plot = FALSE, breaks = 24)
plot(hist.NOxVt, xlab = "[NOx]", ylab = "Frequenze assolute", main = "Istogramma freq. ass. NOx 15-17 Viterbo")
points(hist.NOxVt$mids, hist.NOxVt$counts, col= "red")
My professor suggested that I fit the histogram with a Poisson distribution - paying attention to the transition: discrete -> continuous (I don't know what that means)- or with a "Lognormal" distribution.
I tried to do the Poisson fit, with some command lines that she gave us at lesson, but R gave me an error after having executed the last code line of the following:
my_poisson = function(params, x){
exp(-params)*params^x/factorial(x)
}
y<-hist.NOxVt$counts/1096;
x<-hist.NOxVt$mids;
z <- nls( y ~ exp(-a)*a^x/factorial(x), start=list(a=1) )
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
In addition: There were 50 or more warnings (use warnings() to see the first 50)"
After this problem I couldn't solve (even searching similar problems on the internet) I decided to fit the distribution with a Lognormal, but I have no idea how to do it, because the professor did not explain it to us, and I still don't have enough R experience to figure it out on my own.
I would appreciate any suggestion or examples of how to do a lognormal fit and/or Poisson fit.

There's a built-in function fitdistr in the MASS package that comes with R:
Generating a data example to look at (eyeballing parameters to get something similar to your picture):
set.seed(101)
z <- rlnorm(1096,meanlog=4.5,sdlog=0.8)
Fit (on statistical grounds I wouldn't recommend a Poisson fit - it might be possible to adapt a discrete distribution such as Poisson (or better, negative binomial) to fit such continuous data, but log-Normal or Gamma distributions are more natural choices.
library(MASS)
f1 <- fitdistr(z,"lognormal")
f2 <- fitdistr(z,"Gamma")
The f1 and f2 objects, when printed, give the estimated coefficients (meanlog and sdlog for log-Normal, shape and rate for Gamma) and standard errors of the coefficients.
Draw a picture (on the density scale, not the count scale): red is log-Normal, blue is Gamma (in this case log-Normal fits better because that's how I generated the "data" in the first place). [The with(as.list(coef(...)) stuff is some R fanciness to allow the use of the names of the coefficients (meanlog, sdlog etc.) in the subsequent R code.]
hist(z,col="gray",breaks=50,freq=FALSE)
with(as.list(coef(f1)),
curve(dlnorm(x,meanlog,sdlog),
add=TRUE,col="red",lwd=2))
with(as.list(coef(f2)),
curve(dgamma(x,shape=shape,rate=rate),
add=TRUE,col="blue",lwd=2))

Related

Gamma distribution in a GLMM

I am trying to create a GLMM in R. I want to find out how the emergence time of bats depends on different factors. Here I take the time difference between the departure of the respective bat and the sunset of the day as dependent variable (metric). As fixed factors I would like to include different weather data (metric) as well as the reproductive state (categorical) of the bats. Additionally, there is the transponder number (individual identification code) as a random factor to exclude inter-individual differences between the bats.
I first worked in R with a linear mixed model (package lme4), but the QQ plot of the residuals deviates very strongly from the normal distribution. Also a histogram of the data rather indicates a gamma distribution. As a result, I implemented a GLMM with a gamma distribution. Here is an example with one weather parameter:
model <- glmer(formula = difference_in_min ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl, family=gamma(link = log))
However, since there was no change in the QQ plot this way, I looked at the residual diagnostics of the DHARMa package. But the distribution assumption still doesn't seem to be correct, because the data in the QQ plot deviates very much here, too.
Residual diagnostics from DHARMa
But if the data also do not correspond to a gamma distribution, what alternative is there? Or maybe the problem lies somewhere else entirely.
Does anyone have an idea where the error might lie?
But if the data also do not correspond to a gamma distribution, what alternative is there?
One alternative is called the lognormal distribution (https://en.wikipedia.org/wiki/Log-normal_distribution)
Gaussian (or normal) distributions are typically used for data that are normally distributed around zero, which sounds like you do not have. But the lognormal distribution does not have the same requirements. Following your previous code, you would fit it like this:
model <- glmer(formula = log(difference_in_min) ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl, family=gaussian(link = identity))
or instead of glmer you can just call lmer directly where you don't need to specify the distribution (which it may tell you to do in a warning message anyway:
model <- lmer(formula = log(difference_in_min) ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl)

Using ROC curve to find optimum cutoff for my weighted binary logistic regression (glm) in R

I have build a binary logistic regression for churn prediction in Rstudio. Due to the unbalanced data used for this model, I also included weights. Then I tried to find the optimum cutoff by try and error, however To complete my research I have to incorporate ROC curves to find the optimum cutoff. Below I provided the script I used to build the model (fit2). The weight is stored in 'W'. This states that the costs of wrongly identifying a churner is 14 times as large as the costs of wrongly identifying a non-churner.
#CH1 logistic regression
library(caret)
W = 14
lvl = levels(trainingset$CH1)
print(lvl)
#if positive we give it the defined weight, otherwise set it to 1
fit_wts = ifelse(trainingset$CH1==lvl[2],W,1)
fit2 = glm(CH1 ~ RET + ORD + LVB + REVA + OPEN + REV2KF + CAL + PSIZEF + COM_P_C + PEN + SHOP, data = trainingset, weight=fit_wts, family=binomial(link='logit'))
# we test it on the test set
predlog1 = ifelse(predict(fit2,testset,type="response")>0.5,lvl[2],lvl[1])
predlog1 = factor(predlog1,levels=lvl)
predlog1
confusionMatrix(pred,testset$CH1,positive=lvl[2])
For this research I have also build ROC curves for decision trees using the pROC package. However, of course the same script does not work the same for a logistic regression. I have created a ROC curve for the logistic regression using the script below.
prob=predict(fit2, testset, type=c("response"))
testset$prob=prob
library(pROC)
g <- roc(CH1 ~ prob, data = testset, )
g
plot(g)
Which resulted in the ROC curve below.
How do I get the optimum cut off from this ROC curve?
Getting the "optimal" cutoff is totally independent of the type of model, so you can get it like you would for any other type of model with pROC. With the coords function:
coords(g, "best", transpose = FALSE)
Or directly on a plot:
plot(g, print.thres=TRUE)
Now the above simply maximizes the sum of sensitivity and specificity. This is often too simplistic and you probably need a clear definition of "optimal" that is adapted to your use case. That's mostly beyond the scope of this question, but as a starting point you should a look at Best Thresholds section of the documentation of the coords function for some basic options.

How does one extract hat values and Cook's Distance from an `nlsLM` model object in R?

I'm using the nlsLM function to fit a nonlinear regression. How does one extract the hat values and Cook's Distance from an nlsLM model object?
With objects created using the nls or nlreg functions, I know how to extract the hat values and the Cook's Distance of the observations, but I can't figure out how to get them using nslLM.
Can anyone help me out on this? Thanks!
So, it's not Cook's Distance or based on hat values, but you can use the function nlsJack in the nlstools package to jackknife your nls model, which means it removes every point, one by one, and bootstraps the resulting model to see, roughly speaking, how much the model coefficients change with or without a given observation in there.
Reproducible example:
xs = rep(1:10, times = 10)
ys = 3 + 2*exp(-0.5*xs)
for (i in 1:100) {
xs[i] = rnorm(1, xs[i], 2)
}
df1 = data.frame(xs, ys)
nls1 = nls(ys ~ a + b*exp(d*xs), data=df1, start=c(a=3, b=2, d=-0.5))
require(nlstools)
plot(nlsJack(nls1))
The plot shows the percentage change in each model coefficient as each individual observation is removed, and it marks influential points above a certain threshold as "influential" in the resulting plot. The documentation for nlsJack describes how this threshold is determined:
An observation is empirically defined as influential for one parameter if the difference between the estimate of this parameter with and without the observation exceeds twice the standard error of the estimate divided by sqrt(n). This empirical method assumes a small curvature of the nonlinear model.
My impression so far is that this a fairly liberal criterion--it tends to mark a lot of points as influential.
nlstools is a pretty useful package overall for diagnosing nls model fits though.

{Methcomp} ā€“ Deming / orthogonal regression ā€“ goodness of fit + confidence intervals

A question following this post. I have the following data:
x1, disease symptom
y1, another disease symptom
I fitted the x1/y1 data with a Deming regression with vr (or sdr) option set to 1. In other words, the regression is a Total Least Squares regression, i.e. orthogonal regression. See previous post for the graph.
x1=c(24.0,23.9,23.6,21.6,21.0,20.8,22.4,22.6,
21.6,21.2,19.0,19.4,21.1,21.5,21.5,20.1,20.1,
20.1,17.2,18.6,21.5,18.2,23.2,20.4,19.2,22.4,
18.8,17.9,19.1,17.9,19.6,18.1,17.6,17.4,17.5,
17.5,25.2,24.4,25.6,24.3,24.6,24.3,29.4,29.4,
29.1,28.5,27.2,27.9,31.5,31.5,31.5,27.8,31.2,
27.4,28.8,27.9,27.6,26.9,28.0,28.0,33.0,32.0,
34.2,34.0,32.6,30.8)
y1=c(100.0,95.5,93.5,100.0,98.5,99.5,34.8,
45.8,47.5,17.4,42.6,63.0,6.9,12.1,30.5,
10.5,14.3,41.1, 2.2,20.0,9.8,3.5,0.5,3.5,5.7,
3.1,19.2,6.4, 1.2, 4.5, 5.7, 3.1,19.2, 6.4,
1.2,4.5,81.5,70.5,91.5,75.0,59.5,73.3,66.5,
47.0,60.5,47.5,33.0,62.5,87.0,86.0,77.0,
86.0,83.0,78.5,83.0,83.5,73.0,69.5,82.5,78.5,
84.0,93.5,83.5,96.5,96.0,97.5)
x11()
plot(x1,y1,xlim=c(0,35),ylim=c(0,100))
library(MethComp)
dem_reg <- Deming(x1, y1)
abline(dem_reg[1:2], col = "green")
I would like to know how much x1 helps to predict y1:
normally, Iā€™d go for a R-squared, but it does not seem to be relevant; although another mathematician told me he thinks a R-squared may be appropriate. And this page suggests to calculate a Pearson product-moment correlation coefficient, which is R I believe?
partially related, there is possibly a tolerance interval. I could calculated it with R ({tolerance} package or code shown in the post), but it is not exactly what I am searching for.
Does someone know how to calculate a goodness of fit for Deming regression, using R? I looked at MetchComp pdf but could not find it (perhaps missed it though).
EDIT: following Gaurav's answers about confidence interval: R code
Firstly: confidence intervals for parameters
library(mcr)
MCR_reg=mcreg(x1,y1,method.reg="Deming",error.ratio=1,method.ci="analytical")
getCoefficients(MCR_reg)
Secondly: confidence intervals for predicted values
# plot of data
x11()
plot(x1,y1,xlim=c(0,35),ylim=c(0,100))
# Deming regression using functions from {mcr}
library(mcr) MCR_reg=mcreg(x1,y1,method.reg="Deming",error.ratio=1,method.ci="analytical")
MCR_intercept=getCoefficients(MCR_reg)[1,1]
MCR_slope=getCoefficients(MCR_reg)[2,1]
# CI for predicted values
x_to_predict=seq(0,35)
predicted_values=MCResultAnalytical.calcResponse(MCR_reg,x_to_predict,alpha=0.05)
CI_low=predicted_values[,4]
CI_up=predicted_values[,5]
# plot regression line and CI for predicted values
abline(MCR_intercept,MCR_slope, col="red")
lines(x_to_predict,CI_low,col="royalblue",lty="dashed")
lines(x_to_predict,CI_up,col="royalblue",lty="dashed")
# comments
text(7.5,60, "Deming regression", col="red")
text(7.5,40, "Confidence Interval for", col="royalblue")
text(7.5,35, "Predicted values - 95%", col="royalblue")
EDIT 2
Topic moved to Cross Validated:
https://stats.stackexchange.com/questions/167907/deming-orthogonal-regression-measuring-goodness-of-fit
There are many proposed methods to calculate goodness of fit and tolerance intervals for Deming Regression but none of them widely accepted. The conventional methods we use for OLS regression may not make sense. This is an area of active research. I don't think there many R-packages which will help you compute that since not many mathematicians agree on any particular method. Most methods for calculating intervals are based on Resampling techniques.
However you can check out the 'mcr' package for intervals...
https://cran.r-project.org/web/packages/mcr/

how can I predict probability of an event using the weibull distribution

I have a data set of connection forces based on axial force in N (http://pastebin.com/Huwg4vxv)
Some previous analyses has been undertaken (by another party) and has fitted a Weibull distribution to it, and then predicted that the chances of recording a force of 60N or higher is around 1.2%.
I have to say that eyeballing the data, that doesn't seem likely to me, but I know nothing about this particular distribution.
So far I am able to fit the curve:
force<-read.csv(file="forcestats.csv",header = T)
library(MASS)
fitdistr(force$F, 'weibull')
hist(force$F)
I am trying to understand
is a weibull distro really the best fit for this data ?
how I can make that same prediction using R (how to calculate the probability of values above 60N);
is it possible to calculate the 95% confidence interval for that value (i.e., 1.2% +/- x%)
Thanks for reading
Pete
To address your first item,
is a weibull distro really the best fit for this data ?
conceptually, this is more of a question about statistical inference rather than programming, so you most likely want to tackle that on CrossValidated rather than SO. However, you can certainly inquire about the means of investigating this programmatically, such as comparing the estimated density of the observed data to the theoretical density function or to the density function of random samples from a weibull distribution with your parameter estimates:
library(MASS)
##
Weibull <- read.csv(
"F:/Studio/MiscData/force_in_newtons.txt",
header=TRUE)
##
params <- fitdistr(Weibull$F, 'weibull')
##
Shape <- params[[1]][1]
Scale <- params[[1]][2]
##
set.seed(123)
plot(
density(
rweibull(
500,shape=Shape,scale=Scale)),
col="red",
lwd=2,lty=3,
main="")
##
lines(
density(
Weibull$F),
col="blue",
lty=3,lwd=2)
##
legend(
"topright",
legend=c(
"rweibull(n=500,...)",
"observed data"),
lty=c(3,3),
col=c("red","blue"),
lwd=c(3,3),
bty="n")
Of course, there are many other ways of assessing the fit of your model, this is just a quick sanity check.
As for your second question, you can use the pweibull function with lower.tail=FALSE to get probabilities from the theoretical survival function (S(x) = 1 - F(x)):
## Pr(X >= 60)
> pweibull(
60,shape=Shape,scale=Scale,
lower.tail=FALSE)
[1] 0.01268268
As for your final item, I believe that calculating confidence intervals on probabilities (as well as certain other statistical quantities) for an estimated distribution requires using the Delta method; I could be recalling incorrectly though, so you may want to double check on this. If this is the case and you aren't familiar with the Delta method, then unfortunately you will probably have to do a fair amount of reading on the subject because the calculation involved is generally non-trivial - here's another link; the Wikipedia article doesn't give a very in-depth treatment of the subject. Or, you could inquire about this on Cross Validated as well.

Resources