Quantile regression and p-values - r

I am applying guantile regression for my data-set (using R). It is easy to produce the nice scatterplot-image with different quantile regression lines
(taus <- c(0.05,0.25,0.75,0.95)).
Problem occurs when I want to produce p-values (in order to see statistical significance of each regression line) for each one of these quantiles. For median quantile (tau=0.5) this is not problematic, but when it comes to for example tau=0.25, I get following error message:
>QRmodel<-rq(y~x,tau=0.25,model=T)
>summary(QRmodel,se="nid")
Error in summary.rq(QRmodel, se = "nid") : tau - h < 0: error in summary.rq
What could be the reason for this?
Also: Is it recommendable to mention p-values and coefficients regarding the results of quantile regression model or could it be enough to show just the plot-picture and discuss the results based on that picture?
Best regards, frustrated person

A good way to learn what's going on in these sorts of debugging situations is to find the relevant portion of code that is throwing the error. If you type 'summary.rq' at the console, you'll see the code for the function summary.rq. Scanning through it you'll find the section where it calculates se's using the "nid" method, starting with this code:
else if (se == "nid") {
h <- bandwidth.rq(tau, n, hs = hs)
if (tau + h > 1)
stop("tau + h > 1: error in summary.rq")
if (tau - h < 0)
stop("tau - h < 0: error in summary.rq")
bhi <- rq.fit.fnb(x, y, tau = tau + h)$coef
blo <- rq.fit.fnb(x, y, tau = tau - h)$coef
So what's happening here is that in order to calculate the se's, the function first need to calculate a bandwidth, h, and the quantreg model is refit for tau +/- h. For tau's near 0 or 1, there's a possibility that adding or subtracting the bandwidth 'h' will lead to a tau below 0 or greater than 1, which isn't good, so the function stops.
You have a couple of options:
1.) Try a different se method (bootstrapping?)
2.) Modify the summary.rq code yourself to force it to use either max(tau,0) or min(tau,1) in the instances where the bandwidth pushes tau out of bounds. (There could be serious theoretical reasons why this is a bad idea; not advised unless you know what you're doing.)
3.) You could try to read up on the theory behind the calculation of these se's so you'd have a better idea of when they might work well or not. This might shed some light on why you're running into errors with values of tau near 0 or 1.

Try summary(QRmodel,se="boot")
Have a look at the help for summary.rq as well!

Related

Beta regression model in R

Please again accept my apologies for my little knowledge in R. I'm, trying to get better! I'm a biologist and my statistical knowledge is sadly low
I have the following data set:
Perc_Reacting,Pulses,IndMutant,Proportion
93,1,1,0.93
81,2,1,0.81
73,3,1,0.73
64,4,1,0.64
73,5,1,0.73
68,6,1,0.68
64,7,1,0.64
65,8,1,0.65
50,9,1,0.5
68,10,1,0.68
57,11,1,0.57
50,12,1,0.5
62,13,1,0.62
44,14,1,0.44
54,15,1,0.54
56,16,1,0.56
50,17,1,0.5
42,18,1,0.42
42,19,1,0.42
29,20,1,0.29
96,1,0,0.96
100,2,0,1
92,3,0,0.92
96,4,0,0.96
92,5,0,0.92
92,6,0,0.92
84,7,0,0.84
96,8,0,0.96
91,9,0,0.91
82,10,0,0.82
86,11,0,0.86
82,12,0,0.82
91,13,0,0.91
85,14,0,0.85
83,15,0,0.83
70,16,0,0.7
74,17,0,0.74
64,18,0,0.64
68,19,0,0.68
78,20,0,0.78
The first and last rows are the same, one expressed in % an the other in a 1-0 proportion
I need to run a Beta regression model, but when I try to create the model an error jumps:
model.beta<-betareg(C_elegans$Proportion~C_elegans$Pulses)
Error in betareg(C_elegans$Proportion ~ C_elegans$Pulses) :
invalid dependent variable, all observations must be in (0, 1)
Could you help me to create a beta regression model for this data and how to make relevant plots to show it fits good?
Also I need to propose a linear regression model for this data, can anyone let me know how you think it could be done better?
Here are the results of fitting the last three columns to a flat surface plane equation "Proportion = a + (b * Pulses) + (c * IndMutant)" with parameters a = 1.0468289473684214E+00, b = -1.8650375939849695E-02, and c = -2.5850000000000006E-01 yielding R-squared = 0.876 and RMSE = 0.064.
(here "absolute error" means "not relative error")

Extracting coefficients from mcmc object

Extracting stuff from objects has always been one of the most confusing aspects of R to me. I've fitted a bayesian linear regression model using rjags and have the following mcmc object:
summary(m_csim)
Iterations = 1:150000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 150000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
BR2 0.995805 0.0007474 1.930e-06 3.527e-06
BR2adj 0.995680 0.0007697 1.987e-06 3.633e-06
b[1] -5.890842 0.1654755 4.273e-04 1.289e-02
b[2] 1.941420 0.0390239 1.008e-04 1.991e-03
b[3] 1.056599 0.0555885 1.435e-04 5.599e-03
sig2 0.004678 0.0008333 2.152e-06 3.933e-06
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
BR2 0.994108 0.995365 0.995888 0.996339 0.99702
BR2adj 0.993932 0.995227 0.995765 0.996229 0.99693
b[1] -6.210425 -6.000299 -5.894810 -5.784082 -5.55138
b[2] 1.867453 1.914485 1.940372 1.967466 2.02041
b[3] 0.942107 1.020846 1.057720 1.094442 1.16385
sig2 0.003321 0.004082 0.004585 0.005168 0.00657
In order to extract the coefficients' means I did b = colMeans(mod_csim)[3:5]. I want to calculate credible intervals so I need to extract the 0.025 and 0.975 quantiles too. How do I do that programmatically ?
You can probably extract the quantiles directly. As others have pointed out, you can call str(m_csim), and you can limit the output of the str call with str(m_csim, max.level=1), and keep adding one to the max.level= argument until you see something that looks like quantiles.
What I like to do is to convert the MCMC output to a data.frame so it's easier to work with. I use jagsUI rather than rjags, but I often do something like
mcmc_df <- as.data.frame(as.matrix(MY_MCMC_OBJECT$samples))
Note: it might be a little different with rjags, but I'm sure you can find it with a little digging.
The benefit: I can then access a single vector for each mcmc_df$PARAMETER, and create a matrix of quantiles with
mcmc_quants <- apply(mcmc_df, 2, quantile, p=c(0.025, 0.25, 0.5, 0.75, 0.975))
or whatever quantiles you want.
You probably are looking for
model_summary_object <- summary(m_csim)
model_summary_object$quantiles[,c('2.5%','97.5%')]
I hope I'm not overstepping my knowledge bounds, but I want to answer from the point of view of 'in general' rather than specifically for rjags. m_csim is an object and many methods can possibly be used on it. You've used the summary method to see something. As people have commented, probably there is a coef method. However as someone else has commented (while I was replying !), using str() to see what an object contains is the best way to see what information is in an object and how to address it. I'd be very surprised if using str() doesn't show how to find not only the coefficients but also enough information on the confidence intervals to allow you to find the desired CI.

Why is the likelihood/AIC of my poisson regression infinite?

I am trying to evaluate themodel fit of several regressions in R, and I have run into a problem I have had multiple times now: the log-likelihood of my Poisson regression is infinite.
I'm using a non-integer dependent variable (Note: I know what I'm doing in this regard), and I'm wondering if maybe that's the problem. However, I don't get an infinite log-likelihood when running the regression with glm.nb.
Code to reproduce the issue is below.
Edit: the problem appears to go away when I coerce the DV to integer. Any idea how to get log likelihood from Poissons with non-integer DVs?
# Input Data
so_data <- data.frame(dv = c(21.0552722691125, 24.3061351414885, 7.84658638053276,
25.0294679770848, 15.8064731063311, 10.8171744654056, 31.3008088413026,
2.26643928259238, 18.4261153345417, 5.62915828161753, 17.0691184593063,
1.11959635820499, 30.0154935602592, 23.0000809735738, 28.4389825676123,
27.7678405415711, 23.7108405071757, 23.5070651053276, 14.2534787168392,
15.2058525068363, 19.7449094187771, 2.52384709295823, 29.7081691356397,
32.4723790240354, 19.2147002673637, 61.7911384519901, 10.5687170234821,
23.9047421013736, 18.4889651451222, 13.0360878554798, 15.1752866581849,
11.5205948111817, 31.3539840929108, 31.7255952728076, 25.3034625215724,
5.00013988265465, 30.2037887018226, 1.86123112349445, 3.06932041603219,
22.6739418581257, 6.33738321053804, 24.2933951601142, 14.8634827414491,
31.8302947881089, 34.8361908525564, 1.29606416941288, 13.206844629927,
28.843579313401, 25.8024295609021, 14.4414831628722, 18.2109680632694,
14.7092063453463, 10.0738043919183, 28.4124482962025, 27.1004208775326,
1.31350378236957, 14.3009307888745, 1.32555197766214, 2.70896028922312,
3.88043749517381, 3.79492216916016, 19.4507965653633, 32.1689088941444,
2.61278585713499, 41.6955885902228, 2.13466761675063, 30.4207256294235,
24.8231524369244, 20.7605955978196, 17.2182798298094, 2.11563574288652,
12.290778250655, 0.957467139696772, 16.1775287334746))
# Run Model
p_mod <- glm(dv ~ 1, data = so_data, family = poisson(link = 'log'))
# Be Confused
logLik(p_mod)
Elaborating on #ekstroem's comment: the Poisson distribution is only supported over the non-negative integers (0, 1, ...). So, technically speaking, the probability of any non-integer value is zero -- although R does allow for a little bit of fuzz, to allow for round-off/floating-point representation issues:
> dpois(1,lambda=1)
[1] 0.3678794
> dpois(1.1,lambda=1)
[1] 0
Warning message:
In dpois(1.1, lambda = 1) : non-integer x = 1.100000
> dpois(1+1e-7,lambda=1) ## fuzz
[1] 0.3678794
It is theoretically possible to compute something like a Poisson log-likelihood for non-integer values:
my_dpois <- function(x,lambda,log=FALSE) {
LL <- -lambda+x*log(lambda)-lfactorial(x)
if (log) LL else exp(LL)
}
but I would be very careful - some quick tests with integrate suggest it integrates to 1 (after I fixed the bug in it), but I haven't checked more carefully that this is really a well-posed probability distribution. (On the other hand, some reasonable-seeming posts on CrossValidated suggest that it's not insane ...)
You say "I know what I'm doing in this regard"; can you give some more of the context? Some alternative possibilities (although this is steering into CrossValidated territory) -- the best answer depends on where your data really come from (i.e., why you have "count-like" data that are non-integer but you think should be treated as Poisson).
a quasi-Poisson model (family=quasipoisson). (R will still not give you log-likelihood or AIC values in this case, because technically they don't exist -- you're supposed to do inference on the basis of the Wald statistics of the parameters; see e.g. here for more info.)
a Gamma model (probably with a log link)
if the data started out as count data that you've scaled by some measure of effort or exposure), use an appropriate offset model ...
a generalized least-squares model (nlme::gls) with an appropriate heteroscedasticity specification
Poisson log-likelihood involves calculating log(factorial(x)) (https://www.statlect.com/fundamentals-of-statistics/Poisson-distribution-maximum-likelihood). For values larger than 30 it has to be done using Stirling's approximation formula in order to avoid exceeding the limit of computer arithmetic. Sample code in Python:
# define a likelihood function. https://www.statlect.com/fundamentals-of- statistics/Poisson-distribution-maximum-likelihood
def loglikelihood_f(lmba, x):
#Using Stirling formula to avoid calculation of factorial.
#logfactorial(n) = n*ln(n) - n
n = x.size
logfactorial = x*np.log(x+0.001) - x #np.log(factorial(x))
logfactorial[logfactorial == -inf] = 0
result =\
- np.sum(logfactorial) \
- n * lmba \
+ np.log(lmba) * np.sum(x)
return result

Optimizing a simple linear curve (constant and coefficient estimated from a regression)

I am trying to calculate the turning point of a a few functions where I have estimated the coefficient and constant from a regression. I'm using the optimize function for this as my curves are all linear.
My function looks like:
F<- function(x){
beta* x + alpha
}
mind: beta and alpha are both vectors here. When running the optimisation with optimize, I'm getting the following error:
Error in optimize(F, interval = c(10, 20), lower = (10), :
invalid function value in 'optimize'
Is this because optimize is running the optimisation mathematically, so the beta and alphas need to be single parameters? If anyone knows a better way of doing this please do contribute!
Thank you in advance :)
If the functions are linear, then they will be at a minimum at the lower end of the range where beta>=0, and at the upper end of the range if beta<=0 - no need to use optimize().
It's not entirely clear what you're expecting the code to do - if you want it to return an x for each set of parameters, look at optim() instead and have F return the sum, or run optimize on each set of parameters in turn using an apply() function or loop.
One other thing is that your syntax is a bit wonky - I imagine that you mean:
> F<- function(x){
+ beta* x + alpha
+ }
> alpha <- 1
> beta <- 2
> optimize(F,c(10,20))
$minimum
[1] 10.00006
$objective
[1] 21.00011

R: Robust fitting of data points to a Gaussian function

I need to do some robust data-fitting operation.
I have bunch of (x,y) data, that I want to fit to a Gaussian (aka normal) function.
The point is, I want to remove the ouliers. As one can see on the sample plot below, there is another distribution of data thats pollutting my data on the right, and I don't want to take it into account to do the fitting (i.e. to find \sigma, \mu and the overall scale parameter).
R seems to be the right tool for the job, I found some packages (robust, robustbase, MASS for example) that are related to robust fitting.
However, they assume the user already has a strong knowledge of R, which is not my case, and the documentation is only provided as a sort of reference manual, no tutorial or equivalent. My statistical background is rather low, I attempted to read reference material on fitting with R, but it didn't really help (and I'm not even sure thats the right way to go).
But I have the feeling that this is actually a quite simple operation.
I have checked this related question (and the linked ones), however they take as input a single vector of values, and I have a vector of pairs, so I don't see how to transpose.
Any help on how to do this would be appreciated.
Fitting a Gaussian curve to the data, the principle is to minimise the sum of squares difference between the fitted curve and the data, so we define f our objective function and run optim on it:
fitG =
function(x,y,mu,sig,scale){
f = function(p){
d = p[3]*dnorm(x,mean=p[1],sd=p[2])
sum((d-y)^2)
}
optim(c(mu,sig,scale),f)
}
Now, extend this to two Gaussians:
fit2G <- function(x,y,mu1,sig1,scale1,mu2,sig2,scale2,...){
f = function(p){
d = p[3]*dnorm(x,mean=p[1],sd=p[2]) + p[6]*dnorm(x,mean=p[4],sd=p[5])
sum((d-y)^2)
}
optim(c(mu1,sig1,scale1,mu2,sig2,scale2),f,...)
}
Fit with initial params from the first fit, and an eyeballed guess of the second peak. Need to increase the max iterations:
> fit2P = fit2G(data$V3,data$V6,6,.6,.02,8.3,0.10,.002,control=list(maxit=10000))
Warning messages:
1: In dnorm(x, mean = p[1], sd = p[2]) : NaNs produced
2: In dnorm(x, mean = p[4], sd = p[5]) : NaNs produced
3: In dnorm(x, mean = p[4], sd = p[5]) : NaNs produced
> fit2P
$par
[1] 6.035610393 0.653149616 0.023744876 8.317215066 0.107767881 0.002055287
What does this all look like?
> plot(data$V3,data$V6)
> p = fit2P$par
> lines(data$V3,p[3]*dnorm(data$V3,p[1],p[2]))
> lines(data$V3,p[6]*dnorm(data$V3,p[4],p[5]),col=2)
However I would be wary about statistical inference about your function parameters...
The warning messages produced are probably due to the sd parameter going negative. You can fix this and also get a quicker convergence by using L-BFGS-B and setting a lower bound:
> fit2P = fit2G(data$V3,data$V6,6,.6,.02,8.3,0.10,.002,control=list(maxit=10000),method="L-BFGS-B",lower=c(0,0,0,0,0,0))
> fit2P
$par
[1] 6.03564202 0.65302676 0.02374196 8.31424025 0.11117534 0.00208724
As pointed out, sensitivity to initial values is always a problem with curve fitting things like this.
Fitting a Gaussian:
# your data
set.seed(0)
data <- c(rnorm(100,0,1), 10, 11)
# find & remove outliers
outliers <- boxplot(data)$out
data <- setdiff(data, outliers)
# fitting a Gaussian
mu <- mean(data)
sigma <- sd(data)
# testing the fit, check the p-value
reference.data <- rnorm(length(data), mu, sigma)
ks.test(reference.data, data)

Resources