Beta regression model in R - r

Please again accept my apologies for my little knowledge in R. I'm, trying to get better! I'm a biologist and my statistical knowledge is sadly low
I have the following data set:
Perc_Reacting,Pulses,IndMutant,Proportion
93,1,1,0.93
81,2,1,0.81
73,3,1,0.73
64,4,1,0.64
73,5,1,0.73
68,6,1,0.68
64,7,1,0.64
65,8,1,0.65
50,9,1,0.5
68,10,1,0.68
57,11,1,0.57
50,12,1,0.5
62,13,1,0.62
44,14,1,0.44
54,15,1,0.54
56,16,1,0.56
50,17,1,0.5
42,18,1,0.42
42,19,1,0.42
29,20,1,0.29
96,1,0,0.96
100,2,0,1
92,3,0,0.92
96,4,0,0.96
92,5,0,0.92
92,6,0,0.92
84,7,0,0.84
96,8,0,0.96
91,9,0,0.91
82,10,0,0.82
86,11,0,0.86
82,12,0,0.82
91,13,0,0.91
85,14,0,0.85
83,15,0,0.83
70,16,0,0.7
74,17,0,0.74
64,18,0,0.64
68,19,0,0.68
78,20,0,0.78
The first and last rows are the same, one expressed in % an the other in a 1-0 proportion
I need to run a Beta regression model, but when I try to create the model an error jumps:
model.beta<-betareg(C_elegans$Proportion~C_elegans$Pulses)
Error in betareg(C_elegans$Proportion ~ C_elegans$Pulses) :
invalid dependent variable, all observations must be in (0, 1)
Could you help me to create a beta regression model for this data and how to make relevant plots to show it fits good?
Also I need to propose a linear regression model for this data, can anyone let me know how you think it could be done better?

Here are the results of fitting the last three columns to a flat surface plane equation "Proportion = a + (b * Pulses) + (c * IndMutant)" with parameters a = 1.0468289473684214E+00, b = -1.8650375939849695E-02, and c = -2.5850000000000006E-01 yielding R-squared = 0.876 and RMSE = 0.064.
(here "absolute error" means "not relative error")

Related

R Quantreg: Singularity with categorical survey data

For my Bachelor's thesis I am trying to apply a linear median regression model on constant sum data from a survey (see formula from A.Blass (2008)). It is an attempt to recreate the probability elicitation approach proposed by A. Blass et al (2008) - Using Elicited Choice Probabilities to Estimate Random Utility Models: Preferences for Electricity Reliability
My dependent variable is the log-odds transformation of the constant sum allocations. Calculated using the following formula:
PE_raw <- PE_raw %>% group_by(sys_RespNum, Task) %>% mutate(LogProb = c(log(Response[1]/Response[1]),
log(Response[2]/Response[1]),
log(Response[3]/Response[1])))
My independent variables are delivery costs, minimum order quantity and delivery window, each categorical variables with levels 0, 1, 2 and 3. Here, level 0 represent the none-option.
Data snapshot
I tried running the following quantile regression (using R's quantreg package):
LAD.factor <- rq(LogProb ~ factor(`Delivery costs`) + factor(`Minimum order quantity`) + factor(`Delivery window`) + factor(NoneOpt), data=PE_raw, tau=0.5)
However, I ran into the following error indicating singularity:
Error in rq.fit.br(x, y, tau = tau, ...) : Singular design matrix
I ran a linear regression and applied R's alias function for further investigation. This informed me of three cases of perfect multicollinearity:
minimum order quantity 3 = delivery costs 1 + delivery costs 2 + delivery costs 3 - minimum order quantity 1 - minimum order quantity 2
delivery window 3 = delivery costs 1 + delivery costs 2 + delivery costs 3 - delivery window 1 - delivery window 2
NoneOpt = intercept - delivery costs 1 - delivery costs 2 - delivery costs 3
In hindsight these cases all make sense. When R dichotomizedthe categorical variables you get these results by construction as, delivery costs 1 + delivery costs 2 + delivery costs 3 = 1 and minimum order quantity 1 + minimum order quantity 2 + minimum order quantity 3 = 1. Rewriting gives the first formula.
It looks like a classic dummy trap. In an attempt to workaround this issue I tried to manually dichotomize the data and used the following formula:
LM.factor <- rq(LogProb ~ Delivery.costs_1 + Delivery.costs_2 + Minimum.order.quantity_1 + Minimum.order.quantity_2 + Delivery.window_1 + Delivery.window_2 + factor(NoneOpt), data=PE_dichomitzed, tau=0.5)
Instead of an error message I now got the following:
Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique
When using the summary function:
> summary(LM.factor)
Error in base::backsolve(r, x, k = k, upper.tri = upper.tri, transpose = transpose, :
singular matrix in 'backsolve'. First zero in diagonal [2]
In addition: Warning message:
In summary.rq(LM.factor) : 153 non-positive fis
Is anyone familiar with this issue? I am looking for alternative solutions. Perhaps I am making mistakes using the rq() function, or the data might be misrepresented.
I am grateful for any input, thank you in advance.
Reproducible example
library(quantreg)
#### Raw dataset (PE_raw_SO) ####
# quantile regression (produces singularity error)
LAD.factor <- rq(
LogProb ~ factor(`Delivery costs`) +
factor(`Minimum order quantity`) + factor(`Delivery window`) +
factor(NoneOpt),
data = PE_raw_SO,
tau = 0.5
)
# linear regression to check for singularity
LM.factor <- lm(
LogProb ~ factor(`Delivery costs`) +
factor(`Minimum order quantity`) + factor(`Delivery window`) +
factor(NoneOpt),
data = PE_raw_SO
)
alias(LM.factor)
# impose assumptions on standard errors
summary(LM.factor, se = "iid")
summary(LM.factor, se = "boot")
#### Manually created dummy variables to get rid of
#### collinearity (PE_dichotomized_SO) ####
LAD.di.factor <- rq(
LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
Minimum.order.quantity_1 + Minimum.order.quantity_2 +
Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
data = PE_dichotomized_SO,
tau = 0.5
)
summary(LAD.di.factor) #backsolve error
# impose assumptions (unusual results)
summary(LAD.di.factor, se = "iid")
summary(LAD.di.factor, se = "boot")
# linear regression to check for singularity
LM.di.factor <- lm(
LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
Minimum.order.quantity_1 + Minimum.order.quantity_2 +
Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
data = PE_dichotomized_SO
)
alias(LM.di.factor)
summary(LM.di.factor) #regular results, all significant
Link to sample data + code: GitHub
The Solution may be nonunique behaviour is not unusual when doing quantile regressions with dummy explanatory variables.
See, e.g., the quantreg FAQ:
The estimation of regression quantiles is a linear programming
problem. And the optimal solution may not be unique.
A more intuitive explanation for what is happening is given by Roger Koenker (the author of quantreg) on r-help back in 2006:
When computing the median from a sample with an even number of
distinct values there is inherently some ambiguity about its value:
any value between the middle order statistics is "a" median.
Similarly, in regression settings the optimization problem solved by
the "br" version of the simplex algorithm, modified to do general
quantile regression identifies cases where there may be non
uniqueness of this type. When there are "continuous" covariates this
is quite rare, when covariates are discrete then it is relatively
common, atleast when tau is chosen from the rationals. For univariate
quantiles R provides several methods of resolving this sort of
ambiguity by interpolation, "br" doesn't try to do this, instead
returning the first vertex solution that it comes to.
Your second warning -- "153 non-positive fis" -- is a warning related to how the local densities are calculated by rq. Occasionally, it could be possible that local densities of the quantile regression function end up being negative (which is obviously impossible). If this happens, rq automatically sets them to zero. Again, quoting from the FAQ:
This is generally harmless, leading to a somewhat conservative
(larger) estimate of the standard errors, however if the reported
number of non-positive fis is large relative to the sample size then
it is an indication of misspecification of the model.

Why is the likelihood/AIC of my poisson regression infinite?

I am trying to evaluate themodel fit of several regressions in R, and I have run into a problem I have had multiple times now: the log-likelihood of my Poisson regression is infinite.
I'm using a non-integer dependent variable (Note: I know what I'm doing in this regard), and I'm wondering if maybe that's the problem. However, I don't get an infinite log-likelihood when running the regression with glm.nb.
Code to reproduce the issue is below.
Edit: the problem appears to go away when I coerce the DV to integer. Any idea how to get log likelihood from Poissons with non-integer DVs?
# Input Data
so_data <- data.frame(dv = c(21.0552722691125, 24.3061351414885, 7.84658638053276,
25.0294679770848, 15.8064731063311, 10.8171744654056, 31.3008088413026,
2.26643928259238, 18.4261153345417, 5.62915828161753, 17.0691184593063,
1.11959635820499, 30.0154935602592, 23.0000809735738, 28.4389825676123,
27.7678405415711, 23.7108405071757, 23.5070651053276, 14.2534787168392,
15.2058525068363, 19.7449094187771, 2.52384709295823, 29.7081691356397,
32.4723790240354, 19.2147002673637, 61.7911384519901, 10.5687170234821,
23.9047421013736, 18.4889651451222, 13.0360878554798, 15.1752866581849,
11.5205948111817, 31.3539840929108, 31.7255952728076, 25.3034625215724,
5.00013988265465, 30.2037887018226, 1.86123112349445, 3.06932041603219,
22.6739418581257, 6.33738321053804, 24.2933951601142, 14.8634827414491,
31.8302947881089, 34.8361908525564, 1.29606416941288, 13.206844629927,
28.843579313401, 25.8024295609021, 14.4414831628722, 18.2109680632694,
14.7092063453463, 10.0738043919183, 28.4124482962025, 27.1004208775326,
1.31350378236957, 14.3009307888745, 1.32555197766214, 2.70896028922312,
3.88043749517381, 3.79492216916016, 19.4507965653633, 32.1689088941444,
2.61278585713499, 41.6955885902228, 2.13466761675063, 30.4207256294235,
24.8231524369244, 20.7605955978196, 17.2182798298094, 2.11563574288652,
12.290778250655, 0.957467139696772, 16.1775287334746))
# Run Model
p_mod <- glm(dv ~ 1, data = so_data, family = poisson(link = 'log'))
# Be Confused
logLik(p_mod)
Elaborating on #ekstroem's comment: the Poisson distribution is only supported over the non-negative integers (0, 1, ...). So, technically speaking, the probability of any non-integer value is zero -- although R does allow for a little bit of fuzz, to allow for round-off/floating-point representation issues:
> dpois(1,lambda=1)
[1] 0.3678794
> dpois(1.1,lambda=1)
[1] 0
Warning message:
In dpois(1.1, lambda = 1) : non-integer x = 1.100000
> dpois(1+1e-7,lambda=1) ## fuzz
[1] 0.3678794
It is theoretically possible to compute something like a Poisson log-likelihood for non-integer values:
my_dpois <- function(x,lambda,log=FALSE) {
LL <- -lambda+x*log(lambda)-lfactorial(x)
if (log) LL else exp(LL)
}
but I would be very careful - some quick tests with integrate suggest it integrates to 1 (after I fixed the bug in it), but I haven't checked more carefully that this is really a well-posed probability distribution. (On the other hand, some reasonable-seeming posts on CrossValidated suggest that it's not insane ...)
You say "I know what I'm doing in this regard"; can you give some more of the context? Some alternative possibilities (although this is steering into CrossValidated territory) -- the best answer depends on where your data really come from (i.e., why you have "count-like" data that are non-integer but you think should be treated as Poisson).
a quasi-Poisson model (family=quasipoisson). (R will still not give you log-likelihood or AIC values in this case, because technically they don't exist -- you're supposed to do inference on the basis of the Wald statistics of the parameters; see e.g. here for more info.)
a Gamma model (probably with a log link)
if the data started out as count data that you've scaled by some measure of effort or exposure), use an appropriate offset model ...
a generalized least-squares model (nlme::gls) with an appropriate heteroscedasticity specification
Poisson log-likelihood involves calculating log(factorial(x)) (https://www.statlect.com/fundamentals-of-statistics/Poisson-distribution-maximum-likelihood). For values larger than 30 it has to be done using Stirling's approximation formula in order to avoid exceeding the limit of computer arithmetic. Sample code in Python:
# define a likelihood function. https://www.statlect.com/fundamentals-of- statistics/Poisson-distribution-maximum-likelihood
def loglikelihood_f(lmba, x):
#Using Stirling formula to avoid calculation of factorial.
#logfactorial(n) = n*ln(n) - n
n = x.size
logfactorial = x*np.log(x+0.001) - x #np.log(factorial(x))
logfactorial[logfactorial == -inf] = 0
result =\
- np.sum(logfactorial) \
- n * lmba \
+ np.log(lmba) * np.sum(x)
return result

Solution of varying coefficients ODE

I have a set of observed raw data and use 2nd order ODE to fit the data
y''+b1(t)y'+b0(t)y = 0
The b1 and b0 are time-dependent and I use principal differential analysis(PDA) (R-package: fda, function: pda.fd)to get the estimate of b1(t) and b0(t) .
To check the validity of the estimates of b1(t) and b0(t), I use collocation method (R-package bvpSolve, function:bvpcol) to get the numerical solution of the ODE and compare the solution with the smoothing curve fitting of the raw data.
My question is that my numerical solution from bvpcol can caputure the shape of the fitting curve but not for the value of the function. There are different in term of some constant multiples.
(Since I am not allowed to post images,please see the link for figure)
See the figure of my output. The gray dot is my raw data, the red line is Fourier expansion of the raw data, the green line is numerical solution of bvpcol function and the blue line the green-line/1.62. We can see the green line can capture the shape but with values that are constant times of fourier expansion.
I fit several other data and have similar situation but different constant. I am wondering it is the problem of numerical solution of ODE or some other reasons and how to solve this problem to get a good accordance between numerical solution(green) and true Fourier expansion?
Any help and idea is appreciated!
Here is a raw data and code:
RData is here
library(fda)
library(bvpSolve)
# load the data
load('y.RData')
tvec = 1:length(y)
tvec = (tvec-min(tvec))/(max(tvec)-min(tvec))
# create basis
fbasis = create.fourier.basis(c(0,1),nbasis=nbasis)
bbasis = create.bspline.basis(c(0,1),norder=8,nbasis=47)
bfdPar = fdPar(bbasis)
yfd = smooth.basis(tvec,y,fbasis)$fd
yfdlist = list(yfd)
bwtlist = rep(list(bfdPar),2)
# PDA fit
bwt = pda.fd(yfdlist,bwtlist)$bwtlist
# output of estimated coefficients
beta0.fd<-bwt[[1]]$fd
beta1.fd<-bwt[[2]]$fd
# define the vary-coef function in terms of t
fbeta0<-function(t)eval.fd(t,beta0.fd)
fbeta1<-function(t)eval.fd(t,beta1.fd)
# define 2nd order ODE
fun2 <- function(t,y,pars) {
with(as.list(c(y,pars)),{
beta0 = pars[[1]];
beta1 = pars[[2]];
dy1 = y[2]
dy2 = -beta1(t)*y[2]-beta0(t)*y[1]
return(list(c(dy1,dy2)))
})
}
# BVP
yinit<-c(p1[1],NA)
yend<-c(p1[length(p1)],NA)
t<-seq(tvec[1],tvec[length(tvec)],0.005)
col<-bvpcol(yini=yinit,yend=yend,x=t,func=fun2,parms=c(fbeta0,fbeta1),atol=1e-5,islin=T)
# plot output
plot(col[,1],col[,2],col='green',type='l')
points(tvec,p1,col='darkgray')
lines(yfd,col='red',lwd=2)
lines(col[,1],col[,2],col='green',type='l')
lines(col[,1],col[,2]/1.62,col='blue',type='l',lwd=2,lty=4)
legend('topleft',col=c('green','darkgray','red','blue'),
legend=c('ODE solution','raw data','basis curve fitting','ODE solution/1.62'),lty=1)

R Nonlinear Least Squares (nls) Model Fitting

I'm trying to fit the information from the G function of my data to the following mathematical mode: y = A / ((1 + (B^2)*(x^2))^((C+1)/2)) . The shape of this graph can be seen here:
http://www.wolframalpha.com/input/?i=y+%3D+1%2F+%28%281+%2B+%282%5E2%29*%28x%5E2%29%29%5E%28%282%2B1%29%2F2%29%29
Here's a basic example of what I've been doing:
data(simdat)
library(spatstat)
simdat.Gest <- Gest(simdat) #Gest is a function within spatstat (explained below)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
GvsR_dataframe <- data.frame(R = Rvalues, G = rev(Gvalues))
themodel <- nls(rev(Gvalues) ~ (1 / (1 + (B^2)*(R^2))^((C+1)/2)), data = GvsR_dataframe, start = list(B=0.1, C=0.1), trace = FALSE)
"Gest" is a function found within the 'spatstat' library. It is the G function, or the nearest-neighbour function, which displays the distance between particles on the independent axis, versus the probability of finding a nearest neighbour particle on the dependent axis. Thus, it begins at y=0 and hits a saturation point at y=1.
If you plot simdat.Gest, you'll notice that the curve is 's' shaped, meaning that it starts at y = 0 and ends up at y = 1. For this reason, I reveresed the vector Gvalues, which are the dependent variables. Thus, the information is in the correct orientation to be fitted the above model.
You may also notice that I've automatically set A = 1. This is because G(r) always saturates at 1, so I didn't bother keeping it in the formula.
My problem is that I keep getting errors. For the above example, I get this error:
Error in nls(rev(Gvalues) ~ (1/(1 + (B^2) * (R^2))^((C + 1)/2)), data = GvsR_dataframe, :
singular gradient
I've also been getting this error:
Error in nls(Gvalues1 ~ (1/(1 + (B^2) * (x^2))^((C + 1)/2)), data = G_r_dataframe, :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562
I haven't a clue as to where the first error is coming from. The second, however, I believe was occurring because I did not pick suitable starting values for B and C.
I was hoping that someone could help me figure out where the first error was coming from. Also, what is the most effective way to pick starting values to avoid the second error?
Thanks!
As noted your problem is most likely the starting values. There are two strategies you could use:
Use brute force to find starting values. See package nls2 for a function to do this.
Try to get a sensible guess for starting values.
Depending on your values it could be possible to linearize the model.
G = (1 / (1 + (B^2)*(R^2))^((C+1)/2))
ln(G)=-(C+1)/2*ln(B^2*R^2+1)
If B^2*R^2 is large, this becomes approx. ln(G) = -(C+1)*(ln(B)+ln(R)), which is linear.
If B^2*R^2 is close to 1, it is approx. ln(G) = -(C+1)/2*ln(2), which is constant.
(Please check for errors, it was late last night due to the soccer game.)
Edit after additional information has been provided:
The data looks like it follows a cumulative distribution function. If it quacks like a duck, it most likely is a duck. And in fact ?Gest states that a CDF is estimated.
library(spatstat)
data(simdat)
simdat.Gest <- Gest(simdat)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
plot(Gvalues~Rvalues)
#let's try the normal CDF
fit <- nls(Gvalues~pnorm(Rvalues,mean,sd),start=list(mean=0.4,sd=0.2))
summary(fit)
lines(Rvalues,predict(fit))
#Looks not bad. There might be a better model, but not the one provided in the question.

Quantile regression and p-values

I am applying guantile regression for my data-set (using R). It is easy to produce the nice scatterplot-image with different quantile regression lines
(taus <- c(0.05,0.25,0.75,0.95)).
Problem occurs when I want to produce p-values (in order to see statistical significance of each regression line) for each one of these quantiles. For median quantile (tau=0.5) this is not problematic, but when it comes to for example tau=0.25, I get following error message:
>QRmodel<-rq(y~x,tau=0.25,model=T)
>summary(QRmodel,se="nid")
Error in summary.rq(QRmodel, se = "nid") : tau - h < 0: error in summary.rq
What could be the reason for this?
Also: Is it recommendable to mention p-values and coefficients regarding the results of quantile regression model or could it be enough to show just the plot-picture and discuss the results based on that picture?
Best regards, frustrated person
A good way to learn what's going on in these sorts of debugging situations is to find the relevant portion of code that is throwing the error. If you type 'summary.rq' at the console, you'll see the code for the function summary.rq. Scanning through it you'll find the section where it calculates se's using the "nid" method, starting with this code:
else if (se == "nid") {
h <- bandwidth.rq(tau, n, hs = hs)
if (tau + h > 1)
stop("tau + h > 1: error in summary.rq")
if (tau - h < 0)
stop("tau - h < 0: error in summary.rq")
bhi <- rq.fit.fnb(x, y, tau = tau + h)$coef
blo <- rq.fit.fnb(x, y, tau = tau - h)$coef
So what's happening here is that in order to calculate the se's, the function first need to calculate a bandwidth, h, and the quantreg model is refit for tau +/- h. For tau's near 0 or 1, there's a possibility that adding or subtracting the bandwidth 'h' will lead to a tau below 0 or greater than 1, which isn't good, so the function stops.
You have a couple of options:
1.) Try a different se method (bootstrapping?)
2.) Modify the summary.rq code yourself to force it to use either max(tau,0) or min(tau,1) in the instances where the bandwidth pushes tau out of bounds. (There could be serious theoretical reasons why this is a bad idea; not advised unless you know what you're doing.)
3.) You could try to read up on the theory behind the calculation of these se's so you'd have a better idea of when they might work well or not. This might shed some light on why you're running into errors with values of tau near 0 or 1.
Try summary(QRmodel,se="boot")
Have a look at the help for summary.rq as well!

Resources