Zeros in Count data, how to deal with? - r

I have a data set with count data. I do a Poisson regression with glm.
Now i want to compute the null deviance by hand. For that i need the loglike of the full model. For the loglike i get NaN. I think its because some values of the response variable are 0 and log(0) produce NaN. However glm computes the null deviance. So there must be a trick to deal with the 0 entries in y. Should i replace them with very small values like 0,00001 oder what could be a possible solution to get a result for lf that is not NaN
data(discoveries)
disc <- data.frame(count=as.numeric(discoveries),
year=seq(0,(length(discoveries)-1),1))
yearSqr <- disc$year^2
hush <- glm(count ~ year + yearSqr , family = "poisson", disc)
# modelFrame
test <- hush$model
# reponse variable
test$count
# formula for loglike full modell lf = sum(y * log(y) - y - log(factorial(y)))
# result is NaN
lf <- sum(test$count * log(test$count) - test$count - log(factorial(test$count)))

Your applied formula is wrong; it does not use any information about estimated parameters. You want to use the following:
sum(test$count * log(fitted(hush)) - fitted(hush) - log(factorial(test$count)))
# [1] -200.9226
logLik(hush)
# 'log Lik.' -200.9226 (df=3)

Related

R Regression with different null hypothesis

I have a series of regressions where I would like to execute different null hypotheses in the same regression.
This means that I would like to test whether one independent variable is equal to 1 and the other equal to 0.
netew3 <- summary(lm(ewvw[,3]-factors$RF ~ factors$Mkt.RF + factors$SMB + factors$HML + factors$MOM, na.action = na.exclude), data = ewvw)
I would like to test whether the first variable (factors$Mkt.RF) is equal to 1 and the others (SMB, HML, and MOM) are equal to zero.
Thank you in advance for your help.
Best
PL
summary() of an lm-object gives you p-values for all coefficients under the null hypotheses that each coefficient equals 0. However, it also gives you all necessary information to conduct your own test with a different null hypothesis, e.g. that coefficients are 1.
This is one of many places where t-test of regression coefficients is explained in detail. Essentially, you get the t-value by calculating (estimate - reference) / SE. SE is the standard error and reference being the assumed value of the coefficient under the null hypothesis (usually 0). So all you have to do is change the latter value from 0 to 1 and you got your t-value.
I automated this in a function below. h0.value is your assumed value under the null hypothesis. You can check if it works properly with your data/model by running it with h0.value = 0 and compare the result to what you get from summary(). If it works, use it with h0.value = 1.
estim_test <- function(lm.mod, h0.value = 0) {
coefm <- as.data.frame(summary(lm.mod)$coefficients)
n <- length(lm.mod$residuals)
coefm$`t value` <- (coefm$Estimate - h0.value)/coefm$`Std. Error`
coefm$`Pr(>|t|)` <- 2*pt(-abs(coefm$`t value`), df=lm.mod$df.residual)
coefm
}
# Testing the function
data("swiss")
mod1 <- lm(Fertility ~ Agriculture + Education + Catholic, data=swiss)
summary(mod1)
estim_test(mod1, h0.value=0)
estim_test(mod1, h0.value=1)

Creating R Squared function for CPLM package

For my graduate research I'm using the CPLM package (specifically the cpglmm function) to account for zero-inflated data (Tweedie compound Poisson distribution) in a data set looking at the effects of logging on breeding bird densities. This isn't a widely used package like lme4, nlme, etc. Therefore, the model validation methods that can be used on these more commonly used packages cannot be used on cpglmm.
I'm currently at the stage of describing the fit of my models and am trying to calculate R-squared values, both marginal and conditional. Unfortunately I cannot use the r2glmm package or MuMln to calculate R-squared values because they do not support cpglmm. Therefore, I've had to calculate those values manually through an example found here (example found in Appendix 6 under cpglmm parasite models, pg. 33). Here's the script from that example:
# Fit null model without fixed effects (but including all random effects)
parmodCPr <- cpglmm(Parasite ~ 1 + (1 | Population) + (1 | Container), data = DataAll)
# Fit alternative model including fixed and all random effects
parmodCPf <- cpglmm(Parasite ~ Sex + Treatment + Habitat + (1 | Population) +
(1 | Container), data = DataAll)
# Calculation of the variance in fitted values
VarF <- var(as.vector(model.matrix(parmodCPf) %*% fixef(parmodCPf)))
# getting the observation-level variance Null model
phiN <- parmodCPr#phi # the dispersion parameter
pN <- parmodCPr#p # the index parameter
mu <- exp(fixef(parmodCPr) + 0.5 * (VarCorr(parmodCPr)$Population[1] + VarCorr(parmodCPr)$Container[1]))
VarOdN <- phiN * mu^(pN - 2) # the delta method
# Full model
phiF <- parmodCPf#phi # the dispersion parameter
pF <- parmodCPf#p # the index parameter
VarOdF <- phiF * mu^(pF - 2) # the delta method
# R2[GLMM(m)] - marginal R2[GLMM]; using the delta method observation-level variance
R2glmmM <- VarF/(VarF + sum(as.numeric(VarCorr(parmodCPf))) + VarOdF)
# R2[GLMM(c)] - conditional R2[GLMM] for full model
R2glmmC <- (VarF + sum(as.numeric(VarCorr(parmodCPf))))/(VarF + sum(as.numeric(VarCorr(parmodCPf))) +
VarOdF)
What I would like to be able to do is write a function in R using this code outputting both the marginal and conditional R-squared values (RglmmM and RglmmC) with my models as the input. I'd greatly appreciate any help with this problem. Hopefully I have supplied enough information.
Thanks.
Believe I figured it out. Here's an example I wrote up:
R2glmm <- function(model){
# Calculation of the variance in fitted values
VarALT <- var(as.vector(model.matrix(model) %*% fixef(model)))
# getting the observation-level variance Null model
phiNULL <- NULLmodel$phi # the dispersion parameter
pNULL <- NULLmodel$p # the index parameter
mu <- exp(fixef(NULLmodel) + 0.5 * (VarCorr(NULLmodel)$YEAR[1]))
VarOdNULL <- phiNULL * mu^(pNULL - 2) # the delta method
# Alternate model
phiALT <- model$phi # the dispersion parameter
pALT <- model$p # the index parameter
VarOdALT <- phiALT * mu^(pALT - 2) # the delta method
# R2[GLMM(m)] - marginal R2[GLMM]; using the delta method observation-level variance
R2glmmM <- VarALT/(VarALT + sum(as.numeric(VarCorr(model))) + VarOdALT)
# R2[GLMM(c)] - conditional R2[GLMM] for full model
R2glmmC <- (VarALT + sum(as.numeric(VarCorr(model))))/(VarALT + sum(as.numeric(VarCorr(model))) + VarOdALT)
return(c(R2glmmM, R2glmmC))
}
Variables containing ALT refers to the alternate model. "model" represents any cpglmm model you need to run through the function.
Hope this helps someone out. Been working on this problem and other related ones for ages now.

Linear fit without slope in r

I want to fit a linear model with no slope and extract information of it. My objective is to know which is the best y-intercept for an horizontal line in a data set and also evaluate from derived linear fit to identify if y has a particular behavior (x is date). I've using range to evaluate behavior, but I'm looking for an index without unit.
Removing y-intercept:
X <- 1:10
Y <- 2:11
lm1 <- lm(Y~X + 0, data = data.frame(X=X,Y=Y)) # y-intercept remove opt 1
lm1 <- lm(Y~X - 1, data = data.frame(X=X,Y=Y)) # y-intercept remove opt 2
lm1 <- lm(Y~0 + X, data = data.frame(X=X,Y=Y)) # y-intercept remove opt 3
lm1$coefficients
X
1.142857
summary(lm1)$r.squared
[1] 0.9957567
All the lm showed before, has . But, if I evaluate:
lm2 <- lm(Y~1, data = data.frame(X=X,Y=Y))
lm2$coefficients
(Intercept)
6.5
summary(lm2)$r.squared
[1] 0
There is a way to calculate out of lm function or calculate an index to identify how much y is represented by an horizontal line?
Let lmObject be your linear model returned by lm (called with y = TRUE to return y).
If your model has intercept, then R-squared is computed as
with(lmObject, 1 - c(crossprod(residuals) / crossprod(y - mean(y))) )
If your model does not have an intercept, then R-squared is computed as
with(lmObject, 1 - c(crossprod(residuals) / crossprod(y)) )
Note, if your model is only an intercept (so it is certainly from the 1st case above), you have
residuals = y - mean(y)
thus R-squared is always 1 - 1 = 0.
In regression analysis, it is always recommended to include intercept in the model to get unbiased estimate. A model with intercept only is the NULL model. Any other model is compared with this NULL model for further analysis of variance.
A note. The value / quantity you want has nothing to do with regression. You can simply compute it as
c(crossprod(Y - mean(Y)) / crossprod(Y)) ## `Y` is your data
#[1] 0.1633663
Alternatively, use
(length(Y) - 1) * var(Y) / c(crossprod(Y))
#[1] 0.1633663

Manual Perceptron example in R - are the results acceptable?

I am trying to get a perceptron algorithm for classification working but I think something is missing. This is the decision boundary achieved with logistic regression:
The red dots got into college, after performing better on tests 1 and 2.
This is the data, and this is the code for the logistic regression in R:
dat = read.csv("perceptron.txt", header=F)
colnames(dat) = c("test1","test2","y")
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=dat)
fit = glm(y ~ test1 + test2, family = "binomial", data = dat)
coefs = coef(fit)
(x = c(min(dat[,1])-2, max(dat[,1])+2))
(y = c((-1/coefs[3]) * (coefs[2] * x + coefs[1])))
lines(x, y)
The code for the "manual" implementation of the perceptron is as follows:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
With this, I get the following coefficients:
bias test1 test2
9.131054 19.095881 20.736352
and an accuracy of 88%, consistent with that calculated with the glm() logistic regression function: mean(sign(predict(fit))==data[,4]) of 89% - logically, there is no way of linearly classifying all of the points, as it is obvious from the plot above. In fact, iterating only 10 times and plotting the accuracy, a ~90% is reach after just 1 iteration:
Being in line with the training classification performance of logistic regression, it is likely that the code is not conceptually wrong.
QUESTIONS: Is it OK to get coefficients so different from the logistic regression:
(Intercept) test1 test2
1.718449 4.012903 3.743903
This is really more of a CrossValidated question than a StackOverflow question, but I'll go ahead and answer.
Yes, it's normal and expected to get very different coefficients because you can't directly compare the magnitude of the coefficients between these 2 techniques.
With the logit (logistic) model you're using a binomial distribution and logit-link based on a sigmoid cost function. The coefficients are only meaningful in this context. You've also got an intercept term in the logit.
None of this is true for the perceptron model. The interpretation of the coefficients are thus totally different.
Now, that's not saying anything about which model is better. There aren't comparable performance metrics in your question that would allow us to determine that. To determine that you should do cross-validation or at least use a holdout sample.

Fitting non-linear Langmuir Isotherm in R

I want to fit Isotherm models for the following data in R. The simplest isotherm model is Langmuir model given here model is given in the bottom of the page. My MWE is given below which throw the error. I wonder if there is any R package for Isotherm models.
X <- c(10, 30, 50, 70, 100, 125)
Y <- c(155, 250, 270, 330, 320, 323)
Data <- data.frame(X, Y)
LangIMfm2 <- nls(formula = Y ~ Q*b*X/(1+b*X), data = Data, start = list(Q = 1, b = 0.5), algorith = "port")
Error in nls(formula = Y ~ Q * b * X/(1 + b * X), data = Data, start = list(Q = 1, :
Convergence failure: singular convergence (7)
Edited
Some nonlinear models can be transform to linear models. My understanding is that there might be one-to-one relationship between the estimates of nonlinear model and its linear model form but their corresponding standard errors are not related to each other. Is this assertion true? Are there any pitfalls in fitting Nonlinear Models by transforming to linearity?
I am not aware of such packages and personally I don't think that you need one as the problem can be solved using a base R.
nls is sensitive to the starting parameters, so you should begin with a good starting guess. You can easily evaluate Q because it corresponds to the asymptotic limit of the isotherm at x-->Inf, so it is reasonable to begin with Q=323 (which is the last value of Y in your sample data set).
Next, you could do plot(Data) and add a line with an isotherm that corresponds to your starting parameters Q and b and tweak b to come up with a reasonable guess.
The plot below shows your data set (points) and a probe isotherm with Q = 323 and b = 0.5, generated by with(Data,lines(X,323*0.5*X/(1+0.5*X),col='red')) (red line). It seemed a reasonable starting guess to me, and I gave it a try with nls:
LangIMfm2 <- nls(formula = Y ~ Q*b*X/(1+b*X), data = Data, start = list(Q = 300, b = 1), algorith = "port")
# Nonlinear regression model
# model: Y ~ Q * b * X/(1 + b * X)
# data: Data
# Q b
# 366.2778 0.0721
# residual sum-of-squares: 920.6
#
# Algorithm "port", convergence message: relative convergence (4)
and plotted predicted line to make sure that nls found the right solution:
lines(Data$X,predict(LangIMfm2),col='green')
Having said that, I would suggest to use a more effective strategy, based on the linearization of the model by rewriting the isotherm equation in reciprocal coordinates:
z <- 1/Data
plot(Y~X,z)
abline(lm(Y~X,z))
M <- lm(Y~X,z)
Q <- 1/coef(M)[1]
# 363.2488
b <- coef(M)[1]/coef(M)[2]
# 0.0741759
As you could see, both approaches produce essentially the same result, but the linear model is more robust and doesn't require starting parameters (and, as far as I remember, it is the standard way of the isotherm analysis in the experimental physical chemistry).
You can use the SSmicmen self-starter function (see Ritz and Streibig, 2008, Nonlinear Regression with R) in the nlme package for R, which calculates initial parameters from the fit of the linearized form of the Michaelis-Menten (MM) equation. Fortunately, the MM equation possesses a form that can be adapted for the Langmuir equation, S = Smax*x/(KL + x). I've found the nlshelper and tidyverse packages useful for modeling and exporting the results of the nls command into tables and plots, particularly when modeling sample groups. Here's my code for modeling a single set of sorption data:
library(tidyverse)
library(nlme)
library(nlshelper)
lang.fit <- nls(Y ~ SSmicmen(X,Smax,InvKL), data=Data)
fit.summary <- tidy(lang.fit)
fit.coefs <- coef(lang.fit)
For simplicity, the Langmuir affinity constant is modeled here as 1/KL. Applying this code, I get the same parameter estimates as #Marat given above.
The simple code below allows for wrangling the data in order to create a ggplot object, containing the original points and fitted line (i.e., geom_point would represent the original X and Y data, geom_line would represent the original X plus YHat).
FitY <- tibble(predict(lang.fit))
YHat <- FitY[,1]
Data2 <- cbind(Data, YHat)
If you want to model multiple groups of data (say, based on a "Sample_name" column, then the lang.fit variable would be calculated as below, this time using the nlsList command:
lang.fit <- nlsList(Y ~ SSmicmen(X,Smax,InvKL) | Sample_name, data=Data)
The problem is the starting values. We show two approaches to this as well as an alternative that converges even using the starting values in the question.
1) plinear The right hand side is linear in Q*b so it would be better to absorb b into Q and then we have a parameter that enters linearly so it is easier to solve. Also with the plinear algorithm no starting values are needed for the linear parameter so only the starting value for b need be specified. With plinear the right hand side of the nls formula should be specified as the vector that multiplies the linear parameter. The result of running nls giving fm0 below will be coefficients named b and .lin where Q = .lin / b.
We already have our answer from fm0 but if we want a clean run in terms of b and Q rather than b and .lin we can run the original formula in the question using the starting values implied by the coefficients returned by fm0 as shown.
fm0 <- nls(Y ~ X/(1+b*X), Data, start = list(b = 0.5), alg = "plinear")
st <- with(as.list(coef(fm0)), list(b = b, Q = .lin/b))
fm <- nls(Y ~ Q*b*X/(1+b*X), Data, start = st)
fm
giving
Nonlinear regression model
model: Y ~ Q * b * X/(1 + b * X)
data: Data
b Q
0.0721 366.2778
residual sum-of-squares: 920.6
Number of iterations to convergence: 0
Achieved convergence tolerance: 9.611e-07
We can display the result. The points are the data and the red line is the fitted curve.
plot(Data)
lines(fitted(fm) ~ X, Data, col = "red")
(contineud after plot)
2) mean Alternately, using a starting value of mean(Data$Y) for Q seems to work well.
nls(Y ~ Q*b*X/(1+b*X), Data, start = list(b = 0.5, Q = mean(Data$Y)))
giving:
Nonlinear regression model
model: Y ~ Q * b * X/(1 + b * X)
data: Data
b Q
0.0721 366.2779
residual sum-of-squares: 920.6
Number of iterations to convergence: 6
Achieved convergence tolerance: 5.818e-06
The question already had a reasonable starting value for b which we used but if one were needed one could set Y to Q*b so that they cancel and X to mean(Data$X) and solve for b to give b = 1 - 1/mean(Data$X) as a possible starting value. Although not shown using this starting value for b with mean(Data$Y) as the starting value for Q also resulted in convergence.
3) optim If we use optim the algorithm converges even with the initial values used in the question. We form the residual sum of squares and minimize that:
rss <- function(p) {
Q <- p[1]
b <- p[2]
with(Data, sum((Y - b*Q*X/(1+b*X))^2))
}
optim(c(1, 0.5), rss)
giving:
$par
[1] 366.27028219 0.07213613
$value
[1] 920.62
$counts
function gradient
249 NA
$convergence
[1] 0
$message
NULL

Resources