I am fitting a Poisson GLM and want to predict y values given specific levels of the explanatory variables. My code is:
poisson.fit<-glm(y ~ age + gender, family= "poisson", data = data)
I want poisson.fit$y for a hypothetical observation of age = 50 and gender = "male". How do I produce this statistic?
Use the predict function.
predict(poisson.fit, newdata=data.frame(age=50, gender="male"))
You can specify the type of response you want with type= "link", "response" or "terms". See ?predict.glm for complete options and documentation.
Related
With a categorical variable t and some covariate x, I run the random-effects multinomial logit:
model <- mblogit(t ~ x, data = sample, random = ~1 + x|group)
How can I then extract predicted group intercepts and group slopes from model?
You can find a toy dataset here with the necessary variables. The categorical variable t must be a factor, so the relevant command is:
model <- mblogit(as.factor(t) ~ x, data = toySample, random = ~1 + x|group)
##Here I used the 'discoveries' dataset from the faraway package.
library('faraway')
data("discoveries")
#made a variable 'year' from 1860 to 1959
year = 1860:1959
#here I fit a poisson regression model with discoveries as dependent and year as independent.
fit_pois = glm(discoveries ~ year, data = discoveries, family = poisson)
##The question is, what is the probability that there are 4 discoveries in year '1960'(assuming model is right and is predicting the future). I tried to do this with
pred_pr = predict.glm(fit_pois, data.frame(discoveries = 4, year = 1960, type = 'response'))
##However when I predict on the data, it gives out numbers that are not probabilities. Plz Help!!
Let's start by replicating your model:
library('faraway')
data("discoveries")
year = 1860:1959
fit_pois <- glm(discoveries ~ year, data = discoveries, family = poisson)
Now if we use predict, our fit_pois model will tell us the predicted rate of discoveries for any given year(s). It will completely ignore any discoveries in the data frame passed to the newdata parameter of predict, because our model predicts discoveries based solely on the year variable.
Note also that in your example you have included type = "response" as a variable in the newdata data frame, rather than passing it as a parameter to predict. So the prediction line should look like this:
pred_pr = predict.glm(fit_pois, newdata = data.frame(year = 1960), type = 'response')
And the result we get is:
pred_pr
#> 1
#> 2.336768
Let's think what this means. Since we are doing a Poisson regression, this number represents the expected value of discoveries in the year 1960. This means that we can estimate the probability of there being exactly 4 discoveries if we examine the Poisson distribution with an expected value (also known as a lambda) of 2.336768. Let's use dpois to see the probabilities of getting 0 to 6 discoveries if the lambda is 2.336768:
plot(0:6, dpois(0:6, pred_pr), type = "h")
So the probability of there being exactly 4 discoveries in 1960 is:
dpois(4, pred_pr)
#> [1] 0.1200621
i.e. almost exactly 12%
I want to use a logistic regression to actually perform regression and not classification.
My response variable is numeric between 0 and 1 and not categorical. This response variable is not related to any kind of binomial process. In particular, there is no "success", no "number of trials", etc. It is simply a real variable taking values between 0 and 1 depending on circumstances.
Here is a minimal example to illustrate what I want to achieve
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10))
fit <- glm(formula = resp ~ a + b,
family = "binomial",
data = dummy_data)
This code gives a warning then fails because I am trying to fit the "wrong kind" of data:
In eval(family$initialize) : non-integer #successes in a binomial glm!
Yet I think there must be a way since the help of family says:
For the binomial and quasibinomial families the response can be
specified in one of three ways: [...] (2) As a numerical vector with
values between 0 and 1, interpreted as the proportion of successful
cases (with the total number of cases given by the weights).
Somehow the same code works using "quasibinomial" as the family which makes me think there may be a way to make it work with a binomial glm.
I understand the likelihood is derived with the assumption that $y_i$ is in ${0, 1}$ but, looking at the maths, it seems like the log-likelihood still makes sense with $y_i$ in $[0, 1]$. Am I wrong?
This is because you are using the binomial family and giving the wrong output. Since the family chosen is binomial, this means that the outcome has to be either 0 or 1, not the probability value.
This code works fine, because the response is either 0 or 1.
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = sample(c(0,1),10,replace=T,prob=c(.5,.5)) )
fit <- glm(formula = resp ~ a + b,
family = binomial(),
data = dummy_data)
If you want to model the probability directly you should include an additional column with the total number of cases. In this case the probability you want to model is interpreted as the success rate given the number of case in the weights column.
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10),w=round(runif(10,1,11)))
fit <- glm(formula = resp ~ a + b,
family = binomial(),
data = dummy_data, weights = w)
You will still get the warning message, but you can ignore it, given these conditions:
resp is the proportion of 1's in n trials.
for each value in resp, the corresponding value in w is the number of trials.
From the discussion at Warning: non-integer #successes in a binomial glm! (survey packages), I think we can solve it by another family function ?quasibinomial().
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10),w=round(runif(10,1,11)))
fit2 <- glm(formula = resp ~ a + b,
family = quasibinomial(),
data = dummy_data, weights = w)
#Logistic Regression
glm.fit <- glm(recent_cannabis_use~.,data = drug_use_train, family = binomial)
summary(glm.fit)
predict(glm.fit, with(drug_use_train, data.frame(Gender = "Male")), type = "response")
Trying to find the predicted probability for recent_canabis_use for a male.
You should use predict(glm.fit, newdata = data.frame(Gender = "Male")). Using with in this case is not warranted, since you are not accessing any of the variables in drug_use_train.
Note that this assumes your formula is, upon expansion, recent_cannabis_use ~ Gender. If you have other variable and you want to explore only the effect of Gender, you will need to set (pre-calculate or make up) all other variables to some fixed value (remember how coefficients are interpreted - change in y with one unit change of x, provided everything else stays the same). See for example this post.
I am creating a varying-coefficient GAMM using 'mgcv' in R with a continuous 'by' variable by using the by setting. However, I am having difficulty in locating the parameter estimate of the effect of the 'by' variable. In this example we determine the spatially-dependent effect of temperature t on sole eggs (i.e. how the linear effect of temperature on sole eggs changes across space):
require(mgcv)
require(gamair)
data(sole)
b = gam(eggs ~ s(la,lo) + s(la,lo, by = t), data = sole)
We can then plot the predicted effects of s(la,lo, by = t) against the predictor t:
pred <- predict(b, type = "terms", se.fit =T)
by.variable.prediction <- pred[[1]][,2]
plot(x= sole$t, y = by.variable.prediction)
However, I can't find a listing/function with the parameter estimates of the 'by' variable t for each sampling location. summary(), coef(), and predict() do not give you the parameter estimates.
Any help would be appreciated!
So the coefficient for the variable t is the value where t is equal to 1, conditional on the latitude and longitude. So one way to get the coefficient/parameter estimate for t at each latitude and longitude is to construct your own dataframe with a range of latitude/longitude combinations with t=1 and run predict.gam on that (rather than running predict.gam on the data used the fit the model, as you have done). So:
preddf <- expand.grid(list(la=seq(min(sole$la), max(sole$la), length.out=100),
lo=seq(min(sole$lo), max(sole$lo), length.out=100),
t=1))
preddf$parameter <- predict(b, preddf, type="response")
And then if you want to visualize this coefficient over space, you could graph it with ggplot2.
library(ggplot2)
ggplot(preddf) +
geom_tile(aes(x=lo, y=la, fill=parameter))