Two different multiple GLM poisson model regression, mean points and confidence interval - r

I'd like to create a plot in ggplot2 that combines two different multiple GLM poisson model ajusted, mean points and confidence interval (IC 95%). But my mean point representation doesn't work.
#Artificial data set
Consumption <- c(501, 502, 503, 504, 26, 27, 55, 56, 68, 69, 72, 93)
Gender <- gl(n = 2, k = 6, length = 2*6, labels = c("Male", "Female"), ordered = FALSE)
Income <- c(5010, 5020, 5030, 5040, 260, 270, 550, 560, 680, 690, 720, 930)
df3 <- data.frame(Consumption, Gender, Income)
df3
# GLM Regression
fm1 <- glm(Consumption~Gender+Income, data=df3, family=poisson)
summary(fm1)
# ANOVA
anova(fm1,test="Chi")
#Genders are different than I ajusted one model for male and another for Female
#Male model
df4<-df3[df3$Gender=="Male",]
fm2 <- glm(Consumption~Income, data=df4, family=poisson)
summary(fm2)
#Female model
df5<-df3[df3$Gender=="Female",]
fm3 <- glm(Consumption~Income, data=df5, family=poisson)
summary(fm3)
#Create preditions amd confidence interval
Predictions <- c(predict(fm2, type="link", se.fit = TRUE),
predict(fm3, type="link", se.fit = TRUE))
df3_combined <- cbind(df3, Predictions)
df3_combined$UCL<-df3_combined$fit + 1.96*df3_combined$se.fit
df3_combined$LCL<-df3_combined$fit - 1.96*df3_combined$se.fit
df3_combined<-df3_combined[,-(6:9)]
df3_combined<-as.data.frame(df3_combined)
#Create mean values for plot this values
library(dplyr)
df<-df3_combined %>%
group_by(Income, Gender) %>%
summarize(Consumption = mean(Consumption, na.rm = TRUE))
df<-as.data.frame(df)
#Plot
library(tidyverse)
library(ggplot2)
df3_combined %>%
gather(type, value, Consumption) %>%
ggplot(mapping=aes(x=Income, y=Consumption, color = Gender)) +
geom_point(df,mapping=aes(x=Income, y=Consumption, color = Gender)) +
geom_line(mapping=aes(x=Income, y=exp(fit))) +
geom_smooth(mapping=aes(ymin = exp(LCL), ymax = exp(UCL)), stat="identity")
#
I don't see the mean values created in df object in my output plot and I don't know why.

Related

Complex coefficient plot in R

I am trying to create a coefficient plot for regression results from two survey experiments. Both survey experiments asked the same questions (variables are identical) in two different countries. I am then running 2x2 models with only the treatment (independent variable with 3 levels) and control variables remaining constant.
Here is sample code with robust standard errors:
library(estimatr)
set.seed(124)
dv1 <- sample(0:1, 25, replace = TRUE)
dv2 <- sample(0:5, 25, replace = TRUE)
treatment_lvl <- sample(0:3, 25, replace = TRUE)
treatment_lvl <- factor(treatment_lvl, labels = c("baseline", "treatment1", "treatment2", "treatment3"))
control1 <- sample(2:10, 25, replace = TRUE)
control2 <- sample(0:3, 25, replace = TRUE)
df_country1 <- data.frame(dv1, dv2, treatment_lvl, control1, control2)
set.seed(200)
dv1 <- sample(0:1, 25, replace = TRUE)
dv2 <- sample(0:5, 25, replace = TRUE)
treatment_lvl <- sample(0:3, 25, replace = TRUE)
treatment_lvl <- factor(treatment_lvl, labels = c("baseline", "treatment1", "treatment2", "treatment3"))
control1 <- sample(2:10, 25, replace = TRUE)
control2 <- sample(0:3, 25, replace = TRUE)
df_country2 <- data.frame(dv1, dv2, treatment_lvl, control1, control2)
model1 <- lm(dv1 ~ treatment_lvl + control1 + control2, data=df_country1)
model2 <- lm(dv1 ~ treatment_lvl + control1 + control2, data=df_country2)
model3 <- lm(dv2 ~ treatment_lvl + control1 + control2, data=df_country1)
model4 <- lm(dv2 ~ treatment_lvl + control1 + control2, data=df_country2)
model1_robust <- commarobust(model1, se_type = "stata")
model2_robust <- commarobust(model2, se_type = "stata")
model3_robust <- commarobust(model3, se_type = "stata")
model4_robust <- commarobust(model4, se_type = "stata")
I now want to visualize the regressions (model1-4) in the following way:
y-axis: Dependent variable 1, dependent variable 2
Three plots/facets next to each other: One per treatment level (1-3)
For each dependent variable and treatment level one coefficient plot for country1 and one for country 2, stacked above each other
Control variables should not be shown
Something similar to the picture below, but with two instead of three countries. Then on the y-axis it would show the dependent variable of model 1+2; and at the second tick the dv of model 3+4. The other ticks would not be necessary. Instead of "Average treatment effect" it would be "Treatment 1" and then next to it the same for "Treatment 2" and "Treatment 3".

Identifying data points above fixed effects regression using data.table

I want to identify the data points above a regression line. I have a panel data set which I have fit a fixed effects model:
data <- data.frame(ID = c(1,1,1,1,2,2,3,3,3),
year = c(1,2,3,4,1,2,1,2,3),
progenyMean = c(90,78,92,69,86,73,82,85,91),
damMean = c(89,89,72,98,95,92,94,87,89)
ID, year, progenyMean, damMean
1, 1, 70, 69
1, 2, 68, 69
1, 3, 72, 72
1, 4, 69, 68
2, 1, 76, 75
2, 2, 73, 80
3, 1, 72, 74
3, 2, 75, 67
3, 3, 71, 69
# Fixed Effects Model in plm
fixed <- plm(progenyMean ~ damMean, data, model= "within", index = c("ID","year"))
I have plotted progenyMean vs damMean with the fixed effects regression line:
I want to identify the ID's above this regression line.
I have computed the predicted values of the fixed effects model using the following code (based off code from this post)
fitted <- as.numeric(fixed$model[[1]] - fixed$residuals)
> fitted
[1] 71.24338 79.03766 74.86613 71.34263 70.83020 71.56797 72.17324 74.54755 71.16720 73.37487
[11] 70.58863 69.27203 71.05852 59.72911 63.43947 68.69871 67.25271 75.68397 76.30475 81.12128
Is it possible to identify the ID's above the fixed effects regression line using the predicted values above and data.table in R?
Use residuals function. Positive residual = points above the line, negative = points below the line.
library(plm)
library(tidyverse)
library(ggplot2)
data <- data.frame(ID = c(1,1,1,1,2,2,3,3,3),
year = c(1,2,3,4,1,2,1,2,3),
progenyMean = c(90,78,92,69,86,73,82,85,91),
damMean = c(89,89,72,98,95,92,94,87,89))
fixed <- plm(progenyMean ~ damMean, data, model= "within", index = c("ID","year"))
residuals(fixed)
data %>% ggplot(aes(damMean, progenyMean)) +
geom_point(data=data %>% filter(residuals(fixed)>0), col="red")+
geom_point(data=data %>% filter(residuals(fixed)<0), col="blue")
data %>% mutate(
test = ifelse(residuals(fixed)>0, "up", "down") %>% factor()
) %>% group_by(test) %>% summarise(
n = n()
)

R predict() asking for variable excluded in lm() regression model

I intend to apply a regression based on two "x" variables, excluding others present in a dataframe.
As an example:
df <- data.frame(name = c("Paul", "Charles", "Edward", "Iam"),
age = c(18, 20, 25, 30),
income = c( 1000, 2000, 2500, 3000),
workhours = c(35, 40, 45, 40))
regression <- lm(income ~ . -name, data = df)
I face a problem when I try to use the predict function. It demands information about the "name" variable:
predict(object = regression,
data.frame(age = 22, workhours = 36))
It gives the following message error:
Error in eval(predvars, data, env) : object 'name' not found
I've solved this problem by excluting the "name" variable from the lm() function:
regression2 <- lm(income ~ . , data = df[, -1])
predict(object = regression2,
data.frame(age = 22, workhours = 36))
Since I have many variables I intend to exclude from the regression, is there a way to solve this inside de predict() function?
We may use update
> regression <- update(regression, . ~ .)
> predict(object = regression,
+ data.frame(age = 22, workhours = 36))
1
1714.859

Grouping factors in a pooled 2 sample t-test

I have a 2*2 table of 7 men and 11 women's weight (saved as weights_gender.csv), and aim to perform a pooled t-test. I have assigned the CSV file as weight = read.csv("weights_gender.csv"), but whenever I try to run t.test(weight$men~weight$women, var.equal=TRUE), it keeps on printing this message:
grouping factor must have exactly 2 levels.
What is the issue?
Try ...
t.test(x = weight$men, y = weight$women, var.equal = TRUE)
The way you were specifying the command it thought you wanted men's weight grouped by women which of course is not what you want.
Results...
Two Sample t-test
data: weight$men and weight$women
t = 5.9957, df = 16, p-value = 1.867e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
15.26250 31.95828
sample estimates:
mean of x mean of y
77.42857 53.81818
Data
weight <- data.frame(
men = c(88, 90, 78, 75, 70, 72, 69, NA, NA, NA, NA),
women = c(45, 57, 54, 62, 60, 59, 44, 43, 67, 50, 51)
)
Your question is a bit "theoretical" so I'll make it more concrete
Here I make two data frames with data about men's and women's weights, and labeling them.
df_m <- tibble(weight = 170 + 30*rnorm(7), sex = "Male")
df_f <- tibble(weight = 130 + 30*rnorm(11), sex = "Female")
Next we combine the data and set sex to be a factor variable
df_all <- rbind(df_m, df_f)
df_all[, 'sex'] <- lapply(df_all[, 'sex'], as.factor)
Finally we apply the t-test.
t.test(weight ~ sex, data = df_all, var.equal = TRUE)
My result was
Two Sample t-test
data: weight by sex
t = -5.2104, df = 16, p-value = 8.583e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-89.84278 -37.87810
sample estimates:
mean in group Female mean in group Male
120.2316 184.0921

How to extend logistic regression plot?

I have created a logistic model on R, the issue is my max x value is 0.85 hence the plot stops at this value.
Is there a way I can extend this to plot to x=100 and y values calculated using my logistic model?
library(caret)
library(mlbench)
library(ggplot2)
library(tidyr)
library(caTools)
my_data2 <- read.csv('C:/Users/Magician/Desktop/R files/Fnaticfirstround.csv', header=TRUE, stringsAsFactors = FALSE)
my_data2
#converting Map names to the calculated win probability
my_data2[my_data2$Map == "Dust2", "Map"] <- 0.307692
my_data2[my_data2$Map == "Inferno", "Map"] <- 0.47619
my_data2[my_data2$Map == "Mirage", "Map"] <- 0.708333
my_data2[my_data2$Map == "Nuke", "Map"] <- 0.444444
my_data2[my_data2$Map == "Overpass", "Map"] <- 0.333333
my_data2[my_data2$Map == "Train", "Map"] <- 0.692308
my_data2[my_data2$Map == "Vertigo", "Map"] <- 0
my_data2[my_data2$Map == "Cache", "Map"] <- 0.857143
#converting W and L to 1 and 0
my_data2$WinorLoss <- ifelse(my_data2$WinorLoss == "W", 1,0)
my_data2$WinorLoss <- factor(my_data2$WinorLoss, levels = c(0,1))
#converting Map to numeric characters
my_data2$Map <- as.numeric(my_data2$Map)
#Logistic regression model
glm.fit <- glm(WinorLoss ~ Map, family=binomial, data=my_data2)
summary(glm.fit)
#make predictions on the training data
glm.probs <- predict(glm.fit, type="response")
glm.pred <- ifelse(glm.probs>0.5, 1, 0)
attach(my_data2)
table(glm.pred,WinorLoss)
mean(glm.pred==WinorLoss)
#splitting the data for trying and testing
Split <- sample.split(my_data2, SplitRatio = 0.7)
traindata <- subset(my_data2, Split == "TRUE")
testdata <- subset(my_data2, Split == "FALSE")
glm.fit <- glm(WinorLoss ~ Map,
data=traindata,
family="binomial")
glm.probs <- predict(glm.fit,
newdata=testdata,
type="response")
glm.pred <- ifelse(glm.probs > 0.5, "1", "0")
table(glm.pred, testdata$WinorLoss)
mean(glm.pred == testdata$WinorLoss)
summary(glm.fit)
#changing the x axis to 0-100%, min map win prob - max map win prob
newdat <- data.frame(Map = seq(min(traindata$Map), max(traindata$Map), len=100))
newdat$WinorLoss = predict(glm.fit, newdata=newdat, type="response")
p <- ggplot(newdat, aes(x=Map,y=WinorLoss))+
geom_point() +
geom_smooth(method = "glm",
method.args = list(family="binomial"),
se = FALSE) +
xlim(0,1) +
ylim(0,1)
I have tried extending the x value to 100 but that just extended the axis but did not calculate the corresponding y value and hence plot these values..
I cannot reproduce your data, so I will show how to do it using the "challenger disaster" example (see this LINK), with confidence interval ribbons.
You should create artificial points in your data and fit it before plotting.
Next time, try to use reprex or provide a minimal reproducible example.
Preparing data and model fitting:
library(dplyr)
fails <- c(2, 0, 0, 1, 0, 0, 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0)
temp <- c(53, 66, 68, 70, 75, 78, 57, 67, 69, 70, 75, 79, 58, 67, 70, 72, 76, 80, 63, 67, 70, 73, 76)
challenger <- tibble::tibble(fails, temp)
orings = 6
challenger <- challenger %>%
dplyr::mutate(resp = fails/orings)
model_fit <- glm(resp ~ temp,
data = challenger,
weights = rep(6, nrow(challenger)),
family=binomial(link="logit"))
##### ------- this is what you need: -------------------------------------------
# setting limits for x axis
x_limits <- challenger %>%
dplyr::summarise(min = 0, max = max(temp)+10)
# creating artificial obs for curve smoothing -- several points between the limits
x <- seq(x_limits[[1]], x_limits[[2]], by=0.5)
# artificial points prediction
# see: https://stackoverflow.com/questions/26694931/how-to-plot-logit-and-probit-in-ggplot2
temp.data = data.frame(temp = x) #column name must be equal to the variable name
# Predict the fitted values given the model and hypothetical data
predicted.data <- as.data.frame(
predict(model_fit,
newdata = temp.data,
type="link", se=TRUE)
)
# Combine the hypothetical data and predicted values
new.data <- cbind(temp.data, predicted.data)
##### --------------------------------------------------------------------------
# Compute confidence intervals
std <- qnorm(0.95 / 2 + 0.5)
new.data$ymin <- model_fit$family$linkinv(new.data$fit - std * new.data$se)
new.data$ymax <- model_fit$family$linkinv(new.data$fit + std * new.data$se)
new.data$fit <- model_fit$family$linkinv(new.data$fit) # Rescale to 0-1
Plotting:
library(ggplot2)
plotly_palette <- c('#1F77B4', '#FF7F0E', '#2CA02C', '#D62728')
p <- ggplot(challenger, aes(x=temp, y=resp))+
geom_point(colour = plotly_palette[1])+
geom_ribbon(data=new.data,
aes(y=fit, ymin=ymin, ymax=ymax),
alpha = 0.5,
fill = '#FFF0F5')+
geom_line(data=new.data, aes(y=fit), colour = plotly_palette[2]) +
labs(x="Temperature", y="Estimated Fail Probability")+
ggtitle("Predicted Probabilities for fail/orings with 95% Confidence Interval")+
theme_bw()+
theme(panel.border = element_blank(), plot.title = element_text(hjust=0.5))
p
# if you want something fancier:
# library(plotly)
# ggplotly(p)
Result:
Interesting Fact About the Challenger Data:
NASA Engineers used linear regression to estimate the likelihood of O-ring failure. If they had used a more appropriate technique for their data, such as logistic regression, they would have noticed that the probability of failure at lower temperatures (such as ~ 36F at launch time) was extremely high. The plot shows us that for ~36F (a temperature which we extrapolate from the observed ones), we have a probability of ~0.75. If we consider the confidence interval ... well, the accident was pretty much a certainty.

Resources