Identifying data points above fixed effects regression using data.table - r

I want to identify the data points above a regression line. I have a panel data set which I have fit a fixed effects model:
data <- data.frame(ID = c(1,1,1,1,2,2,3,3,3),
year = c(1,2,3,4,1,2,1,2,3),
progenyMean = c(90,78,92,69,86,73,82,85,91),
damMean = c(89,89,72,98,95,92,94,87,89)
ID, year, progenyMean, damMean
1, 1, 70, 69
1, 2, 68, 69
1, 3, 72, 72
1, 4, 69, 68
2, 1, 76, 75
2, 2, 73, 80
3, 1, 72, 74
3, 2, 75, 67
3, 3, 71, 69
# Fixed Effects Model in plm
fixed <- plm(progenyMean ~ damMean, data, model= "within", index = c("ID","year"))
I have plotted progenyMean vs damMean with the fixed effects regression line:
I want to identify the ID's above this regression line.
I have computed the predicted values of the fixed effects model using the following code (based off code from this post)
fitted <- as.numeric(fixed$model[[1]] - fixed$residuals)
> fitted
[1] 71.24338 79.03766 74.86613 71.34263 70.83020 71.56797 72.17324 74.54755 71.16720 73.37487
[11] 70.58863 69.27203 71.05852 59.72911 63.43947 68.69871 67.25271 75.68397 76.30475 81.12128
Is it possible to identify the ID's above the fixed effects regression line using the predicted values above and data.table in R?

Use residuals function. Positive residual = points above the line, negative = points below the line.
library(plm)
library(tidyverse)
library(ggplot2)
data <- data.frame(ID = c(1,1,1,1,2,2,3,3,3),
year = c(1,2,3,4,1,2,1,2,3),
progenyMean = c(90,78,92,69,86,73,82,85,91),
damMean = c(89,89,72,98,95,92,94,87,89))
fixed <- plm(progenyMean ~ damMean, data, model= "within", index = c("ID","year"))
residuals(fixed)
data %>% ggplot(aes(damMean, progenyMean)) +
geom_point(data=data %>% filter(residuals(fixed)>0), col="red")+
geom_point(data=data %>% filter(residuals(fixed)<0), col="blue")
data %>% mutate(
test = ifelse(residuals(fixed)>0, "up", "down") %>% factor()
) %>% group_by(test) %>% summarise(
n = n()
)

Related

How to show the coefficient values and variable importance for logistic regression in R using caret package train() and varImp()

We're performing an exploratory logistic regression and trying to determine the importance of the variables in predicting the outcome. We are using the train() and varImp() functions from the caret package. Ultimately, we would like to create a table/dataframe output that has 3 columns: Variable Name, Importance, and Coefficient. An output like this:
Desired format of output.
Here's some sample code to illustrate:
library(caret)
# Create a sample dataframe
my_DV <- c(0, 1, 0, 1, 1)
IV1 <- c(10, 40, 15, 35, 38)
IV2 <- c(1, 0, 1, 0, 1)
IV3 <- c(5, 4, 3, 2, 1)
IV4 <- c(5, 7, 3, 8, 9)
IV5 <- c(1, 2, 1, 2, 1)
df <- data.frame(my_DV, IV1, IV2, IV3, IV4, IV5)
df$my_DV <- as.factor(df$my_DV)
df$IV1 <- as.numeric(df$IV1)
df$IV2 <- as.factor(df$IV2)
df$IV3 <- as.numeric(df$IV3)
df$IV4 <- as.numeric(df$IV4)
df$IV5 <- as.factor(df$IV5)
# train model/perform logistic regression
model_one <- train(form = my_DV ~ ., data = df, trControl = trainControl(method = "cv", number = 5),
method = "glm", family = "binomial", na.action=na.omit)
summary(model_one)
# get the variable importance
imp <- varImp(model_one)
imp
I would like to take the importance values in imp and merge them with the coefficients from model_one but I'm fairly new to R and I can't figure out how to do it.
Any suggestions are greatly appreciated!
Here is one of many ways to get the desired output:
You assign the summary of the model to an object, and then extract the coefficients using coef() function, and then bind it with the variable names and the corresponding importance into a data frame. You then sort the rows based on the values of importance by using order().
sum_mod <- summary(model_one)
dat <- data.frame(VariableName = rownames(imp$importance),
Importance = unname(imp$importance),
Coefficient = coef(sum_mod)[rownames(imp$importance),][,1],
row.names = NULL)
dat <- dat[order(dat$Importance, decreasing = TRUE),]
The result:
VariableName Importance Coefficient
1 IV1 100.00000 1.0999732
4 IV4 74.48458 3.6665775
2 IV21 34.43803 -7.8831404
3 IV3 0.00000 -0.9166444

Creating data table of points above/below abline in ggplot2

Is it possible to identify data points above a geom_abline in ggplot, and to create a new data table separating these data points using data.table?
I have a panel dataset with 150 unique ID's, and have fit a fixed effects model using plm(). Here is a sample of the dataset:
data <- data.frame(ID = c(1,1,1,1,2,2,3,3,3),
year = c(1,2,3,4,1,2,1,2,3),
progenyMean = c(90,78,92,69,86,73,82,85,91),
damMean = c(89,89,72,98,95,92,94,87,89)
ID, year, progenyMean, damMean
1, 1, 70, 69
1, 2, 68, 69
1, 3, 72, 72
1, 4, 69, 68
2, 1, 76, 75
2, 2, 73, 80
3, 1, 72, 74
3, 2, 75, 67
3, 3, 71, 69
# Fixed Effects Model in plm
fixed <- plm(progenyMean ~ damMean, data, model= "within", index = c("ID","year"))
I have plotted the response progenyMean vs damMean using the following code:
plotFunction <- function(aggData, year){
ggplot(aggData, aes(x=damMeanCentered, y=progenyMean3Y)) +
geom_point() +
geom_abline(slope=fixed$coefficients, intercept=71.09, colour='dodgerblue1', size=1)
# The intercept 71.09 was calculated using the mean of fixef(fixed)
}
plotFunction(data, '(2005 - 2012)')
Is it possible to identify the points above/below the geom_abline in ggplot and create a new data table separating these data points using data.table?
It is not clear where the intercept came from, but nevertheless the trick is
add a predict to your dataset using the regression model (in your case fixed). Then filter out actual values that are higher than the predict.
library(dplyr)
data %>%
mutate(predict = predict(fixed, newdata = data)) %>%
filter(progenyMean > predict)
First make the predictions
data[,newpredict:=predict(fixed, newdata=data)]
It's not clear what you want the new data.table to look like but you'd get the values above predictions by doing
data[progencyMean>newpredict]
For below, you'd obviously just change the > to <.

How to extend logistic regression plot?

I have created a logistic model on R, the issue is my max x value is 0.85 hence the plot stops at this value.
Is there a way I can extend this to plot to x=100 and y values calculated using my logistic model?
library(caret)
library(mlbench)
library(ggplot2)
library(tidyr)
library(caTools)
my_data2 <- read.csv('C:/Users/Magician/Desktop/R files/Fnaticfirstround.csv', header=TRUE, stringsAsFactors = FALSE)
my_data2
#converting Map names to the calculated win probability
my_data2[my_data2$Map == "Dust2", "Map"] <- 0.307692
my_data2[my_data2$Map == "Inferno", "Map"] <- 0.47619
my_data2[my_data2$Map == "Mirage", "Map"] <- 0.708333
my_data2[my_data2$Map == "Nuke", "Map"] <- 0.444444
my_data2[my_data2$Map == "Overpass", "Map"] <- 0.333333
my_data2[my_data2$Map == "Train", "Map"] <- 0.692308
my_data2[my_data2$Map == "Vertigo", "Map"] <- 0
my_data2[my_data2$Map == "Cache", "Map"] <- 0.857143
#converting W and L to 1 and 0
my_data2$WinorLoss <- ifelse(my_data2$WinorLoss == "W", 1,0)
my_data2$WinorLoss <- factor(my_data2$WinorLoss, levels = c(0,1))
#converting Map to numeric characters
my_data2$Map <- as.numeric(my_data2$Map)
#Logistic regression model
glm.fit <- glm(WinorLoss ~ Map, family=binomial, data=my_data2)
summary(glm.fit)
#make predictions on the training data
glm.probs <- predict(glm.fit, type="response")
glm.pred <- ifelse(glm.probs>0.5, 1, 0)
attach(my_data2)
table(glm.pred,WinorLoss)
mean(glm.pred==WinorLoss)
#splitting the data for trying and testing
Split <- sample.split(my_data2, SplitRatio = 0.7)
traindata <- subset(my_data2, Split == "TRUE")
testdata <- subset(my_data2, Split == "FALSE")
glm.fit <- glm(WinorLoss ~ Map,
data=traindata,
family="binomial")
glm.probs <- predict(glm.fit,
newdata=testdata,
type="response")
glm.pred <- ifelse(glm.probs > 0.5, "1", "0")
table(glm.pred, testdata$WinorLoss)
mean(glm.pred == testdata$WinorLoss)
summary(glm.fit)
#changing the x axis to 0-100%, min map win prob - max map win prob
newdat <- data.frame(Map = seq(min(traindata$Map), max(traindata$Map), len=100))
newdat$WinorLoss = predict(glm.fit, newdata=newdat, type="response")
p <- ggplot(newdat, aes(x=Map,y=WinorLoss))+
geom_point() +
geom_smooth(method = "glm",
method.args = list(family="binomial"),
se = FALSE) +
xlim(0,1) +
ylim(0,1)
I have tried extending the x value to 100 but that just extended the axis but did not calculate the corresponding y value and hence plot these values..
I cannot reproduce your data, so I will show how to do it using the "challenger disaster" example (see this LINK), with confidence interval ribbons.
You should create artificial points in your data and fit it before plotting.
Next time, try to use reprex or provide a minimal reproducible example.
Preparing data and model fitting:
library(dplyr)
fails <- c(2, 0, 0, 1, 0, 0, 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0)
temp <- c(53, 66, 68, 70, 75, 78, 57, 67, 69, 70, 75, 79, 58, 67, 70, 72, 76, 80, 63, 67, 70, 73, 76)
challenger <- tibble::tibble(fails, temp)
orings = 6
challenger <- challenger %>%
dplyr::mutate(resp = fails/orings)
model_fit <- glm(resp ~ temp,
data = challenger,
weights = rep(6, nrow(challenger)),
family=binomial(link="logit"))
##### ------- this is what you need: -------------------------------------------
# setting limits for x axis
x_limits <- challenger %>%
dplyr::summarise(min = 0, max = max(temp)+10)
# creating artificial obs for curve smoothing -- several points between the limits
x <- seq(x_limits[[1]], x_limits[[2]], by=0.5)
# artificial points prediction
# see: https://stackoverflow.com/questions/26694931/how-to-plot-logit-and-probit-in-ggplot2
temp.data = data.frame(temp = x) #column name must be equal to the variable name
# Predict the fitted values given the model and hypothetical data
predicted.data <- as.data.frame(
predict(model_fit,
newdata = temp.data,
type="link", se=TRUE)
)
# Combine the hypothetical data and predicted values
new.data <- cbind(temp.data, predicted.data)
##### --------------------------------------------------------------------------
# Compute confidence intervals
std <- qnorm(0.95 / 2 + 0.5)
new.data$ymin <- model_fit$family$linkinv(new.data$fit - std * new.data$se)
new.data$ymax <- model_fit$family$linkinv(new.data$fit + std * new.data$se)
new.data$fit <- model_fit$family$linkinv(new.data$fit) # Rescale to 0-1
Plotting:
library(ggplot2)
plotly_palette <- c('#1F77B4', '#FF7F0E', '#2CA02C', '#D62728')
p <- ggplot(challenger, aes(x=temp, y=resp))+
geom_point(colour = plotly_palette[1])+
geom_ribbon(data=new.data,
aes(y=fit, ymin=ymin, ymax=ymax),
alpha = 0.5,
fill = '#FFF0F5')+
geom_line(data=new.data, aes(y=fit), colour = plotly_palette[2]) +
labs(x="Temperature", y="Estimated Fail Probability")+
ggtitle("Predicted Probabilities for fail/orings with 95% Confidence Interval")+
theme_bw()+
theme(panel.border = element_blank(), plot.title = element_text(hjust=0.5))
p
# if you want something fancier:
# library(plotly)
# ggplotly(p)
Result:
Interesting Fact About the Challenger Data:
NASA Engineers used linear regression to estimate the likelihood of O-ring failure. If they had used a more appropriate technique for their data, such as logistic regression, they would have noticed that the probability of failure at lower temperatures (such as ~ 36F at launch time) was extremely high. The plot shows us that for ~36F (a temperature which we extrapolate from the observed ones), we have a probability of ~0.75. If we consider the confidence interval ... well, the accident was pretty much a certainty.

R - how to get coeffients for each column ~ timeline from a "spread" matrix?

I want to collect the linear regression coefficients for each column ~ ind.
Here is my data:
temp <- data.frame(
ind = c(1:10),
`9891` = runif(10, 15, 75),
`7891` = runif(10, 15, 75),
`5891` = runif(10, 15, 75)
)
I had tried
result = data.frame()
cols <- colnames(temp)[-1]
for (code in cols) {
fit <- lm(temp[, code] ~ temp$ind)
coef <- coef(fit)['ind']
result$ind <- code
result$coef <- coef
}
But this doesn't work.
Can anyone fix my method, or provides a better solution?
Also, I was wondering if lapply() and summarise_at() can do the work.
Thank you!
Here is a summarise_at option
temp %>%
summarise_at(vars(-contains("ind")), list(coef = ~list(lm(. ~ ind)$coef))) %>%
unnest()
# X9891_coef X7891_coef X5891_coef
#1 25.927946 52.5668120 35.152330
#2 2.459137 0.3158741 1.013678
The first row gives the offset and the second row the slope coefficients.
Or to extract only the slope coefficient and store the result in a long data.frame
temp %>%
summarise_at(vars(-contains("ind")), list(coef = ~list(lm(. ~ ind)$coef[2]))) %>%
unnest() %>%
stack() %>%
setNames(c("slope", "column"))
# slope column
# 1 2.4591375 X9891_coef
# 2 0.3158741 X7891_coef
# 3 1.0136783 X5891_coef
PS. It's always good practice to include a fixed random seed when working with random data to ensure reproducibility of results.
Sample data
set.seed(2018)
temp <- data.frame(
ind = c(1:10),
`9891` = runif(10, 15, 75),
`7891` = runif(10, 15, 75),
`5891` = runif(10, 15, 75)
)
You can use sapply
sapply(temp[-1], function(x) coef(lm(x ~ temp$ind))[2])
#X9891.temp$ind X7891.temp$ind X5891.temp$ind
# -0.01252979 -2.94773367 2.57816244
To get the final daatframe, you could do
data.frame(ind = names(temp)[-1],
coef = sapply(temp[-1], function(x) coef(lm(x ~ temp$ind))[2]), row.names = NULL)
# ind coef
#1 X9891 -0.01252979
#2 X7891 -2.94773367
#3 X5891 2.57816244
where every row represents value from the column.
data
set.seed(1234)
temp <- data.frame(
ind = c(1:10),
`9891` = runif(10, 15, 75),
`7891` = runif(10, 15, 75),
`5891` = runif(10, 15, 75)
)

Two different multiple GLM poisson model regression, mean points and confidence interval

I'd like to create a plot in ggplot2 that combines two different multiple GLM poisson model ajusted, mean points and confidence interval (IC 95%). But my mean point representation doesn't work.
#Artificial data set
Consumption <- c(501, 502, 503, 504, 26, 27, 55, 56, 68, 69, 72, 93)
Gender <- gl(n = 2, k = 6, length = 2*6, labels = c("Male", "Female"), ordered = FALSE)
Income <- c(5010, 5020, 5030, 5040, 260, 270, 550, 560, 680, 690, 720, 930)
df3 <- data.frame(Consumption, Gender, Income)
df3
# GLM Regression
fm1 <- glm(Consumption~Gender+Income, data=df3, family=poisson)
summary(fm1)
# ANOVA
anova(fm1,test="Chi")
#Genders are different than I ajusted one model for male and another for Female
#Male model
df4<-df3[df3$Gender=="Male",]
fm2 <- glm(Consumption~Income, data=df4, family=poisson)
summary(fm2)
#Female model
df5<-df3[df3$Gender=="Female",]
fm3 <- glm(Consumption~Income, data=df5, family=poisson)
summary(fm3)
#Create preditions amd confidence interval
Predictions <- c(predict(fm2, type="link", se.fit = TRUE),
predict(fm3, type="link", se.fit = TRUE))
df3_combined <- cbind(df3, Predictions)
df3_combined$UCL<-df3_combined$fit + 1.96*df3_combined$se.fit
df3_combined$LCL<-df3_combined$fit - 1.96*df3_combined$se.fit
df3_combined<-df3_combined[,-(6:9)]
df3_combined<-as.data.frame(df3_combined)
#Create mean values for plot this values
library(dplyr)
df<-df3_combined %>%
group_by(Income, Gender) %>%
summarize(Consumption = mean(Consumption, na.rm = TRUE))
df<-as.data.frame(df)
#Plot
library(tidyverse)
library(ggplot2)
df3_combined %>%
gather(type, value, Consumption) %>%
ggplot(mapping=aes(x=Income, y=Consumption, color = Gender)) +
geom_point(df,mapping=aes(x=Income, y=Consumption, color = Gender)) +
geom_line(mapping=aes(x=Income, y=exp(fit))) +
geom_smooth(mapping=aes(ymin = exp(LCL), ymax = exp(UCL)), stat="identity")
#
I don't see the mean values created in df object in my output plot and I don't know why.

Resources