Predicting relative risk with predict.coxph, simPH and the formula - r

There is a great post about the interpretation of the predict.coxph() output. However, I keep getting different results comparing the output from predict.coxph, simPH and the formula for relative risk. Since my hypothesis includes a quadratic effect, I am going to include a polynomial with power 2 in my example.
I use the example from this post.
data("lung")
Predicting relative risk with predict()
# Defining the quadratic predictor
lung$meal.cal_q <- lung$meal.cal^2
# conduct a cox regression with the predictor meal.cal, its quadratic version and some covariates.
cox_mod <- coxph(Surv(time, status) ~
ph.karno + pat.karno + meal.cal + meal.cal_q,
data = lung)
# a vector of fitted values to predict for
meal.cal_new <- seq(min(lung$meal.cal, na.rm= TRUE), max(lung$meal.cal,
na.rm= TRUE), by= 1)
# a vector of fitted values to predict for, the quadratic effect
meal.cal_q_new <- meal.cal_new^2
# the length of the vector with the values to predict for
n <- length(meal.cal_new)
# a dataframe with all the values to predict for
lung_new <- data.frame(ph.karno= rep(mean(lung$ph.karno, na.rm= TRUE), n),
pat.karno= rep(mean(lung$pat.karno, na.rm= TRUE), n),
meal.cal= meal.cal_new,
meal.cal_q = meal.cal_q_new)
# predict the relative risk
lung_new$rel_risk <- predict(cox_mod, lung_new, type= "risk")
Predicting the relative risk with the formula (see the post mentioned above)
# Defining the quadratic predictor
lung$meal.cal_q <- lung$meal.cal^2
# run a cox regression with the predictor meal.cal, its quadratic version and some covariates.
cox_mod <- coxph(Surv(time, status) ~
ph.karno + pat.karno + meal.cal + meal.cal_q,
data = lung)
# a vector of fitted values to predict for
meal.cal_new <- seq(min(lung$meal.cal, na.rm= TRUE), max(lung$meal.cal,
na.rm= TRUE), by= 1)
# a vector of fitted values to predict for, the quadratic effect
meal.cal_q_new <- meal.cal_new^2
# length of the vector to predict for
n <- length(meal.cal_new)
# A dataframe with the values to make the prediction for
lung_new2 <- data.frame(
ph.karno= rep(mean(lung$ph.karno, na.rm= TRUE), n),
pat.karno= rep(mean(lung$pat.karno, na.rm= TRUE), n),
meal.cal= meal.cal_new,
meal.cal_q = meal.cal_q_new)
# A dataframe with the values to compare the prediction with
lung_new_mean <- data.frame(
ph.karno= rep(mean(lung$ph.karno, na.rm= TRUE), n),
pat.karno= rep(mean(lung$pat.karno, na.rm= TRUE), n),
meal.cal= rep(mean(lung$meal.cal, na.rm= TRUE), n),
meal.cal_q = rep(mean(lung$meal.cal_q, na.rm= TRUE), n))
# extract the coefficients
coefCPH <- coef(cox_mod)
# make the prediction for the values of interest
cox_risk <-
exp(coefCPH["ph.karno"] * lung_new2[ , "ph.karno"] +
coefCPH["pat.karno"] * lung_new2[ , "pat.karno"] +
coefCPH["meal.cal"] * lung_new2[ , "meal.cal"] +
coefCPH["meal.cal_q"] * lung_new2[ , "meal.cal_q"])
# make the predictions for the values to compare with
cox_risk_mean <-
exp(coefCPH["ph.karno"] * lung_new_mean[ , "ph.karno"] +
coefCPH["pat.karno"] * lung_new_mean[ , "pat.karno"] +
coefCPH["meal.cal"] * lung_new_mean[ , "meal.cal"] +
coefCPH["meal.cal_q"] * lung_new_mean[ , "meal.cal_q"])
# calculate the relative risk
lung_new2$rel_risk <- unlist(cox_risk)/ unlist(cox_risk_mean)
Now the plot with the predicted relative risk using predict() and using the formula:
ggplot(lung_new, aes(meal.cal, rel_risk)) +
geom_smooth() +
geom_smooth(data= lung_new2, col= "red")
The plot shows that the predictions are different. I do not understand why this is the case, although the mentioned post shows that the predict function and the formula should give the same result.
Because of this confusion I tried to solve the problem with the simPH package. Here is what I did:
# Defining the quadratic predictor
lung$meal.cal_q <- lung$meal.cal^2
# run a cox regression with the predictor, its quadratic version and some covariates.
cox_mod <- coxph(Surv(time, status) ~
ph.karno + pat.karno + meal.cal + meal.cal_q,
data = lung)
# a vector of fitted values to predict for
meal.cal_new <- seq(min(lung$meal.cal, na.rm= TRUE),
max(lung$meal.cal, na.rm= TRUE), by= 1)
# length of the vector to predict for
n <- length(meal.cal_new)
# A vector with the values to compare the prediction with
meal.cal_new_mean <- rep(mean(lung$meal.cal, na.rm= TRUE), n)
# running 100 simulations per predictor value with coxsimPoly
Sim <- coxsimPoly(obj= cox_mod, b = "meal.cal", pow = 2,
qi = "Relative Hazard",
Xj = meal.cal_new,
Xl = meal.cal_new_mean,
ci = .95,
nsim = 100,
extremesDrop = FALSE)
# plot the result
simGG(Sim)
This gives an empty plot with the warning
Warning messages:
1: In min(obj$sims[, x]) : no non-missing arguments to min; returning Inf
2: In max(obj$sims[, x]) : no non-missing arguments to max; returning -Inf
And the Sim$simsobject appears indeed to be empty.
My questions are:
Why do the results from predict() and the use of the formula differ?
Why does the simPH package not calculate the relative risk?
Which method should I choose? My hypothesis is a quadratic effect in a cox regression and I need a plot for this predictor with its relative risk (compared to the predictor being at its mean value), just like in the example.

Quick answer to the simPH issue: the polynomial terms need to be specified in the coxph call using the I function, e.g.:
cox_mod <- coxph(Surv(time, status) ~
ph.karno + pat.karno + meal.cal + I(meal.cal^2),
data = lung)
(The error handling in your use case is pretty poor.)
Using this modification (and 1000 simulations) with your code above should return something like:
Differences between simPH and predict
My guess to the differences is that simPH doesn't create confidence intervals around the transformed point estimates like predict. It draws simulations from the multivariate normal distribution specified by the fitted model, then shows the central 50% and 95% of this simulated distribution. The central line is just the median of the sims. It is explicitly a different logic from predict. For very non-monotonic quantities of interest, like this one, predict point estimates give highly substantively misleading results compared to simPH. There is little evidence for such a form based on 4 observations.

Related

How do I extract standard errors or variation from predicted ordinal logistic regression analyses?

I am undertaking a ordinal logistic regression using R package MASS.
For example:
library(MASS)
house.plr <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)
summary(house.plr, digits = 3)
I am using the s3 method predict() to obtain the predicted values
test_dat <- data.frame(Infl = factor(rep("Low",4)),
Cont = factor(rep("Low",4)),
Type = unique(housing$Type))
predict(house.plr, test_dat, type = "p")
Low Medium High
1 0.3784493 0.2876752 0.3338755
2 0.5190445 0.2605077 0.2204478
3 0.4675584 0.2745383 0.2579033
4 0.6444840 0.2114256 0.1440905
The result is a table of predicted means for each level of Sat given the variables defined in the test_dat.
How might I extract the variation around each of these means in the form of a standard error or standard deviation?
First, your predicted values are the predicted probability of each outcome for each observation. It is not the predicted mean on the response scale.
Second, you can use the marginaleffects package to get the standard errors for the predicted probabilities and then calculate the confidence intervals yourself. Alternatively, you may implement the non-parametric bootstrap. I implement both in the below. Note that I shifted the order of the columns around in the test data to match the training data.
# Packages
library(MASS)
library(marginaleffects)
library(dplyr)
# Create a test set
N <- 4
test_dat <- data.frame(
Infl = factor(rep("Low", N)),
Type = unique(housing$Type),
Cont = factor(rep("Low", N))
)
# Fit ordered logistic regression model
house.plr <- polr(Sat ~ Infl + Type + Cont,
weights = Freq,
data = housing,
Hess = TRUE)
# Demonstrate that predict() doesn't provide any measure of variability
# for the predicted class probabilities, as shown in question
predict(house.plr, test_dat, type = "probs")
# Use the marginaleffects package to get delta method standard errors for
# each predicted probability
probs <- marginaleffects::predictions(house.plr,
newdata = test_dat,
type = "probs")
# Compute CIs from the standard error using normal approximation
probs$predicted - 1.96*probs$std.error
probs$predicted + 1.96*probs$std.error
# Alternatively, use non-parametric bootstrapped confidence intervals.
# note that this does not adjust the weights to a constant sum for
# each bootstrap, although it is easy to implement. You're free to
# determine how to handle the weights, including resampling based
# on the weights.
# Generate bootstrapped data.frames
set.seed(123)
sims <- 5
samples <- vector(mode = "list", length = sims)
samples <- lapply(samples, function(x){ slice_sample(housing, n = nrow(housing), replace = TRUE)})
# Fit model on each bootstrapped data.frame
models <- lapply(samples, function(x){polr(Sat ~ Infl + Type + Cont,
weights = Freq,
data = x,
Hess = TRUE)})
# Get test predictions into a data.frame
probs_boot <- lapply(models, function(x) {
marginaleffects::predictions(x,
newdata = test_dat,
type = "probs")
})
probs_boot_df <- bind_rows(probs_boot)
# Compute CIs
probs_boot_df %>%
group_by(group, Type.x, Infl, Type.y, Cont) %>%
summarise(ci_low = quantile(predicted, probs = 0.025),
ci_high = quantile(predicted, probs = 0.975))

Is there a difference between gamma hurdle (two-part) models and zero-inflated gamma models?

I have semicontinuous data (many exact zeros and continuous positive outcomes) that I am trying to model. I have largely learned about modeling data with substantial zero mass from Zuur and Ieno's Beginner's Guide to Zero-Inflated Models in R, which makes a distinction between zero-inflated gamma models and what they call "zero-altered" gamma models, which they describe as hurdle models that combine a binomial component for the zeros and a gamma component for the positive continuous outcome. I have been exploring the use of the ziGamma option in the glmmTMB package and comparing the resulting coefficients to a hurdle model that I built following the instructions in Zuur's book (pages 128-129), and they do not coincide. I'm having trouble understanding why not, as I know that the gamma distribution cannot take on the value of zero, so I suppose every zero-inflated gamma model is technically a hurdle model. Can anyone illuminate this for me? See more comments about the models below the code.
library(tidyverse)
library(boot)
library(glmmTMB)
library(parameters)
### DATA
id <- rep(1:75000)
age <- sample(18:88, 75000, replace = TRUE)
gender <- sample(0:1, 75000, replace = TRUE)
cost <- c(rep(0, 30000), rgamma(n = 37500, shape = 5000, rate = 1),
sample(1:1000000, 7500, replace = TRUE))
disease <- sample(0:1, 75000, replace = TRUE)
time <- sample(30:3287, 75000, replace = TRUE)
df <- data.frame(cbind(id, disease, age, gender, cost, time))
# create binary variable for non-zero costs
df <- df %>% mutate(cost_binary = ifelse(cost > 0, 1, 0))
### HURDLE MODEL (MY VERSION)
# gamma component
hurdle_gamma <- glm(cost ~ disease + gender + age + offset(log(time)),
data = subset(df, cost > 0),
family = Gamma(link = "log"))
model_parameters(hurdle_gamma, exponentiate = T)
# binomial component
hurdle_binomial <- glm(cost_binary ~ disease + gender + age + time,
data = df, family = "binomial")
model_parameters(hurdle_binomial, exponentiate = T)
# predicted probability of use
df$prob_use <- predict(hurdle_binomial, type = "response")
# predicted mean cost for people with any cost
df_bin <- subset(df, cost_binary == 1)
df_bin$cost_gamma <- predict(hurdle_gamma, type = "response")
# combine data frames
df2 <- left_join(df, select(df_bin, c(id, cost_gamma)), by = "id")
# replace NA with 0
df2$cost_gamma <- ifelse(is.na(df2$cost_gamma), 0, df2$cost_gamma)
# calculate predicted cost for everyone
df2 <- df2 %>% mutate(cost_pred = prob_use * cost_gamma)
# mean predicted cost
mean(df2$cost_pred)
### glmmTMB with ziGamma
zigamma_model <- glmmTMB(cost ~ disease + gender + age + offset(log(time)),
family = ziGamma(link = "log"),
ziformula = ~ disease + gender + age + time,
data = df)
model_parameters(zigamma_model, exponentiate = T)
df <- df %>% predict(zigamma_model, new data = df, type = "response") # doesn't work
# "no applicable method for "predict" applied to an object of class "data.frame"
The coefficients from the gamma component of my hurdle model and the fixed effects components of the zigamma model are the same, but the SEs are different, which in my actual data has substantial implications for the significance of my predictor of interest. The coefficients on the zero-inflated model are different, and I also noticed that the z values in the binomial component are the negative inverse of those in my binomial model. I assume this has to do with my binomial model modeling the probability of presence (1 is a success) and glmmTMB presumably modeling the probability of absence (0 is a success)?
In sum, can anyone point out what I am doing wrong with the glmmTMB ziGamma model?
The glmmTMB package can do this:
glmmTMB(formula, family=ziGamma(link="log"), ziformula=~1, data= ...)
ought to do it. Maybe something in VGAM as well?
To answer the questions about coefficients and standard errors:
the change in sign of the binomial coefficients is exactly what you suspected (the difference between estimating the probability of 0 [glmmTMB] vs the probability of not-zero [your/Zuur's code])
The standard errors on the binomial part of the model are close but not identical: using broom.mixed::tidy,
round(1-abs(tidy(hurdle_g,component="zi")$statistic)/
abs(tidy(hurdle_binomial)$statistic),3)
## [1] 0.057 0.001 0.000 0.000 0.295
6% for the intercept, up to 30% for the effect of age ...
the nearly twofold difference in the standard errors of the conditional (cost>0) component is definitely puzzling me; it holds up if we simply implement the Gamma/log-link in glmmTMB vs glm. It's hard to know how to check which is right/what the gold standard should be for this case. I might distrust Wald p-values in this case and try to get p-values with the likelihood ratio test instead (via drop1).
In this case the model is badly misspecified (i.e. the cost is uniformly distributed, nothing like Gamma); I wonder if that could be making things harder/worse?

Cox model, coxph(), control treatment with no event, seed germination

I am performing survival analysis and I´m not sure if I am doing it correctly. My dataset is a result of a seed germination experiment. The main variable of interest is the "treat" one (categorical with 3 levels). In my script I am trying to figure out if there is a difference in between treatments, which one is the best, and at what extent, by comparing the PH coeff percentages. Could anyone help me with some of the problems that I'm dealing with?
1) Do I need to declare my variables as.factor() to use them? Or integer is interpreted equally?
2) If proportionality of hazards assumption (PH) is violated, what should I do with my data to proceed to a cox model building? I've intensely researched but haven't been able to understand the programming to add covariate*time interaction
or stratification to my model.
3) How to include frailty terms to cox model and detect random effect (e.g. plate in which seeds were germinated, categorical variable with 4 levels, representing repetition).
4) I also wasn't able to interpret the print(summary(cox.fra)).*
*see below
See below my two whole scripts with comments.
SCRIPT 1
rd01 <- read.table("sa_kb01.txt", header = T) # raw dataset, seed
survival
rd01
str(rd01)
rd01$begin <- as.factor(rd01$begin) # integers to factors
rd01$spp <- as.factor(rd01$spp)
rd01$cit <- as.factor(rd01$cit)
rd01$treat <- as.factor(rd01$treat)
rd01$plate <- as.factor(rd01$plate)
str(rd01)
summary(rd01)
names(rd01) # headers
### Survival analysis
# install.packages("survival")
library(survival)
library (survminer)
?survfit
?survfit.formula
?survfit.coxph
?ggsurvplot
## Fit Kaplan-Meier survivor function
km.fit <- survfit(Surv(day, status) ~ treat, data= rd01, type="kaplan-meier")
km.fit
print(summary(km.fit))
plot(km.fit, conf.int= T, fun = "event", mark.time = c(140), pch = c("S", "W", "A"), col = c("darkred","darkblue","darkgreen"), lty = c("solid","dotted","longdash"),lwd = 1.5, xlab = "time [days]", ylab = "germination probability [%]")
print(summary(km.fit))
## Comparison of Survivor Functions
# Log-rank tests
?survdiff
# Log-rank or Mantel-Haenszel test in "rho = 0" OR
# Peto & Peto modification of the Gehan-Wilcoxon test in "rho = 1"
# ... Assess all groups for heterogeneity
lrmh.123 <- survdiff(Surv(day,status) ~ treat, data= rd01, rho= 0)
print(lrmh.123) # If p<0.05 there are difference between all groups!
# ... Comparing groups pairwise
lrmh.120 <- survdiff(Surv(day,status) ~ treat, data= rd01, subset= {treat!=3}, rho= 0)
lrmh.103 <- survdiff(Surv(day,status) ~ treat, data= rd01, subset= {treat!=2}, rho= 0)
lrmh.023 <- survdiff(Surv(day,status) ~ treat, data= rd01, subset= {treat!=1}, rho= 0)
print(lrmh.120)
print(lrmh.103)
print(lrmh.023) # If p<0.05 there are difference pairwised groups!
## Checking Proportional Hazard (PH) assumption
# Define function mlogmlog() to calculate -log(-log(S(t)))
mlogmlog <- function(y){-log(-log(y))}
# Use estimated Kaplan-Meier survivor functions
km.fit
# ... to plot -log(-log(S(t))) versus log(t)
plot(km.fit, fun= mlogmlog, log="x", mark.time= c(140), pch = c("S", "W", "A"), col = c("darkred","darkblue","darkgreen"), lty = c("solid","dotted","longdash"), lwd = 1.5, xlab="time [days]", ylab= "-log(-log(S(t)))") # If lines do not cross, PH assumption is plausible!
# Interpretarion: http://www.sthda.com/english/wiki/cox-model-assumptions#testing-proportional-hazards-assumption
## Checking for multicollinearity
# install.packages("HH")
library(HH)
# Fit a generalized linear model predicting days from treatment
?glm
mc.glm <- glm(day ~ treat, data=rd01)
print(mc.glm) # doesn't need interpretation, only used to create object to VIF function
# Check for multicollinearity among covariates throught variance inflation factor (VIF)
?vif
mc.vif <- vif(mc.glm)
print(mc.vif) # VIF can determine what proportion of the variation in each covariate
# is explained by the other covariates:
# VIF > 10, serious multicollinearity; VIF = 5, evidence of multicollinearity;
# VIF < 1, no evidence of multicollinearity
## Adding covariates to the Cox model
# Create a Cox model
cox.mod <- coxph(Surv(day, status) ~ treat, data= rd01)
print(summary(cox.mod))
# Interpretation: http://www.sthda.com/english/wiki/cox-proportional-hazards-model
# Double check for PH assumption now with Cox model built
dc.ph <- cox.zph(cox.mod)
dc.ph
ggcoxzph(dc.ph) # if global and individual p-vale > 0.05, PH assumption is plausible!
## Including random effects
?frailty
# Adding plate variable as frailty term
cox.fra <- coxph(Surv(day, status) ~ treat + frailty(plate), data= rd01)
print(summary(cox.fra)) # if global and individual p-vale < 0.05,
# maintain frailty term while adding covariates 1 at a time in cox model!`
SCRIPT 2 - the same, but different dataset, control treat1 with no event!
rd01 <- read.table("sa_hal01.txt", header = T) # raw dataset, seed survival
rd01
str(rd01)
rd01$begin <- as.factor(rd01$begin) # integers to factors
rd01$spp <- as.factor(rd01$spp)
rd01$cit <- as.factor(rd01$cit)
rd01$treat <- as.factor(rd01$treat)
rd01$plate <- as.factor(rd01$plate)
str(rd01)
summary(rd01)
names(rd01) # headers
### Survival analysis
# install.packages("survival")
library(survival)
library (survminer)
?survfit
?survfit.formula
?survfit.coxph
?ggsurvplot
## Fit Kaplan-Meier survivor function
km.fit <- survfit(Surv(day, status) ~ treat, data= rd01, type="kaplan-meier")
km.fit
print(summary(km.fit))
plot(km.fit, conf.int= T, fun = "event", mark.time = c(140), pch = c("S", "W", "A"), col = c("darkred","darkblue","darkgreen"), lty = c("solid","dotted","longdash"),lwd = 1.5, xlab = "time [days]", ylab = "germination probability [%]")
print(summary(km.fit))
## Comparison of Survivor Functions
# Log-rank tests
?survdiff
# Log-rank or Mantel-Haenszel test in "rho = 0" OR
# Peto & Peto modification of the Gehan-Wilcoxon test in "rho = 1"
# ... Assess all groups for heterogeneity
lrmh.123 <- survdiff(Surv(day,status) ~ treat, data= rd01, rho= 0)
print(lrmh.123) # If p<0.05 there are difference between all groups!
# ... Comparing groups pairwise
lrmh.120 <- survdiff(Surv(day,status) ~ treat, data= rd01, subset= {treat!=3}, rho= 0)
lrmh.103 <- survdiff(Surv(day,status) ~ treat, data= rd01, subset= {treat!=2}, rho= 0)
lrmh.023 <- survdiff(Surv(day,status) ~ treat, data= rd01, subset= {treat!=1}, rho= 0)
print(lrmh.120)
print(lrmh.103)
print(lrmh.023) # If p<0.05 there are difference pairwised groups!
## Checking Proportional Hazard (PH) assumption
# Define function mlogmlog() to calculate -log(-log(S(t)))
mlogmlog <- function(y){-log(-log(y))}
# Use estimated Kaplan-Meier survivor functions
km.fit
# ... to plot -log(-log(S(t))) versus log(t)
plot(km.fit, fun= mlogmlog, log="x", mark.time= c(140), pch = c("S", "W", "A"), col = c("darkred","darkblue","darkgreen"), lty = c("solid","dotted","longdash"), lwd = 1.5, xlab="time [days]", ylab= "- log(-log(S(t)))") # If lines do not cross, PH assumption is plausible!
# Interpretarion: http://www.sthda.com/english/wiki/cox-model- assumptions#testing-proportional-hazards-assumption
## Checking for multicollinearity
# install.packages("HH")
library(HH)
# Fit a generalized linear model predicting days from treatment
?glm
mc.glm <- glm(day ~ treat, data=rd01)
print(mc.glm) # doesn't need interpretation, only used to create object to VIF function
# Check for multicollinearity among covariates throught variance inflation factor (VIF)
?vif
mc.vif <- vif(mc.glm)
print(mc.vif) # VIF can determine what proportion of the variation in each covariate
# is explained by the other covariates:
# VIF > 10, serious multicollinearity; VIF = 5, evidence of multicollinearity;
# VIF < 1, no evidence of multicollinearity
## Adding covariates to the Cox model
# Create a Cox model
cox.mod <- coxph(Surv(day, status) ~ treat, data= rd01)
print(summary(cox.mod))
# Interpretation: http://www.sthda.com/english/wiki/cox-proportional-hazards-model
# Double check for PH assumption now with Cox model built
dc.ph <- cox.zph(cox.mod)
dc.ph
ggcoxzph(dc.ph) # if global and individual p-vale > 0.05, PH assumption is plausible!
## Including random effects
?frailty
# Adding plate variable as frailty term
cox.fra <- coxph(Surv(day, status) ~ treat + frailty(plate), data= rd01)
print(summary(cox.fra)) # if global and individual p-vale < 0.05,
# maintain frailty term while adding covariates 1 at a time in cox model!
There seems to be a statistically significant difference and treat3 differs from the other groups in both scripts. In script 1 PH is violated and I don´t now what to do. Apart from that, Cox model in script 1 seems to work fine and the interpretation of hazard ratios are OK, but in script 2, no idea how to interpret or solve that (there was no event in control treat1).
1) Do I need to declare my variables as.factor() to use them? Or integer is interpreted equally?
I think in your case as.factor is correct. You can use integers if you have continuous numeric variables - for example if you would have time seeds have been stored before the experiment, you could use as.numeric for time variable.
2) If PH is violated, what should I do with my data to proceed to a cox model building? I've intensely researched but haven't been able to understand the programming to add covariate x time interaction or stratification to my model.
Cox regression, aka Cox proportional hazards model, is based on the assumption of proportional hazards. If that assumption is violated, you won´t get reliable results. You probably could try some data transformations to see if it would help. Or if it is violated in some subexperiment/group, you could just leave it out.

Back-transform coefficients from glmer with scaled independent variables for prediction

I've fitted a mixed model using the lme4 package. I transformed my independent variables with the scale() function prior to fitting the model. I now want to display my results on a graph using predict(), so I need the predicted data to be back on the original scale. How do I do this?
Simplified example:
database <- mtcars
# Scale data
database$wt <- scale(mtcars$wt)
database$am <- scale(mtcars$am)
# Make model
model.1 <- glmer(vs ~ scale(wt) + scale(am) + (1|carb), database, family = binomial, na.action = "na.fail")
# make new data frame with all values set to their mean
xweight <- as.data.frame(lapply(lapply(database[, -1], mean), rep, 100))
# make new values for wt
xweight$wt <- (wt = seq(min(database$wt), max(database$wt), length = 100))
# predict from new values
a <- predict(model.1, newdata = xweight, type="response", re.form=NA)
# returns scaled prediction
I've tried using this example to back-transform the predictions:
# save scale and center values
scaleList <- list(scale = attr(database$wt, "scaled:scale"),
center = attr(database$wt, "scaled:center"))
# back-transform predictions
a.unscaled <- a * scaleList$scale + scaleList$center
# Make model with unscaled data to compare
un.model.1 <- glmer(vs ~ wt + am + (1|carb), mtcars, family = binomial, na.action = "na.fail")
# make new data frame with all values set to their mean
un.xweight <- as.data.frame(lapply(lapply(mtcars[, -1], mean), rep, 100))
# make new values for wt
un.xweight$wt <- (wt = seq(min(mtcars$wt), max(mtcars$wt), length = 100))
# predict from new values
b <- predict(un.model.1, newdata = xweight, type="response", re.form=NA)
all.equal(a.unscaled,b)
# [1] "Mean relative difference: 0.7223061"
This doesn't work - there shouldn't be any difference. What have I done wrong?
I've also looked at a number of similar questions but not managed to apply any to my case (How to unscale the coefficients from an lmer()-model fitted with a scaled response, unscale and uncenter glmer parameters, Scale back linear regression coefficients in R from scaled and centered data, https://stats.stackexchange.com/questions/302448/back-transform-mixed-effects-models-regression-coefficients-for-fixed-effects-f).
The problem with your approach is that it only "unscales" based on the wt variable, whereas you scaled all of the variables in your regression model. One approach that works is to adjust all of the variables in your new (prediction) data frame using the centering/scaling values that were used on the original data frame:
## scale variable x using center/scale attributes
## of variable y
scfun <- function(x,y) {
scale(x,
center=attr(y,"scaled:center"),
scale=attr(y,"scaled:scale"))
}
## scale prediction frame
xweight_sc <- transform(xweight,
wt = scfun(wt, database$wt),
am = scfun(am, database$am))
## predict
p_unsc <- predict(model.1,
newdata=xweight_sc,
type="response", re.form=NA)
Comparing this p_unsc to your predictions from the unscaled model (b in your code), i.e. all.equal(b,p_unsc), gives TRUE.
Another reasonable approach would be to
unscale/uncenter all of your parameters using the "unscaling" approaches presented in one of the linked question (such as this one), generating a coefficient vector beta_unsc
construct the appropriate model matrix from your prediction frame:
X <- model.matrix(formula(model,fixed.only=TRUE),
newdata=pred_frame)
compute the linear predictor and back-transform:
pred <- plogis(X %*% beta_unsc)

Manual Perceptron example in R - are the results acceptable?

I am trying to get a perceptron algorithm for classification working but I think something is missing. This is the decision boundary achieved with logistic regression:
The red dots got into college, after performing better on tests 1 and 2.
This is the data, and this is the code for the logistic regression in R:
dat = read.csv("perceptron.txt", header=F)
colnames(dat) = c("test1","test2","y")
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=dat)
fit = glm(y ~ test1 + test2, family = "binomial", data = dat)
coefs = coef(fit)
(x = c(min(dat[,1])-2, max(dat[,1])+2))
(y = c((-1/coefs[3]) * (coefs[2] * x + coefs[1])))
lines(x, y)
The code for the "manual" implementation of the perceptron is as follows:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
With this, I get the following coefficients:
bias test1 test2
9.131054 19.095881 20.736352
and an accuracy of 88%, consistent with that calculated with the glm() logistic regression function: mean(sign(predict(fit))==data[,4]) of 89% - logically, there is no way of linearly classifying all of the points, as it is obvious from the plot above. In fact, iterating only 10 times and plotting the accuracy, a ~90% is reach after just 1 iteration:
Being in line with the training classification performance of logistic regression, it is likely that the code is not conceptually wrong.
QUESTIONS: Is it OK to get coefficients so different from the logistic regression:
(Intercept) test1 test2
1.718449 4.012903 3.743903
This is really more of a CrossValidated question than a StackOverflow question, but I'll go ahead and answer.
Yes, it's normal and expected to get very different coefficients because you can't directly compare the magnitude of the coefficients between these 2 techniques.
With the logit (logistic) model you're using a binomial distribution and logit-link based on a sigmoid cost function. The coefficients are only meaningful in this context. You've also got an intercept term in the logit.
None of this is true for the perceptron model. The interpretation of the coefficients are thus totally different.
Now, that's not saying anything about which model is better. There aren't comparable performance metrics in your question that would allow us to determine that. To determine that you should do cross-validation or at least use a holdout sample.

Resources