R: Formula with multiple Conditions and Categorized Surface Plot - r

I want to make 3D plots for linear Regression Models in R: I wish to display surface of the regression plane of a linear model.
I have 2 continuous variables (say AGE, HEIGHT) and 2 factors (SEX, ALLERGIC). I want to display the predicted values of the LM w.r.t. the 2 continuous variables conditioned on the specified levels of each factor, e.g.
ILLNESS = AGE|{SEX==MALE + ALLERGIC==YES} + HEIGHT|{SEX==MALE + ALLERGIC==YES} +
AGE|{SEX==MALE + ALLERGIC==YES}*HEIGHT|{SEX==MALE + ALLERGIC==YES}
This is the outcome I have in mind:
First Question: Are there any cool function, where you can do this very easy?
Second Question: If not, how can I write formulas, where I can condition on >1 factor level?

First, let's make some sample input data to have something to test with.
set.seed(15)
dd <- data.frame(
sex = sample(c("M","F"), 200, replace=T),
allergic = sample(c("YES","NO"), 200, replace=T),
age = runif(200, 18,65),
height = rnorm(200, 6, 2)
)
expit <- function(x) exp(x)/(exp(x)+1)
dd <- transform(dd,
illness=expit(-1+(sex=="M")*.8-0.025*age*ifelse(sex=="M",-1,1)+.16*height*ifelse(allergic=="YES",-1,1)+rnorm(200))>.5
)
Now we define the set of values we want to predict over
gg<-expand.grid(sex=c("M","F"), allergic=c("YES","NO"))
vv<-expand.grid(age=18:65, height=3:9)
and then we fit a model, and use the predict function to calculate the response for each point on the surface we wish to plot.
mm <- glm(illness~sex+allergic+age+height, dd, family=binomial)
pd<-do.call(rbind, Map(function(sex, allergic) {
nd <- cbind(vv, sex=sex, allergic=allergic)
cbind(nd, pred=predict(mm, nd, type="response"))
}, sex=gg$sex, allergic=gg$allergic))
Finally, we can use lattice to plot the data
library(lattice)
wireframe(pred~age+height|sex+allergic, pd, drape=TRUE)
which give us

Related

Simulating logistic regression from saved estimates in R

I have a bit of an issue. I am trying to develop some code that will allow me to do the following: 1) run a logistic regression analysis, 2) extract the estimates from the logistic regression analysis, and 3) use those estimates to create another logistic regression formula that I can use in a subsequent simulation of the original model. As I am, relatively new to R, I understand I can extract these coefficients 1-by-1 through indexing, but it is difficult to "scale" this to models with different numbers of coefficients. I am wondering if there is a better way to extract the coefficients and setup the formula. Then, I would have to develop the actual variables, but the development of these variables would have to be flexible enough for any number of variables and distributions. This appears to be easily done in Mplus (example 12.7 in the Mplus manual), but I haven't figured this out in R. Here is the code for as far as I have gotten:
#generating the data
set.seed(1)
gender <- sample(c(0,1), size = 100, replace = TRUE)
age <- round(runif(100, 18, 80))
xb <- -9 + 3.5*gender + 0.2*age
p <- 1/(1 + exp(-xb))
y <- rbinom(n = 100, size = 1, prob = p)
#grabbing the coefficients from the logistic regression model
matrix_coef <- summary(glm(y ~ gender + age, family = "binomial"))$coefficients
the_estimates <- matrix_coef[,1]
the_estimates
the_estimates[1]
the_estimates[2]
the_estimates[3]
I just cannot seem to figure out how to have R create the formula with the variables (x's) and the coefficients from the original model in a flexible manner to accommodate any number of variables and different distributions. This is not class assignment, but a necessary piece for the research that I am producing. Any help will be greatly appreciated, and please, treat this as a teaching moment. I really want to learn this.
I'm not 100% sure what your question is here.
If you want to simulate new data from the same model with the same predictor variables, you can use the simulate() method:
dd <- data.frame(y, gender, age)
## best practice when modeling in R: take the variables from a data frame
model <- glm(y ~ gender + age, data = dd, family = "binomial")
simulate(model)
You can create multiple replicates by specifying the nsim= argument (or you can simulate anew every time through a for() loop)
If you want to simulate new data from a different set of predictor variables, you have to do a little bit more work (some model types in R have a newdata= argument, but not GLMs alas):
## simulate new model matrix (including intercept)
simdat <- cbind(1,
gender = rbinom(100, prob = 0.5, size = 1),
age = sample(18:80, size = 100, replace = TRUE))
## extract inverse-link function
invlink <- family(model)$linkinv
## sample new values
resp <- rbinom(n = 100, size = 1, prob = invlink(simdat %*% coef(model)))
If you want to do this later from coefficients that have been stored, substitute the retrieved coefficient vector for coef(model) in the code above.
If you want to flexibly construct formulas, reformulate() is your friend — but I don't see how it fits in here.
If you want to (say) re-fit the model 1000 times to new responses simulated from the original model fit (same coefficients, same predictors: i.e. a parametric bootstrap), you can do something like this.
nsim <- 1000
res <- matrix(NA, ncol = length(coef(model)), nrow = nsim)
for (i in 1:nsim) {
## simulate returns a list (in this case, of length 1);
## extract the response vector
newresp <- simulate(model)[[1]]
newfit <- update(model, newresp ~ .)
res[i,] <- coef(newfit)
}
You don't have to store coefficients - you can extract/compute whatever model summaries you like (change the number of columns of res appropriately).
Let’s say your data matrix including age and gender, or whatever predictors, is X. Then you can use X on the right-hand side of your glm formula, get xb_hat <- X %*% the_estimates (or whatever other data matrix replacing X as long as it has same columns) and plug xb_hat into whatever link function you want.

What ML algorithm should I use to determine fish age-at-maturity?

Fish age-at-maturity is where there is a change in the slope of the growth rate. Here is an example of a simulated individual fish and its two growth rates.
I want to create an algorithm that will determine age-at-maturity from age and length data similar to the picture I attached. I have very little idea on what kind of algorithm would be useful and how to apply it to my sample data set:
> head(data)
age length
1 0 0.01479779
2 1 0.05439856
3 2 0.18308919
4 3 0.24380771
5 4 0.37759992
6 5 0.44871502
It was suggested to me to try and use the Cox Proportional Hazards model. To do that I would consider age-at-maturity as a time to event (maturity is the event and age is the time when maturity is reached). I tried fitting that model but got this error:
> cox.model <- coxph(Surv(age ~ length), data = data)
Error in Surv(age ~ length) : Time variable is not numeric
I tried making both variables numeric using as. numeric() but that did not help.
Please let me know if I am using this model wrong or if I should be using a different model.
As I know, time-to-event data should include an event indicator, i.e. a binary variable. If maturity is the event, then it should have been included in the dataset as such a binary variable, and you should run this
cox.model <- coxph(Surv(age, maturity) ~ length, data = data)
Please check manual for more details
Survival package
Cox model
BTW, the figure was created by something like segmented regression and ggplot, I think you may want to use such tech. Here is an example.
I agree with #C.C., 1) a survival model is not applicable for this provided dataset and 2) a simple piecewise linear regression method would be more appropriate.
Please see below the proposed R code for it, together with a sample output graph:
library(segmented)
# create dummy data set, extended from provided one, with noise
df <- data.frame(
age = seq(from = 0, to = 20, by = 1),
length = c(
seq(from = 0, to = 0.45, length.out = 5) + rnorm(5, mean = 1e-3, sd = 1e-2),
seq(from = 0.48, to = 0.6, length.out = 16) + rnorm(16, mean = 1e-3, sd = 1e-2)
)
)
# fit normal linear regression and segmented regression
lm1 <- lm(length ~ age, data = df)
seg_lm <- segmented(lm1, ~ age)
# determine age break point
age_break_point <- seg_lm$psi.history$all.selected.psi[[1]]
# plot raw data points, segmented fit and age break point
plot(seg_lm, res = TRUE, main=paste0('Growth rate change # ', round(age_break_point, 1), ' years old'), xlab='Age', ylab='Length')
abline(v = age_break_point, col='red')

Plotting quadratic curves with poisson glm with interactions in categorical/numeric variables

I want to know if it's possible to plot quadratic curves with Poisson glm with interactions in categorical/numeric variables. In my case:
##Data set artificial
set.seed(20)
d <- data.frame(
behv = c(rpois(100,10),rpois(100,100)),
mating=sort(rep(c("T1","T2"), 200)),
condition = scale(rnorm(200,5))
)
#Condition quadratic
d$condition2<-(d$condition)^2
#Binomial GLM ajusted
md<-glm(behv ~ mating + condition + condition2, data=d, family=poisson)
summary(md)
In a situation where mating, condition and condition2 are significant in the model, I make:
#Create x's vaiues
x<-d$condition##
x2<-(d$condition)^2
# T1 estimation
y1<-exp(md$coefficients[1]+md$coefficients[3]*x+md$coefficients[4]*x2)
#
# T2 estimation
y2<-exp(md$coefficients[1]+md$coefficients[2]+md$coefficients[3]*x+md$coefficients[4]*x2)
#
#
#Separete data set
d_T1<-d[d[,2]!="T2",]
d_T2<-d[d[,2]!="T1",]
#Plot
plot(d_T1$condition,d_T1$behv,main="", xlab="condition", ylab="behv",
xlim=c(-4,3), ylim=c(0,200), col= "black")
points(d_T2$condition,d_T2$behv, col="gray")
lines(x,y1,col="black")
lines(x,y2,col="grey")
#
Doesn't work and I don't have my desirable curves. I'd like a curve for T1 and other for T2 in mating variable. There are any solution for this?
In the code below, we use the poly function to generate a quadratic model without needing to create an extra column in the data frame. In addition, we create a prediction data frame to generate model predictions across the range of condition values and for each level of mating. The predict function with type="response" generates predictions on the scale of the outcome, rather than on the linear predictor scale, which is the default. Also, we change 200 to 100 in creating the data for mating in order to avoid having the exact same outcome data for each level of mating.
library(ggplot2)
# Fake data
set.seed(20)
d <- data.frame(
behv = c(rpois(100,10),rpois(100,100)),
mating=sort(rep(c("T1","T2"), 100)), # Changed from 200 to 100
condition = scale(rnorm(200,5))
)
# Model with quadratic condition
md <- glm(behv ~ mating + poly(condition, 2, raw=TRUE), data=d, family=poisson)
#summary(md)
# Get predictions at range of condition values
pred.data = data.frame(condition = rep(seq(min(d$condition), max(d$condition), length=50), 2),
mating = rep(c("T1","T2"), each=50))
pred.data$behv = predict(md, newdata=pred.data, type="response")
Now plot with ggplot2 and with base R:
ggplot(d, aes(condition, behv, colour=mating)) +
geom_point() +
geom_line(data=pred.data)
plot(NULL, xlim=range(d$condition), ylim=range(d$behv),
xlab="Condition", ylab="behv")
with(subset(d, mating=="T1"), points(condition, behv, col="red"))
with(subset(d, mating=="T2"), points(condition, behv, col="blue"))
with(subset(pred.data, mating=="T1"), lines(condition, behv, col="red"))
with(subset(pred.data, mating=="T2"), lines(condition, behv, col="blue"))
legend(-3, 70, title="Mating", legend=c("T1","T2"), pch=1, col=c("blue", "red"))

coefplot in R with parts of independent variables

I would get a coefplot only with part of independent variables. My regression equation is a fixed effects regression as follows:
aa1 <-glm(Eighty_Twenty ~ Market_Share_H+Market_Share_L+Purchase_Frequency_H+Purchase_Frequency_L+factor(product_group))
coefplot(aa1)
However, I do NOT want to plot coefficients of factor(product_group) variables since there are product groups. Instead, I would get a coefplot with only the coefficients of other variables. How can I do this?
From the help pages (see ?coefplot.default) you can select what predictors or coefficients that you want in your plot.
# some example data
df <- data.frame(Eighty_Twenty = rbinom(100,1,0.5),
Market_Share_H = runif(100),
Market_Share_L = runif(100),
Purchase_Frequency_H = rpois(100, 40),
Purchase_Frequency_L = rpois(100, 40),
product_group = sample(letters[1:3], 100, TRUE))
# model
aa1 <- glm(Eighty_Twenty ~ Market_Share_H+Market_Share_L +
Purchase_Frequency_H + Purchase_Frequency_L +
factor(product_group), df, family="binomial")
library(coefplot)
# coefficient plot with the intercept
coefplot(aa1, coefficients=c("(Intercept)","Market_Share_H","Market_Share_L",
"Purchase_Frequency_H","Purchase_Frequency_L"))
# coefficient plot specifying predictors (no intercept)
coefplot(aa1, predictors=c("Market_Share_H","Market_Share_L" ,
"Purchase_Frequency_H","Purchase_Frequency_L"))

project a linear regression hyper plane to a 2d plot (abline-like)

I have this code
factors<-read.csv("India_Factors.csv",header=TRUE)
marketfactor<-factors[,4]
sizefactor<-factors[,5]
valuefactor<-factors[,6]
dati<-get.hist.quote("SI", quote = "AdjClose", compression = "m")
returns<-diff(dati)
regression<-lm(returns ~ marketfactor + sizefactor + valuefactor,na.action=na.omit)
that does multilinear regression.
I want to plot on a 2D plane the returns against a factor (and this is trivial of course) with superimposed the projection of the linear regression hyperplane for the specific factor. To be more clear the result should be like this: wolfram demonstrations (see the snapshots).
Any help will be greatly appreciated.
Thank you for your time and have a nice week end.
Giorgio.
The points in my comment withstanding, here is the canonical way to generate output from a fitted model in R for combinations of predictors. It really isn't clear what the plots you want are showing, but the ones that make sense to me are partial plots; where one variable is varied over its range whilst holding the others at some common value. Here I use the sample mean when holding a variable constant.
First some dummy data, with only to covariates, but this extends to any number
set.seed(1)
dat <- data.frame(y = rnorm(100))
dat <- transform(dat,
x1 = 0.2 + (0.4 * y) + rnorm(100),
x2 = 2.4 + (2.3 * y) + rnorm(100))
Fit the regression model
mod <- lm(y ~ x1 + x2, data = dat)
Next some data values to predict at using the model. You could do all variables in a single prediction and then subset the resulting object to plot only the relevant rows. Alternatively, more clearly (though more verbose), you can deal with each variable separately. Below I create two data frames, one per covariate in the model. In a data frame I generate 100 values over the range of the covariate being varied, and repeat the mean value of the other covariate(s).
pdatx1 <- with(dat, data.frame(x1 = seq(min(x1), max(x1), length = 100),
x2 = rep(mean(x2), 100)))
pdatx2 <- with(dat, data.frame(x1 = rep(mean(x1), 100),
x2 = seq(min(x2), max(x2), length = 100)))
In the linear regression with straight lines, you really don't need 100 values --- the two end points of the range of the covariate will do. However for models where the fitted function is not linear you need to predict at more locations.
Next, use the model to predict at these data points
pdatx1 <- transform(pdatx1, yhat = predict(mod, pdatx1))
pdatx2 <- transform(pdatx2, yhat = predict(mod, pdatx2))
Now we are ready to draw the partial plots. First compute a range for the y axis - again it is mostly redundant here but if you are adding confidence intervals you will need to include their values below,
ylim <- range(pdatx1$y, pdatx2$y, dat$y)
To plot (here putting two figures on the same plot device) we can use the following code
layout(matrix(1:2, ncol = 2))
plot(y ~ x1, data = dat)
lines(yhat ~ x1, data = pdatx1, col = "red", lwd = 2)
plot(y ~ x2, data = dat)
lines(yhat ~ x2, data = pdatx2, col = "red", lwd = 2)
layout(1)
Which produces

Resources