Extracting the Model Object in R from str() - r

I have a logit model object fit using glm2. The predictors are continuous and time varying so I am using basis splines. When I predict(FHlogit, foo..,) the model object it provides a prediction. All is well.
Now, what I would like to do is extract the part of FHLogit and the basis matrix the provides the prediction. I do not want to extract information about the model from str(FHLogit) I am trying to extract the part that says Beta * Predictor = 2. So, I can manipulate the basis matrix for each predictor

I don't think using basis splines will affect this. If so, please provide a reproducible example.
Here's a simple case:
df1 <- data.frame(y=c(0,1,0,1),
x1=seq(4),
x2=c(1,3,2,6))
library(glm2)
g1 <- glm2(y ~ x1 + x2, data=df1)
### default for type is "link"
> stats::predict.glm(g1, type="link")
1 2 3 4
0.23809524 0.66666667 -0.04761905 1.14285714
Now, being unsure how these no.s were arrived at we can look at the source for the above, with predict.glm. We can see that type="link" is the simplest case, returning
pred <- object$fitted.values # object is g1 in this case
These values are the predictions resulting from the original data * the coefficients, which we can verify with e.g.
all.equal(unname(predict.glm(g1, type="link")[1]),
unname(coef(g1)[1] + coef(g1)[2]*df1[1, 2] + coef(g1)[3]*df1[1, 3]))

Related

Predict function returns Fitted value even though I put newdata

I'm making model validation testing function for my own.
In doing so, I let
a=entire set of predictor variables in model-building set
b=set of response variable in model-building set
c=entire set of predictor variables in validation set
d=set of response variable in validation set
e=number of column which I have an interest
This is based on book Applied Linear Regression Models, Kutner , so I used
library(ALSM).
In my case, model-building set is SurgicalUnit, and validation set is SurgicalUnitAdditional.
Both data consists of 10 columns, of which from 1st to 8th columns are entire set of indep. variables, 9th is the response variable, 10th is the log(response variable)
So,
a=SurgicalUnit[,1:8]; b=SurgicalUnit[,10];
c=SurgicalUnitAdditional[,1:8]; d=SurgicalUnitAdditional[,10]; e=c(1,2,3,8)
, since I want to fit with logged response variable, and I want to regress with variable x1,x2,x3 and x8.
(Please note that the reason why I used "entire" set of independent variables with specific number of column instead of putting set of interested independent variables dircetly is, because I need to obtain Mallow's Cp in my function at once.)
So my regression is, asdf=lm(b~as.matrix(a[e])) , the problem is, I want to predict validation set in models built with model-building set. So, I let preds=data.frame(c[e]) and finally predict(asdf, newdata=preds) which is equal with predict(asdf), which means that it's fitted values of asdf.
Why predict doesn't work? Helps will be appreciated.
Below is my function
mod.valid=function(a,b,c,d,e){
asdf=lm(b~as.matrix(a[e])) # model what you want
qwer=lm(b~as.matrix(a[1:max(e)])) # full model in order to get Cp
mat=round(coef(summary(asdf))[,c(-3,-4)],4); mat2=matrix(0,5,2)
mat=rbind(mat,mat2); mat # matrix for coefficients and others(model-building)
n=nrow(anova(asdf)); m=nrow(anova(qwer))
nn=length(b) # To get size of sample size
p=asdf$rank # To get parameters p
cp=anova(asdf)$Sum[n] / (anova(qwer)$Mean[m]) - (nn-2*p); cp=round(cp,4)
mat[p+1,1]=p; mat[p+1,2]=cp # adding p and Cp
rp=summary(asdf)$r.squared; rap=summary(asdf)$adj.r.squared; rp=round(rp,4); rap=round(rap,4)
mat[p+2,1]=rp; mat[p+2,2]=rap # adding Rp2 and Rap2
sse=anova(asdf)$Sum[n]; pre=MPV::PRESS(asdf); sse=round(sse,4); pre=round(pre,4)
mat[p+3,1]=sse; mat[p+3,2]=pre # adding SSE and PRESS
**preds=data.frame(c[e]); predd=predict(asdf,newdata=preds)** **# I got problem here!**
mspr=sum((d-predd)^2) / length(d); mse=anova(asdf)$Mean[n]; mspr=round(mspr,4); mse=round(mse,4)
mat[p+4,1]=mse; mat[p+4,2]=mspr # adding MSE and MSPR
aic=nn*log(anova(asdf)$Sum[n]) - nn*log(nn) + 2*p; aic=round(aic,4)
bic=nn*log(anova(asdf)$Sum[n]) - nn*log(nn) + log(nn)*p; bic=round(bic,4)
mat[p+5,1]=aic; mat[p+5,2]=bic # adding AIC and BIC
rownames(mat)[p+1]="p&Cp"; rownames(mat)[p+2]="Rp.sq&Rap.sq"
rownames(mat)[p+3]="SSE&PRESS"; rownames(mat)[p+4]="MSE&MSPR"; rownames(mat)[p+5]="AIC&BIC"
asdf2=lm(d~as.matrix(c[e]))
qwer2=lm(d~as.matrix(c[1:max(e)]))
matt=round(coef(summary(asdf2))[,c(-3,-4)],4); matt2=matrix(0,5,2)
matt=rbind(matt,matt2); matt # matrix for coefficients and others(validation)
n2=nrow(anova(asdf2)); m2=nrow(anova(qwer2))
nn2=length(d) # To get size of sample size
p2=asdf$rank # To get parameters p
cp2=anova(asdf2)$Sum[n2] / (anova(qwer2)$Mean[m2]) - (nn2-2*p2); cp2=round(cp2,4)
matt[p2+1,1]=p2; matt[p2+1,2]=cp2 # adding p and Cp
rp2=summary(asdf2)$r.squared; rap2=summary(asdf2)$adj.r.squared; rp2=round(rp2,4); rap2=round(rap2,4)
matt[p2+2,1]=rp2; matt[p2+2,2]=rap2 # adding Rp2 and Rap2
sse2=anova(asdf2)$Sum[n]; pre2=MPV::PRESS(asdf2); sse2=round(sse2,4); pre2=round(pre2,4)
matt[p2+3,1]=sse2; matt[p2+3,2]=pre2 # adding SSE and PRESS
mse2=anova(asdf2)$Mean[n]; mse2=round(mse2,4)
matt[p2+4,1]=mse2; matt[p2+4,2]=NA # adding MSE and MSPR, in this case MSPR=0
aic2=nn2*log(anova(asdf2)$Sum[n2]) - nn2*log(nn2) + 2*p2; aic2=round(aic2,4)
bic2=nn2*log(anova(asdf2)$Sum[n2]) - nn2*log(nn2) + log(nn2)*p2; bic2=round(bic2,4)
matt[p2+5,1]=aic2; matt[p2+5,2]=bic2 # adding AIC and BIC
mat=cbind(mat,matt); colnames(mat)=c("Estimate","Std.Error","Val.Estimate","Val.Std.Error")
print(mat)
}
This function will provide useful statistics for model validation.
It returns a matrix with coefficients, p, Mallow's Cp, R.squared, R.adj.squared, SSE, PRESS, MSE, MSPR, AIC and BIC.
Everythig works fine for general given data, except for MSPR since predict function doesn't work! It only returns the fitted.
Can you try something like this. You have to make sure the both training and test data has same column names.
x <- rnorm(100)
y <- x + rnorm(100)
df <- data.frame(x = x, y=y)
# model fitting
fit <- lm(y ~ x, data=df)
predict(fit)
# creating new data
newx <- rnorm(50)
newdf <- data.frame(x = newx)
# making predictions
predict(fit, newdata = newdf)

Generating a spatial prediction (raster) from a GLMM with a random intercept and a quadratic term

As indicated in the title, I am trying to generate a predictive raster depicting the relative probability of use. To create a reproducible example, I have used the MaungaWhau data from the maxlike package. MaungaWhau is a list that contains two raster layers as well a layer of 1000 locations. So, with these data and packages...
library(maxlike)
library(lme4)
data(MaungaWhau)
we can make two raster layers, a raster stack, as well as a SpatialPoints object.
elev <- raster(MaungaWhau$elev, xmn=0, xmx=61, ymn=0, ymx=87)
precip <- raster(MaungaWhau$precip, xmn=0, xmx=61, ymn=0, ymx=87)
rs <- stack(elev, precip)
PointDat <- SpatialPoints(MaungaWhau$xy)
I then make a new dataframe that contains IndID: the unique ID for each individual (AAA - DDD); Used: the binary response variable indicating whether the point was used or not (1 or 0, respectively); as well as the Elev and Precip values for each point from the SpatialPoints object.
df <- data.frame(IndID = sample(c("AAA", "BBB", "CCC", "DDD"), 1000, replace = T),
Used = sample(c(0,1),1000, replace = T),
Elev = extract(elev, PointDat),
Precip = extract(precip, PointDat))
head(df); tail(df)
> head(df); tail(df)
IndID Used Elev Precip
1 DDD 0 0.3798393 0.6405494
2 DDD 1 0.8830846 1.1174869
3 AAA 0 1.9282864 0.9641432
4 DDD 0 1.5024634 0.4695881
5 BBB 1 1.3089075 -0.1341483
6 BBB 1 0.5733952 0.6246699
I then build a resource selection model (RSF) and specify IndID as a random effect. Notice also, that I included a quadratic term for Elev.
#Make model
Mod <- glmer(Used ~ Elev + I(Elev^2) + Precip + (1|IndID), family=binomial, data = df, verbos = 1)
summary(Mod)
I am not interested in interpretation given the made up used and available points. My first question is more if a confirmation. The raster package vignette states that "The names in the Raster object should exactly match those expected by the model." In the instance of the quadratic term fit with I(Elev^2) am I correct that predict will be 'looking' for Elev? This seems to be the case as there in no error associated with Elev in the predict code below.
Secondly, how do I deal with the random intercept term (1|IndID)? I am interested in marginal predictions and do not need to account for individuals.
With the following code
#Change names of layers in raster stack to match model
names(rs) <- c("Elev", "Precip")
Pred <- predict(rs, Mod)
I get the error:
Error in eval(expr, envir, enclos) : object 'IndID' not found
Is it possible to generate a marginal prediction for the 'typical' individual without passing the IndID covariate to the predict function? In other words, I want to ignore the IndID term and the associated individual adjustments to the intercept when making the prediction surface.
The predict function for lme4 (merMod) objects makes conditional predictions by default.
To make marginal/unconditional predictions, you need to make use of the re.form argument. Your code would look like:
Pred <- predict(rs, Mod, re.form = NA)
If you also wanted predictions done on the scale of the response variable, you can use the type argument. See the help page for more details on the available arguments, ?predict.merMod.
Rather than relying on the predict function, generate predictions manually by first making an object of betas of the fixed effects
betas <- fixef(Mod)
and then generate predictions by multiplying each raster by the respective beta coefficient.
Pred <- betas[1] + (elev * betas[2]) + (elev^2 * betas[3]) + (precip * betas[4])
plot(Pred)
It is then easy to add or remove the intercept and manually specify a link function (e.g. logit).

Local prediction modelling approach in R

users
I am trying to develop a local model (PLSR) which is predicting a query sample by a model built on the 10 most similar samples using the code below (not the full model yet, just a part of it). I got stuck when trying to predict the query sample (second to last line). The model is actually predicting something, ("prd") but not the query sample!
Here is my code:
require("pls")
set.seed(10000) # generate some sample data
mat <- replicate(100, rnorm(100))
y <- as.matrix(mat[,1], drop=F)
x <- mat[,2:100]
eD <- dist(x, method="euclidean") # create a distance matrix
eDm <- as.matrix(eD)
Looping over all 100 samples and extracting their 10 most similar samples for subsequent model building and prediction of query sample:
for (i in 1:nrow(eDm)) {
kni <- head(order(eDm[,i]),11)[-1] # add 10 most similar samples to kni
pls1 <- plsr(y[kni,] ~ x[kni,], ncomp=5, validation="CV") # run plsr on sel. samples
prd <- predict(pls1, ncomp=5, newdata=x[[i]]) # predict query sample ==> I suspect there is something wrong with this expression: newdata=x[[i]]
}
I can't figure out how to address the query sample properly - many thanks i.a. for any help!
Best regards,
Chega
You are going to run into all sorts of pain building models with formulae like that. Also the x[[i]] isn't doing what you think it is - you need to supply a data frame usually to these modelling functions. In this case a matrix seems fine too.
I get all your code working OK if I use:
prd <- predict(pls1, ncomp=5, newdata=x[i, ,drop = FALSE])
giving
> predict(pls1, ncomp=5, newdata=x[i,,drop = FALSE])
, , 5 comps
y[kni, ]
[1,] 0.6409897
What you were seeing with your code are the fitted values for the training data.
> fitted(pls1)[, , 5, drop = FALSE]
, , 5 comps
y[kni, ]
1 0.1443274
2 0.2706769
3 1.1407780
4 -0.2345429
5 -1.0468221
6 2.1353091
7 0.8267103
8 3.3242296
9 -0.5016016
10 0.6781804
This is convention in R when you either don't supply newdata or the object you are supplying makes no sense and doesn't contain the covariates required to generate predictions.
I would have fitted the model as follows:
pls1 <- plsr(y ~ x, ncomp=5, validation="CV", subset = kni)
where I use the subset argument for its intended purpose; to select the rows of the input data to fit the model with. You get nicer output from the models; the labels use y instead of y[kni, ] etc, plus this general convention will serve you well in other modelling tools, where R will expect newdata to be a data frame with names exactly the same as those mentioned in the model formula. In your case, with your code, that would mean creating a data frame with names like x[kni, ] which are not easy to do, for good reason!

plot multiple fit and predictions for logistic regression

I am running multiple times a logistic regression over more than 1000 samples taken from a dataset. My question is what is the best way to show my results ? how can I plot my outputs for both the fit and the prediction curve?
This is an example of what I am doing, using the baseball dataset from R. For example I want to fit and predict the model 5 times. Each time I take one sample out (for the prediction) and use another for the fit.
library(corrgram)
data(baseball)
#Exclude rows with NA values
dataset=baseball[complete.cases(baseball),]
#Create vector replacing the Leage (A our N) by 1 or 0.
PA=rep(0,dim(dataset)[1])
PA[which(dataset[,2]=="A")]=1
#Model the player be league A in function of the Hits,Runs,Errors and Salary
fit_glm_list=list()
prd_glm_list=list()
for (k in 1:5){
sp=sample(seq(1:length(PA)),30,replace=FALSE)
fit_glm<-glm(PA[sp[1:15]]~baseball$Hits[sp[1:15]]+baseball$Runs[sp[1:15]]+baseball$Errors[sp[1:15]]+baseball$Salary[sp[1:15]])
prd_glm<-predict(fit_glm,baseball[sp[16:30],c(6,8,20,21)])
fit_glm_list[[k]]=fit_glm;prd_glm_list[[k]]=fit_glm
}
There are a number of issues here.
PA is a subset of baseball$League but the model is constructed on columns from the whole baseball data frame, i.e. they do not match.
PA is treated as a continuous response when using the default family (gaussian), it should be changed to a factor and binomial family.
prd_glm_list[[k]]=fit_glm should probably be prd_glm_list[[k]]=prd_glm
You must save the true class labels for the predictions otherwise you have nothing to compare to.
My take on your code looks like this.
library(corrgram)
data(baseball)
dataset <- baseball[complete.cases(baseball),]
fits <- preds <- truths <- vector("list", 5)
for (k in 1:5){
sp <- sample(nrow(dataset), 30, replace=FALSE)
fits[[k]] <- glm(League ~ Hits + Runs + Errors + Salary,
family="binomial", data=dataset[sp[1:15],])
preds[[k]] <- predict(fits[[k]], dataset[sp[16:30],], type="response")
truths[[k]] <- dataset$League[sp[1:15]]
}
plot(unlist(truths), unlist(preds))
The model performs poorly but at least the code runs without problems. The y-axis in the plot shows the estimated probabilities that the examples belong to league N, i.e. ideally the left box should be close to 0 and the right close to 1.

Create function to automatically create plots from summary(fit <- lm( y ~ x1 + x2 +... xn))

I am running the same regression with small alterations of x variables several times. My aim is after having determined the fit and significance of each variable for this linear regression model to view all all major plots. Instead of having to create each plot one by one, I want a function to loop through my variables (x1...xn) from the following list.
fit <-lm( y ~ x1 + x2 +... xn))
The plots I want to create for all x are
1) 'x versus y' for all x in the function above
2) 'x versus predicted y
3) x versus residuals
4) x versus time, where time is not a variable used in the regression but provided in the dataframe the data comes from.
I know how to access the coefficients from fit, however I am not able to use the coefficient names from the summary and reuse them in a function for creating the plots, as the names are characters.
I hope my question has been clearly described and hasn't been asked already.
Thanks!
Create some mock data
dat <- data.frame(x1=rnorm(100), x2=rnorm(100,4,5), x3=rnorm(100,8,27),
x4=rnorm(100,-6,0.1), t=(1:100)+runif(100,-2,2))
dat <- transform(dat, y=x1+4*x2+3.6*x3+4.7*x4+rnorm(100,3,50))
Make the fit
fit <- lm(y~x1+x2+x3+x4, data=dat)
Compute the predicted values
dat$yhat <- predict(fit)
Compute the residuals
dat$resid <- residuals(fit)
Get a vector of the variable names
vars <- names(coef(fit))[-1]
A plot can be made using this character representation of the name if you use it to build a string version of a formula and translate that. The four plots are below, and the are wrapped in a loop over all the vars. Additionally, this is surrounded by setting ask to TRUE so that you get a chance to see each plot. Alternatively you arrange multiple plots on the screen, or write them all to files to review later.
opar <- par(ask=TRUE)
for (v in vars) {
plot(as.formula(paste("y~",v)), data=dat)
plot(as.formula(paste("yhat~",v)), data=dat)
plot(as.formula(paste("resid~",v)), data=dat)
plot(as.formula(paste("t~",v)), data=dat)
}
par(opar)
The coefficients are stored in the fit objects as you say, but you can access them generically in a function by referring to them this way:
x <- 1:10
y <- x*3 + rnorm(1)
plot(x,y)
fit <- lm(y~x)
fit$coefficient[1] # intercept
fit$coefficient[2] # slope
str(fit) # a lot of info, but you can see how the fit is stored
My guess is when you say you know how to access the coefficients you are getting them from summary(fit) which is a bit harder to access than taking them directly from the fit. By using fit$coeff[1] etc you don't have to have the name of the variable in your function.
Three options to directly answer what I think was the question: How to access the coefficients using character arguments:
x <- 1:10
y <- x*3 + rnorm(1)
fit <- lm(y~x)
# 1
fit$coefficient["x"]
# 2
coefname <- "x"
fit$coefficient[coefname]
#3
coef(fit)[coefname]
If the question was how to plot the various functions then you should supply a sufficiently complex construction (in R) to allow demonstration of methods with a well-specified set of objects.

Resources