Troubles predicting fixed effects from a hierarchical GAM in mgcv - r

I have been fitting different hierarchical GAMs (hereafter: HGAM) using mgcv in R. I can extract and plot their predictions for their random effects without problems. Conversely, extracting and plotting their predictions for their fixed effects only works for some models, and I don't know why.
Here is a practical example, which refers to the color spectra of flowers from two species (Taxon) sampled at various localities (also discussed here):
rm(list=ls()) # wipe R's memory clean
library(pacman) # load packages, installing them from CRAN if needed
p_load(RCurl) # allows accessing data from URL
ss <- read.delim(text=getURL("https://raw.githubusercontent.com/marcoplebani85/datasets/master/flower_color_spectra.txt"))
head(ss)
ss$density <- ifelse(ss$density<0, 0, ss$density) # set spurious negative reflectance values to zero
ss$clr <- ifelse(ss$Taxon=="SpeciesB", "red", "black")
ss <- with(ss, ss[order(Locality, wl), ])
These are the mean color spectra at the population level for the two species (rolling means were used):
Each color refers to a different species. Each line refers to a different locality.
The following model is a HGAM of type G according to Pedersen et al.'s classification (2019) and it does not give any issues:
gam_G1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(Locality, bs="re"), # "re" is short for "random effect"
data = ss, method = 'REML',
family="quasipoisson"
)
# gam.check(gam_G1)
# k.check(gam_G1)
# MuMIn::AICc(gam_G1)
# gratia::draw(gam_G1)
# plot(gam_G1, pages=1)
# use gam_G1 to predict wl by Locality
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl", "Taxon", "Locality", "clr")])
# predict:
pred <- predict(object= gam_G1, newdata=nn, type="response", se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
# use gam_G1 to predict wl by Taxon
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl",
"Taxon",
"Locality",
"clr")])
nn$Locality=0 # turns random effect off
# after https://stats.stackexchange.com/q/131106/214127
# predict:
pred <- predict(object = gam_G1,
type="response",
newdata=nn,
se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
R warns me that factor levels 0 not in original fit, but it executes the task without issues:
Left panel: gam_G1 predictions at the Locality level. Right panel: gam_G1 predictions for the fixed effects.
Troublesome models
The following model is a HGAM of type "GI" sensu Pedersen et al. (2019). It produces more accurate predictions at the Locality level, but I can only get NA as predictions at the level of fixed effects:
# GI: models with a global smoother for all observations,
# plus group-level smoothers, the wiggliness of which is estimated individually
start_time <- Sys.time()
gam_GI1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(wl, by = Locality, bs="tp", m=1)
# "tp" is short for "thin plate [regression spline]"
+ s(Locality, bs="re"),
family="quasipoisson",
data = ss, method = 'REML'
)
end_time <- Sys.time()
end_time - start_time # it took ~2.2 minutes on my computer
# gam.check(gam_GI1)
# k.check(gam_GI1)
# MuMIn::AICc(gam_GI1)
Attempt at drawing predictions for the fixed effects (Taxon and wl) according to gam_GI1:
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl",
"Taxon",
"Locality",
"clr")])
nn$Locality=0 # turns random effect off
# after https://stats.stackexchange.com/q/131106/214127
# predict:
pred <- predict(object = gam_GI1,
type="response",
# exclude="c(Locality)",
# # this should turn random effect off
# # (doesn't work for me)
newdata=nn,
se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
head(nn)
# wl Taxon Locality clr fit se
# 1 298.34 SpeciesB 0 red NA NA
# 2 305.82 SpeciesB 0 red NA NA
# 3 313.27 SpeciesB 0 red NA NA
# 4 320.72 SpeciesB 0 red NA NA
# 5 328.15 SpeciesB 0 red NA NA
# 6 335.57 SpeciesB 0 red NA NA
Left panel: gam_GI1 predictions at the Locality level. Right panel (blank): gam_GI1 predictions for the fixed effects.
The following model, which includes a global smoother for all observations, plus group-level smoothers, all with the same "wiggliness", doesn't provide fixed-effect predictions either:
gam_GS1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(wl, by = Locality, bs="fs", m=1),
# "fs" is short for "factor-smoother [interaction]"
family="quasipoisson",
data = ss, method = 'REML'
)
Why don't gam_GI1 and gam_GS1 produce predictions for their fixed effects, and how can I obtain them?
The models can take a few minutes to run. To save time, their output can be downloaded from here as an RData file. My R scripts (which include the code for plotting the figures) are available here.

I think you are conflating several things here; The by trick to turn off random effects only works for bs = "re" smooths. Locality is a factor (otherwise your random effect isn't a random intercept) and setting it to 0 is creating a new level (although it could be creating an NA as 0 isn't among the original levels.
If what you want to do is turn off anything to do with Locality, you should use exclude; however you have the invocation wrong. The reason why it's not working is because you are creating a character vector with a single element "c(Locality)". This fails for obvious reasons once you realize that c(Locality) doesn't related to anything in your model. What you need to provide here is a vector of smooth names as printed by summary(). For example, to exclude the smooth s(Locality, bs = "re"), {mgcv} knows this as s(Locality), so you would use exclude = "s(Locality)".
In your case, it is tedious to type out all the "s(wl):LocalityLevelX" labels for each smooth. As you have only two taxa, it would be easier to use the complimentary argument terms, where you list smooth labels that you want to include in the model. So you could do terms = c("s(wl):TaxonSpeciesB", "s(wl):TaxonSpeciesC") or whatever summary() displays for these smooths.
You also need to include the Taxon term in terms, which I think needs to be:
terms = c("TaxonSpeciesB", TaxonSpeciesC",
"s(wl):TaxonSpeciesB", "s(wl):TaxonSpeciesC")
If you install and load my {gratia} package, you can use smooths(gam_GI1) to list all the smooth labels as far as {mgcv} knows them.
The by trick works like this:
gam(y ~ x + s(z) + s(id, bs = "re", by = dummy)
where dummy is set to a numeric value 1 when fitting and to 0 when you are predicting. As this is a numeric by variable you are multiplying the smooth by dummy and hence why setting it to 0 excludes the term. The reason why your code isn't working is because you really want separate smooths for wl for each Locality; Locality is an actual variable of interest in your data/model, not a dummy variable we create to achieve the aim of excluding a term from the model.
Hopefully now you can see why exclude and terms are much better solutions than this dummy trick.
FYI, in bs = "tp", the "tp" doesn't mean tensor product smooth. It mean thin plate regression spline (TPRS). You only get tensor product smooths through te(), t2(), or ti() terms.

Related

Understanding the iml (interpretable machine learning) output for a classification task

Consider this synthetic dataset for classification,
library(tidyverse)
library(iml)
library(randomForest)
# Generate data
set.seed(5)
x = matrix(rnorm(2000), nrow=500)
z = x %*% matrix(c(1,1,1,1), nrow=4)
y = round(1 / (1 + exp(-z)), 0) %>% as.integer()
x = cbind(x, rnorm(500))
y_factor = as.factor(y)
data = data.frame(x, y_factor)
# Train model
rf = randomForest(y_factor ~ X1+X2+X3+X4+X5, data=data, ntree = 50)
# Compute feature importance using iml package
x_df = data[,-6]
predictor_rf <- Predictor$new(rf, data=x_df, y=y_factor)
imp_rf <- FeatureImp$new(predictor_rf, loss = "ce")
plot(imp_rf)
Here, x is a matrix with 5 independent variables, 4 of them are related to the response, and the fith is just noise. Then I train a random forest algorithm and finally compute the variable importance using feature permutation from the iml package and obtain the output from the figure below. In the manual from the package says that:
The importance is measured as
the factor by which the model’s prediction error increases when the feature is shuffled.
So here, variable X4 obtained a feature importance value of 0.2, which means that the prediction error "increased" by a factor of 0.2. However, being 0.2 a factor smaller than 1 this means that the prediction error actually decreased when doing the permutation on X2, which makes no sense to me, because on one side, it would imply that just random shuffled numbers obtain better results than the actual variables, but on the other side, the current model with the original variable obtains an accuracy of 100%. Same interpretation could be seen in the rest of the variables, except for variable X5, which was noise and obtained an importance of 0.
So... what am I missing here? What is that 0.2 value?

Plotting precision#k and recall#k in ROCR (R)

I'm evaluating a binary classifier in R with the ROCR package. My classifier outputs a score between 0 and 1 for target 0/1 labels.
I'd like to plot precision and recall # k but can't find a way to do it. Calling performance() without specifying the x-axis measure plots the precision value by score cutoff:
library(ROCR)
#df <- a two-dimensional dataframe with prediction scores and actual labels of my classifier
pred <- prediction(df$score, df$label)
pr_curve <- performance(pred, measure="prec")
For precision (or recall) at k, I'd need to plot the precision against the rank of each prediction, ordered by descending score:
pred <- prediction(df$score, df$label)
pr_curve <- performance(pred, measure="prec", x.measure="rank") #but there seems to be no "rank" in ROCR!
Is there a way to do this in ROCR? I'm open to use alternative libraries if this isn't the case.
Load libraries and define train and test set:
library(mlbench)
library(e1071)
library(ROCR)
data(BreastCancer)
df = BreastCancer
idx = sample(1:nrow(df),150)
trn = df[idx,]
test = df[-idx,]
Fit naives bayes
fit = naiveBayes(Class ~ .,data=trn)
In the manual for performance, it is written,
Precision/recall graphs: measure="prec", x.measure="rec".
Plot precision-recall:
pred = prediction(predict(fit,test,type="raw")[,2],test$Class)
#plot to see it is working correctly:
plot(performance(pred,measure="prec",x.measure="rec"))
Now for your case to do it at K, we can also do the precision recall from scratch:
#combine prob, predicted labels, and actual labels
res = data.frame(prob=predict(fit,test,type="raw")[,2],
predicted_label=predict(fit,test),
label = test$Class)
res = res[order(res$prob,decreasing=TRUE),]
res$rank = 1:nrow(res)
# calculate recall, which is the number of actual classes we get back
res$recall = cumsum(res$label=="malignant")/sum(res$label=="malignant")
# precision, number of malignant cases we predicted correctly
res$precision = cumsum(res$label=="malignant")/res$rank
# check the two plots
par(mfrow=c(1,2))
plot(performance(pred,measure="prec",x.measure="rec"))
plot(res$recall,res$precision,type="l")
Now you have it correct, getting or plotting precision at K is simply:
par(mfrow=c(1,2))
with(res,
plot(rank,precision,main="self-calculated",type="l"))
plot(pred#n.pos.pred[[1]],
pred#tp[[1]]/(pred#fp[[1]]+pred#tp[[1]]),
type="l",main="from RORC")
I am not aware of a way to use the .plot.performance function.. But you can use the variables stored under prediction object. pred#tp is the true positive, pred#fp is the false positive, so tp / fp+fp gives precision and pred#n.pos.pred gives the rank essentially.

Simulate data for mixed-effects model with predefined parameter

I'm trying to simulate data for a model expressed with the following formula:
lme4::lmer(y ~ a + b + (1|subject), data) but with a set of given parameters:
a <- rnorm() measured at subject level (e.g nSubjects = 50)
y is measured at the observation level (e.g. nObs = 7 for each subject
b <- rnorm() measured at observation level and correlated at a given r with a
variance ratio of the random effects in lmer(y ~ 1 + (1 | subject), data) is fixed at for example 50/50 or 10/90 (and so on)
some random noise is present (so that a full model does not explain all the variance)
effect size of the fixed effects can be set at a predefined level (e.g. dCohen=0.5)
I played with various packages like: powerlmm, simstudy or simr but still fail to find a working solution that will accommodate the amount of parameters I'd like to define beforehand.
Also for my learning purposes I'd prefer a base R method than a package solution.
The closest example I found is a blog post by Ben Ogorek "Hierarchical linear models and lmer" which looks great but I can't figure out how to control for parameters listed above.
Any help would be appreciated.
Also if there a package that I don't know of, that can do these type of simulations please let me know.
Some questions about the model definition:
How do we specify a correlation between two random vectors that are different lengths? I'm not sure: I'll sample 350 values (nObs*nSubject) and throw away most of the values for the subject-level effect.
Not sure about "variance ratio" here. By definition, the theta parameters (standard deviations of the random effects) are scaled by the residual standard deviation (sigma), e.g. if sigma=2, theta=2, then the residual std dev is 2 and the among-subject std dev is 4
Define parameter/experimental design values:
nSubjects <- 50
nObs <- 7
## means of a,b are 0 without loss of generality
sdvec <- c(a=1,b=1)
rho <- 0.5 ## correlation
betavec <- c(intercept=0,a=1,b=2)
beta_sc <- betavec[-1]*sdvec ## scale parameter values by sd
theta <- 0.4 ## = 20/50
sigma <- 1
Set up data frame:
library(lme4)
set.seed(101)
## generate a, b variables
mm <- MASS::mvrnorm(nSubjects*nObs,
mu=c(0,0),
Sigma=matrix(c(1,rho,rho,1),2,2)*outer(sdvec,sdvec))
subj <- factor(rep(seq(nSubjects),each=nObs)) ## or ?gl
## sample every nObs'th value of a
avec <- mm[seq(1,nObs*nSubjects,by=nObs),"a"]
avec <- rep(avec,each=nObs) ## replicate
bvec <- mm[,"b"]
dd <- data.frame(a=avec,b=bvec,Subject=subj)
Simulate:
dd$y <- simulate(~a+b+(1|Subject),
newdata=dd,
newparams=list(beta=beta_sc,theta=theta,sigma=1),
family=gaussian)[[1]]

Linear predictions always the same regardless of features in model

I'm using the Ames, Iowa housing prices data set.
I have a train set and test set. The test set is missing the dependent variable SalePrice. (No column for SalePrice exists).
I have done a linear model and now am trying to predict the Sale Price values on the test set. But when doing so, I always get these same predicted values for SalePrice, regardless of the model used.
Then when trying to calculate RMSE, I get NA.
Here is my model:
lm2 <- lm(SalePrice ~
GarageCars +
Neighborhood +
I(OverallQual^2) + OverallQual +
OverallQual*GrLivArea +
log2(LotArea) +
log2(GrLivArea) +
KitchenQual +
I(TotalBsmtSF^2) +
TotalBsmtSF
, data=train)
# Add an empty column to the test set,
# to be later filled in by predictions
# (Is this even necessary?):
test[, "SalePrice"] <- NA
# My predictions:
predictions <- predict(lm2, newdata = test)
head(predictions)
1 2 3 4 5 6
121093.5 170270.7 170029.5 187012.1 239359.2 172962.1
I always get these same values regardless of the model used. I suspect I'm just not understanding predict(). I suspect I am only getting the predicted values based on my train set rather than on my test set.
I know that the variable names need to match exactly those used in the model, but what other aspect of predict am I not understanding? Do I need to perform the same predictor variable transformations in the test set? Must I create variables to hold them?
Then I calculate the RMSE:
# Formula function for calculating RMSE:
rmse <- function(actual, pred) sqrt(mean((actual-pred)^2))
# Calculate rmse on test set:
rmse(test$SalePrice, predictions))
[1] NA
Could you please tell me what I'm doing wrong? Let me know if you need to see the data.

linear predictor - ordered probit (ordinal, clm)

I have got a question regarding the ordinal package in R or specifically regarding the predict.clm() function. I would like to calculate the linear predictor of an ordered probit estimation. With the polr function of the MASS package the linear predictor can be accessed by object$lp. It gives me on value for each line and is in line with what I understand what the linear predictor is namely X_i'beta. If I however use the predict.clm(object, newdata,"linear.predictor") on an ordered probit estimation with clm() I get a list with the elements eta1 and eta2,
with one column each, if the newdata contains the dependent variable
where each element contains as many columns as levels in the dependent variable, if the newdata doesn't contain the dependent variable
Unfortunately I don't have a clue what that means. Also in the documentations and papers of the author I don't find any information about it. Would one of you be so nice to enlighten me? This would be great.
Cheers,
AK
UPDATE (after comment):
Basic clm model is defined like this (see clm tutorial for details):
Generating data:
library(ordinal)
set.seed(1)
test.data = data.frame(y=gl(4,5),
x=matrix(c(sample(1:4,20,T)+rnorm(20), rnorm(20)), ncol=2))
head(test.data) # two independent variables
test.data$y # four levels in y
Constructing models:
fm.polr <- polr(y ~ x) # using polr
fm.clm <- clm(y ~ x) # using clm
Now we can access thetas and betas (see formula above):
# Thetas
fm.polr$zeta # using polr
fm.clm$alpha # using clm
# Betas
fm.polr$coefficients # using polr
fm.clm$beta # using clm
Obtaining linear predictors (only parts without theta on the right side of the formula):
fm.polr$lp # using polr
apply(test.data[,2:3], 1, function(x) sum(fm.clm$beta*x)) # using clm
New data generation:
# Contains only independent variables
new.data <- data.frame(x=matrix(c(rnorm(10)+sample(1:4,10,T), rnorm(10)), ncol=2))
new.data[1,] <- c(0,0) # intentionally for demonstration purpose
new.data
There are four types of predictions available for clm model. We are interested in type=linear.prediction, which returns a list with two matrices: eta1 and eta2. They contain linear predictors for each observation in new.data:
lp.clm <- predict(fm.clm, new.data, type="linear.predictor")
lp.clm
Note 1: eta1 and eta2 are literally equal. Second is just a rotation of eta1 by 1 in j index. Thus, they leave left side and right side of linear predictor scale opened respectively.
all.equal(lp.clm$eta1[,1:3], lp.clm$eta2[,2:4], check.attributes=FALSE)
# [1] TRUE
Note 2: Prediction for first line in new.data is equal to thetas (as far as we set this line to zeros).
all.equal(lp.clm$eta1[1,1:3], fm.clm$alpha, check.attributes=FALSE)
# [1] TRUE
Note 3: We can manually construct such predictions. For instance, prediction for second line in new.data:
second.line <- fm.clm$alpha - sum(fm.clm$beta*new.data[2,])
all.equal(lp.clm$eta1[2,1:3], second.line, check.attributes=FALSE)
# [1] TRUE
Note 4: If new.data contains response variable, then predict returns only linear predictor for specified level of y. Again we can check it manually:
new.data$y <- gl(4,3,length=10)
lp.clm.y <- predict(fm.clm, new.data, type="linear.predictor")
lp.clm.y
lp.manual <- sapply(1:10, function(i) lp.clm$eta1[i,new.data$y[i]])
all.equal(lp.clm.y$eta1, lp.manual)
# [1] TRUE

Resources