I have got a question regarding the ordinal package in R or specifically regarding the predict.clm() function. I would like to calculate the linear predictor of an ordered probit estimation. With the polr function of the MASS package the linear predictor can be accessed by object$lp. It gives me on value for each line and is in line with what I understand what the linear predictor is namely X_i'beta. If I however use the predict.clm(object, newdata,"linear.predictor") on an ordered probit estimation with clm() I get a list with the elements eta1 and eta2,
with one column each, if the newdata contains the dependent variable
where each element contains as many columns as levels in the dependent variable, if the newdata doesn't contain the dependent variable
Unfortunately I don't have a clue what that means. Also in the documentations and papers of the author I don't find any information about it. Would one of you be so nice to enlighten me? This would be great.
Cheers,
AK
UPDATE (after comment):
Basic clm model is defined like this (see clm tutorial for details):
Generating data:
library(ordinal)
set.seed(1)
test.data = data.frame(y=gl(4,5),
x=matrix(c(sample(1:4,20,T)+rnorm(20), rnorm(20)), ncol=2))
head(test.data) # two independent variables
test.data$y # four levels in y
Constructing models:
fm.polr <- polr(y ~ x) # using polr
fm.clm <- clm(y ~ x) # using clm
Now we can access thetas and betas (see formula above):
# Thetas
fm.polr$zeta # using polr
fm.clm$alpha # using clm
# Betas
fm.polr$coefficients # using polr
fm.clm$beta # using clm
Obtaining linear predictors (only parts without theta on the right side of the formula):
fm.polr$lp # using polr
apply(test.data[,2:3], 1, function(x) sum(fm.clm$beta*x)) # using clm
New data generation:
# Contains only independent variables
new.data <- data.frame(x=matrix(c(rnorm(10)+sample(1:4,10,T), rnorm(10)), ncol=2))
new.data[1,] <- c(0,0) # intentionally for demonstration purpose
new.data
There are four types of predictions available for clm model. We are interested in type=linear.prediction, which returns a list with two matrices: eta1 and eta2. They contain linear predictors for each observation in new.data:
lp.clm <- predict(fm.clm, new.data, type="linear.predictor")
lp.clm
Note 1: eta1 and eta2 are literally equal. Second is just a rotation of eta1 by 1 in j index. Thus, they leave left side and right side of linear predictor scale opened respectively.
all.equal(lp.clm$eta1[,1:3], lp.clm$eta2[,2:4], check.attributes=FALSE)
# [1] TRUE
Note 2: Prediction for first line in new.data is equal to thetas (as far as we set this line to zeros).
all.equal(lp.clm$eta1[1,1:3], fm.clm$alpha, check.attributes=FALSE)
# [1] TRUE
Note 3: We can manually construct such predictions. For instance, prediction for second line in new.data:
second.line <- fm.clm$alpha - sum(fm.clm$beta*new.data[2,])
all.equal(lp.clm$eta1[2,1:3], second.line, check.attributes=FALSE)
# [1] TRUE
Note 4: If new.data contains response variable, then predict returns only linear predictor for specified level of y. Again we can check it manually:
new.data$y <- gl(4,3,length=10)
lp.clm.y <- predict(fm.clm, new.data, type="linear.predictor")
lp.clm.y
lp.manual <- sapply(1:10, function(i) lp.clm$eta1[i,new.data$y[i]])
all.equal(lp.clm.y$eta1, lp.manual)
# [1] TRUE
Related
I have been fitting different hierarchical GAMs (hereafter: HGAM) using mgcv in R. I can extract and plot their predictions for their random effects without problems. Conversely, extracting and plotting their predictions for their fixed effects only works for some models, and I don't know why.
Here is a practical example, which refers to the color spectra of flowers from two species (Taxon) sampled at various localities (also discussed here):
rm(list=ls()) # wipe R's memory clean
library(pacman) # load packages, installing them from CRAN if needed
p_load(RCurl) # allows accessing data from URL
ss <- read.delim(text=getURL("https://raw.githubusercontent.com/marcoplebani85/datasets/master/flower_color_spectra.txt"))
head(ss)
ss$density <- ifelse(ss$density<0, 0, ss$density) # set spurious negative reflectance values to zero
ss$clr <- ifelse(ss$Taxon=="SpeciesB", "red", "black")
ss <- with(ss, ss[order(Locality, wl), ])
These are the mean color spectra at the population level for the two species (rolling means were used):
Each color refers to a different species. Each line refers to a different locality.
The following model is a HGAM of type G according to Pedersen et al.'s classification (2019) and it does not give any issues:
gam_G1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(Locality, bs="re"), # "re" is short for "random effect"
data = ss, method = 'REML',
family="quasipoisson"
)
# gam.check(gam_G1)
# k.check(gam_G1)
# MuMIn::AICc(gam_G1)
# gratia::draw(gam_G1)
# plot(gam_G1, pages=1)
# use gam_G1 to predict wl by Locality
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl", "Taxon", "Locality", "clr")])
# predict:
pred <- predict(object= gam_G1, newdata=nn, type="response", se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
# use gam_G1 to predict wl by Taxon
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl",
"Taxon",
"Locality",
"clr")])
nn$Locality=0 # turns random effect off
# after https://stats.stackexchange.com/q/131106/214127
# predict:
pred <- predict(object = gam_G1,
type="response",
newdata=nn,
se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
R warns me that factor levels 0 not in original fit, but it executes the task without issues:
Left panel: gam_G1 predictions at the Locality level. Right panel: gam_G1 predictions for the fixed effects.
Troublesome models
The following model is a HGAM of type "GI" sensu Pedersen et al. (2019). It produces more accurate predictions at the Locality level, but I can only get NA as predictions at the level of fixed effects:
# GI: models with a global smoother for all observations,
# plus group-level smoothers, the wiggliness of which is estimated individually
start_time <- Sys.time()
gam_GI1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(wl, by = Locality, bs="tp", m=1)
# "tp" is short for "thin plate [regression spline]"
+ s(Locality, bs="re"),
family="quasipoisson",
data = ss, method = 'REML'
)
end_time <- Sys.time()
end_time - start_time # it took ~2.2 minutes on my computer
# gam.check(gam_GI1)
# k.check(gam_GI1)
# MuMIn::AICc(gam_GI1)
Attempt at drawing predictions for the fixed effects (Taxon and wl) according to gam_GI1:
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl",
"Taxon",
"Locality",
"clr")])
nn$Locality=0 # turns random effect off
# after https://stats.stackexchange.com/q/131106/214127
# predict:
pred <- predict(object = gam_GI1,
type="response",
# exclude="c(Locality)",
# # this should turn random effect off
# # (doesn't work for me)
newdata=nn,
se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
head(nn)
# wl Taxon Locality clr fit se
# 1 298.34 SpeciesB 0 red NA NA
# 2 305.82 SpeciesB 0 red NA NA
# 3 313.27 SpeciesB 0 red NA NA
# 4 320.72 SpeciesB 0 red NA NA
# 5 328.15 SpeciesB 0 red NA NA
# 6 335.57 SpeciesB 0 red NA NA
Left panel: gam_GI1 predictions at the Locality level. Right panel (blank): gam_GI1 predictions for the fixed effects.
The following model, which includes a global smoother for all observations, plus group-level smoothers, all with the same "wiggliness", doesn't provide fixed-effect predictions either:
gam_GS1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(wl, by = Locality, bs="fs", m=1),
# "fs" is short for "factor-smoother [interaction]"
family="quasipoisson",
data = ss, method = 'REML'
)
Why don't gam_GI1 and gam_GS1 produce predictions for their fixed effects, and how can I obtain them?
The models can take a few minutes to run. To save time, their output can be downloaded from here as an RData file. My R scripts (which include the code for plotting the figures) are available here.
I think you are conflating several things here; The by trick to turn off random effects only works for bs = "re" smooths. Locality is a factor (otherwise your random effect isn't a random intercept) and setting it to 0 is creating a new level (although it could be creating an NA as 0 isn't among the original levels.
If what you want to do is turn off anything to do with Locality, you should use exclude; however you have the invocation wrong. The reason why it's not working is because you are creating a character vector with a single element "c(Locality)". This fails for obvious reasons once you realize that c(Locality) doesn't related to anything in your model. What you need to provide here is a vector of smooth names as printed by summary(). For example, to exclude the smooth s(Locality, bs = "re"), {mgcv} knows this as s(Locality), so you would use exclude = "s(Locality)".
In your case, it is tedious to type out all the "s(wl):LocalityLevelX" labels for each smooth. As you have only two taxa, it would be easier to use the complimentary argument terms, where you list smooth labels that you want to include in the model. So you could do terms = c("s(wl):TaxonSpeciesB", "s(wl):TaxonSpeciesC") or whatever summary() displays for these smooths.
You also need to include the Taxon term in terms, which I think needs to be:
terms = c("TaxonSpeciesB", TaxonSpeciesC",
"s(wl):TaxonSpeciesB", "s(wl):TaxonSpeciesC")
If you install and load my {gratia} package, you can use smooths(gam_GI1) to list all the smooth labels as far as {mgcv} knows them.
The by trick works like this:
gam(y ~ x + s(z) + s(id, bs = "re", by = dummy)
where dummy is set to a numeric value 1 when fitting and to 0 when you are predicting. As this is a numeric by variable you are multiplying the smooth by dummy and hence why setting it to 0 excludes the term. The reason why your code isn't working is because you really want separate smooths for wl for each Locality; Locality is an actual variable of interest in your data/model, not a dummy variable we create to achieve the aim of excluding a term from the model.
Hopefully now you can see why exclude and terms are much better solutions than this dummy trick.
FYI, in bs = "tp", the "tp" doesn't mean tensor product smooth. It mean thin plate regression spline (TPRS). You only get tensor product smooths through te(), t2(), or ti() terms.
I wish to create a model matrix of the independent variables/specific levels of categorical variables selected by LASSO so that I can plug said model matrix into a glm() function to run a logistic regression.
I have included an example of what I'm trying to do. Any help would be greatly appreciated
data("iris")
iris$Petal.Width <- factor(iris$Petal.Width)
iris$Sepal.Length2 <- ifelse(iris$Sepal.Length>=5.8,1,0)
f <- as.formula(Sepal.Length2~Sepal.Width+Petal.Length+Petal.Width+Species)
X <- model.matrix(f,iris)[,-1]
Y <- iris$Sepal.Length2
cvfit <- cv.glmnet(X,Y,alpha=1,family="binomial")
fit <- glmnet(X,Y,alpha=1,family = "binomial")
b <- coef(cvfit,s="lambda.1se")
print(b)
## This is the part I am unsure of: I want to create a model matrix of the non-zero coefficients contained within 'b'
## e.g.
lasso_x <- model.matrix(b,iris)
logistic_model <- glm.fit(lasso_x,Y,family = "binomial")
Edit:
I also tried the following:
model.matrix(~X)[which(b!=0)-1]
but it just gives me a single column of 1's, the length of the number of selections from LASSO (minus the intercept)
According to the help of multinom, package nnet, "The response should be a factor or a matrix with K columns, which will be interpreted as counts for each of K classes." I tried to use this function in the second case, obtaining an error.
Here is a sample code of what I do:
response <- matrix(round(runif(200,0,1)*100),ncol=20) # 10x20 matrix of counts
predictor <- runif(10,0,1)
fit1 <- multinom(response ~ predictor)
weights1 <- predict(fit1, newdata = 0.5, "probs")
Here what I obtain:
'newdata' had 1 row but variables found have 10 rows
How can I solve this problem?
Bonus question: I also noticed that we can use multinom with a predictor of factors, e.g. predictor <- factor(c(1,2,2,3,1,2,3,3,1,2)). I cannot understand how this is mathematically possible, given that a multinomial linear logit regression should work only with continuous or dichotomous predictors.
The easiest method for obtaining the predictions for a new variable is to define the new data as a data.frame.
Using the sample code
> predict(fit1, newdata = data.frame(predictor = 0.5), type = "probs")
[1] 0.07231972 0.05604055 0.05932186 0.07318140 0.03980245 0.06785690 0.03951593 0.02663618
[9] 0.04490844 0.04683919 0.02298260 0.04801870 0.05559221 0.04209283 0.03799946 0.06406533
[17] 0.04509723 0.02197840 0.06686314 0.06888748
Let me state my confusion with the help of an example,
#making datasets
x1<-iris[,1]
x2<-iris[,2]
x3<-iris[,3]
x4<-iris[,4]
dat<-data.frame(x1,x2,x3)
dat2<-dat[1:120,]
dat3<-dat[121:150,]
#Using a linear model to fit x4 using x1, x2 and x3 where training set is first 120 obs.
model<-lm(x4[1:120]~x1[1:120]+x2[1:120]+x3[1:120])
#Usig the coefficients' value from summary(model), prediction is done for next 30 obs.
-.17947-.18538*x1[121:150]+.18243*x2[121:150]+.49998*x3[121:150]
#Same prediction is done using the function "predict"
predict(model,dat3)
My confusion is: the two outcomes of predicting the last 30 values differ, may be to a little extent, but they do differ. Whys is it so? should not they be exactly same?
The difference is really small, and I think is just due to the accuracy of the coefficients you are using (e.g. the real value of the intercept is -0.17947075338464965610... not simply -.17947).
In fact, if you take the coefficients value and apply the formula, the result is equal to predict:
intercept <- model$coefficients[1]
x1Coeff <- model$coefficients[2]
x2Coeff <- model$coefficients[3]
x3Coeff <- model$coefficients[4]
intercept + x1Coeff*x1[121:150] + x2Coeff*x2[121:150] + x3Coeff*x3[121:150]
You can clean your code a bit. To create your training and test datasets you can use the following code:
# create training and test datasets
train.df <- iris[1:120, 1:4]
test.df <- iris[-(1:120), 1:4]
# fit a linear model to predict Petal.Width using all predictors
fit <- lm(Petal.Width ~ ., data = train.df)
summary(fit)
# predict Petal.Width in test test using the linear model
predictions <- predict(fit, test.df)
# create a function mse() to calculate the Mean Squared Error
mse <- function(predictions, obs) {
sum((obs - predictions) ^ 2) / length(predictions)
}
# measure the quality of fit
mse(predictions, test.df$Petal.Width)
The reason why your predictions differ is because the function predict() is using all decimal points whereas on your "manual" calculations you are using only five decimal points. The summary() function doesn't display the complete value of your coefficients but approximate the to five decimal points to make the output more readable.
What command should I use in R to perform a confusion matrix after having used rpart() and predict() commands to generate a prediction model?
# Grow tree
library(rpart)
fit <- rpart(activity ~ ., method="class", data=train.data)
printcp(fit) # display the results
plotcp(fit) # visualize cross-validation results
summary(fit) # detailed summary of splits
# Prune the tree (in my case is exactly the same as the initial model)
pfit <- prune(fit, cp=0.10) # from cptable
pfit <- prune(fit,cp=fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
# Predict using the test dataset
pred1 <- predict(fit, test.data, type="class")
# Show re-substitution error
table(train.data$activity, predict(fit, type="class"))
# Accuracy rate
sum(test.data$activity==pred1)/length(pred1)
I would like to summarise in a clear way True Positives, False Negatives, False Positives and True Negatives. It would be great also to have in the same matrix Sensitivity, Specificity, Positive Predictive Value and Negative Predictive Value.
Source: http://en.wikipedia.org/wiki/Sensitivity_and_specificity
Use the predict() method, with your fit and the original data frame, like so:
pred = predict(train.fit, newdata, type = "vector")
newdata$pred = as.vector(pred)
newdata$prediction = activities[newdata$pred]
tab = table (newdata$prediction, newdata$activity)
print(tab)
In the example above, the rpart model predicts an activity (a factor variable). pred is numeric, with values corresponding to the levels of the factor. activities = sort(unique(data$activity)) corresponds to the default factor mapping.