Dirichlet Regression using Caret package - r

I am attempting to predict tree species composition using Sentinel 2A imagery and forest plot data. I have calculated the proportion of basal area (the cross-sectional area trees of a given species occupy within the plot divided by the total cross-sectional area all tree occupy with in a plot) from the forest plot data and want to predict proportion of basal area by species across the landscape as a raster using the intensity values from Sentinel 2A. Dirichlet regression seems appropriate here, because I have more than 2 categories of proportional data bounded on (0,1) where the summed proportions for each observational unit must be equal to 1. Therefore, I want to model the joint composition of proportional basal area as a function of spectral reflectance intensity using Dirichlet regression, with k-folds crossvalidation using 10 folds and 5 repeats. This seems like a modeling exercise perfectly suited to caret::train() using a custom function.
I followed the documentation for how to build a custom routine for DirichletReg::DirichReg(), but I am a bit stumped on how to feed a multivariate response variable to caret::train(). The DirichletReg::DirichReg() function requires that response variable be formatted with the DirichletReg::DR_data() function first prior to modeling. However, when I feed a Dr_data object to caret::train(), I get this error:
Error: wrong model type for classification
I think this error is getting thrown because I have passed a multivariate response variable to the y= argument of caret::train() and the routine thinks that can only happen with classification algorithms. Does anyone have experience doing multivariate regression? Is there another way to change my custom routine so that caret::train() will accept multivariate response variables?
Below is my reproducible example:
##Loading Necessary Packages##
library(caret)
library(DirichletReg)
##Creating Fake Data##
set.seed(88)#For reproducibility
#Response variables#
PSME_BA<-rnorm(25,50, 15)
TSHE_BA<-rnorm(25,40,12)
ALRU2_BA<-rnorm(25,20,0.5)
Total_BA<-PSME_BA+TSHE_BA+ALRU2_BA
#Predictor variables#
B1<-runif(25, 0, 2000)
B2<-runif(25, 0, 1800)
B3<-runif(25, 0, 3000)
#Dataset for modeling#
df<-data.frame(PSME=PSME_BA/Total_BA, TSHE=TSHE_BA/Total_BA, ALRU2=ALRU2_BA/Total_BA,
B1=B1, B2=B2, B3=B3)
##Creating a Dirichlet regression modeling routine to feed to caret::train()##
#Method list#
Dreg <- list(type='Regression',
library='DirichletReg',
loop=NULL)
#Parameters element#
prm <- data.frame(parameter=c("model"),
class="character")
Dreg$parameters <- prm
#Grid element#
DregGrid <- function(x, y, len=NULL, search="grid"){
if(search == "grid"){
out <- expand.grid(model=c("common", "alternative"), stringsAsFactors = F) # here force the strings as character,
# othewise get error that the model arguments
# were expecting 'chr' when fitting
}
out
}
Dreg$grid <- DregGrid
#Fit element#
DregFit <- function(x, y, param, ...){
dat <- if(is.data.frame(x)) x else as.data.frame(x)
dat$.outcome <- y
theDots <- list(...)
modelArgs <- c(list(formula = as.formula(".outcome ~ ."), data = dat, link=param$model, type=param$type), theDots)
out <- do.call(DirichletReg::DirichReg, modelArgs)
out$call <- NULL
out
}
Dreg$fit <- DregFit
#Predict element#
DregPred <- function(modelFit, newdata, preProc=NULL, submodels=NULL){
if(!is.data.frame(newdata)) newdata <- as.data.frame(newdata)
DirichletReg::predict.DirichletRegModel(modelFit, newdata)
}
Dreg$predict <- DregPred
#prob element#
DregProb <- function(){
return(NULL)
}
Dreg$prob <- DregProb
##Modeling the data using Dirichlet regression with repeated k-folds cross validation##
trCrtl<-trainControl(method="repeatedcv", number = 10, repeats = 5)
Y<-DR_data(df[,c(1:3)])#Converting the response variables to DR_data format
mod<-train(x=df[,-c(1:3)],Y, method=Dreg,trControl=trCrtl)#Throws error

Related

How does the kernel SVM in e1071 predict?

I am trying to understand the way the e1071 package obtains its SVM predictions in a two-class classification framework. Consider the following toy example.
library(mvtnorm)
library(e1071)
n <- 50
### Gaussians
eps <- 0.05
data1 <- as.data.frame(rmvnorm(n, mean = c(0,0), sigma=diag(rep(eps,2))))
data2 <- as.data.frame(rmvnorm(n, mean = c(1,1), sigma=diag(rep(eps,2))))
### Train Model
data_df <- as.data.frame(rbind(data1, data2))
data <- as.matrix(data_df)
data_df$y <- as.factor(c(rep(-1,n), rep(1,n)))
svm <- svm(y ~ ., data = data_df, kernel = "radial", gamma=1, type = "C-classification", scale = FALSE)
Having trained the SVM, I would like to write a function that uses the coefficients and the intercept to predict on a new data point.
Recall that the kernel trick guarantees that we can write the prediction on a new point as the weighted sum of the kernel evaluated at the support vectors and the new point itself (plus some intercept).
In other words: How to combine the following three terms
supportv <- svm$SV
coefs <- svm$coefs
intercept <- svm$rho
to get the prediction associated with the corresponding SVM?
If this is not possible, or too complicated, I would also switch to a different package.

Two methods of recovering fitted values from a Bayesian Structural Time Series model yield different results

Two conceptually plausible methods of retrieving in-sample predictions (or "conditional expectations") of y[t] given y[t-1] from a bsts model yield different results, and I don't understand why.
One method uses the prediction errors returned by bsts (defined as e=y[t] - E(y[t]|y[t-1]); source: https://rdrr.io/cran/bsts/man/one.step.prediction.errors.html):
library(bsts)
get_yhats1 <- function(fit){
# One step prediction errors defined as e=y[t] - yhat (source: )
# Recover yhat by y-e
bsts.pred.errors <- bsts.prediction.errors(fit, burn=SuggestBurn(0.1, fit))$in.sample
predictions <- t(apply(bsts.pred.errors, 1, function(e){fit$original.series-e}))
return(predictions)
}
Another sums the contributions of all model component at time t.
get_yhats2 <- function(fit){
burn <- SuggestBurn(0.1, fit)
X <- fit$state.contributions
niter <- dim(X)[1]
ncomp <- dim(X)[2]
nobs <- dim(X)[3]
# initialize final fit/residuals matrices with zeros
predictions <- matrix(data = 0, nrow = niter - burn, ncol = nobs)
p0 <- predictions
comps <- seq_len(ncomp)
for (comp in comps) {
# pull out the state contributions for this component and transpose to
# a niter x (nobs - burn) array
compX <- X[-seq_len(burn), comp, ]
# accumulate the predictions across each component
predictions <- predictions + compX
}
return(predictions)
}
Fit a model:
## Air passengers data
data("AirPassengers")
# 11 years, monthly data (timestep=monthly) --> 132 observations
Y <- stats::window(AirPassengers, start=c(1949,1), end=c(1959,12))
y <- log(Y)
ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons=12, season.duration=1)
bsts.model <- bsts(y, state.specification=ss, niter=500, family='gaussian')
Compute and compare predictions using each of the functions
p1 <- get_yhats1(bsts.model)
p2 <- get_yhats2(bsts.model)
# Compare predictions for t=1:5, first MCMC iteration:
p1[1,1:5]; p2[1,1:5]
I'm the author of bsts.
The 'prediction errors' in bsts come from the filtering distribution. That is, they come from p(state | past data). The state contributions come from the smoothing distribution, i.e. p(state | all data). The filtering distribution looks backward in time, while the smoothing distribution looks both forward and backward. One typically needs the filtering distribution while using a fitted model, and the smoothing distribution while fitting the model in the first place.

Plotting precision#k and recall#k in ROCR (R)

I'm evaluating a binary classifier in R with the ROCR package. My classifier outputs a score between 0 and 1 for target 0/1 labels.
I'd like to plot precision and recall # k but can't find a way to do it. Calling performance() without specifying the x-axis measure plots the precision value by score cutoff:
library(ROCR)
#df <- a two-dimensional dataframe with prediction scores and actual labels of my classifier
pred <- prediction(df$score, df$label)
pr_curve <- performance(pred, measure="prec")
For precision (or recall) at k, I'd need to plot the precision against the rank of each prediction, ordered by descending score:
pred <- prediction(df$score, df$label)
pr_curve <- performance(pred, measure="prec", x.measure="rank") #but there seems to be no "rank" in ROCR!
Is there a way to do this in ROCR? I'm open to use alternative libraries if this isn't the case.
Load libraries and define train and test set:
library(mlbench)
library(e1071)
library(ROCR)
data(BreastCancer)
df = BreastCancer
idx = sample(1:nrow(df),150)
trn = df[idx,]
test = df[-idx,]
Fit naives bayes
fit = naiveBayes(Class ~ .,data=trn)
In the manual for performance, it is written,
Precision/recall graphs: measure="prec", x.measure="rec".
Plot precision-recall:
pred = prediction(predict(fit,test,type="raw")[,2],test$Class)
#plot to see it is working correctly:
plot(performance(pred,measure="prec",x.measure="rec"))
Now for your case to do it at K, we can also do the precision recall from scratch:
#combine prob, predicted labels, and actual labels
res = data.frame(prob=predict(fit,test,type="raw")[,2],
predicted_label=predict(fit,test),
label = test$Class)
res = res[order(res$prob,decreasing=TRUE),]
res$rank = 1:nrow(res)
# calculate recall, which is the number of actual classes we get back
res$recall = cumsum(res$label=="malignant")/sum(res$label=="malignant")
# precision, number of malignant cases we predicted correctly
res$precision = cumsum(res$label=="malignant")/res$rank
# check the two plots
par(mfrow=c(1,2))
plot(performance(pred,measure="prec",x.measure="rec"))
plot(res$recall,res$precision,type="l")
Now you have it correct, getting or plotting precision at K is simply:
par(mfrow=c(1,2))
with(res,
plot(rank,precision,main="self-calculated",type="l"))
plot(pred#n.pos.pred[[1]],
pred#tp[[1]]/(pred#fp[[1]]+pred#tp[[1]]),
type="l",main="from RORC")
I am not aware of a way to use the .plot.performance function.. But you can use the variables stored under prediction object. pred#tp is the true positive, pred#fp is the false positive, so tp / fp+fp gives precision and pred#n.pos.pred gives the rank essentially.

Extimate prediction accuracy of cox ph

i would like to develop a cox proportional hazard model with r, use it to predict input and evaluate the accuracy of the model. For the evaluation I would like to use the Brior score.
# import various packages, needed at some point of the script
library("survival")
library("survminer")
library("prodlim")
library("randomForestSRC")
library("pec")
library("rpart")
library("mlr")
library("Hmisc")
library("ipred")
# load lung cancer data
data("lung")
head(lung)
# recode status variable
lung$status <- lung$status-1
# Delete rows with missing values
lung <- na.omit(lung)
# split data into training and testing
## 80% of the sample size
smp_size <- floor(0.8 * nrow(lung))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(lung)), size = smp_size)
# training and testing data
train.lung <- lung[train_ind, ]
test.lung <- lung[-train_ind, ]
# time and failure event
s <- Surv(train.lung$time, train.lung$status)
# create model
cox.ph2 <- coxph(s~age+meal.cal+wt.loss, data=train.lung)
# predict
pred <- predict(cox.ph2, newdata = train.lung)
# evaluate
sbrier(s, pred)
as an outcome of the prediction I would expect the time (as in "when does this individuum experience failure). Instead I get values like this
[1] 0.017576359 -0.135928959 -0.347553969 0.112509137 -0.229301199 -0.131861582 0.044589175 0.002634008
[9] 0.345966978 0.209488560 0.002418358
What does that mean?
Furthermore sbrier does not work. Apparently it can not work with the prediction pred (no surprise there)
How do I solve this? How do I make a prediction with cox.ph2? How can I evaluate the model afterwards?
The predict() function won't return a time value, you have to specify the argument type = c("lp", "risk","expected","terms","survival") in the predict() function.
If you want to get the hazard ratios :
predict(cox.ph2, newdata = test.lung, type = "risk")
Note that you want to predict the values on the test set not the training set.
I have read that you can use AFT models in your case :
https://stats.stackexchange.com/questions/79362/how-to-get-predictions-in-terms-of-survival-time-from-a-cox-ph-model
You also can read this post :
Calculate the Survival prediction using Cox Proportional Hazard model in R
Hope it will help

Multivariate regression with glm: logical subscript too long

I am teaching myself multivariate regression and I am trying to simulate a multivariate random variable and construct a generalized linear model to fit it.
Here is my code:
#Clear Previous
rm(list=ls())
cmp = 2 #Number of components in sample
n = 10 #Number of simulated data points
B = matrix(c(1,2,3,4), nrow=2,byrow=TRUE) #Coefficient matrix
#Simulate model
X = matrix(rep(0,2*n), nrow=2,byrow=TRUE) #Initiate independent matrix
Y = matrix(rep(0,2*n), nrow=2,byrow=TRUE) #Initiate response matrix
for (j in 1:cmp){
X[j,] = rnorm(n) #independent data
e = rnorm(n) #error term
Y[j,] = B[j,1]+ B[j,2]*X[j,] + e
}
#Linear Regression
fit = glm(Y~X,family = gaussian())
fit
This produces the following error in the function glm:
Error in x[good, , drop = FALSE] : (subscript) logical subscript too long
I am quite unsure what the problem is.
Multivariate GLM
GLM is not working with multiple dependent variables. You can relate a single column like the code below but you can not do both. It is the independent data that can be multivariate.
Use Y[1,] instead of Y
fit = glm(Y[1,]~t(X),family = gaussian())
In addition, the above line uses the transpose t(X) instead of X because the function GLM will interpret the rows as different measurements.
MANOVA/MANCOVA / linear discriminant analysis
in your case, you seem to be using Gaussian distributed errors. For this particular case, there is a method that handles multiple dependent variables. It is MANOVA (if the independent variable is a factor) or MANCOVA (if the independent variable is continuous). You can model it in R as fit = manova(t(Y)~t(X))

Resources