Make a loop functions - gamlss

I was wondering how I could create a loop to calculate the predicted values and then calculate MSE for various models and then sort the models according to the calculate MSE. I need to see which model has the least MSE. Thanks for the help in advance
y_pred1 <-predict(m1,newdata=test,type="response")
MSE11 <- test %>% summarise(MSE = mean( (y_pred1-y)^2))
y_pred2 <-predict(m2,newdata=test,type="response")
MSE2 <- test %>% summarise(MSE = mean( (y_pred2-y)^2))
y_pred3<-predict(m3,newdata=test,type="response")
MSE3 <- test %>% summarise(MSE = mean( (y_pred3-y)^2))

Related

Correlation between p-value and linear regression slope in general linear model

I input a counts table (read counts mapping to 600 sequences in a reference fasta) into a package that runs a linear regression using a general linear model lm() in R, and outputs a trend line slope and p-value for each of the 600 sequences. I made a volcano plot of my results (b.value = regression slope) and found an unexpected correlation between the regression slope and the p-value. Can anyone explain why this might be the case? going through the code for the package, but I want to know if this could actually be an issue with the input data rather than the model?
The code for running the linear regression on the counts table "dat$count" with the response variable given by "condition" is given by:
do.lm <- function(dat) {
count <- dat$count
col_data <- dat$col_data
tidy.count <- count %>%
cbind(name = rownames(count)) %>%
gather(sample, count, -name) %>%
left_join(col_data %>% rownames_to_column("sample"), by = "sample")
# TIP: values in a column must be atomic, can't have a vector
res <- tidy.count %>% group_by(name) %>%
summarise(count = list(count),
condition = list(condition)) %>%
group_by(name) %>%
mutate( lm = list(summary(lm(unlist(condition)~unlist(count)))),
baseMean = mean(unlist(count))) %>%
mutate( p.value = tryCatch({lm[[1]]$coefficients[2,4]}, error = function(e) NA),
b.value = tryCatch({lm[[1]]$coefficients[2,3]}, error = function(e) NA)) %>%
select(name, baseMean, b.value, p.value)
res$padj <- p.adjust(res$p.value, method="fdr")
res
}

R random forest aggregate vs individual prediction

Please consider this minimal reproducible example of a random forest regression estimate
library(randomForest)
# fix missing data
airquality <- na.roughfix(airquality)
set.seed(123)
#fit the random forest model
rf_fit <- randomForest(formula = Ozone ~ ., data = airquality)
#define new observation
new <- data.frame(Solar.R=250, Wind=8, Temp=70, Month=5, Day=5)
set.seed(123)
#use predict all on new observation
rf_predict<-predict(rf_fit, newdata=new, predict.all = TRUE)
rf_predict$aggregate
library(tidyverse)
predict_mean <- rf_predict$individual %>%
as_tibble() %>%
rowwise() %>%
transmute(avg = mean(V1:V500))
predict_mean
I was expecting to get the same value by rf_predict$aggregate and predict_mean
Where and why am I wrong about this assumption?
My final objective is to get a confidence interval of the predicted value.
I believe your code needs to include a c_across() call for the calculation to be performed correctly:
The ?c_across documentations tells us:
c_across() is designed to work with rowwise() to make it easy to
perform row-wise aggregations.
predict_mean <- rf_predict$individual %>%
as_tibble() %>%
rowwise() %>%
transmute(avg = mean(c_across(V1:V500)))
>predict_mean
[1] 30.5
An answer to a previous question, points out that mean() can't handle a data.frame. And in your code the data being provide to mean() is a row-wise data frame with class rowwise_df. c_across allows the data in the rows to be presented to mean() as vectors (I think).

tidymodels roc auc results in multiple classification are affected by first level of factor

Using the iris dataset, a knn-classifier was tuned with iterative search and roc_auc as metric for the purpose of multiple classification.
One AUC result per potential model was calculated as expected, nevertheless, this value is not stable, but affected by:
the order of levels ("setosa", "virginica", "versicolor") in the Species column in the initial dataset
the order of columns in the roc_auc(truth = Species, .pred_setosa, .pred_virginica,.pred_versicolor)
Does this indicate that the AUC may be calculated similarly as setting the first level of the Species column as the positive event (which is expected in the binary classification, whereas in the multiple classification a single AUC based on e.g. a one-vs-all comparison would be appropriate)?
If so, is there a way to select a potential model based on e.g. the averaging AUC value of all the AUC values produced by the "one vs all comparisons"?
Could it also be implemented in the metric_set during the iterative search?
Thank you in advance for your support!
library(tidyverse)
library(tidymodels)
tidymodels_prefer()
df <- iris %>%
mutate(Species = factor(Species,levels = c("virginica", "versicolor", "setosa")))
set.seed(2023)
splits <- initial_split(df, strata = Species, prop = 4/5)
df_train <- training(splits)
df_test <- testing(splits)
df_rec <- recipe(Species ~ ., data = df_train)
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
df_wflow <- workflow() %>%
add_model(knn_model) %>%
add_recipe(df_rec)
set.seed(2023)
knn_cv <-
df_wflow %>%
tune_bayes(
metrics = metric_set(roc_auc),
resamples = vfold_cv(df_train, strata = "Species", v = 2),
control = control_bayes(verbose = TRUE, save_pred = TRUE)
)
cv_train_metrics <- knn_cv %>%
collect_predictions() %>%
group_by(.config, id) %>%
roc_auc(truth = Species, .pred_setosa, .pred_virginica,.pred_versicolor)
roc_auc() expects that the columns that have the probability estimates are in the same order as the factor levels. We'll make the documentation better for that.
By default, we use the method of Hand and Till to compute the area under a single muticlass ROC curve.
So this is not doing multiple ROC curves by default. You can change the estimator argument to do different types of averaging methods though but I would not suggest it for this metric.

Looping cv.glmnet and get the 'best' coefficients

As noted in the help of cv.glmnet, "the results of cv.glmnet are random, since the folds are selected at random. Users can reduce this randomness by running cv.glmnet many times, and averaging the error curves.".
If I make a loop doing n-times cv.glmnet, how can I extract the 'best' coefficients? I usually take the coefficients using this command:
coe<- coef(cvfit, s = "lambda.min")
If I use the mean of all the "lambda.min" then I don't know how to choose the right cvfit out of the many I generated. Do I have to use the mean of cvfit$cvm or MSE or other things?
Thanks
when you do coef(cvfit, s = "lambda.min"), you are taking the lambda that is 1 standard error from the best lambda, see this discusssion So you can average the MSEs across different cv runs
library(glmnet)
library(mlbench)
data(BostonHousing)
X = as.matrix(BostonHousing[,-c(4,14)])
Y = BostonHousing[,14]
nfolds = 5
nreps = 10
res = lapply(1:nreps,function(i){
fit = cv.glmnet(x=X,y=Y,nfolds=nfolds)
data.frame(MSE_mean=fit$cvm,lambda=fit$lambda,se=fit$cvsd)
})
res = do.call(rbind,res)
We can summarize the results, the standard deviation is approximated by just taking the mean, but if you wanna be precise, might have to look into the formula for pooled standard deviation:
library(dplyr)
summarized_res = res %>%
group_by(lambda) %>%
summarise(MSE=mean(MSE_mean),se=mean(se)) %>%
arrange(desc(lambda))
idx = which.min(summarized_res$MSE)
lambda.min = summarized_res$lambda[idx]
lambda.min
[1] 0.019303
index_1se = with(summarized_res,which(MSE < MSE[idx]+se[idx])[1])
lambda_1se = summarized_res$lambda[index_1se]
lambda_1se
[1] 0.3145908
We can plot this:
library(ggplot2)
ggplot(res,aes(x=log(lambda),y=MSE_mean)) + stat_summary(fun=mean,size=2,geom="point") +
geom_vline(xintercept=c(lambda.min,lambda_1se))

Create a list column with just one item in it (no group by)

Here is a workflow that trains an XGB model using tidr list columns, rsmaple folds and purrr map:
library(rsample)
library(xgboost)
library(Metrics)
# keep just numeric features for this example
pdata_split <- initial_split(diamonds %>% select(-cut, -color, -clarity), 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)
train_cv <- vfold_cv(training_data, 5) %>%
# create training and validation sets within each fold
mutate(train = map(splits, ~training(.x)),
validate = map(splits, ~testing(.x)))
# xgb across each fold
mod.xgb <- train_cv %>%
# convert regression data to a dmatrix for xgb. Just simple price ~ carat for here and now
mutate(train_dmatrix = map(train, ~xgb.DMatrix(.x %>% select(carat) %>% as.matrix(), label = .x$price)),
validate_dmatrix = map(validate, ~xgb.DMatrix(.x %>% select(carat) %>% as.matrix(), label = .x$price))) %>%
mutate(regression = map(train_dmatrix, ~xgboost(.x, objective = "reg:squarederror", nrounds = 100))) %>% # fit the model
mutate(predictions =map2(.x = regression, .y = validate_dmatrix, ~predict(.x, .y))) %>% # predictions
mutate(validation_actuals = map(validate, ~.x$carat)) %>% # get the actuals for computing evaluation metrics
mutate(mae = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::mae(actual = .x, predicted = .y))) %>% # mae
mutate(rmse = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::rmse(actual = .x, predicted = .y))) # rmse
My actual script and data uses crossing() and other models with their own hyper parameters in order to pick the best model. So, the real block the above is based on allows me to compare several models since it actually contains several models.
I like this workflow because using dplyr verbs and the pipe operator, I can make changes as needed while progressing through each step, then apply them to each fold using map functions.
Now that I'm at the test phase and passed the cross validation phase, I'd like to emulate that 'flow' except I do not have folds so there is no need for map_* functions.
However, I still need to make transformations such as the one above adding an xgb.DMatrix since I am using xgboost.
Example, below what I created to test my chosen xgb model:
library(rsample)
library(xgboost)
library(Metrics)
# keep just numeric features for this example
pdata_split <- initial_split(diamonds %>% select(-cut, -color, -clarity), 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)
# create xgb.DMatrix'
training_data_xgb_matrix <- xgb.DMatrix(training_data %>% select(-price) %>% as.matrix(), label = training_data$price)
test_data_xgb_matrix <- xgb.DMatrix(testing_data %>% select(-price) %>% as.matrix(), label = testing_data$price)
# create a regression
model_xgb <- xgboost(training_data_xgb_matrix, nrounds = 100, objective = "reg:squarederror")
# predict on test data
xgb_predictions <- predict(model_xgb, test_data_xgb_matrix)
# evaluate using rmse
test_rmse <- rmse(actual = testing_data$price, predicted = xgb_predictions)
test_rmse
# 1370.185
So, that is doing it step by step. My question is, can I somehow do this in a similar way to using the approach above during cross validation, particularity just adding a new column to a existing df / list column?
What is the 'tidy' way of evaluating a model on test data? Is it possible to start with training_data, append test data in a new column and start a workflow to reach the same result with rmse in it's own column following a call to mutate()?
training_data %>%
(add test data in a new column) %>%
mutate(convert training data to a xgb.DMatrix) %>%
mutate(convert test data to a xgb.DMatrix) %>%
mutate(fit a regression model based on the training data xgb.DMatrix) %>%
mutate(predict with the regression model on test data xgb.DMatrix) %>%
mutate(calculate rmse)
Is this possible?

Resources