Meta and Metafor R Package - - r

I am currently conducting a metaanlysis in R using the package "metafor". Doing my research I came across a different package for metaanalyses in R, namely "meta". I like the forest plot created by the latter package better (designwise) but unfortunatley some of the data is not the same as in the plot I created with metafor.
Specifically, the data is different only for I^2 and the pooled estimate.
meta_1 <- rma(yi=yi, vi=vi, measure="SMD", method="ML", slab=Citation, data=dat)
forest(meta_1)
meta_2 <- metagen(yi,vi^.5,data = dat,studlab = paste(Citation), comb.fixed = FALSE,
comb.random = TRUE, hakn = TRUE, method.tau = "ML", sm = "SMD")
forest(meta_2)
Does anyone know why those differences emerge?

So I was able to get the prediction interval to match across functions but not the I^2 values (even though the difference is off by only 2%). There might be some statistical correction one package is doing compared to the other or it has to do with the RE/FE type of modeling approach.
Anyway I hope this code helps point you in the right direction. To get the CIs to match you also have to use the parameter method.tau.ci in metagen().
library(meta)
library(metafor)
study<- c(1:10)
yi<- c( -0.48965031,0.64970214, 0.11201680,0.07945655,-0.70874645 -0.54922759,0.66768916 , -0.45523574 )
vi <- c(0.10299697,0.14036855,0.05137812, 0.03255550, 0.34913525, 0.34971466, 0.07539957, 0.08428983)
dat <- cbind(study, yi, vi)
dat <- as.data.frame(dat)
meta_1 <- rma(yi=dat$yi, vi=dat$vi, measure="SMD", method="REML", slab=paste(study), data=dat)
forest(meta_1)
meta_2 <- meta::metagen(TE =dat$yi,seTE = dat$vi^.5, method.tau = 'REML',
method.tau.ci = 'BJ', comb.random = TRUE, comb.fixed = TRUE,
sm = 'SMD')
forest(meta_2)

Related

Need help diagnosing cause of "Covariate matrix is singular" when estimating effect in structural topic model (stm)

First things first. I've saved my workspace and you can load it with the following command:
load(url("https://dl.dropboxusercontent.com/s/06oz5j41nif7la5/example.RData?dl=0"))
I have a number of abstract texts and I'm attempting to estimate a structural topic model to measure topic prevalence over time. The data contains a document id, abstract text, and year of publication.
I want to generate trends in expected topic proportion over time like the authors of the STM Vignette do here:
I'm able to create my topic model without issue, but when I attempt to run the estimateEffect() function from the stm package in R, I always get the following warning:
And my trends look like this:
In the documentation, the authors note that
The function will automatically check whether the covariate matrix is singular which generally results from linearly dependent columns. Some common causes include a factor variable with an unobserved level, a spline with degrees of freedom that are too high, or a spline with a continuous variable where a gap in the support of the variable results in several empty basis functions.
I've tried a variety of different models, using a 2-topic solution all the way up to 52-topic solution, always with the same result. If I remove the spline function from the "year" variable in my model and assume a linear fit, then estimateEffect() works just fine. So it must be an issue with the splined data. I just don't know what exactly.
Again, here's a link to my workspace:
load(url("https://dl.dropboxusercontent.com/s/06oz5j41nif7la5/example.RData?dl=0"))
And here is the code I'm using to get there:
library(udpipe)
library(dplyr) # data wrangling
library(readr) # import data
library(ggplot2) # viz
library(stm) # STM
library(tidytext) # Tf-idf
library(tm) # DTM stuff
library(quanteda) # For using ngrams in STM
rm(list = ls())
abstracts <- read_delim("Data/5528_demand_ta.txt",
delim = "\t", escape_double = FALSE,
col_names = TRUE, trim_ws = TRUE)
abstracts <- rename(abstracts, doc_id = cpid)
abstracts$doc_id <- as.character(abstracts$doc_id)
# Download english dictionary
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)
# Interpret abstracts assuming English
x <- udpipe_annotate(ud_model, x = abstracts$abstract, doc_id = abstracts$doc_id)
x <- as.data.frame(x)
# Regroup terms
data <- paste.data.frame(x, term = "lemma", group = c("doc_id"))
data <- left_join(data, abstracts) %>%
rename(term = lemma) %>%
select(doc_id, term , year)
# Prepare text
processed <- textProcessor(documents = data$term,
metadata = data,
lowercase = TRUE,
removestopwords = TRUE,
removenumbers = TRUE,
removepunctuation = TRUE,
stem = FALSE)
out <- prepDocuments(processed$documents,
processed$vocab,
processed$meta,
lower.thresh = 20, # term must appear in at least n docs to matter
upper.thres = 1000) # I've been using about 1/3 of documents as an upper thresh
# Build model allowing tSNE to pick k (should result in 52 topics)
stm_mod <- stm(documents = out$documents,
vocab = out$vocab,
K = 0,
init.type = "Spectral",
prevalence = ~ s(year),
data = out$meta,
max.em.its = 500, # Max number of runs to attempt
seed = 831)
###################################################################################
########### If you loaded the workspace from my link, then you are here ###########
###################################################################################
# Estimate effect of year
prep <- estimateEffect(formula = 1:52 ~ s(year),
stmobj = stm_mod,
metadata = out$meta)
# Plot expected topic proportion
summary(prep, topics=1)
plot.estimateEffect(prep,
"year",
method = "continuous",
model = stm_mod,
topics = 5,
printlegend = TRUE,
xaxt = "n",
xlab = "Years")
A singular matrix simply means that you have linearly dependent rows or columns. First thing you could do is check the determinant of the matrix - a singular matrix implies a zero determinant - which means the matrix can't be inverted.
Next thing would be to identify the literally dependent rows (columns), you can do so using smisc::findDepMat(X, rows = TRUE, tol = 1e-10) for rows, and smisc::findDepMat(X, rows = FALSE, tol = 1e-10) for columns. You MAY be able to alter the levels of tol in findDepMat() and etol in stm() to arrive at a solution, probably an unstable solution, but a solution.

Is there a way to limit the number of features in IML package Shapley values?

I have an automl model created with the H2O package. Currently, H2O only calculates Shapley values on tree-based models. I've used the IML package to calculate the values on the AML model. However, because I have a large number of features, the plot is too jumbled to read. I'm looking for a way to select/show only the top X number of features. I can't find anything in the IML CRAN PDF nor in other documentation I've found by Googling.
#initiate h2o
h2o.init()
h2o.no_progress()
#create automl model (data cleaning and train/test split not shown)
set.seed(1911)
num_models <- 10
aml <- h2o.automl(y = label, x = features,
training_frame = train.hex,
nfolds = 5,
balance_classes = TRUE,
leaderboard_frame = test.hex,
sort_metric = 'AUCPR',
max_models = num_models,
verbosity = 'info',
exclude_algos = "DeepLearning", #exclude for reproducibility
seed = 27)
# 1. create a data frame with just the features
features_eval <- as.data.frame(test) %>% dplyr::select(-target)
# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(test$target))
# 3. Create custom predict function that returns the predicted values as a
# vector (probability of purchasing in our example)
pred <- function(model, newdata) {
results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
return(results[[3L]])
}
# example of prediction output
pred(aml, features_eval) %>% head()
#create predictor needed
predictor.aml <- Predictor$new(
model = aml,
data = features_eval,
y = response,
predict.fun = pred,
class = "classification"
)
high <- predict(aml, test.hex) %>% .[,3] %>% as.vector() %>% which.max()
high_prob_ob <- features_eval[high, ]
shapley <- Shapley$new(predictor.aml, x.interest = high_prob_ob, sample.size = 200)
plot(shapley, sort = TRUE)
Any suggestions/help appreciated.
Thank you,
Brian
I can offer a hacky solution that utilizes the fact that iml uses ggplot2 to plot.
N <- 10 # number of features to show
# Capture the ggplot2 object
p <- plot(shapley, sort = TRUE)
# Modify it so it shows only top N features
print(p + scale_x_discrete(limits=rev(p$data$feature.value[order(-p$data$phi)][1:N])))

MLR - calculating feature importance for bagged, boosted trees (XGBoost)

Good morning,
I have a question about calculating feature importance for bagged and boosted regression tree models with MLR package in R. I am using XGBOOST to make predictions and i'm using bagging to estimate prediction uncertainty. My data set is relatively large; approximately 10k features and observations. The predictions work perfectly (see code below), but I can't seem to calculate feature importance (the last line in the code below). The importance function crashes with no errors... and freezes the R session. I saw some related python code, where people seem to calculate the importance for each of the bagged models here and here. I haven't been able to get that to work properly in R either. Specifically, i'm not sure how to access individual models within the objected produced by MLR (mb object in the code below). In python, this seems to be trivial. In R, i can't seem to extract mb$learner.model, which seems logically closest to what i need. So i'm wondering if anyone had any experience with this issues?
Please see the code below
learn1 <- makeRegrTask(data = train.all , target= "resp", weights = weights1)
lrn.xgb <- makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals <- list( objective="reg:squarederror", eval_metric="error", nrounds=300, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 50, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag <- setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, learn1)
fimp1 <- getFeatureImportance(mb)
If you set bw.feats = 1 it might be feasible to average the feature importance values.
Basically you just have to apply over all single models that are stored in the HomogeneousEnsembleModel. Some extra care is necessary because the order of the features gets mixed up because of the sampling - although we set it to 100%.
library(mlr)
data = data.frame(x1 = runif(100), x2 = runif(100), x3 = runif(100))
data$y = with(data, x1 + 2 * x2 + 0.1 * x3 + rnorm(100))
task = makeRegrTask(data = data, target = "y")
lrn.xgb = makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals = list( objective="reg:squarederror", eval_metric="error", nrounds=50, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 10, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag = setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, task)
fimps = lapply(mb$learner.model$next.model, function(x) getFeatureImportance(x)$res)
fimp = fimps[[1]]
# we have to take extra care because the results are not ordered
for (i in 2:length(fimps)) {
fimp = merge(fimp, fimps[[i]], by = "variable")
}
rowMeans(fimp[,-1]) # only makes sense with bw.feats = 1
# [1] 0.2787052 0.4853880 0.2359068

rpart giving same results for cross-validation and no CV

Like the title says, I'm trying to run a decision tree both with and without cross-validation using the rpart package in R. I'm doing this using the xval parameter, as described in the vignette (https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
Unfortunately, I'm getting the same tree with and without CV. I've compared the calculation time for each and the CV model looks like it takes about 10 times as long, so its apparently doing something, I just can't figure out what.
I've also redone the model a number of times with different complexity parameters, but it hasn't made any difference.
Here's sample code that shows my problem, the printcp's show the same results and the predictions from both on the training and a hold-out set are the same.
library(rpart)
library(caret)
abalone <- read.csv(file = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',header = FALSE)
names(abalone) <- c("sex", "length", "diameter", "height", "whole_weight", "shucked_weight", "viscera_weight", "shell_weight", "rings")
train_set <- createDataPartition(abalone$sex, times = 1, p = 0.8, list = FALSE)
abalone_train <- slice(abalone, train_set)
abalone_test <- slice(abalone, -train_set)
abalone_fit_noCV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 0,
cp = 0.005))
abalone_fit_CV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 10,
cp = 0.005))
printcp(abalone_fit_noCV)
printcp(abalone_fit_CV)
CV_pred <- predict(abalone_fit_CV, type = "class")
noCV_pred <- predict(abalone_fit_noCV, type = "class")
confusionMatrix(CV_pred, noCV_pred)
CV_pred <- predict(abalone_fit_CV, abalone_test, type = "class")
noCV_pred <- predict(abalone_fit_noCV, abalone_test, type = "class")
confusionMatrix(CV_pred, noCV_pred)
In true beginner fashion, I figured this out shortly after posting.
For anybody else coming upon this issue, it is basically answered on Cross Validated :
The final tree that is returned is still the initial tree. You must use the prune function using the cross-validation plot to choose the best subtree.
This is clear if you read the full Pruning the tree section of the vignette, rather than just the cross-validation section.

How to ensemble forecasts in R using weights

I have a couple forecasts and was trying to figure out how to merge the two according to some criterion that is widely used.
In part one, I split the data and compared the forecasts against the actual balues using Forest_comb.
library(forecast)
library(ForecastCombinations)
y1 = rnorm(100)
train = y1[1:90]
test = y1[91:100]
fit1 = auto.arima(train)
fit2 = ets(train)
forc1 = forecast(fit1, n=10)$mean
forc2 = forecast(fit2, n=10)$mean
forc_all = cbind(forc1,forc2)
forc_all
?Forecast_comb
fitted <- Forecast_comb(obs = test ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted
In part two, I rebuilt the whole model on the entire data and forecast out by ten values. How can one merge the two values together according to some criterion?
fit3 = auto.arima(y1)
fit4 = ets(y1)
forc3 = forecast(fit3, n=20)$mean
forc4 = forecast(fit4, n=20)$mean
forc_all = cbind(forc3,forc4)
forc_all
fitted <- Forecast_comb(obs = y1[91:100] ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted
Thanks for the help
The reason that I am using ForecastCombination is that it includes procedures for popular combination strategies. I thought that perhaps that function could be modified to perform the desired ensembling.
Based on a lot of Kaggle competitions where people share/discuss their scripts, I'd say that by far the most common and most effective way is simply to manually weight and add your predictions.
pacman::p_load(forecast)
pacman::p_load(ForecastCombinations)
y1 = rnorm(100)
train = y1[1:90]
test = y1[91:100]
fit1 = auto.arima(train)
fit2 = ets(train)
forc1 = forecast(fit1, n=10)$mean
forc2 = forecast(fit2, n=10)$mean
forc_all = cbind(forc1,forc2)
forc_all
?Forecast_comb
fitted_1 <- Forecast_comb(obs = test ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted_1
fit3 = auto.arima(y1)
fit4 = ets(y1)
forc3 = forecast(fit3, n=20)$mean
forc4 = forecast(fit4, n=20)$mean
forc_all = cbind(forc3,forc4)
forc_all
fitted_2 <- Forecast_comb(obs = y1[91:100] ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted_2
# By far the most common way to combine/weight is simply:
fitted <- fitted_2*.5+fitted_1*.5
fitted
One might ask if you should use equal weights or how to know what to make the weights. This is usually determined by
(a) naive, equal weighting if that's all you have time for and it seems to work fine
(b) iterating with a holdout or cross-validation sample(s), being careful not to overfit
Some people try to take more fancy approaches. It's easy to mess that up, however if you get it right then it can lead you to a more optimal forecast.
The model-based and other more fancy approaches are things like creating another stage of the modeling process wherein your predictions on a holdout sample are the X matrix and the outcome variable is the actual y for that sample.
Also, check out Erin LeDell's approach in h2oEnsemble.

Resources