How to fix incompatible matrix dimensions using mvJointModelBayes()? - r

I'm trying to fit a joint model of longitudinal and time-to-event data using the JMbayes package, to predict risk of cardiac arrest as more symptom data becomes available over time. To start, I am fitting a univariate model, but I aim to incorporate a number of longitudinal outcomes once I have the code running, which is why I'm using the mvJointModelBayes() function.
However, when I run I try to run the function I come across the error below.
Error in { :
task 1 failed - "addition: incompatible matrix dimensions: 500x1 and 3000x1"
I have used the same code as provided in the mvJMBayes vignette using pbc2 data, adapted to my dataset, but keep encountering the error. I can't find any obvious way in which my dataframes differ to the pbc2 dataset to be causing the error.
library(tidyverse)
library(JMbayes)
library(lattice)
library(MASS)
# SIMULATE DATA AND SHAPE FOR MODELLING -------------------
id <- 1:500
gender <- sample(c('Male','Female'), 500, replace = TRUE, prob = c(0.51, 0.49))
status <- sample(c(0,1), 500, replace = TRUE, prob = c(0.9, 0.1))
survival_days <- rnorm(500)
angina1 <- sample(c(0,1), 500, replace = TRUE, prob = c(0.9, 0.1))
angina2 <- sample(c(0,1), 500, replace = TRUE, prob = c(0.8, 0.2))
angina3 <- sample(c(0,1), 500, replace = TRUE, prob = c(0.7, 0.3))
# Wide format for survival modelling. Single row per patient.
data_wide <- as.data.frame(
cbind(id, gender, status, survival_days, angina1, angina2, angina3))
# Recode factor levels
data_wide$survival_days<- as.numeric(data_wide$survival_days)
data_wide$status <- as.numeric(data_wide$status)
# Long format for longitudinal modelling.
data_long <- data_wide %>% gather(angina1:angina3, key = "timepoint", value = "angina")
data_long$timepoint <- str_replace(data_long$timepoint,"angina","")
data_long <- data_long %>% mutate(angina = as.factor(angina), timepoint = as.numeric(timepoint))
# MODELLING ---------------------------------------------------
set.seed(123)
mixed_model_fit <- mvglmer(list(angina ~ timepoint + (timepoint | id)), data = data_long, families = list(binomial))
cox_fit <- coxph(Surv(survival_days, status) ~ gender, data = data_wide, model = TRUE)
JMFit <- mvJointModelBayes(mixed_model_fit, cox_fit, timeVar = "timepoint")
The last part of the code above returns:
Error in { :
task 1 failed - "addition: incompatible matrix dimensions: 500x1 and 3000x1"
Is anybody able to shed light on how to fix this error?

I found out that it works if id is a numeric variable instead of a factor and if id occurs in the same order in both data sets. Running the following code before model fitting solves the issue:
data_long <- data_long %>%
mutate(id = as.numeric(id)) %>%
arrange(id)
data_wide <- data_wide %>%
mutate(id = as.numeric(id)) %>%
arrange(id)

Related

How do I add difference proportion among each levels of a categorical variable in R using ybl_svysummary^

I would like to reproduce the following table.Desired table How ever I can't figure out how to add the p-value next to the statistics. The p-value here compares the difference of proportion among each level of those two groups. I'm using this dataset from the library questionr in RStudio. I tried to add_difference(), but it doesn't do what I expected. Here is my Rcode of what I've done so far:
library(questionr)
data(hdv2003)
d <- hdv2003
d$sport2[d$sport == "Oui"] <- TRUE
d$grpage <- cut(d$age, c(16, 25, 45, 65, 99), right = FALSE, include.lowest =
TRUE)
d$etud <- d$nivetud
levels(d$etud) <- c(
"Primaire", "Primaire", "Primaire",
"Secondaire", "Secondaire", "Technique/Professionnel",
"Technique/Professionnel", "Supérieur"
)
d$etud <- forcats::fct_explicit_na(d$etud, "manquant")
d$sexe <- relevel(d$sexe, "Femme")
dw <- svydesign(ids = ~1, data = d, weights = ~poids)
dw %>%
tbl_svysummary(by = sexe,
include = c(sport,sexe , grpage, etud, relig, heures.tv ))

How to add a covariate to a between-within design MANCOVA with R?

I want to run a between-within design MANCOVA with R, with two dependent variables (Planned and Unplanned), two between-subject variables (Genre [Male, Female] and Urb [Yes, No]), one within-subject variable (Period [Before, During]), and one covariate (BMI).
Here is what I've done (see here for similar calculation: https://stats.stackexchange.com/questions/183441/correct-way-to-perform-a-one-way-within-subjects-manova-in-r):
# Create dummy data
data <- data.frame(Quest_before_planned = sample(1:100, 10),
Quest_during_planned = sample(1:100, 10),
Quest_before_unplanned = sample(1:100, 10),
Quest_during_unplanned = sample(1:100, 10),
Genre = sample(rep(c("Male", "Female"), each = 5)),
Urb = sample(rep(c("Yes", "No"), each = 5)),
BMI = sample(1:100, 10))
# Define the within-subjects factor
period <- as.factor(rep(c('before','during'), each = 2))
idata <- data.frame(period)
# Create the data structure for the linear model
data.model <- with(data, cbind(Quest_before_planned, Quest_during_planned,
Quest_before_unplanned, Quest_during_unplanned))
# Build the multivariate-linear model
mod.mlm <- lm(data.model ~ Genre * Urb, data = data_total)
# Run the MANOVA
mav.blpaq <- Anova(mod.mlm, idata = idata, idesign = ~ period, type = 2)
print(mav.blpaq)
Thus, the between-within design MANOVA here works well. However, I failed to add a covariate (i.e., BMI) to this model. Do you know how can I achieve this?
N.B.: I also tried using the (great) mancova() function , which include a covariate parameter; but with this function, I do not know how to specify that Period is a within-subject variable...
blpaq_macov <- mancova(data_tidy,
deps = c("Quest_planned", "Quest_unplanned"),
factors = c("Genre", "Period", "Urb"),
covs = "BMI",
multivar = "pillai")

How to evaluate LightGBM in R using cohen's kappa?

I use XGBoost in R on a regular basis and want to start using LightGBM on the same data. My goal is to use cohen's kappa as evaluation metric. However, I am not able to properly implement LightGBM - it seems that no learning occurs. As a very simple example, I'll use the titanic dataset.
library(data.table)
library(dplyr)
library(caret)
titanic <- fread("https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv")
titanic_complete <- titanic %>%
select(survived, pclass, sex, age, sibsp, parch, fare, embarked) %>%
mutate_if(is.character, as.factor) %>%
mutate(survived = as.factor(survived)) %>%
na.omit()
train_class <- titanic_complete %>%
select(survived) %>%
pull()
train_numeric <- titanic_complete %>%
select_if(is.numeric) %>%
data.matrix()
ctrl <- trainControl(method = "none", search = "grid")
tune_grid_xgbTree <- expand.grid(
nrounds = 700,
eta = 0.1,
max_depth = 3,
gamma = 0,
colsample_bytree = 0,
min_child_weight = 1,
subsample = 1)
set.seed(512)
fit_xgb <- train(
x = train_numeric,
y = train_class,
tuneGrid = tune_grid_xgbTree,
trControl = ctrl,
method = "xgbTree",
metric = "Kappa",
verbose = TRUE)
confusionMatrix(predict(fit_xgb, train_numeric), train_class)
Gives me a Kappa of 0.57 evaluated on the training set (which is only to show my problem, otherwise I would use cross-validation).
For LightGBM, I write Kappa as a custom evaluation function:
library(lightgbm)
lgb.kappa <- function(preds, y) {
label <- getinfo(y, "label")
k <- unlist(e1071::classAgreement(table(label, preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
X_train <- titanic_complete %>% select(-survived) %>% data.matrix()
y_train <- titanic_complete %>% select(survived) %>% data.matrix()
y_train <- y_train - 1
dtrain <- lgb.Dataset(data = X_train, label = y_train)
Here, I use the same parameter set than in XGBoost but I tried different combinations without success.
fit_lgbm <- lgb.train(data = dtrain,
objective = "binary",
learning_rate = 0.1,
nrounds = 700,
colsample_bytree = 0,
eval = lgb.kappa,
min_child_weight = 1,
max_depth = 3)
No learning occurs and the algorithm outputs "No further splits with positive gain, best gain: -inf" and Kappa = 0.
If someone hast successfully implemented LightGBM (maybe with a custom evaluation metric), I would be very happy for a hint of how to resolve this.
No learning occurs and the algorithm outputs "No further splits with positive gain, best gain: -inf"
This is because LightGBM's default parameter values are configured for larger datasets. The training dataset in your example above only has 714 rows. To deal with this, I recommend setting LightGBM's parameters to values that permit smaller leaf nodes, and limiting the number of leaves instead of the depth.
list(
"min_data_in_leaf" = 3
, "max_depth" = -1
, "num_leaves" = 8
)
and Kappa = 0.
I believe your implementation of Cohen's kappa has a mistake. The input to e1071::classAgreement() is expected to be a table of counts (a confusion matrix), and preds is in the form of predicted probabilities. I think this implementation is correct, based on the description of this metric on Wikipedia.
lgb.kappa <- function(preds, dtrain) {
label <- getinfo(dtrain, "label")
threshold <- 0.5
thresholded_preds <- as.integer(preds > threshold)
k <- unlist(e1071::classAgreement(table(label, thresholded_preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
Finally, I think 700 iterations is probably too many for a 700ish-observation dataset. You can see the value of metrics evaluated against the training data at each iteration by passing the training data as a validation set.
Taken together, I think the code below accomplishes what the original question asked for.
library(data.table)
library(dplyr)
library(caret)
library(lightgbm)
titanic <- fread("https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv")
titanic_complete <- titanic %>%
select(survived, pclass, sex, age, sibsp, parch, fare, embarked) %>%
mutate_if(is.character, as.factor) %>%
mutate(survived = as.factor(survived)) %>%
na.omit()
train_class <- titanic_complete %>%
select(survived) %>%
pull()
train_numeric <- titanic_complete %>%
select_if(is.numeric) %>%
data.matrix()
lgb.kappa <- function(preds, dtrain) {
label <- getinfo(dtrain, "label")
threshold <- 0.5
thresholded_preds <- as.integer(preds > threshold)
k <- unlist(e1071::classAgreement(table(label, thresholded_preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
X_train <- titanic_complete %>% select(-survived) %>% data.matrix()
y_train <- titanic_complete %>% select(survived) %>% data.matrix()
y_train <- y_train - 1
# train, printing out eval metrics at ever iteration
fit_lgbm <- lgb.train(
data = lgb.Dataset(
data = X_train,
label = y_train
),
params = list(
"min_data_in_leaf" = 3
, "max_depth" = -1
, "num_leaves" = 8
),
objective = "binary",
learning_rate = 0.1,
nrounds = 10L,
verbose = 1L,
valids = list(
"train" = lgb.Dataset(
data = X_train,
label = y_train
)
),
eval = lgb.kappa,
)
# evaluate a custom function after training
fit_lgbm$eval_train(
feval = lgb.kappa
)

r studio caret train factor has no data

I'm working with a dataset that I'm training with the caret package. My class variable has 7 levels which I create the labels with the dataset documentation. Happened that one of the levels has no data whatsoever in the dataset and I'm having the following error... Error in train.default(x, y, weights = w, ...) : One or more factor levels in the outcome has no data: 'vwnfp'. The easy way should be just getting rid of that level and that should work. But I'm wondering if in the caret packages is any parameter that can handle this type of situations. I did try to add na.action = 'na.omit'. I also wonder if utilizing the preProcess argument can handle this, but I have never use preProcess before and my attempts are unsuccessful. Here is my code to train the data...
fit.control <- trainControl(method = 'cv', number = 10)
grid <- expand.grid(cp = seq(0, 0.05, 0.005))
trained.tree <- train(Type_of_glass ~ ., data = data.train, method = 'rpart',
trControl = fit.control, metric = 'Accuracy', maximize = TRUE,
tuneGrid = grid, na.action = 'na.omit')
The dataset is in the following url... http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data
This is the code I'm utilizing to manipulate the dataset...
# Loading dataset and transform
data <- read.csv(file = 'data.csv',
head = FALSE)
colnames(data) <- c('Id', 'Ri', 'Na', 'Ma', 'Al',
'Si', 'K', 'Ca', 'Ba', 'Fe',
'Type_of_glass')
str(data)
data <- subset(data, select = -Id)
data$Type_of_glass <- factor(data$Type_of_glass,
levels = c(1, 2, 3, 4, 5, 6, 7),
labels = c('bwfp', 'bwnfp', 'vwfp', 'vwnfp',
'c', 't', 'h'))
str(data)
# Spliting training and test dataset
set.seed(2)
sample.train <- sample(1:nrow(data), nrow(data) * .8)
sample.test <- setdiff(1:nrow(data), sample.train)
data.train <- data[sample.train, ]
data.test <- subset(data[sample.test, ], select = -Type_of_glass)
I don't want to manually get rid of the level because in production, after training, the unseen dataset is pass through the model as is. How can I handle this situation in the dataset?

Passing data to forecast.lm using dplyr and do

I am having trouble passing data to forecast.lm in a dplyr do. I want to make several models based on a a factor - hour - and the forecaste these models using new data.
Building on previous excellent examples here is my data example:
require(dplyr)
require(forecast)
# Training set
df.h <- data.frame(
hour = factor(rep(1:24, each = 100)),
price = runif(2400, min = -10, max = 125),
wind = runif(2400, min = 0, max = 2500),
temp = runif(2400, min = - 10, max = 25)
)
# Forecasting set
df.f <- data.frame(
hour = factor(rep(1:24, each = 10)),
wind = runif(240, min = 0, max = 2500),
temp = runif(240, min = - 10, max = 25)
)
# Bind training & forecasting
df <- rbind(df.h, data.frame(df.f, price=NA))
# Do a training model and then forecast using the new data
df <- rbind(df.h, data.frame(df.f, price=NA))
res <- group_by(df, hour) %>% do({
hist <- .[!is.na(.$price), ]
fore <- .[is.na(.$price), c('hour', 'wind', 'temp')]
fit <- Arima(hist$price, xreg = hist[,3:4], order = c(1,1,0))
data.frame(fore[], price=forecast.Arima(fit, xreg = fore[ ,2:3])$mean)
})
res
This works excellently with a time series model, but using a lm I have problem passing the data into the forecasting part.
My corresponding lm example looks like this:
res <- group_by(df, hour) %>% do({
hist <- .[!is.na(.$price), ]
fore <- .[is.na(.$price), c('hour', 'wind', 'temp')]
fit <- lm(hist$price ~ wind + temp, data = hist)
data.frame(fore[], price = forecast.lm(fit, newdata = fore[, 2:3])$mean)
})
The problem is that I cant' get data into the newdata = function. If you add hist$ in the fit section, you can't reference the forecast data, and for some reason if you add data = fore it can't find it - but it can in the time series example.
The problem is that forecast.lm expects that fit has a data component. If you use glm or tslm, that is true. But lm objects don't generally have a data component. So you need to manually add fit$data <- hist for forecast.lm to work properly.
res <- group_by(df, hour) %>% do({
hist <- .[!is.na(.$price), ]
fore <- .[is.na(.$price), c('hour', 'wind', 'temp')]
fit <- lm(price ~ wind + temp, data = hist)
fit$data <- hist # have to add data manually
data.frame(fore[], price = forecast.lm(fit, newdata = fore[, 2:3])$mean)
})
This is actually a known issue.

Resources