R predict() asking for variable excluded in lm() regression model - r

I intend to apply a regression based on two "x" variables, excluding others present in a dataframe.
As an example:
df <- data.frame(name = c("Paul", "Charles", "Edward", "Iam"),
age = c(18, 20, 25, 30),
income = c( 1000, 2000, 2500, 3000),
workhours = c(35, 40, 45, 40))
regression <- lm(income ~ . -name, data = df)
I face a problem when I try to use the predict function. It demands information about the "name" variable:
predict(object = regression,
data.frame(age = 22, workhours = 36))
It gives the following message error:
Error in eval(predvars, data, env) : object 'name' not found
I've solved this problem by excluting the "name" variable from the lm() function:
regression2 <- lm(income ~ . , data = df[, -1])
predict(object = regression2,
data.frame(age = 22, workhours = 36))
Since I have many variables I intend to exclude from the regression, is there a way to solve this inside de predict() function?

We may use update
> regression <- update(regression, . ~ .)
> predict(object = regression,
+ data.frame(age = 22, workhours = 36))
1
1714.859

Related

Getting error "invalid type (list) for variable" when running multiple models in a for loop: how to specify outcome/predictors?

For a study I am working on I need to create bootstrapped datasets and inverse probability weights for each dataset and then run a series of models for each of these datasets/weights. I am attempting to do this with a nested for-loop where the first part of the loop creates the weights and the nested loop runs a series of models, each with different outcome variables and/or predictors. I am running about 80 models for each bootstrapped dataset, hence the reason for a more automated way to do this. Below is a example of what I am doing with some mock data:
# Creation of mock data
data <- data.frame("Severity" = as.factor(c(rep("None", 25), rep("Mild", 25), rep("Moderate", 25), rep("Severe", 25))), "Severity2" = as.factor(c(rep("None", 40), rep("Mild", 20), rep("Moderate", 20), rep("Severe", 20))), "Weight" = rnorm(100, mean = 160, sd = 30), "Age" = rnorm(100, mean = 40, sd = 7), "Gender" = as.factor(rbinom(100, size = 1, prob = 0.5)), "Tested" = as.factor(rbinom(100, size = 1, prob = 0.4)))
data$Severity <- ifelse(data$Tested == 0, NA, data$Severity)
data$Severity2 <- ifelse(data$Tested == 0, NA, data$Severity2)
data$Severity <- ordered(data$Severity, levels = c("None", "Mild", "Moderate", "Severe"))
data$Severity2 <- ordered(data$Severity2, levels = c("None", "Mild", "Moderate", "Severe"))
# Creating boostrapped datasets
nboot <- 2
set.seed(10)
boot.samples <- lapply(1:nboot, function(i) {
data[base::sample(1:nrow(data), replace = TRUE),]
})
# Create empty list to store results later
coefs <- list()
# Setting up the outcomes/predictors of each of the models I will run
mod1 <- list("outcome" <- "Severity", "preds" <- c("Weight","Age"))
mod2 <- list("outcome" <- "Severity2", "preds" <- c("Weight", "Age", "Gender"))
models <- list(mod1, mod2)
# Running the for-loop
for(i in 1:length(boot.samples)) {
#Setting up weight creation
null <- glm(formula = Tested ~ 1, family = "binomial", data = boot.samples[[i]])
full <- glm(formula = Tested ~ Age, family = "binomial", data = boot.samples[[i]])
step <- step(null, k = 2, direction = "forward", scope=list(lower = null, upper = full), trace = 0)
pd.combined <- stats::predict(step, type = "response")
numer.combined <- glm(Tested ~ 1, family = "binomial",
data = boot.samples[[i]])
pn.combined <- stats::predict(numer.combined, type = "response")
# Creating stabilized weights
boot.samples[[i]]$ipw <- ifelse(boot.samples[[i]]$Tested==0, ((1-pn.combined)/(1-pd.combined)), (pn.combined)/(pd.combined))
# Now running each model and storing the coefficients
for(j in 1:length(models)) {
outcome <- models[[j]][[1]] # Set the outcome name
predictors <- models[[j]][[2]] # Set the predictor names
model_results <- polr(boot.samples[[i]][,outcome] ~ boot.samples[[i]][, predictors], weights = boot.samples[[i]]$ipw, method = c("logistic"), Hess = TRUE) #Run the model
coefs[[j]] <- model_results$coefficients # Store regression model coefficients in list
}
}
The portion for creating the IPW weights works just fine, but I keep getting an error for the modeling portion that reads:
"Error in model.frame.default(formula = boot.samples[[i]][, outcome] ~ :
invalid type (list) for variable 'boot.samples[[i]][, predictors]'"
Based on the question asked and answered here: Error in model.frame.default ..... : invalid type (list) for variable I know that the issue is with how I'm calling the outcomes and predictors in the model. I've messed around lots of different ways to handle this to no avail, I need to specify the outcome and predictors as I do because in my actual models the outcomes and predictors changes with each model! Any ideas on how to deal with this would be greatly appreciated!
I've tried something like setting outcome <- boot.samples[[i]][,outcome] outside of the model and then just calling outcome in the model, but that gives me the same error.

How do I add difference proportion among each levels of a categorical variable in R using ybl_svysummary^

I would like to reproduce the following table.Desired table How ever I can't figure out how to add the p-value next to the statistics. The p-value here compares the difference of proportion among each level of those two groups. I'm using this dataset from the library questionr in RStudio. I tried to add_difference(), but it doesn't do what I expected. Here is my Rcode of what I've done so far:
library(questionr)
data(hdv2003)
d <- hdv2003
d$sport2[d$sport == "Oui"] <- TRUE
d$grpage <- cut(d$age, c(16, 25, 45, 65, 99), right = FALSE, include.lowest =
TRUE)
d$etud <- d$nivetud
levels(d$etud) <- c(
"Primaire", "Primaire", "Primaire",
"Secondaire", "Secondaire", "Technique/Professionnel",
"Technique/Professionnel", "Supérieur"
)
d$etud <- forcats::fct_explicit_na(d$etud, "manquant")
d$sexe <- relevel(d$sexe, "Femme")
dw <- svydesign(ids = ~1, data = d, weights = ~poids)
dw %>%
tbl_svysummary(by = sexe,
include = c(sport,sexe , grpage, etud, relig, heures.tv ))

Generalized estimating equations working by themselves but not within functions (R)

I am trying to write a function to run GEE using the geepack package. It works fine "on its own" but not within a function, please see example below:
library(geepack)
library(pstools)
df <- data.frame(study_id = c(1:20),
leptin = runif(20),
insulin = runif(20),
age = runif(20, min = 20, max = 45),
sex = sample(c(0,1), size = 20, replace = TRUE))
#Works
geepack::geeglm(leptin ~ insulin + age + sex, id = study_id, data = df)
#Doesn't work
model_function_covariates_gee <- function(x,y) {
M1 <- paste0(x, "~", y, "+ age + sex")
M1_fit <- geepack::geeglm(M1, id = study_id, data = df)
s <- summary(M1_fit)
return(s)
}
model_function_covariates_gee("leptin", "insulin")
Error message:
Error in mcall$formula[3] <- switch(match(length(sformula), c(0, 2, 3)), :
incompatible types (from language to character) in subassignment type fix
Does anyone know why this is? I've fiddled around with it but can't get it to change. Thanks in advance.

Error in R: Non-conformable arrays, how to fix?

I am trying to create an effect plot for a cox proportional hazards model:
fitC7 <- coxph(Surv(TimeDeath, event == 1) ~
strata(sex) * mutation + age
+ ns(BM1, 3),
data = data)
I created a new dataset as follows:
ND1a <- with(data, expand.grid(age = seq(30, 75, length.out = 40), mutation = factor(c("Yes", "No")), sex = factor(c("male", "female")), BM1 = 1.583926))
Then, I tried to use the predict function:
predict(fitC7, newdata = ND1a, type = "lp", se.fit = T)
However, I keep getting the error:
Error in newx - xmeans[match(newstrat, row.names(xmeans)), ] : non-conformable arrays
and I do not know how to correct this.
It does work when I put in a model without sex as a stratifier, e.g.,
fitC9 <- coxph(Surv(TimeDeath, event ==1) ~
sex * mutation + age +
ns(BM1, 3), data = data)
I hope someone can help me, I could not figure it out with previous question and answer threads.

Am I using xgboost() correctly (in R)?

I'm a beginner with machine learning (and also R). I've figured out how to run some basic linear regression, elastic net, and random forest models in R and have gotten some decent results for a regression project (with a continuous dependent variable) that I'm working on.
I've been trying to learning how to use the gradient boosting algorithm and, in particular, the xgboost() command. My results are way worse here, though, and I'm not sure why.
I was hoping someone could take a look at my code and see if there are any glaring errors.
# Create training data with and without the dependent variable
train <- data[1:split, ]
train.treat <- select(train, -c(y))
# Create test data with and without the dependent variable
test <- data[(split+1):nrow(data), ]
test.treat <- select(test, -c(y))
# Load the package xgboost
library(xgboost)
# Run xgb.cv
cv <- xgb.cv(data = as.matrix(train.treat),
label = train$y,
nrounds = 100,
nfold = 10,
objective = "reg:linear",
eta = 0.1,
max_depth = 6,
early_stopping_rounds = 10,
verbose = 0 # silent
)
# Get the evaluation log
elog <- cv$evaluation_log
# Determine and print how many trees minimize training and test error
elog %>%
summarize(ntrees.train = which.min(train_rmse_mean), # find the index of min(train_rmse_mean)
ntrees.test = which.min(test_rmse_mean)) # find the index of min(test_rmse_mean)
# The number of trees to use, as determined by xgb.cv
ntrees <- 25
# Run xgboost
model_xgb <- xgboost(data = as.matrix(train.treat), # training data as matrix
label = train$y, # column of outcomes
nrounds = ntrees, # number of trees to build
objective = "reg:linear", # objective
eta = 0.001,
depth = 10,
verbose = 0 # silent
)
# Make predictions
test$pred <- predict(model_xgb, as.matrix(test.treat))
# Plot predictions vs actual bike rental count
ggplot(test, aes(x = pred, y = y)) +
geom_point() +
geom_abline()
# Calculate RMSE
test %>%
mutate(residuals = y - pred) %>%
summarize(rmse = sqrt(mean(residuals^2)))
How does this look?
Also, one thing I don't get about xgboost() is why I have to take out the dependent variable from the dataset in the "data" option and then add it back in the "label" option. Why do we do this?
My dataset has 809 observations and 108 independent variables. Here is an arbitrary subset:
structure(list(year = c(2019, 2019, 2019, 2019), ht = c(74, 76,
74, 73), wt = c(223, 234, 215, 215), age = c(36, 29, 32, 24),
gp_l1 = c(16, 16, 11, 14), gp_l2 = c(7, 0, 16, 0), gp_l3 = c(16,
15, 16, 0), gs_l1 = c(16, 16, 11, 13), gs_l2 = c(7, 0, 16,
0), gs_l3 = c(16, 15, 16, 0), cmp_l1 = c(372, 430, 226, 310
), cmp_l2 = c(154, 0, 297, 0), cmp_l3 = c(401, 346, 364,
0), att_l1 = c(597, 639, 365, 486), y = c(8, 71.5, 26, 22
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
My RMSE from this xgboost() model is 31.7. Whereas my random forest and glmnet models give RMSEs around 13. The prediction metric I'm comparing to has RMSE of 15.5. I don't get why my xgboost() model does so much worse than my random forest and glmnet models.

Resources