Extracting and exchanging terms within an R formula object - r

I would like to create a graphing function in R which takes a formula as an argument, e.g.:
my.plot(sqrt(Sepal.Width) ~ Petal.Width + log(Petal.Length) + Species + Petal.Width * Petal.Length, .data = iris)
And then
Perform a model fit with the first predictor term exchanged for another vector created within the function.
Use the outcome term and the first predictor term for an overlying plot.
Allow interaction and crossing terms, and use of the . symbol denoting all other variables in the data frame.
Handle the case where only 1 predictor term is provided - e.g. Sepal.Width ~ Petal.Width.
R pseudocode for a highly simplified example:
library("formula.tools")
my.plot <- function(.formula, .data) {
outcome.term <- lhs(.formula)
first.predictor.term <- rhs(.formula)[1]
new.formula <- outcome.term ~ 1:nrow(.data) + rhs(.formula)[-1]
my.fit <- lm(new.formula, data = .data)
my.predict <- predict(my.fit)
plot(first.predictor.term, outcome.term, data = .data)
lines(first.predictor.term, my.predict, data = .data)
}

You could accomplish the same using Base R:
my.plot <- function(.formula, .data) {
outcome.term <- deparse(.formula[[2]])
first.predictor.term <- .formula[[3]]
len <- length(first.predictor.term) > 1
if (len) first.predictor.term <- .formula[[3]][[2]]
if (len) .formula[[3]][[2]] <- substitute(new_variable)
else .formula[[3]] <- substitute(new_variable)
.data['new_variable'] <- 1:nrow(.data)
my.fit <- lm(.formula, data = .data)
my.predict <- predict(my.fit)
f <- reformulate(deparse(first.predictor.term), outcome.term)
plot(f, data = .data, ty = "p")
}

Related

How can I loop a list of models to get slope estimate

I have a list of models as specified by the following code:
varlist <- list("PRS_Kunkle", "PRS_Kunkle_e07",
"PRS_Kunkle_e06","PRS_Kunkle_e05", "PRS_Kunkle_e04",
"PRS_Kunkle_e03", "PRS_Kunkle_e02", "PRS_Kunkle_e01",
"PRS_Kunkle_e00", "PRS_Jansen", "PRS_deroja_KANSL")
PRS_age_pacc3 <- lapply(varlist, function(x) {
lmer(substitute(z_pacc3_ds ~ i*AgeAtVisit + i*I(AgeAtVisit^2) +
APOE_score + gender + EdYears_Coded_Max20 +
VisNo + famhist + X1 + X2 + X3 + X4 + X5 +
(1 |family/DBID),
list(i=as.name(x))), data = WRAP_all, REML = FALSE)
})
I want to obtain the slope of PRS at different age points in each of the models. How can I write code to achieve this goal? Without loop, the raw code should be:
test_stat1 <- simple_slopes(PRS_age_pacc3[[1]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat2 <- simple_slopes(PRS_age_pacc3[[2]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat3 <- simple_slopes(PRS_age_pacc3[[3]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat4 <- simple_slopes(PRS_age_pacc3[[4]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat5 <- simple_slopes(PRS_age_pacc3[[5]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat6 <- simple_slopes(PRS_age_pacc3[[6]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat7 <- simple_slopes(PRS_age_pacc3[[7]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat8 <- simple_slopes(PRS_age_pacc3[[8]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat9 <- simple_slopes(PRS_age_pacc3[[9]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat10 <- simple_slopes(PRS_age_pacc3[[10]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat11 <- simple_slopes(PRS_age_pacc3[[11]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
library(lme4)
library(reghelper)
set.seed(101)
## add an additional factor variable so we can use it for an interaction
sleepstudy$foo <- factor(sample(LETTERS[1:3], size = nrow(sleepstudy),
replace = TRUE))
m1 <- lmer(Reaction ~ Days*foo + I(Days^2)*foo + (1|Subject), data = sleepstudy)
s1 <- simple_slopes(m1, levels=list(Days = c(5, 10, 15)))
Looking at these results, s1 is a data frame with 6 rows (number of levels of foo × number of Days values specified) and 5 columns (Days, foo, estimate, std error, t value).
The simplest way to do this:
res <- list()
for (i in seq_along(varlist)) {
res[[i]] <- simple_slopes(model_list[[i]], ...) ## add appropriate args here
}
res <- do.call("rbind", res) ## collapse elements to a single data frame
## add an identifier column
res_final <- data.frame(model = rep(varlist, each = nrow(res[[1]])), res)
If you want to be fancier, you could replace the for loop with an appropriate lapply. If you want to be even fancier than that:
library(tidyverse)
(model_list
%>% setNames(varlist)
## map_dfr runs the function on each element, collapses results to
## a single data frame. `.id="model"` adds the names of the list elements
## (set in the previous step) as a `model` column
%>% purrr::map_dfr(simple_slopes, ... <extra args here>, .id = "model")
)
By the way, I would be very careful with simple_slopes when you have a quadratic term in the model as well. The slopes calculated will (presumably) apply only in the case where any other continuous variables in the model are zero. You might want to center your variables as in Schielzeth 2010 Methods in Ecology and Evolution ("Simple means to improve ...")

Create numerous statistical models based on selection criteria in a separate dataframe

I want to perform a certain number of statistical models based on selection criteria specified in a dataframe. So using a basic example, say I had 2 responses variables and 2 explanatory variables:
#######################Data Input############################
Responses <- as.data.frame(matrix(sample(0:10, 1*100, replace=TRUE), ncol=2))
colnames(Responses) <- c("A","B")
Explanatories <- as.data.frame(matrix(sample(20:30, 1*100, replace=TRUE), ncol=2))
colnames(Explanatories) <- c("x","y")
I then define which statistical models that I would like to run, which can include different combinations of Response / Explanatory variables and different statistical functions:
###################Model selection#########################
Function <- c("LIN","LOG","EXP") ##Linear, Logarithmic (base 10) and exponential - see the formula for these below
Respo <- c("A","B","B")
Explan <- c("x","x","y")
Model_selection <- data.frame(Function,Respo,Explan)
How do I then perform a list of models based on these selection criteria? Here is an example of the models I would like to create based on the inputs from the Model_selection data frame.
####################Model creation#########################
Models <- list(
lm(Responses$A ~ Explanatories$x),
lm(Responses$B ~ log10(Explanatories$x)),
lm(Responses$B ~ exp(Explanatories$y))
)
I would guess that some kind of loop function would be required and after looking around perhaps paste too? Thanks in advance for any help with this
This isn't the prettiest solution, but it seems to work for your example:
Models <- list()
idx <- 1L
for (row in 1:nrow(Model_selection)){
if (Model_selection$Function[row]=='LOG'){
expl <- paste0('LOG', Model_selection$Explan[row])
Explanatories[[expl]] <- log10(Explanatories[[Model_selection$Explan[row]]])
Models[[idx]] <- lm(Responses[[Model_selection$Respo[row]]] ~ Explanatories[[expl]])
}
if (Model_selection$Function[row]=='EXP'){
expl <- paste0('EXP', Model_selection$Explan[row])
Explanatories[[expl]] <- exp(Explanatories[[Model_selection$Explan[row]]])
Models[[idx]] <- lm(Responses[[Model_selection$Respo[row]]] ~ Explanatories[[expl]])
}
if (Model_selection$Function[row]=='LIN'){
expl <- paste0('LIN', Model_selection$Explan[row])
Explanatories[[expl]] <- Explanatories[[Model_selection$Explan[row]]]
Models[[idx]] <- lm(Responses[[Model_selection$Respo[row]]] ~ Explanatories[[expl]])
}
names(Models)[idx] <- paste(Model_selection$Respo[row], '~', expl)
idx <- idx+1L
}
Models
This is a perfect use-case for the tidyverse
library(tidyverse)
## cbind both data sets into one
my_data <- cbind(Responses, Explanatories)
## use 'mutate' to change function names to the existing function names
## mutate_all to transform implicit factors to characters
## NB this step could be ommitted if Function would already use the proper names
model_params <- Model_selection %>%
mutate(Function = case_when(Function == "LIN" ~ "identity",
Function == "LOG" ~ "log10",
Function == "EXP" ~ "exp")) %>%
mutate_all(as.character)
## create a function which estimates the model given the parameters
## NB: function params must be named exactly like columns
## in the model_selection df
make_model <- function(Function, Respo, Explan) {
my_formula <- formula(paste0(Respo, "~", Function, "(", Explan, ")"))
my_mod <- lm(my_formula, data = my_data)
## syntactic sugar: such that we see the value of the formula in the print
my_mod$call$formula <- my_formula
my_mod
}
## use purrr::pmap to loop over the model params
## creates a list with all the models
pmap(model_params, make_model)

Formatting table from apa_list(list)

I am printing a table to PDF with much success. Standard hierarchical regression
with three steps. However, my questions is twofold: 1) How do I add the asterisk to mark sig p values on the covariates and 2) how do I remove rows like AIC etc.
At this point, I am just opening the pdf in word to edit the table but thought someone might have a solution.
H_regression <- apa_print(list(Step1 = model1,
Step2 = model2,
Step3 = model3),
boot_samples = 0)
I'll use the example from the papaja documentation as an example.
mod1 <- lm(Sepal.Length ~ Sepal.Width, data = iris)
mod2 <- update(mod1, formula = . ~ . + Petal.Length)
mod3 <- update(mod2, formula = . ~ . + Petal.Width)
moi <- list(Baseline = mod1, Length = mod2, Both = mod3)
h_reg <- apa_print(moi, boot_samples = 0)
h_reg_table <- h_reg$table
2) how do I remove rows like AIC
The table returned by apa_print() is a data.frame with some additional information. Hence, you can index and subset it as you would any other table. You can select rows by name (see below) or by row number.
# Remove rows
rows_to_remove <- c("$\\mathrm{AIC}$", "$\\mathrm{BIC}$")
h_reg_table <- h_reg_table[!rownames(h_reg_table) %in% rows_to_remove, ]
1) How do I add the asterisk to mark sig p values on the covariates
There is currently no way to highlight significant predictors (I'm not a fan of this practice). But here's some code that will let you add highlighting after the fact. The following function takes the formatted table, the list of the compared models and a character symbol to highlight significant predictors as input.
# Define custom function
highlight_sig_predictors <- function(x, models, symbol) {
n_coefs <- sapply(models, function(y) length(coef(y)))
for(i in seq_along(models)) {
sig_stars <- rep(FALSE, max(n_coefs))
sig_stars[1:n_coefs[i]] <- apply(confint(models[[i]]), 1, function(y) all(y > 0) || all(y < 0))
x[1:max(n_coefs), i] <- paste0(x[1:max(n_coefs), i], ifelse(sig_stars, symbol, paste0("\\phantom{", symbol, "}")))
}
x
}
Now this function can be used to customize the table returned by apa_print().
# Add significance symbols to predictors
h_reg_table <- highlight_sig_predictors(h_reg_table, moi, symbol = "*")
# Print table
apa_table(h_reg_table, escape = FALSE, align = c("lrrr"))

Regression in R using a function

I am trying to smooth out my data for each variable in the data frame. Lets say it looks like this:
data <- data.frame(v1 = c(0.5,1.1,2.9,3.4,4.1,5.7,6.3,7.4,6.9,8.5,9.1),
v2 = c(0.1,0.8,0.5,1.1,1.9,2.4,0.8,3.4,2.9,3.1,4.2),
v3 = c(1.3,2.1,0.8,4.1,5.9,8.1,4.3,9.1,9.2,8.4,7.4))
data$x <- 1:nrow(data)
I then specify my x and y variables as:
x <- data$x
y <- data$v1
I can fit the predicted line I want (and I am happy with the process):
f <- function (x,a,b,d) {(a*x^2) + (b*x) + d}
order_two <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co2 <- coef(order_two)
data$order_two_predicted_v1 <- (co2[1] * (data$x)^2) + (co2[2] * data$x) + co2[3]
I therefore end up with an appropriately titled new variable (the predicted values for v1). I now want to do this for each of the other 100 variables in my data frame (v2 and v3 in this example).
I tried using a function to do this but can't get it to work as intended. Here is my attempt:
myfunction <- function(xaxis,yaxis){
# Specfiy my "y" and "x"
x <- data$xaxis
y <- data$yaxis
f <- function (x,a,b,d) {(a*x^2) + (b*x) + d}
order_two <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co2 <- coef(order_two)
data$order_two_predicted_yaxis <- (co2[1] * (data$x)^2) + (co2[2] * data$x) + co2[3]
}
myfunction(x,v1)
myfunction(x,v2)
myfunction(x,v3)
Not only does the function not work as intended, I would like to avoid calling the function 100 times for each variable and instead somehow loop through it.
This is really simple to do in SAS using macros but I am struggling to get this to work in R.
You can model your data directly with the lm() function:
data <- data.frame(v1 = c(0.5,1.1,2.9,3.4,4.1,5.7,6.3,7.4,6.9,8.5,9.1),
v2 = c(0.1,0.8,0.5,1.1,1.9,2.4,0.8,3.4,2.9,3.1,4.2),
v3 = c(1.3,2.1,0.8,4.1,5.9,8.1,4.3,9.1,9.2,8.4,7.4))
x <- 1:nrow(data)
# initialize a list to store the models
models = vector("list", length = (ncol(data)))
# create a loop running over the columns of data
for (i in 1:(ncol(data))){
models[[i]] = lm(data[,i] ~ poly(x,2, raw = TRUE))}
You can also use lapply instead of the for-loop, as stated in the comments.
Use predict() to get the values of the models:
smoothed_v1 = predict(model[[1]], newdata=data.frame(x = x))
Edit:
Regarding your comment - you can store the new values in data with:
for (i in (length(models):1)){
data <- cbind(predict(models[[i]], newdata=data.frame(x = x)), data)
# set the name for the new column
names(data)[1] = paste("pred_v",i, sep ="")}

Function to regress chosen variable against all others

In my dataset I have 6 variables(x1,x2,x3,x4,x5,x6), i wish to create a function that allows me to input one variable and it will do the formula with the rest of the variables in the data set.
For instance,
fitRegression <- function(data, dependentVariable) {
fit = lm(formula = x1 ~., data = data1)
return(fit)
}
fitRegression(x2)
However, this function only returns me with results of x1. My desire result will be inputting whatever variables and will automatically do the formula with the rest of the variables.
For Example:
fitRegression(x2)
should subtract x2 from the variable list therefore we only compare x2 with x1,x3,x4,x5,x6.
and if:
fitRegression(x3)
should subtract x3 from the comparable list, therefore we compare x3 with x1,x2,x4,x5,x6.
Is there any ways to express this into my function, or even a better function.
You can do it like this:
# sample data
sampleData <- data.frame(matrix(rnorm(500),100,5))
colnames(sampleData) <- c("A","B","C","D","E")
# function
fitRegression <- function(mydata, dependentVariable) {
# select your independent and dependent variables
dependentVariableIndex<-which(colnames(mydata)==dependentVariable)
independentVariableIndices<-which(colnames(mydata)!=dependentVariable)
fit = lm(formula = as.formula(paste(colnames(mydata)[dependentVariableIndex], "~", paste(colnames(mydata)[independentVariableIndices], collapse = "+"), sep = "" )), data = mydata)
return(fit)
}
# ground truth
lm(formula = A~B+C+D+E, data = sampleData)
# reconcile results
fitRegression(sampleData, "A")
You want to select the Y variable in your argument. The main difficulty is to pass this argument without any quotes in your function (it is apparently the expected result in your code). Therefore you can use this method, using the combination deparse(substitute(...)):
fitRegression <- function(data, dependentVariable) {
formula <- as.formula(paste0(deparse(substitute(dependentVariable)), "~."))
return(lm(formula, data) )
}
fitRegression(mtcars, disp)
That will return the model.
The below function uses "purrr" and "caret" it produces a list of models.
df <-mtcars
library(purrr);library(caret)
#create training set
vect <- createDataPartition(1:nrow(df), p=0.8, list = FALSE)
#build model list
ModList <- 1:length(df) %>%
map(function(col) train(y= df[vect,col], x= df[vect,-col], method="lm"))

Resources