R sometimes fails to evaluate expressions parsed from strings

R sometimes fails to evaluate expressions parsed from strings - r

I have a massive dataframe where I need to create "lagged" variables and compare them with former time points. As this process needs to be variable, I've chosen to write my own functions which create these lagged variables (not included here).
As I use GLM's, I want to use the stepAIC function and before I start writing tenth of "lag01 + lag02..." I wanted to create another function (modelfiller) which creates these strings according to my parameters and then I use string2lang to make them expressions.
This mostly works but there is one issue which I cannot get my head around.
As you can see in the reprex full.model can be created when I only use y~x+lag01+lag02. If I use modelfiller("y", 2, "x", "lag") at location 1 and 3 it also works. But the moment I put modelfiller("y", 2, "x", "lag") at location 2 in the code (within the stepAIC glm) it creates the following error message:
Error: Problem with `mutate()` input `GLM_AIC`.
x object '.x' not found
i Input `GLM_AIC` is `purrr::map(...)`.
i The error occurred in group 1: group = "a".
I have also tried as.formula with & without eval, but it caused the same issue.
group <- c(rep("a", 10), rep("b", 10), rep("c", 10))
order <- c(seq(1:10), seq(1:10), seq(1:10))
x <- c(runif(30))
y <- c(runif(30))
df <- data.frame(group, order, x, y)
df <- df %>%
dplyr::group_by(group) %>%
dplyr::arrange(group, order) %>%
dplyr::mutate(lag01 = dplyr::lag(x, n=1),
lag02 = dplyr::lag(x, n=2)) %>%
tidyr::drop_na()
modelfiller = function(depPar, maxlag, indepPar, str) {
varnames = list()
for (i in seq(1:maxlag)) {
varnames[i] = paste0(str, stringr::str_pad(i, width = 2, pad = "0"))
}
varnames = paste0(varnames, collapse="+")
varnames = paste(indepPar, varnames, sep = "+")
return(paste(depPar, varnames, sep = "~"))
}
full.model <- df %>%
tidyr::nest(- group) %>%
dplyr::mutate(
# Perform GLM calculation on each group and then a step-wise model selection based on AIC
GLM = purrr::map(
data, ~ lm(data = .x,
# Location 1 - Working
str2lang(modelfiller("y", 2, "x", "lag"))
#y~x+lag01+lag02
)),
GLM_AIC = purrr::map(
data, ~ MASS::stepAIC(glm(data = .x,
# Location 2 - NOT Working
str2lang(modelfiller("y", 2, "x", "lag"))
#y~x+lag01+lag02
)
,direction = "both", trace = FALSE, k = 2,
scope = list(
lower = lm(data = .x,
y ~ 1),
upper = glm(data = .x,
# Location 3 - Working
str2lang(modelfiller("y", 2, "x", "lag"))
#y~x+lag01+lag02
)
)))
)

The issue is that glm stores the name of the variable used to reference the data, and stepAIC then attempts to retrieve this name and evaluate it to access the data, but gets confused about which environment the variable was defined in. To demonstrate, I'm going to simplify your code to
mdl <- str2lang(modelfiller("y", 2, "x", "lag")) # This is your y~x+lag01+lag02
dfn <- df %>% tidyr::nest( data = c(-group) ) # First step of your %>% chain
glms <- purrr::map( dfn$data, ~glm(data = .x, mdl) ) # Construct the models
# Examine glms to observe that
# Call: glm(formula = mdl, data = .x) <--- glm() remembers that the data is in .x
# but stepAIC is not properly aware of where .x
# is defined and behaves effectively as
MASS::stepAIC( glms[[1]] ) # Error: object '.x' not found
Option 1
One workaround is to manually construct the expression that contains the data and then evaluate it:
glm2 <- function(.df, ...) {
eval(rlang::expr(glm(!!rlang::enexpr(.df),!!!list(...)))) }
glms2 <- purrr::map( dfn$data, ~glm2(data = .x, mdl) ) # Same as above, but with glm2
MASS::stepAIC( glms2[[1]] ) # Now works
Changing glm to glm2 in your problematic spot makes your code work too. The down side is that the Call: then remembers the entire data frame, which can be problematic if they are very large.
Option 2
Another alternative is to replace the purrr call with a for loop, which helps maintain the calling frames assumed by stepAIC, thus guiding it to where the data is defined
# This fails with Error: object '.x' not found
purrr::map( dfn$data, ~MASS::stepAIC(glm(data=.x, mdl), direction="both") )
# This works
for( mydata in dfn$data )
MASS::stepAIC(glm(data=mydata, mdl), direction="both")
The advantage here is not needing to store the entire data frame inside the call. The disadvantage is that you effectively lose access to what purrr does to streamline the code.

Related

How to generalize a function to take more arguments? R

Seems no one is able to tackle this, so I re-ask:
I would like to use an external function from sjstats package within my own function.
The external function takes 4 arguments as like: https://strengejacke.github.io/sjstats/reference/weighted_sd.html
weighted_mannwhitney(data, x, grp, weights, ...)
A simple call to that external function can be done like this.
weighted_mannwhitney(c12hour ~ c161sex + weight, efc)
I have built a simple function below, which currently only works when provided with data as input - adding arguments produces errors.
I would like to be able to pass on some arguments (data, x, grp, weights) to the internally called external functional, so I can change those arguments for different datasets.
The Question
My attempts at generalizing the function are not successful, can anyone show me how I can do this? I show how I would like to call the generalized function below.
library(sjstats) # has weighted Mann-Whitney function which I would like to automate within own function
library(tidyverse)
data(efc)
efc$weight <- abs(rnorm(nrow(efc), 1, .3))
# Own working function
own_funk <- function(data, ...) {
# Weighted MannWhitney Test p-value
res <- data %>%
summarise(
across(
c(c12hour,e17age),
~sjstats:::weighted_mannwhitney.formula(.x ~ c161sex + weight, data)$p.value[[1]]
)
) %>%
#round(., 3) %>%
tibble::rownames_to_column() %>%
pivot_longer(-rowname) %>%
pivot_wider(names_from=rowname, values_from=value) %>%
rename("Outcome" = 1, "P-value" = 2) %>%
as.data.frame()
return(res)
}
Call to own working function
own_funk(efc)
How I would like to call the function
# NB: The x can take on many variables
own_funk(data = efc, x = c(c12hour,e17age), grp = c161sex, weights = weight)

Somewhat similar to #snaut's solution I'd use map to loop over your x columns/formulas. The difference mainly is that we probably don't want to hard code the variables within the function and rather pass these as an argument to the function.
library(sjstats) # has weighted Mann-Whitney function which I would like to automate within own function
library(tidyverse)
library(survey)
# Own working function
own_funk <- function(data, x, grp, weights) {
# Weighted MannWhitney Test p-value
formula <- paste0(x, " ~ ", grp, " + ", weights)
res <- map2(.x = formula,
.y = x,
.f = ~ data |>
summarize(!!sym(.y) := sjstats:::weighted_mannwhitney.formula(as.formula(.x), data = data)$p.value[[1]]) %>%
#round(., 3) %>%
tibble::rownames_to_column() %>%
pivot_longer(-rowname) %>%
pivot_wider(names_from=rowname, values_from=value) %>%
rename("Outcome" = 1, "P-value" = 2) %>%
as.data.frame()) |>
bind_rows()
return(res)
}
own_funk(data = efc, x = c("c12hour", "e17age"), grp = "c161sex", weights = "weight")
Which gives:
Outcome P-value
1 c12hour 0.006806572
2 e17age 0.187765467

sjstats:::weighted_mannwhitney.formula is clearly meant for interactive use and not for programming. sjstats:::weighted_mannwhitney.default seems to be more suited for programming, but still does something with unqouting variable names that I did not quite figure out.
What I do most of the time when I have to use functions created to be used in an interactive way for programming is to create a temporary column with a fixed name and then rename the columns I want to iterate over to that column.
This is what I have done in my solution. I also don't use summarise and across but instead use map_dfr to iterate over the column names and output rows of a tibble.
I did not quite understand what you want to archive with the pivot_longer and pivot_wider but I'm sure you can format the output of my solution to your needs.
library(sjstats) # has weighted Mann-Whitney function which I would like to automate within own function
library(tidyverse)
data(efc)
efc$weight <- abs(rnorm(nrow(efc), 1, .3))
# Own working function
own_funk <- function(mydata, ...) {
# Weighted MannWhitney Test p-value
map_dfr(
c("c12hour", "e17age"),
function(column){
tmp_data <- mydata %>%
rename(tmp_col = {column})
tibble(
variable = column,
`P-value`=sjstats:::weighted_mannwhitney.formula(tmp_col ~ c161sex + weight, data=tmp_data)$p.value[[1]]
)
}
)
}
own_funk(efc)

R: short and effective code for a sequence of lm models from DOE factors

Background:
I have several responses as output from an DOE and want to model each one. First and of course successfull try was to simply write a long code for each model.
formula<- y~a+b+c+d+a:b+a:c+a:d+b:c+b:d+c:d+I(a^2)+I(b^2)+I(c^2)+I(d^2)
Response_Data <- dataframe%>%
dplyr::select(2:5,6)
a <- Response_Data$X1
b <- Response_Data$X2
c <- Response_Data$X3
d <- Response_Data$X4
y <- Response_Data[[5]]
Response_1_Model <- lm(formula,Response_Data)
Response_Data <- dataframe%>%
dplyr::select(2:5,7)
a <- Response_Data$X1
b <- Response_Data$X2
c <- Response_Data$X3
d <- Response_Data$X4
y <- Response_Data[[5]]
Response_2_Model <- lm(formula,Response_Data)
Response_Data <- dataframe%>%
dplyr::select(2:5,8)
a <- Response_Data$X1
b <- Response_Data$X2
c <- Response_Data$X3
d <- Response_Data$X4
y <- Response_Data[[5]]
Response_3_Model <- lm(formula,Response_Data)
and so on, and so on
then I wanted to have all coefficients in one dataframe and did this:
Coefficients<-data.frame(Response_1_Model%>%coef(),Response_2_Model%>%coef(),Response_3_Model%>%coef(),Response_4_Model%>%coef(),Response_5_Model%>%coef(),Response_6_Model%>%coef(),Response_7_Model%>%coef(),Response_8_Model%>%coef(),Response_9_Model%>%coef(),Response_10_Model%>%coef(),Response_11_Model%>%coef(),Response_12_Model%>%coef(),Response_13_Model%>%coef(),Response_14_Model%>%coef(),Response_15_Model%>%coef(),Response_16_Model%>%coef(),Response_17_Model%>%coef(),Response_18_Model%>%coef(),Response_19_Model%>%coef(),Response_20_Model%>%coef(),Response_21_Model%>%coef(),Response_22_Model%>%coef(),Response_23_Model%>%coef(),Response_24_Model%>%coef(),Response_25_Model%>%coef(),Response_26_Model%>%coef(),Response_27_Model%>%coef())%>%mutate(across(is.numeric, round, digits=3))
colnames(Coefficients)<-c(names(RT_Datentabelle[6:32]))
I know, this is horrible, but I was not able to do this with paste0().
I tried it as follows:
Response_Models<-paste0("Response_",1:27,"_Model%>%coef()",collapse = ",")
the output was:
[1] "Response_1_Model%>%coef(),Response_2_Model%>%coef(),Response_3_Model%>%coef(),Response_4_Model%>%coef(),Response_5_Model%>%coef(),Response_6_Model%>%coef(),Response_7_Model%>%coef(),Response_8_Model%>%coef(),Response_9_Model%>%coef(),Response_10_Model%>%coef(),Response_11_Model%>%coef(),Response_12_Model%>%coef(),Response_13_Model%>%coef(),Response_14_Model%>%coef(),Response_15_Model%>%coef(),Response_16_Model%>%coef(),Response_17_Model%>%coef(),Response_18_Model%>%coef(),Response_19_Model%>%coef(),Response_20_Model%>%coef(),Response_21_Model%>%coef(),Response_22_Model%>%coef(),Response_23_Model%>%coef(),Response_24_Model%>%coef(),Response_25_Model%>%coef(),Response_26_Model%>%coef(),Response_27_Model%>%coef()"
so there are at least three problems in my code:
how can I realize a sequence of the select(), lm model and naming?
I couldn´t find a solution for updating the factors (a,b,c,d) after selecting new columns except of writing them down again
how can I paste a "real code" and not only the text to get the coefficients in my example?
EDIT:
Thanks to the comment from Miss.Alpha I tried a nest() and map() approach and it worked quite well starting from a tidy data frame.
df_tidy<-df%>%
pivot_longer(cols = Y1:Y27,names_to = "Y", values_to = "value")
df_nest_lm<-df_tidy%>%
nest(data= -Y)%>%
mutate(fit= map(data,~lm(.x$value~.x$X1+.x$X2+.x$X3+.x$X4+.x$X1:.x$X2+.x$X1:.x$X3+.x$X1:.x$X4+.x$X2:.x$X3+.x$X2:.x$X4+.x$X3:.x$X4+I(.x$X1^2)+I(.x$X2^2)+I(.x$X3^2)+I(.x$X4^2),data= .x)),
tidied = map(fit, tidy)
) %>%
unnest(tidied)
Of course I want to get rid of the ugly formula, but I cannot paste it like this:
a <- '.x$X1'
b <- '.x$X2'
c <- '.x$X3'
d <- '.x$X4'
y <- '.x$value'
f<- as.formula(paste(y, paste(a,b,c,d,
paste0(a,":",c(b,c,d),collapse = "+"),
paste0(b,":",c(c,d),collapse = "+"),
paste0(c,":",d),
paste0("I(",c(a,b,c,d),"^2)",sep="",collapse = "+"),
sep = "+"),
sep = "~"))
df_nest_lm<-df_tidy%>%
nest(data= -Y)%>%
mutate(fit= map(data,~lm(formula=f,data= .x)),
tidied = map(fit, tidy)
) %>%
unnest(tidied)
I always get the error:
Error: Problem with mutate() column fit. i fit = map(data, ~lm(formula = f, data = .x)). x Objekt '.x' not found
Does anybody can explain why its not working although the formula is the same as above?

This would be one way:
# Define formula
formula<- y~a+b+c+d+a:b+a:c+a:d+b:c+b:d+c:d+I(a^2)+I(b^2)+I(c^2)+I(d^2)
# Define a function that takes the x:th variable
my_reg <- function(x, formula, dataframe){
Response_Data <- dataframe%>%
dplyr::select(2:5,x)
a <- Response_Data$X1
b <- Response_Data$X2
c <- Response_Data$X3
d <- Response_Data$X4
y <- Response_Data[[5]]
lm(formula,Response_Data)
}
# Make one model for each x:th variable
map(6:8, my_reg, formula, dataframe)

How to optimize memory usage in dplyr + purrr

I have an issue where after replicating data for a training and testing set, I'm showing a large amount of memory allocated to my user in Rstudio, but not being used in my R session. I've created a small example to reproduce my situation :)
This code runs a bunch of model, based on different formulas, algorithms, and parameter sets that I give it. It is a function, but I've created a simple script for reprex.
library(dplyr)
library(purrr)
library(modelr)
library(tidyr)
library(pryr)
# set my inputs
data <- mtcars
formulas <- c(test1 = mpg ~ cyl + wt + hp,
test2 = mpg ~ cyl + wt)
params = list()
methods <- "lm"
n <- 20 # num of cv splits
mult <- 10 # number of times I want to replicate some of the data
frac <- .25 # how much I want to cut down other data (fractional)
### the next few chunks get the unique combos of the inputs.
if (length(params) != 0) {
cross_params <- params %>%
map(cross) %>%
map_df(enframe, name = "param_set", .id = "method") %>%
list
} else cross_params <- NULL
methods_df <- tibble(method = methods) %>%
list %>%
append(cross_params) %>%
reduce(left_join, by = "method") %>%
split(1:nrow(.))
# wrangle formulas into a split dataframe
formulas_df <- tibble(formula = formulas,
name = names(formulas)) %>%
split(.$name)
# split out the data into n random train-test combos
cv_data <- data %>%
crossv_kfold(n) %>% # rsample?
mutate_at(vars(train:test), ~map(.x, as_tibble))
# sample out if needed
cv_data_samp <- cv_data %>%
mutate(train = modify(train,
~ .x %>%
split(.$gear == 4) %>%
# take a sample of the non-vo data
modify_at("FALSE", sample_frac, frac) %>%
# multiply out the vo-on data
modify_at("TRUE", function(.df) {
map_df(seq_along(1:mult), ~ .df)
}) %>%
bind_rows))
# get all unique combos of formula and method
model_combos <- list(cv = list(cv_data_samp),
form = formulas_df,
meth = methods_df) %>%
cross %>%
map_df(~ bind_cols(nest(.x$cv), .x$form, .x$meth)) %>%
unnest(data, .preserve = matches("formula|param|value")) %>%
{if ("value" %in% names(.)) . else mutate(., value = list(NULL))}
# run the models
model_combos %>%
# put all arguments into a single params column
mutate(params = pmap(list(formula = formula, data = train), list)) %>%
mutate(params = map2(params, value, ~ append(.x, .y))) %>%
mutate(params = modify(params, discard, is.null)) %>%
# run the models
mutate(model = invoke_map(method, params))
mem_change(rm(data, cv_data, cv_data_samp))
mem_used()
Now after I do this, my mem_used comes out to 77.3mb, but I'm seeing roughly double that (160Mb) allocated to my R user. This really explodes when my data is 3 Gb, which is my real-life case. I end up using 100Gb and tying up a whole server :(.
What is going on and how can I optimize?
Any help appreciated!!!

I figured this out! The issue was that I was converting my series of modelr resample objects to tibbles and that was exploding the memory even though I was subsequently sampling them down. The solution? Write methods for dealing with resample objects so that I never have to convert resample objects to tibble. These looked like:
# this function just samples the indexes instead of the data
sample_frac.resample <- function(data, frac) {
data$idx <- sample(data$idx, frac * length(data$idx))
data
}
# this function replicates the indexes. I should probably call it something else.
augment.resample <- function(data, n) {
data$idx <- unlist(map(seq_along(1:n), ~ data$idx))
data
}
# This function does simple splitting (logical only) of resample obejcts
split.resample <- function(data, .p) {
pos <- list(data = data$data, idx = which(.p, 1:nrow(data$data)))
neg <- list(data = data$data, idx = which(!.p, 1:nrow(data$data)))
class(pos) <- "resample"
class(neg) <- "resample"
list("TRUE" = pos,
"FALSE" = neg)
}
# This function takes the equivalent of a `bind_rows` for resample objects.
# Since bind rows does not call `useMethod` I had to call it something else
bind <- function(data) {
out <- list(data = data[[1]]$data, idx = unlist(map(data, pluck, "idx")))
class(out) <- "resample"
out
}
Then I just converted to a tibble in the same purrr closure in which my model is run for that CV. Problem solved! My memory usage is VERY low now.

map + pmap, cannot find variables

I am trying to collate results from a simulation study using dplyr and purrr. My results are saved as a list of data frames with the results from several different classification algorithms, and I'm trying to use purrr and dplyr to summarize these results.
I'm trying to calculate
- number of objects assigned to each cluster
- number of objects in the cluster that actually belong to the cluster
- number of true positives, false positives, false negatives, and true negatives using 3 different algorithms (KEEP1 - KEEP3)
- for 2 of the algorithms, I have access to a probability of being in the cluster, so I can compare this to alternate choices of alpha - and so I can calculate true positives etc. using a different choice of alpha.
I found this: https://github.com/tidyverse/dplyr/issues/3101, which I used successfully on a single element of the list to get exactly what I wanted:
f <- function(.x, .y) {
sum(.x & .y)
}
actions <- list(
.vars = lst(
c('correct'),
c('KEEP1', 'KEEP2', 'KEEP3'),
c('pval1', 'pval2')
),
.funs = lst(
funs(Nk = length, N_correct = sum),
funs(
TP1 = f(., .y = correct),
FN1 = f(!(.), .y = correct),
TN1 = f(!(.), .y = !(correct)),
FP1 = f(., .y = !(correct))
),
funs(
TP2 = f((. < alpha0) , .y = correct),
FN2 = f(!(. < alpha0), .y = correct),
TN2 = f(!(. < alpha0), .y = !(correct)),
FP2 = f((. < alpha0), .y = !(correct))
)
)
)
reproducible_data <- replicate(2,
data_frame(
k = factor(rep(1:10, each = 20)), # group/category
correct = sample(x = c(TRUE, FALSE), 10 * 20, replace = TRUE, prob = c(.8, .2)),
pval1 = rbeta(10 * 20, 1, 10),
pval2 = rbeta(10 * 20, 1, 10),
KEEP1 = pval1 < 0.05,
KEEP2 = pval2 < 0.05,
KEEP3 = runif(10 * 20) > .2,
alpha0 = 0.05,
alpha = 0.05 / 20 # divided by no. of objects in each group (k)
),
simplify = FALSE)
# works
df1 <- reproducible_data[[1]]
pmap(actions, ~df1 %>% group_by(k) %>% summarize_at(.x, .y)) %>%
reduce(inner_join,by = 'k')
Now, I want to use map to do this to the entire list. However, I can no longer access the variable "correct" (it hasn't gotten far enough to not see alpha or alpha0, but presumably the same issue will occur). I'm still learning dplyr/purrr, but my experimenting hasn't proved useful.
# does not work
out_summary <- map(
reproducible_data,
pmap(actions, ~ as_tibble(.) %>% group_by("k") %>% summarize_at(.x, .y)) %>%
reduce(inner_join,by = 'k')
)
# this doesn't either
out_summary <- map(
reproducible_data,
pmap(actions, ~ as_tibble(.) %>% group_by("k") %>% summarize_at(.x, .y, alpha = alpha, alpha0 = alpha0, correct = correct)) %>%
reduce(inner_join,by = 'k')
)
Within map, I don't see the variable 'k' in $group_by(k)$ unless it is quoted $group_by('k')$, but I do not need to quote it when I just used pmap. I've tried various ways to pass the correct variables to these functions, but I'm still learning dplyr and purrr, and haven't succeeded yet.
One more note - the actual data is stored as a regular data frame, so I need $as_tibble()$ in the pmap function. I was running into some different errors when I removed it in this example, so I opted to add it back so I would get the same issues. Thanks!

Try this
map(
reproducible_data,
function(df1) {
pmap(actions, ~ df1 %>%
as_tibble() %>%
group_by(k) %>%
summarize_at(.x, .y)) %>%
reduce(inner_join, by = "k")
}
)
I think your arguments might get mixed up when using map and pmap at the same time. I used the function syntax for map to define df1 to try to fix that. The rest of it looks ok (although I switched to pmap_df to return a dataframe (the structure of the list was ugly without it and pmap_df was the easiest way to make it pretty. Lmk if it's not the expected output. 👍
Also the problem with group_by("k") vs. group_by(k)
Also: writing group_by("k") actually creates a variable "k" and fills it with characters "k", then uses that to group. That will get your code to run, but it won't do what you like. Sometimes that kind of problem is really because of an error that occurs a line or two before (or, with dplyr, a pipe or two before). In this case, map wasn't passing df1 where you needed it.

R: Using a pasted formula in Sapply

I am trying to identify correlated explanatory variables and eliminate. I'm using Sapply to apply the regression to the variables I am interested in, and manually deleting the ones with FIVs > 10. However, when I try to reproduce this to quickly run for many vifs, I cannot manage to get my regression script to run with a pasted formula object containing the the names I want to keep. Below:
regressiondata <- data.frame(matrix(ncol=9,nrow=100,runif(900,1,100)))
colnames(regressiondata) <- c("indep1","indep2","indep3","indep4","var1","var2","var3","var4","var5")
vifs1_model <- sapply(regressiondata[,indep_variables],function(x) vif(lm(x~var1+var2+var3+var4+var5,
data = regressiondata,
na.action=na.exclude)))
vifs1 <- rowMeans(vifs1_model)
formula_variables <- paste(names(vifs1),collapse="+")
final_model <- t(round(sapply(regressiondata[,indep_variables],
function(x) lm(x ~ formula_variables,data=regressiondata,na.action=na.exclude)$coef),2))
when I run "final_model" I get this error:
Error in t(round(sapply(regressiondata[, indep_variables], function(x) lm(x ~ :
error in evaluating the argument 'x' in selecting a method for function 't': Error in model.frame.default(formula = x ~ formula_variables, data = regressiondata, :
variable lengths differ (found for 'formula_variables')

I think you have a couple of issues:
You are using sapply over a dataframe when it looks like you simply want to sapply over the vector of independent variable names
Your last nested call to lm seems to mix expressions and strings
Here is my walk through. Your code refers to some missing objects so I have added in some lines I think you left out
library(car) # for fiv()
regressiondata <- data.frame(matrix(ncol=9,nrow=100,runif(900,1,100)))
colnames(regressiondata) <- c("indep1",
"indep2",
"indep3",
"indep4",
"var1",
"var2",
"var3",
"var4",
"var5")
indep_variables <- names(regressiondata)[1:4] # object did not exist
I broke out the anonymous functions for clarity:
f1 <- function(x) {
vif(lm(x~var1+var2+var3+var4+var5,
data = regressiondata,
na.action=na.exclude))
}
Now your regressions
vifs1_model <- sapply(regressiondata[,indep_variables], f1)
vifs1 <- rowMeans(vifs1_model)
formula_variables <- paste(names(vifs1),collapse="+")
I named this function that pulls the coefficients and handed lm a character vector (string) with the whole formula:
getCoefs <- function(x) {
lm(paste(x, "~", formula_variables), data=regressiondata,
na.action=na.exclude)$coef
}
Now, just sapply over the vector of names, then transpose and round:
final_model <- sapply(indep_variables, getCoefs)
final_model <- t(round(final_model ,2))

Here is a dplyr way of doing things. The bulk of the work is done by the sub_regression function, which conducts a regression, filters independent variables by vif, and then redoes the regression
library(dplyr)
library(tidyr)
library(magrittr)
library(car)
sub_regression = function(sub_data_frame)
lm(independent_value ~ var1+var2+var3+var4+var5,
data = sub_data_frame ,
na.action="na.exclude") %>%
vif %>%
Filter(function(x) x <= 10, .) %>%
names %>%
paste(collapse = " + ") %>%
paste("independent_value ~ ", .) %>%
as.formula %>%
lm(. , sub_data_frame, na.action="na.exclude") %>%
coefficients %>%
round(3) %>%
as.list %>%
data.frame(check.names = FALSE)
matrix(ncol=9,nrow=100,runif(900,1,100)) %>%
data.frame %>%
setNames(c("indep1","indep2","indep3","indep4","var1","var2","var3","var4","var5")) %>%
gather(independent_variable, independent_value,
indep1, indep2, indep3, indep4) %>%
group_by(independent_variable) %>%
do(sub_regression(.))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R sometimes fails to evaluate expressions parsed from strings - r

Related

How to generalize a function to take more arguments? R

R: short and effective code for a sequence of lm models from DOE factors

How to optimize memory usage in dplyr + purrr

map + pmap, cannot find variables

R: Using a pasted formula in Sapply

Categories

Resources