How to do a single logistic regression with multiple data frames? - r

The goal is to make a single ordered logistic regression using multiple data frames with the same structure.
In the example below I joined two data frames to be able to do a single ordered logistic regression. However, I would like a solution that automates the process, since there are 417 data frames.
library(dplyr)
library(MASS)
#dataframe 1
party <- factor(rep(c("Rep","Dem"), c(407, 428)),
levels=c("Rep","Dem"))
rpi <- c(30, 46, 148, 84, 99) # cell counts
dpi <- c(80, 81, 171, 41, 55) # cell counts
ideology <- c("Very Liberal","Slightly Liberal","Moderate","Slightly Conservative","Very Conservative")
pol.ideology <- factor(c(rep(ideology, rpi),
rep(ideology, dpi)), levels = ideology)
data1 <- data.frame(party,pol.ideology)
#dataframe 2
party <- factor(rep(c("Rep","Dem"), c(410, 430)),
levels=c("Rep","Dem"))
rpi2 <- c(26, 50, 140, 95, 99) # cell counts
dpi2 <- c(75, 86, 141, 61, 67) # cell counts
ideology2 <- c("Very Liberal","Slightly Liberal","Moderate","Slightly Conservative","Very Conservative")
pol.ideology <- factor(c(rep(ideology2, rpi2),
rep(ideology2, dpi2)), levels = ideology2)
data2 <- data.frame(party,pol.ideology)
nrow(data1)
nrow(data2)
## Joining dataframes "manually"
dat <- bind_rows(data1,data2)
table(dat)
nrow(dat)
# fit proportional odds model
pom <- polr(pol.ideology ~ party, data=dat)
summary(pom)
So I tried to loop in an attempt to do a single regression with data1 and date2. I was not successful, as the regression considers only data2.
## LOOP
data = ls(pattern="data")
for(i in 1:length(ls(pattern="data"))){
pom <- polr(pol.ideology ~ party, data=i)
}
summary(pom)
Could someone point me to a solution? I recently started studying the loop theme on R. Sorry if the question is silly.

The best option would be create a list of dataframes(assuming all dfs have similar structure and all are individual dfs in your environment with names as data1, data2, data3...)
library(tidyverse)
df_list <- map(paste0("data",1:3), ~eval(sym(.x)))
Once you have the list created. then you can easily do whatever you want.
Create regression for individual dataframes.
result_list <- map(df_list, ~polr(pol.ideology ~ party, data=.x))
Or, if you want to create one regression of combined dataframes.
Combine all dataframe together.
big_df <- df_list %>% bind_rows()
run your regression for combined df. big_df
big_df_result <- polr(pol.ideology ~ party, data= big_df)

The i in your loop is not the data. It's the index of the character vector you created outside the loop. But be careful assigning the name to data as you then create one more item in the for loop! Best to rename that. Your other problems are that you never save the results of each model, only the last one. So use a list to save them all and then lapply to extract them.
library(MASS) # For ordered logistic regression (polr)
## LOOP
my_list = ls(pattern="data")
pom <- list() # initialise your pom
for(i in 1:length(my_list)){ # or seq_along(my_list)
pom[[i]] <- polr(pol.ideology ~ party, data=get(my_list[i]))
}
lapply(pom, summary)
You could also do this without a loop using lapply.

Related

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

Using a FOR-loop to calculate AUC of multiple dataframes

For a certain problem I want to perform a RandomForest classifier over multiple datasets and compare the AUC's of said datasets. I want to use a 'lazy' approach, so instead of doing the classification n-times over multiple datasets, I wanted to use a for-loop to do this for me instead.
So, a for-loop that loops over multiple datasets, performs randomforest classification, calculate the AUC en store this AUC in a empty matrix/dataframe. The result should be a table/matrix which shows me a column for each dataset and a row showing the AUC of each dataset.
I prepped some code using the Iris dataset to get started, but don't have any experience with using for-loops on this kind of problem. Hopefully somebody can help me out or even to get me thinking in the right direction!?
Example:
require(pROC)
require(randomForest)
#use the Iris dataset as example
data(iris)
#make a simple 2-class outcome over the Iris dataset
iris <- iris[-which(iris$Species=="setosa"),]
iris$Species<-as.factor(as.character(iris$Species))
#create list of dataframes we want to use
df1 <- iris
df2 <- iris
df_list <- list(df1, df2)
#create empty matrix to store results in
results_matrix <- matrix(ncol=2, nrow=1)
#create a for loop to calculate and store AUC of each dataframe
for(df in df_list){
rf_model <- randomForest::randomForest(Species ~., data = df)
rf_model_roc <- roc(iris$Species,rf_model$votes[,2])
df_auc <- auc(rf_model_roc)
#store df_auc of each df in results_matrix
}
You can create a vector to store the values from AUC.
For example,
#create list of dataframes we want to use
df1 <- iris
df2 <- iris
df_list <- list(df1, df2)
results_vec <- numeric(length(df_list))
#create a for loop to calculate and store AUC of each dataframe
for(i in seq_along(df_list)) {
data <- df_list[[i]]
rf_model <- randomForest::randomForest(Species ~., data = data)
rf_model_roc <- roc(data$Species,rf_model$votes[,2])
results_vec[i] <- as.numeric(auc(rf_model_roc))
}
results_vec

Cluster robust standard errors after multiple imputation using mice R package

I would like to compute cluster robust standard errors using a mids class object. This arise from multiple imputation of missing values in a column of my original data. A minimal example below.
library(mice)
y <- c(1,0,0,1,1,1,1,0)
x <- c(26, 34, 55, 15, 31 ,47, 97, 12)
z <- c(2, NA, 0, NA, 3 ,7,7, 5)
mydata <- as.data.frame(cbind(y,x,z))
tempData <- mice(mydata,m=5,maxit=5,meth='pmm',seed=500)
class(tempData)
# [1] "mids"
modelFit <- with(tempData,lm(y ~ x + z))
summary(modelFit)
At this point I would like to get the cluster robust standard errors. Unfortunately miceadds::lm.cluster does not allow "mids" class objects.
The function lm.cluster in miceadds is intended for regular data frames. An example for an application to multiply imputed data is given in the documentation.
Given below is a version adapted to your question. I used the first variables as a cluster indicator because your example didn't have one.
library(mice)
library(miceadds)
id <- c(1,0,0,1,1,1,1,0)
y <- c(26,34,55,15,31,47,97,12)
x <- c(2,NA,0,NA,3,7,7,5)
dat <- data.frame(id,y,x)
imp <- mice(dat, m=5, maxit=5, method='pmm', seed=500)
implist <- lapply(1:5, function(i) complete(imp,i))
mod <- lapply( implist, function(i){
lm.cluster( i, formula=y~x, cluster=i$id )
})
# extract parameters and covariance matrices
betas <- lapply(mod, coef)
vars <- lapply(mod, vcov)
# pool
summary(pool_mi( qhat=betas, u=vars ))

Save iterations of for loop in R

I'm working on a project where I need to collect the intercept, slope, and R squared of several linear regressions. Since I need to at least 200 samples of different sample sizes I set-up the code below, but it only saves the last iteration of the loop. Any suggestions on how I can record each loop so that I can have all of the coefficients and r-squares that I require.
for (i in 1:5) {
x <- as.data.frame(mydf[sample(1:1000,25,replace=FALSE),])
mylm <- lm(spd66305~spd66561, data=x)
coefs <- rbind(lman(mylm))
total.coefs <- rbind(coefs)
}
total.coefs
The function used in the loop is below if that is needed.
lman <- function(mylm){
r2 <- summary(mylm)$r.squared
r <- sqrt(r2)
intercept <- coef(mylm)[1]
slope <- coef(mylm)[2]
tbl <- c(intercept,slope,r2,r)
}
Thanks for the help.
Before starting your loop, you can write
total.coefs <- data.frame(), to initialise an empty data.frame. Then in your loop you want to update the total.coefs, as follows: total.coefs <- rbind(total.coefs, coefs). Finally replace the last line in lman by:
tbl <- data.frame(intercept=intercept, slope=slope, r2=r2, r=r).
Here's how I'd do it, for example on the mtcars data. Note: It's not advisable to use rbind inside the loop if you're building a data structure. You can call rbind after the looping has been done and things are much less stressful. I prefer to do this type of operation with a list.
Here I wrapped my lapply loop with rbind, and then do.call binds the list elements together recursively. Another thing to note is that I take the samples prior to entering the loop. This makes debugging easier and can be more efficient overall
reps <- replicate(3, sample(nrow(mtcars), 5), simplify = FALSE)
do.call(rbind, lapply(reps, function(x) {
mod <- lm(mpg ~ hp, mtcars[x,])
c(coef(mod), R = summary(mod)$r.squared)
}))
# (Intercept) hp R
# [1,] 33.29360 -0.08467169 0.5246208
# [2,] 29.97636 -0.06043852 0.4770310
# [3,] 28.33462 -0.05113847 0.8514720
The following transposed vapply loop produces the same result, and is often faster when you know the type of result you expect
t(vapply(reps, function(x) {
mod <- lm(mpg ~ hp, mtcars[x,])
c(coef(mod), R = summary(mod)$r.squared)
}, numeric(3)))
Another way to record each loop would be to make the work reproducible and keep your datasets around in case you have extreme values, missing values, new questions about the datasets, or other surprises that need investigated.
This is a similar case using the iris dataset.
# create sample data
data(iris)
iris <- iris[ ,c('Sepal.Length','Petal.Length')]
# your function with data.frame fix on last line
lman <- function(mylm){
r2 <- summary(mylm)$r.squared
r <- sqrt(r2)
intercept <- coef(mylm)[1]
slope <- coef(mylm)[2]
data.frame(intercept,slope,r2,r)
}
# set seed to make reproducible
set.seed(3)
# create all datasets
alldatasets <- lapply(1:200,function(x,df){
df[sample(1:nrow(df),size = 50,replace = F), ]
},df = iris)
# create all models based on alldatasets
allmodels <- lapply(alldatasets,lm,formula = Sepal.Length ~ Petal.Length)
# run custom function on all models
lmanresult <- lapply(allmodels,lman)
# format results
result <- do.call('rbind',lmanresult)
row.names(result) <- NULL
# inspect the 129th sample, model, and result
alldatasets[[129]]
summary(allmodels[[129]])
result[129, ]

Error in creating a function to perform ttests on multiple continuous variables

So I'm trying to create a function that will take in a string of continuous variables, a categorical variable and a dataframe and output a table that includes, for each continuous variable: mean group1, mean group2, teststat, confidence interval, p-value.
What is currently here gives me the error: Error in model.frame.default(formula = var ~ class, data = data) : variable lengths differ (found for 'class')
I would love any feedback on how to fix this error and make this function do what I like. I want to make this function way more substantial and flexible, but I can't even get the basic version (handling multiple variables) to work.
THANKS!
#Continuous must be an object of the form:
#vars<-c("cont1", "cont2", "cont3", etc)
#CREATE DATA
cat1<-sample(c(1,2), 100, replace=T)
cont1<-rnorm(100, 25, 8)
cont2<-rnorm(100, 0, 1)
cont3<-rnorm(100, 6, 14.23)
cont4<-rnorm(100, 25, 8)*runif(5, 0.1, 1)
one<-data.frame(cat1, cont1, cont2, cont3, cont4)
#FUNCTION
two.group.comp<-function(continvars,class,data){
attach(data)
descriptives<-function(var){
test<-t.test(var~class, data)
means<-data.frame(test[5])
mean1<-means[1,1]
mean2<-means[2,1]
teststatbig<-data.frame(test[1])
teststat<-teststatbig[1,1]
conf<-data.frame(test[4])
lconf<-conf[1,1]
uconf<-conf[2,1]
pvalues<-data.frame(test[3])
pvalue<-pvalues[1,1]
variablename<-deparse(substitute(var))
entry<-data.frame(variablename,mean1,mean2,lconf,uconf,teststat,pvalue)
}
var<-data.frame(continvars)
table<<-sapply(var,descriptives)
detach(data)
}
#VARIABLES
continvars<-c("cont1", "cont2", "cont3")
#CALL TO FUNCTION
two.group.comp(continvars=continvars, class=cat1, data=one)
Does this do what you want?
two.group.comp <- function(continvars,class,data){
get.stats <- function(x,cat){
f <- unique(cat)
x1 <- x[cat==f[1]]
x2 <- x[cat==f[2]]
tt <- t.test(x1,x2)
smry <- c(tt$estimate,tt$statistic,p=tt$p.value)
names(smry) <- c("mean.1","mean.2","t","p")
return(smry)
}
result <- do.call(rbind,lapply(data[,continvars],get.stats,cat=class))
return(result)
}
# create sample dataset
set.seed(1)
cat1 <-sample(c(1,2), 100, replace=T)
cont1<-rnorm(100, 25, 8)
cont2<-rnorm(100, 0, 1)
cont3<-rnorm(100, 6, 14.23)
cont4<-rnorm(100, 25, 8)*runif(5, 0.1, 1)
one <-data.frame(cat1, cont1, cont2, cont3, cont4)
continvars<-c("cont1", "cont2", "cont3")
# call the function...
two.group.comp(continvars,cat1,one)
# mean.1 mean.2 t p
# cont1 24.4223859 25.33275704 -0.6024497 0.54827955
# cont2 0.0330148 0.01168979 0.1013519 0.91947827
# cont3 10.5784201 4.00651493 2.4183031 0.01747468
Working from the inside out:
get.stats(...) takes a single column of data, splits it into x1 and x2 according to cat, runs the t-test, and returns the summary statistics as a named vector.
lapply(...) passes the continvars columns of data to get.stats(...) one at a time.
do.call(rbind,...) binds together the set of vectors returned from lapply(...), row-wise, to generate the final result table.
This will work also if you pass column numbers instead of column names.
A piece of advice: the way you have it set up, you pass the column names of the continuous variables, but you pass the grouping factor as a vector. It would be cleaner if you pass the column name of the grouping factor.

Resources