Regression model over factor levels using dplyr : getting repeated errors - r

I'm trying to run a logistic regression model over several factor levels in my dataframe and I'm getting replicated results for each factor level instead of a unique model's parameters. It happens when I use the diamond dataset and run the same code, this:
diamonds$E <-
if_else(diamonds$color=='E',1,0) #Make 'E' binary
fitted_models <- diamonds %>%
group_by(clarity) %>% #Group by clarity
do(model=glm(E~price,#regress price on E
data=diamonds,
family=binomial(link='logit')))
fitted_models %>%
tidy(model)%>%
View #use broom package to look
I'm stuck as to why I'm having this particular issue.

The issue is in your glm call. Remove data=diamonds and replace it with data=..
fitted_models <- diamonds %>%
group_by(clarity) %>% #Group by clarity
do(model=glm(E~price,#regress price on E
data = .,
family=binomial(link='logit')))
fitted_models %>%
tidy(model)
whenever you are using do you need to reference the grouped data frame using .. As your code currently reads, you are referencing the original, un-grouped frame not the one passed to do by the pipe. for example, you cannot just call for the column E, you need to use .$E. An alternative solution would be glm(.$E~.$price)

Related

Issue computing AUC with pROC package

I'm trying to use a function that calls on the pROC package in R to calculate the area under the curve for a number of different outcomes.
# Function used to compute area under the curve
proc_auc <- function(outcome_var, predictor_var) {
pROC::auc(outcome_var, predictor_var)}
To do this, I am intending to refer to outcome names in a vector (much like below).
# Create a vector of outcome names
outcome <- c('outcome_1', 'outcome_2')
However, I am having problems defining variables to input into this function. When I do this, I generate the error: "Error in roc.default(response, predictor, auc = TRUE, ...): 'response' must have two levels". However, I can't work out why, as I reckon I only have two levels...
I would be so happy if anyone could help me!
Here is a reproducible code from the iris dataset in R.
library(pROC)
library(datasets)
library(dplyr)
# Use iris dataset to generate binary variables needed for function
df <- iris %>% dplyr::mutate(outcome_1 = as.numeric(ntile(Sepal.Length, 4)==4),
outcome_2 = as.numeric(ntile(Petal.Length, 4)==4))%>%
dplyr::rename(predictor_1 = Petal.Width)
# Inspect binary outcome variables
df %>% group_by(outcome_1) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))
df %>% group_by(outcome_2) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))
# Function used to compute area under the curve
proc_auc <- function(outcome_var, predictor_var) {
pROC::auc(outcome_var, predictor_var)}
# Create a vector of outcome names
outcome <- c('outcome_1', 'outcome_2')
# Define variables to go into function
outcome_var <- df %>% dplyr::select(outcome[[1]])
predictor_var <- df %>% dplyr::select(predictor_1)
# Use function - first line works but not last line!
proc_auc(df$outcome_1, df$predictor_1)
proc_auc(outcome_var, predictor_var)
outcome_var and predictor_var are dataframes with one column which means they cannot be used directly as an argument in the auc function.
Just specify the column names and it will work.
proc_auc(outcome_var$outcome_1, predictor_var$predictor_1)
You'll have to familiarize yourself with dplyr's non-standard evaluation, which makes it pretty hard to program with. In particular, you need to realize that passing a variable name is an indirection, and that there is a special syntax for it.
If you want to stay with the pipes / non-standard evaluation, you can use the roc_ function which follows a previous naming convention for functions taking variable names as input instead of the actual column names.
proc_auc2 <- function(data, outcome_var, predictor_var) {
pROC::auc(pROC::roc_(data, outcome_var, predictor_var))
}
At this point you can pass the actual column names to this new function:
proc_auc2(df, outcome[[1]], "predictor_1")
# or equivalently:
df %>% proc_auc2(outcome[[1]], "predictor_1")
That being said, for most use cases you probably want to follow #druskacik's answer and use standard R evaluation.

Take the nesting variables and mapping them to predicted values?

I'm fitting linear models for several different time periods (the example below uses countries).
I have two questions.
Question 1: Is there a way I can pass the mods column to ggpredict without starting with df1$mods? something like df1 %>% select(mods)%>% map(., ggpredict). I actually tried that but it doesn't work. It's not a huge deal; but I am curious.
Question 2: Is there a way I can automate taking the names fo the grouping variable's categories and mapping them to the predicted values?
Thank you!
library(tidyverse)
library(ggeffects)
#make a fake data frame
df1<-data.frame(country=rep(c("A", "B"), 100), var1=rnorm(200), var2=rnorm(200))
df1
df1 %>%
group_by(country) %>%
#in reality I have 10 to 12 variables and 1 grouping variable with 12 categories
nest(var1, var2) %>%
#in reality I'm also doing glms, but I don't think it matters
mutate(mods=map(data, function(x) lm(var1~var2, data=x))) ->out
out$mods %>%
map_df(., ggpredict, terms=c('var2 [0,0.5]')) %>%
#Is there a way to automate this line; by taking the values of the grouping variables
#somehow.
mutate(country=rep(c('A', 'B'), each=2))

Extracting beta coefficient from group_map()

I am working on a data frame where I am trying to regress two columns(female dummy & scores) while grouping them by another column (country), and extracting the coefficient on female dummy.
I have tried using dplyr, by first grouping my data frame by country, using group_by(), then applying a regression, using group_map(). First off, the coefficients that are shown in the result are all the same, for each group. Second I cannot seem to extract only the second coefficient, and when I try, the code says I cannot implement on a list
f1 %>% group_by(background) %>%
group_map(~ coef(lm(pv1math ~ female, data = f1))) %>%
group_map(~ coef[2])
I essentially want a series of the second coefficient.
I keep getting error for group_split.
error in UseMethod("group_split") :
no applicable method for 'group_split' applied to an object of class "list"

Comparing multiple variables in more than two groups with t.test

I tried to do a t-test comparing values between time1/2/3.. and threshold.
here is my data frame:
time.df1<-data.frame("condition" =c("A","B","C","A","C","B"),
"time1" = c(1,3,2,6,2,3) ,
"time2" = c(1,1,2,8,2,9) ,
"time3" = c(-2,12,4,1,0,6),
"time4" = c(-8,3,2,1,9,6),
"threshold" = c(-2,3,8,1,9,-3))
and I tried to compare each two values by:
time.df1%>%
select_if(is.numeric) %>%
purrr::map_df(~ broom::tidy(t.test(. ~ threshold)))
However, I got this error message
Error in eval(predvars, data, env) : object 'threshold' not found
So, I tried another way (maybe it is wrong)
time.df2<-time.df1%>%gather(TF,value,time1:time4)
time.df2%>% group_by(condition) %>% do(tidy(t.test(value~TF, data=.)))
sadly, I got this error. Even I limited the condition to only two levels (A,B)
Error in t.test.formula(value ~ TF, data = .) : grouping factor must have exactly 2 levels
I wish to loop t-test over each time column to threshold column per condition, then using broom::tidy to get the results in tidy format. My approaches apparently aren't working, any advice is much appreciated to improve my codes.
An alternative route would be to define a function with the required options for t.test() up front, then create data frames for each pair of variables (i.e. each combination of 'time*' and 'threshold') and nesting them into list columns and use map() combined with relevant functions from 'broom' to simplify the outputs.
library(tidyverse)
library(broom)
ttestfn <- function(data, ...){
# amend this function to include required options for t.test
res = t.test(data$resp, data$threshold)
return(res)
}
df2 <-
time.df1 %>%
gather(time, "resp", - threshold, -condition) %>%
group_by(time) %>%
nest() %>%
mutate(ttests = map(data, ttestfn),
glances = map(ttests, glance))
# df2 has data frames, t-test objects and glance summaries
# as separate list columns
Now it's easy to query this object to extract what you want
df2 %>%
unnest(glances, .drop=TRUE)
However, it's unclear to me what you want to do with 'condition', so I'm wondering if it is more straightforward to reframe the question in terms of a GLM (as camille suggested in the comments: ANOVA is part of the GLM family).
Reshape the data, define 'threshold' as the reference level of the 'time' factor and the default 'treatment' contrasts used by R will compare each time to 'threshold':
time.df2 <-
time.df1 %>%
gather(key = "time", value = "resp", -condition) %>%
mutate(time = fct_relevel(time, "threshold")) # define 'threshold' as baseline
fit.aov <- aov(resp ~ condition * time, data = time.df2)
summary(fit.aov)
summary.lm(fit.aov) # coefficients and p-values
Of course this assumes that all subjects are independent (i.e. there are no repeated measures). If not, then you'll need to move on to more complicated procedures. Anyway, moving to appropriate GLMs for the study design should help minimise the pitfalls of doing multiple t-tests on the same data set.
We could remove the threshold from the select and then reintroduce it by creating a data.frame which would go into the formula object of t.test
library(tidyverse)
time1.df %>%
select_if(is.numeric) %>%
select(-threshold) %>%
map_df(~ data.frame(time = .x, time1.df['threshold']) %>%
broom::tidy(t.test(. ~ threshold)))

how to apply lm() to datasets split by factors

In a housing dataset, there are three variables, which are bsqft (the building size of the house), county(a factor variable with 9 levels) and price. I would like to fit an individual regression line using bsqft and price for each separate county. Instead of calling lm() function repeatedly, I prefer using apply function in r but have no idea to create it. Could anyone help me with that? Thanks a lot.
You can use dplyr and broom to do regressions by group and summarise the information back into a dataframe
library(dplyr)
library(broom)
your_dataset %>%
group_by(county) %>%
do(tidy(lm(price ~ bsqft, data=.)))

Resources