Using dplyr rowwise to create multiple linear models - r

considering this post:
https://www.tidyverse.org/blog/2020/06/dplyr-1-0-0/
I was trying to create multiple models for a data set, using multiple formulas. this example says:
library(dplyr, warn.conflicts = FALSE)
models <- tibble::tribble(
~model_name, ~ formula,
"length-width", Sepal.Length ~ Petal.Width + Petal.Length,
"interaction", Sepal.Length ~ Petal.Width * Petal.Length
)
iris %>%
nest_by(Species) %>%
left_join(models, by = character()) %>%
rowwise(Species, model_name) %>%
mutate(model = list(lm(formula, data = data))) %>%
summarise(broom::glance(model))
You can see rowwise function is used to get the answer but when i dont use this function, i still get the correct answer
iris %>%
nest_by(Species) %>%
left_join(models, by = character()) %>%
mutate(model = list(lm(formula, data = data))) %>%
summarise(broom::tidy(model))
i only lost the "model_name" column, but considering that rowwise documentation says, this function is to compute, i dont get why is still computed this way, why this happens?
thanks in advance.

considering
https://cran.r-project.org/web/packages/dplyr/vignettes/rowwise.html
You can optionally supply “identifier” variables in your call to rowwise(). These variables are preserved when you call summarise(), so they behave somewhat similarly to the grouping variables passed to group_by():
i didn't understand how identifiers works, so as far i get this "identifiers" (Species,model_name) doesn't affect how to compute a value, only the way your tibble is presented.
So if you have a rowwise tibble created by nest_by you dont need the rowwise() function to compute by row. So in my example, rowwise function only give you a extra column of information but linear model is still the same. this is just for a "elegant way", it doesn't change the way its computed.
Thanks to tmfmnk

Related

Comparing multiple variables in more than two groups with t.test

I tried to do a t-test comparing values between time1/2/3.. and threshold.
here is my data frame:
time.df1<-data.frame("condition" =c("A","B","C","A","C","B"),
"time1" = c(1,3,2,6,2,3) ,
"time2" = c(1,1,2,8,2,9) ,
"time3" = c(-2,12,4,1,0,6),
"time4" = c(-8,3,2,1,9,6),
"threshold" = c(-2,3,8,1,9,-3))
and I tried to compare each two values by:
time.df1%>%
select_if(is.numeric) %>%
purrr::map_df(~ broom::tidy(t.test(. ~ threshold)))
However, I got this error message
Error in eval(predvars, data, env) : object 'threshold' not found
So, I tried another way (maybe it is wrong)
time.df2<-time.df1%>%gather(TF,value,time1:time4)
time.df2%>% group_by(condition) %>% do(tidy(t.test(value~TF, data=.)))
sadly, I got this error. Even I limited the condition to only two levels (A,B)
Error in t.test.formula(value ~ TF, data = .) : grouping factor must have exactly 2 levels
I wish to loop t-test over each time column to threshold column per condition, then using broom::tidy to get the results in tidy format. My approaches apparently aren't working, any advice is much appreciated to improve my codes.
An alternative route would be to define a function with the required options for t.test() up front, then create data frames for each pair of variables (i.e. each combination of 'time*' and 'threshold') and nesting them into list columns and use map() combined with relevant functions from 'broom' to simplify the outputs.
library(tidyverse)
library(broom)
ttestfn <- function(data, ...){
# amend this function to include required options for t.test
res = t.test(data$resp, data$threshold)
return(res)
}
df2 <-
time.df1 %>%
gather(time, "resp", - threshold, -condition) %>%
group_by(time) %>%
nest() %>%
mutate(ttests = map(data, ttestfn),
glances = map(ttests, glance))
# df2 has data frames, t-test objects and glance summaries
# as separate list columns
Now it's easy to query this object to extract what you want
df2 %>%
unnest(glances, .drop=TRUE)
However, it's unclear to me what you want to do with 'condition', so I'm wondering if it is more straightforward to reframe the question in terms of a GLM (as camille suggested in the comments: ANOVA is part of the GLM family).
Reshape the data, define 'threshold' as the reference level of the 'time' factor and the default 'treatment' contrasts used by R will compare each time to 'threshold':
time.df2 <-
time.df1 %>%
gather(key = "time", value = "resp", -condition) %>%
mutate(time = fct_relevel(time, "threshold")) # define 'threshold' as baseline
fit.aov <- aov(resp ~ condition * time, data = time.df2)
summary(fit.aov)
summary.lm(fit.aov) # coefficients and p-values
Of course this assumes that all subjects are independent (i.e. there are no repeated measures). If not, then you'll need to move on to more complicated procedures. Anyway, moving to appropriate GLMs for the study design should help minimise the pitfalls of doing multiple t-tests on the same data set.
We could remove the threshold from the select and then reintroduce it by creating a data.frame which would go into the formula object of t.test
library(tidyverse)
time1.df %>%
select_if(is.numeric) %>%
select(-threshold) %>%
map_df(~ data.frame(time = .x, time1.df['threshold']) %>%
broom::tidy(t.test(. ~ threshold)))

Using dplyr to compare models

I was working through the examples in the dplyr documentation of the do() function and all was well until I came across this snippet to summarize model comparisons: # compare %>% summarise(p.value = aov$`Pr(>F)`) The error was "Error: expecting a single value". So I found a way forward accessing the list of aov elements directly. This question is about sub-setting operators and to ask if there is a better way to do this. Here is my full attempt and solution.
models <- group_by(mtcars,cyl) %>% do(mod_lin = lm(mpg ~ disp, data = .), mod_quad = lm(mpg ~ poly(disp,2), data = .))
compare <- models %>% do(aov = anova(.$mod_lin, .$mod_quad))
compare %>% summarise(p.value = aov$'Pr(>F)')
Error: expecting a single value
Looking into the structure of compare
select comparison 1
compare$aov[[1]]
select comparison 1 and all of element 6 (the pvalues)
compare$aov[[1]][6]
just the pvalues
compare$aov[[1]][2,6]
compare %>% summarise(pvalue = aov[2,6]) # this gets the pvalues by group
So I suppose I'm wondering how with an object of classes (‘rowwise_df’, ‘tbl_df’ and 'data.frame') that summarise can intuit the [[]] operator. And also if there might be a better way to do this.
You could try
compare %>% do(.$aov['Pr(>F)']) %>% na.omit()

dplyr summarize with a function of a dataframe

I'm having some trouble carrying out a routine using the dplyr package. In short, I have a function which takes a dataframe as an input, and returns a single (numeric) value; I'd like to be able to apply this function to several subsets of a dataframe. It feels like I should be able to use group_by() to specify the subsets of the dataframe, then pipe along to the summarize() function, but I'm not sure how to pass the (subsetted) dataframe along to the function I'd like to apply.
As a simplified example, let's say I'm using the iris dataset, and I've got a fairly simple function which I'd like to apply to several subsets of the data:
data(iris)
lm.func = function(.data){
lm.fit = lm(Petal.Width ~ Petal.Length, data = .data)
out = summary(lm.fit)$coefficients[2,1]
return(out)
}
Now, I'd like to be able to apply this function to subsets of iris based on some other variable, like Species. I'm able to manually filter the data, then pipe along to my function, for example:
iris %>% filter(Species == "setosa") %>% lm.func(.)
But I'd like to be able to apply lm.func to each subset of the data, based on Species. My first thought was to try something like the following:
iris %>% group_by(Species) %>% summarize(coef.val = lm.func(.))
Even though I know this doesn't work, my idea is to try to pass each subset of iris to the lm.func function.
To clarify, I'd like to end up with a dataframe with two columns -- a first with each level of the grouping variable, and a second with the output of lm.func when the data are restricted to a subset specified by the grouping variable.
Is it possible to use summarize() in this way?
You can try with do
iris %>%
group_by(Species) %>%
do(data.frame(coef.val=lm.func(.)))
# Species coef.val
#1 setosa 0.2012451
#2 versicolor 0.3310536
#3 virginica 0.1602970
There is an easy way to do without creating a function.
library(broom)
models <-iris %>%
group_by(Species) %>%
do(
mod = lm(Petal.Width ~ Petal.Length, data =.)
)
models %>% do(tidy(.$mod))
term estimate std.error statistic p.value
1 (Intercept) -0.04822033 0.12164115 -0.3964146 6.935561e-01
2 Petal.Length 0.20124509 0.08263253 2.4354220 1.863892e-02
3 (Intercept) -0.08428835 0.16070140 -0.5245029 6.023428e-01
4 Petal.Length 0.33105360 0.03750041 8.8279995 1.271916e-11
5 (Intercept) 1.13603130 0.37936622 2.9945505 4.336312e-03
6 Petal.Length 0.16029696 0.06800119 2.3572668 2.253577e-02

Combining dplyr::do() with dplyr::mutate?

I would like to achieve the following: for each subgroup of a dataset, I would like to carry out a regression, and the residuals of that regression should be saved as a new variable in the original dataframe. For instance,
group_by(mtcars, gear) %>% mutate(res = residuals(lm(mpg~carb, .)))
indicates what I think should work, but does not (anyone care to explain why it does not work?). One way to get the residuals is to do the following:
group_by(mtcars, gear) %>% do(res = residuals(lm(mpg~carb, .)))
which gives me a dataframe in which dbl-objects are saved, i.e. those contain the residuals for each group. However, it seems they do not contain the original rownames that would help me to merge them back to the original data.
So, my question is: how can I achieve what I want to do in a dplyr-kind of way?
Obviously, it can be achieved in other ways. To give you an example, the following works just fine:
dat <- mtcars
dat$res <- NA
for(i in unique(mtcars$gear)){
dat[dat$gear==i, "res"] <- residuals(lm(mpg ~ disp, data=dat[dat$gear==i,]))
}
However, my understanding is that dplyr is made for this purpose, so there should be a dplyr-style way?
Any hints / tips / comments are appreciated.
Remark: this question is very similar to lm() called within mutate() except that in that question, only one parameter per group is retained, which makes a merge-approach easy. I have an entire vector with no rownames, so that I would have to rely on the ordering of the vector to do that, and that seems troublesome to me.
library(lazyeval)
eq <- "y ~ x"
dat <- mtcars
dat %>%
group_by(gear) %>%
mutate(res=residuals(lm(interp(eq, y = mpg, x = disp))))
or without lazyeval
dat %>%
group_by(gear) %>%
mutate(res=residuals(lm(deparse(substitute(mpg~disp)))))
#This gives you the residuals. You can then combine this with original data.
mtcars %>%
group_by(cyl) %>%
do(model = lm(mpg ~ wt, data=.)) %>%
do((function(reg_mod) {
data.frame(reg_res = residuals(reg_mod$model))
})(.))

Using dplyr, how to pipe or chain to plot()?

I am new to dplyr() package and trying to use it for my visualization assignment. I am able to pipe my data to ggplot() but unable to do that with plot(). I came across this post and the answers including the one in comments, didn't work for me.
Code 1:
emission <- mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))
emission %>%
plot(year, total,.)
I get the following error:
Error in plot(year, total, emission) : object 'year' not found
Code 2:
mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))%>%
plot(year, total, .)
This didn't work either and returned the same error.
Interestingly, the solution from the post I mentioned works for the same dataset but doesn't work out for my own data. However, I am able to create the plot using emission$year and emission$total.
Am I missing anything?
plot.default doesn't take a data argument, so your best bet is to pipe to with:
mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))%>%
with(plot(year, total))
In case anyone missed #aosmith's comment on the question, plot.formula does have a data argument, but of course the formula is the first argument so we need to use the . to put the data in the right place. So another option is
... %>%
plot(total ~ year, data = .)
Of course, ggplot takes data as the first argument, so to use ggplot do:
... %>%
ggplot(aes(x = year, y = total)) + geom_point()
lattice::xyplot is likeplot.formula: there is a data argument, but it's not first, so:
... %>%
xyplot(total ~ year, data = .)
Just look at the documentation and make sure you use a . if data isn't the first argument. If there's no data argument at all, using with is a good work-around.
As an alternative, you can use the %$% operator from magrittr to be able to access the columns of a dataframe directly. For example:
iris %$%
plot(Sepal.Length~Sepal.Width)
This is useful many times when you need to feed the result of a dplyr chain to a base R function (such as table, lm, plot, etc). It can also be used to extract a column from a dataframe as a vector, e.g.:
iris %>% filter(Species=='virginica') %$% Sepal.Length
This is the same as:
iris %>% filter(Species=='virginica') %>% pull(Sepal.Length)

Resources