Want to map over columns in a dataframe & perform t-tests with each column against a fixed column. Desired output would be a dataframe with each row(s) being t-test results - can use map_dfr once mapping process ok
Dug into tidy eval, not sure if it's a tidy eval error - any help much appreciated!
(mtcars as toy dataset)
library(rstatix)
# Test single cases - good
compare_means(mpg ~ cyl, data = mtcars)
compare_means(disp ~ cyl, data = mtcars)
compare_means(hp ~ cyl, data = mtcars)
# Trial map - fail
mtcars %>%
map(~compare_means(.x ~ cyl, data = mtcars))
Error: Can't subset columns that don't exist.
x Column `.x` doesn't exist.
Following tidyeval guidance: https://tidyeval.tidyverse.org/dplyr.html
Tried to see if quoting / unquoting was the issue, but no dice
# Abstract variables
test_data <- function(group_var) {
quote_var <- enquo(group_var)
data %>% compare_means(quote_var ~ cyl, data = mtcars)
}
That's an NSE error, but not tidyeval. You're mapping over the vectors inside mtcars. You're not mapping over the column names of mtcars.
With inject() from the last rlang version you can do some NSE programming with non-tidyeval functions:
names(mtcars) %>% map(~ rlang::inject(compare_means(!!sym(.x) ~ cyl, data = mtcars))
Three things are going on:
We map over the names of the data frame.
We transform the name to a symbol, i.e. an R variable.
We inject that symbol into the formula using inject() and !!.
I have not tested the code.
Actually, it may just be about formula evaluation specifially:
library(ggpubr)
library(tidyverse)
# Test data with 2 Species only
iris.subset <- iris %>%
filter(Species != 'virginica')
# Test single case
iris.subset %>%
compare_means(Sepal.Width ~ Species, data = .)
# Test direct map - doesn't work
iris.subset[1:4] %>%
map(~compare_means(. ~ Species, data = iris.subset))
Is it about formula evaluation? Test as.formula()
as.formula(paste0(names(iris.subset[1]), " ~ Species"))
# Pipe into test
names(iris.subset[1:4]) %>%
map_df(~compare_means(formula = as.formula(paste0(., " ~ Species")), data = iris.subset))
Success!!
Couldn't get an example to work with mtcars but will re-post if I do
Related
My dataset has a dummy variable which divides the data set into two groups. I would like to display the descriptive statistics for both next to each other, like:
example
using stargazer. Is this possible?
For example, if there is the mtcars data set and the variable $am divides the dataset into two groups, how can I display the one group on the left side and the other group on the other side?
Thank you!
I was able to display the two statistics below each other (I had to make two separate datasets for each group), but never next to each other.
treated <- mtcars[mtcars$am == 1,]
control <- mtcars[mtcars$am == 0,]
stargazer(treated, control, keep=c("mpg", "cyl", "disp", "hp"),
header=FALSE, title="Descriptive statistics", digits=1, type="text")
Descriptive statistics below each other
Someone should point out if I'm mistaken, but I don't believe that stargazer will allow for the kind of nested tables you are looking for. However, there are other packages like modelsummary, gtsummary, and flextable that can produce tables similar to stargazer. I have included examples below using select mtcars variables summarized by am. Personally, I prefer gtsummary due to its flexibility.
library(tidyverse)
data(mtcars)
### modelsummary
# not great since it treats `cyl` as a continuous variable
# https://vincentarelbundock.github.io/modelsummary/articles/datasummary.html
library(modelsummary)
datasummary_balance(~am, data = mtcars, dinm = FALSE)
### gtsummary
# based on example 3 from here
# https://www.danieldsjoberg.com/gtsummary/reference/add_stat_label.html
library(gtsummary)
mtcars %>%
select(am, mpg, cyl, disp, hp) %>%
tbl_summary(
by = am,
missing = "no",
type = list(mpg ~ 'continuous2',
cyl ~ 'categorical',
disp ~ 'continuous2',
hp ~ 'continuous2'),
statistic = all_continuous2() ~ c("{mean} ({sd})", "{median}")
) %>%
add_stat_label(label = c(mpg, disp, hp) ~ c("Mean (SD)", "Median")) %>%
modify_footnote(everything() ~ NA)
### flextable
# this function only works on continuous vars, so I removed `cyl`
# https://davidgohel.github.io/flextable/reference/continuous_summary.html
library(flextable)
mtcars %>%
select(am, mpg, cyl, disp, hp) %>%
continuous_summary(
by = "am",
hide_grouplabel = FALSE,
digits = 3
)
You can use the modelsummary package and its datasummary function, which offers a formula-based language to describe the specific table you need. (Disclaimer: I am the maintainer.)
In addition to the super flexible datasummary function, there are many other functions to summarize data in easier ways. See in particular the datasummary_balance() function here:
https://vincentarelbundock.github.io/modelsummary/articles/datasummary.html
library(modelsummary)
dat <- mtcars[, c("mpg", "cyl", "disp", "hp", "am")]
datasummary(
All(dat) ~ Factor(am) * (N + Mean + SD + Min + Max),
data = dat)
I'm trying to figure out why these two code chunks give me different p-values for Welch's T-Test. I really just tried to do a tidy version of the base R code and create a table with both statistics. But the tidy version I'm using has a very small p-value and I'm confused as to why.
t.test(mpg ~ vs, data = mtcars) # p-value = 0.0001098
t.test(mpg ~ am, data = mtcars) # p-value = 0.001374
options(scipen = 999)
mtcars %>%
dplyr::select(mpg, vs, am) %>%
pivot_longer(names_to = 'names', values_to = 'values', 2:3) %>%
nest(data = -names) %>%
mutate(
test = map(data, ~ t.test(.x$mpg, .x$values)), # S3 list-col
tidied = map(test, tidy)
) %>%
unnest(tidied) # vs = 0.000000000000000010038009 and am = 0.000000000000000009611758
If you run simply:
t.test(mtcars$mpg, mtcars$vs)
You'll get the same values as in your nested data example.
So the issue is not the nesting - it's that you're performing a different kind of t-test. The formula version is treating the variables vs or am as having two groups (0, 1) and the vectorized version is not.
I wanted to know how to pass the output of pipe operation directly into lm().
For example, I can pass this following yay vector into lm() directly.
set.seed(40)
yay = c(rnorm(15), exp(rnorm(15)), runif(20, min = -3, max = 0))
lm(yay~1)
#> Call:
#> lm(formula = yay ~ 1)
#> Coefficients:
#> (Intercept)
#> -0.09522
But when I tried something like this, it threw an error.
library(tidyverse)
library(palmerpenguins)
data("penguins")
filter_penguins <- penguins %>% filter(species == "Adelie")
filter_penguins %>%
filter(island == "Torgersen") %>%
select(bill_length_mm) %>%
pull() %>%
lm(. ~ 1)
#> Error in formula.default(object, env = baseenv()) : invalid formula
I have also tried to save the pull() output into object and later feed it into lm(), it works. But why the dot placeholder doesnot work this way?
Thank you very much.
This issu is that lm() inside a pipeline consider the data given as formula argument. Therefore, the data is missplaced. Try:
filter_penguins %>%
filter(island == "Torgersen") %>%
select(bill_length_mm) %>%
lm(data = ., pull(.) ~ 1)
Edit: I realise that I misread the question, and thought OP wanted to pass in a variable name as part of a formula, rather than pass in the dataset itself. I'll leave this post in for the former reason, anyway.
It won't work because the first argument to lm will be whatever is piped in, which is not a proper formula (as the error suggests).
Using your example, and pretending that the piped value is "var", then
"var" %>%
lm(. ~ 1)
would be evaluated as
lm(formula = "var", . ~ 1)
so the . ~ 1 portion isn't part of the formula argument. You could construct the formula with paste0 or similar, though. For example, this would work:
"mpg" %>%
paste0(" ~ .") %>%
lm(data = mtcars)
I am using tidyverse,broom, and purrr to fit a model to some data, by group. I am then trying to use this model to predict on some new data, again by group. broom's 'augment' function nicely adds not only the predictions, but also other values like the std error, etc. However, I am unable to make the 'augment' function use the new data instead of the old data. As a result, my two sets of predictions are exactly the same. The question is - how can I make 'augment' use the new data instead of the old data (which was used to fit the model) ?
Here's a reproducible example:
library(tidyverse)
library(broom)
library(purrr)
# nest the iris dataset by Species and fit a linear model
iris.nest <- nest(iris, data = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)) %>%
mutate(model = map(data, function(df) lm(Sepal.Width ~ Sepal.Length, data=df)))
# create a new dataset where the Sepal.Length is 5x as big
newdata <- iris %>%
mutate(Sepal.Length = Sepal.Length*5) %>%
nest(data = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)) %>%
rename("newdata"="data")
# join these two nested datasets together
iris.nest.new <- left_join(iris.nest, newdata)
# now form two new columns of predictions -- one using the "old" data that the model was
# initially fit on, and the second using the new data where the Sepal.Length has been increased
iris.nest.new <- iris.nest.new %>%
mutate(preds = map(model, broom::augment),
preds.new = map2(model, newdata, broom::augment)) # THIS LINE DOESN'T WORK ****
# unnest the predictions on the "old" data
preds <-select(iris.nest.new, preds) %>%
unnest(cols = c(preds))
# rename the columns prior to merging
names(preds)[3:9] <- paste0("old", names(preds)[3:9])
# now unnest the predictions on the "new" data
preds.new <-select(iris.nest.new, preds.new) %>%
unnest(cols = c(preds.new))
#... and also rename columns prior to merging
names(preds.new)[3:9] <- paste0("new", names(preds.new)[3:9])
# merge the two sets of predictions and compare
compare <- bind_cols(preds, preds.new)
# compare
select(compare, old.fitted, new.fitted) %>% View(.) # EXACTLY THE SAME!!!!
When calling broom::augment, note that the newdata= parameter is the third parameter. When you use purr::map2, the values you iterate over are passed in the first two parameters by default. It doesn't matter what you've named those lists that you are passing in. You need to explicitly place the new data in the newdata= parameter.
iris.nest.new <- iris.nest.new %>%
mutate(preds = map(model, broom::augment),
preds.new = map2(model, newdata, ~broom::augment(.x, newdata=.y)))
The difference can be seen running these two commands.
broom::augment(iris.nest.new$model[[1]], iris.nest.new$newdata[[1]])
broom::augment(iris.nest.new$model[[1]], newdata=iris.nest.new$newdata[[1]])
I am trying to conduct a survival curve using the survival package. The MWE code is as follows:
df %>%
filter(fac <= "Limit") %>%
survfit(Surv(tte, !is.na(event)) ~ fac, data = .) %>%
ggsurvplot(fit = .)
I get the error Error in eval(fit$call$data) : object '.' not found
When I try to break this down further by:
survfit <- df %>%
filter(fac <= "Limit") %>%
survfit(Surv(tte, !is.na(event)) ~ fac, data = .)
ggsurvplot(fit = survfit)
I get an identical error. Is anyone able to figure out how to pipe from my dataframe all the way through a survival curve? The reason I would like to do this is to streamline the filtering of my dataframe in order to produce a multitude of different survival curves without having to create many subsetted dataframes.
Apparently, ggsurvplot expects an object of class "survfit" as its first argument but also needs the data set as an argument.
The example below is based on the first example of function
survfit.formula {survival}.
library(dplyr)
library(survival)
library(survminer)
aml %>%
survfit(Surv(time, status) ~ x, data = .) %>%
ggsurvplot(data = aml)
In the question's case this would become
df %>%
filter(fac <= "Limit") %>%
survfit(Surv(tte, !is.na(event)) ~ fac, data = .) %>%
ggsurvplot(data = filter(df, fac <= "Limit"))