I would like to subset a specific variable (not the entire dataset) in gtsummary.
In the following example, how could I subset gear to remove '5' - only show proportion of cars with gear of '3' and '4' ? I would want to include all patients in mpg however.
library(gt)
library(dplyr)
mtcars %>%
select(cyl, mpg, gear) %>%
tbl_summary(
by = cyl ### how do i say for gear, filter gear != 5 ???
)
You'll need to build two separate tables with tbl_summary() then stack them. Example below!
library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.5.0'
tbl_full_data <-
mtcars %>%
select(cyl, mpg) %>%
tbl_summary(by = cyl) %>%
# removing Ns from header, since they won't be correct for gear
modify_header(all_stat_cols() ~ "**{level}**")
tbl_gear_subset <-
mtcars %>%
select(cyl, gear) %>%
dplyr::filter(gear != 5) %>%
tbl_summary(by = cyl)
# stack tables together
list(tbl_full_data, tbl_gear_subset) %>%
tbl_stack() %>%
as_kable() # convert to kable to it'll print on SO
#> i Column headers among stacked tables differ. Headers from the first table are
#> used. Use `quiet = TRUE` to supress this message.
Characteristic
4
6
8
mpg
26.0 (22.8, 30.4)
19.7 (18.6, 21.0)
15.2 (14.4, 16.2)
gear
3
1 (11%)
2 (33%)
12 (100%)
4
8 (89%)
4 (67%)
0 (0%)
Created on 2021-10-25 by the reprex package (v2.0.1)
Related
Thanks for looking at this!
I want a function to build tables showing stats, such as the mean) for specific variables segrgated into groups.
Below is a start of a function that works up to a point! I use an example using the built in data for mtcars.
MeansbyGroup<-function(var){
M1<-mtcars %>% group_by(cyl)
n1=deparse(substitute(var))
r1<-transpose(M1 %>% summarise(disp=mean(var)))[2,]
}
# EXAMPLE using mtcars
df=MeansbyGroup(mtcars$disp)
df[nrow(df) + 1,] =MeansbyGroup(mtcars$drat)
df
# The above will output
V1 V2 V3
2 230.721875 230.721875 230.721875
2.1 3.596563 3.596563 3.596563
#which is not even the right means!
#below are the correct values...but I can't automate a table like I want
M1<-mtcars %>% group_by(cyl)
transpose(M1 %>% summarise(disp=mean(disp)))[2,]
transpose(M1 %>% summarise(disp=mean(drat)))[2,]
## Here is my desired output of means disaggregated into columns by the group "cyl"
## if the function worked right with the above example
V1 V2 V3
disp 105.1364 183.3143 353.1
drat 4.070909 3.585714 3.229286
As you will see, in the function I have "n1=deparse(substitute(var))" to capture the variable name which I would like to have in the first column, instead of 2 and 2.1 as shown in the example output.
I've tried a few techniques, but when I try to add n1 to the vector, it destroys the values of the means!
Also, I'd like to make the function more generalizable. For this example, I'd prefer the function call to look like MeansbyGroup(var,group,dataframe), which in the above example would be called by MeansbyGroup(disp,cyl,mtcars).
Thanks!
Here's how I would code your table outside of a function:
library(dplyr)
library(tibble)
mtcars %>%
group_by(cyl) %>%
summarize(across(c(disp, drat), mean)) %>%
column_to_rownames("cyl") %>%
t
# 4 6 8
# disp 105.136364 183.314286 353.100000
# drat 4.070909 3.585714 3.229286
Using across if you might have multiple variables is quite nice. Putting this inside a function, we will need to use deparse(substitute()) because column_to_rownames requires a string argument for the column. But for the others we can use the friendly {{:
foo = function(data, group, vars) {
grp_name = deparse(substitute(group))
data %>%
group_by({{group}}) %>%
summarize(across({{vars}}, mean)) %>%
column_to_rownames(grp_name) %>%
t
}
foo(data = mtcars, group = cyl, vars = c(disp, drat))
# 4 6 8
# disp 105.136364 183.314286 353.100000
# drat 4.070909 3.585714 3.229286
I'm working with a table for which I need to count the number of rows satisfying some criterion and I ended up with basically multiple repetitions of the same pipe differing only in the variable name.
Say I want to know how many cars are better than Valiant in mtcars on each of the variables there. An example of the code with two variables is below:
library(tidyverse)
reference <- mtcars %>%
slice(6)
mpg <- mtcars %>%
filter(mpg > reference$mpg) %>%
count() %>%
pull()
cyl <- mtcars %>%
filter(cyl > reference$cyl) %>%
count() %>%
pull()
tibble(mpg, cyl)
Except, suppose I need to do it for like 100 variables so there must be a more optimal way to just repeat the process.
What would be the way to rewrite the code above in an optimal way (maybe, using map() or anything else that works with pipes nicely so that the result would be a tibble with the counts for all the variables in mtcars?
I feel the solution should be very easy but I'm stuck.
Thank you!
Or:
library(tidyverse)
map_dfc(mtcars, ~sum(.x[6] < .x))
map2_dfc(mtcars, reference, ~sum(.y < .x))
You could use summarise + across to count observations greater than a certain value in each column.
library(dplyr)
mtcars %>%
summarise(across(everything(), ~ sum(. > .[6])))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18 14 15 22 30 11 1 0 13 17 25
base solution:
# (1)
colSums(mtcars > mtcars[rep(6, nrow(mtcars)), ])
# (2)
colSums(sweep(as.matrix(mtcars), 2, mtcars[6, ], ">"))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 18 14 15 22 30 11 1 0 13 17 25
You can do it in a loop for example. Like this:
library(tidyverse)
reference <- mtcars %>%
slice(6)
# Empty list to save outcome
list_outcome <- list()
# Get the columnnames to loop over
loop_var <- colnames(reference)
for(i in loop_var){
nr <- mtcars %>%
filter(mtcars[, i] > reference[, i]) %>%
count() %>%
pull()
# Save every iteration in the loop as the ith element of the list
list_outcome[[i]] <- data.frame(Variable = i, Value = nr)
}
# combine all the data frames in the list to one final data frame
df_result <- do.call(rbind, list_outcome)
I am creating the following model:
models <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df))
Based on the results you get from that, I am trying to extract the coefficients by using a series of map functions.
The results should look like this:
4 6 8
-5.647025 -2.780106 -2.192438
I am pulling my hair out trying to figure this out. Any help is appreciated.
You can use map_dbl with the coef function to pick out the "wt" coefficients:
coefs <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df)) %>%
map_dbl(~coef(.)[["wt"]])
It looks like
coefs <- (mtcars
%>% split(.$cyl)
%>% map(lm, formula = mpg~wt)
%>% map_dbl(~coef(.)[["wt"]])
)
should do what you want? If you want to get more information, ending with map_dfr(broom::tidy) instead of the map_dbl will be helpful (you can use the .id= argument too, although this is less useful when the list doesn't have named arguments).
This is very similar to #henryn's answer, although the map syntax (using the named formula argument means that the data get substituted as the next argument implicitly, so you don't have to use an anonymous function function(df) lm(mpg ~ wt, data = df) or (with R >= 4.1.0) \(df) lm(mpg ~ wt, data = df): I think the usual way of doing this, ~ lm(mpg ~ wt, data = .) might get messed up by the tilde in the formula, but I'm nto sure ...
Does this work:
mtcars %>% split(.$cyl) %>% map(function(x) {
c = lm(mpg ~ wt, data = x)
c$coefficients[2]
}) %>% unlist
4.wt 6.wt 8.wt
-5.647025 -2.780106 -2.192438
1) This could be done in straight dplyr:
mtcars %>%
group_by(cyl) %>%
summarize(wt = coef(lm(mpg ~ wt))[[2]], .groups = "drop")
giving:
# A tibble: 3 x 2
cyl wt
<dbl> <dbl>
1 4 -5.65
2 6 -2.78
3 8 -2.19
2) This variation also works:
mtcars %>%
group_by(cyl) %>%
summarize(wt = cov(mpg, wt) / var(wt), .groups = "drop")
3) Also consider this -- omit the [2] to get both coefficients.
library(nlme)
coef(lmList(mpg ~ wt | cyl, mtcars))[2]
giving:
wt
4 -5.647025
6 -2.780106
8 -2.192438
I want to create a data frame with rows that repeat.
Here is my original dataset:
> mtcars_columns_a
variables_interest data_set data_set_and_variables_interest mean
1 mpg mtcars mtcars$mpg 20.09062
2 disp mtcars mtcars$disp 230.72188
3 hp mtcars mtcars$hp 146.68750
Here is my desire dataset
> mtcars_columns_b
variables_interest data_set data_set_and_variables_interest mean
1 mpg mtcars mtcars$mpg 20.09062
2 mpg mtcars mtcars$mpg 20.09062
3 disp mtcars mtcars$disp 230.72188
4 disp mtcars mtcars$disp 230.72188
5 hp mtcars mtcars$hp 146.68750
6 hp mtcars mtcars$hp 146.68750
I know how to do this the long way manually, but this is time consuming and rigid. Is there a quicker way to do this that is more automated and flexible?
Here is the code I used to create the dataset:
# mtcars data
## displays data
mtcars
## 3 row data set
### lists columns of interest
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: lists variables of interest
mtcars_columns_a <-
data.frame(
c(
"mpg",
"disp",
"hp"
)
)
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: adds colnames
names(mtcars_columns_a)[names(mtcars_columns_a) == 'c..mpg....disp....hp..'] <- 'variables_interest'
### adds data set info
mtcars_columns_a$data_set <-
c("mtcars")
### creates data_set_and_variables_interest column
mtcars_columns_a$data_set_and_variables_interest <-
paste(mtcars_columns_a$data_set,mtcars_columns_a$variables_interest,sep = "$")
### creates mean column
mtcars_columns_a$mean <-
c(
mean(mtcars$mpg),
mean(mtcars$disp),
mean(mtcars$hp)
)
## 6 row data set., the long way
### lists columns of interest
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: lists variables of interest
mtcars_columns_b <-
data.frame(
c(
"mpg",
"mpg",
"disp",
"disp",
"hp",
"hp"
)
)
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: adds colnames
names(mtcars_columns_b)[names(mtcars_columns_b) == 'c..mpg....mpg....disp....disp....hp....hp..'] <- 'variables_interest'
### adds data set info
mtcars_columns_b$data_set <-
c("mtcars")
### creates data_set_and_variables_interest column
mtcars_columns_b$data_set_and_variables_interest <-
paste(mtcars_columns_b$data_set,mtcars_columns_b$variables_interest,sep = "$")
### creates mean column
mtcars_columns_b$mean <-
c(
mean(mtcars$mpg),
mean(mtcars$mpg),
mean(mtcars$disp),
mean(mtcars$disp),
mean(mtcars$hp),
mean(mtcars$hp)
)
You can try rep like below
mtcars_columns_a[rep(seq(nrow(mtcars_columns_a)), each = 2),]
Another option is uncount
library(dplyr)
library(tidyr)
mtcars_columns_a %>%
uncount(2)
Based on your expected output is this the sort of thing you were after?
The selection of required variables is made with the select function and the mean calculated using the summarise function following group_by variables.
The duplication of data and adding of additional variables (not really sure if these are necessary) is carried out using mutate.
You can edit variable names using the dplyr::rename function.
library(dplyr)
library(tidyr)
df <-
mtcars %>%
select(mpg, disp, hp) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(mean = mean(value))
df1 <-
bind_rows(df, df) %>%
arrange(name) %>%
mutate(dataset = "mtcars",
variable = paste(dataset, name, sep = "$"))
df1
#> # A tibble: 6 x 4
#> name mean dataset variable
#> <chr> <dbl> <chr> <chr>
#> 1 disp 231. mtcars mtcars$disp
#> 2 disp 231. mtcars mtcars$disp
#> 3 hp 147. mtcars mtcars$hp
#> 4 hp 147. mtcars mtcars$hp
#> 5 mpg 20.1 mtcars mtcars$mpg
#> 6 mpg 20.1 mtcars mtcars$mpg
Created on 2021-04-06 by the reprex package (v1.0.0)
The order of records in a data.frame object is usually not meaningful, so you could just do:
rbind(mtcars_columns_a, mtcars_columns_a)
If you need it to be in the order you showed, this is also simple:
mtcars_columns_b <- rbind(mtcars_columns_a, mtcars_columns_a)
mtcars_columns_b[order(mtcars_columns_b, mtcars_columns_b$name),]
It's my first try with lists. I try to clean up my code and put some code in functions.
One idea is to subset a big dataframe in multiple subsets with a function. So I can call the subset with a function when needed.
With the mtcars dataframe I would like to explain what I am trying to do:
add an id to mtcars
create a function with one argument (mtcars) that outputs a list of subset dataframes (mtcars1, mtcars2, mtcars3) -> learned here: How to assign from a function which returns more than one value? answer by Federico Giorgi
What I achieve is to create the list. But when it comes to see the 3 subset dataframe objects (mtcars1, mtcars2, mtcars3) in the global environment my knowledge is ending. So how can I call these 3 dataframe objects from the list with my function. Thanks!
My Code:
library(dplyr)
# add id to mtcars
mtcars <- mtcars %>%
mutate(id = row_number())
# create function to subset in 3 dataframes
my_func_cars <- function(input){
# first subset
mtcars1 <- mtcars %>%
select(id, mpg, cyl, disp)
# second subset
mtcars2 <- mtcars %>%
select(id, hp, drat, wt, qsec)
# third subset
mtcars3 <- mtcars %>%
select(id, vs, am, gear, carb)
output <- list(mtcars1, mtcars2, mtcars3)
return(output)
}
output<-my_func_cars(mtcars)
for (i in output) {
print(i)
}
It may be better to output a named list
library(dplyr)
library(stringr)
my_func_cars <- function(input){
nm1 <- deparse(substitute(input))
# first subset
obj1 <- input %>%
select(id, mpg, cyl, disp)
# second subset
obj2 <- input %>%
select(id, hp, drat, wt, qsec)
# third subset
obj3 <- input %>%
select(id, vs, am, gear, carb)
dplyr::lst(!! str_c(nm1, 1) := obj1,
!! str_c(nm1, 2) := obj2,
!! str_c(nm1, 3) := obj3)
}
and then we use list2env to create objects in the global env
mtcars <- mtcars %>%
mutate(id = row_number())
list2env(my_func_cars(mtcars), .GlobalEnv)
-check the objects
head(mtcars1, 2)
# id mpg cyl disp
#1 1 21 6 160
#2 2 21 6 160
head(mtcars2, 2)
# id hp drat wt qsec
#1 1 110 3.9 2.620 16.46
#2 2 110 3.9 2.875 17.02
head(mtcars3, 2)
# id vs am gear carb
#1 1 0 1 4 4
#2 2 0 1 4 4