I want to create a data frame with rows that repeat.
Here is my original dataset:
> mtcars_columns_a
variables_interest data_set data_set_and_variables_interest mean
1 mpg mtcars mtcars$mpg 20.09062
2 disp mtcars mtcars$disp 230.72188
3 hp mtcars mtcars$hp 146.68750
Here is my desire dataset
> mtcars_columns_b
variables_interest data_set data_set_and_variables_interest mean
1 mpg mtcars mtcars$mpg 20.09062
2 mpg mtcars mtcars$mpg 20.09062
3 disp mtcars mtcars$disp 230.72188
4 disp mtcars mtcars$disp 230.72188
5 hp mtcars mtcars$hp 146.68750
6 hp mtcars mtcars$hp 146.68750
I know how to do this the long way manually, but this is time consuming and rigid. Is there a quicker way to do this that is more automated and flexible?
Here is the code I used to create the dataset:
# mtcars data
## displays data
mtcars
## 3 row data set
### lists columns of interest
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: lists variables of interest
mtcars_columns_a <-
data.frame(
c(
"mpg",
"disp",
"hp"
)
)
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: adds colnames
names(mtcars_columns_a)[names(mtcars_columns_a) == 'c..mpg....disp....hp..'] <- 'variables_interest'
### adds data set info
mtcars_columns_a$data_set <-
c("mtcars")
### creates data_set_and_variables_interest column
mtcars_columns_a$data_set_and_variables_interest <-
paste(mtcars_columns_a$data_set,mtcars_columns_a$variables_interest,sep = "$")
### creates mean column
mtcars_columns_a$mean <-
c(
mean(mtcars$mpg),
mean(mtcars$disp),
mean(mtcars$hp)
)
## 6 row data set., the long way
### lists columns of interest
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: lists variables of interest
mtcars_columns_b <-
data.frame(
c(
"mpg",
"mpg",
"disp",
"disp",
"hp",
"hp"
)
)
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: adds colnames
names(mtcars_columns_b)[names(mtcars_columns_b) == 'c..mpg....mpg....disp....disp....hp....hp..'] <- 'variables_interest'
### adds data set info
mtcars_columns_b$data_set <-
c("mtcars")
### creates data_set_and_variables_interest column
mtcars_columns_b$data_set_and_variables_interest <-
paste(mtcars_columns_b$data_set,mtcars_columns_b$variables_interest,sep = "$")
### creates mean column
mtcars_columns_b$mean <-
c(
mean(mtcars$mpg),
mean(mtcars$mpg),
mean(mtcars$disp),
mean(mtcars$disp),
mean(mtcars$hp),
mean(mtcars$hp)
)
You can try rep like below
mtcars_columns_a[rep(seq(nrow(mtcars_columns_a)), each = 2),]
Another option is uncount
library(dplyr)
library(tidyr)
mtcars_columns_a %>%
uncount(2)
Based on your expected output is this the sort of thing you were after?
The selection of required variables is made with the select function and the mean calculated using the summarise function following group_by variables.
The duplication of data and adding of additional variables (not really sure if these are necessary) is carried out using mutate.
You can edit variable names using the dplyr::rename function.
library(dplyr)
library(tidyr)
df <-
mtcars %>%
select(mpg, disp, hp) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(mean = mean(value))
df1 <-
bind_rows(df, df) %>%
arrange(name) %>%
mutate(dataset = "mtcars",
variable = paste(dataset, name, sep = "$"))
df1
#> # A tibble: 6 x 4
#> name mean dataset variable
#> <chr> <dbl> <chr> <chr>
#> 1 disp 231. mtcars mtcars$disp
#> 2 disp 231. mtcars mtcars$disp
#> 3 hp 147. mtcars mtcars$hp
#> 4 hp 147. mtcars mtcars$hp
#> 5 mpg 20.1 mtcars mtcars$mpg
#> 6 mpg 20.1 mtcars mtcars$mpg
Created on 2021-04-06 by the reprex package (v1.0.0)
The order of records in a data.frame object is usually not meaningful, so you could just do:
rbind(mtcars_columns_a, mtcars_columns_a)
If you need it to be in the order you showed, this is also simple:
mtcars_columns_b <- rbind(mtcars_columns_a, mtcars_columns_a)
mtcars_columns_b[order(mtcars_columns_b, mtcars_columns_b$name),]
Related
Thanks for looking at this!
I want a function to build tables showing stats, such as the mean) for specific variables segrgated into groups.
Below is a start of a function that works up to a point! I use an example using the built in data for mtcars.
MeansbyGroup<-function(var){
M1<-mtcars %>% group_by(cyl)
n1=deparse(substitute(var))
r1<-transpose(M1 %>% summarise(disp=mean(var)))[2,]
}
# EXAMPLE using mtcars
df=MeansbyGroup(mtcars$disp)
df[nrow(df) + 1,] =MeansbyGroup(mtcars$drat)
df
# The above will output
V1 V2 V3
2 230.721875 230.721875 230.721875
2.1 3.596563 3.596563 3.596563
#which is not even the right means!
#below are the correct values...but I can't automate a table like I want
M1<-mtcars %>% group_by(cyl)
transpose(M1 %>% summarise(disp=mean(disp)))[2,]
transpose(M1 %>% summarise(disp=mean(drat)))[2,]
## Here is my desired output of means disaggregated into columns by the group "cyl"
## if the function worked right with the above example
V1 V2 V3
disp 105.1364 183.3143 353.1
drat 4.070909 3.585714 3.229286
As you will see, in the function I have "n1=deparse(substitute(var))" to capture the variable name which I would like to have in the first column, instead of 2 and 2.1 as shown in the example output.
I've tried a few techniques, but when I try to add n1 to the vector, it destroys the values of the means!
Also, I'd like to make the function more generalizable. For this example, I'd prefer the function call to look like MeansbyGroup(var,group,dataframe), which in the above example would be called by MeansbyGroup(disp,cyl,mtcars).
Thanks!
Here's how I would code your table outside of a function:
library(dplyr)
library(tibble)
mtcars %>%
group_by(cyl) %>%
summarize(across(c(disp, drat), mean)) %>%
column_to_rownames("cyl") %>%
t
# 4 6 8
# disp 105.136364 183.314286 353.100000
# drat 4.070909 3.585714 3.229286
Using across if you might have multiple variables is quite nice. Putting this inside a function, we will need to use deparse(substitute()) because column_to_rownames requires a string argument for the column. But for the others we can use the friendly {{:
foo = function(data, group, vars) {
grp_name = deparse(substitute(group))
data %>%
group_by({{group}}) %>%
summarize(across({{vars}}, mean)) %>%
column_to_rownames(grp_name) %>%
t
}
foo(data = mtcars, group = cyl, vars = c(disp, drat))
# 4 6 8
# disp 105.136364 183.314286 353.100000
# drat 4.070909 3.585714 3.229286
I'm working with a table for which I need to count the number of rows satisfying some criterion and I ended up with basically multiple repetitions of the same pipe differing only in the variable name.
Say I want to know how many cars are better than Valiant in mtcars on each of the variables there. An example of the code with two variables is below:
library(tidyverse)
reference <- mtcars %>%
slice(6)
mpg <- mtcars %>%
filter(mpg > reference$mpg) %>%
count() %>%
pull()
cyl <- mtcars %>%
filter(cyl > reference$cyl) %>%
count() %>%
pull()
tibble(mpg, cyl)
Except, suppose I need to do it for like 100 variables so there must be a more optimal way to just repeat the process.
What would be the way to rewrite the code above in an optimal way (maybe, using map() or anything else that works with pipes nicely so that the result would be a tibble with the counts for all the variables in mtcars?
I feel the solution should be very easy but I'm stuck.
Thank you!
Or:
library(tidyverse)
map_dfc(mtcars, ~sum(.x[6] < .x))
map2_dfc(mtcars, reference, ~sum(.y < .x))
You could use summarise + across to count observations greater than a certain value in each column.
library(dplyr)
mtcars %>%
summarise(across(everything(), ~ sum(. > .[6])))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18 14 15 22 30 11 1 0 13 17 25
base solution:
# (1)
colSums(mtcars > mtcars[rep(6, nrow(mtcars)), ])
# (2)
colSums(sweep(as.matrix(mtcars), 2, mtcars[6, ], ">"))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 18 14 15 22 30 11 1 0 13 17 25
You can do it in a loop for example. Like this:
library(tidyverse)
reference <- mtcars %>%
slice(6)
# Empty list to save outcome
list_outcome <- list()
# Get the columnnames to loop over
loop_var <- colnames(reference)
for(i in loop_var){
nr <- mtcars %>%
filter(mtcars[, i] > reference[, i]) %>%
count() %>%
pull()
# Save every iteration in the loop as the ith element of the list
list_outcome[[i]] <- data.frame(Variable = i, Value = nr)
}
# combine all the data frames in the list to one final data frame
df_result <- do.call(rbind, list_outcome)
I would like to subset a specific variable (not the entire dataset) in gtsummary.
In the following example, how could I subset gear to remove '5' - only show proportion of cars with gear of '3' and '4' ? I would want to include all patients in mpg however.
library(gt)
library(dplyr)
mtcars %>%
select(cyl, mpg, gear) %>%
tbl_summary(
by = cyl ### how do i say for gear, filter gear != 5 ???
)
You'll need to build two separate tables with tbl_summary() then stack them. Example below!
library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.5.0'
tbl_full_data <-
mtcars %>%
select(cyl, mpg) %>%
tbl_summary(by = cyl) %>%
# removing Ns from header, since they won't be correct for gear
modify_header(all_stat_cols() ~ "**{level}**")
tbl_gear_subset <-
mtcars %>%
select(cyl, gear) %>%
dplyr::filter(gear != 5) %>%
tbl_summary(by = cyl)
# stack tables together
list(tbl_full_data, tbl_gear_subset) %>%
tbl_stack() %>%
as_kable() # convert to kable to it'll print on SO
#> i Column headers among stacked tables differ. Headers from the first table are
#> used. Use `quiet = TRUE` to supress this message.
Characteristic
4
6
8
mpg
26.0 (22.8, 30.4)
19.7 (18.6, 21.0)
15.2 (14.4, 16.2)
gear
3
1 (11%)
2 (33%)
12 (100%)
4
8 (89%)
4 (67%)
0 (0%)
Created on 2021-10-25 by the reprex package (v2.0.1)
It's my first try with lists. I try to clean up my code and put some code in functions.
One idea is to subset a big dataframe in multiple subsets with a function. So I can call the subset with a function when needed.
With the mtcars dataframe I would like to explain what I am trying to do:
add an id to mtcars
create a function with one argument (mtcars) that outputs a list of subset dataframes (mtcars1, mtcars2, mtcars3) -> learned here: How to assign from a function which returns more than one value? answer by Federico Giorgi
What I achieve is to create the list. But when it comes to see the 3 subset dataframe objects (mtcars1, mtcars2, mtcars3) in the global environment my knowledge is ending. So how can I call these 3 dataframe objects from the list with my function. Thanks!
My Code:
library(dplyr)
# add id to mtcars
mtcars <- mtcars %>%
mutate(id = row_number())
# create function to subset in 3 dataframes
my_func_cars <- function(input){
# first subset
mtcars1 <- mtcars %>%
select(id, mpg, cyl, disp)
# second subset
mtcars2 <- mtcars %>%
select(id, hp, drat, wt, qsec)
# third subset
mtcars3 <- mtcars %>%
select(id, vs, am, gear, carb)
output <- list(mtcars1, mtcars2, mtcars3)
return(output)
}
output<-my_func_cars(mtcars)
for (i in output) {
print(i)
}
It may be better to output a named list
library(dplyr)
library(stringr)
my_func_cars <- function(input){
nm1 <- deparse(substitute(input))
# first subset
obj1 <- input %>%
select(id, mpg, cyl, disp)
# second subset
obj2 <- input %>%
select(id, hp, drat, wt, qsec)
# third subset
obj3 <- input %>%
select(id, vs, am, gear, carb)
dplyr::lst(!! str_c(nm1, 1) := obj1,
!! str_c(nm1, 2) := obj2,
!! str_c(nm1, 3) := obj3)
}
and then we use list2env to create objects in the global env
mtcars <- mtcars %>%
mutate(id = row_number())
list2env(my_func_cars(mtcars), .GlobalEnv)
-check the objects
head(mtcars1, 2)
# id mpg cyl disp
#1 1 21 6 160
#2 2 21 6 160
head(mtcars2, 2)
# id hp drat wt qsec
#1 1 110 3.9 2.620 16.46
#2 2 110 3.9 2.875 17.02
head(mtcars3, 2)
# id vs am gear carb
#1 1 0 1 4 4
#2 2 0 1 4 4
Problem
I would like to know how to pass a list of variable names to a purrr::map2 function for the purpose of iterating over a separate data frame.
The input_table$key variable below contains mpg and disp from the mtcars dataset. I think the names of the variables are being passed as character strings rather than variable names. The question is how I can change that so that my function recognises that they are variable names(?).
In this example I am trying to sum all of the values in the mtcars variables mpg and disp that fall below a set of numeric thresholds. Those variables from mtcars and the relevant thresholds are contained in input_table (below).
Ideal result
percentile key value sum_y
<fct> <chr> <dbl> <dbl>
1 0.5 mpg 19.2 266.5
2 0.9 mpg 30.1 515.8
3 0.99 mpg 33.4 609.0
4 1 mpg 33.9 642.9
5 ... ... ... ...
Attempt
library(dplyr)
library(purrr)
library(tidyr)
# Arrange a generic example
# Replicating my data structure
input_table <- mtcars %>%
as_tibble() %>%
select(mpg, disp) %>%
map_df(quantile, probs = c(0.5, 0.90, 0.99, 1)) %>%
mutate(
percentile = factor(c(0.5, 0.90, 0.99, 1))
) %>%
select(
percentile, mpg, disp
) %>%
gather(key, value, -percentile)
# Defining the function
test_func <- function(label_desc, threshold) {
mtcars %>%
select({{label_desc}}) %>%
filter({{label_desc}} <= {{threshold}}) %>%
summarise(
sum_y = sum(as.numeric({{label_desc}}), na.rm = T)
)
}
# Demo'ing that it works for a single variable and threshold value
test_func(label_desc = mpg, threshold = 19.2)
# This is where I am having trouble
# Trying to iterate over multiple (mpg, disp) variables
map2(input_table$key, input_table$value, ~test_func(label_desc = .x, threshold = .y))
The issue is curly-curly ({{}}) is used for unquoted variables as you are using in your first attempt. In your second attempt you are passing quoted variables to which the curly-curly operator does not work. A simple fix would be to use _at variants of dplyr which accepts quoted arguments.
test_func <- function(label_desc, threshold) {
mtcars %>%
filter_at(label_desc, any_vars(. <= threshold)) %>%
summarise_at(label_desc, sum)
}
purrr::map2(input_table$key, input_table$value, test_func)
#[[1]]
# mpg
#1 266.5
#[[2]]
# mpg
#1 515.8
#[[3]]
# mpg
#1 609
#[[4]]
# mpg
#1 642.9
#[[5]]
# disp
#1 1956.7
#.....