Combine two dfs with different column names and then melt - r

I want to combine two data frames but melt them into different columns based on below:
treatment<-c('control','noise')
weight<-c(0.01872556,0.01575400)
sd<-c(0.008540041,0.007460524)
df1<-data.frame(treatment,weight,sd)
treatment2<-c('control','noise')
area<-c(0.79809444,0.68014667)
sd2<-c(0.337949414,0.294295847)
df2<-data.frame(treatment2,area,sd2)
And I wanted to combine them and create a data frame which should look like this:
treatment
var
sum
sd
control
area
0.79809444
0.337949414
noise
area
0.68014667
0.294295847
control
weight
0.01872556
0.008540041
noise
weight
0.01575400
0.01575400
I tried this various ways, googled various ways and ended up exporting each data frame into a csv then combining them in excel, re-importing into R for analysis.
Is there a simpler solution?

You could use
library(tidyr)
library(dplyr)
df2 %>%
rename(sd = sd2, treatment = treatment2) %>%
pivot_longer(area, names_to = "var", values_to = "sum") %>%
bind_rows(pivot_longer(df1, weight, names_to = "var", values_to = "sum")) %>%
select(treatment, var, sum, sd)
to get
# A tibble: 4 x 4
treatment var sum sd
<chr> <chr> <dbl> <dbl>
1 control area 0.798 0.338
2 noise area 0.680 0.294
3 control weight 0.0187 0.00854
4 noise weight 0.0158 0.00746

You could do this using functions from {purrr} and {dplyr}:
map(list(df2, df1), ~ mutate(., var = colnames(.)[2])) %>%
map(~ set_names(., nm = c("treatment", "sum", "sd", "var"))) %>%
bind_rows() %>%
relocate("var", .before = "sum")
Output:
treatment var sum sd
1 control area 0.79809444 0.337949414
2 noise area 0.68014667 0.294295847
3 control weight 0.01872556 0.008540041
4 noise weight 0.01575400 0.007460524

Here is a dplyr solution. The strategy is to first process the two dfs to the desired format before merging them
df1 <- df1 %>%
dplyr::mutate(var = "weight") %>%
dplyr::rename(sum = weight)
df2 <- df2 %>%
dplyr::mutate(var = "area") %>%
dplyr::rename(treatment = treatment2,
sd = sd2,
sum = area)
dplyr::bind_rows(df1, df2)
# output
treatment sum sd var
1 control 0.01872556 0.008540041 weight
2 noise 0.01575400 0.007460524 weight
3 control 0.79809444 0.337949414 area
4 noise 0.68014667 0.294295847 area

Related

Dyplr summarise across output as rows?

I would like to generate overview tables for the same statistics (e.g., n, mean, sd) across multiple variables.
I started with combining the dyplr summarise and across function. See follwing example:
df <- data.frame(
var1 = 1:10,
var2 = 11:20
)
VarSum <- df %>% summarise(across(c(var1, var2), list(n = length, mean = mean, sd = sd)))
The output is of course given as one row (1x6) with three colums for each variable in this example. What I would like to achieve is to get the output rowise for each variable (2x3). Is that even possible with my approach? Would appriciate any suggestions.
You can pivot first:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
summarise(across(value, list(n = length, mean = mean, sd = sd)), .by = name)
name value_n value_mean value_sd
<chr> <int> <dbl> <dbl>
1 var1 10 5.5 3.03
2 var2 10 15.5 3.03

Standard deviation of average events per ID in R

Background
I've got this dataset d:
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
It's got 2 people (IDs) in it, and they each have some events.
The problem
I'm trying to get an average number (count) of events per person, along with a standard deviation for that average, all in one result (it can be a dataframe or not, doesn't matter).
In other words I'm looking for something like this:
| Mean | SD |
|------|------|
| 4.00 | 2.83 |
What I've tried
I'm not far off, I don't think -- it's just that I've got 2 separate pieces of code doing these calculations. Here's the mean:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event))
# A tibble: 1 x 1
ratio
<dbl>
1 4
And here's the SD:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(sd = sd(event))
# A tibble: 1 x 1
sd
<dbl>
1 2.83
But I when I try to pipe them together like so...
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event)) %>%
summarise(sd = sd(event))
... I get an error:
Error in `h()`:
! Problem with `summarise()` column `sd`.
i `sd = sd(event)`.
x object 'event' not found
Any insight?
You have to put the last two calls to summarise() in the same call. The only remaining columns after summarise() will be those you named and the grouping columns, so after your second summarise, the event column no longer exists.
library(dplyr)
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
d %>%
group_by(ID) %>%
# the next summarise will be within ID
summarise(event = length(event)) %>%
# this summarise is overall
summarise(sd = sd(event),
ratio = mean(event))
#> # A tibble: 1 × 2
#> sd ratio
#> <dbl> <dbl>
#> 1 2.83 4
The code is a bit confusing because you are renaming the event variable, and doing the first summarise() within groups and the second without grouping. This code would be a little easier to read and get the same result:
d %>%
count(ID) %>%
summarise(sd = sd(n),
ratio = mean(n))
Created on 2022-05-25 by the reprex package (v2.0.1)

Creating a versatile descriptives table using dplyr

I'm trying to create a simple code that I can reuse over and over (with minimal adjustments) to be able to print a table of summary statistics.
A reproducible example creates a table with M and SD for the variable V1 broken down by group:
data <- as.data.frame(cbind(1:100, sample(1:2), rnorm(100), rnorm(100)))
names(data) <- c("ID", "Group", "V1", "V2")
library(dplyr)
descriptives <- data %>% group_by(Group) %>%
summarize(
Mean = mean(V2)
, SD = sd(V2)
)
descriptives
I'd like to modify this function so that it will compute M and SD for all variables in my dataset.
I'd like to be able to replace the call to V1 with something like vars which is just a list of all the variables in my dataset; in this example, V1 and V2. But usually I have like 100 variables.
The reason I'd like it to work this way is so that I can do something very easy like:
vars <- names(data[3:4])
and very quickly select the columns for which I want summary statistics.
A few things for my wishlist:
M and SD for a given variable should be next to eachother and I'd like to add a column above each pair with the variable name.
I'd like the end product to look something like
I'd like to use dplyr, but I'm open to other options.
I'd also like to learn how I could switch the rows and columns of the table so that the variables are on separate rows and each group has a column (or two columns, one for M and one for SD). Like this:
Close, but no cigar:
The newish summarise(across()) kind of helps:
dplyr::group_by(df, Group) %>%
dplyr::summarise(dplyr::across(.cols = c(V1, V2), .fns = c(mean, sd)))
But I don't know how to scale it without making multiple table and using rbind() to stack them.
I really like the format of table1() (vignette), but from what I can tell I can only stratify the column M/SDs by another variable. I really wish I could just add additional grouping variables on.
There is a limitation in the ordering, but if we use select, then can reorder on the substring on the column names
library(dplyr)
library(stringr)
data %>%
group_by(Group) %>%
summarise_at(vars(vars), list(Mean = mean, SD = sd)) %>%
select(Group, order(str_remove(names(.)[-1], "_.*")) + 1)
# A tibble: 2 x 5
# Group V1_Mean V1_SD V2_Mean V2_SD
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.165 0.915 0.146 1.16
#2 2 0.308 1.31 -0.00711 0.854
I had a similar question here, and got some really useful and simple answers using tidyverse. In the end a really robust approach was made, which I wrapped in a function and use regularly.
library(tidyverse)
baseline_table <- function(data, variables, grouping_var) {
data %>%
group_by(!!sym(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -grouping_var,
names_to = "variable"
) %>% pivot_wider(
names_from = grouping_var
)
}
It takes three arguments, data, variables and the grouping_var - all of which are rather self explanatory.
Here is a test using mtcars with a 2 level and 3 level grouping var.
baseline_table(
data = mtcars,
variables = c("mpg", "hp"),
grouping_var = "am"
)
# A tibble: 2 x 3
variable `0` `1`
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
baseline_table(
data = mtcars,
variables = c("mpg", "hp"),
grouping_var = "cyl"
)
# A tibble: 2 x 4
variable `4` `6` `8`
<chr> <chr> <chr> <chr>
1 mpg 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
2 hp 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
It works out of the box, and are applicable to all data, below I used iris,
baseline_table(
data = iris,
variables = c("Sepal.Length", "Sepal.Width"),
grouping_var = "Species"
)
# A tibble: 2 x 4
variable setosa versicolor virginica
<chr> <chr> <chr> <chr>
1 Sepal.Length 5.01(±0.35) 5.94(±0.52) 6.59(±0.64)
2 Sepal.Width 3.43(±0.38) 2.77(±0.31) 2.97(±0.32)
Of course; some grouping variables are not directly suited for this. Namely cyl but it does serve as a good example though. but you can recode your grouping variables accordingly,
baseline_table(
data = mtcars %>% mutate(cyl = paste(cyl, "Cylinders", sep = " ")),
variables = c("mpg", "hp"),
grouping_var = "cyl"
)
# A tibble: 2 x 4
variable `4 Cylinders` `6 Cylinders` `8 Cylinders`
<chr> <chr> <chr> <chr>
1 mpg 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
2 hp 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
You can also modify the function to include descriptive strings, about the values,
baseline_table <- function(data, variables, grouping_var) {
# Generate the table;
tmpTable <- data %>%
group_by(!!sym(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -grouping_var,
names_to = "variable"
) %>% pivot_wider(
names_from = grouping_var
)
# Generate Descriptives dynamically
tmpDesc <- tmpTable[1,] %>% mutate(
across(.fns = ~ paste("Mean (±SD)"))
) %>% mutate(
variable = ""
)
bind_rows(
tmpDesc,
tmpTable
)
}
Granted, this extension is a bit awkward - but it is nonetheless still robust. The output is,
# A tibble: 3 x 4
variable `4 Cylinders` `6 Cylinders` `8 Cylinders`
<chr> <chr> <chr> <chr>
1 "" Mean (±SD) Mean (±SD) Mean (±SD)
2 "mpg" 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
3 "hp" 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
Update: Ive rewritten the function for added flexibility as noted in the comments.
library(tidyverse)
baseline_table <- function(data, variables, grouping_var) {
data %>%
group_by(!!!syms(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% unite(
"grouping",
all_of(grouping_var)
) %>% pivot_longer(
cols = -"grouping",
names_to = "variables"
) %>% pivot_wider(
names_from = "grouping"
)
}
It works in the same way, and outputs the same, unless there is more than one grouping_var,
baseline_table(
mtcars,
variables = c("hp", "mpg"),
grouping_var = c("am", "cyl")
)
# A tibble: 2 x 7
variables `0_4` `0_6` `0_8` `1_4` `1_6` `1_8`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 hp 84.67(±19.66) 115.25(±9.18) 194.17(±33.36) 81.88(±22.66) 131.67(±37.53) 299.5(±50.2)
2 mpg 22.9(±1.45) 19.12(±1.63) 15.05(±2.77) 28.08(±4.48) 20.57(±0.75) 15.4(±0.57)
In the updated function I used unite with a default seperator. Clearly, you can modify this to suit your needs such that the colnames says, for example, 4 Cylinder (Automatic) 6 Cylinder (Automatic) etc.
Slight variation of your original code, you could use across() more simply/flexibly if you specify you don't want the ID (or the already-grouped Group) column, but rather everything else:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = list(Mean = mean, SD = sd), .names = "{.col}_{.fn}"))
# A tibble: 2 x 5
Group V1_Mean V1_SD V2_Mean V2_SD
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.0167 0.979 0.145 1.02
2 2 0.119 1.11 -0.277 1.05
EDIT:
If you want to create your (first) goal exactly, you can use the gt package to make an html table with column spanners:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = list(Mean = mean, SD = sd), .names = "{.col}_{.fn}")) %>%
gt::gt() %>%
gt::tab_spanner_delim("_") %>%
gt::fmt_number(-Group, decimals = 2)
As to your other question, you could alternately do something like this to get the combined & transposed variation:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = ~paste0(
sprintf("%.2f", mean(.x)),
sprintf(" (%.2f)", sd(.x))))) %>%
t() %>%
as.data.frame()
V1 V2
Group 1 2
V1 -0.02 (0.98) 0.12 (1.11)
V2 0.15 (1.02) -0.28 (1.05)
Outside dplyr, you could use the tables package which allows to create summary statistics out of a table formula:
library(tables)
vars <- c("V1","V2")
vars <- paste(vars, collapse="+")
table <- as.formula(paste("(group = factor(Group)) ~ (", vars ,")*(mean+sd)"))
table
# (group = factor(Group)) ~ (V1 + V2) * (mean + sd)
tables::tabular(table, data = data)
# V1 V2
# group mean sd mean sd
# 1 -0.15759 0.9771 0.1405 1.0697
# 2 0.05084 0.9039 -0.1470 0.9949
One way to make a nice summary table is to use a package called gtsummary (note I am a co-author on this package just as an FYI). Below I am just formatting the data a little bit in data2 and dropping the ID variable. Then it is a two line call to gtsummary to summarize your data. The by statement is what stratifies the table, and in the statistics input I am simply telling to show the mean and sd, by default gtsummary will show median q1-q3. This table can be rendered in all markdown options (word, pdf, html).
library(dplyr)
library(gtsummary)
data <- as.data.frame(cbind(1:100, sample(1:2), rnorm(100), rnorm(100)))
names(data) <- c("ID", "Group", "V1", "V2")
data2 <- data %>%
mutate(Group = ifelse(Group == 1, "Group Var1","Group Var2")) %>%
select(-ID)
tbl_summary(data2, by = Group,
statistic = all_continuous()~ "{mean} ({sd})")
If you want more than one strata but do not want to use tbl_strata you can combine two variables into one column and use that in the by statement. You can unite() as many variables as you want (although maybe not reccomended)
trial %>%
tidyr::unite(col = "trt_grade", trt, grade, sep = ", ") %>%
select(age, marker,stage,trt_grade) %>%
tbl_summary(by = c(trt_grade))
A data.table option
dcast(
setDT(data)[,
c(
.(Meas = c("M", "Sd")),
lapply(.SD, function(x) c(mean(x), sd(x)))
),
Group,
.SDcols = patterns("V\\d")
], Group ~ Meas,
value.var = c("V1", "V2")
)
gives
Group V1_M V1_Sd V2_M V2_Sd
1: 1 -0.2392583 1.097343 -0.08048455 0.7851212
2: 2 0.1059716 1.011769 -0.23356373 0.9927975
You can also use base R:
# using do.call to make the result a data.frame
do.call(
data.frame
# here you aggregate for all the functions you need
,(aggregate(. ~ Group, data = data[,-1], FUN = function(x) c(mn = mean(x), sd = sd(x))))
)
This leads to something like this:
Group V1.mn V1.sd V2.mn V2.sd
1 1 0.1239868 1.008214 0.07215481 1.026059
2 2 -0.2324611 1.048230 0.11348897 1.071467
If you want a fancier table, kableExtra could really help. Note, the %>% should be imported also in kableExtra, but in case, from R 4.1 you can use |> instead of it:
library(kableExtra)
# data manipulation as above, note the [,-1] to remove the Group column
do.call(
data.frame
,(aggregate(. ~ Group, data = data[,-1], FUN = function(x) c(mn = mean(x), sd = sd(x)))))[,-1] %>%
# here you define as a kable, and give the names you want to columns
kbl(col.names = rep(c('mean','sd'),2) ) %>%
# some formatting
kable_paper() %>%
# adding the first header
add_header_above(c( "Group 1" = 2, "Group 2" = 2)) %>%
# another header if you need it
add_header_above(c( "Big group" = 4))
And you can find much more to make great tables.
In case, you can also try something like this:
do.call(data.frame,
aggregate(. ~ Group, data = data[,-1], FUN = function(x) paste0(round(mean(x),2),' (', round(sd(x),2),')'))
) %>%
kbl() %>%
kable_paper()
That leads to:

Using clusrank by group

simple question, I want to perform the one-sample rank test with cluster in data, after searching for a while, I got clusWilcox.test from the package clusrank. A toy example for illustration:
df = data.frame(x_1 = rnorm(200),
x_2 = rnorm(200),
group = c(rep('A',100),rep('B',100)),
clus = c(rep('a_1',50),rep('a_2',50),rep('b_1',50),rep('b_2',50)))
Worked like a charm when used directly
clusWilcox.test(x_1,paired = TRUE,cluster = "clus",data = df)
But went wrong when I tried to perform the test by group:
temp_test <-
df %>%
group_by(group) %>%
summarise_each(funs(clusWilcox.test(.,paired = TRUE,cluster = "clus")$p.value), vars = c('x_1','x_2'))
Error in complete.cases(x, cluster, group, stratum) :
not all arguments have the same length
Seems like a data problem, so I fill the data option of the function with df, it worked, but test all the data instead of by group.
temp_test <-
df %>%
group_by(group) %>%
summarise_each(funs(clusWilcox.test(.,paired = TRUE,cluster = "clus",data = df)$p.value), vars = c('x_1','x_2'))
> temp_test
# A tibble: 2 x 3
group vars1 vars2
<fct> <dbl> <dbl>
1 A 0.168 0.136
2 B 0.168 0.136
This won't happen when I tried to perform the one-sample t.test
temp_test <-
df %>%
group_by(group) %>%
summarise_each(funs(t.test(.)$p.value), vars = c('x_1','x_2'))
My guess is that the clusWilcox.test somehow could not inherit data from dplyr, anyone know how to get the problem fixed?
According to ?clusWilcox.test, the cluster parameter should be a numeric vector. In your df, it is a factor.
Therefore, running the test separately for group A with your factor cluster variable
clusWilcox.test(x_1, paired = TRUE, cluster = clus, data = df[df$group == "A", ])
results in:
Clustered Wilcoxon signed rank test using Rosner-Glynn-Lee method
data: x_1; cluster: clus; (from [)x_1; cluster: clus; (from temp)x_1; cluster: clus; (from temp$group == "A")x_1; cluster: clus; (from )
number of observations: 100; number of clusters: 4
Z = NA, p-value = NA
alternative hypothesis: true shift in location is not equal to 0
If you create a new cluster variable that is numeric, it runs the tests correctly:
df %>%
mutate(clus = group_indices(., group, clus)) %>%
group_by(group) %>%
summarise(pvalue = clusWilcox.test(x_1, paired = TRUE, cluster = clus)$p.value)
group pvalue
<fct> <dbl>
1 A 0.175
2 B 0.801
If you want to calculate it for different columns:
df %>%
mutate(clus = group_indices(., group, clus)) %>%
group_by(group) %>%
summarise_at(vars(x_1, x_2), ~ clusWilcox.test(., paired = TRUE, cluster = clus)$p.value)
group x_1 x_2
<fct> <dbl> <dbl>
1 A 0.264 0.712
2 B 0.794 0.289
To indicate that it contains the p-value:
df %>%
mutate(clus = group_indices(., group, clus)) %>%
group_by(group) %>%
summarise_at(vars(x_1, x_2), list(pvalue = ~ clusWilcox.test(., paired = TRUE, cluster = clus)$p.value))
group x_1_pvalue x_2_pvalue
<fct> <dbl> <dbl>
1 A 0.264 0.712
2 B 0.794 0.289

Collapse data frame, by group, using lists of variables for weighted average AND sum

I want to collapse the following data frame, using both summation and weighted averages, according to groups.
I have the following data frame
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse = data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
I want to collapse my data according to the groups identified by group_id. However, in my data, I have variables in absolute levels (var_1, var_2) and in percentage terms (var_percent_1, var_percent_2).
I create two lists for each type of variable (my real data is much bigger, making this necessary). I also have a weighting variable (weighting).
to_be_weighted =df_to_collapse[, 4:5]
to_be_summed = df_to_collapse[,2:3]
to_be_weighted_2=colnames(to_be_weighted)
to_be_summed_2=colnames(to_be_summed)
And my goal is to simultaneously collapse my data using eiter sum or weighted average, according to the type of variable (ie if its in percentage terms, I use weighted average).
Here is my best attempt:
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_summed_2,to_be_weighted_2), .funs=c(sum, mean))
But, as you can see, it is not a weighted average
I have tried many different ways of using the weighted.mean fucntion, but have had no luck. Here is an example of one such attempt;
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_weighted_2,to_be_summed_2), .funs=c(weighted.mean(to_be_weighted_2, weighting), sum))
And the corresponding error:
Error in weighted.mean.default(to_be_weighted_2, weighting) :
'x' and 'w' must have the same length
Here's a way to do it by reshaping into long data, adding a dummy variable called type for whether it's a percentage (optional, but handy), applying a function in summarise based on whether it's a percentage, then spreading back to wide shape. If you can change column names, you could come up with a more elegant way of doing the type column, but that's really more for convenience.
The trick for me was the type[1] == "percent"; I had to use [1] because everything in each group has the same type, but otherwise == operates over every value in the vector and gives multiple logical values, when you really just need 1.
library(tidyverse)
set.seed(1234)
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse <- data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
df_to_collapse %>%
gather(key = var, value = value, -group_id, -weighting) %>%
mutate(type = ifelse(str_detect(var, "percent"), "percent", "int")) %>%
group_by(group_id, var) %>%
summarise(sum_or_avg = ifelse(type[1] == "percent", weighted.mean(value, weighting), sum(value))) %>%
ungroup() %>%
spread(key = var, value = sum_or_avg)
#> # A tibble: 3 x 5
#> group_id var_1 var_2 var_percent_1 var_percent_2
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 26 31 0.269 0.483
#> 2 2 32 21 0.854 0.261
#> 3 3 29 49 0.461 0.262
Created on 2018-05-04 by the reprex package (v0.2.0).

Resources