To display the results of a regression I ran, I've got a tibble with estimates and corresponding confidence intervals:
library(tidyverse)
library(magrittr
mydata <- structure(list(term = structure(c(1L, 3L, 4L), .Label = c("Intercept",
"Follow-up time (years)", "Age (years)", "Sex (male)", "Never smoker (reference)",
"Current smoker", "Former smoker", "Obesity (=30 kg/m²)", "BMI (kg/m²)",
"Diabetes", "Glucose (mmol/L)", "Glucose lowering medication use",
"Hypertension", "Systolic blood pressure (mmHg)", "Diastolic blood pressure (mmHg)",
"Antihypertensive medication use", "Hypercholesterolemia", "LDL cholesterol (mmol/L)",
"Lipid lowering medication use", "Chronic kidney disease (mL/min/1.73m²)",
"=90 (reference)", "60-89", "=60"), class = c("ordered", "factor"
)), estimate = c(518.38, 0.98, 1.07), conf_low = c(178.74, 0.93,
0.96), conf_high = c(1503.36, 1.03, 1.19), label = c("518.38 (178.74-1503.36)",
" 0.98 ( 0.93- 1.03)", " 1.07 ( 0.96- 1.19)")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
mydata
# A tibble: 3 x 4
term estimate conf_low conf_high
<ord> <dbl> <dbl> <dbl>
1 Intercept 518. 179. 1503.
2 Age (years) 0.98 0.93 1.03
3 Sex (male) 1.07 0.96 1.19
To make a label that includes the estimate and 95%CI, I've used paste0, and to make sure that every number has two decimals I've used format. However, when combining these, extra whitespaces appear:
mydata <-
mydata %>%
mutate(
label=
paste0(format(round(estimate, digits=2), nsmall=2),
" (",
format(round(conf_low, digits=2), nsmall=2),
"-",
format(round(conf_high, digits=2), nsmall=2),
")",
sep="", collaps=""))
mydata
# A tibble: 3 x 5
term estimate conf_low conf_high label
<ord> <dbl> <dbl> <dbl> <chr>
1 Intercept 518. 179. 1503. "518.38 (178.74-1503.36)"
2 Age (years) 0.98 0.93 1.03 " 0.98 ( 0.93- 1.03)"
3 Sex (male) 1.07 0.96 1.19 " 1.07 ( 0.96- 1.19)"
Why does this happen? Can I prevent this or otherwise remove the whitespaces so that the format becomes "estimate (conf_low-conf_high)"?
Add trim=TRUE in the format() call:
mydata %>%
mutate(
label=
paste0(format(round(estimate, digits=2), nsmall=2, trim=TRUE),
" (",
format(round(conf_low, digits=2), nsmall=2, trim=TRUE),
"-",
format(round(conf_high, digits=2), nsmall=2, trim=TRUE),
")",
sep="", collaps=""))
# A tibble: 3 × 5
term estimate conf_low conf_high label
<ord> <dbl> <dbl> <dbl> <chr>
1 Intercept 518. 179. 1503. "518.38 (178.74-1503.36)"
2 Age (years) 0.98 0.93 1.03 "0.98 (0.93-1.03)"
3 Sex (male) 1.07 0.96 1.19 "1.07 (0.96-1.19)"
1) Use sprintf
mydata %>%
mutate(label = sprintf("%.2f (%.2f-%.2f)", estimate, conf_low, conf_high))
giving:
# A tibble: 3 x 5
term estimate conf_low conf_high label
<ord> <dbl> <dbl> <dbl> <chr>
1 Intercept 518. 179. 1503. 518.38 (178.74-1503.36)
2 Age (years) 0.98 0.93 1.03 0.98 (0.93-1.03)
3 Sex (male) 1.07 0.96 1.19 1.07 (0.96-1.19)
2) or this variation producing slightly different output
mydata %>%
mutate(label = sprintf("%6.2f (%6.2f-%7.2f)", estimate, conf_low, conf_high))
giving;
# A tibble: 3 x 5
term estimate conf_low conf_high label
<ord> <dbl> <dbl> <dbl> <chr>
1 Intercept 518. 179. 1503. "518.38 (178.74-1503.36)"
2 Age (years) 0.98 0.93 1.03 " 0.98 ( 0.93- 1.03)"
3 Sex (male) 1.07 0.96 1.19 " 1.07 ( 0.96- 1.19)"
Related
I have a dataframe that looks something like this:
dist id daytime season
3 1.11 Name1 day summer
4 2.22 Name2 night spring
5 3.33 Name1 day winter
6 4.44 Name3 night fall
I want of summary of distby some specific columns in my dataframe.
So far I used a custom function:
summary <- function(x){df %>%
group_by(x) %>%
summarize(min = min(dist),
q1 = quantile(dist, 0.25),
median = median(dist),
mean = mean(dist),
q3 = quantile(dist, 0.75),
max = max(dist))}
And applied it to any specific column I wanted at the moment:
summary_ID <- path.summary(id)
I tried it a few weeks ago and would get something like this>
id min q1 median mean q3 max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Name1 0 17.8 310. 788. 1023. 5832.
2 Name2 0 31.7 284. 570. 744. 9578.
3 Name3 0 17.0 325. 721. 1185. 5293.
4 Name4 0 11.9 197. 530. 865. 3476.
5 Name5 0 24.5 94.9 617. 966. 9567.
When I try it now I get an error:
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `x` is not found.
What changed and how do I get around the issue?
Here, we may use {{}} if the input is unquoted
path_summary <- function(dat, x){
dat %>%
group_by({{x}}) %>%
summarize(min = min(dist),
q1 = quantile(dist, 0.25),
median = median(dist),
mean = mean(dist),
q3 = quantile(dist, 0.75),
max = max(dist))
}
-testing
> path_summary(df, id)
# A tibble: 3 × 7
id min q1 median mean q3 max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Name1 1.11 1.66 2.22 2.22 2.78 3.33
2 Name2 2.22 2.22 2.22 2.22 2.22 2.22
3 Name3 4.44 4.44 4.44 4.44 4.44 4.44
data
df <- structure(list(dist = c(1.11, 2.22, 3.33, 4.44), id = c("Name1",
"Name2", "Name1", "Name3"), daytime = c("day", "night", "day",
"night"), season = c("summer", "spring", "winter", "fall")),
class = "data.frame", row.names = c("3",
"4", "5", "6"))
I want to convert a cox table to forest plot as showed below. Unforunatly I’ve lost my original data (coxph object) so I have to use the data from the table. Data below are just examples:
Desired output:
Reprex for the two tables:
GRP1<-tibble::tribble(
~Variable, ~Level, ~Number, ~`HR.(univariable)`, ~`HR.(multivariable)`,
"Sex", "Female", "2204 (100.0)", NA, NA,
NA, "Male", "2318 (100.0)", "1.13 (0.91-1.40, p=0.265)", "1.13 (0.91-1.40, p=0.276)",
"Score", "1", "2401 (100.0)", NA, NA,
NA, "1-2", "1637 (100.0)", "1.49 (1.19-1.87, p=0.001)", "1.15 (0.90-1.47, p=0.250)",
NA, "3-4", "412 (100.0)", "1.71 (1.14-2.56, p=0.010)", "1.09 (0.71-1.67, p=0.710)",
NA, ">=5", "42 (100.0)", "1.67 (0.53-5.21, p=0.381)", "0.96 (0.30-3.05, p=0.943)",
"Treatment", "A", "1572 (100.0)", NA, NA,
NA, "B", "2951 (100.0)", "1.74 (1.26-2.40, p=0.001)", "1.53 (1.09-2.13, p=0.013)"
)
GRP2<-tibble::tribble(
~Variable, ~Level, ~Number, ~`HR.(univariable)`, ~`HR.(univariable)`,
"Sex", "Female", "2204 (100.0)", NA, NA,
NA, "Male", "2318 (100.0)", "1.70 (1.36-2.13, p<0.001)", "1.62 (1.28-2.04, p<0.001)",
"Score", "1", "2401 (100.0)", NA, NA,
NA, "1-2", "1637 (100.0)", "2.76 (1.21-6.29, p=0.016)", "2.69 (1.18-6.13, p=0.019)",
NA, "3-4", "412 (100.0)", "5.11 (2.26-11.58, p<0.001)", "4.46 (1.95-10.23, p<0.001)",
NA, ">=5", "42 (100.0)", "5.05 (2.19-11.64, p<0.001)", "4.08 (1.73-9.59, p=0.001)",
"Treatment", "A", "1572 (100.0)", NA, NA,
NA, "B", "2951 (100.0)", "1.48 (1.16-1.88, p=0.001)", "1.23 (0.95-1.59, p=0.114)"
)
Is it doable?
Best regards, H
The difficult thing about this task is not making the plot; it is converting your data from a bunch of text strings into a single long-format data frame that can be used for plotting. This involves using regular expressions to capture the appropriate number for each column, pivoting the result, then repeating that process for the second data frame before binding the two frames together. This is unavoidably ugly and complicated, but that is one of the reasons why having data stored in the correct format is so important.
Anyway, the following code performs the necessary operations:
library(dplyr)
wrangler <- function(data){
grp <- as.character(match.call()$data)
data %>%
tidyr::fill(Variable) %>%
mutate(Variable = paste(Variable, Level),
Number = as.numeric(gsub("^(\\d+).*$", "\\1", Number)),
univariable_HR = as.numeric(gsub("^((\\d+|\\.)+).*$", "\\1", `HR.(univariable)`)),
univariable_lower = as.numeric(gsub("^.+? \\((.+?)-.*$", "\\1", `HR.(univariable)`)),
univariable_upper = as.numeric(gsub("^.+?-(.+?),.*$", "\\1", `HR.(univariable)`)),
univariable_p = gsub("^.+?p=*(.+?)\\).*$", "\\1", `HR.(univariable)`),
multivariable_HR = as.numeric(gsub("^((\\d+|\\.)+).*$", "\\1", `HR.(multivariable)`)),
multivariable_lower = as.numeric(gsub("^.+? \\((.+?)-.*$", "\\1", `HR.(multivariable)`)),
multivariable_upper = as.numeric(gsub("^.+?-(.+?),.*$", "\\1", `HR.(multivariable)`)),
multivariable_p = gsub("^.+?p=*(.+?)\\).*$", "\\1", `HR.(multivariable)`),
group = grp) %>%
filter(!is.na(univariable_HR)) %>%
select(-Level, -`HR.(multivariable)`, - `HR.(univariable)`) %>%
tidyr::pivot_longer(cols = -(c(1:2, 11)), names_sep = "_", names_to = c("type", ".value"))
}
df <- rbind(wrangler(GRP1), wrangler(GRP2))
This now gives us the data in the correct format for plotting. Each row will become a single pointrange in our plot, so it needs a hazard ratio, a lower confidence bound, an upper confidence bound, a variable label, the type (multivariable versus univariable), and the group it originally came from (GRP1 or GRP2):
df
#> # A tibble: 20 x 8
#> Variable Number group type HR lower upper p
#> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 Sex Male 2318 GRP1 univariable 1.13 0.91 1.4 0.265
#> 2 Sex Male 2318 GRP1 multivariable 1.13 0.91 1.4 0.276
#> 3 Score 1-2 1637 GRP1 univariable 1.49 1.19 1.87 0.001
#> 4 Score 1-2 1637 GRP1 multivariable 1.15 0.9 1.47 0.250
#> 5 Score 3-4 412 GRP1 univariable 1.71 1.14 2.56 0.010
#> 6 Score 3-4 412 GRP1 multivariable 1.09 0.71 1.67 0.710
#> 7 Score >=5 42 GRP1 univariable 1.67 0.53 5.21 0.381
#> 8 Score >=5 42 GRP1 multivariable 0.96 0.3 3.05 0.943
#> 9 Treatment B 2951 GRP1 univariable 1.74 1.26 2.4 0.001
#> 10 Treatment B 2951 GRP1 multivariable 1.53 1.09 2.13 0.013
#> 11 Sex Male 2318 GRP2 univariable 1.7 1.36 2.13 <0.001
#> 12 Sex Male 2318 GRP2 multivariable 1.62 1.28 2.04 <0.001
#> 13 Score 1-2 1637 GRP2 univariable 2.76 1.21 6.29 0.016
#> 14 Score 1-2 1637 GRP2 multivariable 2.69 1.18 6.13 0.019
#> 15 Score 3-4 412 GRP2 univariable 5.11 2.26 11.6 <0.001
#> 16 Score 3-4 412 GRP2 multivariable 4.46 1.95 10.2 <0.001
#> 17 Score >=5 42 GRP2 univariable 5.05 2.19 11.6 <0.001
#> 18 Score >=5 42 GRP2 multivariable 4.08 1.73 9.59 0.001
#> 19 Treatment B 2951 GRP2 univariable 1.48 1.16 1.88 0.001
#> 20 Treatment B 2951 GRP2 multivariable 1.23 0.95 1.59 0.114
Now that we have the data in this format, the plot itself is straightforward:
library(ggplot2)
ggplot(df, aes(HR, Variable)) +
geom_pointrange(aes(xmin = lower, xmax = upper, colour = type),
position = position_dodge(width = 0.5)) +
facet_grid(group~., switch = "y") +
geom_vline(xintercept = 0, linetype = 2) +
theme_bw() +
theme(strip.placement = "outside",
strip.text= element_text(angle = 180),
strip.background = element_blank(),
panel.spacing = unit(0, "mm"))
Created on 2021-11-01 by the reprex package (v2.0.0)
I am trying to perform cor.test in R in a dataframe:
For a toy dataset of X and Y, I used the following:
library(dplyr)
library(broom)
X = c(0.88,1.3,5.6,3.1)
Y = c(0,1,1,1)
ft<-cor.test(X,Y)
tidy(ft) %>%
select(estimate, p.value, conf.low, conf.high) %>%
bind_rows(.id = 'grp')
which gives me the following result:
grp estimate p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl>
1 1 0.571 0.429 -0.864 0.989
Now, a short version of my dataframe is like:
df<-structure(list(X_sample1 = c(0.11, 0.98, 0.88), X_sample2 = c(0.13,
0, 1.3), X_sample3 = c(1.5, 3.5, 5.6), X_sample4 = c(3.2, 2.4,
3.1), Y_sample1 = c(0L, 1L, 0L), Y_sample2 = c(0L, 0L, 1L), Y_sample3 = c(1L,
1L, 1L), Y_sample4 = c(1L, 1L, 1L)), class = "data.frame", row.names = c("Product1",
"Product2", "Product3"))
I want to perform cor.test in each row of the df between X and Y groups. Thus, in the above example df, the groups are:
X = c(0.11,0.13,1.5,3.2)
Y = c(0,0,1,1)
---------------
X = c(0.98,0,3.5,2.4)
Y = c(1,0,1,1)
---------------
X = c(0.88,1.3,5.6,3.1)
Y = c(0,1,1,1)
I want a output like:
grp estimate p.value conf.low conf.high
Product1 0.88 0.12 -0.525 0.997
Product2 0.743 0.257 -0.762 0.994
Product3 0.571 0.429 -0.864 0.989
Thanks for your help!
One option could be:
df %>%
rownames_to_column(var = "grp") %>%
rowwise() %>%
transmute(grp,
tidy(cor.test(c_across(starts_with("X")), c_across(starts_with("Y"))))) %>%
select(grp, estimate, p.value, conf.low, conf.high)
grp estimate p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl>
1 Product1 0.880 0.120 -0.525 0.997
2 Product2 0.743 0.257 -0.762 0.994
3 Product3 0.571 0.429 -0.864 0.989
You can use dplyr and broom:
library(dplyr)
library(broom)
df %>%
rownames_to_column() %>%
pivot_longer(-rowname, names_to = c(".value", "sample"),
names_sep = "_sample") %>%
nest_by(rowname) %>%
summarize(cors1 = tidy(cor.test(data$X, data$Y)))
# A tibble: 3 x 2
# Groups: rowname [3]
rowname cors1$estimate $statistic $p.value $parameter $conf.low $conf.high
<chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 Produc~ 0.880 2.62 0.120 2 -0.525 0.997
2 Produc~ 0.743 1.57 0.257 2 -0.762 0.994
3 Produc~ 0.571 0.984 0.429 2 -0.864 0.989
I've used group_by with the summarise command in dplyr to generate some summaries for my data. I would like to get the same summaries for the overall data set and combine it as one tibble.
Is there a straighforward way of doing this? My solution below feels like it has 4X the amount of code required to do this efficently!
Thanks in advance.
# reprex
library(tidyverse)
tidy_data <- tibble::tribble(
~drug, ~gender, ~condition, ~value,
"control", "f", "work", 0.06,
"treatment", "m", "work", 0.42,
"treatment", "f", "work", 0.22,
"control", "m", "work", 0.38,
"treatment", "m", "work", 0.57,
"treatment", "f", "work", 0.24,
"control", "f", "work", 0.61,
"control", "f", "play", 0.27,
"treatment", "m", "play", 0.3,
"treatment", "f", "play", 0.09,
"control", "m", "play", 0.84,
"control", "m", "play", 0.65,
"treatment", "m", "play", 0.98,
"treatment", "f", "play", 0.38
)
tidy_summaries <- tidy_data %>%
# Group by the required variables
group_by(drug, gender, condition) %>%
summarise(mean = mean(value),
median = median(value),
min = min(value),
max = max(value)) %>%
# Bind rows will bind this output to the following one
bind_rows(
# Now for the overall version
tidy_data %>%
# Generate the overall summary values
mutate(mean = mean(value),
median = median(value),
min = min(value),
max = max(value)) %>%
# We need to know what the structure of the 'grouped_by' tibble first
# as the overall output format needs to match that
select(drug, gender, condition, mean:max) %>% # Keep columns of interest
# The same information will be appended to all rows, so we just need to retain one
filter(row_number() == 1) %>%
# Change the values in drug, gender, condition to "overall"
mutate_at(vars(drug:condition),
list(~ifelse(is.character(.), "overall", .)))
)
This the output I want, but it wasn't as simple as I might have hoped.
tidy_summaries
#> # A tibble: 9 x 7
#> # Groups: drug, gender [5]
#> drug gender condition mean median min max
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 control f play 0.27 0.27 0.27 0.27
#> 2 control f work 0.335 0.335 0.06 0.61
#> 3 control m play 0.745 0.745 0.65 0.84
#> 4 control m work 0.38 0.38 0.38 0.38
#> 5 treatment f play 0.235 0.235 0.09 0.38
#> 6 treatment f work 0.23 0.23 0.22 0.24
#> 7 treatment m play 0.64 0.64 0.3 0.98
#> 8 treatment m work 0.495 0.495 0.42 0.570
#> 9 overall overall overall 0.429 0.38 0.06 0.98
Try
tidy_data %>%
group_by(drug, gender, condition) %>%
summarise(mean = mean(value), median = median(value), min = min(value), max = max(value)) %>%
bind_rows(.,
tidy_data %>%
summarise(drug = "Overall", gender = "Overall", condition = "Overall", mean = mean(value), median = median(value), min = min(value), max = max(value))
)
This gives:
# A tibble: 9 x 7
# Groups: drug, gender [5]
drug gender condition mean median min max
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 control f play 0.27 0.27 0.27 0.27
2 control f work 0.335 0.335 0.06 0.61
3 control m play 0.745 0.745 0.65 0.84
4 control m work 0.38 0.38 0.38 0.38
5 treatment f play 0.235 0.235 0.09 0.38
6 treatment f work 0.23 0.23 0.22 0.24
7 treatment m play 0.64 0.64 0.3 0.98
8 treatment m work 0.495 0.495 0.42 0.570
9 Overall Overall Overall 0.429 0.38 0.06 0.98
The code summarizes it via groupings first, and then creates the final summary row from the original data and binds it at the very bottom.
Interesting question. My take is basically the same answer as #sumshyftw but uses mutate_if and summarise_at.
Code
library(hablar)
funs <- list(mean = ~mean(.),
median = ~median(.),
min = ~min(.),
max = ~max(.))
tidy_data %>%
group_by(drug, gender, condition) %>%
summarise_at(vars(value), funs) %>%
ungroup() %>%
bind_rows(., tidy_data %>% summarise_at(vars(value), funs)) %>%
mutate_if(is.character, ~if_na(., "Overall"))
Result
drug gender condition mean median min max
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 control f play 0.27 0.27 0.27 0.27
2 control f work 0.335 0.335 0.06 0.61
3 control m play 0.745 0.745 0.65 0.84
4 control m work 0.38 0.38 0.38 0.38
5 treatment f play 0.235 0.235 0.09 0.38
6 treatment f work 0.23 0.23 0.22 0.24
7 treatment m play 0.64 0.64 0.3 0.98
8 treatment m work 0.495 0.495 0.42 0.570
9 Overall Overall Overall 0.429 0.38 0.06 0.98
So this is a relatively simple problem, I have a dataset as below
df <- structure(list(term = c("(Intercept)", "overall_quality", "overall_costs",
"wwpf"), estimate = c(0.388607224137536, 0.456477162621961, 0.485612564501229,
NA), std.error = c(0.499812263278414, 0.0987819420575201, 0.108042289289401,
NA), statistic = c(0.777506381273137, 4.62105879995918, 4.49465267438447,
NA), p.value = c(0.440597919486169, 0.0000279867005591494, 0.0000426773877613654,
NA), average = c(NA, 8.09615384615385, 7.86538461538461, 7.90384615384615
), Elasticity = c(NA, 3.69570933584318, 3.81952959386543, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
I am trying to use below
df %>% mutate(Elasticity= average*estimate/average[nrow(df)])
Expected output: https://ibb.co/42ptLXx
basically, divide by last row value & since I am trying to incorporate this in function, I need the method to be dynamic & not hard coded value.
Please help !
We can use n() to return the index of last row for subsetting the value of that column
library(dplyr)
df %>%
mutate(Elasticity= average*estimate/average[n()])
If we need a function (using rlang_0.4.0), we can make use {{..}} for evaluation
f1 <- function(dat, col1, col2) {
dat %>%
mutate(Elasticity = {{col1}} * {{col2}}/{{col1}}[n()])
}
f1(df, average, estimate)
# A tibble: 4 x 7
# term estimate std.error statistic p.value average Elasticity
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 (Intercept) 0.389 0.500 0.778 0.441 NA NA
#2 overall_quality 0.456 0.0988 4.62 0.0000280 8.10 0.468
#3 overall_costs 0.486 0.108 4.49 0.0000427 7.87 0.483
#4 wwpf NA NA NA NA 7.90 NA