Formatting gtsummary tables with checkbox questions - r

I have been enjoying the gtsummary library quite a bit but I can't find a clean way to display checkbox style questions (select all that apply) gtsummary::tbl_summary. Here is an example:
example_df = tibble::tibble(
CHOICE1 = sample(c(1, NA), size = 10, replace = TRUE),
CHOICE2 = sample(c(1, NA), size = 10, replace = TRUE),
CHOICE3 = sample(c(1, NA), size = 10, replace = TRUE)
)
for(i in 1:3){
expss::val_lab(example_df[[i]]) = set_names(1, letters[i])
expss::var_lab(example_df[[i]]) = 'Question 1'
}
example_df %>%
gtsummary::tbl_summary(
type = list(
CHOICE1 ~ "categorical",
CHOICE2 ~ "categorical",
CHOICE3 ~ "categorical"
)
)
Ideally, we would just have one header that says 'Question 1' and then each of the columns would be summarized below it. Any suggestions on how to do this properly or gerry rig it?
Thank you!

Great question. Below is an, admittedly, not great solution to your question. But it does get the job done. If you file an GH issue on the gtsummary page, requesting better support for these types of data, we can work together a more concise solution. Happy Programming!
library(gtsummary)
library(tidyverse)
example_df = tibble::tibble(
CHOICE1 = sample(c(1, NA), size = 10, replace = TRUE),
CHOICE2 = sample(c(1, NA), size = 10, replace = TRUE),
CHOICE3 = sample(c(1, NA), size = 10, replace = TRUE)
)
for(i in 1:3){
expss::val_lab(example_df[[i]]) = setNames(1, letters[i])
expss::var_lab(example_df[[i]]) = 'Question 1'
}
example_df %>%
mutate(across(everything(), ~replace_na(., 0L))) %>%
gtsummary::tbl_summary(
type = list(
CHOICE1 ~ "categorical",
CHOICE2 ~ "categorical",
CHOICE3 ~ "categorical"
)
) %>%
remove_row_type(variables = c(CHOICE2, CHOICE3), type = "header") %>%
modify_table_body(
~.x %>%
filter(label != "0")
) %>%
as_kable() # converting to kable to display on SO
Characteristic
N = 10
Question 1
a
4 (40%)
b
3 (30%)
c
4 (40%)
Created on 2022-01-12 by the reprex package (v2.0.1)

Related

I want a function in R to check whether values of one column are greater than the 75th quantile then writes yes or no in the next column

I have tried the following formula but it gives all nos even when I change the quantile value.
NOTE: I have 3 independent datasets that I want to apply the function.
outlier<-function(x1,x2){
q1<-quantile(x1 , .75, na.rm = TRUE)
if(x1>q1){x2<-"Yes"
}else{
x2<-"No"
}
}
I have tried x2<-ifelse(x1>q1,"Yes","No")
inside the function but it still doesn't work.
You can use an ifelse statement and create a new column using mutate.
library(dplyr)
set.seed(1)
df <- tibble(x1 = sample(c(1:10), size = 10, replace = T))
df %>%
mutate(x2 = ifelse(quantile(x1, 0.75, na.rm = T) < x1, "Yes", "No"))
If you want a function
library(dplyr)
set.seed(1)
df <- tibble(x1 = sample(c(1:10), size = 10, replace = T),
x2 = sample(c(1:10), size = 10, replace = T),
x3 = sample(c(1:10), size = 10, replace = T),
x4 = sample(c(1:10), size = 10, replace = T))
outlier<-function(dataframe, quant = 0.75, col = c("x1", "x2")){
dataframe %>%
mutate(across(all_of(col), ~ifelse(.x>quantile(.x,0.75), 'Yes', 'No'),
.names = '{col}_yes'))
}
outlier(dataframe = df,quant = 0.25)

Selecting elements from a list with non compatible length

Given the following structure of the list:
x <- list(list(Main = list(one = list(tlv = 1, beta = 2), two = "three", three = 4,list_a = list(list(value_1 = "a1", value_2 = "b", c = "c")))),
list(Main = list(one = list(tlv = 2, beta = 6), two = "seven", three = 8,list_a = list(list(value_1 = "aa2", value_2 = "bb", c = "cc")))),
list(Main = list(one = list(tlv = 3),list_a = list(list(value_1 = c("aaa3", "aaaa4"), value_2 = c("bbb", "bbbb"), c = c("ccc", "ccc"))))))
I'm trying to create a dataframe with a structure like this:
tlv | value_1
1 | a1
2 | aa2
3 | aaa3
3 | aaaa4
so far I have to the following:
library(tidyverse)
tibble::tibble(
tlv = map(x, list(1,1,"tlv"), .default = NA) %>% unlist(),
value = map(x, list(1,"list_a", 1, "value"), .default = NA) %>% unlist())
Which leads to the following error:
Error: Tibble columns must have compatible sizes.
* Size 3: Existing data.
* Size 4: Column `value`.
i Only values of size one are recycled.
This makes sense given the structure of the list (3 values for one of the variables en 4 values for the other). But I don't see a solution to link the values to the parent element of the list. So that every 'value' also gets the corresponding 'tlv' value. Any guidance how to solve this problem?
Found a solution, this does the trick:
x %>%
map_df(~tibble(
tlv = .$Main$one$tlv,
value = .$Main$list_a[[1]]$value_1))
An alternative :
library(tidyverse)
value_1 <-
map_depth(x, 4, pluck, "value_1", .ragged = TRUE) %>%
map(unlist, use.names = FALSE)
tlv <-
map_depth(x, 3, pluck, "tlv") %>%
map_dbl(unlist, use.names = FALSE)
df <-
tibble(tlv = tlv, value_1 = value_1) %>%
unnest_auto(col = value_1)

Grid Table Format in R

I am using Expss package to create Tables in R.
I am have 5 Statements each statement Have 5 brands.
5 Statements are in 5 consecutive variables like a1,a2,a3,a4,a5
Can i have table in grid format like the following?
enter image description here
There are two solutions: one is verbose but not scalable, second is scalable but is not very simple. Both solutions are based on the idea that we reposition labels from variables to banner location.
library(expss)
# create sample of data
set.seed(123)
N = 150
df = data.frame(
st1 = sample(paste0("brand", 1:5), N, replace = TRUE),
st2 = sample(paste0("brand", 1:5), N, replace = TRUE),
st3 = sample(paste0("brand", 1:5), N, replace = TRUE),
st4 = sample(paste0("brand", 1:5), N, replace = TRUE),
st5 = sample(paste0("brand", 1:5), N, replace = TRUE)
) %>% apply_labels(
st1 = 'Statement 1',
st2 = 'Statement 2',
st3 = 'Statement 3',
st4 = 'Statement 4',
st5 = 'Statement 5'
)
# verbose solution with Tab_*. It is not scalable for large number of variables
# manipulation with variable labels is needed to repostion variable labels from rows to column
df %>%
tab_total_row_position("above") %>%
tab_cells("|" = drop_var_labs(st1)) %>%
tab_stat_cpct(label = var_lab(st1)) %>%
tab_cells("|" = drop_var_labs(st2)) %>%
tab_stat_cpct(label = var_lab(st2)) %>%
tab_cells("|" = drop_var_labs(st3)) %>%
tab_stat_cpct(label = var_lab(st3)) %>%
tab_cells("|" = drop_var_labs(st4)) %>%
tab_stat_cpct(label = var_lab(st4)) %>%
tab_cells("|" = drop_var_labs(st5)) %>%
tab_stat_cpct(label = var_lab(st5)) %>%
tab_pivot(stat_position = "inside_columns") %>%
tab_transpose()
# solution wich will work for arbirary number of variables
df %>%
calculate(
lapply(st1 %to% st5, function(item)
# manipulation with variable labels is needed to repostion variable labels from rows to column
cro(list(drop_var_labs(item)), list(var_lab(item)), total_row_position = "above")
)
) %>%
Reduce("%merge%", .) %>%
tab_transpose()

Recoding values in a selection of columns of a dataframe using dplyr

I have a large dataset like the one in the next example. Columns with As in their headers have codes from 1 to 4, and columns with Bs from 1 to 3.
library(dplyr)
d <- data.frame(
ID = 1:10,
A = sample(x = 1:4, size = 10, replace = T),
AA = sample(x = 1:4, size = 10, replace = T),
B = sample(x = 1:3, size = 10, replace = T),
BB = sample(x = 1:3, size = 10, replace = T)
)
Is there a neat way to use pipes in dplyr to recode the values from columns with As in the headers and columns with Bs to the following strings?
As - from 1, 2, 3, 4 to Green, Yellow, Orange, Red respectively
Bs - from 1, 2, 3 to Green, Yellow, Red respectively
This is a simplified and friendlier version of the real dataset.
By using mutate_at from dplyr, it is possible to accomplish the recodification from numeric codes to strings. It is necessary to first coerce the columns we want to recode from numeric to character or, otherwise, there will be an error message.
library(dplyr)
d <- data.frame(
ID = 1:10,
A = sample(x = 1:4, size = 10, replace = T),
AA = sample(x = 1:4, size = 10, replace = T),
B = sample(x = 1:3, size = 10, replace = T),
BB = sample(x = 1:3, size = 10, replace = T))
d_recoded <- d %>% mutate_at(vars(-contains("ID")), funs(as.character)) %>%
mutate_at(vars(contains("A"), -contains("ID")), funs(case_when(. == 1 ~ "Green", . == 2 ~ "Yellow", . == 3 ~ "Orange", . == 4 ~ "Red"))) %>%
mutate_at(vars(contains("B"), -contains("ID")), funs(case_when(. == 1 ~ "Green", . == 2 ~ "Yellow", . == 3 ~ "Red")))

Normalising data with dplyr mutate() brings inconsistencies

I'm trying to reproduce the framework from this blogpost http://www.luishusier.com/2017/09/28/balance/ with the following code but it looks like I get inconsistent results
library(tidyverse)
library(magrittr)
ids <- c("1617", "1516", "1415", "1314", "1213", "1112", "1011", "0910", "0809", "0708", "0607", "0506")
data <- ids %>%
map(function(i) {read_csv(paste0("http://www.football-data.co.uk/mmz4281/", i ,"/F1.csv")) %>%
select(Date:AST) %>%
mutate(season = i)})
data <- bind_rows(data)
data <- data[complete.cases(data[ , 1:3]), ]
tmp1 <- data %>%
select(season, HomeTeam, FTHG:FTR,HS:AST) %>%
rename(BP = FTHG,
BC = FTAG,
TP = HS,
TC = AS,
TCP = HST,
TCC = AST,
team = HomeTeam)%>%
mutate(Pts = ifelse(FTR == "H", 3, ifelse(FTR == "A", 0, 1)),
Terrain = "Domicile")
tmp2 <- data %>%
select(season, AwayTeam, FTHG:FTR, HS:AST) %>%
rename(BP = FTAG,
BC = FTHG,
TP = AS,
TC = HS,
TCP = AST,
TCC = HST,
team = AwayTeam)%>%
mutate(Pts = ifelse(FTR == "A", 3 ,ifelse(FTR == "H", 0 , 1)),
Terrain = "Extérieur")
tmp3 <- bind_rows(tmp1, tmp2)
l1_0517 <- tmp3 %>%
group_by(season, team)%>%
summarise(j = n(),
pts = sum(Pts),
diff_but = (sum(BP) - sum(BC)),
diff_t_ca = (sum(TCP, na.rm = T) - sum(TCC, na.rm = T)),
diff_t = (sum(TP, na.rm = T) - sum(TC, na.rm = T)),
but_p = sum(BP),
but_c = sum(BC),
tir_ca_p = sum(TCP, na.rm = T),
tir_ca_c = sum(TCC, na.rm = T),
tir_p = sum(TP, na.rm = T),
tir_c = sum(TC, na.rm = T)) %>%
arrange((season), desc(pts), desc(diff_but))
Then I apply the framework mentioned above:
l1_0517 <- l1_0517 %>%
mutate(
# First, see how many goals the team scores relative to the average
norm_attack = but_p %>% divide_by(mean(but_p)) %>%
# Then, transform it into an unconstrained scale
log(),
# First, see how many goals the team concedes relative to the average
norm_defense = but_c %>% divide_by(mean(but_c)) %>%
# Invert it, so a higher defense is better
raise_to_power(-1) %>%
# Then, transform it into an unconstrained scale
log(),
# Now that we have normalized attack and defense ratings, we can compute
# measures of quality and attacking balance
quality = norm_attack + norm_defense,
balance = norm_attack - norm_defense
) %>%
arrange(desc(norm_attack))
When I look at the column norm_attack, I expect to find the same value for equivalent but_p values, which is not the case here:
head(l1_0517, 10)
for instance when but_p has value 83, row 5 and row 7, I get norm_attack at 0.5612738 and 0.5128357 respectively.
Is it normal? I would expect mean(l1_0517$but_p) to be fixed and therefore obtaining the same result when a value of l1_0517$but_p is log normalised?
UPDATE
I have tried to work on a simpler example but I can't reproduce this issue:
df <- tibble(a = as.integer(runif(200, 15, 100)))
df <- df %>%
mutate(norm_a = a %>% divide_by(mean(a)) %>%
log())
I found the solution after looking at the type of l1_0517
It is a grouped_df hence the different results.
The correct code is:
l1_0517 <- tmp3 %>%
group_by(season, team)%>%
summarise(j = n(),
pts = sum(Pts),
diff_but = (sum(BP) - sum(BC)),
diff_t_ca = (sum(TCP, na.rm = T) - sum(TCC, na.rm = T)),
diff_t = (sum(TP, na.rm = T) - sum(TC, na.rm = T)),
but_p = sum(BP),
but_c = sum(BC),
tir_ca_p = sum(TCP, na.rm = T),
tir_ca_c = sum(TCC, na.rm = T),
tir_p = sum(TP, na.rm = T),
tir_c = sum(TC, na.rm = T)) %>%
ungroup() %>%
arrange((season), desc(pts), desc(diff_but))

Resources