Re-arrange data so a single cell is header - r

I am looking to re-arrange my data. Currently it looks like data 1 and I would like for it to look like data2. Essentially, I would like to move 'total' so that it is its own column, and I'd like to move its n along with it. I am using R. Thank you.
data1 <- data.frame (
question = c("recommend", "recommend", "overall", "overall"),
response = c("top box score", "total", "top box score", "total"),
n = c(673, 784, 654, 784))
data2 <- data.frame (
question = c("recommend", "overall"),
response = c("top box score", "top box score"),
n = c(673, 654),
total = c(784, 784))

You can use data.table as follows:
library(data.table)
data2 <- setDT(data1)[response != "total"][data1, total := i.n, on = "question"]

One way would be to filter data for "total" rows, get them in wide format and join to the original data without "total" rows.
library(dplyr)
library(tidyr)
data1 %>%
filter(response != 'total') %>%
left_join(data1 %>%
filter(response == 'total') %>%
pivot_wider(names_from = response, values_from = n), by = 'question')
# question response n total
#1 recommend top box score 673 784
#2 overall top box score 654 784

Related

Trying to calculate percentages after using dplyr::count

I am trying to calculate the percentages for cigarettes smoking status by sex (for example, the % of males/females who are Non-smokers, Occasional smokers, Prefer not to say, Regular smokers etc). The default seems to calculate the percentage from the Row Total and not the Column Total. Any help would be greatly appreciated.
Dataframe
structure(list(sex = c("Female", "Male", "Female", "Female"),
cigarettes_smoking_status = c("Non-smoker", "Non-smoker",
"Non-smoker", "Non-smoker")), row.names = c(NA, 4L), class = "data.frame")
Code
smoking_status_by_sex <- smoking_data %>%
group_by(sex) %>%
dplyr::count(cigarettes_smoking_status) %>%
pivot_wider(names_from = sex, values_from = n) %>% #increase number of columns & reduce rows
adorn_totals(c("row", "col") )
smoking_status_by_sex_per <- smoking_status_by_sex %>%
mutate(female_pct = round((100*.[[2]]/Total),digits =2),
male_pct = round((100*.[[3]]/Total),digits =2),
prefer_not_to_say_pct = round((100*.[[4]]/Total), digits=2),
unknown_pct = round((100*.[[5]]/Total),digits =2),
total_pct = round((100*.[[6]]/Total), digits=2))
This is the table I am trying to replicate below
[What I am trying to replicate][1]
[1]: https://i.stack.imgur.com/hhDA4.png
I have tried using count, colSum, adorn_totals etc and then tried to use pivot_wider. Any help would be greatly appreciated.
Its easier to group_by sex and smoking status and then compute the relative frequencies. An example is given below.
library(tidyverse)
df<-starwars
df %>%
group_by(eye_color,skin_color) %>% ##grouping by eyecolor and skin color!
summarise(count1=n()) %>%
mutate(grouppercentage=(count1/sum(count1))*100)

gtsummary tbl_merge with multiple columns of variable length

I am doing a meta-analysis and would like to use gtsummary for Table 1 (Description of the Included Studies). I would like to have each column be a detail of the study (e.g. Authors, Intervention, Number, etc). Within this MA, there are some studies that have more than 2 interventions, so the rows won't be equal among studies (i.e. first column has 1 row per study, second column variable rows per study, etc).
Here is a dataset for the problem that matches my own dataset.
library(tidyverse)
#Create dataset
MA <-
tibble(
Study = c("Study 1", "Study 2"),
Intervention1 = c("Placebo", "Control"),
Intervention2 = c("Walking", "Running"),
Intervention3 = c("Running", NA),
Number_Int1 = c(21, 19),
Number_Int2 = c(19, 20),
Number_Int3 = c(20, NA)
)
Created on 2022-06-27 by the reprex package (v2.0.1)
I've tried to use tbl_summary and tbl_merge to generate a summary table, but to no avail.
Here is what I would like the table to look like:
Any help would be appreciated.
Ben
I've managed to find a solution using the gt package. Here is the code:
MA %>% pivot_longer(
cols = !Study,
names_to = c(".value", ".value"),
names_pattern = "(.)(.)",
values_drop_na = TRUE
) %>%
rename(Intervention = In) %>%
rename(Number = Nu) %>%
gt(groupname_col = "Study") %>%
tab_stubhead(label = "Study") %>%
tab_options(row_group.as_column = TRUE)
This gives the following output table:
If anyone has any solutions using the gtsummary package, that'd be great.
Thanks,
Ben

{gtExtras} column showing in wrong order in {gt} table when grouped

I am making a gt table showing the progress of individuals towards a goal. In the table, there is a row showing a horizontal bar graph of progress towards that goal (if goal is 50 and score is 40, the bar is at 80%).
However, when I change the order of the gt rows by using the groupname_col argument, the order of the other cells changes, but not the order of the gtExtras gt_plt_bar_pct column, so it's showing the wrong bars for the name and score in that row, instead, that column seems to always be represented in the order of rows in the input data.
I understand that I can fix this by using arrange on the df before the gt begins, but this doesn't seem like a good solution since I'm going to want to change the order of the rows to view by different groups. Is this a flaw with gtExtras? is there a better fix?
thanks!
reprex:
library(tibble)
library(gt)
library(gtExtras)
library(dplyr)
# make dataframe of individuals and their goals
df <- tribble(
~name, ~group, ~score, ~goal,
"Bob", "C", 20, 40,
"Chris", "A", 50, 40,
"Dale", "B", 30, 50,
"Jay", "A", 0, 40,
"Ben", "B", 10, 20
) %>%
# calculate percent towards goal, and cap at 100%
mutate(percent_to_goal = score/goal *100,
percent_to_goal = case_when(percent_to_goal >= 100 ~ 100,
TRUE ~ percent_to_goal))
df %>%
# this fixes the issue, but doesn't seem like a permanent solution
#arrange(group, name) %>%
# make gt table
gt(rowname_col = "name", groupname_col = "group") %>%
# order groups
row_group_order(groups = c("A","B","C")) %>%
# add bar chart column
gt_plt_bar_pct(column = percent_to_goal) %>%
# highlight blue if person reaches their goal
tab_style(
style = list(
cell_fill(color = "lightcyan"),
cell_text(weight = "bold")),
locations = cells_body(
columns = c(goal,score, percent_to_goal),
rows = score >= goal
)
)
Here is the output from the above code: notice that the length of the bar charts do not always reflect the values of the rows they are appearing in. Instead, they reflect the order of the original dataset.
EDIT: remove row_group_order. If I run the above code again, but comment out the line meant to rearrange the appearance of groups, the grouping shows up in a different order (order of appearance of groups in the original dataset), and the name and first two columns sort into these groups accordingly, but the bar chart column still does not, and remains in the original order of the dataset. Image below:
Per gtExtras v 0.2.4 this bug has been fixed. Thanks for raising and the great reprex!
library(tibble)
library(gt)
library(gtExtras)
library(dplyr)
# make dataframe of individuals and their goals
df <- tribble(
~name, ~group, ~score, ~goal,
"Bob", "C", 20, 40,
"Chris", "A", 50, 40,
"Dale", "B", 30, 50,
"Jay", "A", 0, 40,
"Ben", "B", 10, 20
) %>%
# calculate percent towards goal, and cap at 100%
mutate(percent_to_goal = score/goal *100,
percent_to_goal = case_when(percent_to_goal >= 100 ~ 100,
TRUE ~ percent_to_goal))
df %>%
# make gt table
gt(rowname_col = "name", groupname_col = "group") %>%
# order groups
row_group_order(groups = c("A","B","C")) %>%
# add bar chart column
gt_plt_bar_pct(column = percent_to_goal) %>%
# highlight blue if person reaches their goal
tab_style(
style = list(
cell_fill(color = "lightcyan"),
cell_text(weight = "bold")),
locations = cells_body(
columns = c(goal,score, percent_to_goal),
rows = score >= goal
)
)

fct_reorder/ggplot2 not ordering as desired

I am working on the most recent TidyTuesday data and had an issue in my plot. New Jersey is shown above Nashville despite Nashville overall has more values. I am unsure how to fix this.
I think it has something to do with the one tweet by user etmckinley being sorted in Nashville first since it alphabetically comes before username sqlsekou. Perhaps there is a way to reverse the sorting and have it work correctly?
If not, how else can I order the data correctly to have Nashville above New Jersey?
library(tidyverse)
tweets <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-15/tweets.csv')
top_states <- tweets %>%
filter(
location != "iPhone: 34.704040,-86.722909",
location != "Kevin.Elder#GCSU.edu"
) %>%
drop_na(location) %>%
count(location, sort = TRUE) %>%
slice_max(n, n = 7) %>%
pull(location)
tweets %>%
filter(
location != "iPhone: 34.704040,-86.722909",
location != "Kevin.Elder#GCSU.edu"
) %>%
drop_na(location) %>%
count(location, username, sort = TRUE) %>%
filter(location %in% top_states) %>%
mutate(location = fct_reorder(location, n)) %>%
mutate(username = fct_reorder(username, -n)) %>%
ggplot(aes(n, location, fill = username)) +
geom_col() +
scale_fill_brewer(palette = "Set3") +
labs(
x = "Quantity of tweets",
y = "Location",
title = "Tweets by location over 3 month period",
subtitle = "Filled by username"
)
By default, fct_reorder reorders by the median value. Your Nashville bar has 2 components, one big, one small, and the median is half way inbetween. Your NJ bar has only one component, so the median is the full value. Override the default in fct_reorder by setting .fun = sum. See ?fct_reorder for more details.

Formatting an ftable in R

I have the following 3 way table I created in R.
with(dataset, ftable(xtabs(count ~ dos + sex + edu)))
The output looks like
edu high low medium unknown
dos sex
five-to-ten-years female 247776 44916 127133 23793
male 225403 37858 147821 20383
five-years-or-less female 304851 58018 182152 33649
male 253977 55720 193621 28972
more-than-ten-years female 709303 452605 539403 165675
male 629162 309193 689299 121336
native-born female 1988476 1456792 2094297 502153
male 1411509 1197395 2790522 395953
unknown female 57974 75480 73204 593141
male 40176 57786 93108 605542
I want to rename the variables and format the table so that I can include it in a report. I know that I can use dnn to rename the variables, but are there any other recommendations to rename the variables? And to format the table (similar to using kable)?
You could convert the output to a text matrix using the following function, after which you can style with kable however you choose:
ftab_to_matrix <- function(ft)
{
row_vars <- attr(ft, "row.vars")
for(i in seq_along(row_vars)){
row_vars[[i]] <- c(names(row_vars[i]), row_vars[[i]])}
rowvar_widths <- sapply(row_vars, function(x) max(nchar(x))) + 1
col_vars <- attr(ft, "col.vars")
rowvar_widths <- c(1, cumsum(c(rowvar_widths, max(nchar(names(col_vars))))))
ft_text <- capture.output(print(ft))
row_cols <- sapply(seq_along(rowvar_widths)[-1], function(x)
substr(ft_text, rowvar_widths[x - 1], rowvar_widths[x]))
ft_text <- substr(ft_text, rowvar_widths[length(rowvar_widths)] + 2, 100)
ft_breaks <- c(1, cumsum(lapply(strsplit(ft_text[length(ft_text)], "\\d "),
function(x) nchar(x) + 2)[[1]]))
col_cols <- sapply(seq_along(ft_breaks)[-1], function(x)
substr(ft_text, ft_breaks[x - 1], ft_breaks[x]))
trimws(cbind(row_cols, col_cols))
}
So, for example, using my example data from your last question, you could do something like:
my_tab <- with(`3waydata`, ftable(xtabs(count ~ duration + sex + education)))
as_image(kable_styling(kable(ftab_to_df(my_tab))), file = "kable.png")
Might have been easier had you given the full picture when you asked your first question... You could use gt to make fancy tables for reports. This is an edited version more fully demonstrating some capabilities.
library(dplyr)
library(gt)
way3data <- data %>%
group_by(duration, education, sex) %>%
summarise(count = sum(number)) %>%
ungroup
# Reorder with select and Titlecase with stringr
longer <- tidyr::pivot_wider(way3data,
values_from = count,
names_from = "education") %>%
select(duration, sex, high, medium, low, unknown) %>%
rename_with(stringr::str_to_title)
# Demonstrating some of the features of gt
# obviously could have done some of this
# to the original dataframe
myresults <- longer %>%
group_by(Duration) %>%
gt(rowname_col = "Sex") %>%
row_group_order(
groups = c("native-born",
"more-than-ten-years",
"five-to-ten-years",
"five-years-or-less",
"unknown")
) %>%
tab_spanner(label = "Education",
columns = matches("High|Low|Medium|Unknown")) %>%
tab_stubhead(label = "Duration or something") %>%
tab_style(
style = cell_text(style = "oblique", weight = "bold"),
locations = cells_row_groups()) %>%
tab_style(
style = cell_text(align = "right", style = "italic", weight = "bold"),
locations = cells_column_labels(
columns = vars(High, Low, Medium, Unknown)
)) %>%
tab_style(
style = cell_text(align = "right", weight = "bold"),
locations = cells_stub()) %>%
tab_header(
title = "Fancy table of counts with Duration, Education and Gender") %>%
tab_source_note(md("More information is available at https://stackoverflow.com/questions/62284264."))
# myresults
# Can save in other formats including .rtf
myresults %>%
gtsave(
"tab_1.png", expand = 10
)
You can read about all the formatting choices here
Data compliments of Allan
set.seed(69)
data <- data.frame(education = sample(c("high","low","medium","unknown"), 600, T),
sex = rep(c("Male", "Female"), 300),
duration = sample(c("unknown", "native-born",
"five-years-or-less", "five-to-ten-years",
"more-than-ten-years"), 600, T),
number = rpois(600, 10))

Resources