How do I format tibbles? - r

I'd like to make a table that looks like this
I have tibbles with each of the data points, but they're not combined.
library('dplyr')
library('ISLR')
data(Hitters)
Hitters <- na.omit(Hitters)
Q <- Hitters %>% group_by(League) %>%
dplyr::summarize(count = n(), avg_wage = sum(Salary)/n())
A <- Hitters %>% group_by(Division) %>%
dplyr::summarize(count = n(), avg_wage = sum(Salary)/n())
Z <- Hitters %>% group_by(NewLeague) %>%
dplyr::summarize(count = n(), avg_wage = sum(Salary)/n())
My goal is to stack the tibbles above each other in one output with shared "count" and "avg_wage" columns. I tried bind_rows() and ftable(), without success.

The problem is that you can't combine rows with different column names so it ends up giving you a confusing dataframe. We can instead use gather() to create two new columns and get the proper table.
library(tidyverse)
library(ISLR)
data(Hitters)
Hitters <- na.omit(Hitters)
Q <- Hitters %>% group_by(League) %>%
dplyr::summarize(count = n(), avg_wage = sum(Salary)/n())
A <- Hitters %>% group_by(Division) %>%
dplyr::summarize(count = n(), avg_wage = sum(Salary)/n())
Z <- Hitters %>% group_by(NewLeague) %>%
dplyr::summarize(count = n(), avg_wage = sum(Salary)/n())
list(Q,A,Z) %>%
map_df(bind_rows) %>%
gather("league_type", "league_id", c(1, 4, 5)) %>%
filter(!is.na(league_id))
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> # A tibble: 6 x 4
#> count avg_wage league_type league_id
#> <int> <dbl> <chr> <chr>
#> 1 139 542. League A
#> 2 124 529. League N
#> 3 129 624. Division E
#> 4 134 451. Division W
#> 5 141 537. NewLeague A
#> 6 122 535. NewLeague N
Created on 2019-01-21 by the reprex package (v0.2.1)
You can use spread() to get it back to wide format, although I would advise against that. The long version will probably be easier to work with.

Related

Search elements of a single character string in a dataframe column to subset it

I have two dataframes:
set.seed(1)
df1 <- data.frame(k1 = "AFD(1);Acf(2);Vgr7(2);"
,k2 = "ABC(7);BHG(46);TFG(675);")
df2 <- data.frame(site =c("AFD(1);AFD(2);", "Acf(2);", "TFG(677);",
"XX(275);", "ABC(7);", "ABC(9);")
,p1 = rnorm(6, mean = 5, sd = 2)
,p2 = rnorm(6, mean = 6.5, sd = 2))
The first dataframe is in fact a list of often very long strings, made of 'elements". Each "element" is made of a few letters/numbers, followed by a number in brackets, followed by a semicolon. In this example I only put 3 "elements" into each string, but in my real dataframe there are tens to hundreds of them.
> df1
k1 k2
1 AFD(1);Acf(2);Vgr7(2); ABC(7);BHG(46);TFG(675);
The second dataframe shares some of the "elements" with df1. Its first column, called site, contains some (not all) "elements" from the first dataframe, sometimes the "element" forms the whole string, and sometimes is a part of a longer string:
> df2
site p1 p2
1 AFD(1);AFD(2); 4.043700 3.745881
2 Acf(2); 5.835883 5.670011
3 TFG(677); 7.717359 5.711420
4 XX(275); 4.794425 6.381373
5 ABC(7); 5.775343 8.700051
6 ABC(9); 4.892390 8.026351
I would like to filter the whole df2 using df2$site and each k column from df1 (there are many K columns, not all of them contain k in the names).
The easiest way to explain this is to show how the desired output would look like.
> outcome
k site p1 p2
1 k1 AFD(1);AFD(2): 4.043700 3.745881
2 k1 Acf(2); 5.835883 5.670011
3 k2 ABC(7); 5.775343 8.700051
The first column of the outcome dataframe corresponds to the column names in df1. The second column corresponds to the site column of df2 and contains only sites from df1 columns that were found in df2$sites. Other columns are from df2.
I appreciate that this question is made of two separate "problems", one grepping-related and one related to looping through df1 columns. I decided to show the task in its entirety in case there exists a solution that addresses both in one go.
FAILED SOLUTION 1
I can create a string to grep, but for each column separately:
# this replaces the semicolons with "|", but does not escape the brackets.
k1_pattern <- df1 %>%
select(k1) %>%
deframe() %>%
str_replace_all(";","|")
And then I am not sure how to use it. This (below) didn't work, maybe because I didn't escape brackets, but I am struggling with doing it:
k1_result <- df2 %>%
filter(grepl(pattern = k1_pattern, site))
But even if it did work, it would only deal with a single column from df1, and I have many, and would like to perform this operation on all df1 columns at the same time.
FAILED SOLUTION 2
I can create a list of sites to search in df2 from columns in df1:
k1_sites<- df1 %>%
select(k1) %>%
deframe() %>%
strsplit(., "[;]") %>%
unlist()
but the delimiter is lost here, and %in% cannot be used, as the match will sometimes be partial.
library(dplyr)
df2 %>%
mutate(site_list = strsplit(site, ";")) %>%
rowwise() %>%
filter(length(intersect(site_list,
unlist(strsplit(x = paste0(c(t(df1)), collapse=""),
split = ";")))) != 0) %>%
select(-site_list)
#> # A tibble: 3 x 3
#> # Rowwise:
#> site p1 p2
#> <chr> <dbl> <dbl>
#> 1 AFD(1);AFD(2); 3.75 7.47
#> 2 Acf(2); 5.37 7.98
#> 3 ABC(7); 5.66 9.52
Updated answer:
library(dplyr)
library(tidyr)
df1 %>%
rownames_to_column("id") %>%
pivot_longer(-id, names_to = "k", values_to = "site") %>%
separate_rows(site, sep = ";") %>%
filter(site != "") %>%
select(-id) -> df1_k
df2 %>%
tibble::rownames_to_column("id") %>%
separate_rows(site, sep = ";") %>%
full_join(., df1_k, by = c("site")) %>%
group_by(id) %>%
fill(k, .direction = "downup") %>%
filter(!is.na(id) & !is.na(k)) %>%
summarise(k = first(k),
site = paste0(site, collapse = ";"),
p1 = first(p1),
p2 = first(p2), .groups = "drop") %>%
select(-id)
#> # A tibble: 3 x 4
#> k site p1 p2
#> <chr> <chr> <dbl> <dbl>
#> 1 k1 AFD(1);AFD(2); 3.75 7.47
#> 2 k1 Acf(2); 5.37 7.98
#> 3 k2 ABC(7); 5.66 9.52
Here's a way going to a long format for exact matching (so no regex):
library(dplyr)
library(tidyr)
df1_long = df1 |> stack() |>
separate_rows(values, sep = ";") |>
filter(values != "")
df2 |>
mutate(id = row_number()) |>
separate_rows(site, sep = ";") |>
filter(site != "") |>
left_join(df1_long, by = c("site" = "values")) %>%
group_by(id) |>
filter(any(!is.na(ind))) %>%
summarize(
site = paste(site, collapse = ";"),
across(-site, \(x) first(na.omit(x)))
)
# # A tibble: 3 × 5
# id site p1 p2 ind
# <int> <chr> <dbl> <dbl> <fct>
# 1 1 AFD(1);AFD(2) 3.75 7.47 k1
# 2 2 Acf(2) 5.37 7.98 k1
# 3 5 ABC(7) 5.66 9.52 k2

Standard deviation of average events per ID in R

Background
I've got this dataset d:
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
It's got 2 people (IDs) in it, and they each have some events.
The problem
I'm trying to get an average number (count) of events per person, along with a standard deviation for that average, all in one result (it can be a dataframe or not, doesn't matter).
In other words I'm looking for something like this:
| Mean | SD |
|------|------|
| 4.00 | 2.83 |
What I've tried
I'm not far off, I don't think -- it's just that I've got 2 separate pieces of code doing these calculations. Here's the mean:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event))
# A tibble: 1 x 1
ratio
<dbl>
1 4
And here's the SD:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(sd = sd(event))
# A tibble: 1 x 1
sd
<dbl>
1 2.83
But I when I try to pipe them together like so...
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event)) %>%
summarise(sd = sd(event))
... I get an error:
Error in `h()`:
! Problem with `summarise()` column `sd`.
i `sd = sd(event)`.
x object 'event' not found
Any insight?
You have to put the last two calls to summarise() in the same call. The only remaining columns after summarise() will be those you named and the grouping columns, so after your second summarise, the event column no longer exists.
library(dplyr)
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
d %>%
group_by(ID) %>%
# the next summarise will be within ID
summarise(event = length(event)) %>%
# this summarise is overall
summarise(sd = sd(event),
ratio = mean(event))
#> # A tibble: 1 × 2
#> sd ratio
#> <dbl> <dbl>
#> 1 2.83 4
The code is a bit confusing because you are renaming the event variable, and doing the first summarise() within groups and the second without grouping. This code would be a little easier to read and get the same result:
d %>%
count(ID) %>%
summarise(sd = sd(n),
ratio = mean(n))
Created on 2022-05-25 by the reprex package (v2.0.1)

Ontime percentage calculations

I need to calculate the overall ontime percentage of each airline with this sample dataset.
library(tidyverse)
library(dplyr)
df_chi <- tribble(
~airline, ~ontime, ~qty,~dest,
'delta',TRUE,527,'CHI',
'delta',FALSE,92,'CHI',
'american',TRUE,4229,'CHI',
'american',FALSE,825,'CHI'
)
df_nyc <- tribble(
~airline, ~ontime, ~qty,~dest,
'delta',TRUE,1817,'NYC',
'delta',FALSE,567,'NYC',
'american',TRUE,1651,'NYC',
'american',FALSE,625,'NYC'
)
I have a solution although it is verbose and I want to avoid the numbered index ie [2,2]. Is there a more elegant way using more of the tidyverse?
df_all <- bind_rows(df_chi,df_nyc)
delta_ot <- df_all %>%
filter(airline == "delta") %>%
group_by(ontime) %>%
summarize(total = sum(qty))
delta_ot <- delta_ot[2,2] / sum(delta_ot$total)
american_ot <- df_all %>%
filter(airline == "american") %>%
group_by(ontime) %>%
summarize(total = sum(qty))
american_ot <- american_ot[2,2] / sum(american_ot$total)
As on the ontime column is logical column, use that to subset instead of [2, 2]. Also, instead of doing the filter, do this once by adding the 'airline' as a grouping column
library(dplyr)
bind_rows(df_chi, df_nyc) %>%
group_by(airline, ontime) %>%
summarise(total = sum(qty), .groups = 'drop_last') %>%
summarise(total = total[ontime]/sum(total))
-output
# A tibble: 2 × 2
airline total
<chr> <dbl>
1 american 0.802
2 delta 0.781
Subsetting by logical returns the corresponding value where there are TRUE elements
> c(1, 3, 5)[c(FALSE, TRUE, FALSE)]
[1] 3

R - create a dual entry pivot table

I'm an R newbie so my apologizes if this is a simple question.
I use a lot excel to create "dual entries" tables. It's likely the name 'dual table' is not the most accurate but I wouldn't know how to describe it otherwise.
I basically start from big tables and then create a new one where I average the data grouping by two columns and then I display it as a matrix.
I will share with you a perfectly functional R example I coded myself.
My question is: is there an easier / better way to do it?
This is my working code:
require(dplyr)
df <- mtcars
output_var <- 'disp'
rows_var <- 'cyl'
col_var <- 'am'
output_name <- paste0("Avg. ",output_var)
one_way_table <- df %>%
group_by(eval(parse(text=rows_var)), eval(parse(text=col_var)) ) %>%
summarise(output=mean( eval(parse(text=output_var)) ))
one_way_table <- data.frame(one_way_table, check.rows = F, check.names = F, stringsAsFactors = F)
colnames(one_way_table) <- c(rows_var, col_var, output_name)
unique_row_items <- unique(one_way_table[,rows_var])
unique_col_items <- unique(one_way_table[,col_var])
x_rows <- rep(unique_row_items, length(unique_col_items))
y_cols <- rep(unique_col_items, length(unique_row_items))
new_df <- data.frame(x = x_rows, y = y_cols, check.rows = F, check.names = F, stringsAsFactors = F)
colnames(new_df) <- c(rows_var, col_var)
new_df <- base::merge(new_df, one_way_table, by = c(rows_var, col_var), all.x=T)
m <- matrix(new_df[, output_name], ncol= length(unique(new_df[,col_var])) )
df_matrix <- data.frame(m, check.rows = F, check.names = F, stringsAsFactors = F)
Perhaps there's a way more efficient way to do it.
Notice how, since this will be coded inside a function, I had to use variable names do define what columns I want to use for the analysis.
Thanks
A possible solution for your issue can come from tidyverse. Here an example reshaping your data and aggregating with mean:
library(tidyverse)
#Data
df <- mtcars
#Code
df %>% pivot_longer(cols = -c(cyl,am)) %>% filter(name=='disp') %>%
group_by(cyl,am) %>% summarise(Mean=mean(value)) %>%
pivot_wider(names_from = am,values_from=Mean)
Output:
# A tibble: 3 x 3
# Groups: cyl [3]
cyl `0` `1`
<dbl> <dbl> <dbl>
1 4 136. 93.6
2 6 205. 155
3 8 358. 326
Which is close to df_matrix the final output of your code.
If we need to pivot, this can be done in a more simple way. We select the columns of interest and use pivot_wider with values_fn specifying as mean to be applied on the columns selected on values_from
library(dplyr)
library(tidyr)
mtcars %>%
select(cyl, am, disp) %>%
pivot_wider(names_from = am, values_from = disp, values_fn = mean)
# A tibble: 3 x 3
# cyl `1` `0`
# <dbl> <dbl> <dbl>
#1 6 155 205.
#2 4 93.6 136.
#3 8 326 358.

Implicit NA's making a table in the Tidyverse

I am running into an error message when trying to create a table using tidyverse. The error message reads
"Factor Com.Race contains implicit NA, consider using
forcats::fct_explicit_na".
I am noob when it comes to the tidyverse. So I haven't been able to try much.
Major_A <- rep("Major A", times=150)
set.seed(1984)
gender <- sample(c("Female","Male"), prob=c(.95,.05),size=150, replace=T)
race.asian <- sample(c("Y","N"),prob= c(.01,.99),size=150, replace=T)
race.black <- sample(c("Y","N"),prob= c(.1,.9),size=150, replace=T)
race.AmInd <- sample(c("Y","N"),prob= c(.01,.99),size=150, replace=T)
race.hawa <- sample(c("Y","N"),prob= c(.01,.99),size=150, replace=T)
race.hisp <- sample(c("Y","N"),prob= c(.02,.98),size=150, replace=T)
race.white <- sample(c("Y","N"),prob=c(.8,.2),size=150,replace=T)
race.NotR <- sample(c("Y","N"),prob=c(.01,.98),size=150,replace=T)
degree <- sample(c("BA","MAT"),prob=c(.9,.1),size=150,replace=T)
enroll <- data.frame(Major_A,gender,race.asian,race.black,race.AmInd,race.hawa,race.hisp,race.white, race.NotR, degree)
multi.race_fun <- function(dat,startr,endr){
dat$multi <- rowSums(dat[,startr:endr]=="Y")
return(dat)
}
enroll.multiR <- multi.race_fun(enroll,3,9)
# load comrace function
com_race.fun <- function(dat){
dat$Com.Race <- ifelse(dat$race.hisp=="Y","Hispanic",
ifelse(dat$race.black=="Y" & dat$multi==1, "African Am",
ifelse(dat$race.AmInd=="Y" & dat$multi==1,"Native Am",
ifelse(dat$race.asian=="Y" & dat$multi==1,"Asian",
ifelse(dat$race.hawa=="Y" & dat$multi==1, "Hawaiian",
ifelse(dat$race.white=="Y" & dat$multi==1,"Caucasian",
ifelse(dat$multi>=2,"Two or More Races","Not Reported")))))))
return(dat)
}
# run comrace function
enroll.comR <- com_race.fun(enroll.multiR)
enroll.comR$gender <- factor(enroll.comR$gender, levels= c("Female", "Male"))
enroll.comR$Com.Race <- factor(enroll.comR$Com.Race, levels=c("African Am","Asian","Caucasian","Hawaiian","Hispancic","Two or More Races", "Not Reported"))
library(tidyverse)
gen_race.tbl<- enroll.comR%>%
group_by(Com.Race, gender, .drop = FALSE) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(perc = (count/sum(count)*100)) %>%
gather(key, value, -gender, -Com.Race) %>%
unite(Com.Race, Com.Race, key) %>%
spread(Com.Race, value)
I would like the code to produce a table with counts and percents for all level of the gender and Com.Race variables.
I would suggest using gather() from dplyr to restructure your wide-format data right at the start, then you can summarize the counts/percentages for each level of the gender and ethnicity variables. Using reshape2::dcast() at the end will give your desired output, but spread() can also be used.
# toy data set
df <- data.frame(gender=sample(c('M','F'),100,T,prob=c(0.9,0.1)),
ethn.a=sample(c('Y','N'),100,T,prob=c(0.8,0.2)),
ethn.b=sample(c('Y','N'),100,T,prob=c(0.7,0.3)),
ethn.c=sample(c('Y','N'),100,T,prob=c(0.25,0.75)),
ethn.d=sample(c('Y','N'),100,T,prob=c(0.95,0.05)))
# gather wide data, group by gender/ethnicity, summarise, reshape to wide format
df %>% gather(k,v,-gender) %>% group_by(gender,k,v) %>%
summarise(n=n()) %>% mutate(perc=round((n/sum(n))*100,2)) %>%
mutate(cell=paste0(n,' (',sprintf("%.1f",perc),'%)')) %>%
select(-n,-perc) %>%
filter(v=='Y') %>% reshape2::dcast(v~k+gender, value.var = 'cell')
v ethn.a_F ethn.a_M ethn.b_F ethn.b_M ethn.c_F ethn.c_M ethn.d_F ethn.d_M
1 Y 11 (84.6%) 69 (79.3%) 10 (76.9%) 66 (75.9%) 3 (23.1%) 28 (32.2%) 12 (92.3%) 87 (100.0%)
# using spread()
df %>% gather(k,v,-gender) %>% group_by(gender,k,v) %>%
summarise(n=n()) %>% mutate(perc=round((n/sum(n))*100,2)) %>%
mutate(cell=paste0(n,' (',sprintf("%.1f",perc),'%)')) %>%
select(-n,-perc) %>%
filter(v=='Y') %>%
spread(k,cell,fill=0)
# A tibble: 2 x 6
# Groups: gender [2]
gender v ethn.a ethn.b ethn.c ethn.d
<fct> <chr> <chr> <chr> <chr> <chr>
1 F Y 11 (84.6%) 10 (76.9%) 3 (23.1%) 12 (92.3%)
2 M Y 69 (79.3%) 66 (75.9%) 28 (32.2%) 87 (100.0%)

Resources