Complex summary table using gtsummary in R - r

I have selected a few columns within the data set and I want to make a table by using gtsummary. I have come across some issues and not sure how to make it work.
Part of the reproducible data are here
structure(list(country = c("SGP", "JPN", "THA", "CHN", "JPN",
"CHN", "CHN", "JPN", "JPN", "JPN"), Final_Medal = c(NA, NA, NA,
NA, NA, "GOLD", NA, NA, NA, NA), Success = c(0, 0, 0, 0, 0, 1,
0, 0, 0, 0)), row.names = c(NA, 10L), class = "data.frame")
And it looks like this :
country Final_Medal Success
SGP NA 0
JPN NA 0
THA NA 0
Final_Medal contain NA, GOLD, SILVER and BRONZE
Success contains 0 and 1
All I want for the output is to group by country and count number of medal and success for each country.
Desire output:
Country GOLD Silver Bronze Success Total_Entry
SGP 5 2 10 17 50
JPN 4 3 5 12 60
CHN 5 2 6 13 60
Success will only count 1 and Total_Entry I want it to be included doesn't matter if it is 0 or 1
I have a code that look like this but it does't work and am not sure what needs to be done.
library(gtsummary)
example%>%tbl_summary(
by = country,
missing = "no" # don't list missing data separately
) %>%
bold_labels()

You may do the aggregation in dplyr and use gt/gtsummary for display purpose.
library(dplyr)
library(gt)
df %>%
group_by(country) %>%
summarise(Gold = sum(Final_Medal == 'GOLD', na.rm = TRUE),
Silver = sum(Final_Medal == 'SILVER', na.rm = TRUE),
Bronze = sum(Final_Medal == 'BRONZE', na.rm = TRUE),
Success = sum(Success),
Total_Entry = n()) %>%
gt()

Related

How to create a pivot table from multiple data.frames in R to write to excel?

I have multiple data.frames with an equal number of columns. I want to combine these into a single pivot table that I can write to excel.
Example data.frames:
> net_imports[,1:5]
1979 1980 1981 1982 1983
beginning_stocks NA -53 -83 -110 -60.000
production NA -390 -585 -510 -434.996
consumption NA 370 380 390 410.000
ending_stocks 53 83 110 60 46.000
predicted NA 10 -178 -170 -38.996
> area_harvested_output[,1:5]
1979 1980 1981 1982 1983
area_harvested_lag 51.22632 51.2263243 41.6213885 57.6296148 54.4279695
area_harvested_trend 0.00000 0.1007849 0.2015699 0.3023548 0.4031397
import_price_cpi NA 20.4610740 18.7566970 16.8987151 15.2273790
predicted NA 71.7881832 60.5796553 74.8306847 70.0584883
error NA 58.2118168 119.4203447 95.1693153 99.9415117
pred_err NA 130.0000000 180.0000000 170.0000000 170.0000000
I want the resulting table in excel to look something like this
Basically, I just want to maintain the variable names like "net_imports" and "area_harvested_output" as grouped data.
I'd pivot_longer both data.frames to long, so that year becomes one instead of (in your example) five columns, rbind or bind_rows them and export the accumulated long table to Excel (in which I'd then build the interactive Excel pivot table).
your example data:
net_imports <- structure(list(parameter = c("beginning_stocks", "production",
"consumption", "ending_stocks", "predicted"), X1979 = c(NA, NA,
NA, 53L, NA), X1980 = c(-53L, -390L, 370L, 83L, 10L), X1981 = c(-83L,
-585L, 380L, 110L, -178L), X1982 = c(-110L, -510L, 390L, 60L,
-170L), X1983 = c(-60, -434.996, 410, 46, -38.996)), class = "data.frame", row.names = c(NA,
5L))
area_harvested_output <- structure(list(parameter = c("area_harvested_lag", "area_harvested_trend", "import_price_cpi", "predicted", "error", "pred_err"), X1979 = c(51.22632,
0, NA, NA, NA, NA), X1980 = c(51.2263243, 0.1007849, 20.461074,
71.7881832, 58.2118168, 130), X1981 = c(41.6213885, 0.2015699,
18.756697, 60.5796553, 119.4203447, 180), X1982 = c(57.6296148,
0.3023548, 16.8987151, 74.8306847, 95.1693153, 170), X1983 = c(54.4279695,
0.4031397, 15.227379, 70.0584883, 99.9415117, 170)), class = "data.frame", row.names = c(NA,
6L))
the code:
library(dplyr)
library(tidyr)
library(rio) ## convenience package for imports/exports
long_table <-
net_imports %>%
pivot_longer(cols = -parameter,
names_to = 'year') %>%
bind_rows(
area_harvested_output %>%
pivot_longer(cols = -parameter,
names_to = 'year')
)
long_table %>% export('long_table.xlsx')

Rename various rows

I have a dataset that looks like this:
Starting Dataset
Code used to create the Starting dataset:
dataset<-data.frame(Attorney=c("John Doe", "Client #1","274", "296",
"297", "Client #2", "633", "Jane Doe",
"Client #1", "309", "323"),
Date=c(NA, NA, "2019/4/4", "2019/4/4", "2019/4/12",
NA, " 2019/2/3", NA, NA, "2019/12/1", "2019/12/4"),
Code=c(NA, NA, "7NP/7NP", "1UE/1UE", "2C1/2C1",NA,
"7NP/7NP", NA, NA, "7NP/7NP", "7FU/7FU"),
Billed_Amount=c(NA, NA, 1200.00, 4000.00, 2775.00,
NA, 1200.00, NA, NA, 1200.00, 385),
Amount= c(NA, NA, "1200", "4000", "2775", NA, "1200",
NA, NA, "1200", "385"),
Current =c(NA, NA, 0, 0, 0, NA, 0, NA, NA, 0, 0),
X.120=c(NA, NA, "1200", "4000", "2775", NA, "1200",
NA, NA, "1200", "385"))
My goal is to end up with a dataset that looks like:
Goal Dataset
Code used to create Goal dataset:
dataset<-data.frame(Attorney=c("John Doe", "John Doe", "John Doe",
"John Doe", "Jane Jane", "Jane Jane"),
Date=c("2019/4/4", "2019/4/4", "2019/12/4", " 2019/2/3",
"2019/12/1","2019/12/4" ),
Code=c("7NP/7NP", "1UE/1UE","2C1/2C1", "7NP/7NP",
"7NP/7NP", "7FU/7FU"),
Billed_Amount=c(1200.00, 4000.00,2775.00, 1200.00,
1200.00, 385),
Amount= c(1200, 4000, 2775, 1200,1200, 385),
Current= c(0, 0, 0, 0, 0, 0),
X.120=c(1200, 4000, 2775,1200, 1200, 385))
I want to rename the rows underneath each attorney with the attorney's name while not worrying about preserving the client's name. My original dataset has a number of attorneys and they have a varying number of clients and those clients have a various number of codes, dates, and amounts associated with them.
I tried to use if else statement but encountered an error message.
I appreciate any help you can give me. Thanks!
Edit: I have edited my question to include hypothetical attorney names.
An option is to create a grouping variable based on the presence of 'Attorney substring in 'Attorney' column, then mutate the 'Attorney' column with the first element of 'Attorney' after grouping by 'grp', filter out the NA elements
library(dplyr)
library(stringr)
dataset %>%
group_by(grp = cumsum(str_detect(Attorney, "^Attorney"))) %>%
mutate(Attorney = first(Attorney)) %>%
filter_at(vars(Date:X.120), all_vars(!is.na(.))) %>%
ungroup %>%
select(-grp)
We can also use na.omit here
dataset %>%
group_by(grp = cumsum(str_detect(Attorney, "^Attorney"))) %>%
mutate(Attorney = first(Attorney)) %>%
ungroup %>%
select(-grp) %>%
na.omit
# A tibble: 6 x 7
# Attorney Date Code Billed_Amount Amount Current X.120
# <fct> <fct> <fct> <dbl> <fct> <dbl> <fct>
#1 Attorney #1 "2019/4/4" 7NP/7NP 1200 1200 0 1200
#2 Attorney #1 "2019/4/4" 1UE/1UE 4000 4000 0 4000
#3 Attorney #1 "2019/4/12" 2C1/2C1 2775 2775 0 2775
#4 Attorney #1 " 2019/2/3" 7NP/7NP 1200 1200 0 1200
#5 Attorney #2 "2019/12/1" 7NP/7NP 1200 1200 0 1200
#6 Attorney #2 "2019/12/4" 7FU/7FU 385 385 0 385
Or another option is to fill the 'Attorney' column after replaceing the non 'Attorney' substring elements with NA so that it gets filled with the previous non-NA element, then do na.omit
library(tidyr)
dataset %>%
mutate(Attorney = replace(Attorney, !str_detect(Attorney, "Attorney"), NA)) %>%
fill(Attorney) %>%
na.omit
Base R solution (using #akrun's logic):
data.frame(do.call("rbind",
lapply(split(dataset, cumsum(!(grepl("\\d+", dataset$Attorney)))),
function(x){
non_att_cols <- names(x)[names(x) != "Attorney"]
y <- data.frame(na.omit(x[,non_att_cols]))
y$Attorney <- x$Attorney[1]
return(y[,c("Attorney", non_att_cols)])
}
)
),
row.names = NULL
)

Iterate through columns' suffixes in a for loop. R

I am trying to modify my dataset with a for loop. I want to modify certain cells of some columns depending on the value of its "paired" column. My dataset could be:
data1989 <- data.frame("date" = c("1987-01-01", "1987-01-03", "1987-01-19"),
"NDVI_1" = c(NA, 0.589, 0.120),
"NDVI_3" = c(NA, 0.447, NA),
"NDVI_4" = c(NA, NA, NA),
"pixelQA_1" = c(NA, 66.897,90.599),
"pixelQA_3" = c(NA, 66.097,NA),
"pixelQA_4" = c(NA, NA, NA),
stringsAsFactors = FALSE)
> data1989
date NDVI_1 NDVI_3 NDVI_4 pixelQA_1 pixelQA_3 pixelQA_4
1 1987-01-01 NA NA NA NA NA NA
2 1987-01-03 0.589 0.447 NA 66.897 66.097 NA
3 1987-01-19 0.120 NA NA 90.599 NA NA
Columns are "paired" by the suffix of each column, so NDVI_1 is paired with pixelQA_1, and so on. I want to modify the values under NDVI's columns depending on it's "paired" values on pixelQA column, following:
if PixelQa is NA -> then NDVI should be also NA.
if Pixel Qa is 66±0.5 OR 130±0.5 -> then NDVI remains the same value.
if Pixel Qa is different to 66±0.5 OR 130±0.5 -> then NDVI value is set to NA (this is bad quality data which needs to be ignored).
Applying these very simple rules my data should look like:
data1989clean <- data.frame("date" = c("1987-01-01", "1987-01-03", "1987-01-19"),
"NDVI_1" = c(NA, NA, NA),
"NDVI_3" = c(NA, 0.447, NA),
"NDVI_4" = c(NA, NA, NA),
"pixelQA_1" = c(NA, 66.897,90.599),
"pixelQA_3" = c(NA, 66.097,NA),
"pixelQA_4" = c(NA, NA, NA),
stringsAsFactors = FALSE)
> data1989clean
date NDVI_1 NDVI_3 NDVI_4 pixelQA_1 pixelQA_3 pixelQA_4
1 1987-01-01 NA NA NA NA NA NA
2 1987-01-03 NA 0.447 NA 66.897 66.097 NA
3 1987-01-19 NA NA NA 90.599 NA NA
To reach my goal I am trying the following for loop:
for(i in 1:4){
data1989$NDVI_[i] <- ifelse(data1989$pixelQA_[i] < 66.5 & data1989$pixelQA_[i] > 65.5 |
data1989$pixelQA_[i] < 130.5 & data1989$pixelQA_[i] > 129.5,
data1989$NDVI_[i], NA)
}
But so far it is not working, as the dataset output looks exactly the same as the original one. Any suggestion will be welcomed.
As suggested by #George Savva, you can achieve this by pivoting longer, correcting the data, and pivoting back wider. So, using the tidyverse, that gives:
library(tidyverse)
newdd1 <-
#
data1989 %>%
#
pivot_longer(cols = -date,
names_to = c(".value", "set"),
names_sep = "_") %>%
#
mutate(NDVI = case_when(is.na(pixelQA) ~ NA_real_,
between(pixelQA, 65.5, 66.5) ~ NDVI,
between(pixelQA, 129.5, 130.5) ~ NDVI,
TRUE ~ NA_real_)) %>%
#
pivot_wider(names_from = set,
values_from = c(NDVI, pixelQA))

Sum all cells to the right of a column in each row using Dplyr

So I've seen many pages on the generalized version of this issue but here specifically I would like to sum all values in a row after a specific column.
Let's say we have this df:
id city identity q1 q2 q3
0110 detroit ella 2 4 3
0111 boston fitz 0 0 0
0112 philly gerald 3 1 0
0113 new_york doowop 8 11 2
0114 ontario wazaaa NA 11 NA
Now the df's I work with aren't usually with 3 "q" variables, they vary. Hence, I would like to rowSum every row but only sum the rows that are after the column identity.
Rows with NA are to be ignored.
Eventually I would like to take the rows which sum to 0 to be removed and end with a df that looks like this:
id city identity q1 q2 q3
0110 detroit ella 2 4 3
0112 philly gerald 3 1 0
0113 new_york doowop 8 11 2
Doing this in dplyr is the preference but not required.
EDIT:
I have added below the data of which this solution is not working for, apologies for the confusion.
df <- structure(list(Program = c("3002", "111", "2455", "2929", "NA",
"NA", NA), Project_ID = c("299", "11", "271", "780", "207", "222",
NA), Advance_Identifier = c(14, 24, 12, 15, NA, 11, NA), Sequence = c(6,
4, 4, 5, 2, 3, 79), Item = c("payment", "hero", "prepayment_2",
"UPS", "period", "prepayment", "yeet"), q1 = c("500", "12", "-1",
"0", NA, "0", "0"), q2 = c("500", "12", "-1", "0", NA, "0", "1"
), q3 = c("500", "12", "2", "0", NA, "0", "2"), q4 = c("500",
"13", "0", "0", NA, "0", "3")), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
Base R version with zero extra dependencies:
[Edit: I always forget rowSums exists]
> df1$new = rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)
> df1
id city identity q1 q2 q3 new
1 110 detroit ella 2 4 3 9
2 111 boston fitz 0 0 0 0
3 112 philly gerald 3 1 0 4
4 113 new_york doowop 8 11 2 21
If you need to convert chars to numbers, use apply with as.numeric:
df$new = apply(df[,(1+which(names(df)=="Item")):ncol(df),drop=FALSE], 1, function(col){sum(as.numeric(col))})
BUT look out if they are really factors because this will fail, which is why converting things that look like numbers to numbers before you do anything else is a Good Thing.
Benchmark
In case you are worried about speed here's a benchmark test of my function against the currently accepted solution:
akrun = function(df1){df1 %>%
mutate(new = rowSums(select(., ((match('identity', names(.)) +
1):ncol(.))), na.rm = TRUE))}
baz = function(df1){rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)}
sample data
df = data.frame(id=sample(100,100), city=sample(LETTERS,100,TRUE), identity=sample(letters,100,TRUE), q1=runif(100), q2=runif(100),q3=runif(100))
Test - note I remove the new column from the source data frame each time otherwise the code keeps adding one of those into it (although akrun doesn't modify df in place it can get run after baz has modified it by assigning it the new column in the benchmark code).
> microbenchmark({df$new=NULL;df2 = akrun(df)},{df$new=NULL;df$new=baz(df)})
Unit: microseconds
expr min lq mean
{ df$new = NULL df2 = akrun(df) } 1300.682 1328.941 1396.63477
{ df$new = NULL df$new = baz(df) } 63.102 72.721 87.78668
median uq max neval
1376.9425 1398.5880 2075.894 100
84.3655 86.7005 685.594 100
The tidyverse version takes 16 times as long as the base R version.
We can use
out <- df1 %>%
mutate(new = rowSums(select(., ((match('identity', names(.)) +
1):ncol(.))), na.rm = TRUE))
out
# id city identity q1 q2 q3 new
#1 110 detroit ella 2 4 3 9
#2 111 boston fitz 0 0 0 0
#3 112 philly gerald 3 1 0 4
#4 113 new_york doowop 8 11 2 21
and then filter out the rows that have 0 in 'new'
out %>%
filter(new >0)
In the OP's updated dataset, the type of columns are character. We can automatically convert the types to respective types with
df %>%
#type.convert %>% # base R
# or with `readr::type_convert
type_convert %>%
...
NOTE: The OP mentioned in the title and in the description about a tidyverse option. It is not a question about efficiency.
Also, rowSums is a base R option. Here, we showed how to use that in tidyverse chain. I could have written an answer in base R way too earlier with the same option.
If we remove the select, it becomes just a base R i.e
df1$new < rowSums(df1[(match('identity', names(df1)) + 1):ncol(df1)], na.rm = TRUE)
Benchmarks
df = data.frame(id=sample(100,100), city=sample(LETTERS,100,TRUE),
identity=sample(letters,100,TRUE), q1=runif(100), q2=runif(100),q3=runif(100))
akrun = function(df1){
rowSums(df1[(match('identity', names(df1)) + 1):ncol(df1)], na.rm = TRUE)
}
baz = function(df1){rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)}
microbenchmark({df$new=NULL;df2 = akrun(df)},{df$new=NULL;df$new=baz(df)})
#Unit: microseconds
# expr min lq mean median uq max neval
# { df$new = NULL df2 = akrun(df) } 69.926 73.244 112.2078 75.4335 78.7625 3539.921 100
# { df$new = NULL df$new = baz(df) } 73.670 77.945 118.3875 80.5045 83.5100 3767.812 100
data
df1 <- structure(list(id = 110:113, city = c("detroit", "boston", "philly",
"new_york"), identity = c("ella", "fitz", "gerald", "doowop"),
q1 = c(2L, 0L, 3L, 8L), q2 = c(4L, 0L, 1L, 11L), q3 = c(3L,
0L, 0L, 2L)), class = "data.frame", row.names = c(NA, -4L
))
Similar to akrun you can try
df %>%
mutate_at(vars(starts_with("q")),funs(as.numeric)) %>%
mutate(sum_new = rowSums(select(., starts_with("q")), na.rm = TRUE)) %>%
filter(sum_new>0)
Here i use reduce in purrr to sum rows, it's the fastest way.
library(tidyverse)
data %>% filter_at(vars(starts_with('q')),~!is.na(.)) %>%
mutate( Sum = reduce(select(., starts_with("q")), `+`)) %>%
filter(Sum > 0)

Need to separate out variable names from a column in r

So I have a pretty bad dataset I am not allowed to change. I would like to take the column "Draw_CashFlow" and make only certain values into their own columns. Additionally I need to make the variables all one column (period) (wide to Tidy if you will).
In the dataset below we have a column (Draw_CashFlow) which begins with the variable in question followed by a list of IDs, then repeats for the next variable. Some variables may have NA entries.
structure(list(Draw_CashFlow = c("Principal", "R01",
"R02", "R03", "Workout Recovery Principal",
"Prepaid Principal", "R01", "R02", "R03",
"Interest", "R01", "R02"), `PERIOD 1` = c(NA,
834659.51, 85800.18, 27540.31, NA, NA, 366627.74, 0, 0, NA, 317521.73,
29175.1), `PERIOD 2` = c(NA, 834659.51, 85800.18, 27540.31, NA,
NA, 306125.98, 0, 0, NA, 302810.49, 28067.8), `PERIOD 3` = c(NA,
834659.51, 85800.18, 27540.31, NA, NA, 269970.12, 0, 0, NA, 298529.92,
27901.36), `PERIOD 4` = c(NA, 834659.51, 85800.18, 27540.31,
NA, NA, 307049.06, 0, 0, NA, 293821.89, 27724.4)), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
Now it is a finite list of variables needed (Principal, Workout Recovery Principal, Prepaid Principal, and Interest) so I tried to make a loop where it would see if it existed then gather but that was not correct.
After the variables are set apart from Draw_CashFlow I hope it looks something like this (First four rows, ignore variable abbreviations).
ID Period Principal Wrk_Reco_Principal Prepaid_Principal Interest
R01 1 834659.51 NA 366627.74 317521.73
R02 1 85800.18 NA 0.00 29175.10
R03 1 27540.31 NA 0.00 NA
R01 2 834659.51 NA 306125.98 302810.49
Notes: Wrl_Reco_Principal is NA because there are no ID's within this Draw_CashFlow for this variable. Keep in mind this is supposed to be built to combat any number of IDs, but the variable names in the Draw_CashFlow column will always be the same.
Here's an approach which assumes the Draw_CashFlow values that start with an R are ID numbers. You might need a different method (e.g. !Draw_CashFlow %in% LIST_OF_VARIABLES) if that doesn't hold up.
df %>%
# create separate columns for ID and Variable
mutate(ID = if_else(Draw_CashFlow %>% str_starts("R"),
Draw_CashFlow, NA_character_),
Variable = if_else(!Draw_CashFlow %>% str_starts("R"),
Draw_CashFlow, NA_character_)) %>%
fill(Variable) %>% # Fill down Variable in NA rows from above
select(-Draw_CashFlow) %>%
gather(Period, value, -c(ID, Variable)) %>% # Gather into long form
drop_na() %>%
spread(Variable, value, fill = 0) %>% # Spread based on Variable
mutate(Period = parse_number(Period))
# A tibble: 12 x 5
ID Period Interest `Prepaid Principal` Principal
<chr> <dbl> <dbl> <dbl> <dbl>
1 R01 1 317522. 366628. 834660.
2 R01 2 302810. 306126. 834660.
3 R01 3 298530. 269970. 834660.
4 R01 4 293822. 307049. 834660.
5 R02 1 29175. 0 85800.
6 R02 2 28068. 0 85800.
7 R02 3 27901. 0 85800.
8 R02 4 27724. 0 85800.
9 R03 1 0 0 27540.
10 R03 2 0 0 27540.
11 R03 3 0 0 27540.
12 R03 4 0 0 27540.

Resources