table1() and t1kable() Ignoring named level of a factor - r

I am trying to use RMarkdown to automatically generate summary tables that are part of reports that get knitted to pdf. I am using table1() to create the initial table, and t1kable() to convert the table for pdf-knitting, and kable_styling() to center the table on the page. Here is my reproducible example:
library(tidyverse)
library(ggplot2)
library(table1)
library(kableExtra)
# Create df for reproducible example
df <- tibble(grad_term = c(rep("Fall '15 to Summer '20", 3), "Spring '12 to Summer '15",rep("Fall '15 to Summer '20", 5), "Spring '12 to Summer '15"),
enrollment_age = c("23-29", "30-39", rep("23-29",8)),
gender = c(rep("Male", 6), rep("Female", 2), rep("Male", 2)),
racial_group = c("2+ groups", rep("White", 7), NA_character_, "White"),
citizenship_status = c(rep("Since birth", 10))
) %>%
mutate(across(.cols = everything(), ~ factor(.x)))
# Create necessary components of table1() object
labels <- list(
variables=list(gender="Gender",
racial_group="Race/Ethnicity",
citizenship_status="Citizenship Status",
enrollment_age="Age at Enrollment",
grad_term = "Graduation Term"))
strata <- c(list(Total = df))
# Creating my own "render categorical variable" function is necessary to round to 0 decimals
my.render.cat <- function(x) {
c("", sapply(stats.default(x), function(y) with(y,
sprintf("%d (%.0f %%)", FREQ, PCT))))
}
# Put it all together and pass through t1kable() for knitting to pdf and kable_styling() for LateX formatting
table1(strata,
labels,
render.categorical = my.render.cat) %>%
t1kable() %>%
kable_styling(., latex_options = "hold_position", position = "center")
Here is the result:
bad table :(
As you can see, 2 undesired outcomes are happening:
The value "1" is being displayed in the "Citizenship status" row, rather than "Since birth". You can check in the console, that levels(df$citizenship_status) returns [1] "Since birth"
This dataset represents one of ~50 like this; a separate report_driver.R file uses map() to, for each of these 50 groups: filter the large data frame for only those individuals, call droplevels() to only have the factor levels represented inside that group remain, and execute markdown::render(input = "report.Rmd") . The code you see here is what is inside report.Rmd.
Citizenship status, here in this particular group's data frame, is the only time I encounter this "1" issue. It is also the only instance of only one level being present in the filtered data frame (In other words, the other ~49 groups all have multiple levels, such as "Since birth" and "Naturalized" present for citizenship_status, as well as for all the other demographic variables. This one group, with only 10 people, just happens to have all "Since birth").
I have no idea why having only one level present of these factor variables causes this erroneous labeling, but they seem to be 1:1 correlated across my data.
Nearly all levels are rounding with 0 decimal places, except for the Missing row (which corresponds to the NAs). This occurs in any of the other ~50 datasets and automatically generated tables, but only for "Missing"/NA values.
Thus, I have deduced that table1() and possibly kable() in general treat NA values differently, but I am not well-versed enough to "open up the hood" and make any modifications myself.
Please let me know if you can help with either the (1) factor level issue or (2) the rounding of NA rows. Have a great day and thanks so much!

Related

Locating bold text in excel using R

I am trying to make a couple of spreadsheets in excel accessible. I need to replace bold text and some contents of the cells depending on their specific grouping. For example, if I have this table:
I would like to have the equivalent "accessible" table:
I am not worried about writing in the excel file, my goal is to read the table from the spreadsheet and create a data frame that looks like the table above with the variable names in the first column.
My idea was to identify where there is a bold text in the first column so I could prepend that text to the names that are not in bold as bold represents the subgroups.
I understand this might not be the best solution to the problem, I hope someone can find a proper solution.
Thank you all very much.
--EDITED -- it turns out you can tell the style of which cell has which style and depending on how many styles/how consistent the styles are used in the workbook would determine your needs. But I will leave the other answer using the Total column approach below. The first approach relies on the bold text being consistently used. And the second approach relies on the Total column subcategories always equaling the total categories. The both end up using similar approaches, just initial strategy of identifying the category text is different with each approach.
---I don't believe that openxlsx can determine which cells have a bolded style-- only that a bolded style exists in the workbook.--- I couldn't have been more wrong!
---Bold text search answer --
library(openxlsx)
library(tidyverse)
wb <- createWorkbook()
wb <- loadWorkbook("Path\\Your_File_Name.xlsx")
#Examine structure of the workbook
str(wb)
#Tells number of styles in workbook
wb$styleObjects %>% length()
#In this example there is just 1 style. So index the 1st style below this text logical if fontDecoration is bolded (answer is TRUE)
wb$styleObjects[[1]]$style$fontDecoration == "BOLD"
Since this is the style desired, pull out the rows that have this fontDecoration. Note if BOLD and other style type in different cell (e.g., Motorcycle was in red font) then this may get more complex in flagging/collecting the rows with bolded font (hence approach 2 may be safer option).
#This indicates that rows 1 and 5 have bolded text (i.e., Car and Motorcycle)
thesebold <- wb$styleObjects[[1]]$rows
df <- read.xlsx("Path\\filename", colNames = FALSE)
This identifies number of repetitions each category should have. So taking different between row position 1 and 5, then remaining length of the dataset. See second approach if more than two categories
thesereps <- c(diff(thesebold), dim(df)[1] - diff(thesebold))
#Named variables for ease
df %>%
set_names("Category", "Total") %>%
bind_cols(newcat = rep(df[thesebold,1], thesereps)) %>%
mutate(Category = case_when(Category == newcat ~ Category,
Category != newcat ~ paste0(newcat, ":", Category))) %>%
select(-newcat)
--Second approach --
So, this answer isn't using the bold text approach, but assuming the structure of the dataset is how you have it displayed in the example, then the below should work. The data are structured where you have categories (Car, Motorcycle) then subcategories (Tesla, Honda, Toyota, etc.) with the Total column being Total for each category, then subcategory subtotal that contributes to the Total. Using this column, you can define category boundaries (i.e., when subtracting Total from itself, it reaches 0 before switching to the next category). For demo, I added two more categories of varying lengths but same restrictions (sum of subcategories' totals must equal category total). I made a note where things may need adjusted for your purposes since I am creating the dataframe from scratch instead of reading it in using openxlsx
library(tidyverse)
#Make expanded data set for demo - adding extra categories
thesenames = c("Car", "Tesla", "Honda", "Toyota", "Motorcylce", "Honda", "Yamaha", "Suzuki", "Fruit", "Apple", "Orange", "Grape", "Strawberry", "Lemon", "Lime", "Shape", "Circle", "Square", "Octogon")
thesetotals = c(12,3,2,7,20,13,5,2, 32, 8, 4, 4, 8, 4, 4, 24, 2, 4, 18)
df <- bind_cols(thesenames, thesetotals)%>%
set_names("Type", "Total")
#Empty tibble to save running total result to
y <- tibble(NULL)
#Initialize the current.total as 0
current.total = 0
for(i in thesetotals){
if (current.total == 0){
current.total = current.total + i
} else{
current.total = current.total - i
}
tmp <- current.total
y <- rbind(y, tmp)
}
y <- as_tibble(y) %>%
set_names("RT")
#Gets the number of subcategories between each of main categories
thislong <- c(diff(which(y$RT ==0)))
thislong <- c((length(y$RT) - sum(thislong)),thislong)
#This part assumes the structure of the df I created above which may need modified in your dataset
#This pulls from first column, first row, which here is "Car"
firstrow <- df[1,1] %>% pull()
#Gets vector of each category; determines category by looking at the lag RT value
thesetypes <- bind_cols(df,y) %>%
mutate(Category = case_when(firstrow == Type ~ Type,
RT > lag(RT) ~ Type,
TRUE ~ "0")) %>%
filter(Category != "0") %>%
pull(Category)
#Adds new category to existing df, repeating the specified number of times
df$Category <- rep(thesetypes,thislong)
#Modifies the subcategory text with prefixing the category membership then drops Category
df <- df %>%
mutate(Type = case_when(Type != Category ~ paste0(Category, ":", Type),
TRUE ~ Type)) %>%
select(-Category)
df

R- How do I use a lookup table containing threshold values that vary for different variables (columns) to replace values below those thresholds?

I am trying to streamline the process of auditing chemistry laboratory data. When we encounter data where an analyte is not detected I need to change the recorded result to a value equal to 1/2 of the level of detection (LOD) for the analytical method. I have LOD's contained within another dataframe to be used as a lookup table.
I have multiple columns representing data from different analytical tests, each with it's own unique LOD. Here's an example of the type of data I am working with:
library(tidyverse)
dat <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,-2.3,7.6,0.1,45.6,12.2,-0.1,22.2,0.6),
"TN" = c(100.3,56.2,-10.5,0.4,-0.3,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,-10.5,0.2,14.6,489.3,0.3,14.4,54.6,88.8))
dat
detect_level <- tibble("Parameter" = c('TP', 'TN', 'DOC'),
'LOD' = c(0.6, 11, 0.3)) %>%
mutate(halfLOD=LOD/2)
detect_level
I have poured over multiple other questions with a similar theme:
Change values in multiple columns of a dataframe using a lookup table
R - Match values from multiple columns in a data.frame to a lookup table.
Replace values in multiple columns using different thresholds
and gotten to a point where I have pivoted the data and split it out into a list of dataframes that are specific analytes:
dat %>%
pivot_longer(cols = c('TP','TN','DOC')) %>%
arrange(name) %>%
split(.$name)
I have tried to apply a function using map(), however I cannot figure out how to integrate the values from the lookup table (detect_level) into my code. If someone could help me continue this pipe, or finish the process to achieve a final product dat2 that should look like this I would appreciate it:
dat2 <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,0.3,7.6,0.3,45.6,12.2,0.3,22.2,0.6),
"TN" = c(100.3,56.2,5.5,5.5,5.5,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,0.15,0.15,14.6,489.3,0.3,14.4,54.6,88.8))
dat2
Another possibility would be from the closest similar question I have found is:
Lookup multiple column from a single table
Here's a snippet of code that I have adapted from this question, however, if you run it you will see that where values exist that are not found in detect_level an NA is returned. Additionally, it does not appear to have worked for $TN or $DOC, even in cases when the $LOD value from detect_level was present.
dat %>%
mutate(across(all_of(unique(detect_level$Parameter)),
~ {i1 <- detect_level$Parameter == cur_column()
detect_level$LOD[i1][match(., detect_level$LOD)]}))
I am not comfortable at all with the purrr language here and have only adapted this code from the question linked, so I would appreciate if this is the direction an answerer chooses, that they might comment code to explain briefly what is happening "under the hood".
Thank you in advance!
Perhaps this helps
library(dplyr)
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ pmax(., detect_level$LOD[match(cur_column(), detect_level$Parameter)])))
For the updated case
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ replace(., . < detect_level$LOD[match(cur_column(),
detect_level$Parameter)],detect_level$halfLOD[match(cur_column(),
detect_level$Parameter)])))

Using stringdist_left_join to join by multiple columns, but not all of them fuzzy

I have a 1.3 million-row dataset of publications and, for each record, I want to retrieve a paper_id from a second dataset with 8.6 million rows. The idea is to use multiple columns from both tables to find matches for dataset1 in dataset2 as shown in this functional, yet simplified, script:
library(fuzzyjoin); library(tidyverse)
dataset1 %>%
stringdist_left_join(dataset2 %>% select(Title, Year, Publication_id, Paper_id),
by = list(x = c("Title", "Year", "Publication_id"),
y = c("Title", "Year", "Publication_id"))
max_dist = 3, ignore_case = TRUE, distance_col = NULL)
I have two problems here. The first is that only "Title" has variations (typos, abbreviations, special characters, etc.) that would require fuzzy matching, but the code accepts variations in all three of the used fields. This inflates the number of possible matches with incorrect ones, as similar titles appear across different years and publications.
A solution I could think that would solve this first problem would be:
library(fuzzyjoin); library(tidyverse)
dataset1 %>%
stringdist_left_join(dataset2 %>%
select(Title2 = Title, Year2 = Year, Pub_id2 = Publication_id, Paper_id),
by = list(x = c("Title", "Year", "Publication_id"),
y = c("Title2", "Year2", "Pub_id2"))
max_dist = 3, ignore_case = TRUE, distance_col = NULL) %>%
filter(Year == Year2, Publication_id == Pub_id2)
This would solve the first problem, but there is the second one: the script runs ok on sample data, but when I try to use it for the whole dataset, it gives the error "vector memory exhausted (limit reached?)".
So, my questions are:
Would it be possible to determine which columns should be identical and which should be fuzzy, which might make the script lighter?
Is there a possibility to subset both datasets according to the identical variables, and then run the fuzzy match on the title within the subsets, merging everything back? (I'm not sure if this would work, as I would have 180 thousand subsets: 30 thousand publications from six years).
Any help is appreciated.
Best

R: Mutate Returning NA Values When Additional Variable in Select Statement

I'm using dplyr and rollmean mean to calculate a 13 Week Moving Average and Growth rates. The following works:
NEW_DATA <- DATA %>%
select(CAT, Inventory_Amount, Sales, Shipments, DATE)%>%
group_by(CAT, DATE)%>%
summarise(
INVENTORY = sum(Inventory_Amount),
SO = sum(Sale),
SI = sum(Shipments)
) %>%
arrange(CAT, DATE)%>%
mutate(SO_13WK_AVG = rollmean(x = SO, 13, align = "right", fill = NA ),
GROWTH = round(((SO - lag(SO, 52)) / lag(SO, 52)) *100,2))
This codes adds two new columns "SO_13WK_AVG" (the 13 week sales average) and Growth (YoY Growth Rate for Sales)
When I try to select an additional variable from the original dataframe to include in the new summarized dataframe, the values for the new variables being created all turn to NA's. The following code generates NA's for the SO_13WK_AVG and GROWTH (all I've done is selected the "WK" variable:
NEW_DATA <- DATA %>%
select(CAT, Inventory_Amount, Sales, Shipments, DATE, WK)%>%
group_by(CAT, DATE, WK)%>%
summarise(
INVENTORY = sum(Inventory_Amount),
SO = sum(Sale),
SI = sum(Shipments)
) %>%
arrange(CAT, DATE)%>%
mutate(SO_13WK_AVG = rollmean(x = SO, 13, align = "right", fill = NA ),
GROWTH = round(((SO - lag(SO, 52)) / lag(SO, 52)) *100,2))
I searched stackoverflow and one found one thread that seems related:
Group/Mutate only returns NA and not an average
This thread suggests using na.rm = TRUE to remove NA values from calculations. However as far as I can tell I don't have any missing values. Any help / commentary is appreciated.
I just resolved a very similar issue. Can't quite tell whether it will fix yours without spending more time thingking about it, but I was grouping by the two variables which accounted for all of the variation across my data set (location and week). Therefore, the rolling mean was either not able to calculate, or could only create the fill values. Not grouping by "week" solved the issue. Since "WK" is almost certainly 100% dependent on "Date", I expect you have the same issue. Remember, "summarise" drops the last grouping variable from the grouping. Try grouping by WK before your summarise, and then regrouping without week or date.
(BTW, I'm sure you've figured something out, since this was almost two years ago, but I imagine others will encounter this as well, after all, that's why I came to this question.)

Plotting only 1 hourly datapoint (1 per day) alongside hourly points (24 per day) in R Studio

I am a bit stuck with some code. Of course I would appreciate a piece of code which sorts my dilemma, but I am also grateful for hints of how to sort that out.
Here goes:
First of all, I installed the packages (ggplot2, lubridate, and openxlsx)
The relevant part:
I extract a file from an Italians gas TSO website:
Storico_G1 <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",sheet = "Storico_G+1", startRow = 1, colNames = TRUE)
Then I created a data frame with the variables I want to keep:
Storico_G1_df <- data.frame(Storico_G1$pubblicazione, Storico_G1$IMMESSO, Storico_G1$`SBILANCIAMENTO.ATTESO.DEL.SISTEMA.(SAS)`)
Then change the time format:
Storico_G1_df$pubblicazione <- ymd_h(Storico_G1_df$Storico_G1.pubblicazione)
Now the struggle begins. Since in this example I would like to chart the 2 time series with 2 different Y axes because the ranges are very different. This is not really a problem as such, because with the melt function and ggplot i can achieve that. However, since there are NAs in 1 column, I dont know how I can work around that. Since, in the incomplete (SAS) column, I mainly care about the data point at 16:00, I would ideally have hourly plots on one chart and only 1 datapoint a day on the second chart (at said 16:00). I attached an unrelated example pic of a chart style I mean. However, in the attached chart, I have equally many data points on both charts and hence it works fine.
Grateful for any hints.
Take care
library(lubridate)
library(ggplot2)
library(openxlsx)
library(dplyr)
#Use na.strings it looks like NAs can have many values in the dataset
storico.xl <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",
sheet = "Storico_G+1", startRow = 1,
colNames = TRUE,
na.strings = c("NA","N.D.","N.D"))
#Select and rename the crazy column names
storico.g1 <- data.frame(storico.xl) %>%
select(pubblicazione, IMMESSO, SBILANCIAMENTO.ATTESO.DEL.SISTEMA..SAS.)
names(storico.g1) <- c("date_hour","immesso","sads")
# the date column look is in the format ymd_h
storico.g1 <- storico.g1 %>% mutate(date_hour = ymd_h(date_hour))
#Not sure exactly what you want to plot, but here is each point by hour
ggplot(storico.g1, aes(x= date_hour, y = immesso)) + geom_line()
#For each day you can group, need to format the date_hour for a day
#You can check there are 24 points per day
#feed the new columns into the gplot
storico.g1 %>%
group_by(date = as.Date(date_hour, "d-%B-%y-")) %>%
summarise(count = n(),
daily.immesso = sum(immesso)) %>%
ggplot(aes(x = date, y = daily.immesso)) + geom_line()

Resources