Locating bold text in excel using R - r

I am trying to make a couple of spreadsheets in excel accessible. I need to replace bold text and some contents of the cells depending on their specific grouping. For example, if I have this table:
I would like to have the equivalent "accessible" table:
I am not worried about writing in the excel file, my goal is to read the table from the spreadsheet and create a data frame that looks like the table above with the variable names in the first column.
My idea was to identify where there is a bold text in the first column so I could prepend that text to the names that are not in bold as bold represents the subgroups.
I understand this might not be the best solution to the problem, I hope someone can find a proper solution.
Thank you all very much.

--EDITED -- it turns out you can tell the style of which cell has which style and depending on how many styles/how consistent the styles are used in the workbook would determine your needs. But I will leave the other answer using the Total column approach below. The first approach relies on the bold text being consistently used. And the second approach relies on the Total column subcategories always equaling the total categories. The both end up using similar approaches, just initial strategy of identifying the category text is different with each approach.
---I don't believe that openxlsx can determine which cells have a bolded style-- only that a bolded style exists in the workbook.--- I couldn't have been more wrong!
---Bold text search answer --
library(openxlsx)
library(tidyverse)
wb <- createWorkbook()
wb <- loadWorkbook("Path\\Your_File_Name.xlsx")
#Examine structure of the workbook
str(wb)
#Tells number of styles in workbook
wb$styleObjects %>% length()
#In this example there is just 1 style. So index the 1st style below this text logical if fontDecoration is bolded (answer is TRUE)
wb$styleObjects[[1]]$style$fontDecoration == "BOLD"
Since this is the style desired, pull out the rows that have this fontDecoration. Note if BOLD and other style type in different cell (e.g., Motorcycle was in red font) then this may get more complex in flagging/collecting the rows with bolded font (hence approach 2 may be safer option).
#This indicates that rows 1 and 5 have bolded text (i.e., Car and Motorcycle)
thesebold <- wb$styleObjects[[1]]$rows
df <- read.xlsx("Path\\filename", colNames = FALSE)
This identifies number of repetitions each category should have. So taking different between row position 1 and 5, then remaining length of the dataset. See second approach if more than two categories
thesereps <- c(diff(thesebold), dim(df)[1] - diff(thesebold))
#Named variables for ease
df %>%
set_names("Category", "Total") %>%
bind_cols(newcat = rep(df[thesebold,1], thesereps)) %>%
mutate(Category = case_when(Category == newcat ~ Category,
Category != newcat ~ paste0(newcat, ":", Category))) %>%
select(-newcat)
--Second approach --
So, this answer isn't using the bold text approach, but assuming the structure of the dataset is how you have it displayed in the example, then the below should work. The data are structured where you have categories (Car, Motorcycle) then subcategories (Tesla, Honda, Toyota, etc.) with the Total column being Total for each category, then subcategory subtotal that contributes to the Total. Using this column, you can define category boundaries (i.e., when subtracting Total from itself, it reaches 0 before switching to the next category). For demo, I added two more categories of varying lengths but same restrictions (sum of subcategories' totals must equal category total). I made a note where things may need adjusted for your purposes since I am creating the dataframe from scratch instead of reading it in using openxlsx
library(tidyverse)
#Make expanded data set for demo - adding extra categories
thesenames = c("Car", "Tesla", "Honda", "Toyota", "Motorcylce", "Honda", "Yamaha", "Suzuki", "Fruit", "Apple", "Orange", "Grape", "Strawberry", "Lemon", "Lime", "Shape", "Circle", "Square", "Octogon")
thesetotals = c(12,3,2,7,20,13,5,2, 32, 8, 4, 4, 8, 4, 4, 24, 2, 4, 18)
df <- bind_cols(thesenames, thesetotals)%>%
set_names("Type", "Total")
#Empty tibble to save running total result to
y <- tibble(NULL)
#Initialize the current.total as 0
current.total = 0
for(i in thesetotals){
if (current.total == 0){
current.total = current.total + i
} else{
current.total = current.total - i
}
tmp <- current.total
y <- rbind(y, tmp)
}
y <- as_tibble(y) %>%
set_names("RT")
#Gets the number of subcategories between each of main categories
thislong <- c(diff(which(y$RT ==0)))
thislong <- c((length(y$RT) - sum(thislong)),thislong)
#This part assumes the structure of the df I created above which may need modified in your dataset
#This pulls from first column, first row, which here is "Car"
firstrow <- df[1,1] %>% pull()
#Gets vector of each category; determines category by looking at the lag RT value
thesetypes <- bind_cols(df,y) %>%
mutate(Category = case_when(firstrow == Type ~ Type,
RT > lag(RT) ~ Type,
TRUE ~ "0")) %>%
filter(Category != "0") %>%
pull(Category)
#Adds new category to existing df, repeating the specified number of times
df$Category <- rep(thesetypes,thislong)
#Modifies the subcategory text with prefixing the category membership then drops Category
df <- df %>%
mutate(Type = case_when(Type != Category ~ paste0(Category, ":", Type),
TRUE ~ Type)) %>%
select(-Category)
df

Related

How to use window functions?

I'm struggling to get window functions working in R to rank groups by the number of rows.
Here's my sample code:
data <- read_csv("https://dq-content.s3.amazonaws.com/498/book_reviews.csv")
data %>%
group_by(state) %>%
mutate(num_books = n(),
state_rank = dense_rank(num_books)) %>%
arrange(num_books)
The expected output is that the original data will have a new column that tells me the rank for each row (book, state, price and review) depending on whether that row is for a state with the most book reviews (would have state_rank of 1); second most books (rank 2), etc.
Manually I can get the output like this:
manual_ranks <- data %>%
count(state) %>%
mutate(state_rank = rank(state))
desired_output <- data %>%
left_join(manual_ranks)
In other words, I want the last column of this table:
data %>%
count(state) %>%
mutate(state_rank = rank(state))
added to each row of the original table (without having to create this table and then using left_join by state; that's the point of window functions).
Anyway, with the original code, you'll see that all state_rank just say 1, when I would expect states with the most book reviews to be ranked 1, second most reviews would have 2, etc.
My goal is to then be able to filter by, say, state_rank > 4. That is, I want to keep all the rows in the original data for top 4 states with the most book reviews.

How can I select specific regions from a shapefile?

I have the present shapefile
heitaly<- readOGR("ProvCM01012017/ProvCM01012017_WGS84.shp")
FinalData<- merge(italy, HT, by.x="COD_PROV", by.y="Domain")
But I'm interesting not on all Italy, but also same provinces. How can I get them?
There are many ways to select a category into a shapefile. I don't know for what do you want. For example if it is to colour a specific region in a plot or to select a row from shapefile attribute table.
To plot:
plot(shape, col = shape$column_name == "element") # general example
plot(heitaly, col = heitaly$COD_PROV == "name of province") # your shapefile
To attribute table:
df <- shape %>% data.frame
This will give you the complete attribute table
row <- shape %>% data.frame %>% slice(1)
This will give you the first row with all columns. If you change the number 1 to another number, for example 3, will give you the information for row number 3
I hope have been useful

R- How do I use a lookup table containing threshold values that vary for different variables (columns) to replace values below those thresholds?

I am trying to streamline the process of auditing chemistry laboratory data. When we encounter data where an analyte is not detected I need to change the recorded result to a value equal to 1/2 of the level of detection (LOD) for the analytical method. I have LOD's contained within another dataframe to be used as a lookup table.
I have multiple columns representing data from different analytical tests, each with it's own unique LOD. Here's an example of the type of data I am working with:
library(tidyverse)
dat <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,-2.3,7.6,0.1,45.6,12.2,-0.1,22.2,0.6),
"TN" = c(100.3,56.2,-10.5,0.4,-0.3,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,-10.5,0.2,14.6,489.3,0.3,14.4,54.6,88.8))
dat
detect_level <- tibble("Parameter" = c('TP', 'TN', 'DOC'),
'LOD' = c(0.6, 11, 0.3)) %>%
mutate(halfLOD=LOD/2)
detect_level
I have poured over multiple other questions with a similar theme:
Change values in multiple columns of a dataframe using a lookup table
R - Match values from multiple columns in a data.frame to a lookup table.
Replace values in multiple columns using different thresholds
and gotten to a point where I have pivoted the data and split it out into a list of dataframes that are specific analytes:
dat %>%
pivot_longer(cols = c('TP','TN','DOC')) %>%
arrange(name) %>%
split(.$name)
I have tried to apply a function using map(), however I cannot figure out how to integrate the values from the lookup table (detect_level) into my code. If someone could help me continue this pipe, or finish the process to achieve a final product dat2 that should look like this I would appreciate it:
dat2 <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,0.3,7.6,0.3,45.6,12.2,0.3,22.2,0.6),
"TN" = c(100.3,56.2,5.5,5.5,5.5,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,0.15,0.15,14.6,489.3,0.3,14.4,54.6,88.8))
dat2
Another possibility would be from the closest similar question I have found is:
Lookup multiple column from a single table
Here's a snippet of code that I have adapted from this question, however, if you run it you will see that where values exist that are not found in detect_level an NA is returned. Additionally, it does not appear to have worked for $TN or $DOC, even in cases when the $LOD value from detect_level was present.
dat %>%
mutate(across(all_of(unique(detect_level$Parameter)),
~ {i1 <- detect_level$Parameter == cur_column()
detect_level$LOD[i1][match(., detect_level$LOD)]}))
I am not comfortable at all with the purrr language here and have only adapted this code from the question linked, so I would appreciate if this is the direction an answerer chooses, that they might comment code to explain briefly what is happening "under the hood".
Thank you in advance!
Perhaps this helps
library(dplyr)
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ pmax(., detect_level$LOD[match(cur_column(), detect_level$Parameter)])))
For the updated case
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ replace(., . < detect_level$LOD[match(cur_column(),
detect_level$Parameter)],detect_level$halfLOD[match(cur_column(),
detect_level$Parameter)])))

table1() and t1kable() Ignoring named level of a factor

I am trying to use RMarkdown to automatically generate summary tables that are part of reports that get knitted to pdf. I am using table1() to create the initial table, and t1kable() to convert the table for pdf-knitting, and kable_styling() to center the table on the page. Here is my reproducible example:
library(tidyverse)
library(ggplot2)
library(table1)
library(kableExtra)
# Create df for reproducible example
df <- tibble(grad_term = c(rep("Fall '15 to Summer '20", 3), "Spring '12 to Summer '15",rep("Fall '15 to Summer '20", 5), "Spring '12 to Summer '15"),
enrollment_age = c("23-29", "30-39", rep("23-29",8)),
gender = c(rep("Male", 6), rep("Female", 2), rep("Male", 2)),
racial_group = c("2+ groups", rep("White", 7), NA_character_, "White"),
citizenship_status = c(rep("Since birth", 10))
) %>%
mutate(across(.cols = everything(), ~ factor(.x)))
# Create necessary components of table1() object
labels <- list(
variables=list(gender="Gender",
racial_group="Race/Ethnicity",
citizenship_status="Citizenship Status",
enrollment_age="Age at Enrollment",
grad_term = "Graduation Term"))
strata <- c(list(Total = df))
# Creating my own "render categorical variable" function is necessary to round to 0 decimals
my.render.cat <- function(x) {
c("", sapply(stats.default(x), function(y) with(y,
sprintf("%d (%.0f %%)", FREQ, PCT))))
}
# Put it all together and pass through t1kable() for knitting to pdf and kable_styling() for LateX formatting
table1(strata,
labels,
render.categorical = my.render.cat) %>%
t1kable() %>%
kable_styling(., latex_options = "hold_position", position = "center")
Here is the result:
bad table :(
As you can see, 2 undesired outcomes are happening:
The value "1" is being displayed in the "Citizenship status" row, rather than "Since birth". You can check in the console, that levels(df$citizenship_status) returns [1] "Since birth"
This dataset represents one of ~50 like this; a separate report_driver.R file uses map() to, for each of these 50 groups: filter the large data frame for only those individuals, call droplevels() to only have the factor levels represented inside that group remain, and execute markdown::render(input = "report.Rmd") . The code you see here is what is inside report.Rmd.
Citizenship status, here in this particular group's data frame, is the only time I encounter this "1" issue. It is also the only instance of only one level being present in the filtered data frame (In other words, the other ~49 groups all have multiple levels, such as "Since birth" and "Naturalized" present for citizenship_status, as well as for all the other demographic variables. This one group, with only 10 people, just happens to have all "Since birth").
I have no idea why having only one level present of these factor variables causes this erroneous labeling, but they seem to be 1:1 correlated across my data.
Nearly all levels are rounding with 0 decimal places, except for the Missing row (which corresponds to the NAs). This occurs in any of the other ~50 datasets and automatically generated tables, but only for "Missing"/NA values.
Thus, I have deduced that table1() and possibly kable() in general treat NA values differently, but I am not well-versed enough to "open up the hood" and make any modifications myself.
Please let me know if you can help with either the (1) factor level issue or (2) the rounding of NA rows. Have a great day and thanks so much!

Categorize large factor into small factors based on frequency with remaining entries as 'Others'

I have a large factor (df$name) with more than 1000 factors. What I need is the top 10-15 factors by frequency and the remaining factors clubbed together as 'others'
I tried using the following command but wasn't successful:
df$name <- levels(df$name)[which(table(df$name)<1000000)] <- "Others"
PS: I'm using a frequency count since I don't want to restrict myself with a specific count of factors here. I'm happy if I get anywhere from 5-20 top factors (by frequency) and the rest of them combined together as 'Others' for easy visualization.
First of all, I would count name frequency by using table() & top_n() to specify top 15 (or 10) names in your data set. (I contained them in top_15_names object.) After that I did create name_category column to show groups of names by using mutate(). Here is how I would do it.
df$name = as.factor(df$name)
top_15 = data.frame(table(df$name)) %>%
arrange(desc(Freq)) %>%
top_n(15)
top_15_names = top_15$Var1
dat = df %>%
mutate(name_category = case_when(
name %in% top_15_names ~ "Top15",
TRUE ~ "Others"
))
I hope you find this helpful.
Here's a column in a data frame with 2000 factors:
df <- data.frame(names = sample(1:2000, 1E6, replace = T))
df$names <- as.factor(df$names)
And here a new variable is added which keeps the top 15 and puts the rest in "Other."
df$names_lump = forcats::fct_lump(df$names, n = 15)

Resources