How to reshape complicated elections data using tidyverse? - r

I need to reshape a complicated table from rows of stacked election data to cleanly formatted columns containing all the information. I'm having trouble automating this.
Here's a simple version of the input data. Note that there are just 2 elections in this example; in the real data there are many, so the code needs to generalize:
input <-
structure(list(a = c("2020 ge", "winner", NA, "2016 ge", "winner"
), b = c(NA, "orange (cat)", NA, NA, "peach (kitten)"), c = c(NA,
"runner up", NA, NA, "runner up"), d = c(NA, "peach (kitten)", NA,
NA, "orange (cat)"), e = c(NA, "margin", NA, NA, "margin"), f = c(NA,
100, NA, NA, 150)), row.names = c(NA, 5L), class = "data.frame")
And this is the output I would like:
output <-
structure(list(`2019_winner_name` = "orange", `2020_winner_party` = "cat",
`2020_runner_up_name` = "peach", `2020_runner_up_party` = "kitten",
`2020_margin` = 100, `2016_winner_name` = "peach", `2016_winner_party` = "kitten",
`2016_runner_up_name` = "orange", `2016_runner_up_party` = "cat",
`2016_margin` = 150), row.names = 1L, class = "data.frame")
Here is what I've tried so far, which works for one year:
# test data
test <-
input %>%
slice(1:2) %>%
fill(c(b, c, d, e, f), .direction = c("up"))
# select first row
row_one <-
test %>%
select(a) %>%
slice(1)
# select year
year <-
str_extract(row_one$a, "^([0-9]*)")
# select second row as name
row_two <-
test %>%
select(a) %>%
slice(2) %>%
as.character()
# bring back to test data
test <-
test %>%
mutate(a = row_two) %>%
slice(1) %>%
add_row() %>%
fill(c(b, d, f)) %>%
mutate(a = ifelse(is.na(a), b, a),
c = ifelse(is.na(c), d, c),
e = ifelse(is.na(e), f, e)) %>%
select(a, c, e) %>%
row_to_names(1) %>%
rename_all(funs(paste0(year, "_", .)))
# extract party variable
test <-
test %>%
mutate_at(vars(contains("winner"), contains("runner")),
funs(party = str_extract(., "(?<=\\().+?(?=\\))"))) %>%
mutate_at(vars(ends_with("winner"), ends_with("up")),
funs(name = str_extract(., "([^()]*)")))
What would be an easier and more concise way to do this, given the unusual data format? How could I automate this so that I can run it over multiple election years?
Thank you.

First off, I agree with #deschen in that this is very messy data. Rather than trying to tidy/reshape the data as provided I would recommend exploring whether source data can be parsed in a better (tidier) way.
Having said that, it is possible to reshape & tidy data into your expected output. Mind you, this is a fairly messy procedure and I have no idea how well this generalises on bigger data.
library(tidyverse)
# Define a convenience function that turns a vector with an even number of elements
# into a named vector where every odd element is the name of the following even element
to_named_vec <- function(x) {
if (length(x) == 1) return(magrittr::set_names(x, "margin"))
nm <- x[c(TRUE, FALSE)]
vec <-x[c(FALSE, TRUE)]
return(magrittr::set_names(vec, nm))
}
# First convert the input into a nested `list`
lst <- input %>%
t() %>%
as.character() %>%
discard(is.na) %>%
split(., cumsum(str_detect(., "\\d{4}"))) %>%
map(~ .x %>%
str_remove(" ge") %>%
stringi::stri_replace_all_regex("(\\w+)\\s\\((\\w+)\\)", "name_$1_party_$2") %>%
str_split("_") %>%
unlist()) %>%
magrittr::set_names(map_chr(., head, 1)) %>%
map(~ .x[-1] %>%
split(cumsum(str_detect(.x[-1], "(winner|runner up|margin)"))) %>%
magrittr::set_names(map_chr(., head, 1)) %>%
map(~ .x %>% tail(-1) %>% to_named_vec() %>% bind_rows()))
# The last step involves `unlist`ing the nested `list`, tidying the names and
# converting the named vector into a `tibble` with `bind_rows`.
lst %>%
unlist() %>%
set_names(., str_replace_all(names(.), "\\.", "_")) %>%
set_names(., str_replace(names(.), "_margin", "")) %>%
bind_rows()
## A tibble: 1 x 10
#`2020_winner_na~ `2020_winner_pa~ `2020_runner up~ `2020_runner up~ `2020_margin` `2016_winner_na~
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 orange cat peach kitten 100 peach
## ... with 4 more variables: `2016_winner_party` <chr>, `2016_runner up_name` <chr>, `2016_runner
## up_party` <chr>, `2016_margin` <chr>
It's best to step through the code line-by-line to understand what every steps does; roughly,
we transpose input,
convert the resulting matrix into a character vector, discard NAs, and
split the vector on the occurrence of "\d{4}" (i.e. the year of the GE).
We then operate on every list element separately, by
removing the string " ge",
replacing occurrences of the form "orange (cat)" with "name_orange_party_cat",
splitting entries on "_".
The rest is a matter of giving the nested list elements proper names that from the vector of list elements themselves.
The final step involves unlisting the nested list and tidying the names of the named vector to reflect those from your expected output.

Related

R code to merge 2 data frames by whether values in the first "by" variable contain string values in the second "by" variable

I have 2 data frames: one with a list of medications, the other with a different but highly overlapping list of medications along with corresponding medication ID codes. I want to merge these two data frames to apply the medication codes to the first data frame's medication list. I have a lot of partial string matches, and I want to detect strings in a case-insensitive manner.
library(tidyverse)
library(stringr)
label <- c("0.4% Lidocaine Hydrochloride", "10% Dextrose", "Act Raloxifene")
df1 <- as.DataFrame(label)
label2 <- c("LIDOCAINE", "RALOXIFENE", "JANUMET", "ESOMEPRAZOLE", "METFORMIN")
code <- c(0003, 0005, 0006, 0001, 0011)
df2 <- data.frame(label2, code)%>%
rename(label=label2)
I try to use str_detect from stringr package
merge_df <- merge(df1, df2,
by.x=c("label" = ifelse(str_detect(df1$label, regex(df2$label, ignore_case = T)),
df1$label, NA)),
by.y=c("label" = ifelse(str_detect(df1$label, regex(df2$label, ignore_case = T)),
df2$label, NA)),
ignore.case=T,all.x=T,all.y=T,
suffixes = c("_list", "_dict"),
nomatch=0)
And I get the error:
Error in str_detect():
! Can't recycle string (size 3) to match pattern (size 5).
An approach using left_join.
First add a variable l_lower in both sets containing all tolower strings, separated by strsplit to enable match of all entries.
After joining and arranging the y-labels remove duplicated entries and the helper column.
library(dplyr)
library(tidyr)
left_join(df1 %>%
rowwise() %>%
mutate(l_label = strsplit(tolower(label), " ")) %>%
unnest(l_label),
df2 %>%
rowwise() %>%
mutate(l_label = unlist(strsplit(tolower(label), " "))), "l_label") %>%
arrange(label.y) %>%
group_by(label.x) %>%
filter(!duplicated(label.x)) %>%
select(-l_label) %>%
ungroup()
# A tibble: 3 × 3
label.x label.y code
<chr> <chr> <dbl>
1 0.4% Lidocaine Hydrochloride LIDOCAINE 3
2 Act Raloxifene RALOXIFENE 5
3 10% Dextrose NA NA
Data
df1 <- structure(list(label = c("0.4% Lidocaine Hydrochloride", "10% Dextrose",
"Act Raloxifene")), class = "data.frame", row.names = c(NA, -3L
))
df2 <- structure(list(label = c("LIDOCAINE", "RALOXIFENE", "JANUMET",
"ESOMEPRAZOLE", "METFORMIN"), code = c(3, 5, 6, 1, 11)),
class = "data.frame", row.names = c(NA,
-5L))

Several Model Estimates Table - Rearranging columns and rows (reprex included)

I'm working on creating a table of regression estimates from several models. Here is the data:
structure(list(term = c("age_ceo_state__rf", "", "mktrf", "",
NA, NA), intercept = c("0.390***", "(19.455)", "0.673***", "(23.409)",
NA, NA), term_2 = c("age_ceo_state__rf", "", "age_firm_state__rf",
"", "mktrf", ""), intercept_2 = c("0.209***", "(9.449)", "0.405***",
"(15.511)", "0.417***", "(13.255)"), term_3 = c("age_ceo_state__rf",
"", "age_firm_state__rf", "", "mktrf", ""), intercept_3 = c("0.209***",
"(9.449)", "0.405***", "(15.511)", "0.417***", "(13.255)")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
Here is how it looks right now:
And target table:
And yes, the term 2 and term 3 coefficients are the same even if it's a different model. I'm looking for a programmatic way to select the most complete set of terms, move them to term 1 column (notice the order of the terms changed), and set the missing cells to blank. This is a common layout and a lot of regression reporting packages use that layout; I just can't wrap my head around the elegant and flexible way to move the terms around. Apologies for tagging modelsummary an amazing package in R for regression tables even though this problem does not directly involve it but the author may have an insight in how to handle this problem.
This one is quite clumsy. But I think you are looking for something like this?
library(dplyr)
library(tidyr)
df %>%
mutate(id =as.integer(gl(n(),2,n()))) %>%
pivot_longer(starts_with("term")) %>%
group_by(id) %>%
add_count(value) %>%
mutate(x = value[n=max(n)]) %>%
ungroup() %>%
mutate(id1 =as.integer(gl(n(),max(id),n()))) %>%
group_by(id, id1) %>%
dplyr::slice(1) %>%
mutate(name = paste(name, id, sep="_")) %>%
ungroup() %>%
group_by(name) %>%
mutate(term = ifelse(row_number() == 2, NA_character_, x), .before=1) %>%
ungroup() %>%
select(-c(id, id1, value, n, name, x))
term intercept intercept_2 intercept_3
<chr> <chr> <chr> <chr>
1 age_ceo_state__rf 0.390*** 0.209*** 0.209***
2 NA (19.455) (9.449) (9.449)
3 age_firm_state__rf 0.673*** 0.405*** 0.405***
4 NA (23.409) (15.511) (15.511)
5 mktrf NA 0.417*** 0.417***
6 NA NA (13.255) (13.255)

Rowwise find most frequent term in dataframe column and count occurrences

I try to find the most frequent category within every row of a dataframe. A category can consist of multiple words split by a /.
library(tidyverse)
library(DescTools)
# example data
id <- c(1, 2, 3, 4)
categories <- c("apple,shoes/socks,trousers/jeans,chocolate",
"apple,NA,apple,chocolate",
"shoes/socks,NA,NA,NA",
"apple,apple,chocolate,chocolate")
df <- data.frame(id, categories)
# the solution I would like to achieve
solution <- df %>%
mutate(winner = c("apple", "apple", "shoes/socks", "apple"),
winner_count = c(1, 2, 1, 2))
Based on these answers I have tried the following:
Write a function that finds the most common word in a string of text using R
trial <- df %>%
rowwise() %>%
mutate(winner = names(which.max(table(categories %>% str_split(",")))),
winner_count = which.max(table(categories %>% str_split(",")))[[1]])
Also tried to follow this approach, however it also does not give me the required results
How to find the most repeated word in a vector with R
trial2 <- df %>%
mutate(winner = DescTools::Mode(str_split(categories, ","), na.rm = T))
I am mainly struggling because my most frequent category is not just one word but something like "shoes/socks" and the fact that I also have NAs. I don't want the NAs to be the "winner".
I don't care too much about the ties right now. I already have a follow up process in place where I handle the cases that have winner_count = 2.
split the categories on comma in separate rows, count their occurrence for each id, drop the NA values and select the top occurring row for each id
library(dplyr)
library(tidyr)
df %>%
separate_rows(categories, sep = ',') %>%
count(id, categories, name = 'winner_count') %>%
filter(categories != 'NA') %>%
group_by(id) %>%
slice_max(winner_count, n = 1, with_ties = FALSE) %>%
ungroup %>%
rename(winner = categories) %>%
left_join(df, by = 'id') -> result
result
# id winner winner_count categories
# <dbl> <chr> <int> <chr>
#1 1 apple 1 apple,shoes/socks,trousers/jeans,chocolate
#2 2 apple 2 apple,NA,apple,chocolate
#3 3 shoes/socks 1 shoes/socks,NA,NA,NA
#4 4 apple 2 apple,apple,chocolate,chocolate

Using pivot_longer to separate columns into long format

I have a df that is of non-finite length that looks like the table below.
The example here only has 2 traits: "lipids" and "density". Other rows may have 50 traits or more. But will always have the same pattern of trait, unit, method. When importing into R using read_excel it changes non unique names to xxx...[col.number]. I want to use pivot_longer to cast the data into a long format from wide. I'm having difficulty manipulating the function and would appreciate some help. The final column names I would like would be geno_name, observation_id, trait, value, unit, method
Sample Data
Desired Output (without the drop_na statement to show example)
x <- structure(list(geno_name = "MB mixed", observation_id = 10, lipids = NA,
unit...3 = NA, method...4 = NA, density = 1.125, unit...6 = "g cm^-3",
method...7 = "3D scanning"), class = "data.frame", row.names = c(NA,-1L))
So far I have:
x %>% pivot_longer(
cols = 3:ncol(x),
names_to = c("trait","unit","method"),
#need help with these other arguments
values_drop_na = T)
The data column names to be used in 'long' format doesn't all have the same pattern in column names. Therefore, the steps included are
rename columns that doesn't have the ... or _ in their column names by adding those with paste/str_c
reshape to long format with pivot_longer - taking into account the pattern in names with either names_sep or names_pattern, specify the names_to as a vector of c(".value", "trait") in the same order we want the column values and the suffix value to be stored as separate columns
Once we reshaped, create a grouping column based on the values in the 'trait' (some of them are numbers - create a logical vector and get the cumulative sum) along with the other grouping 'geno_name', 'observation_id' (which doesn't create a unique column though))
Now summarise the other columns by slicing the first row after ordering based on NA elements i.e. if there are no NA, the first value will be non-NA or else it will be NA
library(dplyr)
library(stringr)
library(tidyr)
x %>%
rename_at(vars(names(.)[!str_detect(names(.), "[_.]+")]),
~ str_c("value...", .)) %>%
pivot_longer(cols = 3:ncol(.),
names_to = c(".value", "trait"), names_sep = "\\.+") %>%
group_by(geno_name, observation_id,
grp = cumsum(str_detect(trait, "\\D+"))) %>%
summarise(across(everything(), ~ .[order(is.na(.))][1]),
.groups = 'drop') %>%
select(-grp)
-output
# A tibble: 2 x 6
# geno_name observation_id trait value unit method
# <chr> <dbl> <chr> <dbl> <chr> <chr>
#1 MB mixed 10 lipids NA <NA> <NA>
#2 MB mixed 10 density 1.12 g cm^-3 3D scanning
data
x <- structure(list(geno_name = "MB mixed", observation_id = 10, lipids = NA,
unit...3 = NA, method...4 = NA, density = 1.125, unit...6 = "g cm^-3",
method...7 = "3D scanning"), class = "data.frame", row.names = c(NA,
-1L))

Select unique values

I need to change this function that doesn't match for unique values. For example, if I want MAPK4, the function matches MAPK41 and AMAPK4 etc. The function must select only the unique values.
Function:
library(dplyr)
df2 <- df %>%
rowwise() %>%
mutate(mutated = paste(mutated_genes[unlist(
lapply(mutated_genes, function(x) grepl(x,genes, ignore.case = T)))], collapse=","),
circuit_name = gsub("", "", circuit_name)) %>%
select(-genes) %>%
data.frame()
data:
df <-structure(list(circuit_name = c("hsa04010__117", "hsa04014__118" ), genes = c("MAP4K4,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP3*,DUSP3*,DUSP3*,DUSP3*,PPM1A,AKT3,AKT3,AKT3,ZAK,MAP3K12,MAP3K13,TRAF2,CASP3,IL1R1,IL1R1,TNFRSF1A,IL1A,IL1A,TNF,RAC1,RAC1,RAC1,RAC1,MAP2K7,MAPK8,MAPK8,MAPK8,MECOM,HSPA1A,HSPA1A,HSPA1A,HSPA1A,HSPA1A,HSPA1A,MAP4K3,MAPK8IP2,MAP4K1", "MAP4K4,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*")), class = "data.frame", row.names = c(NA, -2L))
mutated_genes <- c("MAP4K4", "MAP3K12","TRAF2", "CACNG3")
output:
circuit_name mutated
1 hsa04010__117 MAP4K4,TRAF2
2 hsa04014__118 MAP4K4
A base R approach would be by splitting the genes on "," and return those string which match mutated_genes.
df$mutated <- sapply(strsplit(df$genes, ","), function(x)
toString(grep(paste0(mutated_genes, collapse = "|"), x, value = TRUE)))
df[c(1, 3)]
# circuit_name mutated
#1 hsa04010__117 MAP4K4, MAP3K12, TRAF2
#2 hsa04014__118 MAP4K4
Please note that based on the mutated_genes vector, your expected output is missing MAP3K12 for hsa04010__117.
Here is a tidyverse possibility
df %>%
separate_rows(genes) %>%
filter(genes %in% mutated_genes) %>%
group_by(circuit_name) %>%
summarise(mutated = toString(genes))
## A tibble: 2 x 2
# circuit_name mutated
# <chr> <chr>
#1 hsa04010__117 MAP4K4, MAP3K12, TRAF2
#2 hsa04014__118 MAP4K4
Explanation: We separate comma-separated entries into different rows, then select only those rows where genes %in% mutated_genes and summarise results per circuit_name by concatenating genes entries.
PS. Personally I'd recommend keeping the data in a tidy long format (i.e. don't concatenate entries with toString); that way you have one row per gene, which will make any post-processing of the data much more straightforward.
We can use str_extract
library(stringr)
df$mutated <- sapply(str_extract_all(df$genes, paste(mutated_genes,
collapse="|")), toString)

Resources