Several Model Estimates Table - Rearranging columns and rows (reprex included) - r

I'm working on creating a table of regression estimates from several models. Here is the data:
structure(list(term = c("age_ceo_state__rf", "", "mktrf", "",
NA, NA), intercept = c("0.390***", "(19.455)", "0.673***", "(23.409)",
NA, NA), term_2 = c("age_ceo_state__rf", "", "age_firm_state__rf",
"", "mktrf", ""), intercept_2 = c("0.209***", "(9.449)", "0.405***",
"(15.511)", "0.417***", "(13.255)"), term_3 = c("age_ceo_state__rf",
"", "age_firm_state__rf", "", "mktrf", ""), intercept_3 = c("0.209***",
"(9.449)", "0.405***", "(15.511)", "0.417***", "(13.255)")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
Here is how it looks right now:
And target table:
And yes, the term 2 and term 3 coefficients are the same even if it's a different model. I'm looking for a programmatic way to select the most complete set of terms, move them to term 1 column (notice the order of the terms changed), and set the missing cells to blank. This is a common layout and a lot of regression reporting packages use that layout; I just can't wrap my head around the elegant and flexible way to move the terms around. Apologies for tagging modelsummary an amazing package in R for regression tables even though this problem does not directly involve it but the author may have an insight in how to handle this problem.

This one is quite clumsy. But I think you are looking for something like this?
library(dplyr)
library(tidyr)
df %>%
mutate(id =as.integer(gl(n(),2,n()))) %>%
pivot_longer(starts_with("term")) %>%
group_by(id) %>%
add_count(value) %>%
mutate(x = value[n=max(n)]) %>%
ungroup() %>%
mutate(id1 =as.integer(gl(n(),max(id),n()))) %>%
group_by(id, id1) %>%
dplyr::slice(1) %>%
mutate(name = paste(name, id, sep="_")) %>%
ungroup() %>%
group_by(name) %>%
mutate(term = ifelse(row_number() == 2, NA_character_, x), .before=1) %>%
ungroup() %>%
select(-c(id, id1, value, n, name, x))
term intercept intercept_2 intercept_3
<chr> <chr> <chr> <chr>
1 age_ceo_state__rf 0.390*** 0.209*** 0.209***
2 NA (19.455) (9.449) (9.449)
3 age_firm_state__rf 0.673*** 0.405*** 0.405***
4 NA (23.409) (15.511) (15.511)
5 mktrf NA 0.417*** 0.417***
6 NA NA (13.255) (13.255)

Related

Filter based on different conditions at different positions in a string in R

The middle part of the string is the ID, and I want only one occurrence of each ID. If there is more than one observation with the same six middle letters, I need to keep the one that says "07" rather than "08", or "A" rather than "B". I want to completely exclude if the number is "02". Other than that, if there is only one occurrence of the ID, I want to keep it. So if I had:
col1
ID-1-AMBCFG-07A-01
ID-1-CGUMBD-08A-01
ID-1-XDUMNG-07B-01
ID-1-XDUMNG-08B-01
ID-1-LOFBUM-02A-01
ID-1-ABYEMJ-08A-01
ID-1-ABYEMJ-08B-01
Then I would want:
col1
ID-1-AMBCFG-07A-01
ID-1-CGUMBD-08A-01
ID-1-XDUMNG-07B-01
ID-1-ABYEMJ-08A-01
I am thinking maybe I can use group_by to specify the 6 letter ID, and then some kind of if_else statement? But I can't figure out how to specify the positions of the characters in the string. Any help is greatly appreciated!
Using extract and some dplyr wrangling:
library(tidyr)
library(dplyr)
df %>%
extract(col1, "ID-\\d-(.*)-(\\d*)(A|B)-01",
into = c("ID", "number", "letter"),
remove = FALSE, convert = TRUE) %>%
group_by(ID) %>%
filter(number != 2) %>%
slice_min(n = 1, order(number, letter)) %>%
ungroup() %>%
select(col1)
# col1
#1 ID-1-ABYEMJ-08A-01
#2 ID-1-AMBCFG-07A-01
#3 ID-1-CGUMBD-08A-01
#4 ID-1-XDUMNG-07B-01
An option with str_detect
library(stringr)
library(dplyr)
df1 %>%
group_by(ID = str_extract(col1, "ID-\\d+-\\w+")) %>%
filter(str_detect(col1, "02", negate = TRUE), row_number() == 1) %>%
ungroup %>%
select(-ID)
-output
# A tibble: 4 × 1
col1
<chr>
1 ID-1-AMBCFG-07A-01
2 ID-1-CGUMBD-08A-01
3 ID-1-XDUMNG-07B-01
4 ID-1-ABYEMJ-08A-01
data
df1 <- structure(list(col1 = c("ID-1-AMBCFG-07A-01", "ID-1-CGUMBD-08A-01",
"ID-1-XDUMNG-07B-01", "ID-1-XDUMNG-08B-01", "ID-1-LOFBUM-02A-01",
"ID-1-ABYEMJ-08A-01", "ID-1-ABYEMJ-08B-01")), class = "data.frame",
row.names = c(NA,
-7L))

R transform dataframe by parsing columns

Context
I have created a small sample dataframe to explain my problem. The original one is larger, as it has many more columns. But it is formatted in the same way.
df = data.frame(Case1.1.jpeg.text="the",
Case1.1.jpeg.text.1="big",
Case1.1.jpeg.text.2="DOG",
Case1.1.jpeg.text.3="10197",
Case1.2.png.text="framework",
Case1.3.jpg.text="BE",
Case1.3.jpg.text.1="THE",
Case1.3.jpg.text.2="Change",
Case1.3.jpg.text.3="YOUWANTTO",
Case1.3.jpg.text.4="SEE",
Case1.3.jpg.text.5="in",
Case1.3.jpg.text.6="theWORLD",
Case1.4.png.text="09.80.56.60.77")
The dataframe consists of output from a text detection ML model based on a certain number of input images.
The output format makes each word for each image a separate column, thereby creating a very wide dataset.
Desired Output
I am looking to create a cleaner version of it, with one column containing the image name (e.g. Case1.2.png) and the second with the concatenation of all possible words that the model finds in that particular image (the number of words varies from image to image).
result = data.frame(Case=c('Case1.1.jpeg','Case1.2.png','Case1.3.jpg','Case1.4.png'),
Text=c('thebigDOG10197','framework','BETHEChangeYOUWANTTOSEEintheWORLD','09.80.56.60.77'))
I have tried many approaches based on similar questions found on Stackoverflow, but none seem to give me the exact output I'm looking for.
Any help on this would be greatly appreciated.
library(tidyr)
library(dplyr)
df %>%
pivot_longer(cols = everything(),
names_pattern = "(.*)\\.(text.*)",
names_to = c("Case", NA)) %>%
group_by(Case) %>%
summarize(value = paste(value, collapse = ""), .groups = "drop")
Alternatively, this can be accomplished using just the pivot functions from tidyr:
library(tidyr)
library(stringr)
df %>%
pivot_longer(cols = everything(),
names_pattern = "(.*)\\.(text).*",
names_to = c("Case", "cols")) %>%
pivot_wider(id_cols = Case,
values_from = value,
names_from = cols,
values_fn = str_flatten)
Output
Case value
<chr> <chr>
1 Case1.1.jpeg thebigDOG10197
2 Case1.2.png framework
3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
4 Case1.4.png 09.80.56.60.77
A possible solution:
library(tidyverse)
df %>%
pivot_longer(everything()) %>%
mutate(name = str_remove(name, "\\.text\\.*\\d*")) %>%
group_by(name) %>%
summarise(text = str_c(value, collapse = ""))
#> # A tibble: 4 x 2
#> name text
#> <chr> <chr>
#> 1 Case1.1.jpeg thebigDOG10197
#> 2 Case1.2.png framework
#> 3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
#> 4 Case1.4.png 09.80.56.60.77
An option in base R is stack the data into a two column data.frame with stack and then do a group by paste with aggregate
aggregate(cbind(Text = values) ~ Case, transform(stack(df),
Case = trimws(ind, whitespace = "\\.text.*")), FUN = paste, collapse = "")
Case Text
1 Case1.1.jpeg thebigDOG10197
2 Case1.2.png framework
3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
4 Case1.4.png 09.80.56.60.77
You can use pivot_longer(everything()), manipulate the "Case" column, group, and paste together:
pivot_longer(df,everything(),names_to="Case") %>%
mutate(Case = str_remove_all(Case, ".text.*")) %>%
group_by(Case) %>% summarize(Text=paste(value, collapse=""))
Output:
Case Text
<chr> <chr>
1 Case1.1.jpeg thebigDOG10197
2 Case1.2.png framework
3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
4 Case1.4.png 09.80.56.60.77

How to reshape complicated elections data using tidyverse?

I need to reshape a complicated table from rows of stacked election data to cleanly formatted columns containing all the information. I'm having trouble automating this.
Here's a simple version of the input data. Note that there are just 2 elections in this example; in the real data there are many, so the code needs to generalize:
input <-
structure(list(a = c("2020 ge", "winner", NA, "2016 ge", "winner"
), b = c(NA, "orange (cat)", NA, NA, "peach (kitten)"), c = c(NA,
"runner up", NA, NA, "runner up"), d = c(NA, "peach (kitten)", NA,
NA, "orange (cat)"), e = c(NA, "margin", NA, NA, "margin"), f = c(NA,
100, NA, NA, 150)), row.names = c(NA, 5L), class = "data.frame")
And this is the output I would like:
output <-
structure(list(`2019_winner_name` = "orange", `2020_winner_party` = "cat",
`2020_runner_up_name` = "peach", `2020_runner_up_party` = "kitten",
`2020_margin` = 100, `2016_winner_name` = "peach", `2016_winner_party` = "kitten",
`2016_runner_up_name` = "orange", `2016_runner_up_party` = "cat",
`2016_margin` = 150), row.names = 1L, class = "data.frame")
Here is what I've tried so far, which works for one year:
# test data
test <-
input %>%
slice(1:2) %>%
fill(c(b, c, d, e, f), .direction = c("up"))
# select first row
row_one <-
test %>%
select(a) %>%
slice(1)
# select year
year <-
str_extract(row_one$a, "^([0-9]*)")
# select second row as name
row_two <-
test %>%
select(a) %>%
slice(2) %>%
as.character()
# bring back to test data
test <-
test %>%
mutate(a = row_two) %>%
slice(1) %>%
add_row() %>%
fill(c(b, d, f)) %>%
mutate(a = ifelse(is.na(a), b, a),
c = ifelse(is.na(c), d, c),
e = ifelse(is.na(e), f, e)) %>%
select(a, c, e) %>%
row_to_names(1) %>%
rename_all(funs(paste0(year, "_", .)))
# extract party variable
test <-
test %>%
mutate_at(vars(contains("winner"), contains("runner")),
funs(party = str_extract(., "(?<=\\().+?(?=\\))"))) %>%
mutate_at(vars(ends_with("winner"), ends_with("up")),
funs(name = str_extract(., "([^()]*)")))
What would be an easier and more concise way to do this, given the unusual data format? How could I automate this so that I can run it over multiple election years?
Thank you.
First off, I agree with #deschen in that this is very messy data. Rather than trying to tidy/reshape the data as provided I would recommend exploring whether source data can be parsed in a better (tidier) way.
Having said that, it is possible to reshape & tidy data into your expected output. Mind you, this is a fairly messy procedure and I have no idea how well this generalises on bigger data.
library(tidyverse)
# Define a convenience function that turns a vector with an even number of elements
# into a named vector where every odd element is the name of the following even element
to_named_vec <- function(x) {
if (length(x) == 1) return(magrittr::set_names(x, "margin"))
nm <- x[c(TRUE, FALSE)]
vec <-x[c(FALSE, TRUE)]
return(magrittr::set_names(vec, nm))
}
# First convert the input into a nested `list`
lst <- input %>%
t() %>%
as.character() %>%
discard(is.na) %>%
split(., cumsum(str_detect(., "\\d{4}"))) %>%
map(~ .x %>%
str_remove(" ge") %>%
stringi::stri_replace_all_regex("(\\w+)\\s\\((\\w+)\\)", "name_$1_party_$2") %>%
str_split("_") %>%
unlist()) %>%
magrittr::set_names(map_chr(., head, 1)) %>%
map(~ .x[-1] %>%
split(cumsum(str_detect(.x[-1], "(winner|runner up|margin)"))) %>%
magrittr::set_names(map_chr(., head, 1)) %>%
map(~ .x %>% tail(-1) %>% to_named_vec() %>% bind_rows()))
# The last step involves `unlist`ing the nested `list`, tidying the names and
# converting the named vector into a `tibble` with `bind_rows`.
lst %>%
unlist() %>%
set_names(., str_replace_all(names(.), "\\.", "_")) %>%
set_names(., str_replace(names(.), "_margin", "")) %>%
bind_rows()
## A tibble: 1 x 10
#`2020_winner_na~ `2020_winner_pa~ `2020_runner up~ `2020_runner up~ `2020_margin` `2016_winner_na~
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 orange cat peach kitten 100 peach
## ... with 4 more variables: `2016_winner_party` <chr>, `2016_runner up_name` <chr>, `2016_runner
## up_party` <chr>, `2016_margin` <chr>
It's best to step through the code line-by-line to understand what every steps does; roughly,
we transpose input,
convert the resulting matrix into a character vector, discard NAs, and
split the vector on the occurrence of "\d{4}" (i.e. the year of the GE).
We then operate on every list element separately, by
removing the string " ge",
replacing occurrences of the form "orange (cat)" with "name_orange_party_cat",
splitting entries on "_".
The rest is a matter of giving the nested list elements proper names that from the vector of list elements themselves.
The final step involves unlisting the nested list and tidying the names of the named vector to reflect those from your expected output.

Creating a df of unique combinations of columns in R where order doesn't matter

I want to create a df with all of the unique combinations of three columns where the order of the value doesn't matter. In my example, I want to create a list of all the combinations of ideology groups of three people could have.
In my example, "No opinion", "Moderate", "Conservative" is the same as "Conservative" "No opinion" "Moderate" which is the same as "Moderate", "No opinion", "Conservative", etc. all of these combinations should be represented by one row.
I've seen similar threads about using distinct for home and away sports teams, but I don't think this is working for this problem.
library(tidyverse)
political_spectrum_values =
factor(c("Far left",
"Liberal",
"Moderate",
"Conservative",
"Far right",
"No opinion"),
ordered = T)
political_groups_of_3 <-
crossing(first_person = political_spectrum_values,
second_person = political_spectrum_values,
third_person = political_spectrum_values)
I've considered making some kind of combined variable by piping into this line, but I'm not sure how to take it from here
unite(col = "group_composition", c(first_person, second_person, third_person), sep = "_")
EDIT: After working with this problem longer I've reshaped the data in a way that might make this easier
crossing(first_person = political_spectrum_values,
second_person = political_spectrum_values,
third_person = political_spectrum_values) %>%
mutate(group_n = row_number()) %>%
pivot_longer(cols = c(first_person, second_person, third_person),
values_to = "ideology",
names_to = "group") %>%
select(-group)
Here's a trick you can use. Instead of starting with the names of the political leanings, start with the numbers 5^(0:5). Notice that the sum of any length-3 combination will be unique, since 3 times 5^x is less than 5^(x+1). So if you run expand.grid (equivalent to crossing) on three such vectors and take the row sums, then the positions of the unique sums will be the same as the positions of the unique combinations of names in your crossing result.
So you could just do this one-liner:
political_groups_of_3[!duplicated(rowSums(expand.grid(5^(0:5), 5^(0:5), 5^(0:5)))), ]
which gives:
#> # A tibble: 56 x 3
#> first_person second_person third_person
#> <ord> <ord> <ord>
#> 1 Conservative Conservative Conservative
#> 2 Conservative Conservative Far left
#> 3 Conservative Conservative Far right
#> 4 Conservative Conservative Liberal
#> 5 Conservative Conservative Moderate
#> 6 Conservative Conservative No opinion
#> 7 Conservative Far left Far left
#> 8 Conservative Far left Far right
#> 9 Conservative Far left Liberal
#> 10 Conservative Far left Moderate
#> # ... with 46 more rows
Whether this is "more elegant" or just an opaque hack is a matter of taste of course...
A base R method is to create all the combination of political_spectrum_values taking 3 at a time using expand.grid, sort them by row and select unique rows.
df <- expand.grid(first_person = political_spectrum_values,
second_person = political_spectrum_values,
third_person = political_spectrum_values)
df[] <- t(apply(df, 1, sort))
unique(df)
If needed as a single string
unique(apply(df, 1, function(x) paste0(sort(x), collapse = "_")))
Here is a two-step solution using gtools::combinations and paste.
library(gtools)
#Get all combinations with repeats for the political_spectrum_values in groups of 3
combs<-combinations(nlevels(political_spectrum_values),
3,
as.character(political_spectrum_values),
repeats = T)
#Collapse each row in a single entry and convert it into a data.frame
combs<-data.frame(group_composition = apply(combs,
1,
function(x) paste(x, collapse = "_")))
Here's an answer using a combination of the update and unite. I'll leave this open a little bit longer just incase anyone has a more elegant solution
crossing(first_person = political_spectrum_values,
second_person = political_spectrum_values,
third_person = political_spectrum_values) %>%
mutate(group_n = row_number()) %>%
pivot_longer(cols = c(first_person, second_person, third_person),
values_to = "ideology",
names_to = "group") %>%
select(-group) %>%
group_by(group_n) %>%
arrange(ideology) %>%
mutate(person = row_number()) %>%
pivot_wider(id_cols = group_n, values_from = ideology, names_from = person) %>%
unite(col = "group_composition", c(`1`, `2`, `3`), sep = "_") %>%
ungroup() %>%
distinct(group_composition)

How do I merge rows while also moving columns to the merged row?

Hello everyone and thanks for reading my question.
I have the following in R:
**type, status, count**
human, living, 36
human, living, 57
human, dead, 43
mouse, living, 4
mouse, dead 8
What I want to do is merge the rows based on 'type' (so 'type' would be exclusive) and then move the contents of 'status' and 'count' to the merged row and add some symbols as shown below:
**type, status, count**
human, living = "36, 57", dead = "43"
mouse, living = "4", dead = "8"
I did manage to merge the rows in R (sort of) but I cannot figure out how to move the status and count to the merged row and lay them out as shown.
I don't have to use R but I thought R was the most suitable way of doing this but I could use anything to get the job done. Any help would be greatly appreciated.
Many thanks.
Edit: This is the final solution which worked great (thanks loads to gersht):
rm(list=ls());
library(tidyr)
library(dplyr)
df <- read.table("D:/test.csv", header = TRUE, sep=",")
df <- df %>%
group_by(type, status) %>%
mutate(count = paste(count, collapse = ", ")) %>%
ungroup() %>%
distinct() %>%
spread(status, count) %>%
mutate(dead = paste("dead = ", dead),
living = paste("living = ", living))
write.table(df, col.names = FALSE)
This will return a dataframe with correct values, more or less. You can change column order and column names as needed:
library(tidyr)
library(dplyr)
df %>%
group_by(type, status) %>%
mutate(count = paste(count, collapse = ", ")) %>%
ungroup() %>%
distinct() %>%
spread(status, count) %>%
mutate(dead = paste("dead = ", dead),
living = paste("living = ", living))
#### OUTPUT ####
# A tibble: 2 x 3
type dead living
<chr> <chr> <chr>
1 human dead = 43 living = 36, 57
2 mouse dead = 8 living = 4
I've simply grouped by type and status so I can then collapse the values of count into a single string using mutate(). I use ungroup() as good practice, but it isn't strictly necessary.
This creates some duplicates, which I remove with distinct(). I then use the spread() function to move living and dead to their own columns, and then I use mutate again to add strings "living = " and "dead = " to their respective columns.
data
structure(list(type = c("human", "human", "human", "mouse", "mouse"
), status = c("living", "living", "dead", "living", "dead"),
count = c(36, 57, 43, 4, 8)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))

Resources