How to handle NA with str_c in this case - r

I got help with this questions a while ago:
How to replace multiple values in a string depending on a list of keys
Now I need to take into account that some keys are not to be "translated". So in this case I want the key1-4 should be translated into code1-4. I want it to be able to handle keys that arent in the key_code translation. If I add a key that is missing from the keycode, say keyx, somewhere where there is already another valid key value, I can just filter the NA that appears when joining with the key_codes. But if I have an id which has only the keyx value, that whole row dissapears and I want to keep it (it can show up as NA for example). Any ideas on how to solve that?
library(dplyr)
library(tidyr)
library(stringr)
values = tibble(id = 1:4, values = c("key1;keyx", "key3;key4;key1", "key2;key1", "keyx"))
key_code = tibble(key = c("key1", "key2", "key3", "key4"), code = c("code1", "code2", "code3", "code4"))
values %>%
separate_rows(values) %>%
left_join(key_code, by = c("values" = "key")) %>%
group_by(id) %>%
filter(!is.na(code)) %>%
summarise(code = str_c(code, collapse=";"))

We could use an if/else condition to check if all the elements in the 'code' are NA, then return NA or else to paste the non-NA elements
library(dplyr)
library(tidyr)
library(stringr)
values %>%
separate_rows(values) %>%
left_join(key_code, by = c("values" = "key")) %>%
group_by(id) %>%
summarise(code = if(all(is.na(code))) NA_character_ else
str_c(str_replace_na(code, ""), collapse=";"), .groups = 'drop')
-output
# A tibble: 4 x 2
# id code
# <int> <chr>
#1 1 code1;
#2 2 code3;code4;code1
#3 3 code2;code1
#4 4 <NA>

Related

Filter based on different conditions at different positions in a string in R

The middle part of the string is the ID, and I want only one occurrence of each ID. If there is more than one observation with the same six middle letters, I need to keep the one that says "07" rather than "08", or "A" rather than "B". I want to completely exclude if the number is "02". Other than that, if there is only one occurrence of the ID, I want to keep it. So if I had:
col1
ID-1-AMBCFG-07A-01
ID-1-CGUMBD-08A-01
ID-1-XDUMNG-07B-01
ID-1-XDUMNG-08B-01
ID-1-LOFBUM-02A-01
ID-1-ABYEMJ-08A-01
ID-1-ABYEMJ-08B-01
Then I would want:
col1
ID-1-AMBCFG-07A-01
ID-1-CGUMBD-08A-01
ID-1-XDUMNG-07B-01
ID-1-ABYEMJ-08A-01
I am thinking maybe I can use group_by to specify the 6 letter ID, and then some kind of if_else statement? But I can't figure out how to specify the positions of the characters in the string. Any help is greatly appreciated!
Using extract and some dplyr wrangling:
library(tidyr)
library(dplyr)
df %>%
extract(col1, "ID-\\d-(.*)-(\\d*)(A|B)-01",
into = c("ID", "number", "letter"),
remove = FALSE, convert = TRUE) %>%
group_by(ID) %>%
filter(number != 2) %>%
slice_min(n = 1, order(number, letter)) %>%
ungroup() %>%
select(col1)
# col1
#1 ID-1-ABYEMJ-08A-01
#2 ID-1-AMBCFG-07A-01
#3 ID-1-CGUMBD-08A-01
#4 ID-1-XDUMNG-07B-01
An option with str_detect
library(stringr)
library(dplyr)
df1 %>%
group_by(ID = str_extract(col1, "ID-\\d+-\\w+")) %>%
filter(str_detect(col1, "02", negate = TRUE), row_number() == 1) %>%
ungroup %>%
select(-ID)
-output
# A tibble: 4 × 1
col1
<chr>
1 ID-1-AMBCFG-07A-01
2 ID-1-CGUMBD-08A-01
3 ID-1-XDUMNG-07B-01
4 ID-1-ABYEMJ-08A-01
data
df1 <- structure(list(col1 = c("ID-1-AMBCFG-07A-01", "ID-1-CGUMBD-08A-01",
"ID-1-XDUMNG-07B-01", "ID-1-XDUMNG-08B-01", "ID-1-LOFBUM-02A-01",
"ID-1-ABYEMJ-08A-01", "ID-1-ABYEMJ-08B-01")), class = "data.frame",
row.names = c(NA,
-7L))

R transform dataframe by parsing columns

Context
I have created a small sample dataframe to explain my problem. The original one is larger, as it has many more columns. But it is formatted in the same way.
df = data.frame(Case1.1.jpeg.text="the",
Case1.1.jpeg.text.1="big",
Case1.1.jpeg.text.2="DOG",
Case1.1.jpeg.text.3="10197",
Case1.2.png.text="framework",
Case1.3.jpg.text="BE",
Case1.3.jpg.text.1="THE",
Case1.3.jpg.text.2="Change",
Case1.3.jpg.text.3="YOUWANTTO",
Case1.3.jpg.text.4="SEE",
Case1.3.jpg.text.5="in",
Case1.3.jpg.text.6="theWORLD",
Case1.4.png.text="09.80.56.60.77")
The dataframe consists of output from a text detection ML model based on a certain number of input images.
The output format makes each word for each image a separate column, thereby creating a very wide dataset.
Desired Output
I am looking to create a cleaner version of it, with one column containing the image name (e.g. Case1.2.png) and the second with the concatenation of all possible words that the model finds in that particular image (the number of words varies from image to image).
result = data.frame(Case=c('Case1.1.jpeg','Case1.2.png','Case1.3.jpg','Case1.4.png'),
Text=c('thebigDOG10197','framework','BETHEChangeYOUWANTTOSEEintheWORLD','09.80.56.60.77'))
I have tried many approaches based on similar questions found on Stackoverflow, but none seem to give me the exact output I'm looking for.
Any help on this would be greatly appreciated.
library(tidyr)
library(dplyr)
df %>%
pivot_longer(cols = everything(),
names_pattern = "(.*)\\.(text.*)",
names_to = c("Case", NA)) %>%
group_by(Case) %>%
summarize(value = paste(value, collapse = ""), .groups = "drop")
Alternatively, this can be accomplished using just the pivot functions from tidyr:
library(tidyr)
library(stringr)
df %>%
pivot_longer(cols = everything(),
names_pattern = "(.*)\\.(text).*",
names_to = c("Case", "cols")) %>%
pivot_wider(id_cols = Case,
values_from = value,
names_from = cols,
values_fn = str_flatten)
Output
Case value
<chr> <chr>
1 Case1.1.jpeg thebigDOG10197
2 Case1.2.png framework
3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
4 Case1.4.png 09.80.56.60.77
A possible solution:
library(tidyverse)
df %>%
pivot_longer(everything()) %>%
mutate(name = str_remove(name, "\\.text\\.*\\d*")) %>%
group_by(name) %>%
summarise(text = str_c(value, collapse = ""))
#> # A tibble: 4 x 2
#> name text
#> <chr> <chr>
#> 1 Case1.1.jpeg thebigDOG10197
#> 2 Case1.2.png framework
#> 3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
#> 4 Case1.4.png 09.80.56.60.77
An option in base R is stack the data into a two column data.frame with stack and then do a group by paste with aggregate
aggregate(cbind(Text = values) ~ Case, transform(stack(df),
Case = trimws(ind, whitespace = "\\.text.*")), FUN = paste, collapse = "")
Case Text
1 Case1.1.jpeg thebigDOG10197
2 Case1.2.png framework
3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
4 Case1.4.png 09.80.56.60.77
You can use pivot_longer(everything()), manipulate the "Case" column, group, and paste together:
pivot_longer(df,everything(),names_to="Case") %>%
mutate(Case = str_remove_all(Case, ".text.*")) %>%
group_by(Case) %>% summarize(Text=paste(value, collapse=""))
Output:
Case Text
<chr> <chr>
1 Case1.1.jpeg thebigDOG10197
2 Case1.2.png framework
3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
4 Case1.4.png 09.80.56.60.77

Apply functions on list columns that are not NULL in a tibble in R

I have a tibble similar to
x <- tibble(
id = 1:2,
var = list(NULL, tibble(x = c("yes", "no")))
)
I want to apply functions to the list-elements of var that are not NULL, e.g.
x %>%
mutate(var2 = map_chr(var, ~pull(.x) %>%
str_c(., collapse = ";"))
)
This, however, throws an error, as pull() cannot be applied to the class "NULL" object in the first row. (It works if I filter before, but I cannot drop cases where var is NULL as other variables contain values for such cases).
I failed to restrict the operation to rows where var is not NULL.
The following does not work, for example:
x %>%
mutate(
var2 = case_when(
!map_lgl(var, is.null) ~ map_chr(var, ~pull(.x) %>%
str_c(., collapse = ";"))))
I'd be grateful for a solution.
Here is an option where you unnest your column and then within group collapse your variable:
library(dplyr)
library(tidyr)
library(stringr)
x %>%
unnest(var, keep_empty = T) %>%
group_by(id) %>%
summarize(var2 = str_c(x, collapse = ";"), .groups = "drop")
Output
id var2
<int> <chr>
1 1 NA
2 2 yes;no

Rowwise find most frequent term in dataframe column and count occurrences

I try to find the most frequent category within every row of a dataframe. A category can consist of multiple words split by a /.
library(tidyverse)
library(DescTools)
# example data
id <- c(1, 2, 3, 4)
categories <- c("apple,shoes/socks,trousers/jeans,chocolate",
"apple,NA,apple,chocolate",
"shoes/socks,NA,NA,NA",
"apple,apple,chocolate,chocolate")
df <- data.frame(id, categories)
# the solution I would like to achieve
solution <- df %>%
mutate(winner = c("apple", "apple", "shoes/socks", "apple"),
winner_count = c(1, 2, 1, 2))
Based on these answers I have tried the following:
Write a function that finds the most common word in a string of text using R
trial <- df %>%
rowwise() %>%
mutate(winner = names(which.max(table(categories %>% str_split(",")))),
winner_count = which.max(table(categories %>% str_split(",")))[[1]])
Also tried to follow this approach, however it also does not give me the required results
How to find the most repeated word in a vector with R
trial2 <- df %>%
mutate(winner = DescTools::Mode(str_split(categories, ","), na.rm = T))
I am mainly struggling because my most frequent category is not just one word but something like "shoes/socks" and the fact that I also have NAs. I don't want the NAs to be the "winner".
I don't care too much about the ties right now. I already have a follow up process in place where I handle the cases that have winner_count = 2.
split the categories on comma in separate rows, count their occurrence for each id, drop the NA values and select the top occurring row for each id
library(dplyr)
library(tidyr)
df %>%
separate_rows(categories, sep = ',') %>%
count(id, categories, name = 'winner_count') %>%
filter(categories != 'NA') %>%
group_by(id) %>%
slice_max(winner_count, n = 1, with_ties = FALSE) %>%
ungroup %>%
rename(winner = categories) %>%
left_join(df, by = 'id') -> result
result
# id winner winner_count categories
# <dbl> <chr> <int> <chr>
#1 1 apple 1 apple,shoes/socks,trousers/jeans,chocolate
#2 2 apple 2 apple,NA,apple,chocolate
#3 3 shoes/socks 1 shoes/socks,NA,NA,NA
#4 4 apple 2 apple,apple,chocolate,chocolate

Understanding tidyr::gather key/value arguments

Simple question,
I've provided two different data frames below with code/output, why does one work and the other doesn't? Having trouble understanding the Key/Value inputs (when they need to be explicitly defined/and what it means to just have them as strings in the input).
library(tidyverse)
dat <- data.frame(one = c("x", "x", "x"), two = c("x", "", "x"),
three = c("", "", ""), type = c("chocolate", "vanilla", "strawberry"))
dat %>%
na_if("") %>%
gather("Key", "Val", -type,na.rm=TRUE) %>%
rowid_to_column %>%
spread(Key, Val,fill = "") %>%
select(-1) # works well
dat %>%
na_if("") %>%
gather("Key", "Val", -type,na.rm=TRUE)
Error: Strings must match column names. Unknown columns: Val
Extra Credit: if someone could explain the effect of rowit_to_column & spread(), that'd be helpful.
Perhaps I'm missing something, but I can't reproduce your error.
dat %>%
na_if("") %>% # Replace "" with NA
gather("Key", "Val", -type, na.rm = TRUE) %>% # wide -> long
rowid_to_column() %>% # Sequentially number rows
spread(Key, Val, fill = "") %>% # long -> wide
select(-1) # works well # remove row number
# type one two
#1 chocolate x
#2 vanilla x
#3 strawberry x
#4 chocolate x
#5 strawberry x
dat %>%
na_if("") %>% # Replace "" with NA
gather("Key", "Val", -type, na.rm = TRUE) # wide -> long
# type Key Val
#1 chocolate one x
#2 vanilla one x
#3 strawberry one x
#4 chocolate two x
#6 strawberry two x
Explanation:
na_if("") replaces "" entries with NA.
gather("Key", "Val", -type, na.rm = TRUE) turns a wide table into a long "key-value" table, by storing entries in all columns except type in two columns Key (i.e. the column name) and Val (i.e. the entry). na.rm = TRUE removes rows with NA values.
rowid_to_column sequentially numbers the rows.
spread(Key, Val, fill = "") turns a long "key-value" table into a wide table, with as many columns as there are unique keys in Key. Entries are taken from column Val, if an entry is missing it's filled with "".
select(-1) removes the first column.

Resources