I have a data frame, and for various reasons I need to keep one of the elements as a factor and, maintaining the order of the levels, replace periods in the levels with spaces. Here's an example
library(tidyverse) library(stringr)
sandwich <- c("bread", "mustard.sauce", "tuna.fish", "lettuce", "bread")
data_frame(sandwich_str = sandwich) %>%
mutate(sandwich_factor = factor(sandwich)) %>%
mutate(sandwich2 = factor(sandwich_factor,
levels = str_replace_all(levels(sandwich_factor), "\\.", " "))) %>%
mutate(sandwich3 = str_replace_all(sandwich_str, "\\.", " "))
print(sandwich_df)
# A tibble: 5 x 4
sandwich_str, sandwich_factor, sandwich2, sandwich3
<chr> <fctr>, <fctr> <chr>,
1 bread bread bread bread
2 mustard.sauce mustard.sauce <NA> mustard sauce
3 tuna.fish tuna.fish <NA> tuna fish
4 lettuce lettuce lettuce lettuce
5 bread bread bread bread
So in this data frame:
sandwich_str is an element of characters
sandwich_factor is an element of factors
in sandwich2 I tried replacing all of the periods in the levels of sandwich_factor. For whatever reason, this returns NA whenever there are periods.
in sandwich3 I take the more simple approach of just replacing all of the periods in strings with spaces. This works substantially better.
So I'm wondering what isn't working in my attempt at sandwich2. I'd like it to look more like sandwich3. Any advice?
Does this suit?
library(tidyverse)
library(stringr)
# Data --------------------------------------------------------------------
sandwich <-
c("bread", "mustard.sauce", "tuna.fish", "lettuce", "bread")
df <-
data_frame(sandwich_str = sandwich)
# Convert periods to spaces -----------------------------------------------
df$sandwich_str <-
df$sandwich_str %>%
as.character() %>%
str_replace("\\."," ") %>%
as.factor()
# Print output ------------------------------------------------------------
df %>%
print()
Credit to #aosmith for posting this answer as a comment. I'll post it here as an answer so I can accept and close this.
The problem was that factor levels are defined with the flag labels rather than levels. So the correct way for me to have written this previously would be:
library(tidyverse) library(stringr)
sandwich <- c("bread", "mustard.sauce", "tuna.fish", "lettuce", "bread")
data_frame(sandwich_str = sandwich) %>%
mutate(sandwich_factor = factor(sandwich)) %>%
mutate(sandwich2 = factor(sandwich_factor,
labels = str_replace_all(levels(sandwich_factor), "\\.", " "))) %>%
mutate(sandwich3 = str_replace_all(sandwich_str, "\\.", " "))
print(sandwich_df)
Related
Given a dataframe of types and values like so:
topic
keyword
cheese
cheddar
meat
beef
meat
chicken
cheese
swiss
bread
focaccia
bread
sourdough
cheese
gouda
My aim is to make a set of dynamic regexs based on the type, but I don't know how to make the variable names from the types. I can do this individually like so:
fn_get_topic_regex <- function(targettopic,df)
{
filter_df <- df |>
filter(topic == targettopic)
regex <- paste(filter_df$keyword, collapse = "|")
}
and do things like:
cheese_regex <- fn_get_topic_regex("cheese",df)
But what I'd like to be able to do is build all these regexes automatically without having to define each one.
The intended output would be something like:
cheese_regex: "cheddar|swiss|gouda"
bread_regex: "focaccia|sourdough"
meat_regex: "beef|chicken"
Where the start of the variable name is the distinct topic.
What's the best way to do that without defining each regex individually by hand?
You can use dplyr's group_by() and summarise()
df %>%
group_by(topic) %>%
summarise(regex = paste(keyword, collapse = "|"))
# A tibble: 3 × 2
topic regex
<chr> <chr>
1 bread focaccia|sourdough
2 cheese cheddar|swiss|gouda
3 meat beef|chicken
Or you can apply your function to every unique value in df$topic:
map_chr(unique(df$topic) %>% setNames(paste0(., "_regex")),
fn_get_topic_regex, df = df)
cheese_regex meat_regex bread_regex
"cheddar|swiss|gouda" "beef|chicken" "focaccia|sourdough"
Just remember to add return(regex) to the end of your function, or not to assign the last line to a variable at all. I would even put everything in a single pipe chain:
fn_get_topic_regex <- function(targettopic,df)
{
df |>
filter(topic == targettopic) |>
pull(keyword) |>
paste(collapse = "|")
}
Here is a base R solution with your intended output in a named list.
df <- structure(list(topic = c("cheese", "meat", "meat", "cheese", "bread", "bread", "cheese"),
keyword = c("cheddar", "beef", "chicken", "swiss", "focaccia", "sourdough", "gouda")),
class = "data.frame", row.names = c(NA, -7L))
#split into a list per topic
topics <- split(df, df$topic)
#collapse the keyword column
topics <- lapply(topics, function(t) {
paste(t$keyword, collapse = "|")
})
#rename
names(topics)<- paste0(names(topics), "_regex")
topics
$bread_regex
[1] "focaccia|sourdough"
$cheese_regex
[1] "cheddar|swiss|gouda"
$meat_regex
[1] "beef|chicken"
We could do something like this:
after grouping we could use summarise together with paste and collapse to get our regex s
Then, when the regex is needed we could refer to it by indexing like the example below:
library(dplyr)
library(stringr) #str_detect
my_regex <- df %>%
group_by(topic) %>%
summarise(regex = paste(keyword, collapse = "|"))
df %>%
mutate(new_col = ifelse(str_detect(keyword, my_regex$regex[1]), "it is bread", "it is not bread"))
A tibble: 3 × 2
topic regex
<chr> <chr>
1 bread focaccia|sourdough
2 cheese cheddar|swiss|gouda
3 meat beef|chicken
> df %>%
+ mutate(new_col = ifelse(str_detect(keyword, my_regex$regex[1]), "it is bread", "it is not bread"))
topic keyword new_col
1 cheese cheddar it is not bread
2 meat beef it is not bread
3 meat chicken it is not bread
4 cheese swiss it is not bread
5 bread focaccia it is bread
6 bread sourdough it is bread
7 cheese gouda it is not bread
I have the following problem, I have a tibble with mutliple character columns.
I tried to provide an MRE below:
library(tidyverse)
df <- tibble(food = c("pizza, bread, apple","joghurt, cereal, banana"),
food2 = c("bread, sausage, strawberry", "joghurt, oat, bacon"),
food3 = c("ice cream, bread, milkshake", "melon, cake, joghurt")
)
df %>%
# rowwise() %>%
mutate(allcolumns = map2(
str_split(food, ", "),
str_split(food2, ", "),
# str_split(food3, ", "),
intersect
) %>% unlist()
) -> df_new
My goal would be to get the common words for all columns. Words are separated by , in the columns. In the MRE I am able to find the intersect between two columns, however I couldnt get a solution for this issue. I experimented with Reduce but was not able to get it.
As an EDIT: I would also like to append it as a new row to the existing tibble
We may use map to loop over the columns, do the str_split and then reduce to get the intersect for elementwise intersect
library(dplyr)
library(purrr)
library(stringr)
df %>%
purrr::map(str_split, ", ") %>%
transpose %>%
purrr::map_chr(reduce, intersect) %>%
mutate(df, Intersect = .)
-output
# A tibble: 2 x 4
food food2 food3 Intersect
<chr> <chr> <chr> <chr>
1 pizza, bread, apple bread, sausage, strawberry ice cream, bread, milkshake bread
2 joghurt, cereal, banana joghurt, oat, bacon melon, cake, joghurt joghurt
or may also use pmap
df %>%
mutate(Intersect = pmap(across(everything(), str_split, ", "),
~ list(...) %>%
reduce(intersect)))
I have an R data frame (actually an excel sheet which I have read into R) in the format below:
ID Text
1 This is a red
car. Its electric
and has 4 wheels.
2 This is a van with
six wheels.
I want to reshape it into the following format
ID Text
1 This is a red car. Its electric and has 4 wheels.
2 This is a van with six wheels
Essentially between the two ID numbers my text has been broken into multiple lines. I want to combine it to look like the output above.
Using group_by a numeric ID did not work as it gets rid of lines w/o the ID#.
Any thoughts on how I can achieve this type of output?
Thanks!
Here is one option with tidyverse. Convert the blank ("") in 'ID' to NA (na_if), using fill from tidyr, change the NA elements to previous non-Na value, grouped by 'ID', then paste the 'Text' by collapseing the elements together to a single string
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(ID = na_if(ID, "")) %>%
fill(ID) %>%
group_by(ID) %>%
summarise(Text = str_c(Text, collapse=' '))
# A tibble: 2 x 2
# ID Text
# <chr> <chr>
#1 1 This is a red car. Its electric and has 4 wheels.
#2 2 This is a van with six wheels.
Or create a logical index converted to numeric to fill the 'ID' and use that as grouping variable to summarise the 'Text' column
df1 %>%
group_by(ID = ID[ID != ""][cumsum(ID != "")]) %>%
summarise(Text = str_c(Text, collapse=" "))
# A tibble: 2 x 2
# ID Text
# <chr> <chr>
#1 1 This is a red car. Its electric and has 4 wheels.
#2 2 This is a van with six wheels.
data
df1 <- structure(list(ID = c("1", "", "", "2", ""), Text = c("This is a red",
"car. Its electric", "and has 4 wheels.", "This is a van with",
"six wheels.")), row.names = c(NA, -5L), class = "data.frame")
I'm trying to find an effective way to extract words from an text column in a dataset. The approach I'm using is
library(dplyr)
library(stringr)
Text = c("A little bird told me about the dog", "A pig in a poke", "As busy as a bee")
data = as.data.frame(Text)
keywords <- paste0(c("bird", "dog", "pig","wolf","cat", "bee", "turtle"), collapse = "|")
data %>% mutate(Word = str_extract(Text, keywords))
It's just an example but I have more than 2000 possible words to extract from each row. I don't know yet another approach to use, but the fact I will have a big regex will make things slow or doesn't matter the size of the regex? I think it will not appear more than one of these words in each row, but there is a way to make multiple columns automatically if more than one word appear in each row?
We can use str_extract_all to return a list, convert the list elements to a named list or tibble and use unnest_wider
library(purrr)
library(stringr)
library(tidyr)
library(dplyr)
data %>%
mutate(Words = str_extract_all(Text, keywords),
Words = map(Words, ~ as.list(unique(.x)) %>%
set_names(str_c('col', seq_along(.))))) %>%
unnest_wider(Words)
# A tibble: 3 x 3
# Text col1 col2
# <fct> <chr> <chr>
#1 A little bird told me about the dog bird dog
#2 A pig in a poke pig <NA>
#3 As busy as a bee bee <NA>
Try intersect with keywords as an array
data <- data.frame(Text = Text, Word = sapply(Text, function(v) intersect(unlist(strsplit(v,split = " ")),keywords),USE.NAMES = F))
# Sample Data Frame
df <- data.frame(Column_A
=c("1011 Red Cat",
"Mouse 2011 is in the House 3001", "Yellow on Blue Dog walked around Park"))
I've a column of manually inputted data which I'm trying to clean.
Column_A
1|1011 Red Cat |
2|Mouse 2011 is in the House 3001 |
2|Yellow on Blue Dog walked around Park|
I want to separate each characteristic into it's own column, but still maintain Column A to pull out other characteristics later.
Colour Code Column_A
1|Red |1001 |Cat
2|NA |2001 3001 |Mouse is in the House
3|Yellow on Blue |NA |Dog walked around Park
To date, I've been re-ordering them with gsub and capturing groups, then using Tidyr::extract to separate them.
library(dplyr)
library(tidyr)
library(stringr)
df1 <- df %>%
# Reorders the Colours
mutate(Column_A = gsub("(.*?)?(Yellow|Blue|Red)(.*)?", "\\2 \\1\\3",
Column_A, perl = TRUE)) %>%
# Removes Whitespaces
mutate(Column_A =str_squish(Column_A)) %>%
# Extracts the Colours
extract(Column_A, c("Colour", "Column_A"), "(Red|Yellow|Blue)?(.*)") %>%
# Repeats the Prececding Steps for Codes
mutate(Column_A = gsub("(.*?)?(\\b\\d{1,}\\b)(.*)?", "\\2 \\1\\3",
Column_A, perl = TRUE)) %>%
mutate(Column_A =str_squish(Column_A)) %>%
extract(Column_A, c("Code", "Column_A"), "(\\b\\d{1,}\\b)?(.*)") %>%
mutate(Column_A = str_squish(Column_A))
Which Results in this:
Colour Code Column_A
|Red |1011 |Cat
|Yellow |NA |on Blue Dog walked around Park
|NA |1011 |Mouse is in the House 1001
This works fine for the first row, but not the proceeding space and word separated ones, which I've subsequently been extracting and uniting. What's a more elegant way of doing this?
Here's a solution with a combination of stringr and gsub, using a list of colours supplied in R:
library(dplyr)
library(stringr)
# list of colours from R colors()
cols <- as.character(colors())
apply(df,
1,
function(x)
tibble(
# Exctract CSV of colours
Color = cols[cols %in% str_split(tolower(x), " ", simplify = T)] %>%
paste0(collapse = ","),
# Extract CSV of sequential lists of digits
Code = str_extract_all(x, regex("\\d+"), simplify = T) %>%
paste0(collapse = ","),
# Remove colours and digits from Column_A
Column_A = gsub(paste0("(\\d+|",
paste0(cols, collapse = "|"),
")"), "", x, ignore.case = T) %>% trimws())) %>%
bind_rows()
# A tibble: 3 x 3
Color Code Column_A
<chr> <chr> <chr>
1 red 1011 Cat
2 "" 2011,3001 Mouse is in the House
3 blue,yellow "" on Dog walked around Park
Using tidyverse we can do
library(tidyverse)
colors <- paste0(c("Red", "Yellow", "Blue"), collapse = "|")
df %>%
mutate(Color = str_extract(Column_A,
paste0("(", colors, ").*(", colors, ")|(", colors, ")")),
Code = str_extract_all(Column_A, "\\d+", ),
Column_A = pmap_chr(list(Color, Code, Column_A), function(x, y, z)
trimws(gsub(paste0("\\b", c(x, y), "\\b", collapse = "|"), "", z))),
Code = map_chr(Code, paste, collapse = " "))
# Column_A Color Code
#1 Cat Red 1011
#2 Mouse is in the House <NA> 2011 3001
#3 Dog walked around Park Yellow on Blue
We first extract text between two colors using str_extract. You can include all the possible colors which can occur in the data in colors. We use paste0 to construct the regex. For this example it would be
paste0("(", colors, ").*(", colors, ")|(", colors, ")")
#[1] "(Red|Yellow|Blue).*(Red|Yellow|Blue)|(Red|Yellow|Blue)"
meaning extract text between and including colors or extract only colors.
For Code part as we can have multiple Code values, we use str_extract_all and get all the numbers from the column. This part is initially stored in a list.
For Column_A values we remove everything which was selected in Code and Color adding word boundaries using gsub and the remaining part is saved.
As we had stored Code in list previously, we convert them to one string by collapsing them. This returns empty strings for values that do not match. You can convert them back to NA by adding Code = replace(Code, Code == "", NA)) in the chain if needed.