I have a requirement to identify specific words and combinations of specific words within a free text description column. My dataset contains two columns - a reference number and description. The data relates to repairs. I need to be able to determine which room the repair took place in for each reference number. This could include “kitchen”, “bathroom”, “dining room” amongst others.
The dataset looks like this
|reference|description |
|————————-|———————————————————————-|
|123456 |repair light in kitchen |
The output I require is something like this:
|reference|Room |
|————————-|————————|
|123456 |kitchen |
Any help very much appreciated.
This will pull the first match from room_vector in each description.
room_vector = c("kitchen", "bathroom", "dining room")
library(stringr)
your_data$room = str_extract(your_data$description, paste(room_vector, collapse = "|"))
This version takes into account the combination with the word repair:
library(dplyr)
library(stringr)
my_vector <- c("kitchen", "bathroom", "dining room")
pattern <- paste(my_vector, collapse = "|")
df %>%
mutate(Room = case_when(
str_detect(description, "repair") &
str_detect(description, pattern) ~ str_extract(description, pattern)))
If you apply the code to this dataframe:
reference description
1 123456 live in light in kitchen
you will get:
reference description Room
1 123456 live in light in kitchen <NA>
First version does not take the combination with the word repair into account:
Similar to Gregor Thomas solution:
library(dplyr)
library(stringr)
my_vector <- c("kitchen", "bathroom", "dining room")
pattern <- paste(my_vector, collapse = "|")
df %>%
mutate(Room = case_when(
str_detect(description, "repair") |
str_detect(description, pattern) ~ str_extract(description, pattern)))
reference description Room
1 123456 repair light in kitchen kitchen
Using Base R:
rooms <- c("kitchen", "bathroom", "dining room")
pat <- sprintf('.*repair.*(%s).*|.*', paste0(rooms, collapse = '|'))
transform(df, room = sub(pat, '\\1', reference))
reference room
1 repair bathroom bathroom
2 live bathroom
3 repair lights in kitchen kitchen
4 food in kitchen
5 tv in dining room
6 table repair dining room dining room
Data:
df <- structure(list(reference = c("repair bathroom", "live bathroom",
"repair lights in kitchen", "food in kitchen", "tv in dining room",
"table repair dining room ")), class = "data.frame", row.names = c(NA,
-6L))
Related
I need to mutate a new column "Group" by those keyword,
I tried to using %in% but not got data I expected.
I want to create an extra column names'group' in my df data frame.
In this column, I want lable every rows by using some keywords.
(from the keywords vector or may be another keywords dataframe)
For example:
library(tibble)
df <- tibble(Title = c("Iran: How we are uncovering the protests and crackdowns",
"Deepak Nirula: The man who brought burgers and pizzas to India",
"Phil Foden: Manchester City midfielder signs new deal with club until 2027",
"The Danish tradition we all need now",
"Slovakia LGBT attack"),
Text = c("Iranian authorities have been disrupting the internet service in order to limit the flow of information and control the narrative, but Iranians are still sending BBC Persian videos of protests happening across the country via messaging apps. Videos are also being posted frequently on social media.
Before a video can be used in any reports, journalists need to establish where and when it was filmed.They can pinpoint the location by looking for landmarks and signs in the footage and checking them against satellite images, street-level photos and previous footage. Weather reports, the position of the sun and the angles of shadows it creates can be used to confirm the timing.",
"For anyone who grew up in capital Delhi during the 1970s and 1980s, Nirula's - run by the family of Deepak Nirula who died last week - is more than a restaurant. It's an emotion.
The restaurant transformed the eating-out culture in the city and introduced an entire generation to fast food, American style, before McDonald's and KFC came into the country. For many it was synonymous with its hot chocolate fudge.",
"Stockport-born Foden, who has scored two goals in 18 caps for England, has won 11 trophies with City, including four Premier League titles, four EFL Cups and the FA Cup.He has also won the Premier League Young Player of the Season and PFA Young Player of the Year awards in each of the last two seasons.
City boss Pep Guardiola handed him his debut as a 17-year-old and Foden credited the Spaniard for his impressive development over the last five years.",
"Norwegian playwright and poet Henrik Ibsen popularised the term /friluftsliv/ in the 1850s to describe the value of spending time in remote locations for spiritual and physical wellbeing. It literally translates to /open-air living/, and today, Scandinavians value connecting to nature in different ways – something we all need right now as we emerge from an era of lockdowns and inactivity.",
"The men were shot dead in the capital Bratislava on Wednesday, in a suspected hate crime.Organisers estimated that 20,000 people took part in the vigil, mourning the men's deaths and demanding action on LGBT rights.Slovak President Zuzana Caputova, who has raised the rainbow flag over her office, spoke at the event.")
)
keyword1 <- c("authorities", "Iranian", "Iraq", "control", "Riots",)
keyword2 <- c("McDonald's","KFC", "McCafé", "fast food")
keyword3 <- c("caps", "trophies", "season", "seasons")
keyword4 <- c("travel", "landscape", "living", "spiritual")
keyword5 <- c("LGBT", "lesbian", "les", "rainbow", "Gay", "Bisexual","Transgender")
I need to mutate a new column "Group" by those keyword
if match keyword1 lable "Politics",
if match keyword2 lable "Food",
if match keyword3 lable "Sport",
if match keyword4 lable "Travel",
if match keyword5 lable "LGBT".
Can also ignore.case ?
Below is expected output
Title
Text
Group
Iran: How..
Iranian...
Politics
Deepak Nir..
For any...
Food
Phil Foden..
Stockpo...
Sport
The Danish..
Norwegi...
Travel
Slovakia L..
The men...
LGBT
Thanks to everyone who spending time.
you could try this:
df %>%
rowwise %>%
mutate(
## add column with words found in title or text (splitting by non-word character):
words = list(strsplit(split = '\\W', paste(Title, Text)) %>% unlist),
group = {
categories <- list(keyword1, keyword2, keyword3, keyword4, keyword5)
## i indexes those items (=keyword vectors) of list 'categories'
## which share at least one word with column Title or Text (so that length > 0)
i <- categories %>% lapply(\(category) length(intersect(unlist(words), category))) %>% as.logical
## pick group name via index; join with ',' if more than one category applies
c('Politics', 'Food', 'Sport', 'Travel', 'LGBD')[i] %>% paste(collapse = ',')
}
)
output:
## # A tibble: 5 x 4
## # Rowwise:
## Title Text words group
## <chr> <chr> <lis> <chr>
## 1 Iran: How we are uncovering the protests and crackdowns "Ira~ <chr> Poli~
## 2 Deepak Nirula: The man who brought burgers and pizzas to In~ "For~ <chr> Food
## 3 Phil Foden: Manchester City midfielder signs new deal with ~ "Sto~ <chr> Sport
## 4 The Danish tradition we all need now "Nor~ <chr> Trav~
## 5 Slovakia LGBT attack "The~ <chr> LGBD
Check this out - the basic idea is to define all keyword* case-insensitively (hence the (?i) in the patterns) as alternation patterns (hence the | for collapsing) with word boundaries (hence the \\b before and after the alternatives, to ensure that "caps" is matched but not for example "capsize") and use nested ifelse statements to assign the Group labels:
library(tidyverse)
df %>%
mutate(
All = str_c(Title, Text),
Group = ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword1, collapse = "|"), ")\\b")), "Politics",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword2, collapse = "|"), ")\\b")), "Food",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword3, collapse = "|"), ")\\b")), "Sport",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword4, collapse = "|"), ")\\b")), "Travel", "LGBT"))))
) %>%
select(Group)
# A tibble: 5 × 1
Group
<chr>
1 Politics
2 Food
3 Sport
4 Travel
5 LGBT
I have a data frame of tweets for a sentiment analysis I am working on. I want to remove references to some proper names (for example, "Jeff Smith"). Is there a way to remove all or partial references to a name in the same command? Right now I am doing it the long way:
library(stringr)
str_detect(text, c('(Jeff Smith) | (Jeff) | (Smith)' ))
But that obviously gets cumbersome as I add more names. Ideally there'd be some way to feed just "Jeff Smith" and then be able to match all or some of it. Does anybody have any ideas?
Some sample code if you would like to play with it:
tweets = data.frame(text = c('Smith said he’s not counting on Monday being a makeup day.',
"Williams says that Steve Austin will miss the rest of the week",
"Weird times: Jeff Smith just got thrown out attempting to steal home",
"Rest day for Austin today",
"Jeff says he expects to bat leadoff", "Jeff", "No reference to either name"))
name = c("Jeff Smith", "Steve Austin")
Based on the data showed, all of them should be TRUE
library(dplyr)
library(stringr)
pat <- str_c(gsub(" ", "\\b|\\b", str_c("\\b", name, "\\b"),
fixed = TRUE), collapse="|")
tweets %>%
mutate(ind = str_detect(text, pat))
-output
# text ind
#1 Smith said he’s not counting on Monday being a makeup day. TRUE
#2 Williams says that Steve Austin will miss the rest of the week TRUE
#3 Weird times: Jeff Smith just got thrown out attempting to steal home TRUE
#4 Rest day for Austin today TRUE
#5 Jeff says he expects to bat leadoff TRUE
#6 Jeff TRUE
#7 No reference to either name FALSE
Not a beauty, but it works.
#example data
namelist <- c('Jeff Smith', 'Kevin Arnold')
namelist_spreaded <- strsplit(namelist, split = ' ')
f <- function(x) {
paste0('(',
paste(x, collapse = ' '),
') | (',
paste(x, collapse = ') | ('),
')')
}
lapply(namelist_spreaded, f)
I have a large dataframe of published articles for which I would like to extract all articles relating to a few authors specified in a separate list. The authors in the dataframe are grouped together in one column separated by a ; . Not all authors need to match, I would like to extract any article which has one author matched to the list. An example is below.
Title<-c("A", "B", "C")
AU<-c("Mark; John; Paul", "Simone; Lily; Poppy", "Sarah; Luke")
df<-cbind(Title, AU)
authors<-as.character(c("Mark", "John", "Luke"))
df[sapply(strsplit((as.character(df$AU)), "; "), function(x) any(authors %in% x)),]
I would expect to return;
Title AU
A Mark; John
C Sarah; Luke
However with my large dataframe this command does not work to return all AU, it only returns rows which have a single AU not multiple ones.
Here is a dput from my larger dataframe of 5 rows
structure(list(AU = c("FOOKES PG;DEARMAN WR;FRANKLIN JA", "SIMS DG;DOWNHAM MAPS;MCQUILLIN J;GARDNER PS",
"TURNER BR", "BUTLER J;MARSH H;GOODARZI F", "OVERTON M"), TI = c("SOME ENGINEERING ASPECTS OF ROCK WEATHERING WITH FIELD EXAMPLES FROM DARTMOOR AND ELSEWHERE",
"RESPIRATORY SYNCYTIAL VIRUS INFECTION IN NORTH-EAST ENGLAND",
"TECTONIC AND CLIMATIC CONTROLS ON CONTINENTAL DEPOSITIONAL FACIES IN THE KAROO BASIN OF NORTHERN NATAL, SOUTH AFRICA",
"WORLD COALS: GENESIS OF THE WORLD'S MAJOR COALFIELDS IN RELATION TO PLATE TECTONICS",
"WEATHER AND AGRICULTURAL CHANGE IN ENGLAND, 1660-1739"), SO = c("QUARTERLY JOURNAL OF ENGINEERING GEOLOGY",
"BRITISH MEDICAL JOURNAL", "SEDIMENTARY GEOLOGY", "FUEL", "AGRICULTURAL HISTORY"
), JI = c("Q. J. ENG. GEOL.", "BRIT. MED. J.", "SEDIMENT. GEOL.",
"FUEL", "AGRICULTURAL HISTORY")
An option with str_extract
library(dplyr)
library(stringr)
df %>%
mutate(Names = str_extract_all(Names, str_c(authors, collapse="|"))) %>%
filter(lengths(Names) > 0)
# Title Names
#1 A Mark, John
#2 C Luke
data
df <- data.frame(Title, Names)
in Base-R you can access it like so
df[sapply(strsplit(as.character(df$Names, "; "), function(x) any(authors %in% x)),]
Title Names
1 A Mark; John; Paul
3 C Sarah; Luke
This can be accomplished by subsetting on those Names that match the pattern specified in the first argument to the function grepl:
df[grepl(paste0(authors, collapse = "|"), df[,2]),]
Title Names
[1,] "A" "Mark; John; Paul"
[2,] "C" "Sarah; Luke"
I tried to find an answer for this in other posts but nothing seemed to be working.
I have a data set where people answered the city they were in using a free response format. Therefore for each city, people identified in many different ways. For example, those living in Atlanta might have written "Atlanta", "atlanta", "Atlanta, GA" and so on.
There are 12 cities represented in this data set. I'm trying to clean this variable so each city is written consistently. Is there a way to do this efficiently for each city?
I've tried mutate_if and str_replace_all but can't seem to figure it out (see my code below)
all_data_city <- mutate_if(all_data_city, is.character,
str_replace_all, pattern = "Atlanta, GA",
replacement = "Atlanta")
all_data_city %>%
str_replace_all(c("Atlanta, GA" & "HCA Atlanta" & "HCC Atlanta" &
"Suwanee" & "Suwanee, GA" & "suwanee"), = "Atlanta")
If we need to pass a vector of elements to be replaced, paste them together with | as pattern and replace with 'Atlanta'
library(dplyr)
library(stringr)
pat <- str_c(c("Atlanta, GA" , "HCA Atlanta" , "HCC Atlanta" ,
"Suwanee" , "Suwanee, GA" , "suwanee"), collapse = "|")
all_data_city %>%
str_replace_all(pat, "Atlanta")
Using a reproducible example with iris
iris %>%
transmute(Species = str_replace_all(Species,
str_c(c("set", "versi"), collapse="|"), "hello")) %>%
pull(Species) %>%
unique
#[1] "helloosa" "hellocolor" "virginica"
Questions on data cleaning are difficult to answer, as answers strongly depend on the data.
Proposed solutions may work for a (small) sample dataset but may fail for a (large) production dataset.
In this case, I see two possible approaches:
Collecting all possible ways of writing a city's name and replacing these different variants by the desired city name. This can be achieved by str_replace() or by joining. This is safe but tedious.
Looking for a matching character string within the city name and replace if found.
Below is a blue print which can be extended for other uses cases. For demonstration, a data.frame with one column city is created:
library(dplyr)
library(stringr)
data.frame(city = c("Atlanta, GA", "HCA Atlanta", "HCC Atlanta",
"Suwanee", "Suwanee, GA", "suwanee", "Atlantic City")) %>%
mutate(city_new = case_when(
str_detect(city, regex("Atlanta|Suwanee", ignore_case = TRUE)) ~ "Atlanta",
TRUE ~ as.character(city)
)
)
city city_new
1 Atlanta, GA Atlanta
2 HCA Atlanta Atlanta
3 HCC Atlanta Atlanta
4 Suwanee Atlanta
5 Suwanee, GA Atlanta
6 suwanee Atlanta
7 Atlantic City Atlantic City
I have the simpsons data from kaggle.com which includes titles of each episode. I want to check how many times the character names have been used in each title. I can find the exact words in titles but my code is missing out the words such as Homers when I look for Homer. Is there a way to do it?
Data example and my code:
text <- 'title
Homer\'s Night Out
Krusty Gets Busted
Bart Gets an "F"
Two Cars in Every Garage and Three Eyes on Every Fish
Dead Putting Society
Bart the Daredevil
Bart Gets Hit by a Car
Homer vs. Lisa and the 8th Commandment
Oh Brother, Where Art Thou?
Old Money
Lisa\'s Substitute
Blood Feud
Mr. Lisa Goes to Washington
Bart the Murderer
Like Father, Like Clown
Saturdays of Thunder
Burns Verkaufen der Kraftwerk
Radio Bart
Bart the Lover
Separate Vocations
Colonel Homer'
simpsons <- read.csv(text = text, stringsAsFactors = FALSE)
library(stringr)
titlewords <- paste(simpsons$title, collapse = " " )
words <- c('Homer')
titlewords <- gsub("[[:punct:]]", "", titlewords)
HomerCount <- str_count(titlewords, paste(words, collapse=" "))
HomerCount
In an alternative to the excellent suggestions in the comments, you can also use the tidytext package
library(tidytext)
library(dplyr)
text <- 'title
Homer\'s Night Out
Krusty Gets Busted
Bart Gets an "F"
Two Cars in Every Garage and Three Eyes on Every Fish
Dead Putting Society
Bart the Daredevil
Bart Gets Hit by a Car
Homer vs. Lisa and the 8th Commandment
Oh Brother, Where Art Thou?
Old Money
Lisa\'s Substitute
Blood Feud
Mr. Lisa Goes to Washington
Bart the Murderer
Like Father, Like Clown
Saturdays of Thunder
Burns Verkaufen der Kraftwerk
Radio Bart
Bart the Lover
Separate Vocations
Colonel Homer'
simpsons <- read.csv(text = text, stringsAsFactors = FALSE)
# Number of homers
simpsons %>%
unnest_tokens(word, title) %>%
summarize(count = sum(grepl("homer", word)))
# Lines location of homers
simpsons %>%
unnest_tokens(word, title) %>%
mutate(lines = rownames(.)) %>%
filter(grepl("homer", word))