Remove everything in a string after the first " - " (multiple " - ") - r

I am struggling to only keep the part before the first " - ".
If I try this regex on regex101.com I get the expected output but when I try it in R I get a different output.
authors <- sub("\\s-\\s.*", "", authors)
Input:
[1] "T Dietz, RL Shwom, CT Whitley - Annual Review of Sociology, 2020 - annualreviews.org"
[2] "L Berrang-Ford, JD Ford, J Paterson - Global environmental change, 2011 - Elsevier"
[3] "CD Thomas - Diversity and Distributions, 2010 - Wiley Online Library"
Expected output:
[1] "T Dietz, RL Shwom, CT Whitley"
[2] "L Berrang-Ford, JD Ford, J Paterson"
[3] "CD Thomas"
Actual output:
[1] "T Dietz, RL Shwom, CT Whitley - Annual Review of Sociology, 2020"
[2] "L Berrang-Ford, JD Ford, J Paterson - Global environmental change, 2011"
[3] "CD Thomas - Diversity and Distributions, 2010"
Thanks in advance!

It seems you receive output containing some Unicode whitespaces.
In this case, the following will work:
sub("(*UTF)(*UCP)\\s-\\s.*", "", authors, perl=TRUE)
The (*UTF)(*UCP) (or probably just (*UCP)) will enable \s to match any Unicode whitespaces.

You can use this regex. Replace for nothing the result in Notepad++ for example:
Regex
-(.*?)$

You can also just split the string on your delimiter (-) and take the first element:
sapply(strsplit(authors, " -", fixed = T), `[[`, 1)
[1] "T Dietz, RL Shwom, CT Whitley" "L Berrang-Ford, JD Ford, J Paterson"
[3] "CD Thomas"
You can also use regex greedy matching to remove everything after and including your delimiter. Because it is greedy it will match as much as possible:
stringr::str_remove(authors, " -.*")
[1] "T Dietz, RL Shwom, CT Whitley" "L Berrang-Ford, JD Ford, J Paterson"
[3] "CD Thomas"

Too long for a comment at the moment, may delete later. When I run this code alone, I get your expected output:
authors <- c("T Dietz, RL Shwom, CT Whitley - Annual Review of Sociology, 2020 - annualreviews.org",
"L Berrang-Ford, JD Ford, J Paterson - Global environmental change, 2011 - Elsevier",
"CD Thomas - Diversity and Distributions, 2010 - Wiley Online Library")
sub("\\s-\\s.*", "", authors)
#[1] "T Dietz, RL Shwom, CT Whitley" "L Berrang-Ford, JD Ford, J Paterson" "CD Thomas"
This might have something to do with the fact that you reassign to authors every time you try subbing, which overwrites authors. You might have been doing that as you were developing the regex, and forgot to reassign the authors vector to the original.

Related

Splitting string with '<U+FF0E>' in R

Hello I am trying to split a dataframe column test$Name that is in this format.
[1]"Fung Yat Building<U+FF0E>13/F<U+FF0E>Flat A"
[2] "Victoria Centre<U+FF0E>Block 3<U+FF0E>20/F<U+FF0E>Flat B"
[3] "Lei King Wan<U+FF0E>Sites B<U+FF0E>Block 6 Yat Hong Mansion<U+FF0E>3/F<U+FF0E>Flat H"
[4] "Island Place<U+FF0E>Block 3 (Three Island Place)<U+FF0E>9/F<U+FF0E>Flat G"
[5] "7A Comfort Terrace<U+FF0E>5/F<U+FF0E>Flat B"
[6] "Broadview Court<U+FF0E>Block 4<U+FF0E>38/F<U+FF0E>Flat E"
[7] "Chi Fu Fa Yuen<U+FF0E>Fu Ho Yuen (Block H-5)<U+FF0E>16/F<U+FF0E>Flat G"
[8] "City Garden<U+FF0E>Phase 2<U+FF0E>Block 10<U+FF0E>9/F<U+FF0E>Flat B"
[9] "Euston Court<U+FF0E>Tower 1<U+FF0E>12/F<U+FF0E>Flat H"
[10] "Garley Building<U+FF0E>10/F<U+FF0E>Flat C"
The structure of each entry is BuildingName<U+FF0E>FloorNumber<U+FF0E>Unit. I would like to extract the building name like the following example.
Name
Fung Yat Building
Victoria Centre
Lei King Wan
...
I have tested that <U+FF0E> is actually '.' by doing this.
grepl('.',"Fung Yat Building<U+FF0E>13/F<U+FF0E>Flat A")
[1] TRUE
Hence, I have tried the followings but none of them worked...
test %>% separate(Name, c('Name'), sep = '.') %>% head
gsub(".", " ", test$Name[1], fixed=TRUE)
sub("^\\s*<U\\+\\w+>\\s*", " ", test$Name[1])
Any suggestions please? Thanks!
easies way is to use < as a split pattern.
library(stringr)
word("Fung Yat Building<U+FF0E>13/F<U+FF0E>Flat A", 1, sep = "\\<")
# word("Fung Yat Building<U+FF0E>13/F<U+FF0E>Flat A", 1, sep = "\\<U\\+FF0E\\>") ## building is '1', FloorNumber is '2', Unit os '3'
out:
[1] "Fung Yat Building"

Removing dates and all junks from texts using R

I am cleaning a huge dataset made up of tens of thousands of texts using R. I know regular expression will do the job conveniently but I am poor in using it. I have combed stackoverflow but could not find solution. This is my dummy data:
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982",
"04/02/2016 Health is a priority: WAI000553",
"09/ 08/2016 Economy is bad: 2031CE8D",
": : 21 / 05 / 13: Vehicle license is needed: DPH2790 ")
I want to remove all the dates, punctuations and IDs and want my result to be this:
[1] "Education is good"
[2] "Health is a priority"
[3] "Economy is bad"
[4] "Vehicle license is needed"
Any help in R will be appreciated.
I think specificity is in order here:
First, let's remove the date-like strings. I'll assume either mm/dd/yyyy or dd/mm/yyyy, where the first two can be 1-2 digits, and the third is always 4 digits. If this is variable, the regex can be changed to be a little more permissive:
foo_data2 <- gsub("\\d{1,2}\\s*/\\s*\\d{1,2}\\s*/\\s*\\d{4}", "", foo_data)
foo_data2
# [1] " Education is good: WO0001982" " Health is a priority: WO0002021" " Economy is bad: WO001999" " Vehicle license is needed: WO001050"
From here, the abbreviations seem rather easy to remove, as the other answers have demonstrated. You have not specified if the abbreviation is hard-coded to be anything after a colon, numbers prepended with "WO", or just some one-word combination of letters and numbers. Those could be:
gsub(":.*", "", foo_data2)
# [1] " Education is good" " Health is a priority" " Economy is bad" " Vehicle license is needed"
gsub("\\bWO\\S*", "", foo_data2)
# [1] " Education is good: " " Health is a priority: " " Economy is bad: " " Vehicle license is needed: "
gsub("\\b[A-Za-z]+\\d+\\b", "", foo_data2)
# [1] " Education is good: " " Health is a priority: " " Economy is bad: " " Vehicle license is needed: "
The : removal should be straight forward, and using trimws(.) will remove the leading/trailing spaces.
This can obviously be combined into a single regex (using the logical | with pattern grouping) or a single R call (nested gsub) without complication, I kept them broken apart for discussion.
I think https://stackoverflow.com/a/22944075/3358272 is a good reference for regex in general, note that while that page shows many regex things with single-backslashes, R requires all of those use double-backslashes (e.g., \d in regex needs to be \\d in R). The exception to this is if you use R-4's new raw-strings, where these two are identical:
"\\b[A-Za-z]+\\d+\\b"
r"(\b[A-Za-z]+\d+\b)"
Using stringr try this:
library(stringr)
library(magrittr)
str_remove_all(foo_data, "\\/|\\d+|\\: WO") %>%
str_squish()
#> [1] "Education is good" "Health is a priority"
#> [3] "Economy is bad" "Vehicle license is needed"
Created on 2021-04-22 by the reprex package (v2.0.0)
data
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
"09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
"09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")
gsub(".*\\d{4}[[:space:]]+(.*):.*", "\\1", foo_data)
#> [1] "Education is good" "Health is a priority"
#> [3] "Economy is bad" "Vehicle license is needed"
Created on 2021-04-22 by the reprex package (v2.0.0)

How to extract conversational utterances from single string

I have a conversation between several speakers recorded as a single string:
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
I also have a vector of the speakers' names:
speakers <- c("Peter", "Mary", "al hamshi")
Using this vector as a component of my regex pattern I'm doing relatively well with this extraction:
library(stringr)
str_extract_all(convers,
paste("(?<=: )[\\w\\s]+(?= ", paste0(".*\\b(", paste(speakers, collapse = "|"), ")\\b.*"), ")", sep = ""))
[[1]]
[1] "hiya" "hi how wz your weekend" "ahh still got a headache An you party a lot"
[4] "nuh you know my kid s sick n stuff" "yeah i know thats erm al" "hey guys how s it goin"
[7] "Great" "where ve you been last week"
However, the first part of the the third speaker's name (al) is contained in one of the extracted utterances (yeah i know thats erm al) and the last utterance by speaker al hamshi (ah you know camping with my girl friend) is missing from the output. How can the regex be improved so that all utterances get matched and extracted correctly?
What if you take another approach?
Remove all speakers from the text and split the string on '\\s*:\\s*'
strsplit(gsub(paste(speakers, collapse = "|"), '', convers), '\\s*:\\s*')[[1]]
# [1] "" "hiya"
# [3] "hi how wz your weekend" "ahh still got a headache An you party a lot"
# [5] "nuh you know my kid s sick n stuff" "yeah i know thats erm"
# [7] "hey guys how s it goin" "Great"
# [9] "where ve you been last week" "ah you know camping with my girl friend"
You can clean up the output a bit to remove the first empty value from it.
A correct splitting approach would look like
p2 <- paste0("\\s*\\b(?:", paste(speakers, collapse = "|"), ")(?=:)")
strsplit(sub("^\\W+", "", gsub(p2, "", convers, perl=TRUE)), "\\s*:\\s*")[[1]]
# => [1] "hiya"
# => [2] "hi how wz your weekend"
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"
# => [5] "yeah i know thats erm"
# => [6] "hey guys how s it goin"
# => [7] "Great"
# => [8] "where ve you been last week"
# => [9] "ah you know camping with my girl friend"
The regex to remove speakers from the string will look like
\s*\b(?:Peter|Mary|al hamshi)(?=:)
See the regex demo. It will match
\s* - 0+ whitespaces
\b - a word boundary
(?:Peter|Mary|al hamshi) - one of the speaker names
(?=:) - that must be followed with a : char.
Then, the non-word chars at the start are removed with the sub("^\\W+", "", ...) call, and then the whole string is split with \s*:\s* regex that matches a : enclosed with 0+ whitespaces.
Alternatively, you can use
(?<=(?:Peter|Mary|al hamshi):\s).*?(?=\s*(?:Peter|Mary|al hamshi):|\z)
See this regex demo. Details:
(?<=(?:Peter|Mary|al hamshi):\s) - a location immediately preceded with any speaker name and a whitespace
.*? - any 0+ chars (other than line break chars, use (?s) at the pattern start to make it match any chars) as few as possible
(?=\s*(?:Peter|Mary|al hamshi):|\z) - a location immediately followed with 0+ whitespaces, then any speaker name and a : or end of string.
In R, you can use
library(stringr)
speakers <- c("Peter", "Mary", "al hamshi")
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
p = paste0("(?<=(?:",paste(speakers, collapse="|"),"):\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)")
str_extract_all(convers, p)
# => [[1]]
# => [1] "hiya"
# => [2] "hi how wz your weekend"
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"
# => [5] "yeah i know thats erm"
# => [6] "hey guys how s it goin"
# => [7] "Great"
# => [8] "where ve you been last week"
# => [9] "ah you know camping with my girl friend"

Inverted regex : grep with errors allowed

I am trying to create a datatable from some pdf files, which result in data with sometimes some unplanned spaces, e.g.
MWE <- c("Gross Domestic Product 2.3",
"blabla 1.5",
"blabla2 6.5",
"G ross Domestic Product 4.5",
"Another L ine 9.6",
"Gross Domestic Product 6.9",
"G r oss D omes tic Pr o du ct 7.6")
I would like to have all the occurences of Gross Domestic Product, whether there are spaces or not. But a simple grep("Gross Domestic Product",MWE) takes into account spaces
grep("Gross Domestic Product",MWE)
[1] 1 6
I can do that upstream, for instance by erasing every spaces, e.g.
MWE_2 <- gsub("\\s","",MWE)
grep("GrossDomesticProduct",MWE_2)
[1] 1 4 6 7
I was wondering whether it was possible to achieve the same result with the grep option, which could prove useful for some uses (e.g. not creating a new table)
You can modify your string and use grep, as shown below. Idea is to create a regex which ignores space if present.
MWE <- c("Gross Domestic Product 2.3",
"blabla 1.5",
"blabla2 6.5",
"G ross Domestic Product 4.5",
"Another L ine 9.6",
"Gross Domestic Product 6.9",
"G r oss D omes tic Pr o du ct 7.6")
gdp_str <- "Gross Domestic Product"
gdp_str <- sub("\\s*", "\\\\s*", gsub('(.{1})', '\\1\\\\s*', gdp_str))
grep(gdp_str, MWE)

Structure character data into data frame

I used rvest package in R to scrape some web data but I am having a lot of trouble getting it into a usuable format.
My data currently looks like this:
test
[1] "v. Philadelphia"
[2] "TD GardenRegular Season"
[3] "PTS: 23. Jayson TatumREB: 10. M. MorrisAST: 7. Kyrie Irving"
[4] "PTS: 23. Joel EmbiidREB: 15. Ben SimmonsAST: 8. Ben Simmons"
[5] "100.7 - 83.4"
[6] "# Toronto"
[7] "Air Canada Centre Regular Season"
[8] "PTS: 21. Kyrie IrvingREB: 10. Al HorfordAST: 9. Al Horford"
[9] "PTS: 31. K. LeonardREB: 10. K. LeonardAST: 7. F. VanVleet"
[10] "115.6 - 103.3"
Can someone help me perform the correct operations in order to have it look like this (as a data frame) and provide the code, I would really appreciate it:
Opponent Venue
Philadelphia TD Garden
Toronto Air Canada Centre
I do not need any of the other information.
Let me know if there are any issues :)
# put your data in here
input <- c("v. Philadelphia", "TD GardenRegular Season",
"", "", "",
"# Toronto", "Air Canada Centre Regular Season",
"", "", "")
index <- 1:length(input)
# raw table format
out_raw <- data.frame(Opponent = input[index%%5==1],
Venue = input[index%%5==2])
# using stringi package
install.packages("stringi")
library(stringi)
# copy and clean up
out_clean <- out_raw
out_clean$Opponent <- stri_extract_last_regex(out_raw$Opponent, "(?<=\\s).*$")
out_clean$Venue <- trimws(gsub("Regular Season", "", out_raw$Venue))
out_clean

Resources