Remove text after a specific word except certain characters in R - r

First post: Let me know if I'm posting in the wrong place.
I'm looking to remove text from a lot of data i R.
Each line(string?) looks like this:
example_sentence <- "John Doe and Jane Doe (C)"
I would like to keep only the first name in every sentence and the parenthesis (including what's in it).
Every parenthesis contains one or two letters (both in capital and lower case letters)
What I've tried:
example_sentence %>% str_remove("and.*")
This obviously removes the parenthesis. Just getting to know regexpr. Looking for something like:
[^(*)]
Can't get it to work. Any thoughts?
EDIT:
Here's some more input as requested. Maybe it will help others! (och = and in Swedish)
[1] "Anders Ahlgren och Anders Åkesson (C)"
[2] "Karin Nilsson (C)"
[3] "Edward Riedl (M)"
[4] "Per-Ingvar Johnsson och Anders Åkesson (C)"
[5] "Per-Ingvar Johnsson och Annika Qarlsson (C)"
[6] "Annika Qarlsson och Ulrika Carlsson i Skövde (C)"
Expected output:
[1] "Anders Ahlgren (C)"
[2] "Karin Nilsson (C)"
[3] "Edward Riedl (M)"
[4] "Per-Ingvar Johnsson (C)"
[5] "Per-Ingvar Johnsson (C)"
[6] "Annika Qarlsson (C)"

The [^(*)] pattern matches any single character but (, * and ) and str_remove removes all these characters from anywhere in the string.
If you plan to remove a word and and any chars other than ( and ) after it, you may use
example_sentence %>% str_remove("\\band\\b[^()]*")
Or, using base R:
sub("\\band\\b[^()]*", "", example_sentence)
The pattern matches:
\band\b - a whole word and (\b is a word boundary)
[^()]* - any char, 0 or more occurrences, other than ( and ).
See the regex demo and an R demo. See also the regex graph:

Try this:
example_sentence <- "John Doe and Jane Doe (C)"
spliting <- function(x)
{
y <- strsplit(x,split = ' ')
z <- y[[1]]
z <- z[c(1,length(z))]
return(z)
}
spliting(example_sentence)
[1] "John" "(C)"

You might be able to do this with capture groups. As Ronak says, a few more example input/outputs would be helpful as I'm not sure we know 100% all the possible forms you have in your data.
Here is a start in any case:
gsub('and.*(\\([^)]*\\)).*', '\\1', example_sentence)
# [1] "John Doe (C)"

Related

How to remove only words that end with period with Regex?

I am trying to remove suffixes from a list of last names using regex:
names <- c("John max Jr.", "manuel cortez", "samuel III", "Jameson")
lapply(names, function(x) str_extract(x, ".*[^\\s.*\\.$]"))
Output:
[1] "John max Jr"
[[2]]
[1] "manuel cortez"
[[3]]
[1] "samuel III"
[[4]]
[1] "Jameson"
What I am currently doing, does not work.... I was trying to remove all words that end with a period.
If you could please help me solve this and explain, it would be greatly appreciated. I also need to remove roman numerals but hopefully I can figure that out after learning to remove words ending in period.
Desired Output:
John max
manuel cortez
samuel
Jameson
Updated to remove Roman Numerals:
lapply(names, function(x) str_extract(x, ".*[^(\\s.*\\.$)|(\\sI{2}+)]"))
If we just want to remove something, maybe str_remove()
is better:
library(stringr)
lapply(names, function(x) str_remove(x, "\\w+\\.$")) |>
trimws()
"John max" "manuel cortez" "samuel III" "Jameson"

Extract proper nouns from text in R?

Is there any better way of extracting proper nouns (e.g. "London", "John Smith", "Gulf of Carpentaria") from free text?
That is, a function like
proper_nouns <- function(text_input) {
# ...
}
such that it would extract a list of proper nouns from the text input(s).
Examples
Here is a set of 7 text inputs (some easy, some harder):
text_inputs <- c("a rainy London day",
"do you know John Smith?",
"sail the Adriatic",
# tougher examples
"Hey Tom, where's Fred?" # more than one proper noun in the sentence
"Hi Lisa, I'm Joan." # more than one proper noun in the sentence, separated by capitalized word
"sail the Gulf of Carpentaria", # proper noun containing an uncapitalized word
"The great Joost van der Westhuizen." # proper noun containing two uncapitalized words
)
And here's what such a function, set of rules, or AI should return:
proper_nouns(text_inputs)
[[1]]
[1] "London"
[[2]]
[1] "John Smith"
[[3]]
[1] "Adriatic"
[[4]]
[1] "Tom" "Fred"
[[5]]
[1] "Lisa" "Joan"
[[6]]
[1] "Gulf of Carpentaria"
[[7]]
[1] "Joost van der Westhuizen"
Problems: simple regex are imperfect
Consider some simple regex rules, which have obvious imperfections:
Rule: take capitalized words, unless they're the first word in the sentence (which would ordinarily be capitalized). Problem: will miss proper nouns at start of sentence.
Rule: assume successive capitalized words are parts of the same proper noun (multi-part proper nouns like "John Smith"). Problem: "Gulf of Carpentaria" would be missed since it has an uncapitalized word in between.
Similar problem with people's names containing uncapitalized words, e.g. "Joost van der Westhuizen".
Question
The best approach I currently have is to simply use the regular expressions above and make do with a low success rate. Is there a better or more accurate way to extract the proper nouns from text in R? If I could get 80-90% accuracy on real text, that would be great.
You can start by taking a look at spacyr library.
library(spacyr)
result <- spacy_parse(text_inputs, tag = TRUE, pos = TRUE)
proper_nouns <- subset(result, pos == 'PROPN')
split(proper_nouns$token, proper_nouns$doc_id)
#$text1
#[1] "London"
#$text2
#[1] "John" "Smith"
#$text3
#[1] "Adriatic"
#$text4
#[1] "Hey" "Tom"
#$text5
#[1] "Lisa" "Joan"
#$text6
#[1] "Gulf" "Carpentaria"
This treats every word separately hence "John" and "Smith" are not combined. You maybe need to add some rules on top of this and do some post-processing if that is what you require.

How to delete parts of a textual vector using gsub and regular expressions

I have a list in which each element contains a vector of textual data.
In essence, I would like the code to delete text that follows after a regular expression: the second "." in the respective vector.
I believe the gsub-function is a good way to go about this if used in connection with regular expressions. I have tried to formulate the pattern to be detected using a regular expression (see below).
Data:
v<-c("M. le président. La parole est à M. Emile Vernaudon.",
"M.Gabriel Xaaperei. Monsieur le ministre",
"M. Raymond Fornir, rapporteur. La commission")
Code:
Subbed<-gsub("[^((?<=^M. *))]", "X", v)
The code returns the following:
[1] "M. XX XXXXXXXXX. XX XXXXXX XXX. M. XXXXX XXXXXXXXX."
[2] "M. XXXXXXX XXXXXXXXX. MXXXXXXX XX XXXXXXXXX XXX"
[3] "M. XXXXXXX XXXXXX XXXXXXXXXX. XX XXXXXXXXXX"
Not only does the code take all the "M."s into account, but there is also an "M" in the second row although it is not followed by a ".".
My hunch is that in gsub regular expressions seem to work differently - the "M." in my code might be read by R as "M|." Also, the ^ after the Lookaround doesn't seem to work as an anchor but simply as an additional punctuation character.
The desired outcome is as follows:
[1] "M. le président."
[2] "M. Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
Any help much appreciated.
1) sub Match the beginning of string (^) and then capture M. . Next match spaces if any and then capture everything up to the next dot. Finally match everything else. Replace that with the first capture (\1), a space and the second capture (\2).
Note that we use sub rather than gsub since there is just one overall match per component. Also, it puts a space after the M. even if it did not already have one.
sub("^(M\\.) *([^.]+\\.).*", "\\1 \\2", v)
giving:
[1] "M. le président." "M. Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
2) read.table This solution does not use any regular expressions. We read in v using dot separated fields and then assemble them back together using sprintf.
with(read.table(text = v, sep = ".", fill = TRUE, strip.white = TRUE),
sprintf("%s. %s.", V1, V2))
giving:
[1] "M. le président." "M. Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
3) paste/trimws/sub This uses several functions and only one regex which is relatively simple. We take everything from the 3rd character onwards, replace the first dot and everything after it with a dot, trim whitespace in case any is left and paste M. onto the beginning.
paste("M.", trimws(sub("\\..*", ".", substring(v, 3))))
giving:
[1] "M. le président." "M. Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
Add
gsub("^([^.]*.[^.]*).*", "\\1.", v)
[1] "M. le président." "M.Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
You placed your regular expression within square brackets, which R interprets as a group, and then indeed treats everything in that group as "OR". You also preceded that with ^, which makes R treat it as "NOT", so it basically looks for anything but the characters in your search term.
Furthermore, you didn't escape your periods. Here's the regex as it should be:
gsub("^(M\\..*?\\.).*","\\1",v)
[1] "M. le président." "M.Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
This looks for M. (the period is escaped), followed by anything (unescaped .) for an undetermined number of times (*) which is followed by a second (escaped) period (the ? is to make sure it's ungreedy, so it doesn't look for the last period, only the next one).
It them returns everything up to there (\\1), and discards the rest.

Find the names contained in each sentence cycling through a large vector of names

This question is an extension of this one: Find the names contained in each sentence (not the other way around)
I'll write the relevant part here. From this:
> sentences
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21 with the help of Martin Luther"
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments."
[5] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
We obtained this result:
library(stringr)
lst <- str_extract_all(sentences, paste(toMatch, collapse="|"))
lst[lengths(lst)==0] <- NA
lst
#[[1]]
#[1] "Martin Luther"
#[[2]]
#[1] "Melanchthon" "Martin Luther"
#[[3]]
#[1] "Paul"
#[[4]]
#[1] NA
#[[5]]
#[1] "Melanchthon"
But for a large toMatch vector, concatenating its values with the OR operator might not be very efficient. So my question is, how can be the same result be obtained using a function or a loop? Maybe this way it can be used a regular expression like \< or \b aroung the toMatch values so the system only looks for the whole words instead of strings.
I've tried this but don't know how to save the matches in lst to get the same result as above.
for(i in 1:length(sentences)){
for(j in 1:length(toMatch)){
lst<-str_extract_all(sentences[i], toMatch[j])
}}
Are you expecting something like this?
library(stringr)
sentences <- c(
"Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin",
" Melanchthon became professor of the Greek language in Wittenberg at the age of 21 with the help of Martin Luther",
" He studied the Scripture, especially of Paul, and Evangelical doctrine",
" He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments.",
" Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium")
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
for(i in 1:length(sentences)){
lst[[i]] <- NA * seq(length(toMatch))
for(j in 1:length(toMatch)){
tmp = str_extract_all(sentences[i], toMatch[j])
if (length(tmp[[1]]) > 0) {
lst[[i]][j] <- tmp[[1]]
}
}}
lapply(lst, function(x) x[!is.na(x)])
lst

removing everything after first 'backslash' in a string

I have a vector like below
vec <- c("abc\edw\www", "nmn\ggg", "rer\qqq\fdf"......)
I want to remove everything after as soon as first slash is encountered, like below
newvec <- c("abc","nmn","rer")
Thank you.
My original vector is as below (only the head)
[1] "peoria ave\nste \npeoria" [2] "wood dr\nphoenix"
"central ave\nphoenix"
[4] "southern ave\nphoenix" [5] "happy valley rd\nste
\nglendaleaz " "the americana at brand\n americana way\nglendale"
Here the problem is my original csv file does not contain backslashes, but when i read it backslashes appear. Original csv file is as below
[1] "peoria ave [2] "wood dr
nste nphoenix"
npeoria"
As you can see, they are actually separated by "ENTER" but when i read it in R using read.csv() they are replaced by backslashes.
another solution :
sub("\\\\.*", "", x)
vec <- c("abc\\edw\\www", "nmn\\ggg", "rer\\qqq\\fdf")
sub("([^\\\\])\\\\.*","\\1", vec)
[1] "abc" "nmn" "rer"
strssplit(vec, "\\\\") should do the job.
TO select the first element [[1]][1] 2nd [[1]][2]

Resources