How do I extract names with initials in R using sub? - r

I have several paragraphs that I am trying to extract initials with their correlative name.
For example, I might have a paragraph with lots of text that has the name "A. J. Balfour" in it, or "J. Balfour".
This is what I am writing right now and it doesn't work. I would love your feedback!
z = "This is a bunch of text. I would like to extract A J Balfour"
sub("^(([A]\\\S+\\\s){1}\\\S+).*", "\\1", z, perl = TRUE)
I am thinking the best option is using sub, but I am having issues getting my regular expression to work. I am having trouble finding good info on writing a regular expression that will extract characters.
Thank you.

The stringr library has the str_extract functions with an easier syntax than just using sub.
library(stringr)
str_extract(z, "[A]\\S{0,1}\\s(\\S\\S{0,1}\\s){0,1}.*")
#[1] "A J Balfour"
Edit:
Here is another attempt, but since you are asking for a more general solution, it is very difficult to get an exact match.
z<-c( "This is a bunch of text. I would like to extract A J Balfour",
"J Balfour",
'This is a bunch of text. G. Balfour'
)
str_extract_all(z, "([A-Z]+[\\. ]{1,2}){1,2}.*")
# ( - start of grouping
# [A-Z] - Any capital letter
# + - at least 1 times
# [\\. ] - a period or a space
# {1,2} - one or two times
# ){1,2} - 1 or 2 times for the grouping
# .* - any character zero or more times
In fact this attempt fails on the first test. Narrowing down to [A-J] would help.
Good luck.

Thank you! I ended up using str_extract_all to look like this:
z = "This is a bunch of text. I would like to extract A. J. Balfour and maybe some other words or another A. F. Balfour or even G. G. Balfour or maybe even A. G. Balfour"
str_extract_all(z, regex("[A-Z]. [A-Z]. Balfour", simplify = TRUE))
Thanks for all the thoughts!

Consider
using regmatches in base R.
z = "This is a bunch of text. I would like to extract A J Balfour"
regmatches(z,regexpr("[A]\\s{1}\\S+.*", z))
#[1] "A J Balfour"

Related

Is my R Regular Expression matching correctly?

I've struggled with regular expressions in general and recently wrote one that I think is working correctly, but I'm not sure. My question to anyone who takes the time to review my code below - is it theoretically doing what I want it to do?
Purpose: I'm looking through every column in my data set to identify rows that include strings that begin with 'pharmacy - ' followed by any one of 13 drug types and ends with parentheses with a number inside. Here are some examples:
pharmacy - oxycodone/acetaminophen (3)
pharmacy - fentanyl (2.83)
pharmacy - hydromorphone (6.8)
The code I wrote is below. I believe it is working but would appreciate if any regex experts out there could take a look and confirm that it is doing what I think it's supposed to be doing:
viz$med_2 <- apply(viz, 1, function(x)as.integer(any(grep("^pharmacy+[ -]+(codeine|oxycodone|fentanyl|hydrocodone|hydromophone|mathadone|morphine sulfate|oxycodone|oxycontin|roxicodone|tramadol|hydrocodone/acetaminophen|oxycodone/acetaminophen)+[ -]+[(]+[0-9]+", x))))
No expert, but your expression looks great, I would maybe just slightly modify that to:
^pharmacy\s*-\s*(codeine|oxycodone|fentanyl|hydrocodone|hydromophone|mathadone|morphine sulfate|oxycodone|oxycontin|roxicodone|tramadol|hydrocodone\/acetaminophen|oxycodone\/acetaminophen)\s*\(\s*[0-9]+(\.[0-9]+)?\s*\)$
In this demo, the expression is explained, if you might be interested.
Make sure about required escaping for R.
RegEx Circuit
jex.im visualizes regular expressions:
You need to escape special characters (with a double backslash \\ in R) or the regex will throw an error.
In regex, + means match a character one or more times. So pharmacy+ matches pharmac followed by one or an infinite number of y, which is probably unnecessary.
I'd recommend using \\s instead of a simple whitespace. \\s matches any whitespace character [ \t\r\n\f] and is therefore more versatile.
Here's how I would do it.
viz <- data.frame(
med_2 = c(
"pharmacy - oxycodone/acetaminophen (3)",
"pharmacy - fentanyl (2.83)",
"pharmacy - hydromorphone (6.8)"
)
)
# list of the different drug names
drugs_ls <- c(
"codeine",
"oxycodone",
"fentanyl",
"hydrocodone",
"hydromophone",
"mathadone",
"morphine sulfate",
"oxycontin",
"roxicodone",
"tramadol",
"acetaminophen"
)
# concatenate and separate drug names with a pipe
drugs_re <- paste0(drugs_ls, collapse = "|")
# generate the regex
med_re <- paste0("^(?i)pharmacy[\\s-]+(?:", drugs_re, ")(?:\\/acetaminophen)?[\\s-]+\\(\\d")
viz$med_2 <- apply(viz, 1, function(x)as.integer(any(grep(med_re, x, perl = TRUE))))
viz
# med_2
#1 1
#2 1
#3 0
The whole regex looks like this:
^(?i)pharmacy[\\s-]+(?:codeine|oxycodone|fentanyl|hydrocodone|hydromophone|mathadone|morphine sulfate|oxycontin|roxicodone|tramadol|acetaminophen)(?:\\/acetaminophen)?[\\s-]+\\(\\d
(?i) makes the regex case insensitive.
(?:) creates a non-capturing group.
? matches a character / group or nothing.
\\d is a shorthand for [0-9].

obtaining count of phrases contained between parentheses and containing specific character

There must be a simple answer to this, but I'm new to regex and couldn't find one.
I have a dataframe (df) with text strings arranged in a column vector of length n (df$text). Each of the texts in this column is interspersed with parenthetical phrases. I can identify these phrases using:
regmatches(df$text, gregexpr("(?<=\\().*?(?=\\))", df$text, perl=T))[[1]]
The code above returns all text between parentheses. However, I'm only interested in parenthetical phrases that contain 'v.' in the format 'x v. y', where x and y are any number of characters (including spaces) between the parentheses; for example, '(State of Arkansas v. John Doe)'. Matching phrases (court cases) are always of this format: open parentheses, word beginning with capital letter, possible spaces and other words, v., another word beginning with a capital letter, and possibly more spaces and words, close parentheses.
I'd then like to create a new column containing counts of x v. y phrases in each row.
Bonus if there's a way to do this separately for the same phrases denoted by italics rather than enclosed in parentheses: State of Arkansas v. John Doe (but perhaps this should be posed as a separate question).
Thanks for helping a newbie!
I believe I have figured out what you want, but it is hard to tell without example data. I have made and example data frame to work with. If it is not what you are going for, please give an example.
df <- data.frame(text = c("(Roe v. Wade) is not about boats",
"(Dred Scott v. Sandford) and (Plessy v. Ferguson) have not stood the test of time",
"I am trying to confuse you (this is not a court case)",
"this one is also confusing (But with Capital Letters)",
"this is confusing (With Capitols and v. d)"),
stringsAsFactors = FALSE)
The regular expression I think you want is:
cases <- regmatches(df$text, gregexpr("(?<=\\()([[:upper:]].*? v\\. [[:upper:]].*?)(?=\\))",
df$text, perl=T))
You can then get the number of cases and add it to your data frame with:
df$numCases <- vapply(cases, length, numeric(1))
As for italics, I would really need an example of your data. usually that kind of formatting isn't stored when you read in a string in R, so the italics effectively don't exist anymore.
Change your regex like below,
regmatches(df$text, gregexpr("(?<=\\()[^()]*\\sv\\.\\s[^()]*(?=\\))", df$text, perl=T))[[1]]
DEMO

remove multiple patterns from text vector r

I want to remove multiple patterns from multiple character vectors. Currently I am going:
a.vector <- gsub("#\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)
etc etc.
This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.
Neither the mapply nor the mgsub are working. I made these vectors
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")
Neither mapply(gsub, remove, substitute, a.vector) nor mgsub(remove, substitute, a.vector) worked.
a.vector looks like this:
[4951] "#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
[4952] "#stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"
I want:
[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
[4952] "you are phenomenal #mental #Writing" `
I know this answer is late on the scene but it stems from my dislike of having to manually list the removal patterns inside the grep functions (see other solutions here). My idea is to set the patterns beforehand, retain them as a character vector, then paste them (i.e. when "needed") using the regex seperator "|":
library(stringr)
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))
Yes, this does effectively do the same as some of the other answers here, but I think my solution allows you to retain the original "character removal vector" remove.
Try combining your subpatterns using |. For example
>s<-"#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("#\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.
Consider creating your remove vector as you suggested, then applying it in a loop
> s1 <- s
> remove<-c("#\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants
In case the multiple patterns that you are looking for are fixed and don't change from case-to-case, you can consider creating a concatenated regex that combines all of the patterns into one uber regex pattern.
For the example you provided, you can try:
removePat <- "(#\\w+)|(http\\w+)|([[:punct:]])"
a.vector <- gsub(removePat, "", a.vector)
I had a vector with statement "my final score" and I wanted to keep on the word final and remove the rest. This what worked for me based on Marian suggestion:
str_remove_all("my final score", "my |score")
note: "my final score" is just an example. I was dealing with a vector.

R extract text until, and not including x

I have a bunch of strings of mixed length, but all with a year embedded. I am trying to extract just the text part, that is everything until the number start and am having problem with lookeahead assertions assuming that is the proper way of such extractions.
Here is what I have (returns no match):
>grep("\\b.(?=\\d{4})","foo_1234_bar",perl=T,value=T)
In the example I am looking to extract just foo but there may be several, and of mixed lengths, separated by _ before the year portion.
Look-aheads may be overkill here. Use the underscore and the 4 digits as the structure, combined with a non-greedy quantifier to prevent the 'dot' from gobbling up everything:
/(.+?)_\d{4}/
-first matching group ($1) holds 'foo'
This will grab everything up until the first digit
x <- c("asdfas_1987asdf", "asd_das_12")
regmatches(x, regexpr("^[^[:digit:]]*", x))
#[1] "asdfas_" "asd_das_"
Another approach (often I find that strsplit is faster than regex searching but not always (though this does use a slight bit of regexing):
x <- c("asdfas_1987asdf", "asd_das_12") #shamelessly stealing Dason's example
sapply(strsplit(x, "[0-9]+"), "[[", 1)

Extracting specified word from a vector using R

I have a text e.g
text<- "i am happy today :):)"
I want to extract :) from text vector and report its frequency
Here's one idea, which would be easy to generalize:
text<- c("i was happy yesterday :):)",
"i am happy today :)",
"will i be happy tomorrow?")
(nchar(text) - nchar(gsub(":)", "", text))) / 2
# [1] 2 1 0
I assume you only want the count, or do you also want to remove :) from the string?
For the count you can do:
length(gregexpr(":)",text)[[1]])
which gives 2. A more generalized solution for a vector of strings is:
sapply(gregexpr(":)",text),length)
Edit:
Josh O'Brien pointed out that this also returns 1 of there is no :) since gregexpr returns -1 in that case. To fix this you can use:
sapply(gregexpr(":)",text),function(x)sum(x>0))
Which does become slightly less pretty.
This does the trick but might not be the most direct way:
mytext<- "i am happy today :):)"
# The following line inserts semicolons to split on
myTextSub<-gsub(":)", ";:);", mytext)
# Then split and unlist
myTextSplit <- unlist(strsplit(myTextSub, ";"))
# Then see how many times the smiley turns up
length(grep(":)", myTextSplit))
EDIT
To handle vectors of text with length > 1, don't unlist:
mytext<- rep("i am happy today :):)",2)
myTextSub<-gsub(":\\)", ";:\\);", mytext)
myTextSplit <- strsplit(myTextSub, ";")
sapply(myTextSplit,function(x){
length(grep(":)", x))
})
But I like the other answers better.

Resources