How do I make my string split case insensitive? - r

Below is a piece of code that splits a large piece of text called "lines" into multiples strings. It splits whenever it detects an ending punctuation (such as . or ?) but it excludes all periods that immediately follow an abbreviation such as Mr.
lines<-unlist(strsplit(lines, paste("(?<=(?<!", abbr,")[\\.\\?\\!])[\\s”’]"), perl = T))
All of the abbreviations are stored in a vector called "abbr" and they are all capitalized (Mr., Mrs. as opposed to mr., mrs.). The problem I have with my code is that I want it to be case insensitive and detect abbreviations in the text that aren't capitalized and I want to accomplish this without simply adding lower case versions of each abbreviation to the abbr vector.

strsplit itself does not offer case insensitivity, but you can make an equivalent (if not regex-inefficient) with
abbr <- "SomeText"
abbr1 <- strsplit(abbr, "")
abbr1
# [[1]]
# [1] "S" "o" "m" "e" "T" "e" "x" "t"
abbr2 <- paste(sprintf("[%s%s]", toupper(abbr1[[1]]), tolower(abbr1[[1]])), collapse = "")
abbr2
# [1] "[Ss][Oo][Mm][Ee][Tt][Ee][Xx][Tt]"
and use abbr2 in place of abbr in your code above.

Related

Combine two grep functions using AND operator

I want to combine the following commands using AND operator:
grep("^ab", strings, value = TRUE)
grep("ab$", strings, value = TRUE)
Here is an example for OR operator
http://r.789695.n4.nabble.com/grep-for-multiple-pattern-td4685244.html#a4685247
Would you please advise?
The search for an AND operator in regex (whether in R or elsewhere) can be a long and sad search. The boolean AND means that both of two statements have to be true. How would you apply that to regex? Consider the regex pattern "ab", in grep("ab", strings). Even this simple pattern has several requirements, ALL of which have to be true. It has to have an "a", AND it has to have a "b", AND the "b" has to follow the "a" directly.
strings <- c("abraham, not ahab", "no it was ahab",
"abraham was the one they left on ceti alpha V",
"You're talking about Sherlock Holmes", "He tasks me", "ab")
grep("ab", strings, value = TRUE)
# [1] "abraham, not ahab"
# [2] "no it was ahab"
# [3] "abraham was the one they left on ceti alpha V"
# [4] "You're talking about Sherlock Holmes"
# [5] "ab"
If what you'd like is to match strings that BOTH start with "ab" AND end with "ab", then #r2evans pattern will work for you: grep("^ab.*ab$", strings, value = TRUE) will show them to you. This means it starts with "ab", has zero or more other characters, and then ends with "ab".
grep("^ab.*ab$", strings, value = TRUE)
# [1] "abraham, not ahab"
# NOTICE THAT THIS DOESN'T MATCH "ab", despite "ab" being at the beginning
# AND the end
If what you'd like is to match all the strings that start with an "a" immediately followed by a "b", AND ALSO all those that end with an "a" immediately followed by a "b", then you actually want grep("(^ab)|(ab$)", strings, value = TRUE)
grep("^ab|ab$", strings, value = TRUE)
# [1] "abraham, not ahab"
# [2] "no it was ahab"
# [3] "abraham was the one they left on ceti alpha V"
# [4] "ab"
So what about that solitary "ab" case? What regex pattern would match that and only that?
grep("^ab$", strings, value = TRUE)
# [1] "ab"
In this case, we wanted all of the matches to BOTH start AND end with "ab", but it had to be the same "ab". Of course, we could combine this with the other "AND" version, and get all of the matches where ab was at the start and ab was at the end:
grep("^ab$|^ab.*ab$", strings, value = TRUE)
# [1] "abraham, not ahab" "ab"
..and one more thing:
We can use #r2evans comment to demonstrate a sort of DeMorgan's law with regex. Notice that the pattern with the | metacharacter produces the same thing that you would by subsetting the strings object with the logical vector produced by combining both individual regex matches with a boolean AND:
strings[grepl("^ab", strings) & grepl("ab$", strings)]
# [1] "abraham, not ahab" "ab"
Here grepl returns a logical vector, and we use it twice. The first is TRUE for every element of strings that matches "^ab", and the second for every element that matches "ab$". Combining those logical vectors with an & operator produces the same thing as a pattern with a | metacharacter.
You may use
grep("^ab(.*ab)?$", strings, value = TRUE)
The pattern matches a string that starts with ab and then has an optional substring ending with ab and then end of string should follow:
^ - start of string
ab - an ab substring
(.*ab)? - 1 or 0 repetitions (due to ? quantifier) of
.* - any 0+ chars, as many as possible
ab - an ab substring
$ - end of string.
See the regex demo.

R: which() function can't find specified string in character vector (vector DOES contain string)

I wrote a simple function to scrape the names of all MLB pitchers from Baseball Reference.com. I created a vector with the scraped names, removed the  from the raw scraped names, and coerced into a character vector.
library(rvest)
url <- "http://www.baseball-reference.com/leagues/MLB/2016-standard-pitching.shtml"
mlbpitcherdata <- read_html(url)
mlbpitchers <- mlbpitcherdata %>% html_nodes("td:nth-child(2) a") %>% html_text()
mlbpitchers <- as.character(sapply(mlbpitchers, function(x) gsub("Â","",x))) # Remove "Â" from all raw pitcher names
I then tried to look for the indexes of specific names in the character vector which I knew were inside the vector, and the which() function returned integer(0).
# Search for pitcher name in list of pitchers = Returns integer(0)!
which(mlbpitchers=="Chad Bettis")
integer(0)
# But, mlbpitchers CLEARLY has Chad Bettis inside of it.
mlbpitchers[26]
[1] "Chad Bettis"
I'm so confused as to why the which() function isn't identifying the name. and I would really appreciate anyone's help. I know it's probably something really stupid and easy, but I can't figure it out! Thank you!
(Note: Upon removing the  character, I was asked to choose the encoding for saving. I choose the system default: ISO 8859-1. I'm not sure if this could play a role in the problem.)
It's an encoding problem. In particular, if you look at
R> substr(mlbpitchers[26], 1, 4) == "Chad"
[1] TRUE
R> substr(mlbpitchers[26], 5, 5) == " "
[1] FALSE
As Joran suggest, using
R> rawToChar(charToRaw(mlbpitchers[26]),multiple = TRUE)
[1] "C" "h" "a" "d" "\xc2" "\xa0" "B" "e" "t" "t"
[11] "i" "s"
also highlights the problem. These characters (thanks Nicola) are html non-breaking spaces. To remove them use
gsub("\xc2\xa0"," ",mlbpitchers)

Extract text in parentheses in R

Two related questions. I have vectors of text data such as
"a(b)jk(p)" "ipq" "e(ijkl)"
and want to easily separate it into a vector containing the text OUTSIDE the parentheses:
"ajk" "ipq" "e"
and a vector containing the text INSIDE the parentheses:
"bp" "" "ijkl"
Is there any easy way to do this? An added difficulty is that these can get quite large and have a large (unlimited) number of parentheses. Thus, I can't simply grab text "pre/post" the parentheses and need a smarter solution.
Text outside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("\\([^()]*\\)", "", x)
[1] "ajk" "ipq" "e"
Text inside the parenthesis
> x <- c("a(b)jk(p)" ,"ipq" , "e(ijkl)")
> gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", x, perl=T)
[1] "bp" "" "ijkl"
The (?<=\\()[^()]*(?=\\)) matches all the characters which are present inside the brackets and then the following (*SKIP)(*F) makes the match to fail. Now it tries to execute the pattern which was just after to | symbol against the remaining string. So the dot . matches all the characters which are not already skipped. Replacing all the matched characters with an empty string will give only the text present inside the rackets.
> gsub("\\(([^()]*)\\)|.", "\\1", x, perl=T)
[1] "bp" "" "ijkl"
This regex would capture all the characters which are present inside the brackets and matches all the other characters. |. or part helps to match all the remaining characters other than the captured ones. So by replacing all the characters with the chars present inside the group index 1 will give you the desired output.
The rm_round function in the qdapRegex package I maintain was born to do this:
First we'll get and load the package via pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdapRegex)
## Then we can use it to remove and extract the parts you want:
x <-c("a(b)jk(p)", "ipq", "e(ijkl)")
rm_round(x)
## [1] "ajk" "ipq" "e"
rm_round(x, extract=TRUE)
## [[1]]
## [1] "b" "p"
##
## [[2]]
## [1] NA
##
## [[3]]
## [1] "ijkl"
To condense b and p use:
sapply(rm_round(x, extract=TRUE), paste, collapse="")
## [1] "bp" "NA" "ijkl"

Fix the order of strings that have both letter and number components

I have string data like below.
a <- c("53H", "H26","14M","M47")
##"53H" "H26" "14M" "M47"
I want to fix the numbers and letters in a certain order such that
the numbers goes first, the letters goes second, or the other way around.
How can I do it?
##"53H" "26H" "14M" "47M"
or
##"H53" "H26" "M14" "M47"
You can extract the numbers and letters separately with gsub, then use paste0
to put them in any order you like.
a <- c("53H", "H26","14M","M47")
( nums <- gsub("[^0-9]", "", a) ) ## extract numbers
# [1] "53" "26" "14" "47"
( lets <- gsub("[^A-Z]", "", a) ) ## extract letters
# [1] "H" "H" "M" "M"
Numbers first answer:
paste0(nums, lets)
# [1] "53H" "26H" "14M" "47M"
Letters first answer:
paste0(lets, nums)
# [1] "H53" "H26" "M14" "M47"
You can capture the relevant parts in groups using () and then backreference them using gsub:
a <- c("53H", "H26","14M","M47")
gsub("^([0-9]+)([A-Z]+)$", "\\2\\1", a)
# [1] "H53" "H26" "M14" "M47"
This is like saying "Find a group of numbers at the start of my string and capture them in a group (^([0-9]+)). Then find the group of letters that go on to the end of my string and capture them in a second group (([A-Z]+)). That's my search pattern. Next, replace it such that the second group (referred to by \\2) is returned first and the first group (referred to by \\1) is returned second).
From Ananda Mahto's answer, you can order the number first and letter second using the following code:
gsub("^([A-Z]+)([0-9]+)$", "\\2\\1", a)
because you want to capture the strings which start with a letter (^([A-Z]+)), then capture the group of numbers ( ([0-9]+)$ )/

How to split a string in r by a delimiter and discard the last two items?

I have a string separated by _ and I want to get rid of the last two elements. For example, from A_B_C_D I want to return A_B, and from A_B_C_D_E I want A_B_C. I have tried str_split_fixed from stringr:
my_string <- "A_B_C_D"
x <- str_split_fixed(my_string,"_",3)
but it returns "A" "B" "C_D" instead of "A_B" "C" "D", otherwise I could have done head(x,-2) to get A_B
Is there a better way than
paste(head(unlist(strsplit(my_string,"_")),-2),collapse="_")
How about using a regex:
sub('(_[A-Z]){2}$', '', 'A_B_C_D')
Where the number 2 is the length you want to drop.

Resources