Regex in R - Extracting two letters between spaces - r

I am trying to extract the two letters between two spaces -
AAPL US Equity
1836 JP Equity
APPLE SOMETHING NOT
C US Equity
Result -
US
JP
US
What I tried was gsub("\\s[A-Z]{2}\\s", "\\1", vec) but that gives me -
AAPLEquity
1836Equity
APPLE SOMETHING NOT
CEquity
which seems the exact opposite of what I want.

We can use sub
out <- rep("", length(vec))
i1 <- grepl("\\b[A-Z]{2}\\b", vec)
out[i1] <- sub(".*\\s+([A-Z]{2})\\s+.*", "\\1", vec[i1])
out
#[1] "US" "JP" "" "US"
Or using str_extract to extract the two upper case characters after a space (specified by the regex lookaround) and follows a word boundary (\\b)
str_extract(vec, "(?<=\\s)([A-Z]{2})\\b")
#[1] "US" "JP" NA "US"
NOTE: Not copied syntax from others' answer
data
vec <- c("AAPL US Equity", "1836 JP Equity", "APPLE SOMETHING NOT", "C US Equity")

The gsub command removes the parts of text matched with the regular expression. \s[A-Z]{2}\s finds streaks of whitespace, 2 uppercase ASCII letters and whitespace, and removes them from character vectors.
You may use
x <- c('AAPL US Equity','1836 JP Equity','APPLE SOMETHING NOT','C US Equity')
sub(".*\\s+([A-Z]{2})\\s.*|.*", "\\1", x)
# => [1] "US" "JP" "" "US"
Here, the .*\\s+([A-Z]{2})\\s.* alternative matches those inputs that have a two-letter "word" between whitespaces and puts the words into Group 1 (\1), while .* alternative matches all other inputs to produce an empty result as the sub operation.
Or, you may use
library(stringr)
str_extract(x, "(?<=\\s)[A-Z]{2}(?=\\s)")
# => [1] "US" "JP" NA "US"
Here, (?<=\\s)[A-Z]{2}(?=\\s) matches and str_extract extracts strings that are first two-letter words in between whitespaces.
If the words can be at the start/end of the string use
str_extract(x, "(?<!\\S)[A-Z]{2}(?!\\S)")

Related

R: using \\b and \\B in regex

I read about regex and came accross word boundaries. I found a question that is about the difference between \b and \B. Using the code from this question does not give the expected output. Here:
grep("\\bcat\\b", "The cat scattered his food all over the room.", value= TRUE)
# I expect "cat" but it returns the whole string.
grep("\\B-\\B", "Please enter the nine-digit id as it appears on your color - coded pass-key.", value= TRUE)
# I expect "-" but it returns the whole string.
I use the code as described in the question but with two backslashes as suggested here. Using one backslash does not work either. What am I doing wrong?
You can to use regexpr and regmatches to get the match. grep gives where it hits. You can also use sub.
x <- "The cat scattered his food all over the room."
regmatches(x, regexpr("\\bcat\\b", x))
#[1] "cat"
sub(".*(\\bcat\\b).*", "\\1", x)
#[1] "cat"
x <- "Please enter the nine-digit id as it appears on your color - coded pass-key."
regmatches(x, regexpr("\\B-\\B", x))
#[1] "-"
sub(".*(\\B-\\B).*", "\\1", x)
#[1] "-"
For more than 1 match use gregexpr:
x <- "1abc2"
regmatches(x, gregexpr("[0-9]", x))
#[[1]]
#[1] "1" "2"
grepreturns the whole string because it just looks to see if the match is present in the string. If you want to extract cat, you need to use other functions such as str_extractfrom package stringr:
str_extract("The cat scattered his food all over the room.", "\\bcat\\b")
[1] "cat"
The difference betweeen band Bis that bmarks word boundaries whereas Bis its negation. That is, \\bcat\\b matches only if cat is separated by white space whereas \\Bcat\\B matches only if cat is inside a word. For example:
str_extract_all("The forgot his education and scattered his food all over the room.", "\\Bcat\\B")
[[1]]
[1] "cat" "cat"
These two matches are from education and scattered.

Regex with Exclusion in R

I have a regex .*(?<=code=)(.*?)(;|$).* which allows me pull specific pattern ("code") out of a list. But whenever the intended pattern ("code") is not present in a particular row, the entire row shows up in the result.
Dateset:
rev=63;code=ATL;qty=1;zip=45987
rev=10.60|34;qty=1|2;zip=12686|12694;code=NY
rev=12;qty=7;zip=71565
rev=1.6|4;qty=4|2;zip=4548|464;code=KT
rev=8;qty=1;zip=74268
rev=3|24|8;qty=1|6|3;code=TPA;zip=33684|36842|30254
Current Output (With substitution \1):
ATL
NY
rev=12;qty=7;zip=71565
KT
rev=8;qty=1;zip=74268
TPA
Intended Output:
ATL
NY
KT
TPA
You may extract the data using stringr::str_extract:
x <- c("rev=63;code=ATL;qty=1;zip=45987","rev=10.60|34;qty=1|2;zip=12686|12694;code=NY","rev=12;qty=7;zip=71565","rev=1.6|4;qty=4|2;zip=4548|464;code=KT","rev=8;qty=1;zip=74268","rev=3|24|8;qty=1|6|3;code=TPA;zip=33684|36842|30254")
library(stringr)
str_extract(x, "(?<=\\bcode=)[^;]+")
# => [1] "ATL" "NY" NA "KT" NA "TPA"
If you do not want NA and want empty items use
matches <- str_extract(x, "(?<=\\bcode=)[^;]+")
matches[is.na(matches)] <- ""
matches
## => [1] "ATL" "NY" "" "KT" "" "TPA"
The pattern matches:
(?<=\bcode=) - a positive lookbehind matching a location immediately preceded with a whole word code and a =
[^;]+ - a negated character class that matches and consumes (i.e. adds to to the output and moves the regex index) 1 or more chars other than ;.

(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" while preserving abbreviations?

(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" without splitting abbreviations?
I know how to split the string at every uppercase letter, but doing that would split initialisms/abbreviations, like CA or USSR or even U.S.A. and I need to preserve those.
So I'm thinking some type of logical like if a word in a string isn't an initialism then split the word with a space where a lowercase character is followed by an uppercase character.
My snippet of code below splits words with spaces by capital letters, but it breaks initialisms like CA becomes C A undesirably.
s <- "WeLiveInCA"
trimws(gsub('([[:upper:]])', ' \\1', s))
# "We Live In C A"
or another example...
s <- c("IDon'tEatKittensFYI", "YouKnowYourABCs")
trimws(gsub('([[:upper:]])', ' \\1', s))
# "I Don't Eat Kittens F Y I" "You Know Your A B Cs"
The results I'd want would be:
"We Live In CA"
#
"I Don't Eat Kittens FYI" "You Know Your ABCs"
But this needs to be widely applicable (not just for my example)
Try with base R gregexpr/regmatches.
s <- c("WeLiveInCA", "IDon'tEatKittensFYI", "YouKnowYourABCs")
regmatches(s, gregexpr('[[:upper:]]+[^[:upper:]]*', s))
#[[1]]
#[1] "We" "Live" "In" "CA"
#
#[[2]]
#[1] "IDon't" "Eat" "Kittens" "FYI"
#
#[[3]]
#[1] "You" "Know" "Your" "ABCs"
Explanation.
[[:upper:]]+ matches one or more upper case letters;
[^[:upper:]]* matches zero or more occurrences of anything but upper case letters.
In sequence these two regular expressions match words starting with upper case letter(s) followed by something else.

match substring from another list of all possible substrings

I have a long vector of strings containing a market name and other stuff
S = c('123_GOLD_534', '531_SILVER_dfds', '93_COPPER_29dad', '452_GOLD_deww')
and another vector contains all the possible markets
V = c('GOLD','SILVER')
How can I extract the market name bit from S? Basically I want to loop over V and S, replace S[j] with V[i] if grepl(V[i], S[j]).
So the result should look like
c('GOLD','SILVER',NA,'GOLD')
You may use str_extract from stringr:
> library(stringr)
> str_extract(S, paste(V, collapse="|"))
[1] "GOLD" "SILVER" NA "GOLD"
The paste(V, collapse="|") will create a regex like GOLD|SILVER and will thus extract GOLD or SILVER. If the regex does not match, it will just return NA.
Note that if you need to match GOLD or SILVER only when enclosed with _ symbols, replace paste(V, collapse="|") with paste0("(?<=_)(?:", paste(V, collapse="|"), ")(?=_)"):
> str_extract(S, paste0("(?<=_)(?:", paste(V, collapse="|"), ")(?=_)"))
[1] "GOLD" "SILVER" NA "GOLD"
It will create a regex like (?<=_)(?:GOLD|SILVER)(?=_) and will only match GOLD or SILVER if there is a _ in front ((?<=_), a positive lookbehind) and if there is a _ after the value (due to the (?=_) positive lookahead). Lookaheads do not add matched text to the match (they are non-consuming).

Extract "words" from a string

I have a table with 153 rows by 9 columns. My interest is the character string in the first column, I want to extract the fourth word and create a new list from this fourth word, this list will be 153 rows, 1 column.
An example of the first two rows of column 1 of this database table:
[1] Resistance_Test DevID (Ohms) 428
[2] Diode_Test SUBLo (V) 353
"Words" are separated by spaces, so the fourth word of the first row is "428" and the fourth word of the second row is "353". How can I create a new list containing the fourth word of all 153 rows?
Use gsub() with a regular expression
x <- c("Resistance_Test DevID (Ohms) 428", "Diode_Test SUBLo (V) 353")
ptn <- "(.*? ){3}"
gsub(ptn, "", x)
[1] "428" "353"
This works because the regular expression (.*? ){3} finds exactly three {3} sets of characters followed by a space (.*? ), and then replaces this with ane empty string.
See ?gsub and ?regexp for more information.
If your data has structure that you don't mention in your question, then possibly the regular expression becomes even easier.
For example, if you are always interested in the last word of each line:
ptn <- "(.*? )"
gsub(ptn, "", x)
Or perhaps you know for sure you can only search for digits and discard everything else:
ptn <- "\\D"
gsub(ptn, "", x)
You could use word() from the stringrpackage:
> x <- c("Resistance_Test DevID (Ohms) 428", "Diode_Test SUBLo (V) 353")
> library(stringr)
> word(string = x, start = 4, end = 4)
[1] "428" "353"
Specifying the position of both the start and end words to be the same, you will always get the fourth word.
I hope this helps.
We can use sub. We match the pattern one or more non-white space (\\S+) followed by one or more white space (\\s+) that gets repeated 3 times ({3}) followed by word that is captured in a group ((\\w+)) followed by one or more characters. We replace it by the second backreference.
sub("(\\S+\\s+){3}(\\w+).*", "\\2", str1)
#[1] "428" "353"
This selects by the nth word, so
sub("(\\S+\\s+){3}(\\w+).*", "\\2", str2)
#[1] "428" "353" "428"
Another option is stri_extract
library(stringi)
stri_extract_last_regex(str1, "\\w+")
#[1] "428" "353"
data
str1 <- c("Resistance_Test DevID (Ohms) 428", "Diode_Test SUBLo (V) 353")
str2 <- c(str1, "Resistance_Test DevID (Ohms) 428 something else")
If you are not familiar with regular expressions, the function strsplit can help you :
data <- c('Resistance_Test DevID (Ohms) 428', 'Diode_Test SUBLo (V) 353')
unlist(lapply(strsplit(data, ' '), function(x) x[4]))
[1] "428" "353"

Resources