match substring from another list of all possible substrings

match substring from another list of all possible substrings - r

I have a long vector of strings containing a market name and other stuff
S = c('123_GOLD_534', '531_SILVER_dfds', '93_COPPER_29dad', '452_GOLD_deww')
and another vector contains all the possible markets
V = c('GOLD','SILVER')
How can I extract the market name bit from S? Basically I want to loop over V and S, replace S[j] with V[i] if grepl(V[i], S[j]).
So the result should look like
c('GOLD','SILVER',NA,'GOLD')

You may use str_extract from stringr:
> library(stringr)
> str_extract(S, paste(V, collapse="|"))
[1] "GOLD" "SILVER" NA "GOLD"
The paste(V, collapse="|") will create a regex like GOLD|SILVER and will thus extract GOLD or SILVER. If the regex does not match, it will just return NA.
Note that if you need to match GOLD or SILVER only when enclosed with _ symbols, replace paste(V, collapse="|") with paste0("(?<=_)(?:", paste(V, collapse="|"), ")(?=_)"):
> str_extract(S, paste0("(?<=_)(?:", paste(V, collapse="|"), ")(?=_)"))
[1] "GOLD" "SILVER" NA "GOLD"
It will create a regex like (?<=_)(?:GOLD|SILVER)(?=_) and will only match GOLD or SILVER if there is a _ in front ((?<=_), a positive lookbehind) and if there is a _ after the value (due to the (?=_) positive lookahead). Lookaheads do not add matched text to the match (they are non-consuming).

Related

R how to match and extract character letters of different length in a string

So I have a column of contract names df$name like below
FB210618C00280000
ADM210618C00280000
M210618P00280000
I would like to extract the FB, ADM and M. That is I want to extract characters in the string and they are of different length and stop once the first number occurs, and I don't want to extract the C or P.
The below code will give me the C or P
stri_extract_all_regex(df$name, "[a-z]+")

We can use stri_extract_first from stringi
library(stringi)
stri_extract_first(df$name, regex = "[A-Z]+")
#[1] "FB" "ADM" "M"
Or we can use base R with sub
sub("\\d+.*", "", df$name)
#[1] "FB" "ADM" "M"
Or use trimws from base R
trimws(df$name, whitespace = "\\d+.*")
data
df <- data.frame(name = c("FB210618C00280000", "ADM210618C00280000",
"M210618P00280000"))

You can use
library(stringr)
str_extract(df$name, "^[A-Za-z]+")
# Or
str_extract(df$name, "^\\p{L}+")
The stringr::str_extract function will extract the first occurrence of a pattern and ^[A-Za-z]+ / ^\p{L}+ regex matches one or more letters at the start of the string. Note \p{L} matches any Unicode letters.
See the regex demo.
Same pattern can be used with stringi::stri_extract_first():
library(stringi)
stri_extract_first(df$name, regex="^[A-Za-z]+")

How to extract words containing combinations of certain characters in R

In this sample text:
turns <- tolower(c("Does him good to stir him up now and again .",
"When , when I see him he w's on the settees .",
"Yes it 's been eery for a long time .",
"blissful timing , indeed it was "))
I'd like to extract all words that contain the letters y and e no matter what position or combination, namely yesand eery, using str_extract from stringr:
This regex, in which I determine that y occur immediately before e, matches not surprisingly only yes but not eery:
unlist(str_extract_all(turns, "\\b([a-z]+)?ye([a-z]+)?\\b"))
[1] "yes"
Putting yand e into a character class doesn't get me the desired result either in that all words either with y or with e are matched:
unlist(str_extract_all(turns, "\\b([a-z]+)?[ye]([a-z]+)?\\b"))
[1] "does" "when" "when" "see" "he" "the" "settees" "yes" "been" "eery" "time" "indeed"
So what is the right solution?

You may use both base R and stringr approaches:
stringr::str_extract_all(turns, "\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b")
regmatches(turns, gregexpr("\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b", turns, perl=TRUE))
Or, without turning the strings to lower case, you may use a case insensitive matching with (?i):
stringr::str_extract_all(turns, "(?i)\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b")
regmatches(turns, gregexpr("\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b", turns, perl=TRUE, ignore.case=TRUE))
See the regex demo and the R demo. Also, if you want to make it a tiny bit more efficient, you may use principle of contrast in the lookahead patterns: match any letters but y in the first and all letters but the e in the second using character class substraction:
stringr::str_extract_all(turns, "(?i)\\b(?=[\\p{L}--[y]]*y)(?=[\\p{L}--[e]]*e)\\p{L}+\\b")
Details
(?i) - case insensitive modifier
\b - word boundary
(?=\p{L}*y) - after 0 or more Unicode letters, there must be y ([\p{L}--[y]]* matches any 0 or more letters but y up to the first y)
(?=\p{L}*e) - after 0 or more Unicode letters, there must be e ([\p{L}--[e]]* matches any 0 or more letters but e up to the first e)
\p{L}+ - 1 or more Unicode letters
\b - word boundary

In case there is no urgent need to use stringr::str_extract you can get words containing the letters y and e in base with strsplit and grepl like:
tt <- unlist(strsplit(turns, " "))
tt[grepl("y", tt) & grepl("e", tt)]
#[1] "yes" "eery"
In case you have letter chunks between words:
turns <- c("yes no ay ae 012y345e year.")
tt <- regmatches(turns, gregexpr("\\b[[:alpha:]]+\\b", turns))[[1]]
tt[grepl("y", tt) & grepl("e", tt)]
#[1] "yes" "year"

Regex in R - Extracting two letters between spaces

I am trying to extract the two letters between two spaces -
AAPL US Equity
1836 JP Equity
APPLE SOMETHING NOT
C US Equity
Result -
US
JP
US
What I tried was gsub("\\s[A-Z]{2}\\s", "\\1", vec) but that gives me -
AAPLEquity
1836Equity
APPLE SOMETHING NOT
CEquity
which seems the exact opposite of what I want.

We can use sub
out <- rep("", length(vec))
i1 <- grepl("\\b[A-Z]{2}\\b", vec)
out[i1] <- sub(".*\\s+([A-Z]{2})\\s+.*", "\\1", vec[i1])
out
#[1] "US" "JP" "" "US"
Or using str_extract to extract the two upper case characters after a space (specified by the regex lookaround) and follows a word boundary (\\b)
str_extract(vec, "(?<=\\s)([A-Z]{2})\\b")
#[1] "US" "JP" NA "US"
NOTE: Not copied syntax from others' answer
data
vec <- c("AAPL US Equity", "1836 JP Equity", "APPLE SOMETHING NOT", "C US Equity")

The gsub command removes the parts of text matched with the regular expression. \s[A-Z]{2}\s finds streaks of whitespace, 2 uppercase ASCII letters and whitespace, and removes them from character vectors.
You may use
x <- c('AAPL US Equity','1836 JP Equity','APPLE SOMETHING NOT','C US Equity')
sub(".*\\s+([A-Z]{2})\\s.*|.*", "\\1", x)
# => [1] "US" "JP" "" "US"
Here, the .*\\s+([A-Z]{2})\\s.* alternative matches those inputs that have a two-letter "word" between whitespaces and puts the words into Group 1 (\1), while .* alternative matches all other inputs to produce an empty result as the sub operation.
Or, you may use
library(stringr)
str_extract(x, "(?<=\\s)[A-Z]{2}(?=\\s)")
# => [1] "US" "JP" NA "US"
Here, (?<=\\s)[A-Z]{2}(?=\\s) matches and str_extract extracts strings that are first two-letter words in between whitespaces.
If the words can be at the start/end of the string use
str_extract(x, "(?<!\\S)[A-Z]{2}(?!\\S)")

Extract "words" from a string

I have a table with 153 rows by 9 columns. My interest is the character string in the first column, I want to extract the fourth word and create a new list from this fourth word, this list will be 153 rows, 1 column.
An example of the first two rows of column 1 of this database table:
[1] Resistance_Test DevID (Ohms) 428
[2] Diode_Test SUBLo (V) 353
"Words" are separated by spaces, so the fourth word of the first row is "428" and the fourth word of the second row is "353". How can I create a new list containing the fourth word of all 153 rows?

Use gsub() with a regular expression
x <- c("Resistance_Test DevID (Ohms) 428", "Diode_Test SUBLo (V) 353")
ptn <- "(.*? ){3}"
gsub(ptn, "", x)
[1] "428" "353"
This works because the regular expression (.*? ){3} finds exactly three {3} sets of characters followed by a space (.*? ), and then replaces this with ane empty string.
See ?gsub and ?regexp for more information.
If your data has structure that you don't mention in your question, then possibly the regular expression becomes even easier.
For example, if you are always interested in the last word of each line:
ptn <- "(.*? )"
gsub(ptn, "", x)
Or perhaps you know for sure you can only search for digits and discard everything else:
ptn <- "\\D"
gsub(ptn, "", x)

You could use word() from the stringrpackage:
> x <- c("Resistance_Test DevID (Ohms) 428", "Diode_Test SUBLo (V) 353")
> library(stringr)
> word(string = x, start = 4, end = 4)
[1] "428" "353"
Specifying the position of both the start and end words to be the same, you will always get the fourth word.
I hope this helps.

We can use sub. We match the pattern one or more non-white space (\\S+) followed by one or more white space (\\s+) that gets repeated 3 times ({3}) followed by word that is captured in a group ((\\w+)) followed by one or more characters. We replace it by the second backreference.
sub("(\\S+\\s+){3}(\\w+).*", "\\2", str1)
#[1] "428" "353"
This selects by the nth word, so
sub("(\\S+\\s+){3}(\\w+).*", "\\2", str2)
#[1] "428" "353" "428"
Another option is stri_extract
library(stringi)
stri_extract_last_regex(str1, "\\w+")
#[1] "428" "353"
data
str1 <- c("Resistance_Test DevID (Ohms) 428", "Diode_Test SUBLo (V) 353")
str2 <- c(str1, "Resistance_Test DevID (Ohms) 428 something else")

If you are not familiar with regular expressions, the function strsplit can help you :
data <- c('Resistance_Test DevID (Ohms) 428', 'Diode_Test SUBLo (V) 353')
unlist(lapply(strsplit(data, ' '), function(x) x[4]))
[1] "428" "353"

Using R to compare two words and find letters unique to second word (across c. 6000 cases)

I have a dataframe comprising two columns of words. For each row I'd like to identify any letters that occur in only the word in the second column e.g.
carpet carpelt #return 'l'
bag flag #return 'f' & 'l'
dog dig #return 'i'
I'd like to use R to do this automatically as I have 6126 rows.
As an R newbie, the best I've got so far is this, which gives me the unique letters across both words (and is obviously very clumsy):
x<-(strsplit("carpet", ""))
y<-(strsplit("carpelt", ""))
z<-list(l1=x, l2=y)
unique(unlist(z))
Any help would be much appreciated.

The function you’re searching for is setdiff:
chars_for = function (str)
strsplit(str, '')[[1]]
result = setdiff(chars_for(word2), chars_for(word1))
(Note the inverted order of the arguments in setdiff.)
To apply it to the whole data.frame, called x:
apply(x, 1, function (words) setdiff(chars_for(words[2]), chars_for(words[1])))

Use regex :) Paste your word with brackets [] and then use replace function for regex. This regex finds any letter from those in brackets and replaces it with empty string (you can say that it "removes" these letters).
require(stringi)
x <- c("carpet","bag","dog")
y <- c("carplet", "flag", "smog")
pattern <- stri_paste("[",x,"]")
pattern
## [1] "[carpet]" "[bag]" "[dog]"
stri_replace_all_regex(y, pattern, "")
## [1] "l" "fl" "sm"

x <- c("carpet","bag","dog")
y <- c("carpelt", "flag", "dig")
Following (somewhat) with what you were going for with strsplit, you could do
> sx <- strsplit(x, "")
> sy <- strsplit(y, "")
> lapply(seq_along(sx), function(i) sy[[i]][ !sy[[i]] %in% sx[[i]] ])
#[[1]]
#[1] "l"
#
#[[2]]
#[1] "f" "l"
#
#[[3]]
#[1] "i"
This uses %in% to logically match the characters in y with the characters in x. I negate the matching with ! to determine those those characters that are in y but not in x.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

match substring from another list of all possible substrings - r

Related

R how to match and extract character letters of different length in a string

How to extract words containing combinations of certain characters in R

Regex in R - Extracting two letters between spaces

Extract "words" from a string

Using R to compare two words and find letters unique to second word (across c. 6000 cases)

Categories

Resources