I'm trying to extract the "Number" of "Humans" in the string below, for example:
string <- c("ProjectObjectives|Objectives_NA, PublishDate|PublishDate_NA, DeploymentID|DeploymentID_NA, Species|Human|Gender|Female, Species|Cat|Number|1, Species|Human|Number|1, Species|Human|Position|Left")
The position of the text in the string will constantly change, so I need R to search the string and find "Species|Human|Number|" and return 1.
Apologies if this is a duplicate of another thread, but I've looked here (extract a substring in R according to a pattern) and here (R extract part of string). But I'm not having any luck.
Any ideas?
Use a capturing approach - capture 1 or more digits (\d+) after the known substring (just escape the | symbols):
> string <- c("ProjectObjectives|Objectives_NA, PublishDate|PublishDate_NA, DeploymentID|DeploymentID_NA, Species|Human|Gender|Female, Species|Cat|Number|1, Species|Human|Number|1, Species|Human|Position|Left")
> pattern = "Species\\|Human\\|Number\\|(\\d+)"
> unlist(regmatches(string,regexec(pattern,string)))[2]
[1] "1"
A variation is to use a PCRE regex with regmatches/regexpr
> pattern="(?<=Species\\|Human\\|Number\\|)\\d+"
> regmatches(string,regexpr(pattern,string, perl=TRUE))
[1] "1"
Here, the left side context is put inside a non-consuming pattern, a positive lookbehind, (?<=...).
The same functionality can be achieved with \K operator:
> pattern="Species\\|Human\\|Number\\|\\K\\d+"
> regmatches(string,regexpr(pattern,string, perl=TRUE))
[1] "1"
Simplest way I can think of:
as.integer(gsub("^.+Species\\|Human\\|Number\\|(\\d+).+$", "\\1", string))
It will introduce NAs where there is no mention of Speces|Human|Number. Also, there will be artefacts if any of the strings is a number (but I assume that this won't be an issue)
Related
I need to capture the numbers out of a string that come after a certain parameter name.
I have it working for most, but there is one parameter that is sometimes at the end of the string, but not always. When using the regular expression, it seems to matter.
I've tried different things, but nothing seems to work in both cases.
# Regular expression to capture the digit after the phrase "AppliedWhenID="
p <- ".*&AppliedWhenID=(.\\d*)"
# Tried this, but when at end, it just grabs a blank
#p <- ".*&AppliedWhenID=(.\\d*)&.*|.*&AppliedWhenID=(.\\d*)$"
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
# What should be returned is "2"
gsub(p, "\\1", testAtEnd) # works
gsub(p, "\\1", testNotAtEnd) # doesn't work, it captures 2 + &AgDateTypeID=1
Note that sub and gsub replace the found text(s), thus, in order to extract a part of the input string with a capturing group + a backreference, you need to actually match (and consume) the whole string.
Hence, you need to match the string to the end by adding .* at the end of the pattern:
p <- ".*&AppliedWhenID=(\\d+).*"
sub(p, "\\1", testNotAtEnd)
# => [1] "2"
sub(p, "\\1", testAtEnd)
# => [1] "2"
See the regex demo and the R online demo.
Note that gsub matches multiple occurrences, you need a single one, so it makes sense to replace gsub with sub.
Regex details
.* - any zero or more chars as many as possible
&AppliedWhenID= - a &AppliedWhenID= string
(\d+) - Group 1 (\1): one or more digits
.* - any zero or more chars as many as possible.
You could try using the string look behind conditional "(?<=)" and str_extract() from the stringr library.
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
p <- "(?<=AppliedWhenID=)\\d+"
# What should be returned is "2"
library(stringr)
str_extract(testAtEnd, p)
str_extract(testNotAtEnd, p)
Or in base R
p <- ".*((?<=AppliedWhenID=)\\d+).*"
gsub(p, "\\1", testAtEnd, perl=TRUE)
gsub(p, "\\1", testNotAtEnd, perl=TRUE)
I want to extract strings using rm_between function from the library(qdapRegex)
I need to extract the string between the second "|" and the word "_HUMAN".
I cant figure out how to select the second "|" and not the first.
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
prots <- rm_between(example, '|', 'HUMAN', extract=TRUE)
Thank you!!
Another alternative using regmatches, regexpr and using perl=TRUE to make use of \K
^(?:[^|]*\|){2}\K[^|_]+(?=_HUMAN)
Regex demo
For example
regmatches(example, regexpr("^(?:[^|]*\\|){2}\\K[^|_]+(?=_HUMAN)", example, perl=TRUE))
Output
[1] "EIFCL" "EIF3C"
In your rm_between(example, '|', 'HUMAN', extract=TRUE) command, the | is used to match the leftmost | and HUMAN is used to match the left most HUMAN right after.
Note the default value for the FIXED argument is TRUE, so | and HUMAN are treated as literal chars.
You need to make the pattern a regex pattern, by setting fixed=FALSE. However, the ^(?:[^|]*\|){2} as the left argument regex will not work because the qdap package creates an ICU regex with lookarounds (since you use extract=TRUE that sets include.markers to FALSE), which is (?<=^(?:[^|]*\|){2}).*?(?=HUMAN).
As a workaround, you could use a constrained-width lookbehind, by replacing * with a limiting quantifier with a reasonably large max parameter. Say, if you do not expect more than a 1000 chars between each pipe, you may use {0,1000}:
rm_between(example, '^(?:[^|]{0,1000}\\|){2}', '_HUMAN', extract=TRUE, fixed=FALSE)
# => [[1]]
# [1] "EIFCL"
#
# [[2]]
# [1] "EIF3C"
However, you really should think of using simpler approaches, like those described in other answers. Here is another variation with sub:
sub("^(?:[^|]*\\|){2}(.*?)_HUMAN.*", "\\1", example)
# => [1] "EIFCL" "EIF3C"
Details
^ - startof strig
(?:[^|]*\\|){2} - two occurrences of any 0 or more non-pipe chars followed with a pipe char (so, matching up to and including the second |)
(.*?) - Group 1: any 0 or more chars, as few as possible
_HUMAN.* - _HUMAN and the rest of the string.
\1 keeps only Group 1 value in the result.
A stringr variation:
stringr::str_match(example, "^(?:[^|]*\\|){2}(.*?)_HUMAN")[,2]
# => [1] "EIFCL" "EIF3C"
With str_match, the captures can be accessed easily, we do it with [,2] to get Group 1 value.
this is not exactly what you asked for, but you can achieve the result with base R:
sub("^.*\\|([^\\|]+)_HUMAN.*$", "\\1", example)
This solution is an application of regular expression.
"^.*\\|([^\\|]+)_HUMAN.*$" matches the entire character string.
\\1 matches whatever was matched inside the first parenthesis.
Using regular gsub:
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
gsub(".*?\\|.*?\\|(.*?)_HUMAN", "\\1", example)
#> [1] "EIFCL" "EIF3C"
The part (.*?) is replaced by itself as the replacement contains the back-reference \\1.
If you absolutely prefer qdapRegex you can try:
rm_between(example, '.{0,100}\\|.{0,100}\\|', '_HUMAN', fixed = FALSE, extract = TRUE)
The reason why we have to use .{0,100} instead of .*? is that the underlying stringi needs a mamixmum length for the look-behind pattern (i.e. the left argument in rm_between).
Just saying that you could easily just use sapply()/strsplit():
example <- c("sp|B5ME19|EIFCL_HUMAN", "sp|Q99613|EIF3C_HUMAN")
unlist(sapply(strsplit(example, "|", fixed = T),
function(item) strsplit(item[3], "_HUMAN", fixed = T)))
# [1] "EIFCL" "EIF3C"
It just splits on | in the first list and on _HUMAN on every third element within that list.
I'm trying to extract twitter handles from tweets using R's stringr package. For example, suppose I want to get all words in a vector that begin with "A". I can do this like so
library(stringr)
# Get all words that begin with "A"
str_extract_all(c("hAi", "hi Ahello Ame"), "(?<=\\b)A[^\\s]+")
[[1]]
character(0)
[[2]]
[1] "Ahello" "Ame"
Great. Now let's try the same thing using "#" instead of "A"
str_extract_all(c("h#i", "hi #hello #me"), "(?<=\\b)\\#[^\\s]+")
[[1]]
[1] "#i"
[[2]]
character(0)
Why does this example give the opposite result that I was expecting and how can I fix it?
It looks like you probably mean
str_extract_all(c("h#i", "hi #hello #me", "#twitter"), "(?<=^|\\s)#[^\\s]+")
# [[1]]
# character(0)
# [[2]]
# [1] "#hello" "#me"
# [[3]]
# [1] "#twitter"
The \b in a regular expression is a boundary and it occurs "Between two characters in the string, where one is a word character and the other is not a word character." see here. Since an space and "#" are both non-word characters, there is no boundary before the "#".
With this revision you match either the start of the string or values that come after spaces.
A couple of things about your regex:
(?<=\b) is the same as \b because a word boundary is already a zero width assertion
\# is the same as #, as # is not a special regex metacharacter and you do not have to escape it
[^\s]+ is the same as \S+, almost all shorthand character classes have their negated counterparts in regex.
So, your regex, \b#\S+, matches #i in h#i because there is a word boundary between h (a letter, a word char) and # (a non-word char, not a letter, digit or underscore). Check this regex debugger.
\b is an ambiguous pattern whose meaning depends on the regex context. In your case, you might want to use \B, a non-word boundary, that is, \B#\S+, and it will match # that are either preceded with a non-word char or at the start of the string.
x <- c("h#i", "hi #hello #me")
regmatches(x, gregexpr("\\B#\\S+", x))
## => [[1]]
## character(0)
##
## [[2]]
## [1] "#hello" "#me"
See the regex demo.
If you want to get rid of this \b/\B ambiguity, use unambiguous word boundaries using lookarounds with stringr methods or base R regex functions with perl=TRUE argument:
regmatches(x, gregexpr("(?<!\\w)#\\S+", x, perl=TRUE))
regmatches(x, gregexpr("(?<!\\S)#\\S+", x, perl=TRUE))
where:
(?<!\w) - an unambiguous starting word boundary - is a negative lookbehind that makes sure there is a non-word char immediately to the left of the current location or start of string
(?<!\S) - a whitespace starting word boundary - is a negative lookbehind that makes sure there is a whitespace char immediately to the left of the current location or start of string.
See this regex demo and another regex demo here.
Note that the corresponding right hand boundaries are (?!\w) and (?!\S).
The answer above should suffice. This will remove the # symbol in case you are trying to get the users' names only.
str_extract_all(c("#tweeter tweet", "h#is", "tweet #tweeter2"), "(?<=\\B\\#)[^\\s]+")
[[1]]
[1] "tweeter"
[[2]]
character(0)
[[3]]
[1] "tweeter2"
While I am no expert with regex, it seems like the issue may be that the # symbol does not correspond to a word character, and thus matching the empty string at the beginning of a word (\\b) does not work because there is no empty string when # is preceding the word.
Here are two great regex resources in case you hadn't seen them:
stat545
Stringr's Regex page, also available as a vignette:
vignette("regular-expressions", package = "stringr")
I am new to REGEX. As per title, I would like to capture abc-def-ghi in the string utm_campaign=abc-def-ghi. The string is usually embedded in an url. Using the following pattern (utm_campaign=[a-zA-Z0-9_-]+) I can match the entire string, but I really just want the second part of the string, which is abc-def-ghi. Is there an efficient way to do this in regex? Preferable language for this question is R.
Another option: gsub
> string <- "utm_campaign=abc-def-ghi"
> gsub(".*=(\\w*)", "\\1", string)
[1] "abc-def-ghi"
See regex in use here
(?<=utm_campaign=)[\w-]+
(?<=utm_campaign=) Positive lookbehind ensuring what precedes matches utm_campaign= literally
[\w-]+ Match any word character (a-zA-Z0-9_) or hyphen character one or more times
See code in use here
x <- "utm_campaign=abc-def-ghi"
m <- regexpr("(?<=utm_campaign=)[\\w-]+", x, perl=TRUE)
regmatches(x, m)
Result: abc-def-ghi
I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).