Parsing String - Extract Numeric Characters At End - r

Parsing string fields in R data frames is a bit of a mystery to me I'm afraid...would be grateful for help.
I have a string field which always ends in an indeterminate number of numeric characters. I'd like to write a bit of code to just extract the numeric part at the end of each.
An example of the data format is:
df_test <- data.frame(my_string = c("XXX-0387", "XXXX-1-999999", "XXX 12345432", "XXX-2345", "XXX1234"))
What I'd like is to put the numeric part at the end into a new field but to keep any leading zeros - so presumably the new field would have to be chr rather than int. So my output would look like:
c("0387", "999999", "12345432", "2345", "1234)
Is there an easy way to do this please?
Thank you.

A way using sub to capture the last part of string which is number.
sub('.*?(\\d+)$', '\\1', df_test$my_string)
#[1] "0387" "999999" "12345432" "2345" "1234"
Using stringr :
stringr::str_extract(df_test$my_string, '\\d+$')

You can use regexpr with \\d+$ to find the numbers at the end and extracti it with regmatches.
regmatches(df_test$my_string, regexpr("\\d+$", df_test$my_string))
#[1] "0387" "999999" "12345432" "2345" "1234"

We can use stri_extract_last from stringi
library(stringi)
stri_extract_last(df_test$my_string, regex = "\\d+")
#[1] "0387" "999999" "12345432" "2345" "1234"

Related

Is it possible to use R's base::strsplit() without consuming pattern

I have a string that consists entirely of simple repeating patterns of a [:digit:]+[A-Z] for instance 12A432B4B.
I want to to use base::strsplit() to get:
[1] "12A" "432B" "4B"
I thought I could use lookahead to split by a LETTER and keep this pattern with unlist(strsplit("12A432B4B", "(?<=.)(?=[A-Z])", perl = TRUE)) but as can be seen I get the split wrongly:
[1] "12" "A432" "B4" "B"
Cant get my mind around a pattern that works with this strsplit strategy? Explanations would be really appreciated.
Bonus:
I also failed to use back reference in gsub (e.g. - pattern not working `gsub("([[:digit:]]+[A-Z])+", "\\1", "12A432B4B"), and can you retrieve more than \\1 to \\9 groups, say if [:digit:]+[A-Z] repeats for more than 9 times ?
We can use regex lookaround to split between an upper case letter and a digit
strsplit(str1, "(?<=[A-Z])(?=[0-9])", perl = TRUE)[[1]]
#[1] "12A" "432B" "4B"
data
str1 <- "12A432B4B"
The pattern mentioned in the post can be used as it is in str_extract_all :
str_extract_all(string, '[[:digit:]]+[A-Z]')[[1]]
#[1] "12A" "432B" "4B"
Or in base R :
regmatches(string, gregexpr('[[:digit:]]+[A-Z]', string))[[1]]
where string is :
string <- '12A432B4B'

Regular expressions in R to get the date

What could be the better solution to get the date only, it is a tag of a webpage.
I hope someone could help me.
The patterns is this value in many pages "publishedAtDate":"2020-02-07"
I would like to have the next outcome:
2020-02-07
I am using this code:
art_publishdate<-regexpr("publishedAtDate\":\"[0-9]{4}-[0-9]{2}-[0-9]{2}\"", thepage)
but the result include many backslashes.
[1] "publishedAtDate\":\"2020-02-07\""
Thank you
You could try to just pick out the numbers and format them as.Date.
as.Date(gsub("\\D", "\\1", '"publishedAtDate":"2020-02-07\"'), format="%Y%m%d")
# [1] "2020-02-07"
Two ways to capture the output.
Using gsub we remove everything till a colon (:) is encountered.
string <- '"publishedAtDate":"2020-02-07"'
gsub('.*:|"', '', string)
#[1] "2020-02-07"
Or using sub we can extract date pattern.
sub('.*?(\\d+-\\d+-\\d+).*', '\\1', string)
#[1] "2020-02-07"
Another solution using str_extract from the stringr package:
str_extract(string, "[0-9]{4}-[0-9]{2}-[0-9]{2}")
[1] "2020-02-07"
Alternatively, the date can be extracted thus:
str_extract(string, "[0-9-]+")
[1] "2020-02-07"
Another alternative is using positive look-behind (which encodes the instruction "Match if you see on the left...") as well as a negated character class [^"], which excludes the quote mark but no other character:
str_extract(string, '(?<=:")[^"]*')
[1] "2020-02-07"

Return number from string

I'm trying to extract the "Number" of "Humans" in the string below, for example:
string <- c("ProjectObjectives|Objectives_NA, PublishDate|PublishDate_NA, DeploymentID|DeploymentID_NA, Species|Human|Gender|Female, Species|Cat|Number|1, Species|Human|Number|1, Species|Human|Position|Left")
The position of the text in the string will constantly change, so I need R to search the string and find "Species|Human|Number|" and return 1.
Apologies if this is a duplicate of another thread, but I've looked here (extract a substring in R according to a pattern) and here (R extract part of string). But I'm not having any luck.
Any ideas?
Use a capturing approach - capture 1 or more digits (\d+) after the known substring (just escape the | symbols):
> string <- c("ProjectObjectives|Objectives_NA, PublishDate|PublishDate_NA, DeploymentID|DeploymentID_NA, Species|Human|Gender|Female, Species|Cat|Number|1, Species|Human|Number|1, Species|Human|Position|Left")
> pattern = "Species\\|Human\\|Number\\|(\\d+)"
> unlist(regmatches(string,regexec(pattern,string)))[2]
[1] "1"
A variation is to use a PCRE regex with regmatches/regexpr
> pattern="(?<=Species\\|Human\\|Number\\|)\\d+"
> regmatches(string,regexpr(pattern,string, perl=TRUE))
[1] "1"
Here, the left side context is put inside a non-consuming pattern, a positive lookbehind, (?<=...).
The same functionality can be achieved with \K operator:
> pattern="Species\\|Human\\|Number\\|\\K\\d+"
> regmatches(string,regexpr(pattern,string, perl=TRUE))
[1] "1"
Simplest way I can think of:
as.integer(gsub("^.+Species\\|Human\\|Number\\|(\\d+).+$", "\\1", string))
It will introduce NAs where there is no mention of Speces|Human|Number. Also, there will be artefacts if any of the strings is a number (but I assume that this won't be an issue)

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

Replace text that appears at the end of a string

Consider "artikelnr". I want to replace "nr" by "nummer", but when I consider "inrichting", I do NOT want to replace "nr". So I just want to replace "nr" by "nummer" if it's at the end of a word.
regex is your friend, here:
sub('nr$', 'nummer', 'artikelnr')
# [1] "artikelnummer"
The $ indicates "end of string", so nr will only be replaced with nummer when it appears at the end of the string.
sub can operate on an entire vector, e.g. for a character vector x, do:
sub('nr$', 'nummer', x)
If you don't mind using the stringr package, str_replace is also handy :
library(stringr)
str_replace("artikelnr", "nr$", "nummer")

Resources