How to do str_extract with base R? - r

I am balancing several versions of R and want to change my R libraries loaded depending on which R and which operating system I'm using. As such, I want to stick with base R functions.
I was reading this page to see what the base R equivalent to stringr::str_extract was:
http://stat545.com/block022_regular-expression.html
It suggested I could replicate this functionality with grep. However, I haven't been able to get grep to do more than return the whole string if there is a match. Is this possible with grep alone, or do I need to combine it with another function? In my case I'm trying to distinguish between CentOS versions 6 and 7.
grep(pattern = "release ([0-9]+)", x = readLines("/etc/system-release"), value = TRUE)

1) strcapture If you want to extract a string of digits and dots from "release 1.2.3" using base then
x <- "release 1.2.3"
strcapture("([0-9.]+)", x, data.frame(version = character(0)))
## version
## 1 1.2.3
2) regexec/regmatches There is also regmatches and regexec but that has already been covered in another answer.
3) sub Also it is often possible to use sub:
sub(".* ([0-9.]+).*", "\\1", x)
## [1] "1.2.3"
3a) If you know the match is at the beginning or end then delete everything after or before it:
sub(".* ", "", x)
## [1] "1.2.3"
4) gsub Sometimes we know that the field to be extracted has certain characters and they do not appear elsewhere. In that case simply delete every occurrence of every character that cannot be in the string:
gsub("[^0-9.]", "", x)
## [1] "1.2.3"
5) read.table One can often decompose the input into fields and then pick off the desired one by number or via grep. strsplit, read.table or scan can be used:
read.table(text = x, as.is = TRUE)[[2]]
## [1] "1.2.3"
5a) grep/scan
grep("^[0-9.]+$", scan(textConnection(x), what = "", quiet = TRUE), value = TRUE)
## [1] "1.2.3"
5b) grep/strsplit
grep("^[0-9.]+$", strsplit(x, " ")[[1]], value = TRUE)
## [1] "1.2.3"
6) substring If we know the character position of the field we can use substring like this:
substring(x, 9)
## [1] "1.2.3"
6a) substring/regexpr or we may be able to use regexpr to locate the character position for us:
substring(x, regexpr("\\d", x))
## [1] "1.2.3"
7) read.dcf Sometimes it is possible to convert the input to dcf form in which case it can be read with read.dcf. Such data is of the form name: value
read.dcf(textConnection(sub(" ", ": ", x)))
## release
## [1,] "1.2.3"

You could do
txt <- c("foo release 123", "bar release", "foo release 123 bar release 123")
pattern <- "release ([0-9]+)"
stringr::str_extract(txt, pattern)
# [1] "release 123" NA "release 123"
sapply(regmatches(txt, regexec(pattern, txt)), "[", 1)
# [1] "release 123" NA "release 123"

txt <- c("foo release 123", "bar release", "foo release 123 bar release 123")
pattern <- "release ([0-9]+)"
Extract first match
sapply(
X = txt,
FUN = function(x){
tmp = regexpr(pattern, x)
m = attr(tmp, "match.length")
st = unlist(tmp)
if (st == -1){NA}else{substr(x, start = st, stop = st + m - 1)}
},
USE.NAMES = FALSE)
#[1] "release 123" NA "release 123"
Extract all matches
sapply(
X = txt,
FUN = function(x){
tmp = gregexpr(pattern, x)
m = attr(tmp[[1]], "match.length")
st = unlist(tmp)
if (st[1] == -1){
NA
}else{
sapply(seq_along(st), function(i) substr(x, st[i], st[i] + m[i] - 1))
}
},
USE.NAMES = FALSE)
#[[1]]
#[1] "release 123"
#[[2]]
#[1] NA
#[[3]]
#[1] "release 123" "release 123"

Related

Substitute captured non-ascii letter with upper case

Is it possible to replicate, using only regex and only base R (only using the g*sub() functions), the following...
sub("(i)", "\\U\\1", "string", perl = TRUE)
# [1] "strIng"
For non-ascii letters?
# Hoped for output
sub("(í)", "?", "stríng", perl = TRUE)
# [1] "strÍng"
PS. R regex flavours are TRE and PCRE.
PS2. I'm using R 4.2.1 with Sys.getlocale() giving:
[1] "LC_COLLATE=Icelandic_Iceland.utf8;LC_CTYPE=Icelandic_Iceland.utf8;LC_MONETARY=Icelandic_Iceland.utf8;LC_NUMERIC=C;LC_TIME=Icelandic_Iceland.utf8"
You can use
x="stríng"
gr <- gregexpr("í", x)
mat <- regmatches(x, gr)
regmatches(x, gr) <- lapply(mat, toupper)
# > x
# > [1] "strÍng"
See the R demo online.
For a slightly more involved/explicit solution that only uses base R:
sub_nascii <- function(pattern, string) {
matches <- gregexpr(pattern, string)[[1]]
for (i in matches) {
substr(string, i, i) <- toupper(substr(string, i, i))
}
string
}
sub_nascii(pattern = "í", "stríng")
This works in my locale where sub on it's own doesn't.

gsub / sub to extract between certain characters

How can I extract the numbers / ID from the following string in R?
link <- "D:/temp/sample_data/0000098618-13-000011.htm"
I want to just extract 0000098618-13-000011
That is discard the .htm and the D:/temp/sample_data/.
I have tried grep and gsub without much luck.
1) basename Use basename followed by sub:
sub("\\..*", "", basename(link))
## [1] "0000098618-13-000011"
2) file_path_sans_ext
library(tools)
file_path_sans_ext(link)
## [1] "0000098618-13-000011"
3) sub
sub(".*/(.*)\\..*", "\\1", link)
## [1] "0000098618-13-000011"
4) gsub
gsub(".*/|\\.[^.]*$", "", link)
## [1] "0000098618-13-000011"
5) strsplit
sapply(strsplit(link, "[/.]"), function(x) tail(x, 2)[1])
## [1] "0000098618-13-000011"
6) read.table. If link is a vector this will only work if all elements have the same number of /-separated components. Also this assumes that the only dot is the one separting the extension.
DF <- read.table(text = link, sep = "/", comment = ".", as.is = TRUE)
DF[[ncol(DF)]]
## [1] "0000098618-13-000011"
Using stringr:
library(stringr)
str_extract(link , "[0-9-]+")
# "0000098618-13-000011"

R Match And Sub On Space Between Specific Characters

I need a little help with a regular expression using gsub. Take this object:
x <- "4929A 939 8229"
I want to remove the space in between "A" and "9", but I am not sure how to match on only the space between them and not on the second space. I essentially need something like this:
x <- gsub("A 9", "", x)
But I am not sure how to write the regular expression to not match on the "A" and "9" and only the space between them.
Thanks in advance!
You may use the following regex in sub:
> x <- "4929A 939 8229"
> sub("\\s+", "", x)
[1] "4929A939 8229"
The \\s+ will match 1 or more whitespace symbols.
The replacement part is an empty string.
See the online R demo
gsub matches/uses all regex found whereas sub only matches/uses the first one. So
sub(" ", "", "4929A 939 8229") # returns "4929A939 8229"
Will do the job
Removing second/nth occurence
You can do that e.g. by using strsplit as follows:
x <- c("4929A 939 8229", "4929A 9398229")
collapse_nth <- function(x_split, split, nth, replacement){
left <- paste(x_split[seq_len(nth)], collapse = split)
right <- paste(x_split[-seq_len(nth)], collapse = split)
paste(left, right, sep = replacement)
}
remove_nth <- function(x, nth, split, replacement = ""){
x_split <- strsplit(x, split, fixed = TRUE)
x_len <- vapply(x_split, length, integer(1))
out <- x
out[x_len>nth] <- vapply(x_split[x_len>nth], collapse_nth, character(1), split, nth, replacement)
out
}
Which gives you:
# > remove_nth(x, 2, " ")
# [1] "4929A 9398229" "4929A 9398229"
and
# > remove_nth(x, 2, " ", "---")
# [1] "4929A 939---8229" "4929A 9398229"

R: Delete first and last part of string based on pattern

This string is a ticker for a bond: OAT 3 25/32 7/17/17. I want to extract the coupon rate which is 3 25/32 and is read as 3 + 25/32 or 3.78125. Now I've been trying to delete the date and the name OAT with gsub, however I've encountered some problems.
This is the code to delete the date:
tkr.bond <- 'OAT 3 25/32 7/17/17'
tkr.ptrn <- '[0-9][[:punct:]][0-9][[:punct:]][0-9]'
gsub(tkr.ptrn, "", tkr.bond)
However it gets me the same string. When I use [0-9][[:punct:]][0-9] in the pattern I manage to delete part of the date, however it also deletes the fraction part of the coupon rate for the bond.
The tricky thing is to find a solution that doesn't involve the pattern of the coupon because the tickers have this form: Name Coupon Date, so, using a specific pattern for the coupon may limit the scope of the solution. For example, if the ticker is this way OAT 0 7/17/17, the coupon is zero.
Just replace first and last word with an empty string.
> tkr.bond <- 'OAT 3 25/32 7/17/17'
> gsub("^\\S+\\s*|\\s*\\S+$", "", tkr.bond)
[1] "3 25/32"
OR
Use gsubfn function in-order to use a function in the replacement part.
> gsubfn("^\\S+\\s+(\\d+)\\s+(\\d+)/(\\d+).*", ~ as.numeric(x) + as.numeric(y)/as.numeric(z), tkr.bond)
[1] "3.78125"
Update:
> tkr.bond1 <- c(tkr.bond, 'OAT 0 7/17/17')
> m <- gsub("^\\S+\\s*|\\s*\\S+$", "", tkr.bond1)
> gsubfn(".+", ~ eval(parse(text=x)), gsub("\\s+", "+", m))
[1] "3.78125" "0"
Try
eval(parse(text=sub('[A-Z]+ ([0-9]+ )([0-9/]+) .*', '\\1 + \\2', tkr.bond)))
#[1] 3.78125
Or you may need
sub('^[A-Z]+ ([^A-Z]+) [^ ]+$', '\\1', tkr.bond)
#[1] "3 25/32"
Update
tkr.bond1 <- c(tkr.bond, 'OAT 0 7/17/17')
v1 <- sub('^[A-Z]+ ([^A-Z]+) [^ ]+$', '\\1', tkr.bond1)
unname(sapply(sub(' ', '+', v1), function(x) eval(parse(text=x))))
#[1] 3.78125 0.00000
Or
vapply(strsplit(tkr.bond1, ' '), function(x)
eval(parse(text= paste(x[-c(1, length(x))], collapse="+"))), 0)
#[1] 3.78125 0.00000
Or without the eval(parse
vapply(strsplit(gsub('^[^ ]+ | [^ ]+$', '', tkr.bond1), '[ /]'), function(x) {
x1 <- as.numeric(x)
sum(x1[1], x1[2]/x1[3], na.rm=TRUE)}, 0)
#[1] 3.78125 0.00000
Similar to akrun's answer, using sub with a replacement. How it works: you put your "desired" pattern inside parentheses and leave the rest out (while still putting regex characters to match what's there and that you don't wish to keep). Then when you say replacement = "\\1" you indicate that the whole string must be substituted by only what's inside the parentheses.
sub(pattern = ".*\\s(\\d\\s\\d+\\/\\d+)\\s.*", replacement = "\\1", x = tkr.bond, perl = TRUE)
# [1] "3 25/32"
Then you can change it to numerical:
temp <- sub(pattern = ".*\\s(\\d\\s\\d+\\/\\d+)\\s.*", replacement = "\\1", x = tkr.bond, perl = TRUE)
eval(parse(text=sub(" ","+",x = temp)))
# [1] 3.78125
You can also use strsplit here. Then evaluate components excluding the first and the last. Like this
> tickers <- c('OAT 3 25/32 7/17/17', 'OAT 0 7/17/17')
>
> unlist(lapply(lapply(strsplit(tickers, " "),
+ function(x) {x[-length(x)][-1]}),
+ function(y) {sum(
+ sapply(y, function (z) {eval(parse(text = z))}) )} ) )
[1] 3.78125 0.00000

How to look for a certain part in a string and only keep that part

What is the cleanest way of finding for example the string ": [1-9]*" and only keeping that part?
You can work with regexec to get the starting points, but isn't there a cleaner way just to get immediately the value?
For example:
test <- c("surface area: 458", "bedrooms: 1", "whatever")
regexec(": [1-9]*", test)
How do I get immediately just
c(": 458",": 1", NA )
You can use base R which handles this just fine.
> x <- c('surface area: 458', 'bedrooms: 1', 'whatever')
> r <- regmatches(x, gregexpr(':.*', x))
> unlist({r[sapply(r, length)==0] <- NA; r})
# [1] ": 458" ": 1" NA
Although, I find it much simpler to just do...
> x <- c('surface area: 458', 'bedrooms: 1', 'whatever')
> sapply(strsplit(x, '\\b(?=:)', perl=T), '[', 2)
# [1] ": 458" ": 1" NA
library(stringr)
str_extract(test, ":.*")
#[1] ": 458" ": 1" NA
Or for a faster approach stringi
library(stringi)
stri_extract_first_regex(test, ":.*")
#[1] ": 458" ": 1" NA
If you need the keep the values of the one that doesn't have the match
gsub(".*(:.*)", "\\1", test)
#[1] ": 458" ": 1" "whatever"
Try any of these. The first two use the base of R only. The last one assumes that we want to return a numeric vector.
1) sub
s <- sub(".*:", ":", test)
ifelse(test == s, NA, s)
## [1] ": 458" ": 1" NA
If there can be more than one : in a string then replace the pattern with "^[^:]*:" .
2) strsplit
sapply(strsplit(test, ":"), function(x) c(paste0(":", x), NA)[2])
## [1] ": 458" ": 1" NA
Do not use this one if there can be more than one : in a string.
3) strapplyc
library(gsubfn)
s <- strapplyc(test, "(:.*)|$", simplify = TRUE)
ifelse(s == "", NA, s)
## [1] ": 458" ": 1" NA
We can omit the ifelse line if "" is ok instead of NA.
4) strapply If the idea is really that there are some digits on the line and we want to return the numbers or NA then try this:
library(gsubfn)
strapply(test, "\\d+|$", as.numeric, simplify = TRUE)
## [1] 458 1 NA

Resources