String Pattern in R - r

I have a list of strings as follow: "/home/ricardo/MultiClass/data//F10/1036.txt"
> library(stringr)
> strsplit(cls[1], split= "/")
Give me:
#> [[1]] [1] "" "home" "ricardo" "MultiClass" "data"
#> "" "F10" "1036.txt"
How can I keep only the 7th position?
#> "F10"

If you want to extract one or more chars after // up to the first / or end of string use
> library(stringr)
> s <- "/home/ricardo/MultiClass/data//F10/1036.txt"
> str_extract(s, "(?<=//)[^/]+")
[1] "F10"
The (?<=//)[^/]+ regex pattern will find a position that is preceded with 2 slashes (see (?<=//)) and then matches one or more characters other than / (see [^/]+).
A base R solution with sub will look like
> sub("^.*/([^/]*)/[^/]*$", "\\1", s)
[1] "F10"
Details:
^ - start of string
.* - any 0+ chars as many as possible
/ - a slash (last but one in the string as the previous pattern is greedy)
([^/]*) - capturing group #1 matching any 0+ chars other than /
/ - last slash
[^/]* - any 0+ chars other than /
$ - end of string.

Using function word of stringr,
library(stringr)
word(sub('.*//', '', s), 1, sep = '/')
#[1] "F10"
#where
s <- '/home/ricardo/MultiClass/data//F10/1036.txt'

It can be done in R-base in this way.
I have defined the function gret to extract a pattern from a string
gret <-function(pattern,text,ignore.case=TRUE){
regmatches(text,regexpr(pattern,text,perl=TRUE,ignore.case))
then
gsub("data|/*","",gret("(?=data/).*(?<=/)","/home/ricardo/MultiClass
/data//F10/1036.txt"))
#>[1] "F10"

Related

Use gsub to extract the first integer number

I'd like to use gsub to remove characters from a filename.
In the example below the desired output is 23
digs = "filepath/23-00.xlsx"
I can remove everything before 23 as follows:
gsub("^\\D+", "",digs)
[1] "23-00.xlsx"
or everything after:
gsub("\\-\\d+\\.xlsx$","", digs)
[1] "filepath/23"
How do I do both at the same time?
We could use | (OR) i.e. match characters (.*) till the / or (|), match the - followed by characters (.*), replace with blank ("")
gsub(".*/|-.*", "", digs)
[1] "23"
Or just do parse_number
readr::parse_number(digs)
[1] 23
You can just use a sub like
sub("^\\D+(\\d+).*", "\\1", digs)
# => [1] "23"
See the R demo. See the regex demo. Details:
^ - start of string
\D+ - one or more non-digit chars
(\d+) - Group 1 (\1 refers to this group value): one or more digits
.* - any zero or more chars as many as possible.

R Sub function: pull everything after second number

Trying to figure out how to pull everything after the second number using the sub function in R. I understand the basics with the lazy and greedy matching, but how do I take it one step further and pull everything after the second number?
str <- 'john02imga-04'
#lazy: pulls everything after first number
sub(".*?[0-9]", "", str)
#output: "2imga-04
#greedy: pulls everything after last number
sub(".*[0-9]", "", str)
#output: ""
#desired output: "imga-04"
You can use
sub("\\D*[0-9]+", "", str)
## Or,
## sub("\\D*\\d+", "", str)
## => [1] "imga-04"
See the regex demo. Also, see the R demo online.
sub will find and replace the first occurrence of
\D* (=[^0-9]) - any zero or more non-digit chars
[0-9]+ (=\d+) - one or more digits.
Alternative ways
Match one or more letters, -, one or more digits at the end of the string:
> regmatches(str, regexpr("[[:alpha:]]+-\\d+$", str))
[1] "imga-04"
> library(stringr)
> str_extract(str, "\\p{L}+-\\d+$")
[1] "imga-04"
You can use a capture group for the second part and use that in the replacement
^\D+\d+(\D+\d+)
^ Start of string
\D+\d+ Match 1+ non digits, then 1+ digits
(\D+\d+) Capture group 1, match 1+ non digits and match 1+ digits
Regex demo | R demo
str <- 'john02imga-04'
sub("^\\D+\\d+(\\D+\\d+)", "\\1", str)
Output
[1] "imga-04"
If you want to remove all after the second number:
^\D+\d+(\D+\d+).*
Regex demo
As an alternative getting a match only using perl=T for using PCRE and \K to clear the match buffer:
str <- 'john02imga-04'
regmatches(str, regexpr("^\\D+\\d+\\K\\D+\\d+", str, perl = T))
Output
[1] "imga-04"
See an R demo

extracting part of path using sub

I'm attempting to extract a filename from a path in r. In a string like
someurl.com/vp/125514_45147_55144.jpg?_nc25244
I want to extract 125514_45147_55144
I'm using the following expression:
sub(".*vp/(.*?)/.*", "\\1", input)
which works but it also strips the underscores:
1255144514755144
I cannot figure out how to retain the underscores
Remove dot and everything after it of the basename:
sub("\\..*", "", basename(x))
## [1] "125514_45147_55144"
If it is possible that there are dots in the filename then use this slightly more complex pattern:
sub("(.*)\\..*", "\\1", basename(x))
## [1] "125514_45147_55144"
I suggest fixing it as
sub(".*/vp/([^/?]*?)\\.[^/?.]*(?:\\?.*)?$", "\\1", input)
See the regex demo
Details
.* - any 0+ chars as many as possible
/vp/ - a literal substring
([^/?]*?) - Group 1 (its captured value is referenced by \1 from the replacement pattern): any 0+ chars other than / and ?, as few as possible
\\. - a dot
[^/?.]* - 0+ chars other than ., ? and /
(?:\\?.*)? - an optional substring matching ? and then any 0+ chars as many as possible
$ - end of string.
With regmatches/regexec the pattern becomes much clearer:
x <- "someurl.com/vp/125514_45147_55144.jpg?_nc25244"
regmatches(x,regexec("/vp/([^/?]*)\\.",x))[[1]][2]
## => [1] "125514_45147_55144"
See the R demo
stringr alternative
library( stringr )
str_match( "someurl.com/vp/125514_45147_55144.jpg?_nc25244", "^.*/(.*?)\\..*$" )[[2]]
#[1] "125514_45147_55144"
Inspired by the answer of #G.Grothendieck, a regex-free solution using dirname, basename and chartr
x = 'someurl.com/vp/125514_45147_55144.jpg?_nc25244'
dirname(chartr(x = basename(x), ".", "/"))
# [1] "125514_45147_55144"
Assuming there is no dot in the filename.

Extract both occurrences of pattern Regex

I have an input vector as follows:
input <- c("fdsfs iwantthis (1,1,1,1) fdsaaa iwantthisaswell (2,3,4,5)", "fdsfs thistoo (1,1,1,1)")
And I would like to use a regex to extract the following:
> output
[1] "iwantthis iwantthisaswell" "thistoo"
I have managed to extract every word that is before an opening bracket.
I tried this to get only the first word:
> gsub(".*?[[:space:]](.*?)[[:space:]]\\(.*", "\\1", input)
[1] "iwantthis" "thistoo"
But I cannot get it to work for multiple occurrences:
> gsub(".*?[[:space:]](.*?)[[:space:]]\\(.*?[[:space:]](.*?)[[:space:]]\\(.*", "\\1 \\2", input)
[1] "iwantthis iwantthisaswell" "fdsfs thistoo (1,1,1,1)"
The closest I have managed is the following:
library(stringr)
> str_extract_all(input, "(\\S*)\\s\\(")
[[1]]
[1] "iwantthis (" "iwantthisaswell ("
[[2]]
[1] "thistoo ("
I am sure I am missing something in my regex (not that good at it) but what?
You may use
> sapply(str_extract_all(input, "\\S+(?=\\s*\\()"), paste, collapse=" ")
[1] "iwantthis iwantthisaswell" "thistoo"
See the regex demo. The \\S+(?=\\s*\\() will extract all 1+ non-whitespace chunks from a text before a ( char preceded with 0+ whitespaces. sapply with paste will join the found matches with a space (with collapse=" ").
Pattern details
\S+ - 1 or more non-whitespace chars
(?=\s*\() - a positive lookahead ((?=...)) that requires the presence of 0+ whitespace chars (\s*) and then a ( char (\() immediately to the right of the current position.
Here is an option using base R
unlist(regmatches(input, gregexpr("\\w+(?= \\()", input, perl = TRUE)))
#[1] "iwantthis" "iwantthisaswell" "thistoo"
This works in R:
gsub('\\w.+? ([^\\s]+) \\(.+?\\)','\\1', input, perl=TRUE)
Result:
[1] "iwantthis iwantthisaswell" "thistoo"
UPDATED to work for the general case. E.g. now finds "i_wantthisaswell2" by searching on non-spaces between the other matches.
Using other suggested general case inputs:
general_cases <- c("fdsfs iwantthis (1,1,1,1) fdsaaa iwantthisaswell (2,3,4,5)",
"fdsfs thistoo (1,1,1,1) ",
"GaGa iwant_this (1,1,1,1)",
"lal2!##$%^&*()_+a i_wantthisaswell2 (2,3,4,5)")
gsub('\\w.+? ([^\\s]+) \\(.+?\\)','\\1', general_cases, perl=TRUE)
results:
[1] "iwantthis iwantthisaswell" "thistoo "
[3] "iwant_this" "i_wantthisaswell2"

Can't figure out why regex group is not working in str_match

I have the following code with a regex
CHARACTER <- ^([A-Z0-9 .])+(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$
str_match("WILL (V.O.)",CHARACTER)[1,2]
I thought this should match the value of "WILL " but it is returning blank.
I tried the RegEx in a different language and the group is coming back blank in that instance also.
What do I have to add to this regex to pull back just the value "WILL"?
You formed a repeated capturing group by placing + outside a group. Put it back:
CHARACTER <- "^([A-Z0-9 .]+)(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
^
Note you may trim Will if you use a lazy match with \s* after the group:
CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
See this regex demo.
> library(stringr)
> CHARACTER <- "^([A-Z0-9\\s.]+?)\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$"
> str_match("WILL (V.O.)",CHARACTER)[1,2]
[1] "WILL"
Alternatively, you may just extract Will with
> str_extract("WILL (V.O.)", "^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)")
[1] "WILL"
Or the same with base R:
> regmatches(x, regexpr("^.*?(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$)", x, perl=TRUE))
[1] "WILL"
Here,
^ - matches the start of a string
.*? - any 0+ chars other than line break chars as few as possible
(?=\\s*(?:\\(V\\.O\\.\\))?(?:\\(O\\.S\\.\\))?(?:\\(CONT'D\\))?$) - makes sure that, immediately to the right of the current location, there is
\\s* - 0+ whitespaces
(?:\\(V\\.O\\.\\))? - an optional (V.O.) substring
(?:\\(O\\.S\\.\\))? - an optional (O.S.) substring
(?:\\(CONT'D\\))? - an optional (CONT'D) substring
$ - end of string.

Resources