Pattern match with R - r

I am trying to match a pattern using rgep() function as below -
grep("XYZ31__Sheqwqet1__CSV.csv", "^(XYZ)+[0-9]{2}[a-zA-Z_]+(csv)+$")
However unfortunately above expression results in no match. Any pointer towards the right direction will be very helpful.
Thanks for your time

Before the csv there is also a . and some digits. In addition, the order of arguments is pattern, followed by the input x. (if we pass arguments via name, the order wouldn't matter though)
grep( "^(XYZ)+[0-9]{2}[[:alnum:]_.]+(csv)$", "XYZ31__Sheqwqet1__CSV.csv")
#[1] 1
Pattern match is
^- start of the string
(XYZ)+ - one or more occurence of those letters
[0-9]{2} - two digits
[[:alnum:]_.]+ - one or more alpha numeric characters including the additional two
(csv)$- csv at the end of the string

Related

Subsetting a value based on partial pattern

I'm trying to subset out using regular expressions, the url: happy_to-learn.com.
As I'm really new to regex, could someone help with my code as to why it does not work?
x <- c("happy_to-learn.com", "His_is-omitted.net")
str_subset(x, "^[a-zA-Z](\\_|\\-)*\\.com$")
I understand that ^[a-zA-Z](\\_|\\-)* this portion here refers to, "Start when you hit a range of alphabets from a to z or A to Z, and it contains either _ or -, if yes, then subset out this portion with 0 or more matches.
However, is it possible continue from this code by adding the back part of the value i wish to subset? i.e. \\.com$ refers to all values that end with .com.
Is there something like "^[a-zA-Z](\\_|\\-)*...\\.com$" in regex?
We need to specify one or more with + as the _ or - are not just after the first letter.
str_subset(x, "^[a-zA-Z]+(\\_|\\-).*\\.com$")
#[1] "happy_to-learn.com"
Also, the .* refers to zero or more characters as . can be any character until the . and 'com' at the end ($) of the string
Why use an external package? grep can do it too.
grep("^[[:alpha:]_-]+.*\\.com$", x, value = TRUE)
#[1] "happy_to-learn.com"
Explanation.
"^" marks the beginning of the string.
"[:alpha:] matches any alphabetic character, upper or lower case in a portable way.
"^[[:alpha:]_-]+" between [], there are alternative characters to match repeated one or more times. Alphabetic or the underscore _ or the minus sign -.
"^[[:alpha:]_-]+.*" The above followed by any character repeated zero or more times.
"^[[:alpha:]_-]+.*\\.com$" ending with the string ".com" where the dot is not a metacharacter and therefore must be escaped.

Extracting numeric character of length (1|2) from character list

I am scraping PDFs for data and am trying to search for a numeric character (1:9) that is either of length 1 or 2. Unfortunately the value I am after changes position across the PDFs so I cannot simply call the index of the value and assign it to a variable.
I have tried many regex functions and can get numbers out of the list, but cannot seem to implement the argument to only pull numbers of the specific length.
# Data comes in as a long string
Test<-("82026-424 82026-424 1 CSX10 Store Room 75.74 75.74")
# Seperate data into individual pieces with str_split
Split_Test<-str_split(Test[1],"\\s+")
# We can easily unlist it with the following code (Not sure if needed)
Test_Unlisted<-unlist(Split_Test)
> Test_Unlisted
[1] "82026-424" "82026-424" "1" "CSX10" "Store" "Room"
[8] "75.74" "75.74"
My desired outcome would be to get the "1" out of the character list, and then if the value was "20" also be able to recognize that.
The best logic I can think of in code exists below, but this does not work.:
Test_Final<-str_match(Test_Unlisted, "\\d|\\d\\d")
Using this code I can grab anything of length=1, but it is not guaranteed to be a character:
Test_Final<-which(sapply(Test_Unlisted, nchar)==1)
Thanks for all the help!
You need to use
Test<-("82026-424 82026-424 1 CSX10 Store Room 75.74 75.74, 20")
regmatches(Test, gregexpr("\\b(?<!\\d\\.)\\d{1,2}\\b(?!\\.\\d)", Test, perl=TRUE))
See the regex demo and the regex demo.
Details
\b - a word boundary
(?<!\d\.) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a digit and a dot
\d{1,2} - 1 or 2 digits
\b - a word boundary
(?!\.\d) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a dot and a digit.
Note that due to the lookarounds used in the pattern, the regex should be passed to the PCRE regex engine, hence the perl=TRUE argument is required.
With stringr that is ICU regex engine powered, you may use
library(stringr)
str_extract_all(Test, "\\b(?<!\\d\\.)\\d{1,2}\\b(?!\\.\\d)")

How to split a string by dashes outside of square brackets

I would like to split strings like the following:
x <- "abc-1230-xyz-[def-ghu-jkl---]-[adsasa7asda12]-s-[klas-bst-asdas foo]"
by dash (-) on the condition that those dashes must not be contained inside a pair of []. The expected result would be
c("abc", "1230", "xyz", "[def-ghu-jkl---]", "[adsasa7asda12]", "s",
"[klas-bst-asdas foo]")
Notes:
There is no nesting of square brackets inside each other.
The square brackets can contain any characters / numbers / symbols except square brackets.
The other parts of the string are also variable so that we can only assume that we split by - whenever it's not inside [].
There's a similar question for python (How to split a string by commas positioned outside of parenthesis?) but I haven't yet been able to accurately adjust that to my scenario.
You could use look ahead to verify that there is no ] following sooner than a [:
-(?![^[]*\])
So in R:
strsplit(x, "-(?![^[]*\\])", perl=TRUE)
Explanation:
-: match the hyphen
(?! ): negative look ahead: if that part is found after the previously matched hyphen, it invalidates the match of the hyphen.
[^[]: match any character that is not a [
*: match any number of the previous
\]: match a literal ]. If this matches, it means we found a ] before finding a [. As all this happens in a negative look ahead, a match here means the hyphen is not a match. Note that a ] is a special character in regular expressions, so it must be escaped with a backslash (although it does work without escape, as the engine knows there is no matching [ preceding it -- but I prefer to be clear about it being a literal). And as backslashes have a special meaning in string literals (they also denote an escape), that backslash itself must be escaped again in this string, so it appears as \\].
Instead of splitting, extract the parts:
library(stringr)
str_extract_all(x, "(\\[[^\\[]*\\]|[^-])+")
I am not familiar with r language, but I believe it can do regex based search and replace. Instead of struggling with one single regex split function, I would go in 3 steps:
replace - in all [....] parts by a invisible char, like \x99
split by -
for each element in the above split result(array/list), replace \x99 back to -
For the first step, you can find the parts by \[[^]]

How to extract characters from a string based on the text surrounding them in R

Edited to highlight the language I'm using I'm using the R language and I have many large lists of character strings and they have a similar format. I am interested in the characters directly in front of a series of characters that is consistently in the string, but not in a consistent place within the string. For instance:
a <- "aabbccddeeff"
b <- "aabbddff"
c <- "aabbffgghhii"
d <- "bbffgghhii"
I am interested in extracting the two characters directly preceding the "ff" in each character string. I can't find any reasonable solution apart from breaking each character string down using grepl() and then processing them each independently, which seems like an inefficient way to do it.
You can match those two characters and capture them with sub and the right regular expression.
Strings = c("aabbccddeeff",
"aabbddff",
"aabbffgghhii",
"bbffgghhii")
sub(".*(\\w\\w)ff.*", "\\1", Strings)
[1] "ee" "dd" "bb" "bb"
Explanation, This replaces the entire string with the two characters before the "ff". If there are multiple "ff" in the string, this expression takes the two characters before the last "ff".
How this works: The three arguments to sub are:
1. a pattern to search for
2. What it will be replaced with
3. The strings to apply it to.
Most of the work is in the pattern part - .*(\\w\\w)ff.*. The ff part of the pattern must be obvious. We are targeting things near the specific string ff. What comes right before it is (\\w\\w). \w refers to a "word character". That means any letter a-z or A-Z, any digit 0-9 or the one other character _. We want two characters so we have \\w\\w. By enclosing \\w\\w in parentheses, it turns this pattern of two characters into a "capture group", a string that will be saved into a variable for later use. Since this is the first (and only) capture group in this expression, those two characters will be stored in a variable called \1. Now we want only those two characters so in order to blow away everything before and after we put .* at the front and back. . matches any character and * means do this zero or more times, so .* means zero or more copies of any character. Now we have broken the string into four parts: "ff", the two characters before "ff", everything before that and everything after the ff. This covers the entire string. sub will _replace the part that was matched (everything) with whatever it says in the substitution pattern, in this case "\1". That is just how you write a string that evaluates to \1, the name of the variable where we stored the two characters that we want. We write it that way because backslash "escapes" whatever is after it. We actually want the character \ so we write \ to indicate \ and \1 evaluates to \1. So everything in the string is replaced by the targeted two characters. We apply this to every string in the list of strings Strings.

Regex to extract part of the string

I have the following string
> ma1.andl_4_1000x20x20_k1=1,k2=2,k3=1.csv.
I need to extract the section k1=1,k2=2,k3=1. I used substr() in R to extract.
substr(str, 23, nchar(str) - 4)
However I'm looking for a regular expression to extract the values.
If you need to extract the substr of k1=1,k2=2,k3=1 as Jota points out, and if it is so specific a string then his solution is what you want.
For a generalized solution that'll capture kx=y,ka=b,kj=k you'll need to Capture a Repeated Group, your group to me being kx=y, where x is any number, y is any number and ,. I left out the dot . for simplicity.
REGEX
((?:k\d{1,}=\d{1,}(?:,|\.)?)+)
BREAKDOWN
( - opening capturing bracket
(?: - opening non-capturing bracket, this will be repeated to capture the entire pattern
k\d{1,}=\d{1,} - the guts, allowing kx=y
(?:,|\.) - match the comma and last dot to allow to match the entire pattern of kx=y(?:,|.)
)+ - close non-capturing bracket, repeating this pattern to capture the whole group
) - close capturing bracket
...and you're done. The regex will work, but I don't use R at all so can't test.
Read the link, the whole site is quite informative for regex

Resources