How to delete string or digits after certain pattern? - r

If there is a vector x that is,
x <- c('/name12/?ad_2','/name13/?ad_3','/name14/?ad_4')
Is there a way to delete the following numbers after 'ad_'?
so the converted x appears as
'/name12/?ad_' '/name13/?ad_' '/name14/?ad_'
I was trying to use gsub function but it didn't work because of the digits followed by 'name'.

You may use a regex with sub (since you perform a single search and replace, you do not need gsub) and use a pattern depending on what you need to include or exclude in the result.
You might use "(\\?ad_)[0-9]+$" to remove ?ad_ + digits and replace with "\\1" to restore the ?ad_ value, or just match the _ and then digits (and replace with _).
See demo code:
> x <- c('/name12/?ad_2','/name13/?ad_3','/name14/?ad_4')
> sub("(\\?ad_)[0-9]+$", "\\1", x)
[1] "/name12/?ad_" "/name13/?ad_" "/name14/?ad_"
> sub("_[0-9]+$", "_", x)
[1] "/name12/?ad_" "/name13/?ad_" "/name14/?ad_"
See the regex demo
Pattern details:
_ - matches an underscore
[0-9]+ - 1 or more (due to the + quantifier matching one or more occurrences, as many as possible)
$ - the end of string.

Since the prefix is the same length for all of them:
x <- c('/name12/?ad_2','/name13/?ad_3','/name14/?ad_4')
substr(x,1,12)
[1] "/name12/?ad_" "/name13/?ad_" "/name14/?ad_"
Otherwise I would grep it.

Related

R Use Regular Expression to capture number when sometimes the capture is at the end of the string or not

I need to capture the numbers out of a string that come after a certain parameter name.
I have it working for most, but there is one parameter that is sometimes at the end of the string, but not always. When using the regular expression, it seems to matter.
I've tried different things, but nothing seems to work in both cases.
# Regular expression to capture the digit after the phrase "AppliedWhenID="
p <- ".*&AppliedWhenID=(.\\d*)"
# Tried this, but when at end, it just grabs a blank
#p <- ".*&AppliedWhenID=(.\\d*)&.*|.*&AppliedWhenID=(.\\d*)$"
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
# What should be returned is "2"
gsub(p, "\\1", testAtEnd) # works
gsub(p, "\\1", testNotAtEnd) # doesn't work, it captures 2 + &AgDateTypeID=1
Note that sub and gsub replace the found text(s), thus, in order to extract a part of the input string with a capturing group + a backreference, you need to actually match (and consume) the whole string.
Hence, you need to match the string to the end by adding .* at the end of the pattern:
p <- ".*&AppliedWhenID=(\\d+).*"
sub(p, "\\1", testNotAtEnd)
# => [1] "2"
sub(p, "\\1", testAtEnd)
# => [1] "2"
See the regex demo and the R online demo.
Note that gsub matches multiple occurrences, you need a single one, so it makes sense to replace gsub with sub.
Regex details
.* - any zero or more chars as many as possible
&AppliedWhenID= - a &AppliedWhenID= string
(\d+) - Group 1 (\1): one or more digits
.* - any zero or more chars as many as possible.
You could try using the string look behind conditional "(?<=)" and str_extract() from the stringr library.
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
p <- "(?<=AppliedWhenID=)\\d+"
# What should be returned is "2"
library(stringr)
str_extract(testAtEnd, p)
str_extract(testNotAtEnd, p)
Or in base R
p <- ".*((?<=AppliedWhenID=)\\d+).*"
gsub(p, "\\1", testAtEnd, perl=TRUE)
gsub(p, "\\1", testNotAtEnd, perl=TRUE)

Replace multiple consecutive hyphens in R

I have a string which looks like this:
something-------another--thing
I want to replace the multiple dashes with a single one.
So the expected output would be:
something-another-thing
We can try using sub here:
x <- "something-------another--thing"
gsub("-{2,}", "-", x)
[1] "something-another-thing"
More generally, if we want to replace any sequence of two or more of the same character with just the single character, then use this version:
x <- "something-------another--thing"
gsub("(.)\\1+", "\\1", x)
The second pattern could use an explanation:
(.) match AND capture any single letter
\\1+ then match the same letter, at least one or possibly more times
Then, we replace with just the single captured letter.
you can do it with gsub and using regex.
> text='something-------another--thing'
> gsub('-{2,}','-',text)
[1] "something-another-thing"
t2 <- "something-------another--thing"
library(stringr)
str_replace_all(t2, pattern = "-+", replacement = "-")
which gives:
[1] "something-another-thing"
If you're searching for the right regex to search for a string, you can test it out here https://regexr.com/
In the above, you're just searching for a pattern that is a hyphen, so pattern = "-", but we add the plus so that the search is 'greedy' and can include many hyphens, so we get pattern = "-+"

R Regex capture group?

I have a lot of strings like this:
2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0
I want to extract the substring that lays right after the last "/" and ends with "_":
556662
I have found out how to extract: /01/01/07/556662
by using the following regex: (\/)(.*?)(?=\_)
Please advise how can I capture the right group.
You may use
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/\\K[^_]+", x, perl=TRUE))
## [1] "556662"
See the regex and R demo.
Here, the regex matches and outputs the first substring that matches
.*/ - any 0+ chars as many as possible up to the last /
\K - omits this part from the match
[^_]+ - puts 1 or more chars other than _ into the match value.
Or, a sub solution:
sub(".*/([^_]+).*", "\\1", x)
See the regex demo.
Here, it is similar to the previous one, but the 1 or more chars other than _ are captured into Group 1 (\1 in the replacement pattern) and the trailing .* make sure the whole input is matched (and consumed, ready to be replaced).
Alternative non-base R solutions
If you can afford or prefer to work with stringi, you may use
library(stringi)
stri_match_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", ".*/([^_]+)")[,2]
## [1] "556662"
This will match a string up to the last / and will capture into Group 1 (that you access in Column 2 using [,2]) 1 or more chars other than _.
Or
stri_extract_last_regex("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0", "(?<=/)[^_/]+")
## => [1] "556662"
This will extract the last match of a string that consists of 1 or more chars other than _ and / after a /.
You could use a capturing group:
/([^_/]+)_[^/\s]*
Explanation
/ Match literally
([^_/]+) Capture in a group matching not an underscore or forward slash
_[^/\s]* Match _ and then 0+ times not a forward slash or a whitespace character
Regex demo | R demo
One option to get the capturing group might be to get the second column using str_match:
library(stringr)
str = c("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0")
str_match(str, "/([^_/]+)_[^/\\s]*")[,2]
# [1] "556662"
I changed the Regex rules according to the code of Wiktor Stribiżew.
x <- "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0"
regmatches(x, regexpr(".*/([0-9]+)", x, perl=TRUE))
sub(".*/([0-9]+).*", "\\1", x)
Output
[1] "2019/01/01/07/556662"
[1] "556662"
R demo

Using gsub or sub function to only get part of a string?

Col
WBU-ARGU*06:03:04
WBU-ARDU*08:01:01
WBU-ARFU*11:03:05
WBU-ARFU*03:456
I have a column which has 75 rows of variables such as the col above. I am not quite sure how to use gsub or sub in order to get up until the integers after the first colon.
Expected output:
Col
WBU-ARGU*06:03
WBU-ARDU*08:01
WBU-ARFU*11:03
WBU-ARFU*03:456
I tried this but it doesn't seem to work:
gsub("*..:","", df$col)
Following may help you here too.
sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
Output will be as follows.
> sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456b"
Where Input for data frame is as follows.
dat <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456b")
df <- data.frame(dat)
Explanation: Following is only for explanation purposes.
sub(" ##using sub for global subtitution function of R here.
([^:]*) ##By mentioning () we are keeping the matched values from vector's element into 1st place of memory(which we could use later), which is till next colon comes it will match everything.
: ##Mentioning letter colon(:) here.
([^:]*) ##By mentioning () making 2nd place in memory for matched values in vector's values which is till next colon comes it will match everything.
.*" ##Mentioning .* to match everything else now after 2nd colon comes in value.
,"\\1:\\2" ##Now mentioning the values of memory holds with whom we want to substitute the element values \\1 means 1st memory place \\2 is second memory place's value.
,df$dat) ##Mentioning df$dat dataframe's dat value.
You may use
df$col <- sub("(\\d:\\d+):\\d+$", "\\1", df$col)
See the regex demo
Details
(\\d:\\d+) - Capturing group 1 (its value will be accessible via \1 in the replacement pattern): a digit, a colon and 1+ digits.
: - a colon
\\d+ - 1+ digits
$ - end of string.
R Demo:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("(\\d:\\d+):\\d+$", "\\1", col)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternative approach:
df$col <- sub("^(.*?:\\d+).*", "\\1", df$col)
See the regex demo
Here,
^ - start of string
(.*?:\\d+) - Group 1: any 0+ chars, as few as possible (due to the lazy *? quantifier), then : and 1+ digits
.* - the rest of the string.
However, it should be used with the PCRE regex engine, pass perl=TRUE:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("^(.*?:\\d+).*", "\\1", col, perl=TRUE)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
See the R online demo.
sub("(\\d+:\\d+):\\d+$", "\\1", df$Col)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternatively match what you want (instead of subbing out what you don't want) with stringi:
stringi::stri_extract_first(df$Col, regex = "[A-Z-\\*]+\\d+:\\d+")
Slightly more concise stringr:
stringr::str_extract(df$Col, "[A-Z-\\*]+\\d+:\\d+")
# or
stringr::str_extract(df$Col, "[\\w-*]+\\d+:\\d+")

r: regex for containing pattern with negation

Suppose I have the following two strings and want to use grep to see which match:
business_metric_one
business_metric_one_dk
business_metric_one_none
business_metric_two
business_metric_two_dk
business_metric_two_none
And so on for various other metrics. I want to only match the first one of each group (business_metric_one and business_metric_two and so on). They are not in an ordered list so I can't index and have to use grep. At first I thought to do:
.*metric.*[^_dk|^_none]$
But this doesn't seem to work. Any ideas?
You need to use a PCRE pattern to filter the character vector:
x <- c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")
grep("metric(?!.*_(?:dk|none))", x, value=TRUE, perl=TRUE)
## => [1] "business_metric_one" "business_metric_two"
See the R demo
The metric(?!.*(?:_dk|_none)) pattern matches
metric - a metric substring
(?!.*_(?:dk|none)) - that is not followed with any 0+ chars other than line break chars followed with _ and then either dk or none.
See the regex demo.
NOTE: if you need to match only such values that contain metric and do not end with _dk or _none, use a variation, metric.*$(?<!_dk|_none) where the (?<!_dk|_none) negative lookbehind fails the match if the string ends with either _dk or _none.
You can also do something like this:
grep("^([[:alpha:]]+_){2}[[:alpha:]]+$", string, value = TRUE)
# [1] "business_metric_one" "business_metric_two"
or use grepl to match dk and none, then negate the logical when you're indexing the original string:
string[!grepl("(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
more concisely:
string[!grepl("business_metric_[[:alpha:]]+_(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
Data:
string = c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")

Resources