Regex expression starting from a certain character

Regex expression starting from a certain character - r

Example: "example._AL(5)._._4500_GRE/Jan_2018"
I am trying to extract text from the above string containing parentheses. I wanna extract everything starting from AL.
Output should look like: "AL(5)._._4500_GRE/Jan_2018"

There is some question on what we can assume is known but here are a few variations which make various assumptions.
1) word( This removes everything prior to the first word followed by a parenthesis.
"^" matches the start of string
".*?" is the shortest match of anything provided we still match rest of regex
"\\w+" matches a word
"\\(" matches a left paren
(...) forms a capture group which the replacement string can refer to as "\\1"
Code
x <- "example.AL(5)._._4500_GRE/Jan_2018"
sub("^.*?(\\w+\\()", "\\1", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
1a) or matching a word followed by ( followed by anything and extracting that:
library(gsubfn)
strapplyc(x, "\\w+\\(.*", simplify = TRUE)
## [1] "AL(5)._._4500_GRE/Jan_2018"
2) AL( or if we know that the word is AL then:
sub("^.*?(AL\\(.*)", "\\1", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
3) remove up to 1st dot or if we know that the part to be removed is the part before and including the first dot:
sub("^.*?\\.", "", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
4) dot separated fields If the format of the input is dot-separated fields we can parse them all out at once like this:
read.table(text = x, sep = ".", as.is = TRUE)
## V1 V2 V3 V4
## 1 example AL(5) _ _4500_GRE/Jan_2018

Related

How to change values before text in string using R

I have multiple strings that are similar to the following pattern:
dat<-("00000000AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0")
I need to change all 0 values to "." before the first character value within a string. My desired output in this example would be:
"........AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0".
I tried using gsub to accomplish this task:
gsub("\\G([^_\\d]*)\\d", ".\\1", dat, perl=T)
Unfortunately it changed all of the 0s to "." instead of the 0s preceding the first "A".
Can someone please help me with this issue?

If you wish to simply replace each leading 0 with a ., you can use
gsub("\\G0", ".", dat, perl=TRUE)
Here, \G0 matches a 0 char at the start of string, and then every time after a successful match. See this regex demo.
If you need to replace each 0 in a string before the first letter you can use
gsub("\\G[^\\p{L}0]*\\K0", ".", dat, perl=TRUE)
Here, \G matches start of string or end of the preceding successful match, [^\p{L}0]* matches zero or more chars other than a letter and 0, then \K omits the matched text, and then 0 matches the 0 char and it is replaced with a .. See this regex demo.
See the R demo online:
dat <- c("00000000AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0","102030405000AZD")
gsub("\\G0", ".", dat, perl=TRUE)
## [1] "........AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0"
## [2] "102030405000AZD"
gsub("\\G[^\\p{L}0]*\\K0", ".", dat, perl=TRUE)
## [1] "........AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0"
## [2] "1.2.3.4.5...AZD"

This is really hard.
So I tried to do it with a custom function:
library(stringr)
dat<-("00000000AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0")
Zero_Replacer <- function(x) {
x <- str_split(x, '[A-Za-z]', 2)
x[[1]][1] <- str_replace_all(x[[1]][1], "0", ".")
paste0(x[[1]][1], x[[1]][2])
}
Zero_Replacer(dat)
Output:
[1] "........AAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0"

Replace matched patterns in a string based on condition

I have a text string containing digits, letters and spaces. Some of its substrings are month abbreviations. I want to perform a condition-based pattern replacement, namely to enclose a month abbreviation in whitespaces if and only if a given condition is fulfilled. As an example, let the condition be as follows: "preceeded by a digit and succeeded by a letter".
I tried stringr package but I fail to combine the functions str_replace_all() and str_locate_all():
# Input:
txt = "START1SEP2 1DECX JANEND"
# Desired output:
# "START1SEP2 1 DEC X JANEND"
# (A) What I could do without checking the condition:
library(stringr)
patt_month = paste("(", paste(toupper(month.abb), collapse = "|"), ")", sep='')
str_replace_all(string = txt, pattern = patt_month, replacement = " \\1 ")
# "START1 SEP 2 1 DEC X JAN END"
# (B) But I actually only need replacements inside the condition-based bounds:
str_locate_all(string = txt, pattern = paste("[0-9]", patt_month, "[A-Z]", sep=''))[[1]]
# start end
# [1,] 12 16
# To combine (A) and (B), I'm currently using an ugly for() loop not shown here and want to get rid of it

You are looking for lookarounds:
(?<=\d)DEC(?=[A-Z])
See a demo on regex101.com.
Lookarounds make sure a certain position is matched without consuming any characters. They are available in front of sth. (called lookbehind) or to make sure anything that follows is of a certain type (called lookahead). You have positive and negative ones on both sides, thus you have four types (pos./neg. lookbehind/-ahead).
A short memo:
(?=...) is a pos. lookahead
(?!...) is a neg. lookahead
(?<=...) is a pos. lookbehind
(?<!...) is a neg. lookbehind

A Base R version
patt_month <- capture.output(cat(toupper(month.abb),"|"))#concatenate all month.abb with OR
pat <- paste0("(\\s\\d)(", patt_month, ")([A-Z]\\s)")#make it a three group thing
gsub(pattern = pat, replacement = "\\1 \\2 \\3", txt, perl =TRUE)#same result as above
Also works for txt2 <- "START1SEP2 1JANY JANEND" out of the box.
[1] "START1SEP2 1 JAN Y JANEND"

Replace multiple consecutive hyphens in R

I have a string which looks like this:
something-------another--thing
I want to replace the multiple dashes with a single one.
So the expected output would be:
something-another-thing

We can try using sub here:
x <- "something-------another--thing"
gsub("-{2,}", "-", x)
[1] "something-another-thing"
More generally, if we want to replace any sequence of two or more of the same character with just the single character, then use this version:
x <- "something-------another--thing"
gsub("(.)\\1+", "\\1", x)
The second pattern could use an explanation:
(.) match AND capture any single letter
\\1+ then match the same letter, at least one or possibly more times
Then, we replace with just the single captured letter.

you can do it with gsub and using regex.
> text='something-------another--thing'
> gsub('-{2,}','-',text)
[1] "something-another-thing"

t2 <- "something-------another--thing"
library(stringr)
str_replace_all(t2, pattern = "-+", replacement = "-")
which gives:
[1] "something-another-thing"
If you're searching for the right regex to search for a string, you can test it out here https://regexr.com/
In the above, you're just searching for a pattern that is a hyphen, so pattern = "-", but we add the plus so that the search is 'greedy' and can include many hyphens, so we get pattern = "-+"

Extract date from given string in r

string<-c("Posted 69 months ago (7/4/2011)")
library(gsubfn)
strapplyc(string, "(.*)", simplify = TRUE)
I apply above function but nothing happens.
In this I want to extract only date part i.e 7/4/2011.

The first one shows how to fix the code in the question to give the desired answer. The next 2 solutions are the same except they use different regular expressions. The fourth solution shows how to do it with gsub. The fifth breaks the gsub into two sub calls and the sixth uses read.table.
1) Escape parens The problem is that ( and ) have special meaning in regular expressions so you must escape them if you want to match them literally. By using "[(]" as we do below (or writing them as "\\(" ) they are matched literally. The inner parentheses define the capture group as we don't want that group to include the literal parentheses themselves:
strapplyc(string, "[(](.*)[)]", simplify = TRUE)
## [1] "7/4/2011"
2) Match content Another way to do it is to match the data itself rather than the surrounding parentheses. Here "\\d+" matches one or more digits:
strapplyc(string, "\\d+/\\d+/\\d+", simplify = TRUE)
## [1] "7/4/2011"
You could specify the number of digits if you want to be even more specific but it seems unnecessary here if the data looks similar to that in the question.
3) Match 8 or more digits and slashes Given that there are no other sequences of 8 or more characters consisting only of slashes and digits in the rest of the string we could just pick out that:
strapplyc(string, "[0-9/]{8,}", simplify = TRUE)
## [1] "7/4/2011"
4) Remove text before and after Another way of doing it is to remove everything up to the ( and after the ) like this:
gsub(".*[(]|[)].*", "", string)
## [1] "7/4/2011"
5) sub This is the same as (4) except it breaks the gsub into two sub invocations, one removing everything up to ( and the other removing ) onwards. The regular expressions are therefore slightly simpler.
sub(".*\\(", "", sub("\\).*", "", string))
6) read.table This solution uses no regular expressions at all. It defines sep and comment.char in read.table so that the second column of the result of read.table is the required date or dates.
read.table(text = string, sep = "(", comment.char = ")", as.is = TRUE)$V2
## [1] "7/4/2011"
Note: Note that you don't need the c in defining string
string <- c("Posted 69 months ago (7/4/2011)")
string2 <- "Posted 69 months ago (7/4/2011)"
identical(string, string2)
## [1] TRUE

We can do this with gsub by matching one or more characters that are not a ( ([^(]+) from the start (^) of the string or | the ) at the end ($) of the string and replace it with ""
gsub("[^[^(]+\\(|\\)$", "", string)
#[1] "7/4/2011"
Or using capture groups
sub("^[^(]+\\(([^)]+).*", "\\1", string)
#[1] "7/4/2011"
Or with str_extract, we match one or more characters that are not a ) ([^)]+) that follows the ( ((?<=[(]))
library(stringr)
str_extract(string, "(?<=[(])[^)]+")
#[1] "7/4/2011"

string manipulation to remove the name of files

I have a list of strings
/temp/123/afedcgid/abc.csv
/temp/123/4388dkfa/abc1.csv
/temp/123/4388dkfa/ab1.csv
I want to remove name of the file from the strings
The results desired are
/temp/123/afedcgid
/temp/123/4388dkfa
/temp/123/4388dkfa
How can i do it. Thanks.

You could try the below,
sub("/[^/]*$", "", x)
It removes all the chars from the last / symbol.
OR
> x <- "/temp/123/afedcgid/abc.csv"
> sub("(.*)/.*", "\\1", x)
[1] "/temp/123/afedcgid"
captures all the chars from the start upto the last / symbol (excluding /). Then the following chars are matched by .*. Replacing the matched chars with chars inside group 1 will give you the desired output.
Example:
> x <- "/temp/123/afedcgid/abc.csv"
> sub("/[^/]*$", "", x)
[1] "/temp/123/afedcgid"
OR
regmatches(x, gregexpr(".+(?=/)", x, perl=TRUE))

Use this regex to catch character you want to replace
\/\w+\.\w+$
try this demo
Demo
files <- c("/temp/123/afedcgid/abc.csv" ,
"/temp/123/4388dkfa/abc1.csv" , "/temp/123/4388dkfa/ab1.csv")
sub("\\/\\w+\\.\\w+$" , "" , files)
as you may know you need to \\ for escaping sequences in R

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Regex expression starting from a certain character - r

Example: "example._AL(5)._._4500_GRE/Jan_2018" I am trying to extract text from the above string containing parentheses. I wanna extract everything starting from AL. Output should look like: "AL(5)._._4500_GRE/Jan_2018"

Related

How to change values before text in string using R

Replace matched patterns in a string based on condition

Replace multiple consecutive hyphens in R

Extract date from given string in r

string manipulation to remove the name of files

Categories

Resources