extract number in string using regex - r

I have a data.frame like this :
SO <- data.frame(coiffure_IDF$SIREN, coiffure_IDF$L6_NORMALISEE )
coiffure_IDF.SIREN coiffure_IDF.L6_NORMALISEE
1 54805015 75008 PARIS
2 300086907 94210 ST MAUR DES FOSSES
3 300090453 94220 CHARENTON LE PONT
4 300209608 75007 PARIS
5 300570553 95880 ENGHIEN LES BAINS
6 301123626 75019 PARIS
7 301362349 92300 LEVALLOIS PERRET
I want to have this :
coiffure_IDF.SIREN codpos_norm ville
1 54805015 75008 PARIS
2 300086907 94210 ST MAUR DES FOSSES
3 300090453 94220 CHARENTON LE PONT
4 300209608 75007 PARIS
5 300570553 95880 ENGHIEN LES BAINS
6 301123626 75019 PARIS
7 301362349 92300 LEVALLOIS PERRET
so I used regex :
SO2<- SO %>% extract(col="coiffure_IDF.L6_NORMALISEE", into=c("codpos_norm", "ville"), regex="(\\d+)\\s+(\\S+)")
so I have the right column is "codpos_norm" but in "ville" in line 2 I just have "ST" in stead of "ST MAUR DES FOSSES". In line 3 just "CHARENTON", etc
so I tried to add some \\s+ and \\S+ in the regex but R told me that they are to many groups and that it has to have only 2 groups.
What could I do ?

You need to match the rest of the string in Group 2, the \S construct only matches non-whitespace chars. Use .+ to match any 1+ chars up to the string end:
extract(col="coiffure_IDF.L6_NORMALISEE", into=c("codpos_norm", "ville"), regex="(\\d+)\\s+(.+)")
You may use .* to match empty strings (if there is no text after 1+ whitespaces).

Related

split a string on every odd positioned spaces

I have strings of varying lengths in this format:
"/S498QSB 0 'Score=0' 1 'Score=1' 2 'Score=2' 3 'Score=3' 7 'Not administered'"
the first item is a column name and the other items tell us how this column is encoded
I want the following output:
/S498QSB
0 'Score=0'
1 'Score=1'
2 'Score=2'
3 'Score=3'
7 'Not administered'"
str_split should do it, but it's not working for me:
str_split("/S498QSB 0 'Score=0' 1 'Score=1' 2 'Score=2' 3 'Score=3' 7 'Not administered'",
"([ ].*?[ ].*?)[ ]")
You can use
str_split(x, "\\s+(?=\\d+\\s+')")
See the regex demo.
Details:
\s+ - one or more whitespaces
(?=\d+\s+') - a positive lookahead that requires the following sequence of patterns immediately to the right of the current location:
\d+ - one or more digits
\s+ - one or more whitespaces
' - a single quotation mark.

Regex to match only semicolons not in parenthesis [duplicate]

This question already has answers here:
Regex - Split String on Comma, Skip Anything Between Balanced Parentheses
(2 answers)
Closed 1 year ago.
I have the following string:
Almonds ; Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt); Cashews
I want to replace the semicolons that are not in parenthesis to commas. There can be any number of brackets and any number of semicolons within the brackets and the result should look like this:
Almonds , Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt), Cashews
This is my current code:
x<- Almonds ; Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt); Cashews
gsub(";(?![^(]*\\))",",",x,perl=TRUE)
[1] "Almonds , Roasted Peanuts (Peanuts, Canola Oil (Antioxidants (319; 320)); Salt), Cashews "
The problem I am facing is if there's a nested () inside a bigger bracket, the regex I have will replace the semicolon to comma.
Can I please get some help on regex that will solve the problem? Thank you in advance.
The pattern ;(?![^(]*\)) means matching a semicolon, and assert that what is to the right is not a ) without a ( in between.
That assertion will be true for a nested opening parenthesis, and will still match the ;
You could use a recursive pattern to match nested parenthesis to match what you don't want to change, and then use a SKIP FAIL approach.
Then you can match the semicolons and replace them with a comma.
[^;]*(\((?>[^()]+|(?1))*\))(*SKIP)(*F)|;
In parts, the pattern matches
[^;]* Match 0+ times any char except ;
( Capture group 1
\( Match the opening (
(?> Atomic group
[^()]+ Match 1+ times any char except ( and )
| Or
(?1) Recurse the whole first sub pattern (group 1)
)* Close the atomic group and optionally repeat
\) Match the closing )
) Close group 1
(*SKIP)(*F) Skip what is matched
| Or
; Match a semicolon
See a regex demo and an R demo.
x <- c("Almonds ; Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt); Cashews",
"Peanuts (32.5%); Macadamia Nuts (14%; PPPG(AHA)); Hazelnuts (9%); nuts(98%)")
gsub("[^;]*(\\((?>[^()]+|(?1))*\\))(*SKIP)(*F)|;",",",x,perl=TRUE)
Output
[1] "Almonds , Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt), Cashews"
[2] "Peanuts (32.5%), Macadamia Nuts (14%; PPPG(AHA)), Hazelnuts (9%), nuts(98%)"

Regex Question: separate string at the last comma in string

My string pattern is as follows:
1233 fox street, omaha NE ,69131-7233
Jeffrey Jones, 666 Church Street, Omaha NE ,69131-72339
Betty Davis, LLC, 334 Aloha Blvd., Fort Collins CO ,84444-00333
,1233 Decker street, omaha NE ,69131-7233
I need to separate the above string into four variables: name, address, city_state, zipcode.
Since the pattern has three to four commas, I am starting at the right to separate the field into multiple fields.
rubular.com says the pattern ("(,\\d.........)$"))) or the pattern ",\d.........$" will match the zipcode at the end of the string.
regex101.com, finds neither of the above patterns comes up with a match.
When I try to separate with:
#need to load pkg:tidyr for the `separate`
function
library(tidyr)
separate(street_add, c("street_add2", "zip", sep= ("(,\d.........)$")))
or with:
separate(street_add, c("street_add2", "zip", sep= (",\d.........$")))
In both scenarios, R splits at the first comma in the string.
How do I split the string into segments?
Thank you.
Use
sep=",(?=[^,]*$)"
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
[^,]* any character except: ',' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of look-ahead

Extract date from a text document in R

I am again here with an interesting problem.
I have a document like shown below:
"""UDAYA FILLING STATION ps\na MATTUPATTY ROAD oe\noe 4 MUNNAR Be:\nSeat 4 04865230318 Rat\nBree 4 ORIGINAL bepas e\n\noe: Han Die MC DE ER DC I se ek OO UO a Be ten\" % aot\n: ag 29-MAY-2019 14:02:23 [i\n— INVOICE NO: 292 hee fos\nae VEHICLE NO: NOT ENTERED Bea\nss NOZZLE NO : 1 ome\n- PRODUCT: PETROL ae\ne RATE : 75.01 INR/Ltr yee\n“| VOLUME: 1.33 Ltr ae\n~ 9 =6AMOUNT: 100.00 INR mae wae\nage, Ee pel Di EE I EE oe NE BE DO DC DE a De ee De ae Cate\notome S.1T. No : 27430268741C =. ver\nnes M.S.T. No: 27430268741V ae\n\nThank You! Visit Again\n""""
From the above document, I need to extract date highlighted in bold and Italics.
I tried with strpdate function but did not get the desired results.
Any help will be greatly appreciated.
Thanks in advance.
Assuming you only want to capture a single date, you may use sub here:
text <- "UDAYA FILLING STATION ps\na MATTUPATTY ROAD oe\noe 4 MUNNAR Be:\nSeat 4 04865230318 Rat\nBree 4 ORIGINAL bepas e\n\noe: Han Die MC DE ER DC I se ek OO UO a Be ten\" % aot\n: ag 29-MAY-2019 14:02:23 [i\n— INVOICE NO: 292 hee fos\nae VEHICLE NO: NOT ENTERED Bea\nss NOZZLE NO : 1 ome\n- PRODUCT: PETROL ae\ne RATE : 75.01 INR/Ltr yee\n“| VOLUME: 1.33 Ltr ae\n~ 9 =6AMOUNT: 100.00 INR mae wae\nage, Ee pel Di EE I EE oe NE BE DO DC DE a De ee De ae Cate\notome S.1T. No : 27430268741C =. ver\nnes M.S.T. No: 27430268741V ae\n\nThank You! Visit Again\n"
date <- sub("^.*\\b(\\d{2}-[A-Z]+-\\d{4})\\b.*", "\\1", text)
date
[1] "29-MAY-2019"
If you had the need to match multiple such dates in your text, then you may use regmatches along with regexec:
text <- "Hello World 29-MAY-2019 Goodbye World 01-JAN-2018"
regmatches(text,regexec("\\b(\\d{2}-[A-Z]+-\\d{4})\\b", text))[[1]]
[1] "29-MAY-2019" "29-MAY-2019"

removing some part of string from url

I want to remove jsessionid from given string url and backslash #start
/product.screen?productId=BS-AG-G09&JSESSIONID=SD1SL6FF6ADFF6510
so that output would be like
product.screen?productId=BS-AG-G09
More data like :
1 /product.screen?productId=WC-SH-A02&JSESSIONID=SD0SL6FF7ADFF4953
2 /oldlink?itemId=EST-6&JSESSIONID=SD0SL6FF7ADFF4953
3 /product.screen?productId=BS-AG-G09&JSESSIONID=SD0SL6FF7ADFF4953
4 /product.screen?productId=FS-SG-G03&JSESSIONID=SD0SL6FF7ADFF4953
5 /cart.do?action=remove&itemId=EST-11&productId=WC-SH-A01&JSESSIONID=SD0SL6FF7ADFF4953
6 /oldlink?itemId=EST-14&JSESSIONID=SD0SL6FF7ADFF4953
7 /cart.do?action=view&itemId=EST-6&productId=MB-AG-T01&JSESSIONID=SD1SL6FF6ADFF6510
8 /product.screen?productId=BS-AG-G09&JSESSIONID=SD1SL6FF6ADFF6510
9 /product.screen?productId=WC-SH-A02&JSESSIONID=SD1SL6FF6ADFF6510
10 /cart.do?action=view&itemId=EST-6&productId=WC-SH-A02&JSESSIONID=SD1SL6FF6ADFF6510
11 /product.screen?productId=WC-SH-A02&JSESSIONID=SD1SL6FF6ADFF6510
You may use:
library(stringi)
lf1 = "/product.screen?productId=BS-AG-G09&JSESSIONID=SD0SL6FF7ADFF4953"
stri_replace_all_regex(
"/product.screen?productId=BS-AG-G09&JSESSIONID=SD0SL6FF7ADFF4953",
"&JSESSIONID=.*","")
Where the string: &JSESSIONID=.* (up to the end .*) gets replaced with nothing ("").
or simply: gsub("&JSESSIONID=.*","",lf1)

Resources