Pattern matching and replacing based from regex - r

I am trying to parse through a string and modify values that match a particular pattern. I am basically trying to convert the following R code into Python.
sample_formatted <- stringr::str_replace(sample,
'(\\b[a-zA-Z]+):([a-zA-Z]\\b)', '\\1\\2')
I am completely new to Python regex and a struggling to figure out where to start.
Any help will be greatly appreciated!

I guess then you can simply do a re.sub in Python:
import re
regex = r"(\b[a-zA-Z]+):([a-zA-Z]\b)"
test_str = ("abc:x\n"
"DEf:y\n"
"ABC:z")
subst = "\\1\\2"
result = re.sub(regex, subst, test_str)
if result:
print (result)
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
RegEx Circuit
jex.im visualizes regular expressions:

Related

I need to perform a Stemming operation in python ,without nltk . Using pipeline methods

I have a list of words and a list of stem rules.
I need to stem the words that their sufixes are in the stem rules list.I got a hint from a friend that i can use pipeline methods
For example if i have :
stem=['less','ship','ing','les','ly','es','s']
text=['friends','friendly','keeping','friendship']
i should get :'friend','friend','keep',friend'
You can find and edit patterns using regular expressions (re package)
import re
text = ['friends', 'friendly', 'keeping', 'friendship']
stems = [
# next line finds patterns and remove them from the string.
re.sub(r'less|ship|ing|les|ly|es|s', '', word)
for word in text
]
print(stems)

How to extract a substring from main string starting from valid uuid using lua

I have a main string as below
"/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
From the main string i need to extract a substring starting from the uuid part
"/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
I tried
string.match("/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/", "/[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}/(.)/(.)/$"
But noluck.
if you want to obtain
"/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
from
"/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
or let's say 7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0, output and 9999.317528060546245771146821638997525068657 as this is what your pattern attempt suggests. Otherwise leave out the parenthesis in the following solution.
You can use a pattern like this:
local text = "/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
print(text:match("/([%x%-]+)/([^/]+)/([^/]+)"))
"/([^/]+)/" captures at least one non-slash-character between two slashs.
On your attempt:
You cannot give counts like {4} in a string pattern.
You have to escape - with % as it is a magic character.
(.) would only capture a single character.
Please read the Lua manual to find out what you did wrong and how to use string patterns properly.
Try also the code
s="/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
print(s:match("/.-/.-(/.+)$"))
It skips the first two "fields" by using a non-greedy match.

Regular Expression using R Programming Language

String<- "46,XX,t(1;19)(p32;q13.3),t(6;9)(p22;q34),del(32)t(12;16)(p12;q21)[cp20]"
The value I want to extract is t(1;19)(p32;q13.3), t(6;9)(p22;q34), t(12;16)(p12;q21)
The regex I'm using
ABC<-str_extract(String, regex("t.{1,16}"))
output I Get: t(1;19)(p32;q13.3
I know my code I incomplete but I'm unable to figure out a way to extract this information.
Thank you in advance
Assuming your String is :
String<- "46,XX,t(1;19)(p32;q13.3),t(6;9)(p22;q34),del(32)t(12;16)(p12;q21)[cp20]"
We can use str_extract_all as :
stringr::str_extract_all(String, "t\\(.*?\\)\\(.*?\\)")[[1]]
#[1] "t(1;19)(p32;q13.3)" "t(6;9)(p22;q34)" "t(12;16)(p12;q21)"
This returns "t" followed by everything in round brackets (()), followed by everything in another round bracket next to it.

Match everything up until first instance of a colon

Trying to code up a Regex in R to match everything before the first occurrence of a colon.
Let's say I have:
time = "12:05:41"
I'm trying to extract just the 12. My strategy was to do something like this:
grep(".+?(?=:)", time, value = TRUE)
But I'm getting the error that it's an invalid Regex. Thoughts?
Your regex seems fine in my opinion, I don't think you should use grep, also you are missing perl=TRUE that is why you are getting the error.
I would recommend using :
stringr::str_extract( time, "\\d+?(?=:)")
grep is little different than it is being used here, its good for matching separate values and filtering out those which has similar pattern, but you can't pluck out values within a string using grep.
If you want to use Base R you can also go for sub:
sub("^(\\d+?)(?=:)(.*)$","\\1",time, perl=TRUE)
Also, you may split the string using strsplit and filter out the first string like below:
strsplit(time, ":")[[1]][1]

time pattern in list.files function (R)

I'm trying to get a list of subdirectories from a path. These subdirectories have a time pattern month\day\hour, i.e. 03\21\11.
I naively used the following:
list.files("path",pattern="[0-9]\[0-9]\[0-9]", recursive = TRUE, include.dirs = TRUE)
But it doesn't work.
How to code for the digitdigit\digitdigit\digitdigit pattern here?
Thank you
This Regex works for 10\11\18.
(\d\d\\\d\d\\\d\d)
I think you may need lazy matching for regex, unless there's always two digits - in which case other responses look valid.
If you could provide a vector of file name strings, that would be super helpful.
Capturing backslashes is confusing, I've found this thread helpful: R - gsub replacing backslashes
My guess is something like this: '[0-9]+?\\\\[0-9]+?\\\\[0-9]+'

Resources