Match everything up until first instance of a colon - r

Trying to code up a Regex in R to match everything before the first occurrence of a colon.
Let's say I have:
time = "12:05:41"
I'm trying to extract just the 12. My strategy was to do something like this:
grep(".+?(?=:)", time, value = TRUE)
But I'm getting the error that it's an invalid Regex. Thoughts?

Your regex seems fine in my opinion, I don't think you should use grep, also you are missing perl=TRUE that is why you are getting the error.
I would recommend using :
stringr::str_extract( time, "\\d+?(?=:)")
grep is little different than it is being used here, its good for matching separate values and filtering out those which has similar pattern, but you can't pluck out values within a string using grep.
If you want to use Base R you can also go for sub:
sub("^(\\d+?)(?=:)(.*)$","\\1",time, perl=TRUE)
Also, you may split the string using strsplit and filter out the first string like below:
strsplit(time, ":")[[1]][1]

Related

How to extract a substring from main string starting from valid uuid using lua

I have a main string as below
"/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
From the main string i need to extract a substring starting from the uuid part
"/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
I tried
string.match("/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/", "/[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}/(.)/(.)/$"
But noluck.
if you want to obtain
"/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
from
"/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
or let's say 7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0, output and 9999.317528060546245771146821638997525068657 as this is what your pattern attempt suggests. Otherwise leave out the parenthesis in the following solution.
You can use a pattern like this:
local text = "/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
print(text:match("/([%x%-]+)/([^/]+)/([^/]+)"))
"/([^/]+)/" captures at least one non-slash-character between two slashs.
On your attempt:
You cannot give counts like {4} in a string pattern.
You have to escape - with % as it is a magic character.
(.) would only capture a single character.
Please read the Lua manual to find out what you did wrong and how to use string patterns properly.
Try also the code
s="/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
print(s:match("/.-/.-(/.+)$"))
It skips the first two "fields" by using a non-greedy match.

Removing part of strings within a column

I have a column within a data frame with a series of identifiers in, a letter and 8 numbers, i.e. B15006788.
Is there a way to remove all instances of B15.... to make them empty cells (there’s thousands of variations of numbers within each category) but keep B16.... etc?
I know if there was just one thing I wanted to remove, like the B15, I could do;
sub(“B15”, ””, df$col)
But I’m not sure on the how to remove a set number of characters/numbers (or even all subsequent characters after B15).
Thanks in advance :)
Welcome to SO! This is a case of regex. You can use base R as I show here or look into the stringR package for handy tools that are easier to understand. You can also look for regex rules to help define what you want to look for. For what you ask you can use the following code example to help:
testStrings <- c("KEEPB15", "KEEPB15A", "KEEPB15ABCDE")
gsub("B15.{2}", "", testStrings)
gsub is the base R function to replace a pattern with something else in one or a series of inputs. To test our regex I created the testStrings vector for different examples.
Breaking down the regex code, "B15" is the pattern you're specifically looking for. The "." means any character and the "{2}" is saying what range of any character we want to grab after "B15". You can change it as you need. If you want to remove everything after "B15". replace the pattern with "B15.". the "" means everything till the end.
edit: If you want to specify that "B15" must be at the start of the string, you can add "^" to the start of the pattern as so: "^B15.{2}"
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf has a info on different regex's you can make to be more particular.

Repeating a regex pattern for date parsing

I have the following string
"31032017"
and I want to use regular expressions in R to get
"31.03.2017"
What is the best function to do it?
And a general question, how can I repeat the matched part, like as in sed in bash? There, we use \1 to repeat the first matched part.
You need to put the single parts in round brackets like this:
sub("([0-9]{2})([0-9]{2})([0-9]{4})", "\\1.\\2.\\3", "31032017")
You can then use \\1 to access the part matched by the first group, \\2 for the second and so on.
Note that if your string is a date, there are better ways to parse / reformat it than directly using regex.
date_vector = c("31032017","28052017","04052022")
as.character(format(as.Date(date_vector, format = "%d%m%Y"), format = "%d.%m.%Y"))
#[1] "31.03.2017" "28.05.2017" "04.05.2022"
If you want to work/do math with dates, omit as.character.

How to remove characters before matching pattern and after matching pattern in R in one line?

I have this vector Target <- c( "tes_1123_SS1G_340T01", "tes_23_SS2G_340T021". I want to remove anything before SS and anything after T0 (including T0).
Result I want in one line of code:
SS1G_340 SS2G_340
Code I have tried:
gsub("^.*?SS|\\T0", "", Target)
We can use str_extract
library(stringr)
str_extract(Target, "SS[^T]*")
#[1] "SS1G_340" "SS2G_340"
Try this:
gsub(".*(SS.*)T0.*","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Why it works:
With regex, we can choose to keep a pattern and remove everything outside of that pattern with a two-step process. Step 1 is to put the pattern we'd like to keep in parentheses. Step 2 is to reference the number of the parentheses-bound pattern we'd like to keep, as sometimes we might have multiple parentheses-bound elements. See the example below for example:
gsub(".*(SS.*)+(T0.*)","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Note that I've put the T0.* in parentheses this time, but we still get the correct answer because I've told gsub to return the first of the two parentheses-bound patterns. But now see what happens if I use \\2 instead:
gsub(".*(SS.*)+(T0.*)","\\2",Target)
[1] "T01" "T021"
The .* are wild cards by the way. If you'd like to learn more about using regex in R, here's a reference that can get you started.

How to write a regex OR statement within strapply in R

I have been using strapplyc in R to select different portions of a string that match one particular set of criteria. These have worked successfully until I found a portion of the string where the required portion could be defined one of two ways.
Here is an example of the string which is liberally sprinkled with \t:
\t\t\tsome words here\t\t\tDefect: some more words here Action: more words
I can write the strapply statement to capture the text between Defect: and the start of Action:
strapplyc(record[i], "Defect:(.*?)Action")
This works and selects the chosen text between Defect: and Action. In some cases there is no action section to the string and I've used the following code to capture these cases.
strapplyc(record[i], "Defect:(.*?)$")
What I have been trying to do is capture the text that either ends with Action, or with the end of the string (using $).
This is the bit that keeps failing. It returns nothing for either option. Here is my failing code:
strapplyc(record[i], "Defect:(.*?)Action|$")
Any idea where I'm going wrong, or a better solution would be much appreciated.
If you are up for a more efficient solution, you could drop the .*? matching and unroll your pattern like:
Defect:((?:[^A]+|A(?!ction))*)
This matches Defect: followed by any amount of characters that are not an A or are an A and not followed by ction. This avoids the expanding that is needed for the lazy dot matching. It will work for both ways, as it does stop matching when it hits Action or the end of your string.
As suggested by Wiktor, you can also use
Defect:([^A]*(?:A(?!ction)[^A]*)*)
Which is a little bit faster when there are many As in the string.
You might want to consider to use A(?!ction:) or A(?!ction\s*:), to avoid false early matches.
The alternation operator | is the regex operator with the lowest precedence. That means the regex Defect:(.*?)Action|$ is actually a combination of Defect:(.*?)Action and $ - since an empty string is a valid match for $, your regex returns the empty string.
To solve that, you should combine the regexes Defect:(.*?)Action and Defect:(.*?)$ with an OR:
Defect:(.*?)Action|Defect:(.*?)$
Or you can enclose Action|$ in a group as Sebastian Proske said in the comments:
Defect:(.*?)(?:Action|$)

Resources