Telegraf processors.regex create a new tag based on two patterns - telegraf

I am using Telegraf's [[processors.regex]] to create a replacement tag.
Need to add a new tag "custom_instance" from existing tag "instance".
Matching upto first occurrence of "#"
If tag does not have "#" then it should simply copy the same value
Actual output
> Process,host=A700459-W10,custom_instance=chrome,instance=chrome#28,objectname=Process
> Process,host=A700459-W10,instance=CmRcService,objectname=Process
It does not create replacement for CmRcService
Expected output
> Process,host=A700459-W10,custom_instance=chrome,instance=chrome#28,objectname=Process
> Process,host=A700459-W10,custom_instance=CmRcService,instance=CmRcService,objectname=Process
Got two patterns : This pattern "^(.?)#.$" works and creates a replacement while this "^[^#]+$" does not, not sure why?
[[processors.regex]]
namepass = ["Process"]
# Tag and field conversions defined in a separate sub-tables
[[processors.regex.tags]]
## Tag to change
key = "instance"
## Regular expression to match on a tag value
pattern = "^(.*?)#.*$"
## Matches of the pattern will be replaced with this string. Use ${1}
## notation to use the text of the first submatch.
replacement = "${1}"
result_key = "custom_instance"
[[processors.regex.tags]]
## Tag to change
key = "instance"
## Regular expression to match on a tag value
pattern = "^[^#]+$"
## Matches of the pattern will be replaced with this string. Use ${1}
## notation to use the text of the first submatch.
replacement = "${1}"
result_key = "custom_instance"
Is there any other way of doing this where it will simply create a replacement tag based on existing values?
if "#" then Matching upto first occurrence
else simply copy the same value

Figured out
Second pattern will have replacement = "${0}"
[[processors.regex]]
namepass = ["Process"]
# Tag and field conversions defined in a separate sub-tables
[[processors.regex.tags]]
## Tag to change
key = "instance"
## Regular expression to match on a tag value
pattern = "^(.*?)#.*$"
## Matches of the pattern will be replaced with this string. Use ${1}
## notation to use the text of the first submatch.
replacement = "${1}"
result_key = "custom_instance"
[[processors.regex.tags]]
## Tag to change
key = "instance"
## Regular expression to match on a tag value
pattern = "^[^#]+$"
## Matches of the pattern will be replaced with this string. Use ${1}
## notation to use the text of the first submatch.
replacement = "${0}"
result_key = "custom_instance"

Related

Add string if the pattern matched using Regtex

I have a data table and I want to add a string (FirstWord!) to one column values if the pattern (Letter digits:Letter(s)digits) matches is like below
ColName
New test defiend
G54:Y23 (matched)
test:New
The expected results would be
New test defiend
FirstWord!G54:Y23
test:New
dt[, ColName := ColName %>% str_replace('(?<=\d)\:(?=[[:upper:]])',
paste0("'FirstWord!'",.))]
I don't know how to add the "FristWord!" when I find the pattern in the ColName.
transform(df, ColName = sub("([A-Z][0-9]+:[A-Z]+[0-9]+)", 'FirstWord!\\1', ColName))
ColName
1 New test defiend
2 FirstWord!G54:Y23
3 test:New
You can use sub and backreference:
sub("([A-Z]+\\d+:[A-Z]+)", "FirstWord!\\1", txt)
[1] "ColName" "New test defiend" "FirstWord!G54:Y23 (matched)" "test:New"
Here, we wrap the pattern Upper-case letter(s)digit(s):Upper-case letter(s)digit(s) into a capturing group to be able to refer to it using backreference \\1 in sub's replacement clause; there, we also add the desired "First_Word!" string.
If the pattern should not be case-sensitive, just add (?i) to the front of the pattern.
Data:
txt <- c("ColName","New test defiend","G54:Y23 (matched)","test:New")

Extract string with digits and special characters in r

I have a list of filenames in the format "filename PID00-00-00" or just "PID00-00-00".
I want to extract part of the filename to create an ID column.
I am currently using this code for the string extraction
names(df) <- stringr::str_extract(names(df), "(?<=PID)\\d+")
binded1 = rbindlist(df, idcol = "ID")%>%
as.data.frame(binded1)
This gives the ID as the first set of digits after PID. e.g. filename PID1234-00-01 becomes ID 1234.
I want to also extract the first hyphen and following digits. So from filename PID1234-00-01 I want 1234-00.
What should my regex be?
try this:
stringr::str_extract(names(df),"(?<=PID)\\d{4}-\\d{2}")

replace element of string 3 - 10 positions after a pattern

Ideally, in base R I need some kind of string manipulation that will let me detect a pattern and change the string 3 positions after the pattern.
example <- "when string says SOMETHING = #c792ea"
desired output:
when string says SOMETHING = #001628
I have tried gsub but I am not sure how I can get it to replace the characters after a pattern.
If it based on the position of character, then we can use substring assignment
substring(example, 30) <- "#001628"
example
#[1] "when string says SOMETHING = #001628"
Or if we need to find the position of the word that starts with #
library(stringr)
posvec <- c(str_locate(example, "#\\w+"))
substring(example, posvec[1], posvec[2]) <- "#001628"
# // or with
# str_sub(example, posvec[1], posvec[2]) <- "#001628"
Another option is sub to change the substring after the = and one or more space (\\s*)
sub("=\\s*.*", "= #001628", example)
#[1] "when string says SOMETHING = #001628"

RegEx for a conditional pattern in a string

I need to extract substrings from some strings,for example:
My data is a vector: c("Shigella dysenteriae","PREDICTED: Ceratitis")
a = "Shigella dysenteriae"
b = "PREDICTED: Ceratitis"
I hope that if the string starts with "PREDICTED:", it can be extracted to the subsequent word(maybe "Ceratitis"), and if the string doesn't start with "PREDICTED", it can be extracted to the first word(maybe Shigella);
In this example, the result would be:
result_of_a = "Shigella"
result_of_b = "Ceratitis"
Well,it is a typical conditional regular expression.I tried,but always failed;
I used R which can compatible perl's regular expression.
I know R supports perl's regular expression so I tried to use regexpr and regmatches, two functions to extract the substrings that I want.
The code is :
pattern = "(?<=PREDICTED:)?(?(1)(\\s+\\w+\\b)|(\\w+\\b))"
a = c("Shigella dysenteriae")
m_a = regexpr(pattern,a,perl = TRUE)
result_a = regmatches(a,m_a)
b = c("PREDICTED: Ceratitis")
m_b = regexpr(pattern,a,perl = TRUE)
result_b = regmatches(b,m_b)
Finaly,the result is :
# result_a = "Shigella"
# result_b = "PREDICTED"
It is not the result I expect,result_a is right,result_b is wrong.
WHY???Its seem that the condition didn't work...
PS:
I tried to read some details of conditional reg-expresstion. this is the web I tried to read : https://www.regular-expressions.info/conditional.html and I try to imitate "pattern" from this web ,and also tried to use "RegexBuddy" software to find the reason.
EDIT:
To use the function below on a vector, one can do:
Vector: myvec<-c("Shigella dysenteriae","PREDICTED: Ceratitis")
lapply(myvec,extractor)
[[1]]
[1] "Shigella"
[[2]]
[1] "Ceratitis"
Or:
unlist(lapply(myvec,extractor))
[1] "Shigella" "Ceratitis"
This assumes that the strings are always in the format shown above:
extractor<- function(string){
if(grepl("^PREDICTED",string)){
strsplit(string,": ")[[1]][2]
}
else{
strsplit(string," ")[[1]][1]
}
}
extractor(b)
#[1] "Ceratitis"
extractor(a)
#[1] "Shigella"
I think the reason it does not work is because (1) checks if a numbered capture group has been set but there is no first capturing group set yet, also not in the positive lookbehind (?<=PREDICTED:)?.
There are a first and second capturing group in the parts that follow. The if clause will check for group 1, it is not set so it will match group 2.
If you would make it the only capturing group (?<=(PREDICTED: )?) and omit the other 2 then the if clause will be true but you will get an error because the lookbehind assertion is not fixed length.
Instead of using a conditional pattern, to get both words you might use a capturing group and make PREDICTED: optional:
^(?:PREDICTED: )?(\w+)
Regex demo | R demo
If I understand correctly, the OP wants to extract
the first word after "PREDICTED:" if the strings starts with "PREDICTED:"
the first word of the string if the string does not start with "PREDICTED:".
So, if there is no specific requirement to use only one regex, this is what I would do:
Remove any leading "PREDICTED:" (if any)
Extract the first word from the intermediate result.
For working with regex, I prefer to use Hadley Wickham's stringr package:
inp <- c("Shigella dysenteriae", "PREDICTED: Ceratitis")
library(magrittr) # piping used to improve readability
inp %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")
[1] "Shigella" "Ceratitis"
To be on the safe side, I would remove any leading spaces beforehand:
inp %>%
stringr::str_trim() %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")

How to do a replace with backreferences, when the number of occurences is unknown?

In order to make a few corrections to a .tex file generated by Bookdown, I need to replace occurrences of }{ with , when it is used in a citation, i.e.
s <- "Text.\\autocites{REF1}{REF2}{REF3}. More text \\autocites{REF4}{REF5} and \\begin{tabular}{ll}"
Should become
"Text.\\autocites{REF1,REF2,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}
Because I need to keep the references I tried to look into backreferences, but I cannot seem to get it right, because the number of groups to match is unknown beforehand. Also, I cannot do stringr::str_replace_all(s, "\\}\\{", ","), because }{ occurs in other places in the document as well.
My best approach so far, is to use a look-behind to only do the replace when the occurence is after \\autocites, but then I cannot get the backreferences and grouping right:
stringr::str_replace_all(s, "(?<=\\\\autocites\\{)([:alnum:]+)(\\}\\{)", "\\1,")
[1] "Text.\\autocites{REF1,REF2}{REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"
stringr::str_replace_all(s, "(?<=\\\\autocites\\{)([:alnum:]+)((\\}\\{)([:alnum:]+))*", "\\1,\\4")
[1] "Text.\\autocites{REF1,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"
I might be missing some completely obvious approach, so I hope someone can help.
pat matches
autocites followed by
the shortest string that ends in } and is
followed by end of string or a non-{
It then uses gsubfn to replace each occurrence of }{ in that with a comma. It uses formula notation to express the replacement function -- the body of the function is on the RHS of the ~ and because the body contains ..1 the arguments are taken to be ... . It does not use zero width lookahead or lookbehind.
library(gsubfn)
pat <- "(autocites.*?\\}($|[^{]))"
gsubfn(pat, ~ gsub("}{", ",", ..1, fixed = TRUE), s)
giving:
[1] "Text.\\autocites{REF1,REF2,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"
Variation
One minor simplificaiton of the regular expression shown above is to remove the outer parentheses from pat and instead specify backref = 0 in gsubfn. That tells it to pass the entire match to the function. We could use ..1 to specify the argument as above but since we know that there is necessarily only one argument passed we can specify it as x in the body of the function. Any variable name would do as it assumes that any free variable is an argument. The output would be the same as above.
pat2 <- "autocites.*?\\}($|[^{])"
gsubfn(pat2, ~ gsub("}{", ",", x, fixed = TRUE), s, backref = 0)
Cool problem - I got to learn a new trick with str_replace. You can make the return value a function, and it applies the function to the strings you've picked out.
replace_brakets <- function(str) {
str_replace_all(str, "\\}\\{", ",")
}
s %>% str_replace_all("(?<=\\\\autocites\\{)([:alnum:]+\\}\\{)+", replace_brakets)
# [1] "Text.\\autocites{REF1,REF2,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"

Resources