Str_extract_all where replacement is keeping value between two character strings

Str_extract_all where replacement is keeping value between two character strings - r

First time posting to SO so apologies in advance, I am also quite new to R so this may not be possible.
I am trying to format a string, (extracted from a JSON) where there is a value between 2 curly braces like so
{#link value1}
I am trying to replace the {#link value1} with [[value1]] so that it will work as a link in my markdown file.
I cannot just replace the opening and then the closing as there is also {#b value2} which would be formatted to **value2**
I have cobbled together a str_replace that functions if there is only 1 link replacement needed in a string but I am running into an issue when there is two. Like so:
str <- c("This is the first {#link value1} and this is the second {#link value2}")
The actual potential strings are much more varied than this
My plan was to build a function to take input as to the type of pattern needed either bold or link and then paste the strings with the extracted value in the middle to form the replacement
However that has either left me with
This is the first [[ value1 ]][[ value2 ]] and this is the second [[ value1 ]][[ value2 ]]
or
This is the first [[ value1 ]] and this is the second [[ value1 ]]
Is there a more glamorous way of achieving this without searching from where the last } was replaced?
I was looking at the example of the documentation of stringr for str_replace and it uses an example of a function at the bottom but I can't de-code it to try using for my example
What I'm using to extract the value incl the curly braces
str_extract_all(str,"(\\{#link ).+?(\\})")
[[1]]
[1] "{#link value1}" "{#link value2}"
What I'm using to extract the value excl the curly braces and tag
str_extract_all(str,"(?<=\\{#link ).+?(?=\\})")
[[1]]
[1] "value1" "value2"

You could use str_replace_all() to perform multiple replacements by passing a named vector (c(pattern1 = replacement1)) to it. References of the form \\1, \\2, etc. will be replaced with the contents of the respective matched group created by ().
str <- c("This is the first {#link value1} and this is the second {#b value2}")
str_replace_all(str, c("\\{#link\\s+(.+?)\\}" = "[[\\1]]",
"\\{#b\\s+(.+?)\\}" = "**\\1**"))
# [1] "This is the first [[value1]] and this is the second **value2**"

Related

How to extract a certain part of a string in R using regular expressions

How to convert the following string in R :
this_isastring_12(=32)
so that only the following is kept
isastring_12
Eg
f('this_isastring_12(=32)') returns 'isastring_12'
This should work on other strings with a similar structure, but different characters
Another example with a different string of similar structure
f('something_here_3(=1)') returns 'here_3'

We can use sub to extract everything from first underscore to opening round bracket in the text.
sub(".*?_(.*)\\(.*", "\\1", x)
#[1] "isastring_12" "here_3" "string_4"
where x is
x <- c("this_isastring_12(=32)", "something_here_3(=1)", "another_string_4(=1)")

You could use the package unglue.
Borrowing Ronak's data :
x <- c("this_isastring_12(=32)", "something_here_3(=1)", "another_string_4(=1)")
library(unglue)
unglue_vec(x, "{=.*?}_{res}({=.*?})")
#> [1] "isastring_12" "here_3" "string_4"
{=.*?} matches anything until what's next is matched, but doesn't extract anything because there's no lhs to the equality
{res}, where the name res could be replaced by anything, matches anything, and extracts it
outside of curly braces, no need to escape characters
unglue_vec() returns an atomic vector of the matches

Assign names to list elements without titled quotes

I am interested to assign names to list elements. To do so I execute the following code:
file_names <- gsub("\\..*", "", doc_csv_names)
print(file_names)
"201409" "201412" "201504" "201507" "201510" "201511" "201604" "201707"
names(docs_data) <- file_names
In this case the name of the list element appears with ``.
docs_data$`201409`
However, in this case the name of the list element appears in the following way:
names(docs_data) <- paste("name", 1:8, sep = "")
docs_data$name1
How can I convert the gsub() result to receive the latter naming pattern without quotes?
gsub() and paste () seem to produce the same class () object. What is the difference?

Both gsub and paste return character objects. They are different because they are completely different functions, which you seem to know based on their usage (gsub replaces instances of your pattern with a desired output in a string of characters, while paste just... pastes).
As for why you get the quotations, that has nothing to do with gsub and everything to do with the fact that you are naming variables/columns with numbers. Indeed, try
names(docs_data) <- paste(1:8)
and you'll realize you have the same problem when invoking the naming pattern. It basically has to do with the fact that R doesn't want to be confused about whether a number is really a number or a variable because that would be chaos (how can 1 refer to a variable and also the number 1?), so what it does in such cases is change a number 1 into the character "1", which can be given names. For example, note that
> 1 <- 3
Error in 1 <- 3 : invalid (do_set) left-hand side to assignment
> "1" <- 3 #no problem!
So R is basically correcting that for you! This is not a problem when you name something using characters. Finally, an easy fix: just add a character in front of the numbers of your naming pattern, and you'll be able to invoke them without the quotations. For example:
file_names <- paste("file_",gsub("\\..*", "", doc_csv_names),sep="")
Should do the trick (or just change the "file_" into whatever you want as long as it's not empty, cause then you just have numbers left and the same problem)!

Regex returns digit string with leading "_"

Using R script in PowerBI Query Editor to find six digit numeric string in a description column and add this as a new column to the table. It works EXCEPT where the number string is preceded by a "_" (underscore character)
# 'dataset' holds the input data for this script ##
library(stringr)
# assign regex to variable #
pattern <- "(?:^|\\D)(\\d{6})(?!\\d)"
# define function to use pattern ##
isNewSiteNum = function(x) substr(str_extract(x,pattern),1,6)
# output statement - within adds new column to dataset ##
output <- within(dataset,{NewSiteNum=isNewSiteNum(dataset$LineItemComment)})
number string can be at start, end or in the middle of the description text. When the number string is preceded by underscore (_123456 for example) the regex returns the _12345 instead of 123456. Not sure how to tell this to skip the underscore but still grab the six digits (and not break the cases where there is no leading underscore that currently work.)
regex101.com shows the full match as '_123456' and group.1 as '123456' but my result column has '_12345' For the case with a leading space the full match is ' 123456' yet my result column is correct. I seem to be missing something since the full match gets 7 char and the desirec group 1 has 6.

The problem was with the str_extract which I could not get to work. However, by using the str_match and selecting the group I get what I am looking for.
# 'dataset' holds input data
library(stringr)
pattern<-"(?:^|\\D)(\\d{6})(?!\\d)"
SiteNum = function(x) str_match(x, pattern)[,2]
output<-within(dataset,{R_SiteNum2=SiteNum(dataset$ReqComments)})
this does not pick up non-numeric initial characters.

Extract substring using regular expression in R

I am new to regular expression and have read http://www.gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf regex documents. I know similar questions have been posted previously, but I still had a difficult time trying to figuring out my case.
I have a vector of string filenames, try to extract substring, and save as new filenames. The filenames follow the the pattern below:
\w_\w_(substring to extract)_\d_\d_Month_Date_Year_Hour_Min_Sec_(AM or PM)
For example, ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM, ABC_RE_SP56-01_A_206_281_12_1_2017_1_52_34_AM, the substring will be MS-15-0452-268 and SP56-01_A
I used
map(strsplit(filenames, '_'),3)
but failed, because the new filenames could have _, too.
I turned to regular expression for advanced matching, and come up with this
gsub("^[^\n]+_\\d_\\d_\\d_\\d_(AM | PM)$", "", filenames)
still did not get what I needed.

You may use
filenames <- c('ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM', 'ABC_RE_SP56-01_A_206_281_12_1_2017_1_52_34_AM')
gsub('^(?:[^_]+_){2}(.+?)_\\d+.*', '\\1', filenames)
Which yields
[1] "MS-15-0452-268" "SP56-01_A"
The pattern here is
^ # start of the string
(?:[^_]+_){2} # not _, twice
(.+?) # anything lazily afterwards
_\\d+ # until there's _\d+
.* # consume the rest of the string
This pattern is replaced by the first captured group and hence the filename in question.

Call me a hack. But if that is guaranteed to be the format of all my strings, then I would just use strsplit to hack the name apart, then only keep what I wanted:
string <- 'ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM'
string_bits <- strsplit(string, '_')[[1]]
file_name<- string_bits[3]
file_name
[1] "MS-15-0452-268"
And if you had a list of many file names, you could remove the explicit [[1]] use sapply() to get the third element of every one:
sapply(string_bits, "[[", 3)

R encoding ASCII backtick

I have the following backtick on my list's names. Prior lists did not have this backtick.
$`1KG_1_14106394`
[1] "PRDM2"
$`1KG_20_16729654`
[1] "OTOR"
I found out that this is a 'ASCII grave accent' and read the R page on encoding types. However what to do about it ? I am not clear if this will effect some functions (such as matching on list names) or is it OK leave it as is ?
Encoding help page: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html
Thanks!

My understanding (and I could be wrong) is that the backticks are just a means of escaping a list name which otherwise could not be used if left unescaped. One example of using backticks to refer to a list name is the case of a name containing spaces:
lst <- list(1, 2, 3)
names(lst) <- c("one", "after one", "two")
If you wanted to refer to the list element containing the number two, you could do this using:
lst[["after one"]]
But if you want to use the dollar sign notation you will need to use backticks:
lst$`after one`
Update:
I just poked around on SO and found this post which discusses a similar question as yours. Backticks in variable names are necessary whenever a variable name would be forbidden otherwise. Spaces is one example, but so is using a reserved keyword as a variable name.
if <- 3 # forbidden because if is a keyword
`if` <- 3 # allowed, because we use backticks
In your case:
Your list has an element whose name begins with a number. The rules for variable names in R is pretty lax, but they cannot begin with a number, hence:
1KG_1_14106394 <- 3 # fails, variable name starts with a number
KG_1_14106394 <- 3 # allowed, starts with a letter
`1KG_1_14106394` <- 3 # also allowed, since escaped in backticks

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Str_extract_all where replacement is keeping value between two character strings - r

Related

How to extract a certain part of a string in R using regular expressions

Assign names to list elements without titled quotes

Regex returns digit string with leading "_"

Extract substring using regular expression in R

R encoding ASCII backtick

Categories

Resources