Remove last special character in r - r

I have regex for separating words after 15 chars by escape character (\n). I need to remove last special character from string if ends with only one character (below example string ends with bracket).
m <- gregexpr(pattern = paste("(.{1,",15,"}\\b|.{",1,"})",sep = ""), text = txt, perl = TRUE)
split_txt = trimws(unlist(regmatches(x = txt, m = m)))
paste(split_txt, collapse = "\n")
This resulted into:
Client News V2\n(design, AI\n)
This should be required output:
Client News V2\n(design, AI)
Is there an easy way how to do that? Many thanks in advance.

You can use sub with regex \n(.)$, which captures the last character following new line character, and replace the last two with it (i.e. remove the last new line character if it is followed by only another character at the end):
s = "Client News V2\n(design, AI\n)"
sub("\n(.)$", "\\1", s)
# [1] "Client News V2\n(design, AI)"

Related

How to turn the Web of Science advanced query into regular expression in R?

To do advanced search in Web of Science, we could use query like:
TI = ("ecology" AND ("climate change" OR "biodiversity"))
This means we want to extract papers with titles containing "ecology" and ("climate change" or "biodiversity"). The according regular expression would be(here TI is a string vector of titles):
library(stringr)
str_detect(TI,"ecology") & str_detect(TI,"climate change|biodiversity")
Is there any way to get the regular expression from the WoS query?
1) Firstly we need to define the question more precisely. We assume that a WoS query is a character string containing AND, OR, NOT, parentheses and fixed character strings in lower case or mixed case possibly surrounded by double quotes (this excludes upper case AND or OR appearing within double quotes unless part of a longer string). We assume that we wish to generate a character string holding an R statement containing str_detect instances such as that shown in the question but not necessarily identical to the example shown as long as it satisfies the above.
For AND, OR and NOT we just replace them with the operators &, | and & !. We then replace each instance of a word character followed by spaces followed by word character with the same except the spaces are replaced with an underscore. We then replace any string of word characters that is not quoted with that string surrounded by quotes and finally we revert the underscores to spaces.
If s is the resulting string then eval(parse(text = s)[[1]]) could be used to evaluate it against target.
wos2stmt does not use any packages but the generated statement depends on stringr due to the use of str_detect for consistency with the question.
wos2stmt <- function(TI, target = "target") {
TI |>
gsub(pattern = "\\bNOT\\b", replacement = "& !") |>
gsub(pattern = "\\bAND\\b", replacement = "&") |>
gsub(pattern = "\\bOR\\b", replacement = "|") |>
gsub(pattern = "(\\w) +(\\w)", replacement = "\\1_\\2") |>
gsub(pattern = '(?<!")\\b(\\w+)\\b(?!")', replacement = '"\\1"', perl = TRUE) |>
gsub(pattern = "_", replacement = " ") |>
gsub(pattern = '("[^"]+")', replacement = sprintf("str_detect(%s, \\1)", target)) |>
gsub(pattern = '"(\\w)', replacement = r"{"\\\\b\1}") |>
gsub(pattern = '(\\w)"', replacement = r"{\1\\\\b"}")
}
# test
TI <- '"ecology" AND ("climate change" OR "biodiversity")'
stmt <- wos2stmt(TI)
giving:
cat(stmt, "\n")
## [1] "str_detect(target, \"ecology\") & (str_detect(target, \"climate change\") | str_detect(target, \"biodiversity\"))"
2) The question seems to refer to generating R statements with str_detect but the subject refers to generating regular expressions. In the latter case we accept a WoS query and output a regular expression for use with str_detect like this. I haven't tested this out much so you will need to do that to explore its limitations.
Note that unlike (1)
this addresses the original question which we defined as not including NOT and automatic quoting (they are not mentioned in the quesiton as requirements).
wos2rx <- function(TI) {
TI |>
gsub(pat = ' *\\bOR\\b *', repl = '|') |>
gsub(pat = ' *\\bAND\\b *', repl = '') |>
gsub(pat = ' *"([^"]+)" *', repl = '(?=.*\\1)')
}
# test
library(stringr)
TI <- '("ecology" AND ("climate change" OR "biodiversity"))'
rx <- wos2rx(TI)
str_detect("biodiversity ecology", rx)
## [1] TRUE
str_detect("climate change biodiversity", rx)
## [1] FALSE

replace element of string 3 - 10 positions after a pattern

Ideally, in base R I need some kind of string manipulation that will let me detect a pattern and change the string 3 positions after the pattern.
example <- "when string says SOMETHING = #c792ea"
desired output:
when string says SOMETHING = #001628
I have tried gsub but I am not sure how I can get it to replace the characters after a pattern.
If it based on the position of character, then we can use substring assignment
substring(example, 30) <- "#001628"
example
#[1] "when string says SOMETHING = #001628"
Or if we need to find the position of the word that starts with #
library(stringr)
posvec <- c(str_locate(example, "#\\w+"))
substring(example, posvec[1], posvec[2]) <- "#001628"
# // or with
# str_sub(example, posvec[1], posvec[2]) <- "#001628"
Another option is sub to change the substring after the = and one or more space (\\s*)
sub("=\\s*.*", "= #001628", example)
#[1] "when string says SOMETHING = #001628"

Remove ,"" and replacing '-' to '.'

I am working with single cell data.
I am trying to match cell barcodes I extracted with another data, but the structure of barcodes are different.
Barcode I extracted: ,"SAMPLE_AAAGCAAAGATACACA-1_1" (weirdly, it saved with a comma at the front)
Barcode I want: SAMPLE_AAAGCAAAGATACACA.1_1
Which function is necessary to use when I try to remove <,"> replace these?
Is this what you want?
Data:
x <- ',"SAMPLE_AAAGCAAAGATACACA-1_1"'
Solution:
cat(gsub(',', '', gsub('(?<=[A-Z])-(?=\\d)', '\\.', x, perl = T)))
"SAMPLE_AAAGCAAAGATACACA.1_1"
Here we use 'nested' gsub to first change the hyphen into the period and then to delete the comma.
If you need it without double quote marks:
cat(gsub(',"|"$', '', gsub('(?<=[A-Z])-(?=\\d)', '\\.', x, perl = T)))
SAMPLE_AAAGCAAAGATACACA.1_1
The following are some alternatives.
1) chartr/trimws Assume the test data v below. Then we replace each dash with minus using chartr and we can strip all commas and double quotes from both ends using trimws. If you have a very old version of R you will need to upgrade since the whitespace= argument was added more recently. No packages are used.
Note that the double quotes shown in the output are not part of the strings but are just how R displays chraacter vectors.
# test input
v <- c(',"SAMPLE_AAAGCAAAGATACACA-1_1"', ',"SAMPLE_AAAGCAAAGATACACA-1_1"')
trimws(chartr("-", ".", v), whitespace = '[,"]')
## [1] "SAMPLE_AAAGCAAAGATACACA.1_1" "SAMPLE_AAAGCAAAGATACACA.1_1"
2) gsubfn gsubfn in the package of the same name can map all minus characters to dot and commas and double quotes to empty strings in a single command. The second argument defines the mapping.
This substitutes all double quotes, commas and minus signs. If there are embedded double quotes and commas (i.e. not on the ends) that are not to be substituted then use (1) which onbly trims comma and double quote off the ends.
library(gsubfn)
gsubfn('.', list('"' = '', ',' = '', '-' = '.'), v)
## [1] "SAMPLE_AAAGCAAAGATACACA.1_1" "SAMPLE_AAAGCAAAGATACACA.1_1"
3) read.table/chartr This also uses only base R. Read in the input using read.table separating fields on comma and keeping only the second field. This will also remove the double quotes. Then use chartr to replace minus signs with dot.
This assumes that the only double quotes are the ones surrounding the field and all minus signs are to be replaced by dot. Embedded commas will be handled properly.
chartr("-", ".", read.table(text = v, sep = ",")[[2]])
## [1] "SAMPLE_AAAGCAAAGATACACA.1_1" "SAMPLE_AAAGCAAAGATACACA.1_1"

Get rid of extra sep in the paste function in R

I am trying to get rid of the extra sep in the paste function in R.
It looks easy but I cannot find a non-hacky way to fix it. Assume l1-l3 are lists
l1 = list(a=1)
l2 = list(b=2)
l3 = list(c=3)
l4 = list(l1,l2=l2,l3=l3)
note that the first element of l4 is not named. Now I want to add a constant to the names like below:
names(l4 ) = paste('Name',names(l4),sep = '.')
Here is the output:
names(l4)
[1] "Name." "Name.l2" "Name.l3"
How can I get rid of the . in the first output (Name.)
We can ue trimws (from R 3.6.0 - can specify whitespace with custom character)
trimws(paste('Name',names(l4),sep = '.'), whitespace = "\\.")
#[1] "Name" "Name.l2" "Name.l3"
Or with sub to match the . (. is a metacharacter for any character, so we escape \\ to get the literal meaning) at the end ($) of the string and replace with blank ("")
sub("\\.$", "", paste('Name',names(l4),sep = '.'))
If the . is already there in the names at the end, we can use an index option
ifelse(nzchar(names(l4)), paste("Name", names(l4), sep="."), "Name")
#[1] "Name" "Name.l2." "Name.l3"

Strings in R - Insert a desired/arbitrary digit between a string and digit

I have a character vector that looks like this:
questions <- c("question1" "question10" "question11" "question12"
"question13" "question14" "question15" "question16" "question17",
"question18" "question2" "question3" "question4" "question5" "question6"
"question7" "question8" "question9")
I want to insert a 0 between "question" and single digit so that the character vector looks like:
questions <- c("question01" "question10" "question11" "question12"
"question13" "question14" "question15" "question16" "question17",
"question18" "question02" "question03" "question04" "question05"
"question06" "question07" "question08" "question09")
Notice that string "question" associated with double digits i.e. "question10" or "question18 are unaffected.
I am new to pattern matching. I have tried the following code:
gsub(pattern = "(\\D*)(\\d{1})", replacement = "0\\1", x = mydf6$Question, perl = TRUE)
However, its not giving the desired result.
Any help would be appreciated.
Try
gsub("(?<=[a-z])(\\d)$", "0\\1", mydf6$Question, perl = T)
This subs in a zero, but only if the string ends with a single digit, preceded by a lowercase letter.

Resources