Get rid of extra sep in the paste function in R - r

I am trying to get rid of the extra sep in the paste function in R.
It looks easy but I cannot find a non-hacky way to fix it. Assume l1-l3 are lists
l1 = list(a=1)
l2 = list(b=2)
l3 = list(c=3)
l4 = list(l1,l2=l2,l3=l3)
note that the first element of l4 is not named. Now I want to add a constant to the names like below:
names(l4 ) = paste('Name',names(l4),sep = '.')
Here is the output:
names(l4)
[1] "Name." "Name.l2" "Name.l3"
How can I get rid of the . in the first output (Name.)

We can ue trimws (from R 3.6.0 - can specify whitespace with custom character)
trimws(paste('Name',names(l4),sep = '.'), whitespace = "\\.")
#[1] "Name" "Name.l2" "Name.l3"
Or with sub to match the . (. is a metacharacter for any character, so we escape \\ to get the literal meaning) at the end ($) of the string and replace with blank ("")
sub("\\.$", "", paste('Name',names(l4),sep = '.'))
If the . is already there in the names at the end, we can use an index option
ifelse(nzchar(names(l4)), paste("Name", names(l4), sep="."), "Name")
#[1] "Name" "Name.l2." "Name.l3"

Related

How to turn the Web of Science advanced query into regular expression in R?

To do advanced search in Web of Science, we could use query like:
TI = ("ecology" AND ("climate change" OR "biodiversity"))
This means we want to extract papers with titles containing "ecology" and ("climate change" or "biodiversity"). The according regular expression would be(here TI is a string vector of titles):
library(stringr)
str_detect(TI,"ecology") & str_detect(TI,"climate change|biodiversity")
Is there any way to get the regular expression from the WoS query?
1) Firstly we need to define the question more precisely. We assume that a WoS query is a character string containing AND, OR, NOT, parentheses and fixed character strings in lower case or mixed case possibly surrounded by double quotes (this excludes upper case AND or OR appearing within double quotes unless part of a longer string). We assume that we wish to generate a character string holding an R statement containing str_detect instances such as that shown in the question but not necessarily identical to the example shown as long as it satisfies the above.
For AND, OR and NOT we just replace them with the operators &, | and & !. We then replace each instance of a word character followed by spaces followed by word character with the same except the spaces are replaced with an underscore. We then replace any string of word characters that is not quoted with that string surrounded by quotes and finally we revert the underscores to spaces.
If s is the resulting string then eval(parse(text = s)[[1]]) could be used to evaluate it against target.
wos2stmt does not use any packages but the generated statement depends on stringr due to the use of str_detect for consistency with the question.
wos2stmt <- function(TI, target = "target") {
TI |>
gsub(pattern = "\\bNOT\\b", replacement = "& !") |>
gsub(pattern = "\\bAND\\b", replacement = "&") |>
gsub(pattern = "\\bOR\\b", replacement = "|") |>
gsub(pattern = "(\\w) +(\\w)", replacement = "\\1_\\2") |>
gsub(pattern = '(?<!")\\b(\\w+)\\b(?!")', replacement = '"\\1"', perl = TRUE) |>
gsub(pattern = "_", replacement = " ") |>
gsub(pattern = '("[^"]+")', replacement = sprintf("str_detect(%s, \\1)", target)) |>
gsub(pattern = '"(\\w)', replacement = r"{"\\\\b\1}") |>
gsub(pattern = '(\\w)"', replacement = r"{\1\\\\b"}")
}
# test
TI <- '"ecology" AND ("climate change" OR "biodiversity")'
stmt <- wos2stmt(TI)
giving:
cat(stmt, "\n")
## [1] "str_detect(target, \"ecology\") & (str_detect(target, \"climate change\") | str_detect(target, \"biodiversity\"))"
2) The question seems to refer to generating R statements with str_detect but the subject refers to generating regular expressions. In the latter case we accept a WoS query and output a regular expression for use with str_detect like this. I haven't tested this out much so you will need to do that to explore its limitations.
Note that unlike (1)
this addresses the original question which we defined as not including NOT and automatic quoting (they are not mentioned in the quesiton as requirements).
wos2rx <- function(TI) {
TI |>
gsub(pat = ' *\\bOR\\b *', repl = '|') |>
gsub(pat = ' *\\bAND\\b *', repl = '') |>
gsub(pat = ' *"([^"]+)" *', repl = '(?=.*\\1)')
}
# test
library(stringr)
TI <- '("ecology" AND ("climate change" OR "biodiversity"))'
rx <- wos2rx(TI)
str_detect("biodiversity ecology", rx)
## [1] TRUE
str_detect("climate change biodiversity", rx)
## [1] FALSE

Move "*" to new column in R

Hello I have a column in a data.frame, it has many rows, e.g.,
df = data.frame("Species" = c("*Briza minor", "*Briza minor", "Wattle"))
I want to make a new column "Species_new" where the "*" is moved to the end of the character string, e.g.,
df = data.frame("Species" = c("*Briza minor", "*Briza minor", "Wattle"),
"Species_new" = c("Briza minor*", "Briza minor*", "Wattle"))
Is there a way to do this using gsub? The manual example would take far too long as I have approximately 50,000 rows.
Thanks in advance
One option is to capture the * as a group and in the replacement reverse the backreferences
df$Species_new <- sub("^([*])(.*)$", "\\2\\1", df$Species)
df$Species_new
#[1] "Briza minor*" "Briza minor*" "Wattle"
NOTE: * is a metacharacter meaning 0 or more, so we can either escape (\\*) or place it in brackets ([]) to evaluate the raw character i.e. literal evaluation
Thanks so much for the quick response, I also found a workaround;
df$Species_new = sub("[*]","",df$Species, perl=TRUE)
differences = setdiff(df$Species,df$Species_new)
tochange = subset(df,df$Species == differences)
toleave = subset(df,!df$Species == differences)
tochange$Species_new = paste(tochange$Species_new, "*", sep = "")
df = rbind(tochange,toleave)

Replace multiple strings comprising of a different number of characters with one gsubfn()

Here Replace multiple strings in one gsub() or chartr() statement in R? it is explained to replace multiple strings of one character at in one statement with gsubfn(). E.g.:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", " " = ""), x)
# "doremig_k"
I would however like to replace the string 'doremi' in the example with ''. This does not work:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", "doremi" = ""), x)
# "doremi g_k"
I guess it is because of the fact that the string 'doremi' contains multiple characters and me using the metacharacter . in gsubfn. I have no idea what to replace it with - I must confess I find the use of metacharacters sometimes a bit difficult to udnerstand. Thus, is there a way for me to replace '-' and 'doremi' at once?
You might be able to just use base R sub here:
x <- "doremi g-k"
result <- sub("doremi\\s+([^-]+)-([^-]+)", "\\1_\\2", x)
result
[1] "g_k"
Does this work for you?
gsubfn::gsubfn(pattern = "doremi|-", list("-" = "_", "doremi" = ""), x)
[1] " g_k"
The key is this search: "doremi|-" which tells to search for either "doremi" or "-". Use "|" as the or operator.
Just a more generic solution to #RLave's solution -
toreplace <- list("-" = "_", "doremi" = "")
gsubfn(paste(names(toreplace),collapse="|"), toreplace, x)
[1] " g_k"

Using gsub and paste to remove a portion of text from certain variables

I am using gsub and paste to change the columns of a dataframe. In particular, for all the columns ending by .h, where h is a number from 0 to 23 I use the following script (for example for variable p.1)
h <- 1
gsub(paste(".", h, sep = ""), "", "p.1")
# "p" # correct!
The script should not work for all the variables that do not end with .h. For example
h <- 1
gsub(paste(".", h, sep = ""), "", "prob10")
# "pro0" # not correct!
However, this code yields "pro0", in stead of "prob10". Similarly,
h <- 0
gsub(paste(".", h, sep = ""), "", "prob10")
# "prob" # not correct!
gives me the wrong answer. I don't understand why gsub does not work (in first place) and why the last two examples give different results. Thank you.
Well gsub works, just not as u would expect it to.
because "." is a metacharacter.
The simplest example of a metacharacter is the full stop.
'.'
The full stop character matches any single character of any sort (apart
from a newline).
For example, the regular expression ".at" means: any
letter, followed by the letter 'a', followed by the letter 't'.
".at" => The cat sat on the mat .
(see https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html)
The working code for you would be
for (h in 23:0)
{gsub(paste(".", h, sep = ""), "", "p.1", fixed = TRUE)
}

Remove last special character in r

I have regex for separating words after 15 chars by escape character (\n). I need to remove last special character from string if ends with only one character (below example string ends with bracket).
m <- gregexpr(pattern = paste("(.{1,",15,"}\\b|.{",1,"})",sep = ""), text = txt, perl = TRUE)
split_txt = trimws(unlist(regmatches(x = txt, m = m)))
paste(split_txt, collapse = "\n")
This resulted into:
Client News V2\n(design, AI\n)
This should be required output:
Client News V2\n(design, AI)
Is there an easy way how to do that? Many thanks in advance.
You can use sub with regex \n(.)$, which captures the last character following new line character, and replace the last two with it (i.e. remove the last new line character if it is followed by only another character at the end):
s = "Client News V2\n(design, AI\n)"
sub("\n(.)$", "\\1", s)
# [1] "Client News V2\n(design, AI)"

Resources