i'm trying to replace a part of a string which is matched like in the following example:
str1 <- "abc sdak+ 123+"
I would like to replace all + that come after 3 numbers, but not in the case when a + is coming after characters. I tried like this, but this replaces the whole matched string, when I only want to replace the + with a -
gsub("[0-9]{3}\\+", "-", str1)
The desired outcome should be:
"abc sdak+ 123-"
We could capture the 3 digits as a group ((...)) and the +, replace with the backreference (\\1) of the captured group and the -. Just to make sure that there is no digits before the 3 digits, use either word boundary (\\b) or a space (\\s)
gsub("\\b(\\d{3})\\+", "\\1-", str1)
-output
[1] "abc sdak+ 123-"
You can also use look-behind ie is the + symbol preceded by 3 numbers? if so, replace it.
str1 <- "abc sdak+ 123+"
gsub("(?<= [0-9]{3})\\+", "-", str1, perl = TRUE)
[1] "abc sdak+ 123-"
Related
I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"
I want to remove blanks, periods and hyphens from a string and enclose each element of the resulting string by inverted commas. Furthermore, I'd like to ensure that all letters are uppercase.
I know how to remove a list of special characters but I cannot add enclosing inverted commas due to my lack of experience with regular expressions or other string manipulation functions (e.g., stringr functions).
How can I convert a string such as
test1 <- "A.1, b-1, C" # start string
test2 <- gsub("[ .-]", "", test1) # remove period and hyphen
to generate the string 'A1','B1','C'?
We can use strsplit on , followed by zero or more spaces after removing the .- with gsub and converting the case to upper case
strsplit(gsub("[.-]", "", toupper(test1)), ",\\s*")[[1]]
#[1] "A1" "B1" "C"
If we need a single string, after removing the ., and -, capture the word (\\w+) and replace it by wrapping the ' around the backreference (\\1) of the captured group
gsub('(\\w+)', "'\\1'", gsub("[.-]+", "", toupper(test1)))
#[1] "'A1', 'B1', 'C'"
Use
test1 <- "A.1, b-1, C"
test2 <- gsub("[^,\\w]+", "", test1, perl=TRUE)
test2 <- paste0("'", gsub(",", "','", toupper(test2)), "'")
Remove all non-word characters other than commas with gsub("[^,\\w]+", "", test1, perl=TRUE) and then replace all commas with commas inside quotes and wrap with quotes using paste0("'", gsub(",", "','", toupper(test2)), "'").
I have a data frame. One of the columns is in string format. Various letters and numbers, but always ending in a string of numbers. Sadly this string isn't always the same length.
I'd like to know how to write a bit of code to extract just the numbers at the end. So for example:
x <- c("AB ABC 19012301927 / XX - 4625",
"BC - AB / 827 / 9765",
"XXXX-9276"
)
And I'd like to get from this: (4625, 9765, 9276)
Is there any easy way to do this please?
Thank you.
A
We can use sub to capture one or more digits (\\d+) at the end ($) of the string that follows a non-digit ([^0-9]) and other characters (.*), in the replacement, specify the backreference (\\1) of the captured group
sub(".*[^0-9](\\d+)$", "\\1", x)
#[1] "4625" "9765" "9276"
Or with word from stringr
library(stringr)
word(x, -1, sep="[- ]")
#[1] "4625" "9765" "9276"
Or with stri_extract_last
library(stringi)
stri_extract_last_regex(x, "\\d+")
#[1] "4625" "9765" "9276"
Replace everything up to the last non-digit with a zero length string.
sub(".*\\D", "", x)
giving:
[1] "4625" "9765" "9276"
I have a range of strings as follows:
vec<-c("Peronospora boniNhenrici","Cystoseira abiesNmarina","Niplommatina rubra",
"Padina sanctaeNcrucis","Nachygrapsus NaurusNliguricus","Melphidippa borealis")
I would like to replace the internal capital "N" in the second word for each element with "-", so that it would like:
("Peronospora boni-henrici","Cystoseira abies-marina","Niplommatina rubra",
"Padina sanctae-crucis,"Nachygrapsus Naurus-liguricus","Melphidippa borealis")
Any suggestions? I've got the locations using the following:
stri_locate_all(vec,regex = "[N]")
but I'm not sure how to replace the "N" if it's internal. When I try to replace the capital letter "N" using gsub, it replaces all occurrences of N, rather than only the internal "N".
We can look for any N's surrounded by \w, which in regex matches any alphanumeric characters or underscores. If that's too broad you could replace \w with [a-zA-Z] to only match letters:
stringr::str_replace_all(vec, "(\\w)N(\\w)", "\\1-\\2")
We can use look behind to replace "N" in the middle of the word with a "-"
gsub("(?<!^)\\wN", "-", vec, perl = TRUE)
#[1] "Peronospora bon-henrici" "Cystoseira abie-marina" "Niplommatina rubra"
#[4] "Padina sancta-crucis" "Nachygrapsus Nauru-liguricus" "Melphidippa borealis"
We can use gsub with capture groups
gsub("([a-z])N([a-z])", "\\1-\\2", vec)
#[1] "Peronospora boni-henrici" "Cystoseira abies-marina" "Niplommatina rubra"
#[4] "Padina sanctae-crucis"
#[5] "Nachygrapsus Naurus-liguricus" "Melphidippa borealis"
Lets say I have a string in R:
str <- "abc abc cde cde"
and I use regmatches and gregexpr to find how many "b"´s there is in my string
regmatches(str, gregexpr("b",str))
but I want an output of everything that cointains the letter b.
So an output like: "abc", "abc".
Thank you!
tmp <- "abc abc cde cde"
Split the string up into separate elements, grep for "b", return elements:
grep("b", unlist(strsplit(tmp, split = " ")), value = TRUE)
Look for non-space before and after, something like:
regmatches(str, gregexpr("\\S*b\\S*", s))
# [[1]]
# [1] "abc" "abc"
Special regex characters are documented in ?regex. For this case, \\s matches "any space-like character", and \\S is its negation, so any non-space-like character. You could be more specific, such as \\w ('word' character, same as [[:alnum:]_]). The * means zero-or-more, and + means one-or-more (forcing something).
I suppose you mean you want to find words that contain b. One regex that does this is
\w*b\w*
\w* matches 0 or more word characters, which is a-z, A-Z, 0-9, and the underscore character.
Demo
Here is a base R option using strsplit and grepl:
str <- "abc abc cde cde"
words <- strsplit(str, "\\s+")[[1]]
idx <- sapply(words, function(x) { grepl("b", x)})
matches <- words[idx]
matches
[1] "abc" "abc"