A regex to remove the pattern "[0-9]g" - r

I have the following sample dataset:
XYZ 185g
ABC 60G
Gha 20g
How do I remove the strings "185g", "60G", "20g" without accidentally removing the alphabets g and G in the main words?
I tried the below code but it replaces the alphabets in the main words as well.
a <- str_replace_all(a$words,"[0-9]"," ")
a <- str_replace_all(a$words,"[gG]"," ")

You need to combine them into something like
a$words <- str_replace_all(a$words,"\\s*\\d+[gG]$", "")
The \s*\d+[gG]$ regex matches
\s* - zero or more whitespaces
\d+ - one or more digits
[gG] - g or G
$ - end of string.
If you can have these strings inside a string, not just at the end, you may use
a$words <- str_replace_all(a$words,"\\s*\\d+[gG]\\b", "")
where $ is replaced with a \b, a word boundary.
To ignore case,
a$words <- str_replace_all(a$words, regex("\\s*\\d+g\\b", ignore_case=TRUE), "")

You can try
> gsub("\\s\\d+g$", "", c("XYZ 185g", "ABC 60G", "Gha 20g"), ignore.case = TRUE)
[1] "XYZ" "ABC" "Gha"

You can also use the following solution:
vec <- c("XYZ 185g", "ABC 60G", "Gha 20g")
gsub("[A-Za-z]+(*SKIP)(*FAIL)|[ 0-9Gg]+", "", vec, perl = TRUE)
[1] "XYZ" "ABC" "Gha"

Related

How to change values before text in string using R

I have multiple strings that are similar to the following pattern:
dat<-("00000000AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0")
I need to change all 0 values to "." before the first character value within a string. My desired output in this example would be:
"........AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0".
I tried using gsub to accomplish this task:
gsub("\\G([^_\\d]*)\\d", ".\\1", dat, perl=T)
Unfortunately it changed all of the 0s to "." instead of the 0s preceding the first "A".
Can someone please help me with this issue?
If you wish to simply replace each leading 0 with a ., you can use
gsub("\\G0", ".", dat, perl=TRUE)
Here, \G0 matches a 0 char at the start of string, and then every time after a successful match. See this regex demo.
If you need to replace each 0 in a string before the first letter you can use
gsub("\\G[^\\p{L}0]*\\K0", ".", dat, perl=TRUE)
Here, \G matches start of string or end of the preceding successful match, [^\p{L}0]* matches zero or more chars other than a letter and 0, then \K omits the matched text, and then 0 matches the 0 char and it is replaced with a .. See this regex demo.
See the R demo online:
dat <- c("00000000AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0","102030405000AZD")
gsub("\\G0", ".", dat, perl=TRUE)
## [1] "........AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0"
## [2] "102030405000AZD"
gsub("\\G[^\\p{L}0]*\\K0", ".", dat, perl=TRUE)
## [1] "........AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0"
## [2] "1.2.3.4.5...AZD"
This is really hard.
So I tried to do it with a custom function:
library(stringr)
dat<-("00000000AAAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0")
Zero_Replacer <- function(x) {
x <- str_split(x, '[A-Za-z]', 2)
x[[1]][1] <- str_replace_all(x[[1]][1], "0", ".")
paste0(x[[1]][1], x[[1]][2])
}
Zero_Replacer(dat)
Output:
[1] "........AAAAAAAAA0AAAAAAAAAA0AAAAAAAAAAAAAAAAAAAAAAAAD0"

How can I remove all letters from a string but the last if it is "X"?

A dataset with ISBNs includes some messed up ones with letters - since the only valid letter in an ISBN is an X in the last position, I would like to remove all other letters using gsub - any recommendations?
Following a short example with desired outcomes:
str1 <- 1234X
Desired outcome: 1234X
str2 <- 12X34
Desired outcome: 1234
str3 <- XXXXX
Desired outcome:
str4 <- 1234B
Desired outcome: 1234
Any recommendation?
Another option is to just delete all the non-digits while maintaining the X at the end of a digit number:
gsub("((?<=\\d)X$)|\\D", "\\1", str1, perl = TRUE)
[1] "1234X" "1234" "" "1234"
We could simply use gsub with 'X' at the end ($) of the string to SKIP while matching one or more upper case letters ([A-Z]+), and replace it with blank ("")
gsub("X$(*SKIP)(*F)|[A-Z]+", "", str1, perl = TRUE)
#[1] "1234X" "1234" "" "1234"
data
str1 <- c("1234X", "12X34", "XXXXX", "1234B")

A Regex to remove digits except for words starting with #

I have some strings that can contain letters, numbers and '#' symbol.
I would like to remove digits except for the words that start with '#'
Here is an example:
"table9 dolv5e #10n #dec10 #nov8e 23 hello"
And the expected output is:
"table dolve #10n #dec10 #nov8e hello"
How can I do this with regex, stringr or gsub?
How about capturing the wanted and replacing the unwanted with empty (non captured).
gsub("(#\\S+)|\\d+","\\1",x)
See demo at regex101 or R demo at tio.run (I have no experience with R)
My Answer is assuming, that there is always whitespace between #foo bar #baz2. If you have something like #foo1,bar2:#baz3 4, use \w (word character) instead of \S (non whitespace).
You could split the string on spaces, remove digits from tokens if they don't start with '#' and paste back:
x <- "table9 dolv5e #10n #dec10 #nov8e 23 hello"
y <- unlist(strsplit(x, ' '))
paste(ifelse(startsWith(y, '#'), y, sub('\\d+', '', y)), collapse = ' ')
# output
[1] "table dolve #10n #dec10 #nov8e hello"
You use gsub to remove digits, for example:
gsub("[0-9]","","table9")
"table"
And we can split your string using strsplit:
STRING = "table9 dolv5e #10n #dec10 #nov8e 23 hello"
strsplit(STRING," ")
[[1]]
[1] "table9" "dolv5e" "#10n" "#dec10" "#nov8e" "23" "hello"
We just need to iterate through STRING, with gsub, applying it only to elements that do not have "#"
STRING = unlist(strsplit(STRING," "))
no_hex = !grepl("#",STRING)
STRING[no_hex] = gsub("[0-9]","",STRING[no_hex])
paste(STRING,collapse=" ")
[1] "table dolve #10n #dec10 #nov8e hello"
Base R solution:
unlisted_strings <- unlist(strsplit(X, "\\s+"))
Y <- paste0(na.omit(ifelse(grepl("[#]", unlisted_strings),
unlisted_strings,
gsub("\\d+", "", unlisted_strings))), collapse = " ")
Y
Data:
X <- as.character("table9 dolv5e #10n #dec10 #nov8e 23 hello")
INPUT = "table9 dolv5e #10n #dec10 #nov8e 23 hello";
OUTPUT = INPUT.match(/[^#\d]+(#\w+|[A-Za-Z]+\w*)/gi).join('');
You can remove flags i, cause it was case insensitive
Use this pattern: [^#\d]+(#\w+|[A-Za-Z]+\w*)
[^#\d]+ = character start with no # and digits
#\w+ = find # followed by digit or letter
[A-Za-z]+\w* = find letter followed by letter and/or number
^
|
You can change this with \D+\S* = find any character not just when the first is letter and not just followed by letter and/or number.
I am not put as \w+\w* cause \w same as = [\w\d].
I tried the code in JavaScript and it work.
If you want match not only followed by letter you can use code

Get characters after and before a pattern match in R

Lets say I have a string in R:
str <- "abc abc cde cde"
and I use regmatches and gregexpr to find how many "b"´s there is in my string
regmatches(str, gregexpr("b",str))
but I want an output of everything that cointains the letter b.
So an output like: "abc", "abc".
Thank you!
tmp <- "abc abc cde cde"
Split the string up into separate elements, grep for "b", return elements:
grep("b", unlist(strsplit(tmp, split = " ")), value = TRUE)
Look for non-space before and after, something like:
regmatches(str, gregexpr("\\S*b\\S*", s))
# [[1]]
# [1] "abc" "abc"
Special regex characters are documented in ?regex. For this case, \\s matches "any space-like character", and \\S is its negation, so any non-space-like character. You could be more specific, such as \\w ('word' character, same as [[:alnum:]_]). The * means zero-or-more, and + means one-or-more (forcing something).
I suppose you mean you want to find words that contain b. One regex that does this is
\w*b\w*
\w* matches 0 or more word characters, which is a-z, A-Z, 0-9, and the underscore character.
Demo
Here is a base R option using strsplit and grepl:
str <- "abc abc cde cde"
words <- strsplit(str, "\\s+")[[1]]
idx <- sapply(words, function(x) { grepl("b", x)})
matches <- words[idx]
matches
[1] "abc" "abc"

How to change the word separator character in a vector?

I have a character vector consisting of the following style:
mylist <- c('John Myer Stewert','Steve',' Michael Boris',' Daniel and Frieds','Michael-Myer')
I'm trying to create a character vector like this:
mylist <- c('John+Myer+Stewert','Steve',' Michael+Boris',' Daniel+and+Frieds','Michael+Myer')
I have tried:
test <- cat(paste(shQuote(mylist , type="cmd"), collapse="+"))
That seems wrong. How can I change the word separator in mylist as shown above?
You could use chartr(). Just re-use the + sign for both space and - characters.
chartr(" -", "++", trimws(mylist))
# [1] "John+Myer+Stewert" "Steve" "Michael+Boris"
# [4] "Daniel+and+Frieds" "Michael+Myer"
Note that I also trimmed the leading whitespace since there is really no need to keep it.
We can use gsub by matching the space (" ") as pattern and replace it with "+".
gsub(" ", "+", trimws(mylist))
#[1] "John+Myer+Stewert" "Steve" "Michael+Boris"
#[4] "Daniel+and+Frieds" "Michael-Myer"
I assumed that the leading spaces as typo. If it is not, we can either use regex lookarounds
gsub("(?<=[a-z])[ -](?=[[:alpha:]])", "+", mylist, perl = TRUE)
#[1] "John+Myer+Stewert" "Steve" " Michael+Boris"
#[4] " Daniel+and+Frieds" "Michael+Myer"
Or some PCRE regex
gsub("(^ | $)(*SKIP)(*F)|[ -]", "+", mylist, perl = TRUE)
#[1] "John+Myer+Stewert" "Steve" " Michael+Boris"
#[4] " Daniel+and+Frieds" "Michael+Myer"
You can use the package stringr.
library(stringr)
str_replace_all(trimws(mylist), "[ -]", "+")
#[1] "John+Myer+Stewert" "Steve" "Michael+Boris"
#[4] "Daniel+and+Frieds" "Michael+Myer"
Between [] we specify what we want to replace with +. In this case, that is a single white space and -. I used trimws from Akrun's answer to get rid of the extra white space in the beginning of some elements in your string.
This is yet another alternative.
library(stringi)
stri_replace_all_regex(trimws(mylist), "[ -]", "+")

Resources