A Regex to remove digits except for words starting with # - r

I have some strings that can contain letters, numbers and '#' symbol.
I would like to remove digits except for the words that start with '#'
Here is an example:
"table9 dolv5e #10n #dec10 #nov8e 23 hello"
And the expected output is:
"table dolve #10n #dec10 #nov8e hello"
How can I do this with regex, stringr or gsub?

How about capturing the wanted and replacing the unwanted with empty (non captured).
gsub("(#\\S+)|\\d+","\\1",x)
See demo at regex101 or R demo at tio.run (I have no experience with R)
My Answer is assuming, that there is always whitespace between #foo bar #baz2. If you have something like #foo1,bar2:#baz3 4, use \w (word character) instead of \S (non whitespace).

You could split the string on spaces, remove digits from tokens if they don't start with '#' and paste back:
x <- "table9 dolv5e #10n #dec10 #nov8e 23 hello"
y <- unlist(strsplit(x, ' '))
paste(ifelse(startsWith(y, '#'), y, sub('\\d+', '', y)), collapse = ' ')
# output
[1] "table dolve #10n #dec10 #nov8e hello"

You use gsub to remove digits, for example:
gsub("[0-9]","","table9")
"table"
And we can split your string using strsplit:
STRING = "table9 dolv5e #10n #dec10 #nov8e 23 hello"
strsplit(STRING," ")
[[1]]
[1] "table9" "dolv5e" "#10n" "#dec10" "#nov8e" "23" "hello"
We just need to iterate through STRING, with gsub, applying it only to elements that do not have "#"
STRING = unlist(strsplit(STRING," "))
no_hex = !grepl("#",STRING)
STRING[no_hex] = gsub("[0-9]","",STRING[no_hex])
paste(STRING,collapse=" ")
[1] "table dolve #10n #dec10 #nov8e hello"

Base R solution:
unlisted_strings <- unlist(strsplit(X, "\\s+"))
Y <- paste0(na.omit(ifelse(grepl("[#]", unlisted_strings),
unlisted_strings,
gsub("\\d+", "", unlisted_strings))), collapse = " ")
Y
Data:
X <- as.character("table9 dolv5e #10n #dec10 #nov8e 23 hello")

INPUT = "table9 dolv5e #10n #dec10 #nov8e 23 hello";
OUTPUT = INPUT.match(/[^#\d]+(#\w+|[A-Za-Z]+\w*)/gi).join('');
You can remove flags i, cause it was case insensitive
Use this pattern: [^#\d]+(#\w+|[A-Za-Z]+\w*)
[^#\d]+ = character start with no # and digits
#\w+ = find # followed by digit or letter
[A-Za-z]+\w* = find letter followed by letter and/or number
^
|
You can change this with \D+\S* = find any character not just when the first is letter and not just followed by letter and/or number.
I am not put as \w+\w* cause \w same as = [\w\d].
I tried the code in JavaScript and it work.
If you want match not only followed by letter you can use code

Related

Regex to add comma between any character

I'm relatively new to regex, so bear with me if the question is trivial. I'd like to place a comma between every letter of a string using regex, e.g.:
x <- "ABCD"
I want to get
"A,B,C,D"
It would be nice if I could do that using gsub, sub or related on a vector of strings of arbitrary number of characters.
I tried
> sub("(\\w)", "\\1,", x)
[1] "A,BCD"
> gsub("(\\w)", "\\1,", x)
[1] "A,B,C,D,"
> gsub("(\\w)(\\w{1})$", "\\1,\\2", x)
[1] "ABC,D"
Try:
x <- 'ABCD'
gsub('\\B', ',', x, perl = T)
Prints:
[1] "A,B,C,D"
Might have misread the query; OP is looking to add comma's between letters only. Therefor try:
gsub('(\\p{L})(?=\\p{L})', '\\1,', x, perl = T)
(\p{L}) - Match any kind of letter from any language in a 1st group;
(?=\p{L}) - Positive lookahead to match as per above.
We can use the backreference to this capture group in the replacement.
You can use
> gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
[1] "A,B,C,D"
The (.)(?=.) regex matches any char capturing it into Group 1 (with (.)) that must be followed with any single char ((?=.)) is a positive lookahead that requires a char immediately to the right of the current location).
Vriations of the solution:
> gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
## Or with stringr:
## stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
[1] "A,B,C,D"
Here, (?!$) fails the match if there is an end of string position.
See the R demo online:
x <- "ABCD"
gsub("(.)(?=.)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
gsub("(.)(?!$)", "\\1,", x, perl=TRUE)
# => [1] "A,B,C,D"
stringr::str_replace_all(x, "(.)(?!$)", "\\1,")
# => [1] "A,B,C,D"
A non-regex friendly answer:
paste(strsplit(x, "")[[1]], collapse = ",")
#[1] "A,B,C,D"
Another option is to use positive look behind and look ahead to assert there is a preceding and a following character:
library(stringr)
str_replace_all(x, "(?<=.)(?=.)", ",")
[1] "A,B,C,D"

A regex to remove the pattern "[0-9]g"

I have the following sample dataset:
XYZ 185g
ABC 60G
Gha 20g
How do I remove the strings "185g", "60G", "20g" without accidentally removing the alphabets g and G in the main words?
I tried the below code but it replaces the alphabets in the main words as well.
a <- str_replace_all(a$words,"[0-9]"," ")
a <- str_replace_all(a$words,"[gG]"," ")
You need to combine them into something like
a$words <- str_replace_all(a$words,"\\s*\\d+[gG]$", "")
The \s*\d+[gG]$ regex matches
\s* - zero or more whitespaces
\d+ - one or more digits
[gG] - g or G
$ - end of string.
If you can have these strings inside a string, not just at the end, you may use
a$words <- str_replace_all(a$words,"\\s*\\d+[gG]\\b", "")
where $ is replaced with a \b, a word boundary.
To ignore case,
a$words <- str_replace_all(a$words, regex("\\s*\\d+g\\b", ignore_case=TRUE), "")
You can try
> gsub("\\s\\d+g$", "", c("XYZ 185g", "ABC 60G", "Gha 20g"), ignore.case = TRUE)
[1] "XYZ" "ABC" "Gha"
You can also use the following solution:
vec <- c("XYZ 185g", "ABC 60G", "Gha 20g")
gsub("[A-Za-z]+(*SKIP)(*FAIL)|[ 0-9Gg]+", "", vec, perl = TRUE)
[1] "XYZ" "ABC" "Gha"

Locate position of first number in string [R]

How can I create a function in R that locates the word position of the first number in a string?
For example:
string1 <- "Hello I'd like to extract where the first 1010 is in this string"
#desired_output for string1
9
string2 <- "80111 is in this string"
#desired_output for string2
1
string3 <- "extract where the first 97865 is in this string"
#desired_output for string3
5
I would just use grep and strsplit here for a base R option:
sapply(input, function(x) grep("\\d+", strsplit(x, " ")[[1]]))
Hello I'd like to extract where the first 1010 is in this string
9
80111 is in this string
1
extract where the first 97865 is in this string
5
Data:
input <- c("Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string",
"extract where the first 97865 is in this string")
Here is a way to return your desired output:
library(stringr)
min(which(!is.na(suppressWarnings(as.numeric(str_split(string, " ", simplify = TRUE))))))
This is how it works:
str_split(string, " ", simplify = TRUE) # converts your string to a vector/matrix, splitting at space
as.numeric(...) # tries to convert each element to a number, returning NA when it fails
suppressWarnings(...) # suppresses the warnings generated by as.numeric
!is.na(...) # returns true for the values that are not NA (i.e. the numbers)
which(...) # returns the position for each TRUE values
min(...) # returns the first position
The output:
min(which(!is.na(suppressWarnings(as.numeric(str_split(string1, " ", simplify = TRUE))))))
[1] 9
min(which(!is.na(suppressWarnings(as.numeric(str_split(string2, " ", simplify = TRUE))))))
[1] 1
min(which(!is.na(suppressWarnings(as.numeric(str_split(string3, " ", simplify = TRUE))))))
[1] 5
Here I'll leave a fully tidyverse approach:
library(purrr)
library(stringr)
map_dbl(str_split(strings, " "), str_which, "\\d+")
#> [1] 9 1 5
map_dbl(str_split(strings[1], " "), str_which, "\\d+")
#> [1] 9
Note that it works both with one and multiple strings.
Where strings is:
strings <- c("Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string",
"extract where the first 97865 is in this string")
Here is another approach. We can trim off the remaining characters after the first digit of the first number. Then, just find the position of the last word. \\b matches word boundaries while \\S+ matches one or more non-whitespace characters.
first_numeric_word <- function(x) {
x <- substr(x, 1L, regexpr("\\b\\d+\\b", x))
lengths(gregexpr("\\b\\S+\\b", x))
}
Output
> first_numeric_word(x)
[1] 9 1 5
Data
x <- c(
"Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string",
"extract where the first 97865 is in this string"
)
Here is a base solution using rapply() w/ grep() to recurse through the results of strsplit() and works with a vector of strings.
Note: swap " " and fixed = TRUE with "\\s+" and fixed = FALSE (the default) if you want to split the strings on any whitespace instead of a literal space.
rapply(strsplit(strings, " ", fixed = TRUE), function(x) grep("[0-9]+", x))
[1] 9 1 5
Data:
strings = c("Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string", "extract where the first 97865 is in this string")
Try the following:
library(stringr)
position_first_number <- function(string) {
min(which(str_detect(str_split(string, "\\s+", simplify = TRUE), "[0-9]+")))
}
With your example strings:
> string1 <- "Hello I'd like to extract where the first 1010 is in this string"
> position_first_number(string1)
[1] 9
> string2 <- "80111 is in this string"
> position_first_number(string2)
[1] 1
> string3 <- "extract where the first 97865 is in this string"
> position_first_number(string3)
[1] 5

How to change the word separator character in a vector?

I have a character vector consisting of the following style:
mylist <- c('John Myer Stewert','Steve',' Michael Boris',' Daniel and Frieds','Michael-Myer')
I'm trying to create a character vector like this:
mylist <- c('John+Myer+Stewert','Steve',' Michael+Boris',' Daniel+and+Frieds','Michael+Myer')
I have tried:
test <- cat(paste(shQuote(mylist , type="cmd"), collapse="+"))
That seems wrong. How can I change the word separator in mylist as shown above?
You could use chartr(). Just re-use the + sign for both space and - characters.
chartr(" -", "++", trimws(mylist))
# [1] "John+Myer+Stewert" "Steve" "Michael+Boris"
# [4] "Daniel+and+Frieds" "Michael+Myer"
Note that I also trimmed the leading whitespace since there is really no need to keep it.
We can use gsub by matching the space (" ") as pattern and replace it with "+".
gsub(" ", "+", trimws(mylist))
#[1] "John+Myer+Stewert" "Steve" "Michael+Boris"
#[4] "Daniel+and+Frieds" "Michael-Myer"
I assumed that the leading spaces as typo. If it is not, we can either use regex lookarounds
gsub("(?<=[a-z])[ -](?=[[:alpha:]])", "+", mylist, perl = TRUE)
#[1] "John+Myer+Stewert" "Steve" " Michael+Boris"
#[4] " Daniel+and+Frieds" "Michael+Myer"
Or some PCRE regex
gsub("(^ | $)(*SKIP)(*F)|[ -]", "+", mylist, perl = TRUE)
#[1] "John+Myer+Stewert" "Steve" " Michael+Boris"
#[4] " Daniel+and+Frieds" "Michael+Myer"
You can use the package stringr.
library(stringr)
str_replace_all(trimws(mylist), "[ -]", "+")
#[1] "John+Myer+Stewert" "Steve" "Michael+Boris"
#[4] "Daniel+and+Frieds" "Michael+Myer"
Between [] we specify what we want to replace with +. In this case, that is a single white space and -. I used trimws from Akrun's answer to get rid of the extra white space in the beginning of some elements in your string.
This is yet another alternative.
library(stringi)
stri_replace_all_regex(trimws(mylist), "[ -]", "+")

Removing punctuation between two words

I have a data frame (df) and I would like to remove punctuation.
However there an issue with dot between 2 words and at the end of one word like this:
test.
test1.test2
I use this to remove the punctuation:
library(tm)
removePunctuation(df)
and the result I take is this:
test
test1test2
but I would like to take this as result:
test
test1 test2
How is it possible to have a space between two words in the removing process?
You can use chartr for single character substitution:
chartr(".", " ", c("test1.test2"))
# [1] "test1 test2"
#akrun suggested trimws to remove the space at the end of your test string:
str <- c("test.", "test1.test2")
trimws(chartr(".", " ", str))
# [1] "test" "test1 test2"
We can use gsub to replace the . with a white space and remove the trailing/leading spaces (if any) with trimws.
trimws(gsub('[.]', ' ', str1))
#[1] "test" "test1 test2"
NOTE: In regex, . by itself means any character. So we should either keep it inside square brackets[.]) or escape it (\\.) or with option fixed=TRUE
trimws(gsub('.', ' ', str1, fixed=TRUE))
data
str1 <- c("test.", "test1.test2")
you can also use strsplit:
a <- "test."
b <- "test1.test2"
do.call(paste, as.list(strsplit(a, "\\.")[[1]]))
[1] "test"
do.call(paste, as.list(strsplit(b, "\\.")[[1]]))
[1] "test1 test2"

Resources