Removing repeated, nonpunctuation character from a string - r

I have a string in R with several, non-punctuation repeated characters (a pound sign). I am trying to remove the repeatedness of the pound sign "#" but keep only one to separate the words in the string. The number of pound signs between words is random and is not always the same.
For example:
String="##Hello####World#Happy#######New###Ye#r!"
transform into
String_New="#Hello#World#Happy#New#Ye#r!"
Does the gsub command handle non-punctuation signs?

We need to specify + ie. one or more characters to match and in the replacement add a single #
gsub("#+", "#", String)
#[1] "#Hello#World#Happy#New#Ye#r!"

Here is a quick way to do what you want:
a <- "##Hello####World#Happy#######New###Year"
b <- gsub('#######', '#', a)
b <- gsub('###', '#', b)
b <- gsub('##', '#', b)
And yes you can handle nonpunction signs as well if you desire.

Related

R conditional logic to replace a character in a string, based on the preceeding and following characters in the string

I have a vector of strings in which I have replaced the spaces with underscores. I'm going to reconvert them to spaces, however, there are some syntax errors in the original data which means that some of the spaces shouldn't actually be spaces. I have some simple conditional logic to describe the circumstances when an underscore should be replaced with a space, when it should be replaced with a dash (-), or when it should be removed altogether.
The strings are chemical compound names. In cases where the underscore follows or precedes a number, the underscore should be replaced with a dash ("-"). In cases where the underscore precedes and follows a letter, it should be replaced with a space (" "). And where an underscore precedes or follows a dash, the underscore should be removed without replacement. More than one of these scenarios may apply in different places in a given string. An additional issue is that where a numerical digit directly follows or precedes a letter, there should be a dash between them.
Here is a minimal dataset that demonstrates all of these scenarios and the desired result. Note that the actual dataset has over 35 thousand entries (only 670 unique ones though).
names
[1] "1,8_cineole" "geranyl_acetate" "AR_curcumene" "trans_trans_a-farnesene" "trans_muurola_4,5_diene"
[6] "p_cymene" "a_-_pinene" "cadina_3,5_diene" "germacrene_D" "trans_cadina1,4diene"
converted_names
[1] "1,8-cineole" "geranyl acetate" "AR curcumene" "trans trans a-farnesene" "trans muurola-4,5-diene"
[6] "p cymene" "a-pinene" "cadina-3,5-diene" "germacrene D" "trans cadina-1,4-diene"
I was thinking about approaching this through nested loops that iterate through the names list and then split the string for each name and iterate through the individual characters of the name, but I'm getting a bit lost in applying the conditional logic required to substitute individual characters in the string.
convert_compound_names<-function(x){
underscore_locations<-lapply(strsplit(x,""),function(x) which(x=="_"))
digit_locations<-lapply(strsplit(x,""),function(x) grep("\\d",x))
for(i in c(1:length(x)))
split_name<-unlist(strsplit(x[i],""))
for (j in c(1:length(split_name)))){
#some conditional logic to replace underscores here
}
x[i]<-paste0(split_name[1:length(split_name)],collapse="")
}
return(x)
}
I also wondered if the conditional logic could be incorporated into a gsub function and the looping may not be necessary..?
For the record, I'm a chemist, not a programmer or data-scientist, so any advice, suggestions, or moral support would be appreciated.
Thanks for reading.
I worked out the conditional logic needed to address the substitution of underscores within the loops that I proposed above:
convert_compound_names<-function(x){
for(i in c(1:length(x))){
split_name<-unlist(strsplit(x[i],""))
for (j in c(1:length(split_name))){
#some conditional logic to replace underscores here
if(split_name[j]=="_"){
if(grepl("\\d",split_name[j-1])|(grepl("\\d",split_name[j+1]))){split_name[j]<-"-"}
else if(grepl("-",split_name[j-1])|(grepl("-",split_name[j+1]))){split_name[j]<-""}
else if(grepl("[a-zA-Z]",split_name[j-1])&&(grepl("[a-zA-Z]",split_name[j+1]))){split_name[j]<-" "}
}
}
x[i]<-paste0(split_name[1:length(split_name)],collapse="")
}
return(x)
}
However I'm sure there's a more straightforward way of doing this to be found.
chem_names <- c("1,8_cineole", "geranyl_acetate", "AR_curcumene", "trans_trans_a-farnesene",
"trans_muurola_4,5_diene", "p_cymene", "a_-_pinene", "cadina_3,5_diene",
"germacrene_D", "trans_cadina1,4diene")
This sounds like a regex problem, which I'm still new at, but I think the code below will do what you want.
Here, I first replace all "_#" with "-" using a lookaround in the form of (?=\\d) so that it will find but not replace the number [\\d], which comes after the underscore [\\_] which will be replaced by the -. Then same deal the dash that follows a number, and for any remaining underscores to become spaces.
library(dplyr); library(stringr)
data.frame(chem_names) %>%
mutate(chem_names2 = chem_names %>%
str_replace_all("\\_(?=\\d)", "-") %>% # replace _# with -
str_replace_all("(?<=\\d)\\_", "-") %>% # replace #_ with -
str_replace_all("\\_", " ")) # replace _ with space
Result
chem_names chem_names2
1 1,8_cineole 1,8-cineole
2 geranyl_acetate geranyl acetate
3 AR_curcumene AR curcumene
4 trans_trans_a-farnesene trans trans a-farnesene
5 trans_muurola_4,5_diene trans muurola-4,5-diene
6 p_cymene p cymene
7 a_-_pinene a - pinene
8 cadina_3,5_diene cadina-3,5-diene
9 germacrene_D germacrene D
10 trans_cadina1,4diene trans cadina1,4diene
I think the regex to accomplish this is in fact relatively simple. We use ifelse to check for a condition; the condition is that str_detect detects a digit \\d. If it does, then _ is replaced by -. If it does not, _ is replaced by whitespace:
libraryr(dplyr)
library(stringr)
data.frame(chem_names) %>%
mutate(chem_names = ifelse(str_detect(chem_names, "\\d"),
gsub("_", "-", chem_names),
gsub("_", " ", chem_names)))
chem_names
1 1,8-cineole
2 geranyl acetate
3 AR curcumene
4 trans trans a-farnesene
5 trans-muurola-4,5-diene
6 p cymene
7 a - pinene
8 cadina-3,5-diene
9 germacrene D
10 trans-cadina1,4diene
Data:
chem_names <- c("1,8_cineole", "geranyl_acetate", "AR_curcumene", "trans_trans_a-farnesene",
"trans_muurola_4,5_diene", "p_cymene", "a_-_pinene", "cadina_3,5_diene",
"germacrene_D", "trans_cadina1,4diene")

Detect substring within a string while not considering part of the substring

I'm trying to check whether string B is contained by string A and this is what I tried:
library(stringr)
string_a <- "something else free/1a2b a bird yes"
string_b <- "free/xxxx a bird"
str_detect(string_a, string_b)
I would expect a match (TRUE) since I wouldn't like to consider part of string_b followed by the "/" and before a white space, which is why I put "/xxxx".
In a way the "/xxxx" should represent match any string or number possible in these places. Is there maybe another notation to ignore parts of string when matching like this?
Yes, in regex you can use .* to match zero or more characters.
library(stringr)
string_a <- "something else free/1a2b a bird yes"
string_b <- "free/xxxx a bird"
string_c <- "free/.*a bird"
str_detect(string_a, string_c)
#[1] TRUE
If you cannot change string_b at source, you may use str_replace_all or gsub to replace xxxx with '.*'.
str_detect(string_a, str_replace_all(string_b, 'x+', '.*'))
#[1] TRUE

Count with how many spaces a string starts

I want to know with how many spaces a string starts. Here are some examples:
string.1 <- " starts with 4 spaces"
string.2 <- " starts with only 2 spaces"
My attempt was the following but this leads to 1 in both cases and I understand why this is the case.
stringr::str_count(string.1, "^ ")
stringr::str_count(string.2, "^ ")
I'd prefer if there was a solution completely like this but with another regex.
The ^ pattern matches a single space at the start of the string, that is why both test cases return 1.
To match consecutive spaces at the start of the string, you may use
stringr::str_count(string.1, "\\G ")
Or, to count any whitespaces,
stringr::str_count(string.1, "\\G\\s")
See the R demo
The \G pattern matches a space at the start and each space after the successful match due to the \G anchor.
Another approach: count the length of ^\s+ matches (1 or more whitespace chars at the start of the string):
strings <- c(" starts with 4 spaces", " starts with only 2 spaces")
matches <- regmatches(strings, regexpr("^\\s+", strings))
sapply(matches, nchar)
# => 4 2
One approach might be to take the nchar of the input string, with all content from the first non whitespace character until the end stripped.
string.1 <- " starts with 4 spaces"
nchar(sub("\\S.*$", "", string.1))

r: regex for containing pattern with negation

Suppose I have the following two strings and want to use grep to see which match:
business_metric_one
business_metric_one_dk
business_metric_one_none
business_metric_two
business_metric_two_dk
business_metric_two_none
And so on for various other metrics. I want to only match the first one of each group (business_metric_one and business_metric_two and so on). They are not in an ordered list so I can't index and have to use grep. At first I thought to do:
.*metric.*[^_dk|^_none]$
But this doesn't seem to work. Any ideas?
You need to use a PCRE pattern to filter the character vector:
x <- c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")
grep("metric(?!.*_(?:dk|none))", x, value=TRUE, perl=TRUE)
## => [1] "business_metric_one" "business_metric_two"
See the R demo
The metric(?!.*(?:_dk|_none)) pattern matches
metric - a metric substring
(?!.*_(?:dk|none)) - that is not followed with any 0+ chars other than line break chars followed with _ and then either dk or none.
See the regex demo.
NOTE: if you need to match only such values that contain metric and do not end with _dk or _none, use a variation, metric.*$(?<!_dk|_none) where the (?<!_dk|_none) negative lookbehind fails the match if the string ends with either _dk or _none.
You can also do something like this:
grep("^([[:alpha:]]+_){2}[[:alpha:]]+$", string, value = TRUE)
# [1] "business_metric_one" "business_metric_two"
or use grepl to match dk and none, then negate the logical when you're indexing the original string:
string[!grepl("(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
more concisely:
string[!grepl("business_metric_[[:alpha:]]+_(dk|none)", string)]
# [1] "business_metric_one" "business_metric_two"
Data:
string = c("business_metric_one","business_metric_one_dk","business_metric_one_none","business_metric_two","business_metric_two_dk","business_metric_two_none")

Replace all characters between the 3rd occurrence of “-” and the ":" in each element of a vector

Here is what I am trying to do:
Given a string, I want to remove everything after the third occurrence of the '-' and the character — assuming there is a third occurrence, which there may not be.
This is my expected result :
Initial string
yy-aa-bbb-cccc1:HYT => yy-aa-bbb:HYT
yy-aa-vvv-vv:ZTR => yy-aa-vvv:ZTR
yy-aa-ddd:YTLM => yy-aa-ddd:YTLM
Any help?
gsub('(.*-.*-.*)\\-.*(\\:.*)','\\1\\2',string)
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa-ddd:YTLM"
We match two instances of characters that are not a - followed by - ([^-]+-) followed by another set of characters that are not a -, capture it as a group i.e. inside the (), followed by a - and set of characters that are not a : ([^:]+) followed by the second capture group that starts with : ((:.*)) and replace it with the backreference of the capture groups
sub("(([^-]+-){2}[^-]+)-*[^:]+(:.*)", "\\1\\3", str1)
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa-ddd:YTLM"
data
str1 <- c("yy-aa-bbb-cccc1:HYT", "yy-aa-vvv-vv:ZTR", "yy-aa-ddd:YTLM"
Match the the first two fields and everything afterwards to colon and replace that with the first two fields and colon. Note that \w matches any word character and the \ needs to be doubled inside "..." :
sub("(\\w+-\\w+)-.+:", "\\1:", xx)
## [1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa:YTLM"
Note: The input xx in reproducible form is:
xx <- c("yy-aa-bbb-cccc1:HYT", "yy-aa-vvv-vv:ZTR", "yy-aa-ddd:YTLM")
Just throwing a stringi solution in there.
library(stringi)
sub('_.*:' ,':', stri_replace_last_fixed(x, '-', '_'))
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa:YTLM"

Resources