Replace specific characters within strings - r

I would like to remove specific characters from strings within a vector, similar to the Find and Replace feature in Excel.
Here are the data I start with:
group <- data.frame(c("12357e", "12575e", "197e18", "e18947")
I start with just the first column; I want to produce the second column by removing the e's:
group group.no.e
12357e 12357
12575e 12575
197e18 19718
e18947 18947

With a regular expression and the function gsub():
group <- c("12357e", "12575e", "197e18", "e18947")
group
[1] "12357e" "12575e" "197e18" "e18947"
gsub("e", "", group)
[1] "12357" "12575" "19718" "18947"
What gsub does here is to replace each occurrence of "e" with an empty string "".
See ?regexp or gsub for more help.

Regular expressions are your friends:
R> ## also adds missing ')' and sets column name
R> group<-data.frame(group=c("12357e", "12575e", "197e18", "e18947")) )
R> group
group
1 12357e
2 12575e
3 197e18
4 e18947
Now use gsub() with the simplest possible replacement pattern: empty string:
R> group$groupNoE <- gsub("e", "", group$group)
R> group
group groupNoE
1 12357e 12357
2 12575e 12575
3 197e18 19718
4 e18947 18947
R>

Summarizing 2 ways to replace strings:
group<-data.frame(group=c("12357e", "12575e", "197e18", "e18947"))
1) Use gsub
group$group.no.e <- gsub("e", "", group$group)
2) Use the stringr package
group$group.no.e <- str_replace_all(group$group, "e", "")
Both will produce the desire output:
group group.no.e
1 12357e 12357
2 12575e 12575
3 197e18 19718
4 e18947 18947

You do not need to create data frame from vector of strings, if you want to replace some characters in it. Regular expressions is good choice for it as it has been already mentioned by #Andrie and #Dirk Eddelbuettel.
Pay attention, if you want to replace special characters, like dots, you should employ full regular expression syntax, as shown in example below:
ctr_names <- c("Czech.Republic","New.Zealand","Great.Britain")
gsub("[.]", " ", ctr_names)
this will produce
[1] "Czech Republic" "New Zealand" "Great Britain"

Use the stringi package:
require(stringi)
group<-data.frame(c("12357e", "12575e", "197e18", "e18947"))
stri_replace_all(group[,1], "", fixed="e")
[1] "12357" "12575" "19718" "18947"

> library(stringi)
> group <- c('12357e', '12575e', '12575e', ' 197e18', 'e18947')
> pattern <- "e"
> replacement <- ""
> group <- str_replace(group, pattern, replacement)
> group
[1] "12357" "12575" "12575" " 19718" "18947"

You can use chartr as well:
group$group.no.e <- chartr("e", "", group$group)

Related

How to get the number between two characters in R

I have a vector a.
I want to extract the numbers between PUBMED and \nREFERENCE, which means the number is 32634600
I don't know how to code it using str_extract().
a = "234 4dfd 123PUBMED 32634600\nREFERENCE"
# expected output is 32634600
Using a lookbehind and stringr:
library(stringr)
str_extract_all(a, "(?<=PUBMED )[0-9]+")
[[1]]
[1] "32634600"
We can use sub() here with a capture group:
a <- "234 4dfd 123PUBMED 32634600\nREFERENCE"
num <- sub(".*PUBMED\\s*(\\d+)\\s*\\bREFERENCE\\b.*", "\\1", a)
num
[1] "32634600"

Export multiple matching pattern

I am trying to extract AOB1 or AOB2 or AOB3 from the string below.
df <- data.frame(
id = c(1,2,3),
string = c("acv-32-AOB1", "osa-122-AOB2","cds-543-rr-AOB3")
)
> df
id string
1 1 acv-32-AOB1
2 2 osa-122-AOB2
3 3 cds-543-rr-AOB3
Any ideas?
Thanks!
We can use trimws from base R
trimws(df$string, whitespace =".*-")
[1] "AOB1" "AOB2" "AOB3"
Or use sub from base R
sub(".*-", "", df$string)
[1] "AOB1" "AOB2" "AOB3"
Or if we need to do extract the 'AOB' followed by digits
library(stringr)
str_extract(df$string, "AOB\\d+")
[1] "AOB1" "AOB2" "AOB3"
You can use regular expressions for this:
.* Match anything
(AOB[1-3]) then match AOB followed by a 1, 2 or 3
\\1 replace the entire string with the matched AOB1-3 slot
gsub(".*(AOB[1-3])", "\\1", df$string)
Here is one more dear friends:
m <- gregexpr("[a-zA-Z]{3}\\d{1}", df$string)
unlist(regmatches(df$string, m))
> unlist(regmatches(df$string, m))
[1] "AOB1" "AOB2" "AOB3"
I made some modification so that your desired pattern could be everywhere and you could use the following solution to extract it:
df$res <- gsub("(.*)?([A-Z]{3}\\d)(.*)?", "\\2", df$string, perl = TRUE)
df
id string res
1 1 acv-32-AOB1 AOB1
2 2 osa-122-AOB2 AOB2
3 3 cds-543-rr-AOB3 AOB3

Count number of occurrences when string contains substring

I have string like
'abbb'
I need to understand how many times I can find substring 'bb'.
grep('bb','abbb')
returns 1. Therefore, the answer is 2 (a-bb and ab-bb). How can I count number of occurrences the way I need?
You can make the pattern non-consuming with '(?=bb)', as in:
length(gregexpr('(?=bb)', x, perl=TRUE)[[1]])
[1] 2
Here is an ugly approach using substr and sapply:
input <- "abbb"
search <- "bb"
res <- sum(sapply(1:(nchar(input)-nchar(search)+1),function(i){
substr(input,i,i+(nchar(search)-1))==search
}))
We can use stri_count
library(stringi)
stri_count_regex(input, '(?=bb)')
#[1] 2
stri_count_regex(x, '(?=bb)')
#[1] 0 1 0
data
input <- "abbb"
x <- c('aa','bb','ba')

Get rid of repetitive characters from a column name in R

Here is a portion of my large dataframe
> a
SS29.SS29 PP1.PP1 SS4.SS4 CC43.CC43 FF57.FF57 NN23.NN23 MM25.MM25 KK9.KK9 MM55.MM55 AA75.AA75 SS88.SS88
1 669.9544 1.068153 35.86534 24.47688 1.058007 72.20306 1.854856 10.15414 0.08715572 0.02006310 0.1817582
2 651.2092 1.164428 37.59895 27.41381 1.095322 73.48029 1.927993 10.09958 0.09096972 0.02261701 0.1855258
How I'd be able to get rid of the double column names separated by a dot? e.g. for the first column I'd like to have SS29 instead of repetitive SS29.SS29, for the second column PP1 and so on. Is there any automated way of doing it?
The simplest way would be to use sub to remove the substring after the dot . character.
names(a) <- sub('\\.[^.]*', '', names(a))
You could use sub
names(a) <- sub("[.](.*)", "", names(a))
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
or a substring
substring(names(a), 1, regexpr("[.]", names(a))-1)
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
or strsplit
names(a) <- unlist(strsplit(names(a), "[.](.*)"))
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
You can assign new column names with
colnames(a) <- new_column_names
To compute new_column_names, you can use regular expressions, e.g.. the gsub function, as ssdecontrol suggested.
new_column_names <- gsub(...)

Remove part of a string

How do I remove part of a string? For example in ATGAS_1121 I want to remove everything before _.
Use regular expressions. In this case, you can use gsub:
gsub("^.*?_","_","ATGAS_1121")
[1] "_1121"
This regular expression matches the beginning of the string (^), any character (.) repeated zero or more times (*), and underscore (_). The ? makes the match "lazy" so that it only matches are far as the first underscore. That match is replaced with just an underscore. See ?regex for more details and references
You can use a built-in for this, strsplit:
> s = "TGAS_1121"
> s1 = unlist(strsplit(s, split='_', fixed=TRUE))[2]
> s1
[1] "1121"
strsplit returns both pieces of the string parsed on the split parameter as a list. That's probably not what you want, so wrap the call in unlist, then index that array so that only the second of the two elements in the vector are returned.
Finally, the fixed parameter should be set to TRUE to indicate that the split parameter is not a regular expression, but a literal matching character.
If you're a Tidyverse kind of person, here's the stringr solution:
R> library(stringr)
R> strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
R> strings %>% str_replace(".*_", "_")
[1] "_1121" "_1432" "_1121"
# Or:
R> strings %>% str_replace("^[A-Z]*", "")
[1] "_1121" "_1432" "_1121"
Here's the strsplit solution if s is a vector:
> s <- c("TGAS_1121", "MGAS_1432")
> s1 <- sapply(strsplit(s, split='_', fixed=TRUE), function(x) (x[2]))
> s1
[1] "1121" "1432"
Maybe the most intuitive solution is probably to use the stringr function str_remove which is even easier than str_replace as it has only 1 argument instead of 2.
The only tricky part in your example is that you want to keep the underscore but its possible: You must match the regular expression until it finds the specified string pattern (?=pattern).
See example:
strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
strings %>% stringr::str_remove(".+?(?=_)")
[1] "_1121" "_1432" "_1121"
Here the strsplit solution for a dataframe using dplyr package
col1 = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
col2 = c("T", "M", "A")
df = data.frame(col1, col2)
df
col1 col2
1 TGAS_1121 T
2 MGAS_1432 M
3 ATGAS_1121 A
df<-mutate(df,col1=as.character(col1))
df2<-mutate(df,col1=sapply(strsplit(df$col1, split='_', fixed=TRUE),function(x) (x[2])))
df2
col1 col2
1 1121 T
2 1432 M
3 1121 A

Resources