Count number of occurrences when string contains substring

Count number of occurrences when string contains substring - r

I have string like
'abbb'
I need to understand how many times I can find substring 'bb'.
grep('bb','abbb')
returns 1. Therefore, the answer is 2 (a-bb and ab-bb). How can I count number of occurrences the way I need?

You can make the pattern non-consuming with '(?=bb)', as in:
length(gregexpr('(?=bb)', x, perl=TRUE)[[1]])
[1] 2

Here is an ugly approach using substr and sapply:
input <- "abbb"
search <- "bb"
res <- sum(sapply(1:(nchar(input)-nchar(search)+1),function(i){
substr(input,i,i+(nchar(search)-1))==search
}))

We can use stri_count
library(stringi)
stri_count_regex(input, '(?=bb)')
#[1] 2
stri_count_regex(x, '(?=bb)')
#[1] 0 1 0
data
input <- "abbb"
x <- c('aa','bb','ba')

Related

making for loop for character vector in R

char_vector <- c("Africa", "identical", "ending" ,"aa" ,"bb", "rain" ,"Friday" ,"transport") # character vector
Suppose I have the above character vector
I would like to create a for loop to print on the screen only the elements in a vector that have more than 5 characters and starts with a vowel
and also delete from the vector those elements that do not start with a vowel
I created this for loop but it also gives null characters
for (i in char_vector){
if (str_length(i) > 5){
i <- str_subset(i, "^[AEIOUaeiou]")
print(i)
}
}
The result for the above is
[1] "Africa"
[1] "identical"
[1] "ending"
character(0)
character(0)
My desired result would only be the first 3 characters
I'm really new to R and facing huge difficulty with creating a for loop for this problem. Any help would be greatly appreciated!

Use grepl with the pattern ^[AEIOUaeiuo]\w{5,}$:
char_vector <- c("Africa", "identical", "ending" ,"aa" ,"bb", "rain" ,"Friday" ,"transport")
char_vector <- char_vector[grepl("^[AEIOUaeiuo]\\w{5,}$", char_vector)]
char_vector
[1] "Africa" "identical" "ending"
The regex pattern used here says to match words which:
^ from the start of the word
[AEIOUaeiuo] starts with a vowel
\w{5,} followed by 5 or more characters (total length > 5)
$ end of the word

You don't need for loop, because we use vectorized functions in R.
A simple solution using grep and substr (refer to Tim Blegeleisen answer for details):
substr(grep('^[aeiu].{4}', char_vector, T, , T), 1, 3)
# [1] "Afr" "ide" "end"

With stringr functions, you'd rather use str_detect instead of str_subset, and you can take advantage of the fact that those functions are vectorized:
library(stringr)
char_vector[str_length(char_vector) > 5 & str_detect(char_vector, "^[AEIOUaeiou]")]
#[1] "Africa" "identical" "ending"
or if you want your for loop as a single vector:
vec <- c()
for (i in char_vector){
if (str_length(i) > 5 & str_detect(i, "^[AEIOUaeiou]")){
vec <- c(vec, i)
}
}
vec
# [1] "Africa" "identical" "ending"

The first 3 characters?
library(stringr)
for (i in char_vector){
if (str_length(i) > 5 & str_detect(i, "^[AEIOUaeiou]")) {
word <- str_sub(i, 1, 3)
print(word)
}
}
output is:
[1] "Afr"
[1] "ide"
[1] "end"

Using only base R functions. No need for a loop. I wrapped the steps in a function so you can use the function with other character vectors. You could make this code shorter (see #utubun's answer) but I feel it is easier to understand the process with a "one line one step" approach.
char_vector <- c("Africa", "identical", "ending" ,"aa" ,"bb", "rain" ,"Friday" ,"transport")
yourfun <- function(char_vector){
char_vector <- char_vector[nchar(char_vector)>= 5] # grab only the strings that are at least 5 characters long
char_vector <- char_vector[grep(pattern = "^[AEIOUaeiou]", char_vector)] # grab strings that starts with vowel
return(char_vector) # print the first three strings
# remove comments to get the first three characters of each string
# out <- substring(char_vector, 1, 3) # select only the first 3 characters of each string
# return(out)
}
yourfun(char_vector = char_vector)
#> [1] "Africa" "identical" "ending"
Created on 2022-05-09 by the reprex package (v2.0.1)

deleting multiple substrings of string

I'm using R and I have a vector of strings with 1 and 2.
Examples of strings could be the following:
"11111111******111"
"11111111111***2222222"
"1111*****22222**111*****1111"
where "*" denote a gap.
I'm interested in deleting substrings of gaps shorter than a certain number n.
Example with sequences above:
I decided that n=3, so...
1. "11111111******111"
2. "111111111112222222"
3. "1111*****22222111*****1111"
In the second and third string the "function" deleted a substring of 3 gaps and 2 gaps, because I wanted to delete all substrings of gaps shorter or equal 3.

May be we can do
n <-3
pat <- sprintf("(?<=[0-9])\\*{1,%d}(?=[0-9])", n)
gsub(pat, "", v1, perl = TRUE)
#[1] "11111111******111" "111111111112222222"
#[3] "1111*****22222111*****1111"
data
v1 <- c("11111111******111", "11111111111***2222222", "1111*****22222**111*****1111")

Similar to #akrun's answer:
x<- list("11111111******111",
"11111111111***2222222",
"1111*****22222**111*****1111")
lapply(x, function(x) gsub("(\\d)\\*{,3}(\\d)", "\\1\\2", x, perl = TRUE))

gsub('(?<=\\d)(\\*{1,3})(?=\\d)','',v1,perl=T)
[1] "11111111******111" "111111111112222222" "1111*****22222111*****1111"

Locate different patterns in a sequence

If I want to find two different patterns in a single sequence how am I supposed to do
eg:
seq="ATGCAAAGGT"
the patterns are
pattern=c("ATGC","AAGG")
How am I supposed to find these two patterns simultaneously in the sequence?
I also want to find the location of these patterns like for example the patterns locations are 1,4 and 5,8.
Can anyone help me with this ?

Lets say your sequence file is just a vector of sequences:
seq.file <- c('ATGCAAAGGT','ATGCTAAGGT','NOTINTHISONE')
You can search for both motifs, and then return a true / false vector that identifies if both are present using the following one-liner:
grepl('ATGC', seq.file) & grepl('AAGG', seq.file)
[1] TRUE TRUE FALSE
Lets say the vector of sequences is a column within data frame d, which also contains a column of ID values:
id <- c('s1','s2','s3')
d <- data.frame(id,seq.file)
colnames(d) <- c('id','sequence')
You can append a column to this data frame, d, that identifies whether a given sequence matches with this one-liner:
d$match <- grepl('ATGC',d$sequence) & grepl('AAGG', d$sequence)
> print(d)
id sequence match
1 s1 ATGCAAAGGT TRUE
2 s2 ATGCTAAGGT TRUE
3 s3 NOTINTHISONE FALSE
The following for-loop can return a list of the positions of each of the patterns within the sequence:
require(stringr)
for(i in 1: length(d$sequence)){
out <- str_locate_all(d$sequence[i], pattern)
first <- c(out[[1]])
first.o <- paste(first[1],first[2],sep=',')
second <- c(out[[2]])
second.o <- paste(second[1],second[2], sep=',')
print(c(first.o, second.o))
}
[1] "1,4" "6,9"
[1] "1,4" "6,9"
[1] "NA,NA" "NA,NA"

You can try using the stringr library to do something like this:
seq = "ATGCAAAGGT"
library(stringr)
str_extract_all(seq, 'ATGC|AAGG')
[[1]]
[1] "ATGC" "AAGG"
Without knowing more specifically what output you are looking for, this is the best I can provide right now.

How about this using stringr to find start and end positions:
library(stringr)
seq <- "ATGCAAAGGT"
pattern <- c("ATGC","AAGG")
str_locate_all(seq, pattern)
#[[1]]
# start end
#[1,] 1 4
#
#[[2]]
# start end
#[1,] 6 9

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote

You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6

To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])

You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5

EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

Replace specific characters within strings

I would like to remove specific characters from strings within a vector, similar to the Find and Replace feature in Excel.
Here are the data I start with:
group <- data.frame(c("12357e", "12575e", "197e18", "e18947")
I start with just the first column; I want to produce the second column by removing the e's:
group group.no.e
12357e 12357
12575e 12575
197e18 19718
e18947 18947

With a regular expression and the function gsub():
group <- c("12357e", "12575e", "197e18", "e18947")
group
[1] "12357e" "12575e" "197e18" "e18947"
gsub("e", "", group)
[1] "12357" "12575" "19718" "18947"
What gsub does here is to replace each occurrence of "e" with an empty string "".
See ?regexp or gsub for more help.

Regular expressions are your friends:
R> ## also adds missing ')' and sets column name
R> group<-data.frame(group=c("12357e", "12575e", "197e18", "e18947")) )
R> group
group
1 12357e
2 12575e
3 197e18
4 e18947
Now use gsub() with the simplest possible replacement pattern: empty string:
R> group$groupNoE <- gsub("e", "", group$group)
R> group
group groupNoE
1 12357e 12357
2 12575e 12575
3 197e18 19718
4 e18947 18947
R>

Summarizing 2 ways to replace strings:
group<-data.frame(group=c("12357e", "12575e", "197e18", "e18947"))
1) Use gsub
group$group.no.e <- gsub("e", "", group$group)
2) Use the stringr package
group$group.no.e <- str_replace_all(group$group, "e", "")
Both will produce the desire output:
group group.no.e
1 12357e 12357
2 12575e 12575
3 197e18 19718
4 e18947 18947

You do not need to create data frame from vector of strings, if you want to replace some characters in it. Regular expressions is good choice for it as it has been already mentioned by #Andrie and #Dirk Eddelbuettel.
Pay attention, if you want to replace special characters, like dots, you should employ full regular expression syntax, as shown in example below:
ctr_names <- c("Czech.Republic","New.Zealand","Great.Britain")
gsub("[.]", " ", ctr_names)
this will produce
[1] "Czech Republic" "New Zealand" "Great Britain"

Use the stringi package:
require(stringi)
group<-data.frame(c("12357e", "12575e", "197e18", "e18947"))
stri_replace_all(group[,1], "", fixed="e")
[1] "12357" "12575" "19718" "18947"

> library(stringi)
> group <- c('12357e', '12575e', '12575e', ' 197e18', 'e18947')
> pattern <- "e"
> replacement <- ""
> group <- str_replace(group, pattern, replacement)
> group
[1] "12357" "12575" "12575" " 19718" "18947"

You can use chartr as well:
group$group.no.e <- chartr("e", "", group$group)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Count number of occurrences when string contains substring - r

I have string like 'abbb' I need to understand how many times I can find substring 'bb'. grep('bb','abbb') returns 1. Therefore, the answer is 2 (a-bb and ab-bb). How can I count number of occurrences the way I need?

You can make the pattern non-consuming with '(?=bb)', as in: length(gregexpr('(?=bb)', x, perl=TRUE)[[1]]) [1] 2

Here is an ugly approach using substr and sapply: input <- "abbb" search <- "bb" res <- sum(sapply(1:(nchar(input)-nchar(search)+1),function(i){ substr(input,i,i+(nchar(search)-1))==search }))

We can use stri_count library(stringi) stri_count_regex(input, '(?=bb)') #[1] 2 stri_count_regex(x, '(?=bb)') #[1] 0 1 0 data input <- "abbb" x <- c('aa','bb','ba')

Related

making for loop for character vector in R

deleting multiple substrings of string

Locate different patterns in a sequence

Finding number of r's in the vector (Both R and r) before the first u

Replace specific characters within strings

Categories

Resources