Count the number of hyphens in a name using R - r

I have a data set with a certain amount of names. How can I count the number of names with at least one hyphen using R?

We can use str_count to get the number of hyphens and then count by creating a logical vector and get the sum
library(stringr)
sum(str_count(v1, "-") > 0)

In base R, we can use grepl
sum(grepl('-', df$Name))
Or with grep
length(grep('-', df$Name))
Using a reproducble example,
df <- data.frame(Name = c('name1-name2', 'name1name2',
'name1-name2-name3', 'name2name3'))
sum(grepl('-', df$Name))
#[1] 2
length(grep('-', df$Name))
#[1] 2

Related

subset the string matches in the middle of the column from dataframe in R

I need to subset the column that contains uniprot/swiss-prot: ID from the data frame in R.The column contains other IDs also.
Below is an example:
biogrid:107054|entrez gene/locuslink:BAK1|uniprot/swiss-prot:Q16611|refseq:NP_001179
I need the below output:
Q16611
You can use -
x <- 'biogrid:107054|entrez gene/locuslink:BAK1|uniprot/swiss-prot:Q16611|refseq:NP_001179'
sub('.*swiss-prot:(\\w+)\\|.*', '\\1', x)
#[1] "Q16611"
This will extract a word after swiss-prot: and | in the text.
For apply this to a dataframe column you can do -
df$result <- sub('.*swiss-prot:(\\w+)\\|.*', '\\1', df$col)
Using str_extract
library(stringr)
str_extract(x, "(?<=prot:)\\w+")
[1] "Q16611"
data
x <- 'biogrid:107054|entrez gene/locuslink:BAK1|uniprot/swiss-prot:Q16611|refseq:NP_001179'

Substitute strings by their first match in a dictionary

I have a vector long_strings defined as
long_strings <- c("*/1/1/1/1", "*/1/2/1/1", "*/2/1",
"*/2/2/1", "*/3/1/1/1")
and I have a dictionary of short short_strings containing the initial patterns (with differing lengths) of those strings, for example
short_strings <- c("*/1/1", "*/3", "*/2", "*/1/2")
How can I "simplify" the contents of long_strings to match their corresponding value on short_strings?
The results should look like
"*/1/1", "*/1/2", "*/2", "*/2", "*/3"
I can find where are the occurrences of a single element of short_strings using grep("\\*/2", long_strings), but I want to avoid looping over the short_strings.
An option with sapply
as.character(with(stack(sapply(setNames(paste0("\\", short_strings), short_strings),
grep, x = long_strings)), ind[order(values)]))
#[1] "*/1/1" "*/1/2" "*/2" "*/2" "*/3"
Or using str_extract
library(stringr)
str_extract(long_strings, str_c(str_c("\\", short_strings), collapse="|"))
#[1] "*/1/1" "*/1/2" "*/2" "*/2" "*/3"
We can programmatically create a capture group and use it in sub to extract it
sub(paste0(".*(",paste0("\\", short_strings, collapse = "|"), ").*"), "\\1",long_strings)
#[1] "*/1/1" "*/1/2" "*/2" "*/2" "*/3"

Count number of occurrences when string contains substring

I have string like
'abbb'
I need to understand how many times I can find substring 'bb'.
grep('bb','abbb')
returns 1. Therefore, the answer is 2 (a-bb and ab-bb). How can I count number of occurrences the way I need?
You can make the pattern non-consuming with '(?=bb)', as in:
length(gregexpr('(?=bb)', x, perl=TRUE)[[1]])
[1] 2
Here is an ugly approach using substr and sapply:
input <- "abbb"
search <- "bb"
res <- sum(sapply(1:(nchar(input)-nchar(search)+1),function(i){
substr(input,i,i+(nchar(search)-1))==search
}))
We can use stri_count
library(stringi)
stri_count_regex(input, '(?=bb)')
#[1] 2
stri_count_regex(x, '(?=bb)')
#[1] 0 1 0
data
input <- "abbb"
x <- c('aa','bb','ba')

split paired samples based on substring

I have two groups of paired samples that could be separated by the first two letters. I would like to make two groups based on the pairing using something like [tn][abc].
Example of paired samples:
nb-008 ta-008
na015 ta-015
data:
> colnames(data)
"nb-008" "nb-014" "na015" "na-018" "ta-008" "tc-014" "ta-015" "ta-018"
patient <- factor(sapply(str_split(colnames(data), '[tn][abc]'), function(x) x[[1]]))
We can create a grouping variable with sub. We match the pattern of 2 characters (..) from the beginning of the string (^) followed by - (if present), followed by one or more characters (.*) that we capture as a group (inside the brackets), and replace by the backreference (\\1). This can be used to split the column names.
split(colnames(data), sub('^..-?(.*)', '\\1', colnames(data))))
#$`008`
#[1] "nb-008" "ta-008"
#$`014`
#[1] "nb-014" "tc-014"
#$`015`
#[1] "na015" "ta-015"
#$`018`
#[1] "na-018" "ta-018"
data
v1 <- c("nb-008", "nb-014", "na015", "na-018",
"ta-008", "tc-014", "ta-015", "ta-018" )
set.seed(24)
data <- setNames(as.data.frame(matrix(sample(0:8, 8*5,
replace=TRUE), ncol=8)), v1)

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

Resources