Multiple string/ pattern matching in R

Multiple string/ pattern matching in R - r

I want to look for individual character strings [Sequences], in another data.frame [Database] that has 500+ rows.
For example:
Sequences<-c("AzzY","BbDe")
Database<-c("TTUAzzY","aaa","DBbDe","CAzzY")
Ideally, the code would iterate through each row of [Database] to find if there is a match with one of the [Sequences]. From the example above, there would be 2 matches for "AzzY" and 1 match for "BbDe". I would like this count to be added in a new column of [Sequences].
Many thanks.

require("dplyr")
Sequences=c("AzzY","BbDe")
Database=c("TTUAzzY","aaa","DBbDe","CAzzY")
df=as.data.frame(sapply(Sequences, function(x) grepl(x,Database)))
stats=df %>% summarise_each(funs(sum))
cbind(Sequences,as.numeric(stats))

Sequences=c("AzzY","BbDe")
Database=c("TTUAzzY","aaa","DBbDe","CAzzY")
sapply(Sequences, function(x) length(grep(x, Database)))
# AzzY BbDe
# 2 1

Another idea would be to "flatten" the Database vector and search in that.
Here's that idea in practice with the efficient "stringi" package:
library(stringi)
stri_count_fixed(stri_flatten(Database), Sequences)
# [1] 2 1
You can pretty-up the output and add names if you want with setNames:
setNames(stri_count_fixed(stri_flatten(Database), Sequences), Sequences)
# AzzY BbDe
# 2 1

Related

Remove part of string after 3-digit number

I would like to substitute the strings in the list by cutting each string after the first 3-digit number.
a <- c("MTH314PHY410","LB471LB472","PHY472CHM141")
I would like for it to look something like
a <- c("MTH314","LB471","PHY472")
I have tried something like
b <- gsub("[100-999].*","",a)
but it returns c("MTH","LB","PHY") without the first number

A possible solution, based on stringr::str_remove:
library(stringr)
a <- c("MTH314PHY410","LB471LB472","PHY472CHM141")
str_remove(a, "(?<=\\d{3}).*")
#> [1] "MTH314" "LB471" "PHY472"

c("MTH314PHY410","LB471LB472","PHY472CHM141") %>%
stringr::str_extract('.+?\\d{3}')
[1] "MTH314" "LB471" "PHY472"

Count the number of hyphens in a name using R

I have a data set with a certain amount of names. How can I count the number of names with at least one hyphen using R?

We can use str_count to get the number of hyphens and then count by creating a logical vector and get the sum
library(stringr)
sum(str_count(v1, "-") > 0)

In base R, we can use grepl
sum(grepl('-', df$Name))
Or with grep
length(grep('-', df$Name))
Using a reproducble example,
df <- data.frame(Name = c('name1-name2', 'name1name2',
'name1-name2-name3', 'name2name3'))
sum(grepl('-', df$Name))
#[1] 2
length(grep('-', df$Name))
#[1] 2

List Rows from column selection

Hello I would like to select rows in form of list from a dataframe. Here is my dataframe:
df2 <- data.frame("user_id" = 1:2, "username" = c(215,154), "password" = c("John4","Dora4"))
now with this dataframe I can only select 1 column to view rows as a list, which I did with this code
df2[["user_id"]]
output is
[1] 1 2
but now when I try this with more columns I am told its out of bounds, what is the problem here
df2[["user_id", "username"]]
How can I resolve and get the results of rows as a list

If I understood your question correctly, you need to familiarize yourself with subsetting in R. These are ways to select multiple columns in R:
df2[,c('user_id', 'username')]
or
df2[,1:2]
If you want to return all columns as a list, you can use something like this:
lapply(1:ncol(df2), function(x) df2[,x])

The format is df2['rows','columns'], so you should use:
df2[,c("user_id", "username")]
To get them 'in form of list', do:
as.list(df2[,c("user_id", "username")])
The double bracket [[ notion is used to select a single unnamed element (in this case a single unnamed column since data frames are essentially lists of column data).
See this answer for more on double vs single bracket notion: https://stackoverflow.com/a/1169495/8444966

This should give you a row of list (There's got to be an answer somewhere here).
row_list<- as.list(as.data.frame(t(df2[c("user_id", "username")])))
#$V1
#[1] 1 215
#$V2
#[1] 2 154
If you want to keep names of the rows.
df2_subset <- df2[c("user_id", "username")]
setNames(split(df2_subset, seq(nrow(df2_subset))), rownames(df2_subset))
#$`1`
# user_id username
#1 1 215
#$`2`
# user_id username
#2 2 154

Delete duplicate elements in String in R

I've got some problems deleting duplicate elements in a string.
My data look similar to this:
idvisit path
1 1,16,23,59
2 2,14,14,19
3 5,19,23,19
4 10,10
5 23,23,27,29,23
I have a column containing an unique ID and a column containing a path for web page navigation.
The right column contains some cases, where pages just were reloaded and the page were tracked twice or even more.
The pages are separated with commas and are saved as factors.
My problem is, that I don't want to have multiple pages in a row, so the data should look like this.
idvisit path
1 1,16,23,59
2 2,14,19
3 5,19,23,19
4 10
5 23,27,29,23
The multiple pages next to each other should be removed. I know how to delete a specific multiple number using regexpressions, but I have about 20.000 different pages and can't do this for all of them.
Does anyone have a solution or a hint, for my problem?
Thanks
Sebastian

We can use tidyverse. Use the separate_rows to split the 'path' variable by the delimiter (,) to convert to a long format, then grouped by 'idvisit', we paste the run-length-encoding values
library(tidyverse)
separate_rows(df1, path) %>%
group_by(idvisit) %>%
summarise(path = paste(rle(path)$values, collapse=","))
# A tibble: 5 × 2
# idvisit path
# <int> <chr>
#1 1 1,16,23,59
#2 2 2,14,19
#3 3 5,19,23,19
#4 4 10
#5 5 23,27,29,23
Or a base R option is
df1$path <- sapply(strsplit(df1$path, ","), function(x) paste(rle(x)$values, collapse=","))
NOTE: If the 'path' column is factor class, convert to character before passing as argument to strsplit i.e. strsplit(as.character(df1$path), ",")

Using stringr package, with function: str_replace_all, I think it gets what you want using the following regular expression: ([0-9]+),\\1and then replace it with \\1 (we need to scape the \ special character):
library(stringr)
> str_replace_all("5,19,23,19", "([0-9]+),\\1", "\\1")
[1] "5,19,23,19"
> str_replace_all("10,10", "([0-9]+),\\1", "\\1")
[1] "10"
> str_replace_all("2,14,14,19", "([0-9]+),\\1", "\\1")
[1] "2,14,19"
You can use it in a array form: x <- c("5,19,23,19", "10,10", "2,14,14,19") then:
str_replace_all(x, "([0-9]+),\\1", "\\1")
[1] "5,19,23,19" "10" "2,14,19"
or using sapply:
result <- sapply(x, function(x) str_replace_all(x, "([0-9]+),\\1", "\\1"))
Then:
> result
5,19,23,19 10,10 2,14,14,19
"5,19,23,19" "10" "2,14,19"
Notes:
The first line is the attribute information:
> str(result)
Named chr [1:3] "5,19,23,19" "10" "2,14,19"
- attr(*, "names")= chr [1:3] "5,19,23,19" "10,10" "2,14,14,19"
If you don't want to see them (it does not affect the result), just do:
attributes(result) <- NULL
Then,
> result
[1] "5,19,23,19" "10" "2,14,19"
Explanation about the regular expression used: ([0-9]+),\\1
([0-9]+): Starts with a group 1 delimited by () and finds any digit (at least one)
,: Then comes a punctuation sign: , (we can include spaces here, but the original example only uses this character as delimiter)
\\1: Then comes an identical string to the group 1, i.e.: the repeated number. If that doesn't happen, then the pattern doesn't match.
Then if the pattern matches, it replaces it, with the value of the variable \\1, i.e. the first time the number appears in the pattern matched.
How to handle more than one duplicated number, for example 2,14,14,14,19?:
Just use this regular expression instead: ([0-9]+)(,\\1)+, then it matches when at least there is one repetition of the delimiter (right) and the number. You can try other possibilities using this regex101.com (in MHO it more user friendly than other online regular expression checkers).
I hope this would work for you, it is a flexible solution, you just need to adapt it with the pattern you need.

Locate different patterns in a sequence

If I want to find two different patterns in a single sequence how am I supposed to do
eg:
seq="ATGCAAAGGT"
the patterns are
pattern=c("ATGC","AAGG")
How am I supposed to find these two patterns simultaneously in the sequence?
I also want to find the location of these patterns like for example the patterns locations are 1,4 and 5,8.
Can anyone help me with this ?

Lets say your sequence file is just a vector of sequences:
seq.file <- c('ATGCAAAGGT','ATGCTAAGGT','NOTINTHISONE')
You can search for both motifs, and then return a true / false vector that identifies if both are present using the following one-liner:
grepl('ATGC', seq.file) & grepl('AAGG', seq.file)
[1] TRUE TRUE FALSE
Lets say the vector of sequences is a column within data frame d, which also contains a column of ID values:
id <- c('s1','s2','s3')
d <- data.frame(id,seq.file)
colnames(d) <- c('id','sequence')
You can append a column to this data frame, d, that identifies whether a given sequence matches with this one-liner:
d$match <- grepl('ATGC',d$sequence) & grepl('AAGG', d$sequence)
> print(d)
id sequence match
1 s1 ATGCAAAGGT TRUE
2 s2 ATGCTAAGGT TRUE
3 s3 NOTINTHISONE FALSE
The following for-loop can return a list of the positions of each of the patterns within the sequence:
require(stringr)
for(i in 1: length(d$sequence)){
out <- str_locate_all(d$sequence[i], pattern)
first <- c(out[[1]])
first.o <- paste(first[1],first[2],sep=',')
second <- c(out[[2]])
second.o <- paste(second[1],second[2], sep=',')
print(c(first.o, second.o))
}
[1] "1,4" "6,9"
[1] "1,4" "6,9"
[1] "NA,NA" "NA,NA"

You can try using the stringr library to do something like this:
seq = "ATGCAAAGGT"
library(stringr)
str_extract_all(seq, 'ATGC|AAGG')
[[1]]
[1] "ATGC" "AAGG"
Without knowing more specifically what output you are looking for, this is the best I can provide right now.

How about this using stringr to find start and end positions:
library(stringr)
seq <- "ATGCAAAGGT"
pattern <- c("ATGC","AAGG")
str_locate_all(seq, pattern)
#[[1]]
# start end
#[1,] 1 4
#
#[[2]]
# start end
#[1,] 6 9

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Multiple string/ pattern matching in R - r

require("dplyr") Sequences=c("AzzY","BbDe") Database=c("TTUAzzY","aaa","DBbDe","CAzzY") df=as.data.frame(sapply(Sequences, function(x) grepl(x,Database))) stats=df %>% summarise_each(funs(sum)) cbind(Sequences,as.numeric(stats))

Sequences=c("AzzY","BbDe") Database=c("TTUAzzY","aaa","DBbDe","CAzzY") sapply(Sequences, function(x) length(grep(x, Database))) # AzzY BbDe # 2 1

Related

Remove part of string after 3-digit number

Count the number of hyphens in a name using R

List Rows from column selection

Delete duplicate elements in String in R

Locate different patterns in a sequence

Categories

Resources