R: cluster strings with same begin - r

I've got the following vector
words <- c("verkoop", "verkoopartikel", "artikelnummer", "bank", "bankinfo", "bankrekeningnummer", "artikelnaam")
How can I cluster the words that begin with the same letters?
So here, this would be:
verkoop, verkoopartikel
artikelnummer, artikelnaam
bank, bankinfo, bankrekeningnummer

Here's a potential solution which first extracts the unique starting letters and then clusters the words in the vector using pattern matching:
words <- c("verkoop", "verkoopartikel", "artikelnummer", "bank", "bankinfo", "bankrekeningnummer", "artikelnaam")
l <- unique(substring(words,1,1))
l <- paste0("^", l) # the ^ indicates that the string should start with this letter
lapply(l, function(x,y) y[grep(x,y)], y=words)
# [[1]]
# [1] "verkoop" "verkoopartikel"
# [[2]]
# [1] "artikelnummer" "artikelnaam"
# [[3]]
# [1] "bank" "bankinfo" "bankrekeningnummer"

For two words to belong to the same cluster, how many initial letters should they share? The following example works with n_init = 4 letters.
library(dplyr)
n_init <- 4
data.frame(words) %>%
mutate(cluster = as.numeric(as.factor(substring(words, 1, n_init))))

Related

Subset R list by partial string match of sublist element against character vector, using base R

My actual case is a list of combined header strings and corresponding data as sub-lists; I wish to subset the list to return a list of sub-lists , i.e the same structure, that only contain the sub-lists whose header strings contain strings that match the strings in a character vector.
Test Data:
lets <- letters
x <- c(1,4,8,11,13,14,18,22,24)
ls <- list()
for (i in 1:9) {
ls[[i]] <- list(hdr = paste(lets[x[i]:(x[i]+2)], collapse=""),
data = seq(1,rnd[i]))
}
filt <- c("bc", "lm", "rs", "xy")
To produce a result list, as returned by:
logical_match <- c(T, F, F, T, F, F, T, F, T)
ls_result <- ls[logical_match]
So the function I seek is: ls_result <- fn(ls, filt)
I've looked at: subset list by dataframe; partial match with %in%; nested sublist by condition; subset list by logical condition; and, my favorite, extract sublist elements to array - this uses some neat purr and dplyr solutions, but unfortunately these aren't viable, as I'm looking for a base R solution to make deployment more straightforward (I'd welcome extended R solutions, for interest, of course).
I'm guessing some variation of logical_match <- lapply(ls, fn, '$hdr', filt) is where I'm heading; I started with pmatch(), and wondered how to incorporate grep, but I'm struggling to see how to generate the logical_match vector.
Can someone set me on the right track, please?
EDIT:
when agrepl() is applied to the real data, this becomes trickier; the header string, hdr, may be typically 255 characters long, whilst a string element of the filter vector , filt is of the order of 16 characters. The default agrepl() max.distance argument of 0.1 needs adjusted to somewhere between 0.94 and 0.96 for the example below, which is pretty tight. Even if I use the lower end of this range, and apply it to the ~360 list elements, the function returns a load of total non-matches.
> hdr <- "#CCHANNELSDI12-PTx|*|CCHANNELNO2|*|CDASA1570|*|CDASANAMEShenachieBU_1570|*|CTAGSODATSID|*|CTAGKEYWISKI_LIVE,ShenachieBU_1570,SDI12-PTx,Highres|*|LAYOUT(timestamp,value)|*|RINVAL-777|*|RSTATEW6|*|RTIMELVLhigh-resolution|*|TZEtc/GMT|*|ZDATE20210110130805|*|"
> filt <- c("ShenachieBU_1570", "Pitlochry_4056")
> agrepl(hdr, filt, max.distance = 0.94)
[1] TRUE FALSE
You could do:
Filter(function(x)any(agrepl(x$hdr,filt)), ls)
You could reduce the code to:
Filter(function(x)grepl(paste0(filt, collapse = "|"), x$hdr), ls)
We can also do
library(purrr)
library(stringr)
keep(ls, ~ str_detect(.x$hdr, str_c(filt, collapse = "|")))
-output
#[[1]]
#[[1]]$hdr
#[1] "abc"
#[[1]]$data
#[1] 1
#[[2]]
#[[2]]$hdr
#[1] "klm"
#[[2]]$data
#[1] 1 2 3 4
#[[3]]
#[[3]]$hdr
#[1] "rst"
#[[3]]$data
#[1] 1 2 3 4 5 6 7
#[[4]]
#[[4]]$hdr
#[1] "xyz"
#[[4]]$data
#[1] 1 2 3 4 5 6 7 8 9

How to find patterns between sets of strings in R?

I'm trying to find patterns in a set of strings as the following example:
"2100780D001378FF01E1000000040000--------01A456000000------------"
"3100782D001378FF03E1008100040000--------01A445800000------------"
If I use the standard get_pattern from the bpa library, since it looks individually to every string I will get
"9999999A999999AA99A9999999999999--------99A999999999------------"
But my idea would be to find something like:
"X10078XD001378FF0XE100XX00040000--------01A4XXX00000------------"
The main objective is to find the set of strings with the most similar "pattern"
My first idea was to calculating the hamming distance between them and then analyzing the groups resulting from this distance but it gets tedious. Is there any "automatic" approach?
Any idea of how I can accomplish this mission?
for your sample data, the code below is working.. no idea how it scales to production...
library( data.table )
#sample data
data <- data.table( name = c("2100780D001378FF01E1000000040000--------01A456000000------------",
"3100782D001378FF03E1008100040000--------01A445800000------------"))
# name
# 1: 2100780D001378FF01E1000000040000--------01A456000000------------
# 2: 3100782D001378FF03E1008100040000--------01A445800000------------
#use data.table::tstrsplit() to split the string to individual characters
l <- lapply( data.table::tstrsplit( data$name, ""), function(x) {
#if the same character appears in all strings on the same position,return the character, else return 'X'
if ( length( unique( x ) ) == 1 ) as.character(x[1]) else "X"
})
#paste it all together
paste0(l, collapse = "")
# [1] "X10078XD001378FF0XE100XX00040000--------01A4XXX00000------------"
small explanation
data.table::tstrsplit( data$name, "") returns the following list
[[1]]
[1] "2" "3"
[[2]]
[1] "1" "1"
[[3]]
[1] "0" "0"
etc...
Using lapply(), you can loop over this list, determining the length of the vector with unique elements. Ith this length == 1, then the same character exists in all strings on this position, so return the character.
If the length > 1, then multiple characters apprear on this possition in different strings, and return "X".
Update
if you are after the hamming distances, use the stringdist-package
library(stringdist)
m <- stringdist::stringdistmatrix(a = data$name, b = data$name, ,method="hamming" )
# [,1] [,2]
# [1,] 0 8
# [2,] 8 0
#to get to the minimum value for each row, exclude the diagonal first (by making it NA)
# and the find the position with the minimum value
diag(m) <- NA
apply( m, 1, which.min )
# [1] 2 1
Here is a base R solution, where a custom function findPat is defined and Reduce is applied to find common pattern among a set of strings, i.e.,
findPat <- function(s1,s2){
r1 <- utf8ToInt(s1)
r2 <- utf8ToInt(s2)
r1[bitwXor(r1,r2)!=0]<- utf8ToInt("X")
pat <- intToUtf8(r1)
}
pat <- Reduce(findPat,list(s1,s2,s3))
such that
> pat
[1] "X10078XDX0X378FF0XE100XX00040000--------01AXXXXX0000------------"
DATA
s1 <- "2100780D001378FF01E1000000040000--------01A456000000------------"
s2 <- "3100782D001378FF03E1008100040000--------01A445800000------------"
s3 <- "4100781D109378FF03E1008100040000--------01A784580000------------"

How to find substring from string in R?

If my string is a DNA sequence,
x<-"TATAATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG"
I want to extract substring from ATG to TAA, TGA or TAG. I am able to extract from one point to another by using stringi package with regex.
My code is
stri_extract_all(x, regex = "ATG.*?TAA")
Help me by solving my query.
I believe that you meant str_extract_all from the stringr package. That function does not have an argument called regex; you need pattern. Once you get by that, you can just use or | to allow any of the sequence endings.
library(stringr)
str_extract_all(x, pattern="ATG.*?(TAA|TGA|TAG)")
[[1]]
[1] "ATGCAACGAGGGGCATAA" "ATGCCCAAAATCTGA" "ATGACCGGGTAG"
Here is a possibility using Biostrings:
library("Biostrings")
x <- "TATAATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG"
# Get all combinations of substrings starting with "ATG" and ending with "TAA"
library(tidyverse)
df <- expand.grid(start(matchPattern("ATG", x)), end(matchPattern("TAA", x))) %>%
filter(Var1 < Var2);
ir <- IRanges(df[, 1], df[, 2]);
extractAt(BString(x), IRanges(df[, 1], df[, 2]));
#A BStringSet instance of length 3
# width seq
#[1] 18 ATGCAACGAGGGGCATAA
#[2] 44 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAA
#[3] 20 ATGCCCAAAATCTGATATAA
Since you're working with DNA sequence data, I recommend familiarising yourself with Biostrings from Bioconductor. There exist many Bioconductor packages beyond Biostrings that will make your life a lot easier (down the track), when you're working with DNA/RNA sequence data.
Update
To account for multiple stop codons, simply wrap end(matchPattern(...)) within an sapply loop.
df <- expand.grid(
start(matchPattern("ATG", x)),
unlist(sapply(c("TAA", "TGA", "TAG"), function(ss) end(matchPattern(ss, x))))) %>%
filter(Var1 < Var2);
ir <- IRanges(df[, 1], df[, 2]);
extractAt(BString(x), IRanges(df[, 1], df[, 2]));
# [1] 18 ATGCAACGAGGGGCATAA
# [2] 44 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAA
# [3] 20 ATGCCCAAAATCTGATATAA
# [4] 39 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGA
# [5] 15 ATGCCCAAAATCTGA
# ... ... ...
# [7] 23 ATGCCCAAAATCTGATATAATGA
# [8] 4 ATGA
# [9] 55 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG
#[10] 31 ATGCCCAAAATCTGATATAATGACCGGGTAG
#[11] 12 ATGACCGGGTAG

Filtering a list of lists based on list combinations - R

I have a list of lists:
library(partitions)
list_parts(3)
>[1] (1,2,3)
>[[2]]
>[1] (1,3)(2)
>[[3]]
>[1] (1,2)(3)
>[[4]]
>[1] (2,3)(1)
>[[5]]
>[1] (1)(2)(3)
I need to filter out certain lists based on combinations as they are not feasible. For example list[4] is not possible because (2,3) cannot be a list without (1). How can I filter based on a combination rule set eg remove combinations where 2 and 3 are in a list without 1?
We can borrow the method from this answer to Find vector in list of vectors and wrap it in our own function.
library(partitions)
p = listParts(3)
detect = function(p, pattern) {
Position(function(x) identical(x, pattern), p, nomatch = 0) > 0
}
test = sapply(p, detect, pattern = 2:3)
p[!test]
# [[1]]
# [1] (1,2,3)
#
# [[2]]
# [1] (1,3)(2)
#
# [[3]]
# [1] (1,2)(3)
#
# [[4]]
# [1] (1)(2)(3)

Remove a sequence in a character in R

I have the following character in a data.frame:
b <- "http://datos.labcd.mx/dataset/5b18cc1e-d2f2-46b0-bf2c-e699ae2af713/resource/e265a46f-7a9f-4a30-ae0d-d5937fff17c1/download/201003.csv"
I just want to extract the number 201003.
How should I do that?
b <- "http://datos.labcd.mx/dataset/5b18cc1e-d2f2-46b0-bf2c-e699ae2af713/resource/e265a46f-7a9f-4a30-ae0d-d5937fff17c1/download/201003.csv"
Try this on 'b':
file_name <- basename(b)
file_name
# [1] "201003.csv"
number <- strsplit(file_name, "\\.")[[1]]
number
# [1] "201003" "csv"
number = as.numeric(number[1])
number
# [1] 201003
Hope this helped.

Resources