I'm trying to find patterns in a set of strings as the following example:
"2100780D001378FF01E1000000040000--------01A456000000------------"
"3100782D001378FF03E1008100040000--------01A445800000------------"
If I use the standard get_pattern from the bpa library, since it looks individually to every string I will get
"9999999A999999AA99A9999999999999--------99A999999999------------"
But my idea would be to find something like:
"X10078XD001378FF0XE100XX00040000--------01A4XXX00000------------"
The main objective is to find the set of strings with the most similar "pattern"
My first idea was to calculating the hamming distance between them and then analyzing the groups resulting from this distance but it gets tedious. Is there any "automatic" approach?
Any idea of how I can accomplish this mission?
for your sample data, the code below is working.. no idea how it scales to production...
library( data.table )
#sample data
data <- data.table( name = c("2100780D001378FF01E1000000040000--------01A456000000------------",
"3100782D001378FF03E1008100040000--------01A445800000------------"))
# name
# 1: 2100780D001378FF01E1000000040000--------01A456000000------------
# 2: 3100782D001378FF03E1008100040000--------01A445800000------------
#use data.table::tstrsplit() to split the string to individual characters
l <- lapply( data.table::tstrsplit( data$name, ""), function(x) {
#if the same character appears in all strings on the same position,return the character, else return 'X'
if ( length( unique( x ) ) == 1 ) as.character(x[1]) else "X"
})
#paste it all together
paste0(l, collapse = "")
# [1] "X10078XD001378FF0XE100XX00040000--------01A4XXX00000------------"
small explanation
data.table::tstrsplit( data$name, "") returns the following list
[[1]]
[1] "2" "3"
[[2]]
[1] "1" "1"
[[3]]
[1] "0" "0"
etc...
Using lapply(), you can loop over this list, determining the length of the vector with unique elements. Ith this length == 1, then the same character exists in all strings on this position, so return the character.
If the length > 1, then multiple characters apprear on this possition in different strings, and return "X".
Update
if you are after the hamming distances, use the stringdist-package
library(stringdist)
m <- stringdist::stringdistmatrix(a = data$name, b = data$name, ,method="hamming" )
# [,1] [,2]
# [1,] 0 8
# [2,] 8 0
#to get to the minimum value for each row, exclude the diagonal first (by making it NA)
# and the find the position with the minimum value
diag(m) <- NA
apply( m, 1, which.min )
# [1] 2 1
Here is a base R solution, where a custom function findPat is defined and Reduce is applied to find common pattern among a set of strings, i.e.,
findPat <- function(s1,s2){
r1 <- utf8ToInt(s1)
r2 <- utf8ToInt(s2)
r1[bitwXor(r1,r2)!=0]<- utf8ToInt("X")
pat <- intToUtf8(r1)
}
pat <- Reduce(findPat,list(s1,s2,s3))
such that
> pat
[1] "X10078XDX0X378FF0XE100XX00040000--------01AXXXXX0000------------"
DATA
s1 <- "2100780D001378FF01E1000000040000--------01A456000000------------"
s2 <- "3100782D001378FF03E1008100040000--------01A445800000------------"
s3 <- "4100781D109378FF03E1008100040000--------01A784580000------------"
Related
I am trying to create a function in R that searches in strings a specific pattern in a specific position, if the letter is present in the established position, I want to count it.
example of dataset:
library(dplyr)
mutations <- tibble(
"position" = c(9,10),
"AA" = c("G","G"))
strings <- c("EVQLVESGGGLAKPG",
"VQLVESGGGLAKPGGS",
"EVQLVESGGALAKPGGSLRLSCAAS")
So, in this case, I want to look for the position 9 and 10, if there is a letter "G" I want to count it.
Expected dataframe or tibble output:
| string | mut_counts |
|________|____________|
| 1 | 2 |
|________|____________|
| 2 | 1 |
|________|____________|
| 3 | 1 |
|________|____________|
In this example, all strings have a "G" at position 9, so they would all get 1, and only one of the three sequences have a "G" at position 10, so this sequence will have 2.
I am trying to use str_locate_all() from the stringr package to be able to locate the positions and then compare with my dataframe to count but I am failing to get what I wanted.
library(stringr)
.class_mutations <- function(sequences, mutations){
.count_pattern <- function(x){
df <- sum(as.integer(locating_all_patterns[[x]][,"start"] == mutations$position[mut]))
}
for(mut in nrow(mutations)){
locating_all_patterns <- str_locate_all(pattern = mutations$AA[mut], sequences)
counting_patterns <- lapply(locating_all_patterns, .count_pattern)
}
return(counting_patterns)
}
.class_mutations(strings, mutations)
I am getting this error Error in locating_all_patterns[[x]] : no such index at level 1, besides, if you have a better/faster way to do this, I would also appreciate it. I have to take into account that this is going to applied in thousands of strings, so I should avoid slow functions.
Thank you
base R
rowSums(outer(strings, seq_len(nrow(mutations)),
function(st, i) {
substr(st, mutations$position[i], mutations$position[i]) == mutations$AA[i]
}))
# [1] 2 1 1
Walk-through:
outer effectively just produces two vectors, an expansion of the cartesian product of the two arguments. If we insert a browser() as the first line of the inner anon-func, we'd see
data.frame(st, i)
# st i
# 1 EVQLVESGGGLAKPG 1
# 2 VQLVESGGGLAKPGGS 1
# 3 EVQLVESGGALAKPGGSLRLSCAAS 1
# 4 EVQLVESGGGLAKPG 2
# 5 VQLVESGGGLAKPGGS 2
# 6 EVQLVESGGALAKPGGSLRLSCAAS 2
(Shown as a frame only for a columnar presentation. Both st and i are simple vectors.)
From here, knowing that substr is vectorized across all arguments, then a single call to substr will find the ith character in each of the strings.
The result of the substr is a vector of letters. Continuing the same browser() session from above,
substr(st, mutations$position[i], mutations$position[i])
# [1] "G" "G" "G" "G" "L" "A"
mutations$AA[i]
# [1] "G" "G" "G" "G" "G" "G"
substr(st, mutations$position[i], mutations$position[i]) == mutations$AA[i]
# [1] TRUE TRUE TRUE TRUE FALSE FALSE
The mutations$AA[i] shows us what we're looking for. A nice thing of the vectorized method here is that mutations$AA[i] will always be the same length and in the expected order of letters retrieved by substr(.).
The outer itself returns a matrix, with length(X) rows and length(Y) columns (X and Y are the first and second args to outer, respective).
outer(strings, seq_len(nrow(mutations)),
function(st, i) {
substr(st, mutations$position[i], mutations$position[i]) == mutations$AA[i]
})
# [,1] [,2]
# [1,] TRUE TRUE
# [2,] TRUE FALSE
# [3,] TRUE FALSE
The number of correct mutations found in each string is just a sum of each row. (Ergo rowSums.)
If you're concerned due to a large amount of mutations and strings, you can replace the outer and iterate over each row of mutations instead:
rowSums(sapply(seq_len(nrow(mutations)), function(i) substr(strings, mutations$position[i], mutations$position[i]) == mutations$AA[i]))
# [1] 2 1 1
This calls substr once for each mutations row, so if the outer-explosion is too much, this might reduce the memory footprint.
For a base R option, we can make sure of the string functions. This approach compares the length of each substring before and after replacing the target character:
nchar(substr(strings, 9, 10)) -
nchar(gsub("G", "", substr(strings, 9, 10), fixed=TRUE))
[1] 2 1 1
Data:
strings <- c("EVQLVESGGGLAKPG",
"VQLVESGGGLAKPGGS",
"EVQLVESGGALAKPGGSLRLSCAAS")
I am looking for a build-in function in MATLAB which can work in a similar way with ave in R.
Here is an example with R:
set.seed(0)
x <- sample(c("A", "B"), 10, replace = TRUE)
xid <- ave(seq_along(x), x, FUN = seq_along)
which gives
> x
[1] "B" "A" "B" "A" "A" "B" "A" "A" "A" "B"
> xid
[1] 1 1 2 2 3 3 4 5 6 4
In other words, I have no idea which function in MATLAB allows me group by x and assign the sequence ids by groups, such that I can get an array like xid. I know splitgroup might be close to the goal, but it doesn't give me the desired output since it yields summarized results.
The question asks to replace each entry in x by the number of times it has occurred so far.
I don't know of a built-in function that does this. Here are some approaches. Let
x = ['B' 'A' 'B' 'A' 'A' 'B' 'A' 'A' 'A' 'B']; % example data. Row vector
Short code, but memory-inefficient (computes an intermediate N×N matrix, where N is the length of x):
xid = sum(triu(x==x.'));
A little more efficient (computes an intermediate U×N matrix, where U is the number of unique elements of x ):
t = x==unique(x).';
xid = nonzeros(t.*cumsum(t,2)).';
Boring efficient code with a loop:
xid = NaN(size(x)); % preallocate
for u = unique(x)
t = x==u;
xid(t) = 1:sum(t);
end
My actual case is a list of combined header strings and corresponding data as sub-lists; I wish to subset the list to return a list of sub-lists , i.e the same structure, that only contain the sub-lists whose header strings contain strings that match the strings in a character vector.
Test Data:
lets <- letters
x <- c(1,4,8,11,13,14,18,22,24)
ls <- list()
for (i in 1:9) {
ls[[i]] <- list(hdr = paste(lets[x[i]:(x[i]+2)], collapse=""),
data = seq(1,rnd[i]))
}
filt <- c("bc", "lm", "rs", "xy")
To produce a result list, as returned by:
logical_match <- c(T, F, F, T, F, F, T, F, T)
ls_result <- ls[logical_match]
So the function I seek is: ls_result <- fn(ls, filt)
I've looked at: subset list by dataframe; partial match with %in%; nested sublist by condition; subset list by logical condition; and, my favorite, extract sublist elements to array - this uses some neat purr and dplyr solutions, but unfortunately these aren't viable, as I'm looking for a base R solution to make deployment more straightforward (I'd welcome extended R solutions, for interest, of course).
I'm guessing some variation of logical_match <- lapply(ls, fn, '$hdr', filt) is where I'm heading; I started with pmatch(), and wondered how to incorporate grep, but I'm struggling to see how to generate the logical_match vector.
Can someone set me on the right track, please?
EDIT:
when agrepl() is applied to the real data, this becomes trickier; the header string, hdr, may be typically 255 characters long, whilst a string element of the filter vector , filt is of the order of 16 characters. The default agrepl() max.distance argument of 0.1 needs adjusted to somewhere between 0.94 and 0.96 for the example below, which is pretty tight. Even if I use the lower end of this range, and apply it to the ~360 list elements, the function returns a load of total non-matches.
> hdr <- "#CCHANNELSDI12-PTx|*|CCHANNELNO2|*|CDASA1570|*|CDASANAMEShenachieBU_1570|*|CTAGSODATSID|*|CTAGKEYWISKI_LIVE,ShenachieBU_1570,SDI12-PTx,Highres|*|LAYOUT(timestamp,value)|*|RINVAL-777|*|RSTATEW6|*|RTIMELVLhigh-resolution|*|TZEtc/GMT|*|ZDATE20210110130805|*|"
> filt <- c("ShenachieBU_1570", "Pitlochry_4056")
> agrepl(hdr, filt, max.distance = 0.94)
[1] TRUE FALSE
You could do:
Filter(function(x)any(agrepl(x$hdr,filt)), ls)
You could reduce the code to:
Filter(function(x)grepl(paste0(filt, collapse = "|"), x$hdr), ls)
We can also do
library(purrr)
library(stringr)
keep(ls, ~ str_detect(.x$hdr, str_c(filt, collapse = "|")))
-output
#[[1]]
#[[1]]$hdr
#[1] "abc"
#[[1]]$data
#[1] 1
#[[2]]
#[[2]]$hdr
#[1] "klm"
#[[2]]$data
#[1] 1 2 3 4
#[[3]]
#[[3]]$hdr
#[1] "rst"
#[[3]]$data
#[1] 1 2 3 4 5 6 7
#[[4]]
#[[4]]$hdr
#[1] "xyz"
#[[4]]$data
#[1] 1 2 3 4 5 6 7 8 9
I have a string like this:
data <- c("A:B:C", "A:B", "E:F:G", "H:I:J", "B:C:D")
I want to convert this to a string of:
c("A:B:C:D", "E:F:G", "H:I:J")
The idea is that each element inside the string is another string of sub-elements (e.g. A, B, C) that have been pasted together (with sep=":"). Each element within the string is compared with all other elements to look for common sub-elements, and elements with common sub-elements are combined.
I don't care about the order of the string (or order of the sub-elements) FWIW.
Thanks for any help offered!
--
Answers so far...
I liked d.b's suggestion - not the least because it stayed in base R. However, with a more complicated larger set, it wasn't working perfectly until everything was run again. With an even more complicated dataset, re-running everything more than twice might be needed.
I had more difficulty with thelatemail's suggestion. I had to upgrade R to use lengths, and I then had to figure out how to get to the end point because the answer was incomplete. In any case, this was how I got to the end (I suspect there is a better way). This worked with a larger set without a hitch.
library(igraph)
spl <- strsplit(data,":")
combspl <- data.frame(
grp = rep(seq_along(spl),lengths(spl)),
val = unlist(spl)
)
cl <- clusters(graph.data.frame(combspl))$membership[-(1:length(spl))]
dat <- data.frame(cl) # after getting nowhere working with the list as formatted
dat[,2] <- row.names(dat)
a <- character(0)
for (i in 1:max(cl)) {
a[i] <- paste(paste0(dat[(dat[,1] == i),][,2]), collapse=":")
}
a
#[1] "A:B:C:D" "E:F:G" "H:I:J"
I'm going to leave this for now as is.
A possible application for the igraph library, if you think of your values as an edgelist of paired groups:
library(igraph)
spl <- strsplit(data,":")
combspl <- data.frame(
grp = rep(seq_along(spl),lengths(spl)),
val = unlist(spl)
)
cl <- clusters(graph.data.frame(combspl))$membership[-(1:length(spl))]
#A B C E F G H I J D
#1 1 1 2 2 2 3 3 3 1
split(names(cl),cl)
#$`1`
#[1] "A" "B" "C" "D"
#
#$`2`
#[1] "E" "F" "G"
#
#$`3`
#[1] "H" "I" "J"
Or as collapsed text:
sapply(split(names(cl),cl), paste, collapse=";")
# 1 2 3
#"A;B;C;D" "E;F;G" "H;I;J"
a = character(0)
for (i in 1:length(data)){
a[i] = paste(unique(unlist(strsplit(data[sapply(1:length(data), function(j)
any(unlist(strsplit(data[i],":")) %in% unlist(strsplit(data[j],":"))))],":"))), collapse = ":")
}
unique(a)
#[1] "A:B:C:D" "E:F:G" "H:I:J"
I have a character vector that looks like this
vector <- c('a','b','c','d','e')
I have an object in a for-loop that takes input as:
out[a,] <- c(a,b,c,d,e)
Where a-e are variables with values (for instance, a=0.7). I would like to feed the out object some transfomred version of ther vector object. I've tried
paste(noquote(vector),collapse=',')
However, this just returns
"a,b,c,d,e"
Which is still not useful.
Reverse the order of the function calls:
noquote(paste(vector, collapse = ','))
This will print [1] a,b,c,d,e. If you don't like the [1] use
cat(paste(vector, collapse = ','))
which prints
a,b,c,d,e
You can use mget to put objects into a named list:
# data
a <- 1; b <- 2; c <- 3; d <- 4; e <- 5
mget(letters[1:5])
$a
[1] 1
$b
[1] 2
$c
[1] 3
$d
[1] 4
$e
[1] 5
or wrap it mget in unlist to get a named vector:
unlist(mget(letters[1:5]))
a b c d e
1 2 3 4 5
This is very basic question and ate my head almost with a tiny mistake every time. I simplified and created a function in R language.
Here you go buddy!
numbers <- list(2,5,8,9,14,20) #List containing even odd numbers
en<-list() #Initiating even numbers’ list
on<-list() #Initiating odd numbers’ list
#Function creation
separate <- function(x){
for (i in x)
{
ifelse((i%%2)==0, en <- paste(append(en,i, length(en)+1), collapse = ","),
on <- paste(append(on,i, length(on)+1), collapse = ","))
}
message("Even numbers are : ", en)
message("Odd numbers are : ", on)
}
#Passing the function with argument
separate(numbers)
Result!
Even numbers are : 2,8,14,20
Odd numbers are : 5,9