find frequency of substring in a set of strings - r

I have as input a gene list where each genes has a header like >SomeText.
For each gene I would like to find the frequency of the string GTG. (number of occurences divided by length of gene). The string should only be counted if it starts at position 1,4,7,10 etc (every thids position).
>ENST00000619537.4 cds:known chromosome:GRCh38:21:6560714:6564489:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGATGTGACCATCCAGCACCCCTGGTTCAAGCGCACCCTGGGGCCCTTCTACCCCAGC
CGGCTGTTCGACCAGTTTTTCGGCGAGGGCCTTTTTGAGTATGACCTGCTGCCCTTCCTG
TCGTCCACCATCAGCCCCTACTACCGCCAGTCCCTCTTCCGCACCGTGCTGGACTCCGGC
ATCTCTGAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTC
TCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCAC
AACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTG
CCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACC
TTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCC
GTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624019.3 cds:known chromosome:GRCh38:21:6561284:6563978:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGACGCCCCCCCCCCCCACCCAACCACAGGCCTCCTCTCTGAGCCACGGGTTCGATCC
GACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTCTCCCCGGAGGACCTCACC
GTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCACAACGAGCGCCAGGACGAC
CACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTGCCGTCCAACGTGGACCAG
TCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACCTTCTGTGGCCCCAAGATC
CAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCCGTGTCGCGGGAGGAGAAG
CCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624932.1 cds:known chromosome:GRCh38:21:6561954:6564203:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGCCTGTCTGTCCAGGAGACAGTCACAGGCCCCCGAAAGCTCTGCCCCACTTGGTGTGT
GGGAGAAGAGGCCGGCAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTG
AAGCACTTCTCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCAC
GGAAAGCACAACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGC
TACCGCCTGCCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGC
ATGCTGACCTTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGA
GCCATCCCCGTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
Output:
Gene Frequency
Gene1: 3
Gene2 6.3
....
I was thinging of something like this, but I dont now how to define the positions requirements:
freq <- sapply(gregexpr("GTG",x),function(x)if(x[[1]]!=-1) length(x) else 0)

Here is an idea in R using stringi.
We use stri_locate_all_fixed() to find the start and end position of each GTG occurence. Then we create a column condition to test if start position is in 1,4,7,10,13,16,19,22 ....
library(stringi)
library(dplyr)
data.frame(stri_locate_all_fixed(gene1, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(gene), 3))
Which gives:
# start end condition
#1 4 6 TRUE
If you want to generalize this to a list of genes, you could do:
lst <- list(gene1, gene2, gene3)
res <- lapply(lst, function(x) {
data.frame(stri_locate_all_fixed(x, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(x), 3))
})
Which would give:
#[[1]]
# start end condition
#1 4 6 TRUE
#
#[[2]]
# start end condition
#1 NA NA FALSE
#
#[[3]]
# start end condition
#1 3 5 FALSE
#2 9 11 FALSE
#3 21 23 FALSE
#4 70 72 TRUE
#5 75 77 FALSE
Following #Sobrique's comment, if divided by length means number of occurences respecting condition divided by total number of char in each gene, you could do:
lapply(1:length(res), function(x) sum(res[[x]][["condition"]]) / nchar(lst[[x]]))
Which would give:
#[[1]]
#[1] 0.004830918
#
#[[2]]
#[1] 0
#
#[[3]]
#[1] 0.003021148

Here's a Perl solution that does as you ask
But I don't understand how your example output is derived: the first and last sequences have only one occurrence of GTG in the positions you require, and the second sequence has none at all. That means the outputs are 1 / 207, 0 / 74, and 1 / 331 respectively. None of those are anything like 3 and 6.3 that you say you're expecting
This program expects the path to the input file as a parameter on the command line
use strict;
use warnings 'all';
print "Gene Frequency\n";
my $name;
local $/ = '>';
while ( <> ) {
chomp;
next unless /\S/;
my ($name, $seq) = split /\n/, $_, 2;
$seq =~ tr/A-Z//cd;
my $n = 0;
while ( $seq =~ /(?=GTG)/g ) {
++$n if $-[0] % 3 == 0;
}
printf "%-7s%.6f\n", $name, $n / length($seq);
}
output
Gene Frequency
Gene1 0.004831
Gene2 0.000000
Gene3 0.003021

Here is an alternate solution that does not use a pattern match. Not that it will matter much.
use strict;
use warnings;
my $gene;
while ( my $line = <> ) {
if ( $line =~ /^>(.+)/ ) {
$gene = $1;
next;
}
chomp $line;
printf "%s: %s\n",
$gene,
( grep { $_ eq 'GTG' } split /(...)/, $line ) / length $line;
}
Output:
Gene1: 0.00483091787439614
Gene2: 0
Gene3: 0.00302114803625378
It is essentially similar to Sobrique's answer, but assumes that the gene lines contain the right characters. It splits up the gene string into a list of three-character pieces and takes the ones that are literally GTG.
The splitting works by abusing the fact that split uses a pattern as the delimiter, and that it will also capture the delimiter if a capture group is used. Here's an example.
my #foo = split /(...)/, '1234567890';
p #foo; # from Data::Printer
__END__
[
[0] "",
[1] 123,
[2] "",
[3] 456,
[4] "",
[5] 789,
[6] 0
]
The empty elements get filter out by grep. It might not be the most efficient way, but it gets the job done.
You can run it by calling perl foo.pl horribly-large-gene-sequence.file.

Well, you have an R solution. I've hacked something together in perl because you tagged it:
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'GTG';
local $/ = "\n>";
while ( <> ) {
my ($gene) = m/(Gene\d+)/;
my #hits = grep { /^$target$/ } m/ ( [GTCA]{3} ) /xg;
print "$gene: ".( scalar #hits), "\n";
}
This doesn't give the same results as your input though:
Gene1: 1
Gene2: 0
Gene3: 1
I'm decomposing your string into 3 element lists, and looking for ones that specifically match. (And I haven't divided by length, as I'm not entirely clear if that's the actual string length in letters, or some other metric).
Including length matching - we need to capture both name and string:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n>";
while (<>) {
my ($gene, $gene_str) = m/(Gene\d+)\n([GTCA]+)/m;
my #hits = grep { /^GTG$/ } $gene_str =~ m/ ( [GTCA]{3} ) /xg;
print "$gene: " . #hits . "/". length ( $gene_str ), " = ", #hits / length($gene_str), "\n";
}
We use <> which is the 'magic' filehandle, and tells perl to read from either STDIN or a file specified on command line. Much like sed or grep does.
With your input:
Gene1: 1/207 = 0.00483091787439614
Gene2: 0/74 = 0
Gene3: 1/331 = 0.00302114803625378

Here is a function I created based on your requirement. I am pretty sure there are alternate ways better than this but this solves the problem.
require(stringi)
input_gene_list<- list(gene1= "GTGGGGGTTTGTGGGGGTG", gene2= "GTGGGGGTTTGTGGGGGTG", gene3= "GTGGGGGTTTGTGGGGGTG")
gene_counter<- function(gene){
x<- gene
y<- gsub(pattern = "GTG",replacement = "GTG ", x = x, perl=TRUE)
if(str_count(y,pattern = "GTG")) {
gene_count<- unlist(gregexpr(pattern = " ", y))
counter<- 0
for(i in 1:length(gene_count)){
if((gene_count[i] %% 3) == 1) counter=counter+1
}
return(counter/nchar(x))
}
}
output_list<- lapply(input_gene_list, function(x) gene_counter(x))
result<- t(as.data.frame(output_list))
result
[,1]
gene1 0.1052632
gene2 0.1052632
gene3 0.1052632
Also share your thoughts on it! Thanks!

Related

Unix : Split line with delimiter

I've a file like this
a b c,d
e f g
x y r,s,t
and I would like to split this into columns using "," as delimiter. The other columns should be copied.
Expected result :
a b c
a b d
e f g
x y r
x y s
x y t
Thank you
Using awk. Expects field separators to be space or tab:
$ awk '{
split($3,a,",") # split the third field on commas, hash to a
for(i in a) { # for all entries in a
sub(/[^ \t]+$/,a[i],$0) # replace last field with entries in a...
print # ... preserving separators, space or tab
}
}' file
a b c
a b d
e f g
x y r
x y s
x y t
Due to the use of sub() it will produce false results if there is a & in the $3. Also, as mentioned in the comments, using for(i in a) may result in records outputing in seemingly random order. If that is a problem, use:
$ awk '{
n=split($3,a,",") # store element count to n
for(i=1;i<=n;i++) { # iterate in order
sub(/[^ \t]+$/,a[i],$0)
print
}
}' file
For tab separated files:
$ awk '
BEGIN { OFS="\t" } # output field separator
{
n=split($3,a,",")
for(i=1;i<=n;i++) {
$3=a[i] # & friendly version
print
}
}' file

How to limit character repetition in a word to 2?

I want to remove characters that repeat more than twice in a word. For example
"hhaaappppyyyyyyy mmoooooorning friendsssssssssssssss, good goood day"
to
"hhaappyy mmoorning friendss, good good day"
I have tried something like this, but it is not reducing to exactly 2 repetitions.
gsub('([[:alpha:]])\\1{2}', '\\1',
'hhaaappppyyyyyyy mmoooooorning friendsssssssssssssss, good goood day')
#[1] "hhappyyy mmoorning friendsssss, good god day"
Thank you.
You need to use {2,} quantifier and use two \1 in the replacement:
s<-'hhaaappppyyyyyyy mmoooooorning friendsssssssssssssss, good goood day'
gsub('([[:alpha:]])\\1{2,}', '\\1\\1', s)
# => [1] "hhaappyy mmoorning friendss, good good day"
See the R demo.
The ([[:alpha:]])\\1{2,} pattern matches and captures a letter into Group 1 and then 2 or more repetitions of the same char are matched. Two \1 in the replacement pattern replace the whole match with 2 occurrences of the char. It is valid to use two \1 placeholders because every match is at least 3 identical chars.
Same as from Wiktor Stribiżew, but in javascript and replace every character (numbers, punctuation also), if you need this.
var sInput = "hhaaappppyyyyyyy mmoooooorning friendsssssssssssssss, good goood day";
var sOutput = sInput.replace(/(.)\1{2,}/g, "$1$1");
console.log(sOutput);
fwiw, here is another solution:
f = function(x){
x = strsplit(x, '')[[1]]
x = rle(x)
x = rep(x$values, pmin(2, x$lengths))
paste(x, collapse='')
}
example:
x = "hhaaappppyyyyyyy mmoooooorning friendsssssssssssssss, good goood day"
f(x)
[1] "hhaappyy mmoorning friendss, good good day"
however, gsub is a little easier...
package test.com;
public class limitCharCount {
public static void main(String[] args) {
// TODO Auto-generated method stub
String str = "gggkjjkjkjjjjjsssslklkkkkkk";
char ch[] = str.toCharArray();
String Test = "";
//int count = 2;
for (int i = 0; i < ch.length - 1; i++) {
if (i == 0 ||i == 1)
Test = Test + ch[i];
else if (!(ch[i]==ch[i-1] && ch[i]==ch[i-2]) )
{
Test = Test + ch[i];
}
}
System.out.println(Test);
}
}
output ::ggkjjkjkjjsslklkk

Latex expression replacement in R

I'm working with a LaTex document in R and I need to change {#1 \over #2} to \frac{#1}{#2}.
With simple expressions like:
{1\over 2}
{x^2+y^2\over \lambda}
I can do it with stringr::str_replace() or gsub base functions and the regex \\{([\\^a-z0-9\\\\\\s\\+\\-\\*/\(\)]+)\\s*\\\\over\\s*([\\^a-z0-9\\\\\\s\\+\\-\\*/\(\)]+)\\} (I guess there has to be a better approach to do this. I tried with \\{(.+)\\s*\\\\over\\s*(.*)\\} but it captured more than I wanted.)
But when I work with expressions like:
{e^{2c} \over x-1}
{2yz\over 1+x^{2} }
or a more complicated expression:
\\item $Dom\\left(Q\\right)\\ne {\\rm R}^{2} $ y uno de los puntos no pertenecientes al dominio es $\\left({1\\over 2} ,{1\\over 2} \right).$
the above regex failed.
Is there a regex that can catch all the alternatives? Thanks
Given some sample strings:
> strings
[1] "{1\\over 2}" "{x^2+y^2\\over \\lambda}"
This monster:
> unlist(
lapply(
strsplit(
sub("\\}$","",
sub("^\\{","",strings)),"\\\\over"),
function(x){paste0("\\frac{",x[1],"}{",x[2],"}")}))
produces:
[1] "\\frac{1}{ 2}" "\\frac{x^2+y^2}{ \\lambda}"
This will break if there's more than one \over in the source string. And probably in many other cases too... Oh, it doesn't work if there's spaces before the first { or after the closing }.
On your other examples you get this:
in out
[1,] "{1\\over 2}" "\\frac{1}{ 2}"
[2,] "{x^2+y^2\\over \\lambda}" "\\frac{x^2+y^2}{ \\lambda}"
[3,] "{e^{2c} \\over x-1}" "\\frac{e^{2c} }{ x-1}"
[4,] "{2yz\\over 1+x^{2} }" "\\frac{2yz}{ 1+x^{2} }"
I rather enjoyed this question.
At some point you have to parse the document. parse_tex from TeXCheckR had LaTeX not plain TeX in mind but seems to do okay here. For multi-line instances of \over the script would need to be changed though the principle would be the same I think.
The challenge was for continued fractions.
library(data.table) # for shift
library(TeXCheckR) # for parse_tex
locate_over <- function(doc_parsed) {
lead <- function(x, n) data.table::shift(x, n = n, type = "lead", fill = "")
char <- .subset2(doc_parsed, "char")
which(char == "\\" &
lead(char == "o", 1L) &
lead(char == "v", 2L) &
lead(char == "e", 3L) &
lead(char == "r", 4L))
}
over2frac <- function(lines, verbose = FALSE) {
out <- lines
for (i in seq_along(lines)) {
if (grepl("\\over", lines[i], fixed = TRUE)) {
i_parsed <- parse_tex(lines[i])
# Find lhs
for (j in locate_over(i_parsed)) {
lhs_start <- max(which(.subset2(i_parsed, "char") %chin% c("$", "{") &
.subset2(i_parsed, "column") < j &
.subset2(i_parsed, "tex_group") == .subset2(i_parsed[j], "tex_group")))
rhs_end <- min(which(.subset2(i_parsed, "char") %chin% c("$", "}") &
.subset2(i_parsed, "column") > j + 4L &
.subset2(i_parsed, "tex_group") == .subset2(i_parsed[j], "tex_group")))
i_parsed[lhs_start, char := "{\\frac{"]
i_parsed[rhs_end, char := "}}"]
}
res <- paste0(i_parsed[["char"]], collapse = "")
res <- gsub("\\over", "}{", res, fixed = TRUE)
out[i] <- res
}
}
out
}
Test TeX document:
$5 \over 2$
This is another fraction: ${1 \over 2}$.
And another:
$$A = a \over b$$
What about:
$${{D \over C} \over H}$$
Finally:
$${e^{2c} \over x-1}$$
${2yz\over 1+x^{2} }$
$$\phi = 1 + {1 \over {1 + {1 \over {1 + {1 \over {1 + \ddots}}}}}}$$
\item $Dom\left(Q\right)\ne {\rm R}^{2} $ y uno de los puntos no pertenecientes al dominio es $\left({1\over 2} ,{1\over 2}\right).$
\bye
Resulting LaTeX document: with the necessary LaTeX-specific stuff, plus mandatory math mode for inline fractions. writeLines(over2frac(readLines("tex1.tex"), verbose = FALSE), "latex1.tex")
\documentclass{article}
\begin{document}
${\frac{5 }{ 2}}$
This is another fraction: ${\frac{1 }{ 2}}$.
And another:
${\frac{A = a }{ b}}$
What about:
$${\frac{{\frac{D }{ C}} }{ H}}$$
Finally:
$${\frac{e^{2c} }{ x-1}}$$
${\frac{2yz}{ 1+x^{2} }}$
$$\phi = 1 + {\frac{1 }{ {1 + {\frac{1 }{ {1 + {\frac{1 }{ {1 + \ddots}}}}}}}}}$$
\item $Dom\left(Q\right)\ne {\rm R}^{2} $ y uno de los puntos no pertenecientes al dominio es $\left({\frac{1}{ 2}} ,{\frac{1}{ 2}} \right).$
\end{document}
This gets you most of the way for your examples:
library(stringr)
s <- "Expression 1 is {1\\over 2}.
Expression 2 is {x^2+y^2\\over \\lambda}, yes it is.
Expression 3 is {e^{2c} \\over x-1}.
The last expression: {2yz\\over 1+x^{2} }, all done now."
s2 <- str_replace_all(s,
"\\{(.*?)\\s{0,}\\\\over\\s{0,}(.*?)\\}",
"\\frac\\{\\1\\}\\{\\2\\}")
s2
[1] "Expression 1 is frac{1}{2}.\n\nExpression 2 is frac{x^2+y^2}{\\lambda}, yes it is.\n\nExpression 3 is frac{e^{2c}}{x-1}.\n\nThe last expression: frac{2yz}{1+x^{2} }, all done now."
The only issue is that a space remains in the last expression, which may not be a problem since it existed in the original:
frac{2yz}{1+x^{2} }
x=c("{e^{2c} \\over x-1}","{2yz\\over 1+x^{2} },,dty{k^4e{-rpi/3}\\over\\sqrt{2pik}}")
gsub("\\{(.*?)\\\\over(.*?)\\}","\\\frac{\\1}{\\2}",x)
[1] "\frac{e^{2c} }{ x-1}"
[2] "\frac{2yz}{ 1+x^{2} },,dty\frac{k^4e{-rpi/3}}{\\sqrt{2pik}}"
Explanation:
\{(.*?)\\over(.*?)\\
List item{ matches the character { literally (case sensitive)
1st Capturing Group (.*?)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\\ matches the
character \ literally (case sensitive) over matches the characters
over literally (case sensitive)
2nd Capturing Group (.*?)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\\ matches the character \ literally (case sensitive)
This approach can handle:
multiple {...} containing \over on the same line
{...} not containing \over
other text before, after and between occurrences of {...}
lines not having any {...} with \over
For example, note in the example below the {jjj} on the second input line before the first occurrence of a {...} with \over works as expected.
It makes use of gsubfn which can handle balanced parentheses. First create a proto object p similar to the one in my answer here. p initializes a counter k to 0 and increments it for each { and decrements it for each }. It replaces any { for which k=1 with ! and also any } for which k=0 with !.
Then replace !...\over...! with \frac{...}{...} and replace any remaining !...! with {...}.
We have assumed that ! does not appear in the input but if it does choose a different character.
library(gsubfn)
library(magrittr)
# test input
s <- c("abc {1\\over 2} def {x^2+y^2\\over \\lambda} ghi { 12 } XYZ",
"X {jjj} A {e^{2c} \\over x-1} jkl {2yz\\over 1+x^{2} } Z")
# processing
p <- proto(
pre = function(.) .$k <- 0,
fun = function(., x) {
if (x == "{") .$k <- .$k + 1 else if (x == "}") .$k <- .$k - 1
if (x == "{" && .$k == 1) "!" else if (x == "}" && .$k == 0) "!" else x
})
s %>%
gsubfn("[{}]", p, .) %>%
gsub("!([^!]*)\\\\over ([^!]*)!", "\\\\frac{\\1}{\\2}", .) %>%
gsub("!([^!]*)!", "{\\1}", .)
giving this result:
[1] "abc \\frac{1}{2} def \\frac{x^2+y^2}{\\lambda} ghi { 12 } XYZ"
[2] "X {jjj} A \\frac{e^{2c} }{x-1} jkl \\frac{2yz}{1+x^{2} } Z"

Counting data that is valid to two conditions in columns 1 and 2

I am trying to run the following loop, the two while statements work, but the # c awk line seems to be causing me some problems.
printf "" >! loop.txt
# x = -125
while ($x <= -114)
# y = 32
while ($y <= 42)
# c =`awk '{ for ($1 = $x*; $2 = $y*){count[$1]++}}' text.txt`
printf "$x $y $c\n" >> loop.txt
# y++
end
# x++
end
With the awk line, I am trying to reference a file with lots of differing values in columns 1 and 2 of the text.txt file.
I want to be able to firstly reference all of the values in column 1 that start with $x (as they all have several decimal places), then reference from that sub-list all of the values in column 2 that begin with $y. After this second sub-list has been formed, I would like to count all of the entries valid to those conditions.
However, I keep getting syntax errors with the line, and I'm not sure that I'm using the correct function!
EDIT:
The executable file is a .csh type (C shell, I think)
A sample input format...
-125.025 32.058 2.25
-125.758 32.489 2.67
-125.349 32.921 3.49
-125.786 32.753 4.69
-125.086 33.008 2.78
And the expected output...
-125 32 4
-125 33 1
So this is all you want?
$ awk '{cnt[int($1)][int($2)]++} END{for (x in cnt) for (y in cnt[x]) print x, y, cnt[x][y]}' file
-125 32 4
-125 33 1
If you want to specify a range of x and y values, just add that range check before incrementing the array entry:
awk '
{ x=int($1); y=int($2) }
x>=-125 && x<=-114 && y>=32 && y<=42 { cnt[x][y]++ }
END { for (x in cnt) for (y in cnt[x]) print x, y, cnt[x][y] }
' file
I spit it into multiple lines to improve readability and added variables to avoid calling int() multiple times for each field.
Note that the above will read your input file just once compared to the script you posted in your question which will read the whole input file 132 times so you can imagine the performance improvement from that alone, never mind all the starting/stopping processes 132 times, etc.
The above use GNU awk for 2D arrays but can be easily simulated with other awks.

Finding duplicate values in r

So, In a string containing multiple 1's,
Now, it is possible that, the number
'1'
appears at several positions, let's say, at multiple positions. What I want is
(3)
This is not a complete answer, but some ideas (partly based on comments):
z <- "1101101101"
zz <- as.numeric(strsplit(z,"")[[1]])
Compute autocorrelation function and draw plot: in this case I'm getting the periodicity=3 pretty crudely as the first point at which there is an increase followed by a decrease ...
a1 <- acf(zz)
first.peak <- which(diff(sign(diff(a1$acf[,,1])))==-2)[1]
Now we know the periodicity is 3; create runs of 3 with embed() and analyze their similarities:
ee <- embed(zz,first.peak)
pp <- apply(ee,1,paste,collapse="")
mm <- outer(pp,pp,"==")
aa <- apply(mm[!duplicated(mm),],1,which)
sapply(aa,length) ## 3 3 2 ## number of repeats
sapply(aa,function(x) unique(diff(x))) ## 3 3 3
The following code does exactly what you ask for. Try it with str_groups('1101101101'). It returns a list of 3-vectors. Note that the first triple is (1, 3, 4) because the character at the 10th position is also a 1.
Final version, optimized and without errors
str_groups <- function (s) {
digits <- as.numeric(strsplit(s, '')[[1]])
index1 <- which(digits == 1)
len <- length(digits)
back <- length(index1)
if (back == 0) return(list())
maxpitch <- (len - 1) %/% 2
patterns <- matrix(0, len, maxpitch)
result <- list()
for (pitch in 1:maxpitch) {
divisors <- which(pitch %% 1:(pitch %/% 2) == 0)
while (index1[back] > len - 2 * pitch) {
back <- back - 1
if (back == 0) return(result)
}
for (startpos in index1[1:back]) {
if (patterns[startpos, pitch] != 0) next
pos <- seq(startpos, len, pitch)
if (digits[pos[2]] != 1 || digits[pos[3]] != 1) next
repeats <- length(pos)
if (repeats > 3) for (i in 4:repeats) {
if (digits[pos[i]] != 1) {
repeats <- i - 1
break
}
}
continue <- F
for (subpitch in divisors) {
sublen <- patterns[startpos, subpitch]
if (sublen > pitch / subpitch * (repeats - 1)) {
continue <- T
break
}
}
if (continue) next
for (i in 1:repeats) patterns[pos[i], pitch] <- repeats - i + 1
result <- append(result, list(c(startpos, pitch, repeats)))
}
}
return(result)
}
Note: this algorithm has roughly quadratic runtime complexity, so if you make your strings twice as long, it will take four times as much time to find all patterns on average.
Pseudocode version
To aid understanding of the code. For particulars of R functions such as which, consult the R online documentation, for example by running ?which on the R command line.
PROCEDURE str_groups WITH INPUT $s (a string of the form /(0|1)*/):
digits := array containing the digits in $s
index1 := positions of the digits in $s that are equal to 1
len := pointer to last item in $digits
back := pointer to last item in $index1
IF there are no items in $index1, EXIT WITH empty list
maxpitch := the greatest possible interval between 1-digits, given $len
patterns := array with $len rows and $maxpitch columns, initially all zero
result := array of triplets, initially empty
FOR EACH possible $pitch FROM 1 TO $maxpitch:
divisors := array of divisors of $pitch (including 1, excluding $pitch)
UPDATE $back TO the last position at which a pattern could start;
IF no such position remains, EXIT WITH result
FOR EACH possible $startpos IN $index1 up to $back:
IF $startpos is marked as part of a pattern, SKIP TO NEXT $startpos
pos := possible positions of pattern members given $startpos, $pitch
IF either the 2nd or 3rd $pos is not 1, SKIP TO NEXT $startpos
repeats := the number of positions in $pos
IF there are more than 3 positions in $pos THEN
count how long the pattern continues
UPDATE $repeats TO the length of the pattern
END IF (more than 3 positions)
FOR EACH possible $subpitch IN $divisors:
check $patterns for pattern with interval $subpitch at $startpos
IF such a pattern is found AND it envelopes the current pattern,
SKIP TO NEXT $startpos
(using helper variable $continue to cross two loop levels)
END IF (pattern found)
END FOR (subpitch)
FOR EACH consecutive position IN the pattern:
UPDATE $patterns at row of position and column of $pitch TO ...
... the remaining length of the pattern at that position
END FOR (position)
APPEND the triplet ($startpos, $pitch, $repeats) TO $result
END FOR (startpos)
END FOR (pitch)
EXIT WITH $result
END PROCEDURE (str_groups)
Perhaps the following route will help:
Convert string to a vector of integers characters
v <- as.integer(strsplit(s, "")[[1]])
Repeatedly convert this vector to matrices of varying number of rows...
m <- matrix(v, nrow=...)
...and use rle to find relevant patterns in the rows of the matrix m:
rle(m[1, ]); rle(m[2, ]); ...

Resources