Simultaneous substitutions with s/// in Perl 6 - substitution

Is there a way to do simultaneous substitutions with s///? For instance, if I have a string with a number of 1s, 2s, 3s, etc, and I want to substitute 1 with "tom", and 2 with "mary", and 3, with "jane", etc?
my $a = "13231313231313231";
say $a ~~ s:g/1/tom/;
say $a ~~ s:g/2/mary/;
say $a ~~ s:g/3/jane/;
Is there a good way to do all three steps at once?

For replacements like your example, you can use trans. Provide a list of what to search for and a list of replacements:
my $a = "13231313231313231";
$a .= trans(['1','2','3'] => ['tom', 'mary', 'jane']);
say $a;
tomjanemaryjanetomjanetomjanemaryjanetomjanetomjanemaryjanetom
For simple strings, you can simplify with word quoting:
$a .= trans(<1 2 3> => <tom mary jane>);

The simplest way it probably to make a Map of your substitutions and then reference it.
my $a = "123123";
my $map = Map.new(1 => "tom", 2 => "mary", 3 => "jane");
$a ~~ s:g/\d/$map{$/}/;
say $a
"tomemaryjanetommaryjane"
If you only want to map certain values you can update your match of course :
my $a = "12341234";
my $map = Map.new(1 => "tom", 2 => "mary", 3 => "jane");
$a ~~ s:g/1 || 2 || 3/$map{$/}/;
say $a
"tomemrayjane4tommaryjane4"

Related

Broadcasting algebraic operations between dicts

I have two dicts and I want to subtract the matching values from the two dicts to generate a third dict.
A = Dict("w" => 2, "x" => 3)
B = Dict("x" => 5, "w" => 7)
# Ideally I could go B .- A and get a dict like
C = Dict("w" => 5, "x" => 2)
# but I get ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved
One ugly solution is to overload the subtraction operator but I am not keen to overload for a builtin type like dict for fear of breaking other code.
import Base.-
function -(dictA::Dict, dictB::Dict)
keys_of_A = keys(dictA)
subtractions = get.(Ref(dictB), keys_of_A, 0) .- get.(Ref(dictA), keys_of_A, 0)
return Dict(keys_of_A .=> subtractions)
end
Is there a cleaner way to do algebraic operations on matching values from different dicts?
merge provides the result you want.
A = Dict("w" => 2, "x" => 3)
B = Dict("x" => 5, "w" => 7)
C = merge(-, B, A)
Dict{String,Int64} with 2 entries:
"w" => 5
"x" => 2
Note that merge performs a union of the two collections and combines common keys by performing the given operation. So, for example:
W = Dict("w" => 4)
merge(-, B, W)
Dict{String,Int64} with 2 entries:
"w" => 3
"x" => 5

What's the equivalent of R's ifelse in Perl PDL

I am new to PDL. R's ifelse() method can do conditonal element selection. For example,
x <- c(1,2,3,4)
ifelse(x%%2, x, x*2)
# [1] 1 4 3 8
Anyone knows how to do this in PDL? I know you can do it like below, but is there any better ways?
pdl(map { $_ % 2 ? $_ : $_*2 } #{$x->unpdl} )
#! /usr/bin/perl
use warnings;
use strict;
use PDL;
my $x = 'PDL'->new([1, 2, 3, 4]);
my $where = ! ($x % 2); # [0 1 0 1]
my $y = $x * ($where + 1);
print $y; # [1 4 3 8]
or, shortly
my $y = $x * ( 2 - $x % 2 );
Answering the question myself. It can be something like this,
use PDL;
sub ifelse {
my ( $test, $yes, $no ) = #_;
$test = pdl($test);
my ( $ok, $nok ) = which_both($test);
my $rslt = zeros( $test->dim(0) );
unless ( $ok->isempty ) {
$yes = pdl($yes);
$rslt->slice($ok) .= $yes->index( $ok % $yes->dim(0) );
}
unless ( $nok->isempty ) {
$no = pdl($no);
$rslt->slice($nok) .= $no->index( $nok % $no->dim(0) );
}
return $rslt;
}
my $x = pdl( 1, 2, 3, 4 );
say ifelse( $x % 2, $x, $x * 2 ); # [1 4 3 8]
say ifelse( $x % 2, 5, sequence( 3 ) ); # [5 1 5 0]
say ifelse( 42, $x, $x * 2 ); # [1]
In PDL, the general solution to that sort of thing likely involves slicing and similar. Looking at the latest release notes of PDL (2.077), which has a new where_both, I remembered this question (disclosure: I'm current maintainer). While your specific problem only involves a change to the values of even numbers, I'll also show the case of adding 2 to odds:
my ($odd, $even) = where_both($x, $x % 2);
$odd += 2, $even *= 2; # the "," form is just a flourish, it could be 2 lines
It's efficient, in proper PDL style, because the scan of $x only happens once (you won't be surprised to learn it also uses which_both under the hood), and the mutations only look at the slices of the bits that are relevant. This is very similar to your code, but it got captured into a small, widely-reusable function. (I wrote it to turn the TriD EuclidAxes stuff from using Perl for-loops to actually using ndarrays, if you're interested)
Better than $x ? $y : $z? Not to my mind, but it's a matter of style and taste
sub ifelse {
my ($x,$y,$z) = #_;
$x ? $y : $z ;
if($x){$y}else{$z} ;
[$y,$z]->[!$x] ;
[$z,$y]->[!!$x] ;
($x && $y) || $z ; # valid only if $y is always true
(!$x && $z) || $y ; # valid only if $z is always true
}

How to functionally convert a nested hash to a list of records?

Let's say I have a nested hash describing money quantities:
my %money = (coins => {'50c' => 4}, notes => {'10' => 1, '20' => 5});
My desired format is a record list:
my #money = [
(:type('coins'), :subtype('50c'), value => 4),
(:type('notes'), :subtype('10'), value => 1),
(:type('notes'), :subtype('20'), value => 5),
];
The most obvious answer is loops:
my #money;
for %money.kv -> $type, %sub-records {
for %sub-records.kv -> $subtype, $value {
#money.push: (:$type, :$subtype, :$value);
}
}
But I'm allergic to separating a variable from the code that populates it. Next, I tried to create the variable with functional transformations on the input hash:
%money.kv.map: -> $k1, %hsh2 { :type($k1) X, %hsh2.kv.map(->$k2, $v2 {:subtype($k2), :$v2, :value($v2)}) }
But I didn't get the nesting right. I want a list of flat lists. Plus, the above is a mess to read.
The compromise is the gather/take construct which lets me construct a list by iteration without any temporary/uninitialized junk in the main scope:
my #money = gather for %money.kv -> $type, %sub-records {
for %sub-records.kv -> $subtype, $value {
take (:$type, :$subtype, :$value);
}
};
But I'm curious, what is the right way to get this right with just list transformations like map, X or Z, and flat? ("key1", "key2", and "value" are fine field names, since an algorithm shouldn't be domain specific.)
Edit: I should mention that in Perl 6, gather/take is the most readable solution (best for code that's not write-only). I'm still curious about the pure functional solution.
my #money = %money.map:
-> ( :key($type), :value(%records) ) {
slip
:$type xx *
Z
( 'subtype' X=> %records.keys )
Z
( 'value' X=> %records.values )
}
You could do .kv.map: -> $type, %records {…}
-> ( :key($type), :value(%records) ) {…} destructures a Pair object
:$type creates a type => $type Pair
:$type xx * repeats :$type infinitely (Z stops when any of it's inputs stops)
('subtype' X=> %records.keys) creates a list of Pairs
(Note that .keys and .values are in the same order if you don't modify the Hash between the calls)
Z zips two lists
slip causes the elements of the sequence to slip into the outer sequence
(flat would flatten too much)
If you wanted them to be sorted
my #money = %money.sort.map: # 'coins' sorts before 'notes'
-> ( :key($type), :value(%records) ) {
# sort by the numeric part of the key
my #sorted = %records.sort( +*.key.match(/^\d+/) );
slip
:$type xx *
Z
( 'subtype' X=> #sorted».key )
Z
( 'value' X=> #sorted».value )
}
You could do .sort».kv.map: -> ($type, %records) {…}

find frequency of substring in a set of strings

I have as input a gene list where each genes has a header like >SomeText.
For each gene I would like to find the frequency of the string GTG. (number of occurences divided by length of gene). The string should only be counted if it starts at position 1,4,7,10 etc (every thids position).
>ENST00000619537.4 cds:known chromosome:GRCh38:21:6560714:6564489:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGATGTGACCATCCAGCACCCCTGGTTCAAGCGCACCCTGGGGCCCTTCTACCCCAGC
CGGCTGTTCGACCAGTTTTTCGGCGAGGGCCTTTTTGAGTATGACCTGCTGCCCTTCCTG
TCGTCCACCATCAGCCCCTACTACCGCCAGTCCCTCTTCCGCACCGTGCTGGACTCCGGC
ATCTCTGAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTC
TCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCAC
AACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTG
CCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACC
TTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCC
GTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624019.3 cds:known chromosome:GRCh38:21:6561284:6563978:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGACGCCCCCCCCCCCCACCCAACCACAGGCCTCCTCTCTGAGCCACGGGTTCGATCC
GACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTCTCCCCGGAGGACCTCACC
GTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCACAACGAGCGCCAGGACGAC
CACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTGCCGTCCAACGTGGACCAG
TCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACCTTCTGTGGCCCCAAGATC
CAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCCGTGTCGCGGGAGGAGAAG
CCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624932.1 cds:known chromosome:GRCh38:21:6561954:6564203:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGCCTGTCTGTCCAGGAGACAGTCACAGGCCCCCGAAAGCTCTGCCCCACTTGGTGTGT
GGGAGAAGAGGCCGGCAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTG
AAGCACTTCTCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCAC
GGAAAGCACAACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGC
TACCGCCTGCCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGC
ATGCTGACCTTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGA
GCCATCCCCGTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
Output:
Gene Frequency
Gene1: 3
Gene2 6.3
....
I was thinging of something like this, but I dont now how to define the positions requirements:
freq <- sapply(gregexpr("GTG",x),function(x)if(x[[1]]!=-1) length(x) else 0)
Here is an idea in R using stringi.
We use stri_locate_all_fixed() to find the start and end position of each GTG occurence. Then we create a column condition to test if start position is in 1,4,7,10,13,16,19,22 ....
library(stringi)
library(dplyr)
data.frame(stri_locate_all_fixed(gene1, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(gene), 3))
Which gives:
# start end condition
#1 4 6 TRUE
If you want to generalize this to a list of genes, you could do:
lst <- list(gene1, gene2, gene3)
res <- lapply(lst, function(x) {
data.frame(stri_locate_all_fixed(x, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(x), 3))
})
Which would give:
#[[1]]
# start end condition
#1 4 6 TRUE
#
#[[2]]
# start end condition
#1 NA NA FALSE
#
#[[3]]
# start end condition
#1 3 5 FALSE
#2 9 11 FALSE
#3 21 23 FALSE
#4 70 72 TRUE
#5 75 77 FALSE
Following #Sobrique's comment, if divided by length means number of occurences respecting condition divided by total number of char in each gene, you could do:
lapply(1:length(res), function(x) sum(res[[x]][["condition"]]) / nchar(lst[[x]]))
Which would give:
#[[1]]
#[1] 0.004830918
#
#[[2]]
#[1] 0
#
#[[3]]
#[1] 0.003021148
Here's a Perl solution that does as you ask
But I don't understand how your example output is derived: the first and last sequences have only one occurrence of GTG in the positions you require, and the second sequence has none at all. That means the outputs are 1 / 207, 0 / 74, and 1 / 331 respectively. None of those are anything like 3 and 6.3 that you say you're expecting
This program expects the path to the input file as a parameter on the command line
use strict;
use warnings 'all';
print "Gene Frequency\n";
my $name;
local $/ = '>';
while ( <> ) {
chomp;
next unless /\S/;
my ($name, $seq) = split /\n/, $_, 2;
$seq =~ tr/A-Z//cd;
my $n = 0;
while ( $seq =~ /(?=GTG)/g ) {
++$n if $-[0] % 3 == 0;
}
printf "%-7s%.6f\n", $name, $n / length($seq);
}
output
Gene Frequency
Gene1 0.004831
Gene2 0.000000
Gene3 0.003021
Here is an alternate solution that does not use a pattern match. Not that it will matter much.
use strict;
use warnings;
my $gene;
while ( my $line = <> ) {
if ( $line =~ /^>(.+)/ ) {
$gene = $1;
next;
}
chomp $line;
printf "%s: %s\n",
$gene,
( grep { $_ eq 'GTG' } split /(...)/, $line ) / length $line;
}
Output:
Gene1: 0.00483091787439614
Gene2: 0
Gene3: 0.00302114803625378
It is essentially similar to Sobrique's answer, but assumes that the gene lines contain the right characters. It splits up the gene string into a list of three-character pieces and takes the ones that are literally GTG.
The splitting works by abusing the fact that split uses a pattern as the delimiter, and that it will also capture the delimiter if a capture group is used. Here's an example.
my #foo = split /(...)/, '1234567890';
p #foo; # from Data::Printer
__END__
[
[0] "",
[1] 123,
[2] "",
[3] 456,
[4] "",
[5] 789,
[6] 0
]
The empty elements get filter out by grep. It might not be the most efficient way, but it gets the job done.
You can run it by calling perl foo.pl horribly-large-gene-sequence.file.
Well, you have an R solution. I've hacked something together in perl because you tagged it:
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'GTG';
local $/ = "\n>";
while ( <> ) {
my ($gene) = m/(Gene\d+)/;
my #hits = grep { /^$target$/ } m/ ( [GTCA]{3} ) /xg;
print "$gene: ".( scalar #hits), "\n";
}
This doesn't give the same results as your input though:
Gene1: 1
Gene2: 0
Gene3: 1
I'm decomposing your string into 3 element lists, and looking for ones that specifically match. (And I haven't divided by length, as I'm not entirely clear if that's the actual string length in letters, or some other metric).
Including length matching - we need to capture both name and string:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n>";
while (<>) {
my ($gene, $gene_str) = m/(Gene\d+)\n([GTCA]+)/m;
my #hits = grep { /^GTG$/ } $gene_str =~ m/ ( [GTCA]{3} ) /xg;
print "$gene: " . #hits . "/". length ( $gene_str ), " = ", #hits / length($gene_str), "\n";
}
We use <> which is the 'magic' filehandle, and tells perl to read from either STDIN or a file specified on command line. Much like sed or grep does.
With your input:
Gene1: 1/207 = 0.00483091787439614
Gene2: 0/74 = 0
Gene3: 1/331 = 0.00302114803625378
Here is a function I created based on your requirement. I am pretty sure there are alternate ways better than this but this solves the problem.
require(stringi)
input_gene_list<- list(gene1= "GTGGGGGTTTGTGGGGGTG", gene2= "GTGGGGGTTTGTGGGGGTG", gene3= "GTGGGGGTTTGTGGGGGTG")
gene_counter<- function(gene){
x<- gene
y<- gsub(pattern = "GTG",replacement = "GTG ", x = x, perl=TRUE)
if(str_count(y,pattern = "GTG")) {
gene_count<- unlist(gregexpr(pattern = " ", y))
counter<- 0
for(i in 1:length(gene_count)){
if((gene_count[i] %% 3) == 1) counter=counter+1
}
return(counter/nchar(x))
}
}
output_list<- lapply(input_gene_list, function(x) gene_counter(x))
result<- t(as.data.frame(output_list))
result
[,1]
gene1 0.1052632
gene2 0.1052632
gene3 0.1052632
Also share your thoughts on it! Thanks!

Julia way of searching tokens in integer arrays

Let's say I have buffer=Int[1,2,3,2,3] and token=[2,3].
Is there any preferred way of searching the occurrence of token in buffer to find [2,4] as the answer.
Or, perhaps, is there any split equivalent function for the integer arrays in julia?
(I know how I can perform this operation using 2 nested loops. However, I am especially interested if there is a more Julian way of doing this.)
Because Julia doesn't have conditionals in list comprehensions, I would personally use filter(). Thus if arr = Int64[1,2,3,4,5,2,3,6,2,3,3,2,2]:
filter(x -> arr[x] == 2 && arr[x + 1] == 3, 1 : length(arr) - 1)
=> [2,6,9]
To make it a little more reusable:
pat = [2,3]
filter(x -> arr[x : x + length(pat) - 1] == pat, 1 : length(arr) - length(pat) + 1)
=> [2,6,9]
Julia does have built-ins like find([fun], A), but there's no way that I'm aware of to use them to return indexes of an ordered sublist.
Of course it's arguably more legible to just
ndxs = Int64[]
for i = 1:length(arr)-1
if arr[i] == 2 && arr[i+1] == 3
push!(ndxs, i)
end
end
=> [2,6,9]
For practice I have also made trial-and-errors and the following patterns have worked for Julia0.4.0. With A = Int[1,2,3,2,3] and pat = Int[2,3], the first one is
x = Int[ A[i:i+1] == pat ? i : 0 for i=1:length(A)-1 ]
x[ x .> 0 ] # => [2,4]
the second one is
x = Int[]
[ A[i:i+1] == pat ? push!(x,i) : 0 for i=1:length(A)-1 ]
#show x # => [2,4]
and the third one is
find( [ A[i:i+1] == pat for i=1:length(A)-1 ] ) # => [2,4]
(where find() returns the index array of true elements). But personally, I feel these patterns are more like python than julia way...

Resources