How to calculate the tree the results by combining individual leaf paths? - r

Let's say I have an input file where each line contains the path from the root (A) to a leaf
echo "A\tB\tC\nA\tB\tD\nA\tE" > lines.txt
A B C
A B D
A E
How can I easily generate the resulting tree?: (A(B(C,D),E))
I'd like to use GNU tools (awk, sed, etc.) because they tend to work better with large files, but an R script would also work. The R input would be:
# lines <- lapply(readLines("lines.txt"), strsplit, " +")
lines <- list(list(c("A", "B", "C")), list(c("A", "B", "D")), list(c("A","E")))

In Perl:
#!/usr/bin/env perl
use strict;
my $t = {};
while (<>) {
my #a = split;
my $t1 = $t;
while (my $a = shift #a) {
$t1->{$a} = {} if not exists $t1->{$a};
$t1 = $t1->{$a};
}
}
print &p($t)."\n";
sub p {
my ($t) = #_;
return
unless keys %$t;
return '('
. join(',', map { $_ . p($t->{$_}) } sort keys %$t)
. ')';
}
This script returns:
% cat <<EOF | perl l.pl
A B C
A B D
A E
EOF
(A(B(C,D),E))
Note that this script, due to recursion in p is not at all suited for large datasets. But that can be easily resolved by turning that into a double for loop, like in the first while above.

Why do it the easy way, if you can use Bourne Shell script instead? Note, this is not even Bash, this is plain old Bourne shell, without arrays...
#!/bin/sh
#
# A B C
# A B D
# A E
#
# "" vs "A B C" -> 0->3, ident 0 -> -0+3 -> "(A(B(C"
# "A B C" vs "A B D" -> 3->3, ident 2 -> -1+1 -> ",D"
# "A B D" vs "A E" -> 3->2, ident 1 -> -2+1 -> "),E"
# "A E" vs. endc -> 2->0, ident 0 -> -2+0 -> "))"
#
# Result: (A(B(C,D),E))
#
# Input stream is a path per line, path segments separated with spaces.
process_line () {
local line2="$#"
n2=$#
set -- $line1
n1=$#
s=
if [ $n2 = 0 ]; then # last line (empty)
for s1 in $line1; do
s="$s)"
done
else
sep=
remainder=false
for s2 in $line2; do
if ! $remainder; then
if [ "$1" != $s2 ]; then
remainder=true
if [ $# = 0 ]; then # only children
sep='('
else # sibling to an existing element
sep=,
shift
for s1 in $#; do
s="$s)"
done
fi
fi
fi
if $remainder; then # Process remainder as mismatch
s="$s$sep$s2"
sep='('
fi
shift # remove the first element of line1
done
fi
result="$result$s"
}
result=
line1=
(
cat - \
| sed -e 's/[[:space:]]\+/ /' \
| sed -e '/^$/d' \
| sort -u
echo '' # last line marker
) | while read line2; do
process_line $line2
line1="$line2"
test -n "$line2" \
|| echo $result
done
This produces the correct answer for two different files (l.sh is the shell version, l.pl the version in Perl):
% for i in l l1; do cat $i; ./l.sh < $i; ./l.pl < $i; echo; done
A
A B
A B C D
A B E F
A G H
A G H I
(A(B(C(D),E(F)),G(H(I))))
(A(B(C(D),E(F)),G(H(I))))
A B C
A B D
A E
(A(B(C,D),E))
(A(B(C,D),E))
Hoohah!

Okay, so I think I got it:
# input
lines <- c(list(c("A", "B", "C")), list(c("A", "B", "D")), list(c("A","E")))
# generate children
generate_children <- function(lines){
children <- list()
for (line in lines) {
for (index in 1:(length(line)-1)){
parent <- line[index]
next_child <- line[index + 1]
if (is.null(children[[parent]])){
children[[parent]] <- next_child
} else {
if (next_child %notin% children[[parent]]){
children[[parent]] <- c(children[[parent]], next_child)
}
}
}
}
children
}
expand_children <- function(current_parent, children){
if (current_parent %in% names(children)){
expanded_children <- sapply(children[[current_parent]], function(current_child){
expand_children(current_child, children)
}, USE.NAMES = FALSE)
output <- setNames(list(expanded_children), current_parent)
} else {
output <- current_parent
}
output
}
children <- generate_children(lines)
root <- names(children)[1]
tree <- expand_children(root, children)
dput(tree)
# structure(list(A = structure(list(B = c("C", "D"), "E"), .Names = c("B",""))), .Names = "A")
Is there a simpler answer?

Related

if statement to select a channel in input block with nextflow

I am currently writing my first nextflow pipeline and I need to run different process in function of the parameter.
In fact, I would like, in one process, to select the channel where the input come from.
I've tested like that :
process foo{
input:
if(params.bar && params.bar2)
{
file reads from channel1.flatten()
}
else
{
file reads from channel_2.flatten()
}
output:
publishDir "$params.output_dir"
file "output_file" into channel_3
"""
my command line
"""
I obtain this error and I don't understand why.
No such variable: reads
Is there a way to do something like that ?
Thanks !
It's a bit of a weird error, but basically you just need to make sure your input declaration follows/matches the required syntax:
input:
<input qualifier> <input name> [from <source channel>] [attributes]
One solution might be to use the ternary operator to replace your if/else branch, for example:
ch1 = Channel.of( 'hello', 'world' )
ch2 = Channel.of( 1, 3, 5, 7, 9 )
params.foo = false
params.bar = false
process test {
echo true
input:
val myval from ( params.foo && params.bar ? ch1 : ch2 )
"""
echo -n "${myval}"
"""
}
Results:
$ nextflow run script.nf
N E X T F L O W ~ version 21.04.3
Launching `script.nf` [shrivelled_stone] - revision: 7b3f3a51df
executor > local (5)
[3b/fafa5e] process > test (2) [100%] 5 of 5 ✔
1
5
9
7
3
$ nextflow run script.nf --foo --bar
N E X T F L O W ~ version 21.04.3
Launching `script.nf` [irreverent_mahavira] - revision: 7b3f3a51df
executor > local (2)
[d2/09d418] process > test (1) [100%] 2 of 2 ✔
world
hello
Note that the new DSL 2 decouples the channel inputs from the process declaration, which might help to keep things readable, especially if the condition or action statements are more complex. For example:
nextflow.enable.dsl=2
params.foo = false
params.bar = false
process test {
echo true
input:
val myval
"""
echo -n "${myval}"
"""
}
workflow {
ch1 = Channel.of( 'hello', 'world' )
ch2 = Channel.of( 1, 3, 5, 7, 9 )
if( params.foo && params.bar ) {
test( ch1 )
} else {
test( ch2 )
}
}
Results:
$ nextflow run script.nf
N E X T F L O W ~ version 21.04.3
Launching `script.nf` [nauseous_pare] - revision: e1c4770ff1
executor > local (5)
[36/49d8da] process > test (4) [100%] 5 of 5 ✔
9
1
3
5
7
$ nextflow run script.nf --foo --bar
N E X T F L O W ~ version 21.04.3
Launching `script.nf` [goofy_euler] - revision: e1c4770ff1
executor > local (2)
[56/e635e8] process > test (2) [100%] 2 of 2 ✔
world
hello

Unix : Split line with delimiter

I've a file like this
a b c,d
e f g
x y r,s,t
and I would like to split this into columns using "," as delimiter. The other columns should be copied.
Expected result :
a b c
a b d
e f g
x y r
x y s
x y t
Thank you
Using awk. Expects field separators to be space or tab:
$ awk '{
split($3,a,",") # split the third field on commas, hash to a
for(i in a) { # for all entries in a
sub(/[^ \t]+$/,a[i],$0) # replace last field with entries in a...
print # ... preserving separators, space or tab
}
}' file
a b c
a b d
e f g
x y r
x y s
x y t
Due to the use of sub() it will produce false results if there is a & in the $3. Also, as mentioned in the comments, using for(i in a) may result in records outputing in seemingly random order. If that is a problem, use:
$ awk '{
n=split($3,a,",") # store element count to n
for(i=1;i<=n;i++) { # iterate in order
sub(/[^ \t]+$/,a[i],$0)
print
}
}' file
For tab separated files:
$ awk '
BEGIN { OFS="\t" } # output field separator
{
n=split($3,a,",")
for(i=1;i<=n;i++) {
$3=a[i] # & friendly version
print
}
}' file

Perl's Hash of Hashes equivalent implementation for dict of dicts in Tcl

I have a very large file that contains data like below:
*1 RES L1 N1 0.32
*22 RES L2 N2 0.64
*100 CAP A1 B1 0.3
*200 CAP A2 B1 0.11
*11 IND K1 K2 0.002
*44 IND X1 Y1 0.00134
... and so on
For such files (let us assume the above data is in a file called "example.txt"), I can easily create a Hash of Hashes in Perl and pass these nested Hashes to otherr parts of my Perl program:
#!/usr/bin/perl
use strict;
use warnings;
open(FILE,"<", "example.txt") or die "Cannot open file:$!";
if (-f "example.txt") {
while(<FILE>) {
chomp;
if(/^\s*(\S+)\s+(RES|CAP|IND)\s+(\S+)\s+(\S+)\s+(\S+)\s*$/) {
$hoh{$1}{$2}{$3}{$4} = $5;
}
}
close FILE;
}
What is a similar way to create a Tcl Hash of Hashes (or rather Dictionary of Dictionaries)?
I tried a small piece of code setting the dict like below (not printing the full code here, to keep focus on the problem):
...
set dod [dict create]
if [regexp {^\s*(\S+)\s+(RES|CAP|IND)\s+(\S+)\s+(\S+)\s+(\S+)\s*$} $line all id type x y elemValue] {
dict set dod $id $type $x $y $elemValue
}
But that does not seem to work. I tested it like below:
foreach id [dict keys $dod] {
if [dict exists $dod "RES"] {
puts "RES KEY EXISTS"
} else {
puts "RES KEY NOT FOUND"
}
}
Thanks.
Your immediate problem is a stray slash in the beginning of the regular expression.
To answer the question: a multi-key dictionary is a "hash of hashes". Every key adds a new level of dictionaries.
dict set foo aa bb cc 1
sets the member {cc 1} in a dictionary which is the value of the member {bb ...} in the dictionary which is the value of the member {aa ...} in foo.
If you don't want a multi-level dictionary and still need to use several key values, you need to do:
dict set foo [list aa bb cc] 1
Also, I don't know how much is simplified away in your example, but the code to add an item could be better stated as:
if {[lindex $line 1] in {RES CAP IND}} {
dict set dod {*}$line
}
But if you want to check existence by e.g. "RES", you need to set it as the top-level key, which you don't in your example (the items in the first column become top-level keys). Initializing as above, the value of dod is
*1 {RES {L1 {N1 0.32}}} *22 {RES {L2 {N2 0.64}}} *100 {CAP {A1 {B1 0.3}}} *200 {CAP {A2 {B1 0.11}}} *11 {IND {K1 {K2 0.002}}} *44 {IND {X1 {Y1 0.00134}}}
so you do get a dictionary, but dict exists $dod RES is still necessarily false. By using
if {[lindex $line 1] in {RES CAP IND}} {
dict set dod {*}[lrange $line 1 end]
}
(i.e. all the items in the line after the first as keys, except the last which becomes the value) you get the dictionary
RES {L1 {N1 0.32} L2 {N2 0.64}} CAP {A1 {B1 0.3} A2 {B1 0.11}} IND {K1 {K2 0.002} X1 {Y1 0.00134}}
in which you can test for the existence of "RES".
Going back to the dict-of-dicts
*1 {RES {L1 {N1 0.32}}} *22 {RES {L2 {N2 0.64}}} *100 {CAP {A1 {B1 0.3}}} *200 {CAP {A2 {B1 0.11}}} *11 {IND {K1 {K2 0.002}}} *44 {IND {X1 {Y1 0.00134}}}
you can check for "RES" by examining each of the sub-dictionaries until you find one that has that key:
set found 0
dict for {key subdict} $dod {
if {[dict exists $subdict RES]} {
set found 1
break
}
}
Documentation:
dict
Not exactly same but somewhat similar:
set data "*1 RES L1 N1 0.32
*22 RES L2 N2 0.64
*100 CAP A1 B1 0.3
*200 CAP A2 B1 0.11
*11 IND K1 K2 0.002
*44 IND X1 Y1 0.00134
"
set pattern {\s*(\S+)\s+(RES|CAP|IND)\s+(\S+)\s+(\S+)\s+(\S+)?\s*$}
set result [regexp -all -line -inline -- $pattern $data]
if {[llength $result] == 0} {
puts "Not found"
exit 1
}
array set my_data {}
foreach {all ind_0 ind_1 ind_2 ind_3 ind_4} $result {
set my_data($ind_0)($ind_1)($ind_2)($ind_3) $ind_4
}
puts [parray my_data]
Sample output:
my_data(*1)(RES)(L1)(N1) = 0.32
my_data(*100)(CAP)(A1)(B1) = 0.3
my_data(*11)(IND)(K1)(K2) = 0.002
my_data(*200)(CAP)(A2)(B1) = 0.11
my_data(*22)(RES)(L2)(N2) = 0.64
my_data(*44)(IND)(X1)(Y1) = 0.00134

find frequency of substring in a set of strings

I have as input a gene list where each genes has a header like >SomeText.
For each gene I would like to find the frequency of the string GTG. (number of occurences divided by length of gene). The string should only be counted if it starts at position 1,4,7,10 etc (every thids position).
>ENST00000619537.4 cds:known chromosome:GRCh38:21:6560714:6564489:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGATGTGACCATCCAGCACCCCTGGTTCAAGCGCACCCTGGGGCCCTTCTACCCCAGC
CGGCTGTTCGACCAGTTTTTCGGCGAGGGCCTTTTTGAGTATGACCTGCTGCCCTTCCTG
TCGTCCACCATCAGCCCCTACTACCGCCAGTCCCTCTTCCGCACCGTGCTGGACTCCGGC
ATCTCTGAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTC
TCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCAC
AACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTG
CCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACC
TTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCC
GTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624019.3 cds:known chromosome:GRCh38:21:6561284:6563978:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGACGCCCCCCCCCCCCACCCAACCACAGGCCTCCTCTCTGAGCCACGGGTTCGATCC
GACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTCTCCCCGGAGGACCTCACC
GTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCACAACGAGCGCCAGGACGAC
CACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTGCCGTCCAACGTGGACCAG
TCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACCTTCTGTGGCCCCAAGATC
CAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCCGTGTCGCGGGAGGAGAAG
CCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624932.1 cds:known chromosome:GRCh38:21:6561954:6564203:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGCCTGTCTGTCCAGGAGACAGTCACAGGCCCCCGAAAGCTCTGCCCCACTTGGTGTGT
GGGAGAAGAGGCCGGCAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTG
AAGCACTTCTCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCAC
GGAAAGCACAACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGC
TACCGCCTGCCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGC
ATGCTGACCTTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGA
GCCATCCCCGTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
Output:
Gene Frequency
Gene1: 3
Gene2 6.3
....
I was thinging of something like this, but I dont now how to define the positions requirements:
freq <- sapply(gregexpr("GTG",x),function(x)if(x[[1]]!=-1) length(x) else 0)
Here is an idea in R using stringi.
We use stri_locate_all_fixed() to find the start and end position of each GTG occurence. Then we create a column condition to test if start position is in 1,4,7,10,13,16,19,22 ....
library(stringi)
library(dplyr)
data.frame(stri_locate_all_fixed(gene1, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(gene), 3))
Which gives:
# start end condition
#1 4 6 TRUE
If you want to generalize this to a list of genes, you could do:
lst <- list(gene1, gene2, gene3)
res <- lapply(lst, function(x) {
data.frame(stri_locate_all_fixed(x, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(x), 3))
})
Which would give:
#[[1]]
# start end condition
#1 4 6 TRUE
#
#[[2]]
# start end condition
#1 NA NA FALSE
#
#[[3]]
# start end condition
#1 3 5 FALSE
#2 9 11 FALSE
#3 21 23 FALSE
#4 70 72 TRUE
#5 75 77 FALSE
Following #Sobrique's comment, if divided by length means number of occurences respecting condition divided by total number of char in each gene, you could do:
lapply(1:length(res), function(x) sum(res[[x]][["condition"]]) / nchar(lst[[x]]))
Which would give:
#[[1]]
#[1] 0.004830918
#
#[[2]]
#[1] 0
#
#[[3]]
#[1] 0.003021148
Here's a Perl solution that does as you ask
But I don't understand how your example output is derived: the first and last sequences have only one occurrence of GTG in the positions you require, and the second sequence has none at all. That means the outputs are 1 / 207, 0 / 74, and 1 / 331 respectively. None of those are anything like 3 and 6.3 that you say you're expecting
This program expects the path to the input file as a parameter on the command line
use strict;
use warnings 'all';
print "Gene Frequency\n";
my $name;
local $/ = '>';
while ( <> ) {
chomp;
next unless /\S/;
my ($name, $seq) = split /\n/, $_, 2;
$seq =~ tr/A-Z//cd;
my $n = 0;
while ( $seq =~ /(?=GTG)/g ) {
++$n if $-[0] % 3 == 0;
}
printf "%-7s%.6f\n", $name, $n / length($seq);
}
output
Gene Frequency
Gene1 0.004831
Gene2 0.000000
Gene3 0.003021
Here is an alternate solution that does not use a pattern match. Not that it will matter much.
use strict;
use warnings;
my $gene;
while ( my $line = <> ) {
if ( $line =~ /^>(.+)/ ) {
$gene = $1;
next;
}
chomp $line;
printf "%s: %s\n",
$gene,
( grep { $_ eq 'GTG' } split /(...)/, $line ) / length $line;
}
Output:
Gene1: 0.00483091787439614
Gene2: 0
Gene3: 0.00302114803625378
It is essentially similar to Sobrique's answer, but assumes that the gene lines contain the right characters. It splits up the gene string into a list of three-character pieces and takes the ones that are literally GTG.
The splitting works by abusing the fact that split uses a pattern as the delimiter, and that it will also capture the delimiter if a capture group is used. Here's an example.
my #foo = split /(...)/, '1234567890';
p #foo; # from Data::Printer
__END__
[
[0] "",
[1] 123,
[2] "",
[3] 456,
[4] "",
[5] 789,
[6] 0
]
The empty elements get filter out by grep. It might not be the most efficient way, but it gets the job done.
You can run it by calling perl foo.pl horribly-large-gene-sequence.file.
Well, you have an R solution. I've hacked something together in perl because you tagged it:
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'GTG';
local $/ = "\n>";
while ( <> ) {
my ($gene) = m/(Gene\d+)/;
my #hits = grep { /^$target$/ } m/ ( [GTCA]{3} ) /xg;
print "$gene: ".( scalar #hits), "\n";
}
This doesn't give the same results as your input though:
Gene1: 1
Gene2: 0
Gene3: 1
I'm decomposing your string into 3 element lists, and looking for ones that specifically match. (And I haven't divided by length, as I'm not entirely clear if that's the actual string length in letters, or some other metric).
Including length matching - we need to capture both name and string:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n>";
while (<>) {
my ($gene, $gene_str) = m/(Gene\d+)\n([GTCA]+)/m;
my #hits = grep { /^GTG$/ } $gene_str =~ m/ ( [GTCA]{3} ) /xg;
print "$gene: " . #hits . "/". length ( $gene_str ), " = ", #hits / length($gene_str), "\n";
}
We use <> which is the 'magic' filehandle, and tells perl to read from either STDIN or a file specified on command line. Much like sed or grep does.
With your input:
Gene1: 1/207 = 0.00483091787439614
Gene2: 0/74 = 0
Gene3: 1/331 = 0.00302114803625378
Here is a function I created based on your requirement. I am pretty sure there are alternate ways better than this but this solves the problem.
require(stringi)
input_gene_list<- list(gene1= "GTGGGGGTTTGTGGGGGTG", gene2= "GTGGGGGTTTGTGGGGGTG", gene3= "GTGGGGGTTTGTGGGGGTG")
gene_counter<- function(gene){
x<- gene
y<- gsub(pattern = "GTG",replacement = "GTG ", x = x, perl=TRUE)
if(str_count(y,pattern = "GTG")) {
gene_count<- unlist(gregexpr(pattern = " ", y))
counter<- 0
for(i in 1:length(gene_count)){
if((gene_count[i] %% 3) == 1) counter=counter+1
}
return(counter/nchar(x))
}
}
output_list<- lapply(input_gene_list, function(x) gene_counter(x))
result<- t(as.data.frame(output_list))
result
[,1]
gene1 0.1052632
gene2 0.1052632
gene3 0.1052632
Also share your thoughts on it! Thanks!

Print an extra space if the value is positive

I'm trying to get a visually clear output of my program:
a = -1234
b = 1234
#printf "a = %s%1.2e" "" a
#printf "b = %s%1.2e" " " b
which gives:
a = 1.23e+03
b = -1.23e+03
(The point is to add an extra space for positive number)
Now I want to automate it. I tried to write a funtion:
function negspace(x::Number)
if x < 0
return "", x
else
return " ", x
end
end
and print with
a = -1234
b = 1234
#printf "a = %s%1.2e" negspace( a )
#printf "b = %s%1.2e" negspace( b )
Even simpler is to use the printf format flag to do this directly by putting a space after the %:
julia> #sprintf("a = % 1.2e", -1234)
"a = -1.23e+03"
julia> #sprintf("b = % 1.2e", 1234)
"b = 1.23e+03"
Found a way:
#!/usr/bin/env julia
function signspace(x::Number)
if x > 0
return #sprintf(" %1.2e", x)
else
return #sprintf( "%1.2e", x)
end
end
a = -1234
b = 1234
println("a = ", signspace(a))
println("b = ", signspace(b))
but I'm not sure it is optimal.

Resources