I want to pass the following data from perl to R and rescaled them (scale to [0, 1] ) in R by rescaler function and then send them back to Perl.
$m1 = 4;
$m2 = 5.3;
$m3 = 2;
$m4 = 1;
$m5 = 1.3;
$m6 = 2;
I did:
my $R = Statistics::R->new() ;
$R->startR ;
$R->set('data', $m1 . ',' . $m2 . ',' . $m3 . ',' . $m4 . ',' . $m5 . ',' . $m6);
$R -> run(q`
library(reshape);
scaled_data <- rescaler(data, type="range");
`);
my $scaled_data = $R -> get('scaled_data');
print $scaled_data,"\n",$data,"\n";
$R->stopR();
but I get the following error.
Problem while running this R command:
library(reshape);
scaled_data <- rescaler(data);
Error:
x - mean(x, na.rm = TRUE) :
non-numeric argument to binary operator
Calls: rescaler -> rescaler.default
In addition: Warning message:
In mean.default(x, na.rm = TRUE) :
argument is not numeric or logical: returning NA
Execution halted
1) how can I pass the data correctly?
2) I think by this approach, the code will work slowly, do I need to send the data to R for rescaling?
#Len Jaffe and #MrFlick
I tried :
my $R = Statistics::R->new() ;
$R->startR;
$R->set('data', [ $m1 , $m2 , $m3 , $m4 , $m5 , $m6 ] );
$R -> run(q`library(reshape);scaled_data <- rescaler(data)`);
my $scaled_data = $R -> get('scaled_data');
print $scaled_data,"\n";
$R->stopR();
I got :
ARRAY(0xdde3d0)
Are you sure that you don't want:
$R->set('data', [ $m1 , $m2 , $m3 , $m4 , $m5 , $m6 ] );
That's how the set commands are documented in the Perldoc for Statistics::R
Something like this within Perl should work
use List::Util qw( min max );
my #m = (4,5.3,2,1,1.3,2);
my $min = min #m;
my $max = max #m;
my #scaled = map {($_-$min)/($max-$min)} #m;
print join(" - ", #m), "\n";
print join(" - ", #scaled), "\n";
and that outputs
4 - 5.3 - 2 - 1 - 1.3 - 2
0.69767441860 - 1 - 0.23255813953 - 0 - 0.069767441860 - 0.23255813953
And I believe the main problem with your use of the Statistics::R package is the set command. R needed a vector so the set probably should have looked something like
$R->set('data', #m);
# or maybe $R->set('data', [$m1,$m2,$m3,$m4,$m5,$m6]);
but I do not have that package installed so i didn't test it.
Related
I am trying to be clever and write a "stopcat" function that is in essence shorthand for stop( sprintf( "a string thing %d", val ) )
Here is a small snippet that doesn't work
library( tidyverse ) # For !!
stopcat = function( str, ... ) {
msg = sprintf( str, ... )
eval( expr( stop( !!msg ) ), parent.frame(1) )
}
pig = function() {
cat( "piggy\n" )
stopcat( "no way! %d", 6 )
24 * 44
}
pig()
It prints out
Error in eval(expr(stop(!!msg)), parent.frame(1)) : no way! 6
but I want
Error in pig : no way! 6
Any thoughts?
I found related post
How to get the name of the calling function inside the called routine?
but the details there do not seem to apply to stop (or I can't understand what was said, perhaps)
I don't think you can fool stop by manipulating the evaluation context like that. If you really need to identify that function, you might try grabbing the function name and adding that to the error message and turning off the default expression label. For example
stopcat <- function( str, ... ) {
msg <- sprintf( str, ... )
fun <- deparse(sys.call(1)[[1]])
msg <- paste0("(in ", fun, "): ", msg)
stop( msg, call.=FALSE)
}
pig <- function() {
cat( "piggy\n" )
stopcat( "no way! %d", 6 )
24 * 44
}
pig()
# piggy
# Error: (in pig): no way! 6
I am new to PDL. R's ifelse() method can do conditonal element selection. For example,
x <- c(1,2,3,4)
ifelse(x%%2, x, x*2)
# [1] 1 4 3 8
Anyone knows how to do this in PDL? I know you can do it like below, but is there any better ways?
pdl(map { $_ % 2 ? $_ : $_*2 } #{$x->unpdl} )
#! /usr/bin/perl
use warnings;
use strict;
use PDL;
my $x = 'PDL'->new([1, 2, 3, 4]);
my $where = ! ($x % 2); # [0 1 0 1]
my $y = $x * ($where + 1);
print $y; # [1 4 3 8]
or, shortly
my $y = $x * ( 2 - $x % 2 );
Answering the question myself. It can be something like this,
use PDL;
sub ifelse {
my ( $test, $yes, $no ) = #_;
$test = pdl($test);
my ( $ok, $nok ) = which_both($test);
my $rslt = zeros( $test->dim(0) );
unless ( $ok->isempty ) {
$yes = pdl($yes);
$rslt->slice($ok) .= $yes->index( $ok % $yes->dim(0) );
}
unless ( $nok->isempty ) {
$no = pdl($no);
$rslt->slice($nok) .= $no->index( $nok % $no->dim(0) );
}
return $rslt;
}
my $x = pdl( 1, 2, 3, 4 );
say ifelse( $x % 2, $x, $x * 2 ); # [1 4 3 8]
say ifelse( $x % 2, 5, sequence( 3 ) ); # [5 1 5 0]
say ifelse( 42, $x, $x * 2 ); # [1]
In PDL, the general solution to that sort of thing likely involves slicing and similar. Looking at the latest release notes of PDL (2.077), which has a new where_both, I remembered this question (disclosure: I'm current maintainer). While your specific problem only involves a change to the values of even numbers, I'll also show the case of adding 2 to odds:
my ($odd, $even) = where_both($x, $x % 2);
$odd += 2, $even *= 2; # the "," form is just a flourish, it could be 2 lines
It's efficient, in proper PDL style, because the scan of $x only happens once (you won't be surprised to learn it also uses which_both under the hood), and the mutations only look at the slices of the bits that are relevant. This is very similar to your code, but it got captured into a small, widely-reusable function. (I wrote it to turn the TriD EuclidAxes stuff from using Perl for-loops to actually using ndarrays, if you're interested)
Better than $x ? $y : $z? Not to my mind, but it's a matter of style and taste
sub ifelse {
my ($x,$y,$z) = #_;
$x ? $y : $z ;
if($x){$y}else{$z} ;
[$y,$z]->[!$x] ;
[$z,$y]->[!!$x] ;
($x && $y) || $z ; # valid only if $y is always true
(!$x && $z) || $y ; # valid only if $z is always true
}
I have as input a gene list where each genes has a header like >SomeText.
For each gene I would like to find the frequency of the string GTG. (number of occurences divided by length of gene). The string should only be counted if it starts at position 1,4,7,10 etc (every thids position).
>ENST00000619537.4 cds:known chromosome:GRCh38:21:6560714:6564489:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGATGTGACCATCCAGCACCCCTGGTTCAAGCGCACCCTGGGGCCCTTCTACCCCAGC
CGGCTGTTCGACCAGTTTTTCGGCGAGGGCCTTTTTGAGTATGACCTGCTGCCCTTCCTG
TCGTCCACCATCAGCCCCTACTACCGCCAGTCCCTCTTCCGCACCGTGCTGGACTCCGGC
ATCTCTGAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTC
TCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCAC
AACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTG
CCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACC
TTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCC
GTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624019.3 cds:known chromosome:GRCh38:21:6561284:6563978:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGACGCCCCCCCCCCCCACCCAACCACAGGCCTCCTCTCTGAGCCACGGGTTCGATCC
GACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTCTCCCCGGAGGACCTCACC
GTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCACAACGAGCGCCAGGACGAC
CACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTGCCGTCCAACGTGGACCAG
TCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACCTTCTGTGGCCCCAAGATC
CAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCCGTGTCGCGGGAGGAGAAG
CCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624932.1 cds:known chromosome:GRCh38:21:6561954:6564203:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGCCTGTCTGTCCAGGAGACAGTCACAGGCCCCCGAAAGCTCTGCCCCACTTGGTGTGT
GGGAGAAGAGGCCGGCAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTG
AAGCACTTCTCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCAC
GGAAAGCACAACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGC
TACCGCCTGCCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGC
ATGCTGACCTTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGA
GCCATCCCCGTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
Output:
Gene Frequency
Gene1: 3
Gene2 6.3
....
I was thinging of something like this, but I dont now how to define the positions requirements:
freq <- sapply(gregexpr("GTG",x),function(x)if(x[[1]]!=-1) length(x) else 0)
Here is an idea in R using stringi.
We use stri_locate_all_fixed() to find the start and end position of each GTG occurence. Then we create a column condition to test if start position is in 1,4,7,10,13,16,19,22 ....
library(stringi)
library(dplyr)
data.frame(stri_locate_all_fixed(gene1, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(gene), 3))
Which gives:
# start end condition
#1 4 6 TRUE
If you want to generalize this to a list of genes, you could do:
lst <- list(gene1, gene2, gene3)
res <- lapply(lst, function(x) {
data.frame(stri_locate_all_fixed(x, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(x), 3))
})
Which would give:
#[[1]]
# start end condition
#1 4 6 TRUE
#
#[[2]]
# start end condition
#1 NA NA FALSE
#
#[[3]]
# start end condition
#1 3 5 FALSE
#2 9 11 FALSE
#3 21 23 FALSE
#4 70 72 TRUE
#5 75 77 FALSE
Following #Sobrique's comment, if divided by length means number of occurences respecting condition divided by total number of char in each gene, you could do:
lapply(1:length(res), function(x) sum(res[[x]][["condition"]]) / nchar(lst[[x]]))
Which would give:
#[[1]]
#[1] 0.004830918
#
#[[2]]
#[1] 0
#
#[[3]]
#[1] 0.003021148
Here's a Perl solution that does as you ask
But I don't understand how your example output is derived: the first and last sequences have only one occurrence of GTG in the positions you require, and the second sequence has none at all. That means the outputs are 1 / 207, 0 / 74, and 1 / 331 respectively. None of those are anything like 3 and 6.3 that you say you're expecting
This program expects the path to the input file as a parameter on the command line
use strict;
use warnings 'all';
print "Gene Frequency\n";
my $name;
local $/ = '>';
while ( <> ) {
chomp;
next unless /\S/;
my ($name, $seq) = split /\n/, $_, 2;
$seq =~ tr/A-Z//cd;
my $n = 0;
while ( $seq =~ /(?=GTG)/g ) {
++$n if $-[0] % 3 == 0;
}
printf "%-7s%.6f\n", $name, $n / length($seq);
}
output
Gene Frequency
Gene1 0.004831
Gene2 0.000000
Gene3 0.003021
Here is an alternate solution that does not use a pattern match. Not that it will matter much.
use strict;
use warnings;
my $gene;
while ( my $line = <> ) {
if ( $line =~ /^>(.+)/ ) {
$gene = $1;
next;
}
chomp $line;
printf "%s: %s\n",
$gene,
( grep { $_ eq 'GTG' } split /(...)/, $line ) / length $line;
}
Output:
Gene1: 0.00483091787439614
Gene2: 0
Gene3: 0.00302114803625378
It is essentially similar to Sobrique's answer, but assumes that the gene lines contain the right characters. It splits up the gene string into a list of three-character pieces and takes the ones that are literally GTG.
The splitting works by abusing the fact that split uses a pattern as the delimiter, and that it will also capture the delimiter if a capture group is used. Here's an example.
my #foo = split /(...)/, '1234567890';
p #foo; # from Data::Printer
__END__
[
[0] "",
[1] 123,
[2] "",
[3] 456,
[4] "",
[5] 789,
[6] 0
]
The empty elements get filter out by grep. It might not be the most efficient way, but it gets the job done.
You can run it by calling perl foo.pl horribly-large-gene-sequence.file.
Well, you have an R solution. I've hacked something together in perl because you tagged it:
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'GTG';
local $/ = "\n>";
while ( <> ) {
my ($gene) = m/(Gene\d+)/;
my #hits = grep { /^$target$/ } m/ ( [GTCA]{3} ) /xg;
print "$gene: ".( scalar #hits), "\n";
}
This doesn't give the same results as your input though:
Gene1: 1
Gene2: 0
Gene3: 1
I'm decomposing your string into 3 element lists, and looking for ones that specifically match. (And I haven't divided by length, as I'm not entirely clear if that's the actual string length in letters, or some other metric).
Including length matching - we need to capture both name and string:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n>";
while (<>) {
my ($gene, $gene_str) = m/(Gene\d+)\n([GTCA]+)/m;
my #hits = grep { /^GTG$/ } $gene_str =~ m/ ( [GTCA]{3} ) /xg;
print "$gene: " . #hits . "/". length ( $gene_str ), " = ", #hits / length($gene_str), "\n";
}
We use <> which is the 'magic' filehandle, and tells perl to read from either STDIN or a file specified on command line. Much like sed or grep does.
With your input:
Gene1: 1/207 = 0.00483091787439614
Gene2: 0/74 = 0
Gene3: 1/331 = 0.00302114803625378
Here is a function I created based on your requirement. I am pretty sure there are alternate ways better than this but this solves the problem.
require(stringi)
input_gene_list<- list(gene1= "GTGGGGGTTTGTGGGGGTG", gene2= "GTGGGGGTTTGTGGGGGTG", gene3= "GTGGGGGTTTGTGGGGGTG")
gene_counter<- function(gene){
x<- gene
y<- gsub(pattern = "GTG",replacement = "GTG ", x = x, perl=TRUE)
if(str_count(y,pattern = "GTG")) {
gene_count<- unlist(gregexpr(pattern = " ", y))
counter<- 0
for(i in 1:length(gene_count)){
if((gene_count[i] %% 3) == 1) counter=counter+1
}
return(counter/nchar(x))
}
}
output_list<- lapply(input_gene_list, function(x) gene_counter(x))
result<- t(as.data.frame(output_list))
result
[,1]
gene1 0.1052632
gene2 0.1052632
gene3 0.1052632
Also share your thoughts on it! Thanks!
So let's say I'm trying to find the sum of all factors of 5 below a certain maximum number. I'm doing this recursively, because that seemed easiest. This is my code:
isFactor(X):-
Y is X mod 5,
Y = 0.
sumAll(Number, Result):-
sumAll(Number, 0, Result).
sumAll(Number, RunningTotal, Result):-
(isFactor(Number) ->
NextTotal is RunningTotal + Number;
NextTotal is RunningTotal),
NextNumber is Number - 1,
(NextNumber > 0 ->
mulSum(NextNumber, NextTotal, NextResult);
NextResult is RunningTotal),
number(NextResult) -> % this test is so that the interpreter
write(NextResult), nl; % doesn't print out a bunch of extra stuff
true. % (the internal IDs of each binding of
% NextResult maybe?) after the answer.
Now, this works (that is, it prints the correct sum), but I am slightly miffed that I can't figure out how to arrange the code so that doing
| ?- sumAll(10, X).
binds X to 10, rather than printing '10' and asserting 'yes' at the end.
My instinct is to somehow rebind Result to NextResult if NextNumber > 0 (line 13) is true, but I suspect that's just years of Python programming trying to assert themselves.
Is there a way of 'returning' the result of a goal all the way up the nested recursions here? Or am I just thinking about this all wrong?
That's awfully complicated for something simple. To sum all elements of a list that are divisible by N, all you need is this tail recursive implementation:
sum_all( Xs , N , Sum ) :-
sum_all( Xs , N , 0 , Sum )
.
sum_all( [] , _ , S , S ) .
sum_all( [X|Xs] , N , T , S ) :-
X mod N =:= 0 ,
! ,
T1 is T+X ,
sum_all(Xs,N,T1,S)
.
sum_all( [_|Xs] , N , T , S ) :-
sum_all(Xs,N,T,S)
.
The non-tail recursive implementation is a bit simpler but will blow its stack on a long list:
sum_all( [] , _ , 0 ) .
sum_all( [X|Xs] , N , S ) :-
sum(Xs,N,T) ,
( X mod N =:= 0 -> S is T+X ; S is T )
.
You could even do something like this to decompose the extraction of "interesting" values from the summing of the list:
sum_all(Xs,N,Sum) :-
findall( X , ( member(X,Xs), X mod N =:= 0 ) , L ) ,
sum(L,Sum)
.
sum(L,S) :- sum(L,0,S).
sum( [] , S ,S ) .
sum( [X|Xs] , T ,S ) :- T1 is T+X , sum(Xs,T1,S) .
Once you have that, then you can simply say:
sum_modulo_N_values( Xs , N ) :-
sum_all(Xs,N,Sum) ,
writenl( sum = Sum )
.
Invoke it something like this
sum_modulo_N_values( [1,2,5,6,7,10,11,15,31,30] , 5 ) .
And you'll get the the expected sum = 60 written to the console.
Your code seems more complex than needed, maybe such complexity hides an important fact:
in sumAll(Number, RunningTotal, Result):- Result is a singleton. Then there are little chances to get back the computed value.
I would try to get rid of number(NextResult) -> etc.. (btw you usually need parenthesis to get the expected nesting when using if/then/else - that is (C -> T ; F) ), and 'assign' instead to Result.
I have an input like this:
120-160
200-220
400-500
.
.
.
I would link to break each range (row) into two sub-ranges:
120-140 141-160
200-210 211-220
400-450 451-500
.
.
.
and then print each column (range) into different files.
file 1:
120-140
200-210
400-450
.
.
.
file 2:
141-160
211-220
451-500
.
.
.
I could not do anything and no clue how to do it.
Thank you very much
You can use bash:
while IFS=- read min max; do
mid=$(( (min+max)/2 ))
printf "%d-%d\n" $min $mid >> file.1
printf "%d-%d\n" $((mid+1)) $max >> file.2
done < input.file
The awk would be
awk -F- -v OFS=- '{
mid = int(($1+$2)/2)
print $1, mid > "file.1"
print mid+1, $2 > "file.2"
}' input.file
Perl solution. Save as break-range.pl, run as perl break-range.pl input.
#!/usr/bin/perl
use warnings;
use strict;
open my $F1, '>', 'file1' or die $!;
open my $F2, '>', 'file2' or die $!;
while (<>) {
chomp;
if (my ($low, $high) = /([0-9]+)-([0-9]+)/) {
my $middle = int(($low + $high) / 2);
print $F1 "$low-$middle\n";
print $F2 $middle + 1, "-$high\n";
}
}
close $F1;
close $F2;
For non-fixed number of output files, you can use something like the following, run it as break-rangle.pl number-of-files input. Note that it probably does not work if the number of files is greater than the size of an interval.
#!/usr/bin/perl
use warnings;
use strict;
my $number = shift;
my #FH;
for my $i (0 .. $number - 1) {
open $FH[$i], '>', "file$i" or die $!;
}
while (<>) {
chomp;
if (my ($low, $high) = /([0-9]+)-([0-9]+)/) {
my $step = ($high - $low) / $number;
for my $i (0 .. $number - 1) {
print {$FH[$i]} int($low + $i * $step) + ($i > 0), '-',
int($low + ($i + 1) * $step), "\n";
}
}
}
close $_ for #FH;
Here is an R solution: Assuming you have imported the data into a vector input,
input <- c("120-160", "200-220", "400-500")
ranges <- strsplit(input, "-")
ranges <- lapply(ranges, as.numeric)
ranges <- lapply(ranges, function(x) c(x[1], mean(x), x[2]))
output1 <- sapply(ranges, function(x) sprintf("%d-%d", x[1], x[2]))
output2 <- sapply(ranges, function(x) sprintf("%d-%d", x[2]+1, x[3]))
You can then use writeLines to write the two files.