Why does XQuery add an extra space? - xquery

XQuery adds a space and I don't understand why. I have the following simple query :
declare option saxon:output "method=text";
for $i in 1 to 10
return concat(".", $i, " ", 100, "
", ".")
I ran it with Saxon (SaxonEE9-5-1-8J and SaxonHE9-5-1-8J):
java net.sf.saxon.Query -q:query.xq -o:result.txt
The result is the following:
.1 100
. .2 100
. .3 100
. .4 100
. .5 100
. .6 100
. .7 100
. .8 100
. .9 100
. .10 100
.
My question comes from the presence of an extra space between dots. The first line is OK but the folllowing lines (2 to 10) have that space and I don't understand why. What we see as spaces between digits is in fact a tabulation inserted by the character reference.
Could you enlighten me about that behavior ?
PS: I have added saxon as a tag for the question even if the question is not specific to Saxon.

I think your query returns a sequence of string values which are then by default concatenated with a space (see http://www.w3.org/TR/xslt-xquery-serialization/#sequence-normalization where it says "For each subsequence of adjacent strings in S2, copy a single string to the new sequence equal to the values of the strings in the subsequence concatenated in order, each separated by a single space"). If you don't want that then you can use
string-join(for $i in 1 to 10
return concat(".", $i, " ", 100, "
", "."), '')

The space between the dots is basically a separator introduced between the items in the sequence that you are constructing. It would seem that Saxon's text serializer where it outputs to the console inserts that space character to allow you to make sense of the output items.
Considering your code:
declare option saxon:output "method=text";
for $i in 1 to 10
return
concat(".", $i, " ", 100, "
", ".")
The result of for $i in 1 to 10 return is a sequence of 10 xs:string items. From your output you can determine that the space is interspersed between each evaluation of concat(".", $i, " ", 100, "
", ".").
If you want to check that you can rewrite your query as:
for $i in 1 to 10
return
<x>{concat(".", $i, " ", 100, "
", ".")}</x>
And you will see your 10 distinct items with no spaces between.
If you are trying to create a single text string, as you are already controlling the line-breaks, then you could also join all of the 10 xs:string items together yourself, which would have the effect of eliminating the spaces you are seeing between the sequence items. For example:
declare option saxon:output "method=text";
string-join(
for $i in 1 to 10
return
(".", string($i), " ", "100", "
", ".")
, "")

Related

Remove space in print statement in python

While using the below print command:
print(k,':',dict[k])
I get the output as shown below but in the output, i want to remove the space between the key and colon.How to do it?
Current Output:
Sam : 40
Required Output:
Sam: 40
You could try printing a single string consisting of a concatenation:
print(k + ': ' + dict[k])
The python print() statement has a separator parameter that defaults to a space. So the comma-separated values that you are passing into it serve as arguments each of which will get separated by white-space while printing.
I think what you are looking for is
print(name, ": ", "40", sep = '')
>>> Sam: 40
Simply specifying the "sep" parameter solves your issue.

calculate the percentage of not null recs in a file in unix

How do i figure out the percentage of not null records in my file in UNIX?
My file like this: I wanted to know the amount of records & the percentage of not null rec's. Tried whole lot of grep n cut commands but nothing seems to be working out. Can anyone help me here please...
"name","country","age","place"
"sam","US","30","CA"
"","","",""
"joe","UK","34","BRIS"
,,,,
"jake","US","66","Ohio"
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
use 5.012; # say, keys #arr
use Text::CSV_XS qw{ csv };
my ($count_all, #count_nonempty);
csv(in => shift,
out => \ 'skip',
headers => 'skip',
on_in => sub {
my (undef, $columns) = #_;
++$count_all;
length $columns->[$_] and $count_nonempty[$_]++
for 0 .. $#$columns;
},
);
for my $column (keys #count_nonempty) {
say "Column ", 1 + $column, ": ",
100 * $count_nonempty[$column] / $count_all, '%';
}
It uses Text::CSV_XS to read the CSV file. It skips the header line, and for each subsequent line, it calls the callback specified in on_in, which increments the count of all lines and also the count of empty fields per column if the length of a field is zero.
Along with choroba, I would normally recommend using a CSV parser on CSV data.
But in this case, all we want to look for is that a record contains any character that is not a comma or quote: if a record contains only commas and/or quotes, it is a "null" record.
awk '
/[^",]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
To handle leading/trailing whitespace
awk '
{sub(/^[[:blank:]]+/,""); sub(/[[:blank:]]+$/,"")}
/[^",]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
If allowing fields containing only whitespace, such as
" ","",,," "
is also a null record, we can simple ignore all whitespace
awk '
/[^",[:blank:]]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file

unix split FASTA using a loop, awk and split

I have a long list of data organised as below (INPUT).
I want to split the data up so that I get an output as below (desired OUTPUT).
The code below first identifies all the lines containing ">gi" and saves the linecount of those lines in an array called B.
Then, in a new file, it should replace those lines from array B with the shortened version of the text following the ">gi"
I figured the easiest way would be to split at "|", however this does not work (no separation happens with my code if i replace " " with "|")
My code is below and does split nicely after the " " if I replace the "|" by " " in the INPUT, however I get into trouble when I want to get the text between the [ ] brackets, which is NOT always there and not always only 2 words...:
B=$( grep -n ">gi" 1VAO_1DII_5fxe_all_hits_combined.txt | cut -d : -f 1)
awk <1VAO_1DII_5fxe_all_hits_combined.txt >seqIDs_1VAO_1DII_5fxe_all_hits_combined.txt -v lines="$B" '
BEGIN {split(lines, a, " "); for (i in a) change[a[i]]=1}
NR in change {$0 = ">" $4}
1
'
let me know if more explanations are needed!
INPUT:
>gi|9955361|pdb|1E0Y|A:1-560 Chain A, Structure Of The D170sT457E DOUBLE MUTANT OF VANILLYL- Alcohol Oxidase
MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVA
>gi|557721169|dbj|GAD99964.1|:1-560 hypothetical protein NECHADRAFT_63237 [Byssochlamys spectabilis No. 5]
MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVAPRNV
desired OUTPUT:
>1E0Y
MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVAPRNV
>GAD99964.1 Byssochlamys spectabilis No. 5
MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVA
This can be done in one step with awk (gnu awk):
awk -F'|' '/^>gi/{a=1;match($NF,/\[([^]]*)]/, b);print ">"$4" "b[1];next}a{print}!$0{a=0}' input > output
In a more readable way:
/^>gi/ { # when the line starts with ">gi"
a=1; # set flag "a" to 1
# extract the eventual part between brackets in the last field
match($NF,"\\[([^]]*)]", b);
print ">"$4" "b[1]; # display the line
next # jump to the next record
}
a { print } # when "a" (allowed block) display the line
!$0 { a=0 } # when the line is empty, set "a" to 0 to stop the display

Append two columns adding a specific integer at each value in Unix

I have two files like this:
# step distance
0 4.48595407961296e+01
2500 4.50383737781376e+01
5000 4.53506757198727e+01
7500 4.51682465277482e+01
10000 4.53410353656445e+01
# step distance
0 4.58854106214881e+01
2500 4.58639266431320e+01
5000 4.60620560167519e+01
7500 4.58990075106227e+01
10000 4.59371359946124e+01
So I want to join the two files together, while maintaining the spacing.
Especially, the second file needs to remember the ending values of the first one and start counting from that one.
output:
# step distance
0 4.48595407961296e+01
2500 4.50383737781376e+01
5000 4.53506757198727e+01
7500 4.51682465277482e+01
10000 4.53410353656445e+01
12500 4.58854106214881e+01
15000 4.58639266431320e+01
17500 4.60620560167519e+01
20000 4.58990075106227e+01
22500 4.59371359946124e+01
With calc it was easy to do the problem is that the spacing needs to be in order to work and in that case calc makes a complete mess.
# start awk and set the *Step* between file to 2500
awk -v 'Step=2500' '
# 1st line of 1 file (NR count every line, from each file) init and print header
NR == 1 {LastFile = FILENAME; OFS = "\t"; print}
# when file change (new filename compare to previous line read)
# Set a new index (for incremental absolute step from relative one) and new filename reference
FILENAME != LastFile { StartIndex = LastIndex + Step; LastFile = FILENAME}
# after first line and for every line stating witha digit (+ space if any)
# calculate absolute step and replace relative one, print the new content
NR > 1 && /^[[:blank:]]*[0-9]/ { $1 += StartIndex; LastIndex = $1;print }
' YourFiles*
Result will depend of files order
output separator is set by OFS value (tab here)
Perl to the rescue!
#!/usr/bin/perl
use warnings;
use strict;
open my $F1, '<', 'file1' or die $!;
my ($before, $after, $diff);
my $max = 0;
while (<$F1>) {
print;
my ($space1, $num, $space2) = /^(\s*) ([0-9]+) (\s*)/x or next;
($before, $after) = ($space1, $space2);
$diff = $num - $max;
$max = $num;
}
$before = length "$before$max"; # We'll need it to format the computed numbers.
open my $F2, '<', 'file2' or die $!;
<$F2>; # Skip the header.
while (<$F2>) {
my ($step, $distance) = split;
$step += $max + $diff;
printf "% ${before}d%s%s\n", $step, $after, $distance;
}
The program remembers the last number in $max. It also keeps the length of the leading whitespace plus $max in $before to format all future numbers to take up the same space (using printf).
You didn't show how the distance column is aligned, i.e.
20000 4.58990075106227e+01
22500 11.59371359946124e+01 # dot aligned?
22500 11.34572478912301e+01 # left aligned?
The program would align it the latter way. If you want the former, use a similar trick as for the step column.

print if all value are higher

I have a file like:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
B 37.40,37.40,38.40,38.80,58.40,58.80,45.00,44.8
.
.
.
I want to print those lines that all values in column 2 are more than 50
output:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
I tried:
cat file | tr ',' '\t' | awk '{for (i=2; i<=NF; i++){if($i<50) continue; else print $i}}'
I hope you meant that r tag you added to your question.
tab <- read.table("file")
splt <- strsplit(as.character(tab[[2]]), ",")
rows <- unlist(lapply(splt, function(a) all(as.numeric(a) > 50)))
tab[rows,]
This will read your file as a space-separated table, split the second column into individual values (resulting in a list of character vectors), then compute a logical value for each such row depending on whether or not all values are > 50. These results are combined to a logical vector which is then used to subset your data.
The field separator can be any regular expression, so if you include commas in FS your approach works:
awk '{ for(i=2; i<=NF; i++) if($i<=50) next } 1' FS='[ \t,]+' infile
Output:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
Explanation
The for-loop runs through the comma-separated values in the second column and if any of them is lower than or equal to 50 next is executed, i.e. skip to next line. If the first block is passed, the 1 is encountered which evaluates to true and executes the default block: { print $0 }.

Resources