Xquery to concatenate - xquery

for the below data -
let $x := "Yahooooo !!!! Select one number - "
let $y :=
<A>
<a>1</a>
<a>2</a>
<a>3</a>
<a>4</a>
<a>5</a>
<a>6</a>
<a>7</a>
</A>
I want to get the output as -
`Yahooooo !!!! Select one number - [1 or 2 or 3 or 4 or 5 or 6 or 7]`

In XQuery 3.0, you can use || as a string concatenation operator:
return $x || "[" || fn:string-join($y/a, " or ") || "]"
In XQuery 1.0, you need to use fn:concat():
return fn:concat($x, fn:concat("[", fn:concat(fn:string-join($y/a, " or "), "]")))

Related

regex, R and [:punct:] in grep --> return items from a list not containing any[:punct:] [duplicate]

This question already has an answer here:
POSIX character class does not work in base R regex
(1 answer)
Closed 2 years ago.
I have this list of strings:
stringg <- c("csv.asef", "ac ed", "asdf$", "asdf", "dasf]", "sadf {sadf")
if I want to get all strings containing special characters like so:
grep("[:punct:]+", stringg, value = TRUE)
--------------------------------------------
Result:
[1] "csv.asef" "ac ed"
What I should get is:
[1] "csv.asef" "asdf$" "dasf]" "sadf {sadf"
if I use:
grep("[!\\"#$%&’()*+,-./:;<=>?#[]^_`{|}~.]+", stringg, value = TRUE)
-----------------------------------------------------------------
Result is ERROR
I want these special characters: € ! " # $ % & ’ ( ) * + , - . / : ; < = > ? # [ ] ^ _ ` { | } ~. which [:punct:] doesn't have
I know if I want the strings not containing any of those characters then I would use:
[^ € ! " # $ % & ’ ( ) * + , - . / : ; < = > ? # [ ] ^ _ ` { | } ~.]
but how do I do it with [:punct:]:
[^:punct:]?
[^:punct:]{0}?
and how could i combine ^[:punct:] | ^€ ?
many thanks
According to ?regex
Most metacharacters lose their special meaning inside a character class. To include a literal ], place it first in the list.
grep("[[:punct:]]+", stringg, value = TRUE)
#[1] "csv.asef" "asdf$" "dasf]" "sadf {sadf"
If we want the opposite, use invert = TRUE
grep("[[:punct:]€]", stringg, value = TRUE, invert = TRUE)
#[1] "ac ed" "asdf"

Why does XQuery add an extra space?

XQuery adds a space and I don't understand why. I have the following simple query :
declare option saxon:output "method=text";
for $i in 1 to 10
return concat(".", $i, " ", 100, "
", ".")
I ran it with Saxon (SaxonEE9-5-1-8J and SaxonHE9-5-1-8J):
java net.sf.saxon.Query -q:query.xq -o:result.txt
The result is the following:
.1 100
. .2 100
. .3 100
. .4 100
. .5 100
. .6 100
. .7 100
. .8 100
. .9 100
. .10 100
.
My question comes from the presence of an extra space between dots. The first line is OK but the folllowing lines (2 to 10) have that space and I don't understand why. What we see as spaces between digits is in fact a tabulation inserted by the character reference.
Could you enlighten me about that behavior ?
PS: I have added saxon as a tag for the question even if the question is not specific to Saxon.
I think your query returns a sequence of string values which are then by default concatenated with a space (see http://www.w3.org/TR/xslt-xquery-serialization/#sequence-normalization where it says "For each subsequence of adjacent strings in S2, copy a single string to the new sequence equal to the values of the strings in the subsequence concatenated in order, each separated by a single space"). If you don't want that then you can use
string-join(for $i in 1 to 10
return concat(".", $i, " ", 100, "
", "."), '')
The space between the dots is basically a separator introduced between the items in the sequence that you are constructing. It would seem that Saxon's text serializer where it outputs to the console inserts that space character to allow you to make sense of the output items.
Considering your code:
declare option saxon:output "method=text";
for $i in 1 to 10
return
concat(".", $i, " ", 100, "
", ".")
The result of for $i in 1 to 10 return is a sequence of 10 xs:string items. From your output you can determine that the space is interspersed between each evaluation of concat(".", $i, " ", 100, "
", ".").
If you want to check that you can rewrite your query as:
for $i in 1 to 10
return
<x>{concat(".", $i, " ", 100, "
", ".")}</x>
And you will see your 10 distinct items with no spaces between.
If you are trying to create a single text string, as you are already controlling the line-breaks, then you could also join all of the 10 xs:string items together yourself, which would have the effect of eliminating the spaces you are seeing between the sequence items. For example:
declare option saxon:output "method=text";
string-join(
for $i in 1 to 10
return
(".", string($i), " ", "100", "
", ".")
, "")

Highlight keywords in classic ASP

I have this sentense, "The man went outside".
I also have 4 search criterias I would like to get highligted (ignore the brackets), [went|"an WeNT o"|a|t] with [span id="something"][/span].
I have tried out a lot of stuff but I can't figure out how to do this in classic ASP!? If I insert a somewhere in the text, it will search the HTML code for SPAN too, which is bad or it will not find the text as it has been messed up with HTML code. I also tried inserting on all positions in the original text and even with some magic regular expression which I do not understand but I can't get this working :-/
The search-thing is divided with | and can be anything from 1 to 20 things to search for.
Can anyone help me solving how to do this?
I found and tweaked some code and it works perfectly for me:
Function highlightStr (haystack, needles)
' Taken (and tweaked) from these two sites:
' http://forums.aspfree.com/asp-development-5/asp-highlight-keywords-295641.html
' http://www.eggheadcafe.com/forumarchives/scriptingVisualBasicscript/Jul2005/post23377133.asp
'
' INPUT: haystack = search in this string
' INPUT: needles = searches divided by |... example: this|"is a"|search
' OUTPUT: HTML formatted highlighted string
'
If Len(haystack) > 0 Then
' Delete the first and the last array separator "|" (if any)
If Left(needles,1) = "|" Then needles = Right(needles,Len(needles)-1)
If Right(needles,1) = "|" Then needles = Mid(needles,1,Len(needles)-1)
' Delete a multiple seperator (if any)
needles = Replace(needles,"||","|")
' Delete the exact-search chars (if any)
needles = Replace(needles,"""","")
' Escape all special regular expression chars
needles = Replace(needles,"(","\(")
needles = Replace(needles,")","\)")
needles = Replace(needles,".","\.")
If Len(needles) > 0 Then
haystack = " " & haystack & " "
Set re = New RegExp
re.Pattern = "(" & needles & ")"
re.IgnoreCase = True
re.Global = True
highlightStr = re.Replace(haystack,"<span style='background-color:khaki;'>$&</span>")
Else
highlightStr = haystack
End If
Else
highlightStr = haystack
End If
End Function

How to check if the variable value in AWK script is null or empty?

I am using AWK script to process some logs.
At one place I need to check if the variable value is null or empty to make some decision.
Any Idea how to achieve the same?
awk '
{
{
split($i, keyVal, "#")
key=keyVal[1];
val=keyVal[2];
if(val ~ /^ *$/)
val="Y";
}
}
' File
I have tried with
1) if(val == "")
2) if(val ~ /^ *$/)
not working in both cases.
The comparison with "" should have worked, so that's a bit odd
As one more alternative, you could use the length() function, if zero, your variable is null/empty. E.g.,
if (length(val) == 0)
Also, perhaps the built-in variable NF (number of fields) could come in handy? Since we don't have access to your input data it's hard to say though, but another possibility.
You can directly use the variable without comparison, an empty/null/zero value is considered false, everything else is true.
See here :
# setting default tag if not provided
if (! tag) {
tag="default-tag"
}
So this script will have the variable tag with the value default-tag except if the user call it like this :
$ awk -v tag=custom-tag -f script.awk targetFile
This is true as of :
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
It works just fine for me
$ awk 'BEGIN{if(val==""){print "null or empty"}}'
null or empty
You can't differentiate between variable being empty and null, when you access "unset" variable, awk just initializes it with default value(here it is "" - empty string). You can use some sort of workaround, for example, setting val_accessed variable to 0 and then to 1 when you access it. Or more simple approach(somewhat "hackish") setting val to "unitialized"(or to some other value which can't appear when running your program).
PS: your script looks strange for me, what are the nested brackets for?
I accidentally discovered this less-used function specific in gawk that could help differentiate :
****** gawk-only ******
BEGIN {
$0 = "abc"
print NF, $0
test_function()
test_function($(NF + 1))
test_function("")
test_function($0)
}
function test_function(_) { print typeof(_) }
1 abc
untyped
unassigned
string
string
So it seems, for non-numeric-like data :
absolutely no input to function at all : untyped
non-existent or empty field, including $0 : unassigned
any non-numeric-appearing string, including "" : string
Here's the chaotic part - numeric data :
strangely enough, for absolutely identical input, only differing between using $0 vs. $1 in function call, you frequently get a different value for typeof()
even a combination of both leading and trailing spaces doesn't prevent gawk from identifying it as strnum
[123]:NF:1
$0 = number:123 $1 = strnum:123 +$1 = number:123
[ 456.33]:NF:1
$0 = string: 456.33 $1 = strnum:456.33 +$1 = number:456.33000
[ 19683 ]:NF:1
$0 = string: 19683 $1 = strnum:19683 +$1 = number:19683
[-20.08554]:NF:1
$0 = number:-20.08554 $1 = strnum:-20.08554 +$1 = number:-20.08554
+/- inf/nan (same for all 4):
[-nan]:NF:1
$0 = string:-nan $1 = strnum:-nan +$1 = number:-nan
this one is a string because it was made from sprintf() :
[0x10FFFF]:NF:1
$0 = string:0x10FFFF $1 = string:0x10FFFF +$1 = number:0
using -n / --non-decimal-data flag, all stays same except
[0x10FFFF]:NF:1
$0 = string:0x10FFFF $1 = strnum:0x10FFFF +$1 = number:1114111
Long story short, if you want your gawk function to be able to differentiate between
empty-string input (""), versus
actually no input at all
e.g. when original intention is to directly apply changes to $0
then typeof(x) == "untyped" seems to be the most reliable indicator.
It gets worse when null-string padding versus a non-empty string of all zeros ::
function __(_) { return (!_) ":" (!+_) }
function ___(_) { return (_ == "") }
function ____(_) { return (!_) ":" (!""_) }
$0--->[ "000" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ]
___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 1:1000 ]
$0--->[ "" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 1:1 ]
___($0)-->{ $0=="" }-->[ 1 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 1:1 ]
$0--->[ " -0.0 -0" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ]
___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 0:1 -0.0 -0 ]
$0--->[ " 0x5" ] | __(""$0)-->{ !(""$0) : !+(""$0) }-->[ 0:1 ]
___($0)-->{ $0=="" }-->[ 0 ] | ____($0)-->{ ! $0 : (!""$0) }-->[ 0:1 0x5 ]

How to gather characters usage statistics in text file using Unix commands?

I have got a text file created using OCR software - about one megabyte in size.
Some uncommon characters appears all over document and most of them are OCR errors.
I would like find all characters used in document to easily spot errors (like UNIQ command but for characters, not for lines).
I am on Ubuntu.
What Unix command I should use to display all characters used in text file?
This should do what you're looking for:
cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c
The premise is that the sed puts each character in the file onto a line by itself, then the usual sort | uniq -c sequence strips out all but one of each unique character that occurs, and provides counts of how many times each occurred.
Also, you could append | sort -n to the end of the whole sequence to sort the output by how many times each character occurred. Example:
$ echo hello | sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n
1
1 e
1 h
1 o
2 l
This will do it:
#!/usr/bin/perl -n
#
# charcounts - show how many times each code point is used
# Tom Christiansen <tchrist#perl.com>
use open ":utf8";
++$seen{ ord() } for split //;
END {
for my $cp (sort {$seen{$b} <=> $seen{$a}} keys %seen) {
printf "%04X %d\n", $cp, $seen{$cp};
}
}
Run on itself, that program produces:
$ charcounts /tmp/charcounts | head
0020 46
0065 20
0073 18
006E 15
000A 14
006F 12
0072 11
0074 10
0063 9
0070 9
If you want the literal character and/or name of the character, too, that’s easy to add.
If you want something more sophisticated, this program figures out characters by Unicode property. It may be enough for your purposes, and if not, you should be able to adapt it.
#!/usr/bin/perl
#
# unicats - show character distribution by Unicode character property
# Tom Christiansen <tchrist#perl.com>
use strict;
use warnings qw<FATAL all>;
use open ":utf8";
my %cats;
our %Prop_Table;
build_prop_table();
if (#ARGV == 0 && -t STDIN) {
warn <<"END_WARNING";
$0: reading UTF-8 character data directly from your tty
\tSo please type stuff...
\t and then hit your tty's EOF sequence when done.
END_WARNING
}
while (<>) {
for (split(//)) {
$cats{Total}++;
if (/\p{ASCII}/) { $cats{ASCII}++ }
else { $cats{Unicode}++ }
my $gcat = get_general_category($_);
$cats{$gcat}++;
my $subcat = get_general_subcategory($_);
$cats{$subcat}++;
}
}
my $width = length $cats{Total};
my $mask = "%*d %s\n";
for my $cat(qw< Total ASCII Unicode >) {
printf $mask, $width => $cats{$cat} || 0, $cat;
}
print "\n";
my #catnames = qw[
L Lu Ll Lt Lm Lo
N Nd Nl No
S Sm Sc Sk So
P Pc Pd Ps Pe Pi Pf Po
M Mn Mc Me
Z Zs Zl Zp
C Cc Cf Cs Co Cn
];
#for my $cat (sort keys %cats) {
for my $cat (#catnames) {
next if length($cat) > 2;
next unless $cats{$cat};
my $prop = length($cat) == 1
? ( " " . q<\p> . $cat )
: ( q<\p> . "{$cat}" . "\t" )
;
my $desc = sprintf("%-6s %s", $prop, $Prop_Table{$cat});
printf $mask, $width => $cats{$cat}, $desc;
}
exit;
sub get_general_category {
my $_ = shift();
return "L" if /\pL/;
return "S" if /\pS/;
return "P" if /\pP/;
return "N" if /\pN/;
return "C" if /\pC/;
return "M" if /\pM/;
return "Z" if /\pZ/;
die "not reached one: $_";
}
sub get_general_subcategory {
my $_ = shift();
return "Lu" if /\p{Lu}/;
return "Ll" if /\p{Ll}/;
return "Lt" if /\p{Lt}/;
return "Lm" if /\p{Lm}/;
return "Lo" if /\p{Lo}/;
return "Mn" if /\p{Mn}/;
return "Mc" if /\p{Mc}/;
return "Me" if /\p{Me}/;
return "Nd" if /\p{Nd}/;
return "Nl" if /\p{Nl}/;
return "No" if /\p{No}/;
return "Pc" if /\p{Pc}/;
return "Pd" if /\p{Pd}/;
return "Ps" if /\p{Ps}/;
return "Pe" if /\p{Pe}/;
return "Pi" if /\p{Pi}/;
return "Pf" if /\p{Pf}/;
return "Po" if /\p{Po}/;
return "Sm" if /\p{Sm}/;
return "Sc" if /\p{Sc}/;
return "Sk" if /\p{Sk}/;
return "So" if /\p{So}/;
return "Zs" if /\p{Zs}/;
return "Zl" if /\p{Zl}/;
return "Zp" if /\p{Zp}/;
return "Cc" if /\p{Cc}/;
return "Cf" if /\p{Cf}/;
return "Cs" if /\p{Cs}/;
return "Co" if /\p{Co}/;
return "Cn" if /\p{Cn}/;
die "not reached two: <$_> " . sprintf("U+%vX", $_);
}
sub build_prop_table {
for my $line (<<"End_of_Property_List" =~ m{ \S .* \S }gx) {
L Letter
Lu Uppercase_Letter
Ll Lowercase_Letter
Lt Titlecase_Letter
Lm Modifier_Letter
Lo Other_Letter
M Mark (combining characters, including diacritics)
Mn Nonspacing_Mark
Mc Spacing_Mark
Me Enclosing_Mark
N Number
Nd Decimal_Number (also Digit)
Nl Letter_Number
No Other_Number
P Punctuation
Pc Connector_Punctuation
Pd Dash_Punctuation
Ps Open_Punctuation
Pe Close_Punctuation
Pi Initial_Punctuation (may behave like Ps or Pe depending on usage)
Pf Final_Punctuation (may behave like Ps or Pe depending on usage)
Po Other_Punctuation
S Symbol
Sm Math_Symbol
Sc Currency_Symbol
Sk Modifier_Symbol
So Other_Symbol
Z Separator
Zs Space_Separator
Zl Line_Separator
Zp Paragraph_Separator
C Other (means not L/N/P/S/Z)
Cc Control (also Cntrl)
Cf Format
Cs Surrogate (not usable)
Co Private_Use
Cn Unassigned
End_of_Property_List
my($short_prop, $long_prop) = $line =~ m{
\b
( \p{Lu} \p{Ll} ? )
\s +
( \p{Lu} [\p{L&}_] + )
\b
}x;
$Prop_Table{$short_prop} = $long_prop;
}
}
For example:
$ unicats book.txt
2357232 Total
2357199 ASCII
33 Unicode
1604949 \pL Letter
74455 \p{Lu} Uppercase_Letter
1530485 \p{Ll} Lowercase_Letter
9 \p{Lo} Other_Letter
10676 \pN Number
10676 \p{Nd} Decimal_Number
19679 \pS Symbol
10705 \p{Sm} Math_Symbol
8365 \p{Sc} Currency_Symbol
603 \p{Sk} Modifier_Symbol
6 \p{So} Other_Symbol
111899 \pP Punctuation
2996 \p{Pc} Connector_Punctuation
6145 \p{Pd} Dash_Punctuation
11392 \p{Ps} Open_Punctuation
11371 \p{Pe} Close_Punctuation
79995 \p{Po} Other_Punctuation
548529 \pZ Separator
548529 \p{Zs} Space_Separator
61500 \pC Other
61500 \p{Cc} Control
As far as using *nix commands, the answer above is good, but it doesn't get usage stats.
However, if you actually want stats (like the rarest used, median, most used, etc) on the file, this Python should do it.
def get_char_counts(fname):
f = open(fname)
usage = {}
for c in f.read():
if c not in usage:
usage.update({c:1})
else:
usage[c] += 1
return usage

Resources