Count Number of times a nested FLWOR loop runs in XQuery - xquery

If you had a nested FLWOR statement like this with two 'for' statements:
for $x at $xPos in (1, 2, 3)
for $y at $yPos in fn:doc("/foo")//b
return ()
How would you count exactly how many times the for loop ran? Assume 'fn:doc("/foo")//b' returns sequences in random length, so here is an example run:
$xPos $yPos
1 1
1 2
1 3
2 1
2 2
3 1 <-- for loop ends here (total times ran = 6)
And another example run could be:
$xPos $yPos
1 1
1 2
2 1
2 2
2 3
3 1
3 2
3 3
3 4 <-- for loop ends here (total times ran = 9)
Well hopefully, you get my point. How do I keep and update a counter variable within the nested for loop to count how many times I ran through this loop without having it reset at every iteration of the loop?
Clarification EDIT:
This question is just based on pure curiosity to know if this is possible in XQuery. I know that you can just simply put a let statement like so and just keep track of $xPos which is a simple matter:
for $x at $xPos in (1, 2, 3)
let $_ :=
for $y at $yPos in fn:doc("/foo")//b
return ()
return ()

In MarkLogic you can use xdmp:set to break out of the strict FLWOR paradigm.
let $count := 0
for $x at $xPos in (1, 2, 3)
for $y at $yPos in ("a", "b", "c")
let $set := xdmp:set($count, $count + 1)
return concat($count, ": ", $x, " ", $y)
Produces:
1: 1 a
2: 1 b
3: 1 c
4: 2 a
5: 2 b
6: 2 c
7: 3 a
8: 3 b
9: 3 c

XQuery 3 has a count clause (https://www.w3.org/TR/xquery-31/#id-count):
for $x in (1 to 3)
for $y in //b
count $c
return ``[count `{$c}`: y: `{$y}`]``
https://xqueryfiddle.liberty-development.net/6qVRKvF for the input
<doc>
<b>a</b>
<b>b</b>
<b>c</b>
</doc>
returns the result
count 1: y: a
count 2: y: b
count 3: y: c
count 4: y: a
count 5: y: b
count 6: y: c
count 7: y: a
count 8: y: b
count 9: y: c
However, if I interpret https://docs.marklogic.com/guide/xquery/langoverview#id_11626 correctly, then Marklogic doesn't support the count clause.

How about just counting afterwards:
for $x at $xPos in (1, 2, 3)
let $c :=
for $y at $yPos in fn:doc("/foo")//b
return 1
return count($c)
HTH!

Related

How to mask specific rows elements of matrix in Julia?

I have a matrix A with A[:,1] as Bus_id. So Bus_id are 1,3,4, and 6. For processing, I equated Bus_id's to consecutive row indexing, see A_new matrix.
julia> A=[1 1 3;3 1 1; 4 1 7;6 1 1]
4×3 Array{Int64,2}:
1 1 3
3 1 1
4 1 7
6 1 1
julia> A_new
1 1 1
2 1 1
3 1 1
4 1 1
Now, I have another matrix B, which has some elements of matrix A. I wish to convert B matrix's bus_ids to b_new. I don't know how to explain this problem.
julia> B= [3 1 1; 4 1 7]
2×3 Array{Int64,2}:
3 1 1
6 1 1
julia> B_new
2 1 1
4 1 7
I have tried masking by it works only for one element.
Please help me find a way.
It is possible that you are using Bus_id as an index. If you want to renumber the business ID's, but not lose track of transactions indexed with the original business id's, what you want to do fits naturally into a Dict that translates the Bus_id from one to another.
One problem that immediately arises is what should happen if some of the entries in B have no translation from A, but are already set to a number that is in A's new key? Potential cross-linked database chaos! Instead, the new ids need to be unique if at all possible! I suggest making them negative.
If you use matrix A as your key to translation (and assuming that all entries in A[:,1] are unique--if not the logic might need to drop duplicates first) the dict usage then looks like this:
A = [1 1 3; 3 1 1; 4 1 7; 6 1 1]
B = [3 1 1; 6 1 1]
function consecutive_row_indexing(mat)
dict = Dict{Int, Int}()
for (i,n) in enumerate(mat[:,1])
dict[n] = -i
end
dict
end
function renumberbus_ids!(mat, dict)
for i in 1:size(mat)[1]
if haskey(dict, mat[i,1])
mat[i,1] = dict[mat[i,1]]
end
end
mat
end
d = consecutive_row_indexing(A)
println(renumberbus_ids!(A, d))
println(renumberbus_ids!(B, d))
output: <code>
[-1 1 3; -2 1 1; -3 1 7; -4 1 1]
[-2 1 1; -4 1 1]
If you still really want your B matrix with positive integers for its index column, just replace = -i with = i on the seventh line of the code above.

Using grep in a nested for loop

I am trying to automize one of the simulation. I have two sets of data. One is the Subject ID's of patients (187 rows long), the other is the sample ID (3057 rows long). I would like to classify the sample ID's based on the Subject.
For eg: The sub ID = ABCD. The samples takes from the subject will be ABCD-0001, ABCD-0002 and so.
Now I am trying to use grep to search through every element in sub ID and see if its a subset of the sample ID. and if so, then the value it returns could be inserted into a new vector, with the row of the new vector denoted by the value returned from grep [Same as the row number in Sample ID] and the value would be same as the row number in Subject ID
As in
SubID SampID
ABCD ABCD-0001
EFGH ABCD-0002
IJKL IJKL-0001
IJKL-0002
EFGH-0001
EFGH-0002
EFGH-0003
Desired Output
Numeric ID
1
1
3
3
2
2
2
I am using this code
j = 1:nrow(SubID)
i = 1:nrow(SampID)
for (val in j)
{
for(val in i)
{
if (length(k<-grep(SubID[j,1],SampID[i,1]))>0)
{
l=as.numeric(unlist(k))
Ind[l]=j
}
}
}
There are ways to solve this without using a for-loop
Data:
a = data.frame(subID = c("ab","cd","de"))
b = data.frame(SampID = c("ab-1","ab-2","de-1","de-2","cd-1","cd-2","cd-3"))
> a
subID
1 ab
2 cd
3 de
> b
SampID
1 ab-1
2 ab-2
3 de-1
4 de-2
5 cd-1
6 cd-2
7 cd-3
To obtain the corresponding index, first obtain the substring of the first two elements (in my example! In yours should go from 1 to 4, if all have 4 letters!)
f = substr(b$SampID,1,2)
b$num = sapply(f,function(x){which(x==a)})
Which gives:
> b
SampID num
1 ab-1 1
2 ab-2 1
3 de-1 3
4 de-2 3
5 cd-1 2
6 cd-2 2
7 cd-3 2
Edit: Different letter lengths
If you have different lengths of letters in a, then you can do it with only one for loop. Try this
a = data.frame(subID = c("ab","cd","def"))
b = data.frame(SampID = c("ab-1","ab-2","def-1","def-2","cd-1","cd-2","cd-3"))
b$num = 0
for (k in 1:length(a$subID)){
b$num[grepl( pattern = a$subID[k] , x = b$SampID)] = k
}
In this case loop through every element of a and use grepl to determine those SampID that have this pattern. Assign the loop number to those that return true.
New Results:
> b
SampID num
1 ab-1 1
2 ab-2 1
3 def-1 3
4 def-2 3
5 cd-1 2
6 cd-2 2
7 cd-3 2

Cumulative sum conditional over multiple columns in r dataframe containing the same values

Say my data.frame is as outlined below:
df<-as.data.frame(cbind("Home"=c("a","c","e","b","e","b"),
"Away"=c("b","d","f","c","a","f"))
df$Index<-rep(1,nrow(df))
Home Away Index
1 a b 1
2 c d 1
3 e f 1
4 b c 1
5 e a 1
6 b f 1
What I want to do is calculate a cumulative sum using the Index column for each character a - f regardless of whether they in the Home or Away columns. Thus a column called Cumulative_Sum_Home, say, takes the character in the Home row, "b" in the case of row 6, and counts how many times "b" has appeared in either the Home or Away columns in all previous rows including row 6. Thus in this case b has appeared 3 times cumulatively in the first 6 rows, and thus the Cumulative_Sum_Home gives the value 3. Likewise the same logic applies to the Cumulative_Sum_Away column. Taking row 5, character "a" appears in the Away column, and has cumulatively appeared 2 times in either Home or Away columns up to that row, so the column Cumulative_Sum_Away takes the value 2.
Home Away Index Cumulative_Sum_Home Cumulative_Sum_Away
1 a b 1 1 1
2 c d 1 1 1
3 e f 1 1 1
4 b c 1 2 2
5 e a 1 2 2
6 b f 1 3 2
I have to confess to being totally stumped as to how to solve this problem. I've tried looking at the data.table approaches, but I've never used that package before so I can't immediately see how to solve it. Any tips would be greatly received.
There is scope to make this leaner but if that doesn't matter much for you then this should be okay.
NewColumns = list()
for ( i in sort(unique(c(levels(df[,"Home"]),levels(df[,"Away"]))))) {
NewColumnAddition = i == df$Home | i ==df$Away
NewColumnAddition[NewColumnAddition] = cumsum(NewColumnAddition[NewColumnAddition])
NewColumns[[i]] = NewColumnAddition
}
df$Cumulative_Sum_Home = sapply(
seq(nrow(df)),
function(i) {
NewColumns[[as.character(df[i,"Home"])]][i]
}
)
df$Cumulative_Sum_Away = sapply(
seq(nrow(df)),
function(i) {
NewColumns[[as.character(df[i,"Away"])]][i]
}
)
> df
Home Away Index HomeSum AwaySum
1 a b 1 1 1
2 c d 1 1 1
3 e f 1 1 1
4 b c 1 2 2
5 e a 1 2 2
6 b f 1 3 2
Here's a data.table alternative -
setDT(df)
for ( i in sort(unique(c(levels(df[,Home]),levels(df[,Away]))))) {
df[, TotalSum := cumsum(i == Home | i == Away)]
df[Home == i, Cumulative_Sum_Home := TotalSum]
df[Away == i, Cumulative_Sum_Away := TotalSum]
}
df[,TotalSum := NULL]

Summarize the self-join index while avoiding cartesian product in R data.table

With a 2-column data.table, I'd like to summarize the pairwise relationships in column 1 by summing the number of shared elements in column 2. In other words, how many shared Y elements does each pairwise combination of X-values have?
For example, I can do this in a 2-step process, first doing a cartesian cross join, then summarizing it like so:
d = data.table(X=c(1,1,1,2,2,2,2,3,3,3,4,4), Y=c(1,2,3,1,2,3,4,1,5,6,4,5))
setkey(d, Y)
d2 = d[d, allow.cartesian=TRUE]
d2[, .N, by=c("X", "i.X")]
# X i.X N
#1: 1 1 3
#2: 2 1 3
#3: 3 1 1
#4: 1 2 3
#5: 2 2 4
#6: 3 2 1
#7: 1 3 1
#8: 2 3 1
#9: 3 3 3
#10: 4 2 1
#11: 2 4 1
#12: 4 4 2
#13: 4 3 1
#14: 3 4 1
The second row of this result indicates, that X=1 shares 3 Y-values with X=2; while X=3 shares only 1 y-value with X=4.
Is there any way to do this while bypassing the cartesian join step, which leads to large inefficient tables? I want to do something like this on a table with millions of rows, and the cartesian join runs into the 2^31 vector size limit (in addition to becoming slow).
I'm imagining something like this:
d[d, list(X, length(Y)), by=c("X", "i.X")]
But this gives the error i.X not found
I can do this in SQL with the code below -- but just can't figure out how to translate this into data.table syntax:
CREATE TABLE test (X integer, Y integer);
INSERT INTO test VALUES(1, 1);
INSERT INTO test VALUES(1, 2);
INSERT INTO test VALUES(1, 3);
INSERT INTO test VALUES(2, 1);
INSERT INTO test VALUES(2, 2);
INSERT INTO test VALUES(2, 3);
INSERT INTO test VALUES(2, 4);
INSERT INTO test VALUES(3, 1);
INSERT INTO test VALUES(3, 5);
INSERT INTO test VALUES(3, 6);
INSERT INTO test VALUES(4, 4);
INSERT INTO test VALUES(4, 5);
SELECT A.X, B.X, COUNT(A.Y) as N FROM test as A JOIN test as B WHERE A.Y==B.Y GROUP BY A.X, B.X;
The point is that the column I want to summarize is the same as the column I am joining on. This question is similar to these, but not exactly:
R Data.Table Join on Conditionals
How to self join a data.table on a condition
The key difference being that I want to summarize the index column, which seems impossible to do with by=.EACHI.
If you can split your Y's into groups that don't have a large intersection of X's, you could do the computation by those groups first, resulting in a smaller intermediate table:
d[, grp := Y <= 3] # this particular split works best for OP data
d[, .SD[.SD, allow = T][, .N, by = .(X, i.X)], by = grp][,
.(N = sum(N)), by = .(X, i.X)]
The intermediate table above has only 16 rows, as opposed to 26. Unfortunately I can't think of an easy way to create such grouping automatically.
How about this one using foverlaps(). The more consecutive values of Y you've for each X, the lesser number of rows this'll produce compared to a cartesian join.
d = data.table(X=c(1,1,1,2,2,2,2,3,3,3,4,4), Y=c(1,2,3,1,2,3,4,1,5,6,4,5))
setorder(d, X)
d[, id := cumsum(c(0L, diff(Y)) != 1L), by=X]
dd = d[, .(start=Y[1L], end=Y[.N]), by=.(X,id)][, id := NULL][]
ans <- foverlaps(dd, setkey(dd, start, end))
ans[, count := pmin(abs(i.end-start+1L), abs(end-i.start+1L),
abs(i.end-i.start+1L), abs(end-start+1L))]
ans[, .(count = sum(count)), by=.(X, i.X)][order(i.X, X)]
# X i.X count
# 1: 1 1 3
# 2: 2 1 3
# 3: 3 1 1
# 4: 1 2 3
# 5: 2 2 4
# 6: 3 2 1
# 7: 4 2 1
# 8: 1 3 1
# 9: 2 3 1
# 10: 3 3 3
# 11: 4 3 1
# 12: 2 4 1
# 13: 3 4 1
# 14: 4 4 2
Note: make sure X and Y are integers for faster results. This is because joins on integer types are faster than on double types (foverlaps performs binary joins internally).
You can make this more memory efficient by using which=TRUE in foverlaps() and using the indices to generate count in the next step.
You already have solution written in SQL so I suggest R package sqldf
Here's code:
library(sqldf)
result <- sqldf("SELECT A.X, B.X, COUNT(A.Y) as N FROM test as A JOIN test as B WHERE A.Y==B.Y GROUP BY A.X, B.X")

KMP failure function calculation

My professor solved the kmp failure function as follows:
index 1 2 3 4 5 6 7 8 9
string a a b a a b a b b
ff 0 1 2 1 2 3 4 5 1
From other texts I checked online, I found out it might be wrong, I went back to confirm from him again and he told me he's absolutely right. Can someone pls explain to me why he thinks it's right or wrong in a simple step by step manner? Thanks
As I understand the algorithm, the failure function for your example should be the following:
1 2 3 4 5 6 7 8 9
a a b a a b a b b
0 1 0 1 2 3 4 0 0
f - failure function (by definition, this is the length of the longest prefix of the string which is a suffix also)
Here how I built it step by step:
f(a) = 0 (always = 0 for one letter)
f(aa) = 1 (one letter 'a' is both a prefix and suffix)
f(aab) = 0 (there is no the same suffixes and prefixes: a != b, aa != ab)
f(aaba) = 1 ('a' is the same in the beginning and the end, but if you take 2 letters, they won't be equal: aa != ba)
f(aabaa) = 2 ( you can take 'aa' but no more: aab != baa)
f(aabaab) = 3 ( you can take 'aab')
f(aabaaba) = 4 ( you can take 'aaba')
f(aabaabab) = 0 ( 'a' != 'b', 'aa' != 'ab' and so on, it can't be = 5, so as 'aabaa' != 'aabab')
f(aabaababb) = 0 ( the same situation)
Since #user1041889 was confused (and got me confused too) I'll lay here the differences between the Z-function and the failure function.
Failure function, π[i]:
Is the mapping of and index to the length of the longest prefix of the string which is also a suffix
But that's arguably Chinese so I'll dumb it down in order to actually understand what I'm saying:
How big is the longest sub-string at the beginning of the string of interest, that is equal to the sub-string ending at index i
Or equivalently:
What is the length of the biggest sub-string ending at index i which matches the start of the string of interest
So in your example:
index 1 2 3 4 5 6 7 8 9
string a a b a a b a b b
ff 0 1 0 1 2 3 4 0 0
We observe that π[6] = 3, so what's the substring that ends at index 6 with length 3? aab!
Interesting how we've seen that before!
Let's check that it is indeed the biggest one: baab != aab. Yup!
Notice how this implies that the failure functions always grows uniformly.
That isn't the case for the Z-algorithm.
[SAVING DRAFT to continue later]

Resources