Issue while inverting Marklogic map - xquery

While inverting MarkLogic map, the keys & values get swapped, however the values get de-duplicated. How can I avoid it?

I'm afraid you can't. MarkLogic map:map is a hash table, so keys are unique. When inverting it will merge keys of duplicate values:
-map:new((
map:entry("a", (1, 2)),
map:entry("b", (2, 3))
))
Depending on what you want to achieve, you might just want to iterate the map:map.
HTH!

Related

Is there a way of specifying the precision and scale of a numeric variable when writing a table using ROracle?

I'm trying to write a table to an Oracle database using the ROracle package. This works fine, however all of the numeric values are showing the full floating point decimal representation on the database. For instance, 7581.24 shows up as 7581.2399999999998.
Is there a way of specifying the number of digits to be stored after the decimal point when writing the table?
I found a work around using Allan's solution here, but it would be better not to have to change the variable after writing it to the database.
Currently I write the table with code like this:
dbWriteTable(db_connection, "TABLE_NAME", table, overwrite = TRUE)
Thanks in advance.
It's not elegant but maybe good programming to make the types and precisions explicit. I did it with something like:
if (dbExistsTable(con, "TABLE_NAME")) dbRemoveTable(con, "TABLE_NAME")
create_table <- "create table CAMS_CFDETT_2019_AA(
ID VARCHAR2(100),
VALUE NUMBER(6,2)
)"
dbGetQuery(con_maps, create_table)
ins_str <- "insert into TABLE_NAME values(:1, :2)"
dbGetQuery(con, ins_str, df)
dbCommit(con)
Essentially, it creates the table and specifies the types for each column and the precision. Then it fills in the values with those from the dataframe (df) in R. You just have to be careful that everything matches up in terms of the columns. If you assign a number to oracle with precision 2 (VALUE NUMBER(3,2) and then push a value from R with more decimals, it will round it to the assigned precision (2 in this example). It will not truncate it. So df$value = 3.1415 in R would become VALUE 3.14 in the Oracle table.

Best index to cover a mix of exact match and less/greater than query in sqlite

I have a table which I need to filter on (using sqlite), it has 3 fields in the query:
WHERE x <= 'something' AND y = 'something' AND z = 'SOMETHING ELSE'
ORDER BY x DESC
I was wondering what's the best index to cover this query.
I have tried a few, for example:
CREATE INDEX idx_x_y_z ON user_messages(
x, y, z
);
CREATE INDEX idx_y_z ON user_messages(
y, z
);
but the best I can get is:
SEARCH TABLE table USING INDEX idx_y_z
USE TEMP B-TREE FOR ORDER BY
Is that optimal or I can avoid the USE TEMP B-TREE FOR ORDER BY?
By reading https://explainextended.com/2009/04/01/choosing-index/ it seems that to be the case, but since the query is slightly different (we order by a field we filter on), I was wondering if maybe is not exactly the same.
Also, I am struggling to find good resources on this, a lot of it addresses the most common scenarios, while it's a bit harder to find more in depth resources, do you have any suggestion?
Thanks!
UPDATE:
Turns out there was another issue, I have oversimplified the schema in the original question.
One of the fields had a type of BOOLEAN, and I was matching it by using the IS FALSE operator, which would return the right number of rows, while = 0 would not for some reasons.
When querying with = it would not use a TEMP B-TREE, while it would when using IS FALSE.
To address this issue I have just created an index excluding the BOOLEAN field, and B-TREE was not used anymore, only SEARCH-TABLE.
From the query planner documentation:
Then the index might be used if the initial columns of the index (columns a, b, and so forth) appear in WHERE clause terms. The initial columns of the index must be used with the = or IN or IS operators. The right-most column that is used can employ inequalities.
So since your WHERE has two exact comparisons and one less than or equal, that one should come last in an index for best effect:
CREATE INDEX idx_y_z_x ON user_messages(y, z, x);
Using that index with a query with your WHERE terms:
sqlite> EXPLAIN QUERY PLAN SELECT * FROM user_messages
...> WHERE x <= 'something' AND y = 'something' AND z = 'something'
...> ORDER BY x DESC;
QUERY PLAN
`--SEARCH TABLE user_messages USING INDEX idx_y_z_x (y=? AND z=? AND x<?)
As you can see, it fully uses the index, with no temporary table needed for sorting the results.
More, essential, reading about how sqlite uses indexes can be found here.

Gremlin select multiple vertices gives an output without the properties with null values

In order to get all data from two vertices a and b i used the following
g.V('xxx').out('hasA')..as('X').out('hasB').as('Y').select('X','Y').
I get values of X where the value of Y isnt null.I wanted to get all X where the value of Y can be or may not be null.
Any ideas as to how i can tweak the above query?
I'm not sure that this matters to you any more but to directly answer your question you need to deal with the chance that there are no "hasB" edges. You might do that with coalesce() in the following fashion:
g.V('xxx').out('hasA').as('X').
coalesce(out('hasB'),constant('n/a')).as('Y').
select('X','Y')

How do you sort a dictionary by it's values and then sum these values up to a certain point?

I was wondering what the best method would be to sort a dictionary of type Dict{String, Int} based on the value. I loop over a FASTQ file containing multiple sequence records, each record has a String as an identifier which serves as key and another string where i take the length from as the value of the key.
for example:
testdict["ee0a"]=length("aatcg")
testdict["002e4"]=length("aatcgtga")
testdict["12-f9"]=length(aatcgtgacgtga")
In this case the key value pairs would be "ee0a" => 5, "002e4" => 8, and "12-f9" => 13.
What i want to do is sort these pairs from highest value to the lowest value, afterwhich i sum these values in a different untill a that variable passes a certain threshold. I then need to save the keys i used so i can use them later on.
Is it possible to use the sort() function or use a SortedDict to achieve this? I would imagine that if the sorting succeeded i could use a while loop to add my keys to a list and add my values into a different variable untill it's greater than my threshold, and then use the list of keys to create a new dictionary with my selected key-value pairs.
However what would be the fastest way to do this? the FASTQ files i read in can contain multiple GB's worth of data so i'd love to create a sorted dictionary while reading in the file and select the records i want before doing anything else with the data.
If your file is multiple GB's worth of data I would avoid storing them in the Dict in the first place. I think it is better to process the file sequentially and store the keys that meet your condition in a PriorityQueue from the DataStructures.jl package. Of course you can repeat the same procedure if you read the data from a dictionary in memory (simply source changes from disk file to the dictionary)
Here is a pseudocode of what you could consider (a full solution would depend on how you read your data which you did not specify).
Assume that you want to store elements until they execeed threshold kept in THRESH constant.
pq = PriorityQueue{String, Int}()
s = 0
while (there are more key-value pairs in source file)
key, value = read(source file)
# this check avoids adding a key-value pair for which we are sure that
# it is not interesting
if s <= THRESH || value > peek(pq)[2]
enqueue!(pq, key, value)
s += value
# if we added something to the queue we have to check
# if we should not drop smallest elements from it
while s - peek(pq)[2] > THRESH
s -= dequeue!(pq)[2]
end
end
end
After this process pq will hold only key-value pairs you are interested in. The key benefit of this approach is that you never need to store whole data in RAM. At any point in time you only store the key-value pairs that would be selected at this stage of processing of the data.
Observe that this process does not give you an easily predictable result because several keys might have the same value. And if this value would be on a cutoff border you do not know which ones would be retained (however, you did not specify what you want to do in this special case - if you would specify the requirement for this case the algorithm should be updated a bit).
If you have enough memory to hold at least one or two full Dicts of the required size, you can use an inverted Dict with the length as key and an array of the old keys as values, to avoid losing data with a duplicate length value as a same key.
I think that the code below is then what your question was leading toward:
d1 = Dict("a" => 1, "b" => 2, "c" => 3, "d" => 2, "e" => 1, "f" =>5)
d2 = Dict()
for (k, v) in d1
d2[v] = haskey(d2, v) ? push!(d2[v], k) : [k]
end
println(d1)
println(d2)
for k in sort(collect(keys(d2)))
print("$k, $(d2[k]); ")
# here can delete keys under a threshold to speed further processing
end
If you don't have enough memory to hold an entire Dict, you may benefit
from first putting the data into a SQL database like SQLite and then doing
queries instead of modifying a Dict in memory. In that case, one column
of the table will be the data, and you would add a column for the data length
to the SQLite table. Or you can use a PriorityQueue as in the answer above.

how do i divide a sql variable into 2

i have a field in sql named as address which is of 80 char.
i want to put this field into 2 fields addr1 and addr2 of 40 char each.
how do i do it.
this is for T-SQL, but it can't be much different for PL/SQL
declare #yourVar varchar(80)
select substring(#yourVar, 1, 40), substring(#yourVar, 40, 40)
for plsql, it's substr(), so select substr(addr, 1, 40) as addr1, substr(addr, 40) as addr2 from ...
I think your schema would be better off if you altered that table to have two columns instead of one. I'd prefer that solution to parsing the current value.
Brute-force chopping the 80-character value at position 40 runs the risk of breaking in the middle of a word. You might want to do the following instead:
Replace all runs of whitespace with a single blank.
Find the last blank at or before position 40.
Place everything before that blank in the first result field.
Place everything after that blank in the second result field.
The exact details of the operations above will depend on what tools are available to you (e.g. SQL only, or reading from one DB and writing to another using a separate program, etc.)
There is the possibility that the 80-character value may be filled in such a way that breaking between "words" will require one of the result values to be more than 40 characters long to avoid truncation.

Resources