Busy comparing SQLServer 2008 R2 and MarkLogic 8 with simple Person Entity.
My dataset is for both 1 Million Records/Documents. Note: Both databases are on the same machine(Localhost).
The following SQLServer Query is ready in a flash:
set statistics time on
select top 10 FirstName + ' ' + LastName, count(FirstName + ' ' + LastName)
from [Person]
group by FirstName + ' ' + LastName
order by count(FirstName + ' ' + LastName) desc
set statistics time off
Result is:
Richard Petter 421
Mark Petter 404
Erik Petter 400
Arjan Petter 239
Erik Wind 237
Jordi Petter 235
Richard Hilbrink 234
Mark Dominee 234
Richard De Boer 233
Erik Bakker 233
SQL Server Execution Times:
CPU time = 717 ms, elapsed time = 198 ms.
The XQuery on MarkLogic 8 however is much slower:
(
let $m := map:map()
let $build :=
for $person in collection('/collections/Persons')/Person
let $pname := $person/concat(FirstName/text(), ' ', LastName/text())
return map:put(
$m, $pname, sum((
map:get($m, $pname), 1)))
for $pname in map:keys($m)
order by map:get($m, $pname) descending
return
concat($pname, ' => ', map:get($m, $pname))
)[1 to 10]
,
xdmp:query-meters()/qm:elapsed-time
Result is:
Richard Petter => 421
Mark Petter => 404
Erik Petter => 400
Arjan Petter => 239
Erik Wind => 237
Jordi Petter => 235
Mark Dominee => 234
Richard Hilbrink => 234
Erik Bakker => 233
Richard De Boer => 233
elapsed-time:PT42.797S
198 msec vs. 42 sec is in my opinion to much difference.
The XQuery is using a map to do the Group By acording to this guide: https://blakeley.com/blogofile/archives/560/
I have 2 questions:
Is the XQuery used in any way tuneable for beter performance?
Is XQuery 3.0 with group by already usable on MarkLogic 8?
Thanks for the help!
As #wst said, the challenge with your current implementation is that it's loading up all the documents in order to pull out the first and last names, adding them up one by one, and then reporting on the top ten. Instead of doing that, you'll want to make use of indexes.
Let's assume that you have string range indexes set up on FirstName and LastName. In that case, you could run this:
xquery version "1.0-ml";
for $co in
cts:element-value-co-occurrences(
xs:QName("FirstName"),
xs:QName("LastName"),
("frequency-order", "limit=10"))
return
$co/cts:value[1] || ' ' || $co/cts:value[2] || ' => ' || cts:frequency($co)
This uses indexes to find First and Last names in the same document. cts:frequency indicates how often that co-occurrence happens. This is all index driven, so it will be very fast.
First, yes, there are many ways to tune queries in MarkLogic. One obvious way is using range indexes; however, I would highly recommend reading their documentation on the subject first:
https://docs.marklogic.com/guide/performance
For a higher level look at the database architecture, there is a whitepaper called Inside Marklogic Server that explains the design extensively:
https://developer.marklogic.com/inside-marklogic
Regarding group by, maybe someone from MarkLogic would like to comment officially, but as I understand it their position is that it's not possible to build a universally high-performing group by, so they choose not to implement it. This puts the responsibility on the developer to understand best practices for writing fast queries.
In your example specifically, it's highly unlikely that Mike Blakeley's map-based group by pattern is the issue. There are several different ways to profile queries in ML, and any of those should lead you to any hot spots. My guess would be that IO overhead to get the Person data is the problem. One common way to solve this is to configure range indexes for FirstName and LastName, and use cts:value-tuples to query them simultaneously from range indexes, which will avoid going to disk for every document not in the cache.
Both answers are relevant. But from what you are asking for, David C's answer is closest to what you appear to want. However, this answer assumes that you need to find the combination of first-name and last-name.
If your document had an identifying, uniq field (think primary key), then:
put a range index on that field
Use cts:element-values
--- options: (fragment-frequency, frequency-order, eager, concurrent, limit=10)
-The count comes from cts:frequencies
-Then use the resulting 10 IDs to grab the Name Information
If the unique ID field can be an integer, then the speeds are many times faster for the initial search as well.
And.. Its also possible to use my step 1-2 to get the list of Id essentially instantly and use these as a limiter on David C's answer - using an element-value-range-query on the IDs in the 'query' option. This stops you from having to build the name yourself as in my option and may speed up David C's approach.
Moral of the story - without first tuning your database for performance(indexes) and using specific queries (range queries, co-occurrence, etc), then the results have no chance of making any sense.
Moral of the story - part 2: lots of approaches - all relevant and viable. Subtle differences - all depending on your specific data.
The doc is listed below
https://docs.marklogic.com/cts:element-values
Related
I have a bitmask (really a 'flagmask') of integer values (1, 2, 4, 8, 16 etc.) which apply to a field and I need to store this in a (text) log file. What I effectively store is something like "x=296" which indicates that for field "x", flags 256, 32 and 8 were set.
When searching the logs, how can I easily search this text string ("x=nnn") and determine from the value of "nnn" whether a specific flag was set? For instance, how could I look at the number and know that flag 8 was set?
I know this is a somewhat trivial question if we're doing 'true' bitmask processing, but I've not seen it asked this way before - the log searching will just be doing string matching, so it just sees a value of "296" and there is no way to convert it to its constituent flags - we're just using basic string searching with maybe some easy SQL in there.
Your best bet is to do full string regex matching.
For bit 8 set, do this:
SELECT log_line
FROM log_table
WHERE log_line
RLIKE 'x=(128|129|130|131|132|133|134|135|136|137|138|139|140|141|142|143|144|145|146|147|148|149|150|151|152|153|154|155|156|157|158|159|160|161|162|163|164|165|166|167|168|169|170|171|172|173|174|175|176|177|178|179|180|181|182|183|184|185|186|187|188|189|190|191|192|193|194|195|196|197|198|199|200|201|202|203|204|205|206|207|208|209|210|211|212|213|214|215|216|217|218|219|220|221|222|223|224|225|226|227|228|229|230|231|232|233|234|235|236|237|238|239|240|241|242|243|244|245|246|247|248|249|250|251|252|253|254|255)[^0-9]'
Or simplified to:
RLIKE 'x=(12[89]|1[3-9][0-9]|2[0-4][0-9]|25[0-5])[^0-9]'
The string x= must exist in the logs followed by the decimal number and a non decimal after the number.
For bit 2 set, do this:
SELECT log_line FROM log_table WHERE log_line RLIKE 'x=(2|3|6|7|10|11|14|15|18|19|22|23|26|27|30|31|34|35|38|39|42|43|46|47|50|51|54|55|58|59|62|63|66|67|70|71|74|75|78|79|82|83|86|87|90|91|94|95|98|99|102|103|106|107|110|111|114|115|118|119|122|123|126|127|130|131|134|135|138|139|142|143|146|147|150|151|154|155|158|159|162|163|166|167|170|171|174|175|178|179|182|183|186|187|190|191|194|195|198|199|202|203|206|207|210|211|214|215|218|219|222|223|226|227|230|231|234|235|238|239|242|243|246|247|250|251|254|255)[^0-9]'
I test the bit2 on some live data:
SELECT count(log_line)
...
count(log_line)
128
1 row in set (0.002 sec)
Much depends on how "easy" the "easy SQL" must be.
If you can use string splitting to separate "X" from "296", then using AND with the value 296 is easy.
If you cannot, then, as you observed, 296 yields no traces of its 8th bit state. You'd need to check for all the values that have the 8th bit set, and of course that means exactly half of the available values. You can compress the regexp somewhat:
248 249 250 251 252 253 254 255 => 24[89]|25[0-5]
264 265 266 267 268 269 270 271 => 26[4-9]|27[01]
280 281 282 283 284 285 286 287 => 28[0-7]
296 297 298 299 300 301 302 303 => 29[6-9]|30[1-3]
...24[89]|25[0-5]|26[4-9]|27[01]|28[0-7]|29[6-9]|30[1-3]...
but I think that the evaluation tree won't change very much, this kind of optimization is already present in most regex engines. Performances are going to be awful.
If at all possible I would try to rework the logs (maybe with a filter) to rewrite them as "X=000100101000b" or at least "X=0128h". The latter hexadecimal would allow you to search for bit 8 looking for "X=...[89A-F]h" only.
I have several times had to do changes like these - it involves preparing a pipe filter to create the new logs (it is more complicated if the logging process writes straight to a file - at worst, you might be forced to never run searches on the current log file, or to rotate the logs when such a search is needed), then slowly during low-load periods the old logs are retrieved, decompressed, reparsed, recompressed, and stored back. It's long and boring, but doable.
For database-stored logs, it's much easier and you can even avoid any alterations to the process that does the logs - you add a flag to the rows to be converted, or persist the timestamps of the catch-up conversion:
SELECT ... FROM log WHERE creation > lowWaterMark ORDER BY creation DESC;
SELECT ... FROM log WHERE creation < highWaterMark ORDER BY creation ASC;
the retrieved rows are updated, and the appropriate watermarkers are updated with the last creation value that has been retrieved. The "lowWaterMark" pursues the logging process. This way you only can't search from the current instant down to lowWaterMark, which will "lag" behind (ultimately, just a few seconds, except under heavy load).
One way to determine if a specific flag is set in the value "nnn" is to use bitwise operators to check if the flag is present in the value. For eg. to check if flag 8 is set in the value 296, you can use the bitwise AND operator to check if the value of 8 is present:
if (nnn & 8) == 8:
print("flag 8 is set")
Another method I can think of would be to check if the value of flag 8 is present in the log string "x=nnn" by using regular expressions, where you search for the string "8" in the log.
import re
log_string = "x=296"
if re.search("8", log_string):
print("flag 8 is set")
You could also use a function that takes the log string and flag as input and returns a boolean indicating whether the flag is set in the log string.
def check_flag(log_string, flag):
nnn = int(log_string.split('=')[1])
return (nnn & flag) == flag
if check_flag("x=296", 8):
print("flag 8 is set")
use a bitwise operator & to check if a specific flag is set in the value
function hasFlag(value, flag) {
return (value & flag) === flag;
}
const value = 296;
console.log(hasFlag(value, 256)); // true
console.log(hasFlag(value, 32)); // true
console.log(hasFlag(value, 8)); // true
console.log(hasFlag(value, 4)); // false
I'm a novice and am working on learning the basics. I have no experience in coding.
I'm trying to make a world for myself and my friends that features quest rewards being the sole way to gain levels, and in full increments - exactly like milestone leveling in Dungeons and Dragons.
Is there a way to level up a character, or have an automated ".levelup" command be used on a character, triggering when that player completes a (custom) quest? Additionally, is this something that can be done in Keira3? Or will I need to use other tools?
I've tried granting quest reward consumables that use the spells 47292 and 24312 (https://wotlkdb.com/?spell=47292 and https://wotlkdb.com/?spell=24312) but those appear to just be the visual level-up effects.
There are multiple ways to achieve this. The most convenient way that I can think of, is to compile the core with the eluna module: https://github.com/azerothcore/mod-eluna
It allows for scripting with the easily accessible Lua language. For example, you can use the following code:
local questId = 12345
local questNpc = 23456
local maxLevel = 80
local CREATURE_EVENT_ON_QUEST_REWARD = 34 --(event, player, creature, quest, opt) - Can return true
local function MyQuestCompleted(event, player, creature, quest, opt)
if player then -- check if the player exists
if player:GetLevel() < maxLevel and quest = questId then -- check if the player has completed the right quest and isn't max level
player:SetLevel( player:GetLevel() + 1 )
end
end
end
RegisterCreatureEvent( questNpc , CREATURE_EVENT_ON_QUEST_REWARD , MyQuestCompleted)
See https://www.azerothcore.org/pages/eluna/index.html for the documentation.
I am using the following query to obtain the current component serial number (tr_sim_sn) installed on the host device (tr_host_sn) from the most recent record in a transaction history table (PUB.tr_hist)
SELECT tr_sim_sn FROM PUB.tr_hist
WHERE tr_trnsactn_nbr = (SELECT max(tr_trnsactn_nbr)
FROM PUB.tr_hist
WHERE tr_domain = 'vattal_us'
AND tr_lot = '99524136'
AND tr_part = '6684112-001')
The actual table has ~190 million records. The excerpt below contains only a few sample records, and only fields relevant to the search to illustrate the query above:
tr_sim_sn |tr_host_sn* |tr_host_pn |tr_domain |tr_trnsactn_nbr |tr_qty_loc
_______________|____________|_______________|___________|________________|___________
... |
356136072015140|99524135 |6684112-000 |vattal_us |178415271 |-1.0000000000
356136072015458|99524136 |6684112-001 |vattal_us |178424418 |-1.0000000000
356136072015458|99524136 |6684112-001 |vattal_us |178628048 |1.0000000000
356136072015050|99524136 |6684112-001 |vattal_us |178628051 |-1.0000000000
356136072015836|99524137 |6684112-005 |vattal_us |178645337 |-1.0000000000
...
* = key field
The excerpt illustrates multiple occurrences of tr_trnsactn_nbr for a single value of tr_host_sn. The largest value for tr_trnsactn_nbr corresponds to the current tr_sim_sn installed within tr_host_sn.
This query works, but it is very slow, ~8minutes.
I would appreciate suggestions to improve or refactor this query to improve its speed.
Check with your admins to determine when they last updated the SQL statistics. If the answer is "we don't know" or "never" then you might want to ask them to run the following 4gl program which will create a SQL script to accomplish that:
/* genUpdateSQL.p
*
* mpro dbName -p util/genUpdateSQL.p -param "tmp/updSQLstats.sql"
*
* sqlexp -user userName -password passWord -db dnName -S servicePort -infile tmp/updSQLstats.sql -outfile tmp/updSQLtats.log
*
*/
output to value( ( if session:parameter <> "" then session:parameter else "updSQLstats.sql" )).
for each _file no-lock where _hidden = no:
put unformatted
"UPDATE TABLE STATISTICS AND INDEX STATISTICS AND ALL COLUMN STATISTICS FOR PUB."
'"' _file._file-name '"' ";"
skip
.
put unformatted "commit work;" skip.
end.
output close.
return.
This will generate a script that updates statistics for all table and all indexes. You could edit the output to only update the tables and indexes that are part of this query if you want.
Also, if the admins are nervous they could, of course, try this on a test db or a restored backup before implementing in a production environment.
I am posting this as a response to my request for an improved query.
As it turns out, the following syntax features two distinct features that greatly improved the speed of the query. One is to include tr_domain search criteria in both main and nested portions of the query. Second is to narrow the search by increasing the number of search criteria, which in the following are all included in the nested section of the syntax:
SELECT tr_sim_sn,
FROM PUB.tr_hist
WHERE tr_domain = 'vattal_us'
AND tr_trnsactn_nbr IN (
SELECT MAX(tr_trnsactn_nbr)
FROM PUB.tr_hist
WHERE tr_domain = 'vattal_us'
AND tr_part = '6684112-001'
AND tr_lot = '99524136'
AND tr_type = 'ISS-WO'
AND tr_qty_loc < 0)
This syntax results in ~0.5s response time. (credit to my colleague, Daniel V.)
To be fair, this query uses criteria outside the originally stated parameters that were included in the original post, making it difficult to impossible for others to attempt a reasonable answer. This omission was not on purpose of course, rather due to being fairly new to fundamentals of good query design. This query in part is a result of learning that when too-few or non-indexed fields are used as search criteria in a large table, it is sometimes helpful to narrow the search by increasing the number of search criteria items. The original had 3, this one has 5.
I'm making a flight tracking map that will need to pull live data from a sql lite db. I'm currently just using the sqlite executable to navigate the db and understand how to interact with it. Each aircraft is identified by a unique hex_ident. I want to get a list of all aircraft that have sent out a signal in the last minute as a way of identifying which aircraft are actually active right now. I tried
select distinct hex_ident, parsed_time
from squitters
where parsed_time >= Datetime('now','-1 minute')
I expected a list of 4 or 5 hex_idents only but I'm just getting a list of every entry (today's entries only) and some are outside the 1 minute bound. I'm new to sql so I don't really know how to do this yet. Here's what each entry looks like. The table is called squitters.
{
"message_type":"MSG",
"transmission_type":8,
"session_id":"111",
"aircraft_id":"11111",
"hex_ident":"A1B4FE",
"flight_id":"111111",
"generated_date":"2021/02/12",
"generated_time":"14:50:42.403",
"logged_date":"2021/02/12",
"logged_time":"14:50:42.385",
"callsign":"",
"altitude":"",
"ground_speed":"",
"track":"",
"lat":"",
"lon":"",
"vertical_rate":"",
"squawk":"",
"alert":"",
"emergency":"",
"spi":"",
"is_on_ground":0,
"parsed_time":"2021-02-12T19:50:42.413746"
}
Any ideas?
You must remove 'T' from the value of parsed_time or use datetime() for it also to make the comparison work:
where datetime(parsed_time) >= datetime('now', '-1 minute')
Note that datetime() function does not take into account microseconds, so if you need 100% accuracy, you must put them in the code with concatenation:
where replace(parsed_time, 'T', ' ') >=
datetime('now', '-1 minute') || substr(parsed_time, instr(parsed_time, '.'))
I am new to MarkLogic ..I need to get the total count of books from the following XML. Can anyone suggest me.
<bk:bookstore xmlns:bk="http://www.bookstore.org">
<bk:book category='Computer'>
<bk:author>Gambardella, Matthew</bk:author>
<bk:title>XML Developer's Guide</bk:title>
<bk:price>44.95</bk:price>
<bk:publish_year>1995</bk:publish_year>
<bk:description>An in-depth look at creating applications with XML.
</bk:description>
</bk:book>
<bk:book category='Fantasy'>
<bk:author>Ralls, Kim</bk:author>
<bk:title>Midnight Rain</bk:title>
<bk:price>5.95</bk:price>
<bk:publish_year>2000</bk:publish_year>
<bk:description>A former architect battles corporate zombies, an evil
sorceress, and her own childhood to become queen of the world.
</bk:description>
</bk:book>
<bk:book category='Comic'>
<bk:author>Robert M. Overstreet</bk:author>
<bk:title>The Overstreet Indian Arrowheads Identification </bk:title>
<bk:price>2000</bk:price>
<bk:publish_year>1991</bk:publish_year>
<bk:description>A leading expert and dedicated collector, Robert M.
Overstreet has been writing The Official Overstreet Identification and
Price
Guide to Indian Arrowheads for more than 21 years</bk:description>
</bk:book>
<bk:book category='Comic'>
<bk:author>Randall Fuller</bk:author>
<bk:title>The Book That Changed America</bk:title>
<bk:price>1000</bk:price>
<bk:publish_year>2017</bk:publish_year>
<bk:description>The New York Times Book Review Throughout its history
America has been torn in two by debates over ideals and beliefs.
</bk:description>
</bk:book>
</bk:bookstore>
Can anyone find the solution for this question as I am new to this.
Id suggest using a cts:count-aggregate in combination with cts:element-reference. This requires you to have a element range index on book.
cts:count-aggregate(cts:element-reference(fn:QName("http://www.bookstore.org", "book")))
If performance isn't too critical and your document count isn't too large, you could also count with fn:count.
declare namespace bk="http://www.bookstore.org";
fn:count(//bk:book)
Try this-
declare namespace bk="http://www.bookstore.org";
let $book_xml :=
<bk:bookstore xmlns:bk="http://www.bookstore.org">
</bk:book>
........
........
</bk:book>
</bk:bookstore>
return fn:count($book_xml//bk:book)
Hope That Helps !