Marklogic : unique words count - xquery

I have the following xml structure:-
<Root>
<text>
Marklogic is a good big data tool. Right now I am exploring Marklogic.
</text>
</Root>
Now I want to count the occurrence of unique words(e.g Marklogic- 2 times, big- 1 time, data-1 time etc). I achieved this by using fn:count() but fn:count() is too slow in case of large database.
Is there any other optimized way to achieve this ? (something related to indexes)

Per http://docs.marklogic.com/guide/search-dev/lexicon#chapter you could enable the word-lexicon and use cts:words.

Related

How to insert large number of nodes into Neo4J

I need to insert about 1 million of nodes in Neo4j. I need to specify that each node is unique, so every time I insert a node it has to be checked that there's not the same node yet. Also the relationships must be unique.
I'm using Python and Cypher:
uq = 'CREATE CONSTRAINT ON (a:ipNode8) ASSERT a.ip IS UNIQUE'
...
queryProbe = 'MERGE (a:ipNode8 {ip:"' + prev + '"})'
...
queryUpdateRelationship= 'MATCH (a:ipNode8 {ip:"' + prev + '"}),(b:ipNode8 {ip:"' + next + '"}) MERGE (a)-[:precede]->(b)'
The problem is that after putting 40-50K nodes into Neo4j , the insertion speed slows down quickly and I can not to put anything else.
Your question is quite open ended. In addition to #InverseFalcon's recommendations, here are some other things you can investigate to speed things up.
Read the Performance Tuning documentation, and follow the recommendations. In particular, you might be running into memory-related issues, so the Memory Tuning section may be very helpful.
Your Cypher query(ies) can probably be sped up. For instance, if it makes sense, you can try something like the following. The data parameter is expected to be a list of objects having the format {a: 123, b: 234}. You can make the list as long as appropriate (e.g., 20K) to avoid running out of memory on the server while it processes the list within a single transaction. (This query assumes that you also want to create b if it does not exist.)
UNWIND {data} AS d
MERGE (a:ipNode8 {ip: d.a})
MERGE (b:ipNode8 {ip: d.b})
MERGE (a)-[:precede]->(b)
There are also periodic execution APOC procedures that you might be able to use.
For mass inserts like this, it's best to use LOAD CSV with periodic commit or the import tool.
I believe it's also best practice to use a parameterized query instead of appending values into a string.
Also, you created a unique property constraint on :ipNode8, but not :ipNode, which is the first one you MERGE. Seems like you'll need a unique constraint for that one too.

Efficient XQuery query to determine the documents where an element does NOT exist

Let's say I have ~50 million records in a collection like this:
<record>
<some_data>
<some_data_id>112423425345235</some_data_id>
</some_data>
</record>
So I have maybe a million records (bad data) that look like this:
<record>
<some_data>
</some_data>
</record>
With some_data element being empty.
So if I have an element-range-index setup on some_data_id, what's an efficient XQuery query that will give me all the empty ones to delete?
I think what I'm looking for is a query that is not a FLWOR where you check the existence of children records for each element, as I think that is inefficient (i.e. pulling the data back and then filtering)?
Whereas if I did it in the cts:search query then it would be more efficient, as in filter the data before pulling it back?
Please write a query that can do this efficiently and confirm whether or not my assumptions about FLWOR statements are correct.
I don't think you need a range index to do this efficiently. Using the "universal" element indexes via cts:query constructors should be fine:
cts:element-query(xs:QName('record'),
cts:element-query(xs:QName('some_data'),
cts:not-query(cts:element-query(xs:QName('some_data_id'), cts:and-query(())))
)
)

Delete nodes in XML based on list of values for a single attribute - xquery in MS SQL SERVER 2012

I am a total noob with XQuery, but before at start digging deep into it, i'd like to ask some experts advice about whether i am looking at the correct direction.
I have XML in table that something looks like :
'<JOURNALEXT>
<JOURNAL journalno="1" journalpos="1" ledgercode="TD1">
</JOURNAL>
<JOURNAL journalno="1" journalpos="1" ledgercode="TD2">
</JOURNAL>
<JOURNAL journalno="1" journalpos="1" ledgercode="TD3">`enter code here`
</JOURNAL>
-----almost 50 such nodes
</JOURNALEXT>'
Now the ledger code attribute's value is there in some table. I have to filter all the nodes whose ledgercode value is not in the value that is there in table.
For example my ledger_code table has two entries TD1 & TD2
so I should get the resultant XML as
<JOURNALEXT>
<JOURNAL journalno="1" journalpos="1" ledgercode="TD3">
</JOURNAL>
-----almost 50 such nodes
</JOURNALEXT>
I can delete nodes based on one attribute by using.
declare #var_1 varchar(max) = 'TD1'
BEGIN TRANSACTION
update [staging_data_load].[TBL_STG_RAWXML_STORE] WITH (rowlock)
set XMLDATA.modify('delete /JOURNALEXT/JOURNAL[#ledgercode!= sql:variable("#var_1")]')
where job_id=#job_Id
but my case is quite complex..i need to get multiple ledgercodes from table and make sure only those nodes having table consisting ledgercodes remain. Rest all gets deleted.
I am using MS SQL SERVER 2012 ...as database and trying to write an xquery.

BizTalk - how to map these two nodes to a repeating node?

I have an incoming schema that looks like this:
<Root>
<ClaimDates005H>
<Begin>20120301</Begin>
<End>20120302</End>
</ClaimDates005H>
</Root>
(there's more to it, this is just the area I'm concerned with)
I want to map it to a schema with a repeating section, so it winds up like this:
<Root>
<DTM_StatementFromorToDate>
<DTM01_DateTimeQualifier>Begin</DTM01_DateTimeQualifier>
<DTM02_ClaimDate>20120301</DTM02_ClaimDate>
</DTM_StatementFromorToDate>
<DTM_StatementFromorToDate>
<DTM01_DateTimeQualifier>End</DTM01_DateTimeQualifier>
<DTM02_ClaimDate>20120302</DTM02_ClaimDate>
</DTM_StatementFromorToDate>
</Root>
(That's part of an X12 835, BTW...)
Of course in the destination schema there's only a single occurrence of DTM_StatementFromorToDate, that can repeat... I get that I can run both Begin and End into a looping functoid to create two instances of DTM_StatementFromorToDate, one with Begin and one with End, but then how do I correctly populate DTM01_DateTimeQualifier?
Figured it out, the Table Looping functoid took care of it.

Hbase schema design -- to make sorting easy?

I have 1M words in my dictionary. Whenever a user issue a query on my website, I will see if the query contains the words in my dictionary and increment the counter corresponding to them individually. Here is the example, say if a user type in "Obama is a president" and "Obama" and "president" are in my dictionary, then I should increment the counter by 1 for "Obama" and "president".
And from time to time, I want to see the top 100 words (most queried words). If I use Hbase to store the counter, what schema should I use? -- I have not come up an efficient one yet.
If I use word in my dictionary as row key, and "counter" as column key, then updating counter(increment) is very efficient. But it's very hard to sort and return the top 100.
Anyone can give a good advice? Thanks.
You can use the natural schema (row key as word and column as count) and use IHBase to get a secondary index on the count column. See https://issues.apache.org/jira/browse/HBASE-2037 for the initial implementation; the current code lives at http://github.com/ykulbak/ihbase.
From Adobe's presentation at HBaseCon 2012 (slide 28 in particular), I suggest using two tables and this sort of data structure for the row key:
name
President => 1000
Test => 900
count
429461296:President => dummyvalue
429461396:Test => dummyvalue
The second table's row keys are derived by using Long.MAX_VALUE - count at that point of time.
As you get new words, just add the "count:word" as a row key to the count table. That way, you always have the top words returned first when you scan the table.
Sorting 1M longs can be done in memory, so what?
Store words x,y,z issued at time t as key:t cols:word:x=1 word:y=1 word:z=1 in a table. Then use a MapRed job to sum up counts for words and get the top 100.
This also enables further analysis.

Resources