Imagine if I have an xml document stored in Marklogic in the following format:
<document>
<id>DocumentID</id>
<questions>
<question_item>
<question>question1</question>
<answer>answer1</answer>
</question_item>
<question_item>
<important>high</important>
<question>question2</question>
<answer>answer2</answer2>
<question_item>
</document>
Basically, each document has a number of questions, only some of them have an element. I want to return all of the "important" questions in a flat format with metadata taken from the document it is pulled from (e.g., id).
The following xquery seems to work, and is reasonably fast:
for $x in cts:search(/document,
cts:element-query(xs:QName("important"),cts:and-query((
))
), "unfiltered" , 0.0)
return for $y in $x/questions/question_item
return
if ($y/important) then
fn:concat($x/id,'|',
$y/question,'|',
$y/answer,
$y/important
)
else ()
This seems to work and is reasonably fast. However, I usually find that the for loops are not the fastest way to work in xquery. The solution does seem to be a relatively cumbersome approach. Is there a better way to return just the "important" nodes initially, but then still have access to the main document elements?
Personally, I find conditional logic more cumbersome than for loops, but I think you can remove one of each for a simpler query. Instead of looping over the first sequence of documents, you can simply assign them to a variable, which will allow you to reference them. Then in your loop, use a predicate to constrain question_item to those with important elements, eliminating the need for the conditional:
let $documents := cts:search(/document,
cts:element-query(xs:QName("important"), cts:and-query(())
), "unfiltered" , 0.0)
for $y in $documents/questions/question_item[important]
return fn:concat($x/id,'|',
$y/question,'|',
$y/answer,
$y/important)
As in the code sample, the optimal approach is to match documents first based on the indexes and then to extract values from the matched documents. A FLWOR expression with non-redundent XPaths is an efficient way to extract values from a document.
One possible improvement would be to take a more fine-grained approach in modelling the documents: that is, to put each question item in a separate document. That way, the search will retrieve only the question items that are important.
That change would become important if the documents are large. For maximum performance, you could then put range indexes on the question, answer, and important elements and get one tuple for each question item directly from the indexes.
If the specific list of question items is usually retrieved and updated together, however, that would argue against splitting out each question as a separate document.
Hoping that helps,
Related
I'm trying to declare an array in R, something logically equivalent to the following Java code:
Object[][] array = new Object[6][32]
After I declare this array, I plan to loop over the indices and assign values to them.
I am not familiar with what you are planning on doing in R, but loops are generally not recommended. I would say this is especially true when you don't know the length of the output.
You might want to find a "vectorized" solution first and if not, then using something in the apply family might also be helpful
Disclaimer: I am certain there is more nuance to this discussion based on what I have read, so I don't want to claim to be an expert on this subject.
Suppose that you are dealing with a potentially infinite amount of data. Suppose further that you do not have this data stored in memory, but can generate individual terms at will. Finally, suppose that you want to do some experiment on this data that will involve checking a large but unknown amount of terms in a way that necessitates keeping a great many of them in memory. Toy problems with Recamán's sequence, like "find the minimum number terms needed in that sequence for the first 25 even numbers to have appeared", are what I have in mind as typical examples.
The obvious solution to this sort of problem would be to write some code like:
list<-c(first term)
while([not found enough terms yet])
{
nextTerm<-Whatever
if(this term worked){list<-c(list,nextTerm)}
}
However, building a big vector like this by adding one new term at a time is your memory's worst nightmare. The alternative that I often see suggested is to pre-allocate a big vector in memory by making the first line of your code something like list<-numeric(10^6), but those solutions suppose that we have some rough idea of how many terms we need to check, which isn't always the case. So what can we do when we are dealing with an ever-growing list of unknown required length?
This is very popular subject in R check this answer: https://stackoverflow.com/a/45195098/5442527
Summing up:
Do not use c() to bind as providing value by index [ is much faster. I know that it might seem surprising that you could grow pre-allocated vector. Make an iter variable before while loop and increase the index inside the if statement.
Normally like in Python you do not have to care about it when using append. Even starting with empty list is not an problem as the list (reserved memory) grows expotentialy (x2x2x1.5x1.2...) when you pass some perimeter number of elements. Link Over-allocating
I have some data which looks something like this:
<wrapper>
<inner a="1"/>
<inner a="2" b="3"/>
</wrapper>
The attribute b may or may not be present on each inner element. My aim is to find all documents containing at least one inner element that doesn't have attribute b.*
This similar question proposes the answer:
cts:not-query(cts:element-attribute-value-query(xs:QName('inner'), xs:QName('b'), '*', ("wildcarded"))))
but that doesn't work, because some inner elements on the same document may have attribute b, and not-queries work on the entire fragment, so a mixed case like the example above would not be returned. Wrapping it in an element-query doesn't help, and cts:and-not-query seems to behave the same way.
I have also tried attacking the problem using co-occurrence/values functions to read the values of relevant attributes a, but that also seems to be impossible. It might have been possible with proximity settings on co-occurrences calls except there is no element text, so the attribute are indexed with the same word positions.
Are there any alternatives to the blunt xpath?
//inner[#a and not(#b)]
You can always make the xpath more complicated if simplicity isnt your goal.
How about this one: (it more accurately answers the exact question of 'return all documents that contain 'innner' elements that do not have an atribute #b'
doc()[exists(//inner[not(#b)])]
I do not know how well this is optimized -- some xpath expressions optimize down to the equivalent cts: query and some do not.
There is another 'trick' involving combining cts expressions represented as maps. Take the results of 2 searches, use the options that return the results as a map, then you can use the operations on this page https://developer.marklogic.com/blog/im-a-map to do extremely efficient set operations (union, intersection, difference etc). When properly constructed, this technique can be as fast as 'native' cts searches --- the cts searches use the same general technique internally for resolving results.
Make the XPath a path range index. //inner[#a and not(#b)], or if there's no element text, //inner[#a and not(#b)]/#a, then do
cts:path-range-query('//inner[#a and not(#b)]/#a','>','')
This happens to also allow us to efficiently answer the question of which #a values have a missing #b, using cts:values.
cts:not-in-query has the necessary behaviour to make this work where cts:and-not-query doesn’t. E.g.
cts:not-in-query(
cts:element-query(xs:QName('inner'), cts:true-query()),
cts:element-attribute-query(xs:QName('inner'), xs:QName('b'),'*','wildcarded')
)
Finds all ‘inner’ elements at positions that do not match the positions of ‘inner’ elements with attribute b.
Element position index must be enabled. Wildcard index must be enabled.
http://docs.marklogic.com/cts:not-in-query
So this is possibly more about functional programming than Kotlin, I am at that stage were a little knowledge is dangerous, and I wrote the app in Kotlin so seems fair to ask a Kotlin question as its Kotlins structures that i am interested in.
I have a sequence of items, they are in batches of three, so the stream may look like
1,a,+,2,b,*,3,c,&.......
What I want to do is to spilt this into three lists, currently I am doing this by partitioning into two lists, one that contains the numbers and one that contains everything else, then taking the second half of the result, the letters and symbols and partitioning again, into letters and symbols, thus i end up with three lists.
This strikes me as somewhat inefficient, maybe a functional approach isn't the best approach here.
Is there an efficient way of doing this, are my choices, this or a for loop?
Thanks
You can use groupBy method to group elements of your sequence by an element type:
val elementsByType = sequence.groupBy { getElementType(it) }
where getElementType is function returning a type of the element: whether it is a letter, or a number, or a symbol. This function may return either a number, such as 1, 2, 3, or a value of some enum with 3 different entries.
groupBy returns a map from element type to list of elements of that type.
I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/