speed of different ways of a full-text search in XQuery (BaseX) - xquery

I am using BaseX 9.3.1 for accessing about 45.000 entries of a dictionary with about 1 million examples etc., having a full-text index as well as one for all text nodes and attributes. So far, so bad. Accessing the nodes for grammar or the meaning is quite fast. Accessing the examples, however, is very slow but only in some queries - not in all. I tried to find the reason for the following but simply did not succeed with the documentation of BaseX or some XQuery-pages etc. Maybe I just don't see the forest for the trees. Here's the issue:
ft:search("myDatabase", "mySearchString")
is very fast and responding in less then a second, whereas
ft:contains(db:open("myDatabase")/path/to/target/node, "mySearchString")
as well as
db:open("myDatabase")/path/to/target/node contains text "mySearchString"
and
db:open("myDatabase")/path/to/target/node[contains(./text(), "mySearchString")]
is so slow (>30s) and heavy that it sometimes even breaks the server.
I tried to find out the differences by changing options for stemming or the collation but it did not make a noticable difference regarding speed. ft:contains() is - according to the docs - an extended version of contains text of standard XQuery. ft:search seems to be a BaseX command for full-text search (see BaseX docs).
ft:search() is working quite well with some additional code in order to get the desired content. But that's actually more like a workaround. I need to understand why directly accessing the path is so slow even though there also is an index for it. Using text() instead of any version of contain or match is faster, too, but not an option for the application since the search must fetch "parts of" and not "exactly" the text content.
Any suggestion/idea/help is appreciated!

Related

Difference between the google translate API

I am building an Open Source Chrome extension based on Google translate (here).
I have read the other questions about Google translate API (like this one and this one) but I still don't have my answer.
I found several URLs for Google translate like these:
https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=fr&dt=t&q=father&ie=UTF-8&oe=UTF-8
https://clients5.google.com/translate_a/t?client=dict-chrome-ex&sl=en&tl=fr&dt=t&q=father
It seems all the URL are a different combination of 3 parts:
a base URL :
translate.googleapis.com/translate_a/
https://translate.google.com/translate_a/
https://clients5.google.com/translate_a/
the first argument after the translate_a/: either single or t
the clients which can be gtx, t or dict-chrome-ex [or apparently any ID]
So far I have seen differences in the JSON returned.
This https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=fr&dt=t&q=father&ie=UTF-8&oe=UTF-8 returns this json:
[[["père","father",null,null,1]
]
,null,"en"]
While this https://clients5.google.com/translate_a/t?client=dict-chrome-ex&sl=en&tl=fr&dt=t&q=father returns this json:
{"sentences":[{"trans":"père","orig":"father","backend":1},{"src_translit":"ˈfäT͟Hər"}],"dict":[{"pos":"noun","terms":["père"],"entry":[{"word":"père","reverse_translation":["father","dad","parent","papa"],"score":0.70910621,"previous_word":"le","gender":1}],"base_form":"father","pos_enum":1},{"pos":"verb","terms":["engendrer","concevoir"],"entry":[{"word":"engendrer","reverse_translation":["generate","engender","give rise to","beget","breed","father"],"synset_id":[52561],"score":0.00017133754},{"word":"concevoir","reverse_translation":["design","conceive","devise","plan","form","father"],"synset_id":[52561],"score":4.8327973e-05}],"base_form":"father","pos_enum":2}],"src":"en","alternative_translations":[{"src_phrase":"father","alternative":[{"word_postproc":"père","score":1000,"has_preceding_space":true,"attach_to_next_token":false}],"srcunicodeoffsets":[{"begin":0,"end":6}],"raw_src_segment":"father","start_pos":0,"end_pos":0}],"confidence":1,"ld_result":{"srclangs":["en"],"srclangs_confidences":[1],"extended_srclangs":["en"]},"query_inflections":[{"written_form":"father","features":{"number":2}},{"written_form":"fathers","features":{"number":1}}],"target_inflections":[{"written_form":"père","features":{"gender":1,"number":2}},{"written_form":"pères","features":{"gender":1,"number":1}},{"written_form":"père","features":{"number":2}},{"written_form":"pères","features":{"number":1}}]}
So my question is what are the (other than this one) differences between the different combinations given above.
In which case should I use one rather than the other (except for the returned JSON). Is there one that is depreciated or that supports more request?
For the meaning of the queries: https://stackoverflow.com/a/29537590/3154274
In regards to your actual question, I'm not sure there are any meaningful differences other than what you have stated and it would be difficult to determine if and when any of them are deprecated.
Given the APIs are undocumented and don't appear to be intended for usage in this manner, I don't think any of them should be considered for use in the development of a real application.
However, for solving your problem of finding a free human language translation API, I would recommend the Azure Translator Text API which provides translation of 2 million characters per month as part of their free tier.
For your specific use case, where I assume there may be a high amount of duplicate translations, I feel that caching the results would provide a significant benefit in reducing your usage amount.

Performance of large collection in Meteor 1.0.X

There has been a LOT of development in the Meteor world, and as such it's getting hard to find answers that work for current versions due to the plethora of answers you find for old, out-dated versions.
I have an app that has a LOT of data in a particular collection. By lots I mean somewhere between 10k-100k, and very potentially a lot more. Essentially it's log data, and I need to display the results in a table with no pagination (like a tail). In researching ways to optimize large collections I keep running into things like this that seem to be for older versions of Meteor.
So, as I see it my options are:
Use fast-render plugin to display the page prior to the subscription (at least this is my understand on how it works).
Use some sort of progressive publish function, where it loads limited more relevant bits of data first, then progressively loads the remaining data by expanding the window/limit (not sure if this would cause heavier load on the server, though). There seems to have been a "progressive publish" plugin, but it doesn't seem to be under active development any longer.
Optimize the lookups via indexing (How do you specify that when creating the collection???)
Profiling and optimizing the template further (not sure how).
Some other method I haven't thought of yet...
Some combination of all-the-above.
What is the proper approach by which to publish and render lots of data in this way?
I'm going to assume that "optimize" means reduced query time.
Always start with the biggest bang for your buck.
Unless you're publishing the entire collection, or query on the _id, then you want to create an index using _ensureIndex. Get more info on this on the mongodb website or by searching other questions. http://docs.mongodb.org/manual/reference/method/db.collection.ensureIndex/
Second, limit the fields to just the info you need. eg {fields: {a:1, b:1}}. http://docs.meteor.com/#/full/fieldspecifiers
Third, don't sort.
If this still isn't good enough, make another question with schema & query details & the desired UI so we can better understand the reactivity and why you can't use some form of pagination.

Sweave/ODFWeave and tracking code chunks

I am getting started with the reproducible research tools in R, and I'm pretty excited about the prospects. Sweave/Knitr/Markdown, all that stuff is great. I use RStudio, and they have done a great job of integrating those tool, and I hear that StatET does a nice job putting all that together as well.
I don't write academic papers in LaTeX, and all the people I work with use Word, so I am very interested in an effective workflow to use ODFWeave to make documents.
My usual process is:
Develop the code chunks in my IDE (RStudio, in my case)
Go back and insert these into a ODT document and fill in the surrounding text.
run ODFweave
My problem is that I get confused in tracking code chunks and putting them into the ODF document. Keeping the ODF document in sync as I create the code is annoying, so I'd rather wait and insert the code chunks by name.
So finally, here are my questions:
What are people's suggestions for tracking code chunks or on how to optimize this workflow?
Can anyone recommend tools or tips for keeping track of the code chunks you write?
Being a software geek and a data nerd, I naturally imagine a piece of software doing this for me. Like I'd have a database of code chunks, and when writing the ODF document I'd be able to click on a chunk to insert it into my ODF file.
Has one anyone created this sort of thing?
When you check the number of items tagged odfweave on SO, you will notice that it is rarely used compared to Sweave and knit-offs. I do not fully understand why it did not take off, possible because of table-generation being such a nuisance (at least that what I remember from my attempts).
Since many customers insist on Word-Documents, we are using two alternatives currently:
Create html, e.g. with RStudio/knitr/rmd, and read it with Word. This is not really a good workflow, to get reasonable document you need much manual post-processing, but it works more or less.
You can also use the path via RDCOM. I don't remember what's the state of art here, because we have totally given up using it since the conditions of licensing were not transparent to us.
Use pandoc. This approach produces documents that do not need manual post-processing in MS-Word, but the range of features to create a nice layout (cross-linked images, figure numbering) are limited; it might be a problem that we are not yet good enough in using pandoc in its full.

Good whitelist for search terms

I'm implementing a simple search on a website, and right now I'm working on sanitizing the input. My plan is to make a whitelist of allowed characters. I'm using PHP, and so far I've got the current regex:
preg_replace('/[^a-z0-9 -]/i', '', $s);
So, I'm removing anything that's not alphanumeric or a space or a hyphen.
Is there a generally accepted whitelist for this sort of thing, or does it just depend on the application? I'm going to be searching on book titles, author names and book blurbs.
What about 2010 (A space odyssey)? What about Giscard d`Estaing's autobiography? ... This is really impossible to answer generally, it will depend on your application and data structures.
You want to look into the fulltext search functions of the database of your choice, or even specialized search appliances like Sphinx.
Clarify what engine you will use first to actually perform your search, and the rules on what you need to strip out will become much clearer.
Google has some pretty advanced rules for searches, but their basic rule is this:
Generally, punctuation is ignored, including ##$%^&*()=+[]\ and other special characters.
However, Google makes exceptions for common search terms, like C++, C#, or $100.
If you want a search as sophisticated as Google's, you can make rules against the above punctuation and have some exceptions. However, for a simple search, just ignore the characters that Google generally ignores.
There's not a generic regular expression to solve this problem. Your code strips out a lot of things you might want to keep, like commas, exclamation points, (semi-)colons, and non-English letters. If you have a full list of all of the titles in your database, you should be able to write a script that will construct a list of all characters found in all of your titles. If your regular expression strips out any of those characters, then you risk having problems (although passing this test doesn't mean that you won't run into problems).
Depending on how the rest of your search is implemented, you may be able to strip out valid characters and still return relevant search results. In this case, you would want your expression to allow non-English characters (since you don't want to split a word) but you might be able to remove all punctuation marks that aren't inside of a quote-delimited phrase. For example, searching for red haired should give you all of the results you would get from searching for red-haired plus a few extra.

Interpreting Search Results

I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.
What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.
Thanks, Simon
I doubt that there exist a silver-bullet algorithm that without any training will just work on any arbitrary search query output.
However, this task can be solved and is actually solved in many applications, but with different approach. First you have to define general structure of single search result item based on what you actually going to do with it (it could be name, date, link, description snippet, etc.), and then write number of html parsers that will extract necessary necessary fields from search result output of particular web sites.
I know it is not super sexy solution, but it probably the only one that works. And it is not rocket science. Writing parsers is actually extremly simple, you can make dozen per day. If you will look into html source of search result, you will notice that output results are typically very structured and marked with specific div sections or class atributes, so it is very easy to find it in the document. You dont have even use any complicated HTML parsing library for that, something grep-like will be enough.
For example, on this particular page your question starts with <div class="post-text"> and ends with </div>. Everything in between is actually a post text with some HTML formatting that you may want to remove along with extra spaces and "\n". And this <div class="post-text"> appears on the page only once.
Once you go at large scale with your retrieval applicaiton, you will find out that there is not that big variety of different search engines on different sites, and you will be able to re-use already created parsers for sties using similar search engines.
The only thing you have to remember is built-in self-testing. Sites tend to upgrade and change design from time to time. If your application is going to live for some time, you will need to include into your parsers some logic that will check validity of their results and notify you every time search output has changed and is not compatible anymore with your parser. Then you will have to modify particular parser or write new one.
Hope this helps.

Resources