When I put 'chromosome' in Google, it shows the meaning, phonetic notation, usage of 'chromosome' among other things. I think it is quit useful and I use this function to look up words often. But not all words you put in the engine give you the dictionary result, I am wondering if there is a way to impose Google to do it.
Prefix define to your search query to get definitions.
For example,
define chromosome
Related
I have a few XML documents in marklogic which have the structure
<abc:doc>
<abc:doc-meta>
<abc:meetings>
<abc:meeting>
</abc:meeting>
<abc:meeting>
</abc:meeting>
</abc:meetings>
</abc:doc-meta>
</abc:doc>
We can have more than one <abc:meeting> element under the <abc:meetings> element.
I am trying to write a cts:search query to get only documents that have more than one <abc:meeting> element in the document.
Please advise
This is tricky. Ideally, you'd want to drive searches from indexes for best performance. Unfortunately, MarkLogic doesn't keep track of element counts in its universal index, and aggregating counts from a range index can be cumbersome.
The overall simplest solution would be to add a count attribute on abc:meetings, and then add a range index on that. It does mean you'd have to change your data, and you'd have to keep that attribute in synch with each change.
You could also just search on the presence of abc:meeting with cts:element-query(), and append an XPath predicate to count the number of elements afterwards. Something like:
cts:search(
collection(),
cts:element-query(xs:QName('abc:meeting'), cts:true-query())
)[count(.//abc:meeting) > 1]
If not many documents contain meetings, this might work fairly well for you, but it still requires pulling up all documents containing meetings, hence could be expensive.
I played with the thought of leveraging cts:near-query(), but that is driven on word positions, so depends on the actual amount of tokens inside a meeting. If that were always an exact number of tokens (unlikely I'd guess), you could use the minimal-distance option on a double cts:element-query() wrapped in a cts:near-query(). It might help optimize the previous option a little though.
Most performant option I can think of right now, involves adding a User-Defined aggregate Function. It unfortunately means compiling c++ code. I happen to have written such a UDF in the past, that you should be able to use as-is after compilation and installation. For details see:
https://github.com/grtjn/doc-count-udf
and
http://docs.marklogic.com/guide/app-dev/aggregateUDFs
HTH!
It boils down to how many "a few" is. If it's thousands or fewer, than what grtjn presents above for a cts:search plus an XPath expression will work fine. If it's more, I'd add the count attribute to abc:meetings and then use a pre-commit trigger (e.g. on the collection of these documents) to ensure that the count attribute value is kept in sync. You'd need a range index to be able to query for "Documents that have a count of meetings of 2 or greater".
Of course, if all you need to query on is whether there's more than one meeting, then just add a "multiple" attribute to abc:meetings with a value of "true". Then you don't need a range index - you can do a cts:element-attribute-value-query on abc:meetings and multiple="true".
It's a Drupal site with solr for search. Mainly I am not satisfied with current search result on Chinese. The tokenizer has broken the words into supposed small pieces. Most of them are reasonable. But still, it made mistakes by not treating something as a valid token either breaking it to pieces or not breaking it.
Assuming I am writing Chinese now: big data analysis is one word which shouldn't be broken. So my search on it should find it. Also I want people to find AI and big data analysis training as the first hit when they search the exact phrase AI and big data analysis training.
So I want a way to intervene or compensate the current tokens to make the search smarter.
Maybe there is a file in solr allow me to manually write these tokens down to relate them certain phrases? So every time when indexing, solr can use it as a reference.
You different steps to achieve what you want :
1) I don't see an extremely big problem with your " over tokenization" :
big data analysis is one word which shouldn't be broken. So my search on it should find it. -> your search will find it even if tokenized, I understand this was an example and the actual words are chinese, but I suspect a different issue there
2) You can use the edismax[1] query parser with phrase boost at various level to boost subsequent tokens or phrases ( pf,pf2,pf3...ps,ps2,ps3...)
[1] https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html , https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html#TheExtendedDisMaxQueryParser-ThepsParameter
There is a list of proper names of stars here: https://www.wikidata.org/wiki/Q1433418
How can I query this in the Wikidata Query Service so that all individual names of stars are listed, alongwith other data in the list, such as Constellation?
In other words, how do I get at the members of the list? "Instance of" doesn't seem to work.
There is a confusion here coming from the fact that this List of proper names of stars (Q1433418) is an element centralizing links to Wikipedia pages playing this role in the different Wikipedia editions but isn't really playing any meaningful role in Wikidata: there are no instance of (P31) List of proper names of stars (Q1433418) in Wikidata.
You would have more luck looking for instance of (P31) Stars (Q523) and instance of elements that are a subclass of (P279) Star, a pattern that you will find in many of the SPARQL query examples: ?star wdt:P31/wdt:P279* wd:Q523 .
That could give this query (json version).
And if you're into JS, you can parse the JSON result with this function I wrote: wdk.simplifySparqlResults
I would not take official names of stars from there. The Wikipedia is one of the most useful resources to get first hand, somewhat organised information, on any topic. It is irreplaceable for this, and it would be a great mess not having it. However, the information is very sensitive to misuse caused by vandalism or clumsy editors.
To get (the only) official proper names of stars, the IAU is making an effort started this year. I would use this as reference. It is also stored in a text file which is easy to retrieve by a program, and is being updated while the Committee accepts more star names. It is here:
http://www.pas.rochester.edu/~emamajek/WGSN/IAU-CSN.txt
In fact, as you see, the file structure is presented in a format ready to use by software applications. It has been made to meet needs as yours.
When needing to create a URL that takes a finite set of parameters, where all of said parameters are semantically the same "level", what is the current consensus around the use of delimiters within URLs? Here's an example:
/myresource/thing1,thing2,thing3
/myresource/thing2,thing1
/myresource/thing1;thing2;thing3
/myresource/thing1;thing3
That is to say, the parameter here could be a single, a pair or a triple. They can be specified in any order because they are not a logical tree, and thing2 is not a subordinate resource of thing1, so doing something like this seems "wrong":
/myresources/thing1/thing2/thing3
This bothers me because it implies a tree-like relationship between the elements of the triple, and that is not the case (despite many HTTP frameworks seemingly pushing this, wrongly in my view). In addition, using a query string doesn't feel right as this is not a search operation, it is a known triple in a very finite space - there's nothing to query or search, so to speak.
I suppose the other option would be to make it a POST request and supply a body that details the parts of the triple being supplied. This doesn't give me warm fuzzies though, for some reason.
How have others handled this? Delimiters seem clean to me, and communicate the intended semantics of the resource, but i know there are folks would would take a different view, and I was looking to understand the experiences of others who've had similar use cases.
Since any value can be missing and values can appear in any order, How would you know which value is for which parameter (if that matters).
I would have used query string for GET, or in the payload for POST.
Use query parameters
/path/to/the/resource?key1=value1&key2=value2&key3=value3
or matrix parameters
/path/to/the/resource;key1=value1;key2=value2;key3=value3
Without a proper example, I'm not sure exactly about your needs.
However, a little known fact is that any HTTP parameter can have multiple values. It is the way to go when you have a set of objects (see GoogleMaps static API for an example).
/path/to/the/resource?things=thing1&things=thing2&things=thing3
Then you can use the same API for single, pairs, triples (and more).
I am working on Marklogic tool
I am having a database of around 27000 documents.
What I want to do is retrieve the keywords which have maximum frequency in the documents given by the result of any search query.
I am currently using xquery functions to count the frequency of each word in the set of all documents retrieved as query result. However, this is quite inefficient.
I was thinking that it would help me if i could get the list of words on which marklogic has performed indexing.
So is there a way to retrieve the list of indexed words from the universal index of marklogic??
Normally you would use something like this in MarkLogic:
(
for $v in cts:element-values(xs:Qname("myelem"))
let $f := cts:frequency($v)
order by $f descending
return $v
)[1 to 10]
This kind of functionality is built-in in the search:search library, which works very conveniently.
But you cannot use that on values from cts:words e.a. unfortunately. There is a little trick that could get you close though. Instead of using cts:frequency, you could use a xdmp:estimate on a cts:search to get a fragment count:
(
for $v in cts:words()
let $f := xdmp:estimate(cts:search(collection(), $v))
order by $f descending
return $v
)[1 to 10]
The performance is less, but still much faster than bluntly running through all documents.
HTH!
What if your search contains multiple terms? How will you calculate the order?
What if some of your terms are very common in your corpus of documents, and others are very rare? Should the count of "the" contribute more to the score than "protease", or should they contribute the same?
If the words occur in the title vs elsewhere in the document, should that matter?
What if one document is relatively short, and another is quite long. How do you account for that?
These are some of the basic questions that come up when trying to determine relevancy. Most search engines use a combination of term frequency (how often do the terms occur in your documents), and document frequency (how many documents contain the terms). They can also use the location of the terms in your documents to determine a score, and they can also account for document length in determining a score.
MarkLogic uses a combination of term frequency and document frequency to determine relevance by default. These factors (and others) are used to determine a relevance score for your search criteria, and this score is the default sorting for results returned by search:search from the search API or the low-level cts:search and its supporting operators.
You can look at the details of the options for cts:search to learn about some of the different scoring options. See 'score-logtfidf' and others here:
http://community.marklogic.com/pubs/5.0/apidocs/SearchBuiltins.html#cts:search
I would also look at the search developers guide:
http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf
Many of the concepts are under consideration by the XQuery working group as enhancements for a future version of XQuery. They aren't part of the language today. MarkLogic has been at the forefront of search for a number of years, so you'll find there are many features in the product, and a lot of discussion related to this area in the archives.
"Is there a way to retrieve the list of indexed words from the universal index of marklogic?" No. The universal index is a hash index, so it contains hashes not words.
As noted by others you can create value-based lexicons that can list their contents. Some of these also include frequency information. However, I have another suggestion: cts:distinctive-terms() will identify the most distinctive terms from a sequence of nodes, which could be the current page of search results. You can control whether the output terms are just words, or include more complex terms such as element-word or phrase. See the docs for more details.
http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/apidoc/SearchBuiltins.xml&category=SearchBuiltins&function=cts:distinctive-terms
I have used cts:distinctive-terms(). It gives mostly wildcarded terms in my case which are not of much use. Furthur it is suitable for finding distinctive terms in a single document. When I try to run it on many documents it is quite slow.
What I want to implement is a dynamic facet which is populated with the keywords of the documents which come up in the search result. I have implemented it but it is inefficient as it counts the frequency of all the words in the documents. I want it to be a suggestion or recommandation feature like if you have searched for this particular term or phrase then you may be interested in these suggested terms or phrases. So I want an efficient method to find the terms which are common in the result set of documents of a search.
I tried cts:words() as suggested. It gives similar words as the search query word and the number of documents in which it is contained. WHat it does not take into account is the set of search result documents. It just shows the number of documents which contain similar words in the whole database, irrespective of whether these documents are present in the search result or not