Encoding wildcarding, stemming, etc in simple search - wildcard

We have a simple search interface which calls the search:search($query-text) function. Is there a syntax to include control for wildcarding, stemming and case sensitivity within the single text string that the function accepts? I haven't been able to find anything in the MarkLogic docs.

See the $options parameter and the <term> and <term-option> constraint at https://docs.marklogic.com/search:search . There is a guide at http://developer.marklogic.com/learn/2009-07-search-api-walkthrough
and some details http://developer.marklogic.com/learn/2009-07-search-api-walkthrough#ndbba3437f320a4a4

I don't know of any existing syntax for those options, aside from the built-in behavior of turning on wildcards when a term contains '*' or '?' and turning on case-sensitivity when the term contains capital letters.
You could develop a syntax. Implementing it might involve a custom parser along the lines of https://github.com/mblakele/xqysp then feeding the resulting cts:query into search:resolve.

Piggybacking on Eric Bloch's answer... you can always dynamically construct your node based on input in the user interface.
For example, I often do this in order to separate the facet selection portion of the query from the text search portion and put the facet selection query in the additional-query element in the options node.

Related

Wordpress PODS magic tags extend with parameters

I am wondering if there's any function that I can create in order to modify the magic tags behaviour.
Ideally, I would like to use a tag like this {#post_content|120} which would go through my custom function and check if there's a | character, then execute the original magic tag, while trimming text down to 120 characters.
But I don't know where to hook in order to filter this content.
I know that I can pass a function name with the magic tag but this isn't really helpful as I need to pass the characters limit parameter which PODS doesn't support.
Also, I can't be creating functions for all my characters limit as I have a lot of places where I need different limits and I would end up using tons of functions and no dynamic solution.
Can I somehow trigger a magic tag with a parameter? Any other thoughts about doing this another way?
Thank you!
I don't think that's possible, {#your_field, your_function} is how it works (the function takes the field value as input) - you could use different function names like trim_120, trim_100 and do the stuff you need in there - I guess it's to create excerpts with different length's although there are other ways to do that e.g use the_content filter for one ...

How to remove punctuation from a database in marklogic?

I want to remove punctuation from a database of xml document in marklogic. This is made for preprocessing purposes for machine learning. I'm new to marklogic and i don't know how to do that. Is there an xquery query that could remove punctuation?
To do a mass replacement of all text in the database, and take out punctuation, you could start with something that looks like this code (modified for your needs):
for $doc in cts:search(fn:collection(), ())
for $text in $doc//text()
return xdmp:node-replace($text, text{fn:replace($text, "[\.,;]", "")})
To be honest, that task is much less expensive to do on the source text files themselves - or in MarkLogic by treating the XML as string during the replacement process. Updating nodes one element at a time will be expensive.
Outside of Marklogic:
use SED or AWK or a similar tool BEFORE INGESTION
Inside of MarkLogic(as a trigger, perhaps)
use xdmp:quote to change the XML to a string, then replace in a sing with fn:replace and then make XML again with xdmp:unquote
let $new-doc := xdmp:unquote(fn:replace(xdmp:quote($doc), "[\.,;]", ""))
Then either store by replacing the root node with xdmp:node-replace - or store this version as a property. This all depends on if the original (punctuated version matters to you). Or perhaps you just want to keep the original and serve this cleansed version back to someone.
In all cases above, you have to make sure that your replacement does not murder your XML. Also, be aware of options for the functions above(like how cdata is handled.
Lastly, "This is for machine learning purposes". You do not elaborate. I think many of us here have a feeling that this solution (cleansing punctuation before insert) rubs against the very grain of MarkLogic - in which you store as-is and then have awesome index, tokenizing, stemming, collation, search support to find and return your data as you need. If you were to elaborate on your use case a bit, you may inspire others to give more MarkLogic-Specific suggestions.
It will work if you use 'punctuation-insensitive' and if required 'diacritic-insensitive' in cts:element-word-query()
I'm not sure if this is what you're asking, but it's technically possible to update every document in the database to remove punctuation; however, it's very expensive and I wouldn't recommend it.
Using built-in search functions, you can probably achieve the same goal without updating your documents by querying with punctuation insensitivity. For example, if you want to select documents with a title matching a case insensitive string:
cts:search(//mydoc,
cts:element-word-query(xs:QName('title'), 'Moby-Dick', 'punctuation-insensitive'))
Or in an existing XQuery:
for $d in $documents
where cts:contains($d,
cts:element-word-query(xs:QName('title'), 'Moby-Dick', 'punctuation-insensitive'))
return $d/summary

jsonapi.org correct way to use pagination using the page query string

In the documentation for jsonapi for pagination is says the following:
For example, a page-based strategy might use query parameters such as
page[number] and page[size]
How would I represent this in the query string? http://localhost:4200/people?page[number]=1&page[size]=25, I don't think using a map link structure is a valid query string. Only the page parameter is reserved according to the documentation.
I don't think using a map link structure is a valid query string.
You're right technically, and that's why the spec has the note that says:
Note: The example query parameters above use unencoded [ and ] characters simply for readability. In practice, these characters must be percent-encoded, per the requirements in RFC 3986.
So, page[size] is really page%5Bsize%5D which is a valid query parameter name.
Only the page parameter is reserved according to the documentation.
When the spec text says that only page is reserved, it actually means that any page[......] style query parameter is reserved. (I can tell you that for sure as one of the spec's editors.) But it should say so more explicitly, so I'll open an issue for it.

How to construct complex Google Web Search query?

Searching through the Web by using the Google search engine is a de facto standard for Internet users.
Google provides a basic or an advanced form to prepare a query string to its search engine. Supposing to be interested in not using the web form, one can simply do an HTTP get request to the specific URL with a query string constructed upon the search conditions.
For instance I can search for results with word "hello" by doing an HTTP request at:
http://www.google.com/search?q=hello
I can add another word, e.g. "world", as follows:
http://www.google.com/search?q=hello+world
You know, the search can be more "complicated" by specifying nice parameters like:
or condition(s)
exact phrase(s)
search on specific domain(s)
avoid a specific word(s)
search with a specific language
limit search by geographical area
search for document type
etc.
How can I modify the query string to account for the above search parameters?
I carefully examined the answers by Pratik Chowdhury and Robbie Vercammen. They provides a link to Web documents that report a list of possible textual filtering to be used within the Google search form. Despite this is interesting, they don't provide an answer to the question. Hence, I studied a lot the problem and I found the following solution.
Suppose that you need to make a una tantum HTTP call (e.g. by a PHP class runned via CRON once a month) to Google Search in order to retrieve the search results for a particular string query, e.g. all the pages with some words (i.e. "hello" and "world") in your website (i.e. mywebsite.com), then you can do an HTTP get call to the following address:
http://www.google.com/search?q=hello+world+site:mywebsite.com
The q parameter can contain the whole search query, however Google defined a dummy proof list of parameters.
Notice that the AND operator can be represented by the as_q parameter instead.
To get page results with one between "hello" and" world" (i.e. and OR), must be changed the query "q" parameter as:
q=hello+OR+world
while a more compact representation uses the as_oq parameter:
as_oq=hello+world
If one looks for the exact phrase "hello world", the q parameter is:
q="hello+world"
while, again, another compact representation uses the as_epq parameter:
as_epq=hello+world
If one looks for all the results that not contain the words "hello" and "world", the q parameter is:
q=-hello+-world
while, again, another compact representation uses the as_eq parameter:
as_eq=hello+world
Of course, as_q, as_oq, as_epq, as_eq, etc. can by combined in a unique search query as usual (i.e. by using the & character). Thus, for instance I can search for both words "hello" and "word" plus one between "programming" and "code" as follow here:
q=hello+world&as_oq=programming+code
One can search for a specific domain (again, mydomain.com) as follow:
as_sitesearch=mydomain.com
However, if you want to exclude a specific domain (e.g., because it is a spam source), you must recur to standard notation. E.g.:
q=hello+-site:mydomain.com
return all the pages with word "hello" that are not in site mydomain.com.
To get for a specific file type, e.g. a pdf, you can use as_filetype:
as_filetype=pdf
More complex search parameter can be used, as provided in Google support docs.
For instance, to get also results with a synonym of a word, simply use the ~ operator in front of the word, e.g.
q=~hello
Moreover, if you want to use wildcards, e.g. to get all the exact phrases that start with "hello" and end with "world", you should use the * operator:
q="hello+*+world"
which probably will return something like: "hello to the world" and "hello sweet world".
One can also search for specific words inside the page title or in the page url by using the following keywords (read here for more details):
intitle
allintitle
inurl
allinurl
For instance, the following returns all the pages s.a. both words "hello" and "world" are in the url:
q=allinurl:hello+world
For the language of the Google GUI page (not the one of the results), one must insert into the query string the language string (e.g. en for English, fr for French, it for Italian, etc.) to the hl parameter. In other words, if one search with the English version of Google, the query string becomes as follow:
http://www.google.com/search?hl=en&q=hello+world+site:mywebsite.com
To select a specific language, e.g. Italian, use the lr query parameter:
lr=lang_it
One can also select pages published in a specific geographical region by using the cr parameter. E.g., to find all the pages published in Italy:
cr=countryIT
To create complex and / or queries, you can use () and OR.
For example if we want to search for
("tschakk buff" AND "boom bang") OR ("zata tong" AND "zong klirr")
The query would look like this:
https://www.google.com/search?q=("tschakk%20buff"%20"boom%20bang")%20OR%20("zata%20tong"%20"zong%20klirr")
though this books title seems dangerous but anyway it will answer all your questions if u don't misuse it.
The name of the book is "Dangerous Google – Searching for Secrets" by Michał Piotrowski by some hackin9 magazine.
Wish ya luck
If you are trying to assemble your own url by manually typing the url before using it, this site should prove helpful: http://www.googleguide.com/advanced_operators.html
Advangle is a nice free service where you can construct web-search queries visually and get a query string (or URL to Google and Bing) as the result.

What's the correct format for TCDL linkAttributes?

I can see the technology-independent Tridion Content Delivery Language (TCDL) link has the following parameters, which are pretty well described on SDL Live Content.
type
origin
destination
templateURI
linkAttributes
textOnFail
addAnchor
VariantId
How do we add multiple attribute-value pairs for the linkAttributes? Specifically, what do we use to escape the double quotes as well as separate pairs (e.g. if we need class="someclass" and onclick="someevent").
The separate pairs are just space delimited, like a normal series of attributes. Try XML encoding the value of linkAttributes however. So, " become &quote;, etc...
If you are using some Javascript, you might take care of the Javascript quotes too, as in \".
Edit: after I figured out your real question, the answer is a lot simpler:
You should wrap the values inside your linkAttributes in single quotes. Spaces inside linkAttributes are typically handled fine; but if not, escape then with %20.
If you need something more or want something that isn't handled by the standard tcdl:ComponentLink, remember that you can always create your own TCDL tag and and use a TagHandler or TagRenderer (look them up in the docs for examples or search for Jaime's article on TagRenderer) to do precisely what you want.
My original answer was to a question you didn't ask: what is the format for TCDL tags (in general). But the explanation might still be useful to some, so remains below.
I'd suggest having a look at what format the default building blocks (e.g. the Link Resolver TBB in the Default Finish Actions) output and use that as a guide line.
This is what I could quickly get from the transport package of a published page:
<tcdl:Link type="Page" origin="tcm:5-199-64" destination="tcm:5-206-64"
templateURI="tcm:0-0-0" linkAttributes="" textOnFail="true"
addAnchor="" variantId="">Home</tcdl:Link>
<tcdl:ComponentPresentation type="Embedded" componentURI="tcm:5-69"
templateURI="tcm:5-133-32">
<span>
...
One of the things that I know from experience: your entire TCDL tag will have to be on a single line (I wrapped the lines above for readability only). Or at least that is the case if it is used to invoke a REL TagRenderer. Clearly the tcdl:ComponentPresentation tag above will span multiple lines, so that "single line rule" doesn't apply everywhere.
And that is probably the best advice: given the fact that TCDL tags are processed at multiple points in Tridion Publishing, Deployment and Delivery pipeline, I'd stick to the format that the default TBBs output. And from my sample that seems to be: put everything on a single line and wrap the values in (double) quotes.

Resources