How to construct complex Google Web Search query? - http

Searching through the Web by using the Google search engine is a de facto standard for Internet users.
Google provides a basic or an advanced form to prepare a query string to its search engine. Supposing to be interested in not using the web form, one can simply do an HTTP get request to the specific URL with a query string constructed upon the search conditions.
For instance I can search for results with word "hello" by doing an HTTP request at:
http://www.google.com/search?q=hello
I can add another word, e.g. "world", as follows:
http://www.google.com/search?q=hello+world
You know, the search can be more "complicated" by specifying nice parameters like:
or condition(s)
exact phrase(s)
search on specific domain(s)
avoid a specific word(s)
search with a specific language
limit search by geographical area
search for document type
etc.
How can I modify the query string to account for the above search parameters?

I carefully examined the answers by Pratik Chowdhury and Robbie Vercammen. They provides a link to Web documents that report a list of possible textual filtering to be used within the Google search form. Despite this is interesting, they don't provide an answer to the question. Hence, I studied a lot the problem and I found the following solution.
Suppose that you need to make a una tantum HTTP call (e.g. by a PHP class runned via CRON once a month) to Google Search in order to retrieve the search results for a particular string query, e.g. all the pages with some words (i.e. "hello" and "world") in your website (i.e. mywebsite.com), then you can do an HTTP get call to the following address:
http://www.google.com/search?q=hello+world+site:mywebsite.com
The q parameter can contain the whole search query, however Google defined a dummy proof list of parameters.
Notice that the AND operator can be represented by the as_q parameter instead.
To get page results with one between "hello" and" world" (i.e. and OR), must be changed the query "q" parameter as:
q=hello+OR+world
while a more compact representation uses the as_oq parameter:
as_oq=hello+world
If one looks for the exact phrase "hello world", the q parameter is:
q="hello+world"
while, again, another compact representation uses the as_epq parameter:
as_epq=hello+world
If one looks for all the results that not contain the words "hello" and "world", the q parameter is:
q=-hello+-world
while, again, another compact representation uses the as_eq parameter:
as_eq=hello+world
Of course, as_q, as_oq, as_epq, as_eq, etc. can by combined in a unique search query as usual (i.e. by using the & character). Thus, for instance I can search for both words "hello" and "word" plus one between "programming" and "code" as follow here:
q=hello+world&as_oq=programming+code
One can search for a specific domain (again, mydomain.com) as follow:
as_sitesearch=mydomain.com
However, if you want to exclude a specific domain (e.g., because it is a spam source), you must recur to standard notation. E.g.:
q=hello+-site:mydomain.com
return all the pages with word "hello" that are not in site mydomain.com.
To get for a specific file type, e.g. a pdf, you can use as_filetype:
as_filetype=pdf
More complex search parameter can be used, as provided in Google support docs.
For instance, to get also results with a synonym of a word, simply use the ~ operator in front of the word, e.g.
q=~hello
Moreover, if you want to use wildcards, e.g. to get all the exact phrases that start with "hello" and end with "world", you should use the * operator:
q="hello+*+world"
which probably will return something like: "hello to the world" and "hello sweet world".
One can also search for specific words inside the page title or in the page url by using the following keywords (read here for more details):
intitle
allintitle
inurl
allinurl
For instance, the following returns all the pages s.a. both words "hello" and "world" are in the url:
q=allinurl:hello+world
For the language of the Google GUI page (not the one of the results), one must insert into the query string the language string (e.g. en for English, fr for French, it for Italian, etc.) to the hl parameter. In other words, if one search with the English version of Google, the query string becomes as follow:
http://www.google.com/search?hl=en&q=hello+world+site:mywebsite.com
To select a specific language, e.g. Italian, use the lr query parameter:
lr=lang_it
One can also select pages published in a specific geographical region by using the cr parameter. E.g., to find all the pages published in Italy:
cr=countryIT

To create complex and / or queries, you can use () and OR.
For example if we want to search for
("tschakk buff" AND "boom bang") OR ("zata tong" AND "zong klirr")
The query would look like this:
https://www.google.com/search?q=("tschakk%20buff"%20"boom%20bang")%20OR%20("zata%20tong"%20"zong%20klirr")

though this books title seems dangerous but anyway it will answer all your questions if u don't misuse it.
The name of the book is "Dangerous Google – Searching for Secrets" by Michał Piotrowski by some hackin9 magazine.
Wish ya luck

If you are trying to assemble your own url by manually typing the url before using it, this site should prove helpful: http://www.googleguide.com/advanced_operators.html

Advangle is a nice free service where you can construct web-search queries visually and get a query string (or URL to Google and Bing) as the result.

Related

Purpose of tilde delimited values in URL fragment instead of GET params

I came across an unusual URL structure on a site. It looked like this:
https://www.agilealliance.org/glossary/xp/#q=~(infinite~false~filters~(postType~(~'post~'aa_book~'aa_event_session~'aa_experience_report)~tags~(~'xp))~searchTerm~'~sort~false~sortDirection~'asc~page~1)
It seems the category, pagination and sort options of a widget on the page injects and reads through these values. Does this format for storing data in the URL have a name, or is this an esoteric format someone made?
What's the purpose of doing this over using regular GET params, or at least using a more conventional format after the fragment?
If you inspect the URL carefully, you'll see that the parameters you describe are placed after the fragment (#), meaning they're not sent to the server but used by the client instead.
In this case, the client (JavaScript) builds them into something like an ElasticSearch query that's then POSTed to the server, in order to update listing you see on your screen.

What is the optimal way to encode a tag to use in algolia?

I have an article page that lists the tags related to that article. When the user clicks on the tag it brings them to the Algolia search results page. This is a Wordpress website. Some of my tags happen to contain ampersands in them like "Spades & Shovels", for example.
What I've noticed is that when I urlencode this term it does not display the search term properly in the search box when I send it via a query string.
I've tried this and thought this was the secret sauce, but it doesn't always work.
$tag_name = json_encode(urlencode(html_entity_decode($tag->name)));
What would be the best way to encode a tag name so that when I pass it via a query string to the search results page it handles it properly?
I've done more testing and I am noticing some odd things on my search results page. If I come to the search page with the searchbox field empty, but pass a post_id (I'm dynamically loading the tags associated with a post in the filter section), I can see "Spades & Shovels" listed there and it has a count of "65" next to it.
If I type "Spades & Shovels" into the searchbox, I see that number quickly drop down to 10 in the sidebar.
When I pass the tag in the query string no matter how I encode it, it doesn't seem to work. I mean I see the words Spades & Shovel in the search box, but I don't get any results. Its very strange, but probably something simple I'm hoping to fix. I need to be able to pass & in a query string to my search results page, but I have not found the proper formula for sending an ampersand in the url for this to work.
It does seem like the value I am passing through in the query string is not an exact match to the actual tag name.
I have tried all of these possibilities and different combinations:
$clean_tag = json_encode(urlencode(html_entity_decode($tag->name)));
$clean_tag = htmlspecialchars_decode($clean_tag);
$clean_tag = html_entity_decode($clean_tag);
$clean_tag = urlencode($clean_tag);
$clean_tag = htmlentities($clean_tag);
$clean_tag = html_entity_decode($clean_tag);
$clean_tag = json_encode($clean_tag);
None of these seem to do the trick. Any thoughts?

jsonapi.org correct way to use pagination using the page query string

In the documentation for jsonapi for pagination is says the following:
For example, a page-based strategy might use query parameters such as
page[number] and page[size]
How would I represent this in the query string? http://localhost:4200/people?page[number]=1&page[size]=25, I don't think using a map link structure is a valid query string. Only the page parameter is reserved according to the documentation.
I don't think using a map link structure is a valid query string.
You're right technically, and that's why the spec has the note that says:
Note: The example query parameters above use unencoded [ and ] characters simply for readability. In practice, these characters must be percent-encoded, per the requirements in RFC 3986.
So, page[size] is really page%5Bsize%5D which is a valid query parameter name.
Only the page parameter is reserved according to the documentation.
When the spec text says that only page is reserved, it actually means that any page[......] style query parameter is reserved. (I can tell you that for sure as one of the spec's editors.) But it should say so more explicitly, so I'll open an issue for it.

Encoding wildcarding, stemming, etc in simple search

We have a simple search interface which calls the search:search($query-text) function. Is there a syntax to include control for wildcarding, stemming and case sensitivity within the single text string that the function accepts? I haven't been able to find anything in the MarkLogic docs.
See the $options parameter and the <term> and <term-option> constraint at https://docs.marklogic.com/search:search . There is a guide at http://developer.marklogic.com/learn/2009-07-search-api-walkthrough
and some details http://developer.marklogic.com/learn/2009-07-search-api-walkthrough#ndbba3437f320a4a4
I don't know of any existing syntax for those options, aside from the built-in behavior of turning on wildcards when a term contains '*' or '?' and turning on case-sensitivity when the term contains capital letters.
You could develop a syntax. Implementing it might involve a custom parser along the lines of https://github.com/mblakele/xqysp then feeding the resulting cts:query into search:resolve.
Piggybacking on Eric Bloch's answer... you can always dynamically construct your node based on input in the user interface.
For example, I often do this in order to separate the facet selection portion of the query from the text search portion and put the facet selection query in the additional-query element in the options node.

Google Analytics Goals: Prevent tracking of URL parameters of subfolders

On my site I am tracking the URL /shop/ as goal by head match. As there are some URL parameters I cannot use exact match here.
Additionally, I am tracking a goal by exact match which is a URL to subfolder: /shop/process/paid.php
The problem is that GA tracks this subfolder with the head match as well, and thus saves the URL parameters that come along with paid.php, e.g. paid.php?email=customer#home.com
How can I prevent GA to track the URL parameters?
How would the setup look like?
Thanks!
That should work with a custom filter:
admin->profile->filters->custom filter->search and replace.
Search for
/shop/process/paid.php\?.*
(that's your url with arbitrary query parameters, the "\" is an escape sign since "?" is also an control character in regular expression. Dot means any character and "*" means any number of the preceding (in that case any) character) and replace with the desired url ( /shop/process/paid.php).
There is probably a more elegant solution but like most people I'm not good at this regex stuff. This should work however.
Alternatives:
If those query parameters are nowhere needed in the tracking data you can exlude them completely in the profile settings.
You can created a profile for the subdirectory based on the directory (include filter->request uri contains "/shop" and set only this profile to remove query parameters

Resources