Is it possible to show and store the results obtained from a Google Search query? The search is performed from the commons-http client, through my webapp. My questions are:
Is it permissible?
Is it ethical?
Is it possible?
I have heard that google changes tags and blocks scrapers. Is it correct? Any other way to do it?
It is certainly possible to scrape Google for results - many do. You can use proxies to mask your IP and avoid being blocked.
Make sure you set your user-agent to a browser.
Google doesn't like this and provides a search API, but it is pretty useless.
Related
Essentially, I'm concerned that a single user can be counted twice. Is there a best practice, etc. I've tried googling and I'm not sure if I'm just not asking the right question with the right words. Platform is on sitecore.
Using the same property to track AMP and non-AMP pages will result in multiple users. See here for Google's recommendation.
Though looks like you can use the Google AMP Client ID API to work around this.
I made crawler with node.js. I want to crawl some sites on an hourly basis.
I tried to find out what user-agent I should use, but I only got results like google bot and bing bot. I don't know if I can use these user-agents.
Could you tell me which user-agent should I use?
Since you made your own crawler, you can come up with your own name. There's no rules around what the UserAgent may be, but many use a format like name/version, like:
myAwesomeCrawler/1.0
You could also include a url so website owners can find more information about your bot if they see it in your logs:
myAwesomeCrawler/1.0 (http://example.org)
But ultimately it's up to you.
This is of course all dependent on you doing something that's not illegal or violates the terms of service of the website you're crawling.
Depends on what you want to achieve. If you want to imitate a legit browser, simply take the useragent of a common browser like Chrome or Firefox. If you want to tell the site that you're a crawler, simply use something you define (e.g. xyzCrawler).
Using Bing Web Search API, I need to filter results only for my domain, example query:
https://api.cognitive.microsoft.com/bing/v7.0/search?q=site:mysite.com+myquery
But in results I received not only mysite.com results but also from sites like wikipedia and others.
How I can search result only for my domain?
Bing Custom Search not work for me because I have more than 10k transactions
Your website is probably not known or indexed by Bing. Since you use Bing Search API, not a custom indexing service or search across a sitemap.
The actual Bing website will need to be able to find your site.
Since it doesn't, this triggers the default behavior of returning as relevant as possible results.
This behavior is valid for urls such as the following:
https://api.cognitive.microsoft.com/bing/v7.0/search?q=microsoft+site:notAnIndexedWebsite.com
Formatting wise, there is multiple options, as seen here.
None of it is a problem in this case.
You can try Bing custom search if it helps (at https://www.customsearch.ai/), especially as it is now in GA. It also provides option to crawl your pages in Bing index through webmaster, if they are not crawled already. This should make sure that you get results only from your website.
What people are doing is basically taking the UA-XXXXXX code that you normally get with analytics, and they are generating calls against it. This is skewing my analytics stats. On top of that, in Google WebMaster tools, it's also causing this:
It looks like somehow these pages, with my code on or at least with the generated code on, is making Google Webmaster tools think I have lots of 404's. This can't possibly be good for my rankings.
Anyone know if there is anything you can do to stop this?
Try making async call from your server end using CURL.That way you will never expose your GA code.
I have not implemented it, but it might work as per theory
Since you can filter by custom dimensions you can set a "token" in a custom dimension on every page and filter out any traffic in your view settings that does not include the token.
Obviously this will not help against people who use the code from your website (unless you also implement shahmanthan9s suggestion - which is a lot of work but will give you cleaner data), but it will work against drive-by shooters who randomly select UAIDs to send data to (which is the situation you refer to in your comment).
I'm doing custom-rolled view tracking on my website, and I just realize that I totally forgot about search bots hitting the pages. How do I filter out that traffic from my view tracking?
Look at the user-agents. It might seem logical to blacklist, that is filter out all the strings that contain "Googlebot" or other known search engine bots, but there are so many of them, it could well be easiest to just to whitelist: log visitors using a known browser.
Another approach would be to use some JavaScript to do the actual logging (like Google Analytics does). Bots won't load the JS and so won't count toward your statistics. You can also do a lot more detailed logging this way because you can see exactly (down to the pixel - if you want) which links were clicked.
You can check the user agent: here there is a nice list.
Or you could cross-check with the hits on robots.txt, since all the spiders should read that first and users usually don't.