What User-Agent should I use when crawling sites using own program - http

I made crawler with node.js. I want to crawl some sites on an hourly basis.
I tried to find out what user-agent I should use, but I only got results like google bot and bing bot. I don't know if I can use these user-agents.
Could you tell me which user-agent should I use?

Since you made your own crawler, you can come up with your own name. There's no rules around what the UserAgent may be, but many use a format like name/version, like:
myAwesomeCrawler/1.0
You could also include a url so website owners can find more information about your bot if they see it in your logs:
myAwesomeCrawler/1.0 (http://example.org)
But ultimately it's up to you.
This is of course all dependent on you doing something that's not illegal or violates the terms of service of the website you're crawling.

Depends on what you want to achieve. If you want to imitate a legit browser, simply take the useragent of a common browser like Chrome or Firefox. If you want to tell the site that you're a crawler, simply use something you define (e.g. xyzCrawler).

Related

Facebook shares not using og:url when clicked in Facebook?

One of the purposes of og:url -- I thought -- was that it was a way you could make sure sessions variables, or any other personal information that might find its way into a URL, would not be passed along by sharing in places like Facebook. According to the best practices on Facebook's developer pages: "URL
A URL with no session id or extraneous parameters. All shares on Facebook will use this as the identifying URL for this article."
(under good examples: developers.facebook.com/docs/sharing/best-practices)
This does NOT appear to be working, and I am puzzled as to either -- how I misunderstood, and/or what I have wrong in my code. Here's an example:
https://vault.sierraclub.org/fb/test.html?name=adrian
When I drop things into the debugger, it seems to be working fine...
https://developers.facebook.com/tools/debug/sharing/?q=https%3A%2F%2Fvault.sierraclub.org%2Ffb%2Ftest.html%3Fname%3Dadrian
og:url reads as expected (without name=adrian).
But if I share this on facebook -- and then click the link. The URL goes to the one with name=adrian in it, not the og:url.
Am I doing something incorrectly here, or have I misunderstood? If the latter, how does one keep things like sessions variables out of shares?
Thanks for any insight.
UPDATE
Facebook replied to a bug report on this, and I learned that I indeed was reading the documentation incorrectly
developers.facebook.com/bugs/178234669405574/
The question then remains -- is there any other method to keeping sessions variables/authentication tokens out of shares?

How to stop Google Analytics Hacks?

What people are doing is basically taking the UA-XXXXXX code that you normally get with analytics, and they are generating calls against it. This is skewing my analytics stats. On top of that, in Google WebMaster tools, it's also causing this:
It looks like somehow these pages, with my code on or at least with the generated code on, is making Google Webmaster tools think I have lots of 404's. This can't possibly be good for my rankings.
Anyone know if there is anything you can do to stop this?
Try making async call from your server end using CURL.That way you will never expose your GA code.
I have not implemented it, but it might work as per theory
Since you can filter by custom dimensions you can set a "token" in a custom dimension on every page and filter out any traffic in your view settings that does not include the token.
Obviously this will not help against people who use the code from your website (unless you also implement shahmanthan9s suggestion - which is a lot of work but will give you cleaner data), but it will work against drive-by shooters who randomly select UAIDs to send data to (which is the situation you refer to in your comment).

HTML5 Page Request

I have been searching around for a way to simply request webpages with HTML5. Specifically, I want to do an HTTP(S) request to a different site. It sounds like this is not possible at first due to obvious security reasons, but I have been told this is possible (maybe with WebSockets?).
I don't want to use iframes, unless there is a way to make it so the external site being requested does not know that it is being requested through an iframe.
I've been having a really difficult time finding this information. All I want to do is load a webpage, and display it.
Does anyone have some code they could share, or some suggestions?
Thanks.
Sounds like you are trying to to circumvent Same Origin Policy. Don't do that :)
If it is just the request you want (and nothing else), there are a number of ways to do it, namely with new Image in JavaScript or iframe. The site will know it is in an iframe by checking top.location.href != window.location.href.
funneling all your requests through a proxy might be a solution - the client addresses only the proxy and the actual url to retrieve would be a query string parameter.
best regards, carsten

Possible to show and store top 3 results of Google Search?

Is it possible to show and store the results obtained from a Google Search query? The search is performed from the commons-http client, through my webapp. My questions are:
Is it permissible?
Is it ethical?
Is it possible?
I have heard that google changes tags and blocks scrapers. Is it correct? Any other way to do it?
It is certainly possible to scrape Google for results - many do. You can use proxies to mask your IP and avoid being blocked.
Make sure you set your user-agent to a browser.
Google doesn't like this and provides a search API, but it is pretty useless.

Using nginx rewrite to hide or clean URLs?

Greetings.
I've been trying to grok regexes and rewrites but i'm a newbie at it. I have a simple site, one dir only, all i want is to rewrite domain.com/file.php?var=value to domain.com, making sure the user will only see domain.com in the adress bar throughout site navigation (even if i start making the site more complex).
Simply speaking it's freezing the url, although if the site grow's i'd rather make some "clean url" approach but i'm still at basic php and i sense i'd have to rtfm on http/1.1
You can use nginx rewrite rule like this:
rewrite ^([^.]*)$ /index.php
to send all requests that do not contain a "." to your index.php script. In your php script, explode the request url to get the passed parameters, and then call the php page you want to:
#URL /people/jack_wilson
$url_vars = explode('/',$_SERVER['request_uri']); # array('','people','jack_wilson');
$page = $url_vars[1];
if($page == 'people'){
$username = $url_vars[2];
require('people.php'); #handle things from here.
}
Typically this type of "clean" URL stuff is typically accomplished with a wrapper frame for the entire site. The browser shows the URL of that frame only, and the contents can be whatever.
But I actually don't recommend it, because the user may want to bookmark the "dirty" URL.
This type of URL obfuscation diminishes usability for advanced users.
Not to be a jerk, but I typically only see this type of URL obfuscation on "artsy" sites that care more about what their address bar looks like than usability of their site. Making your users happy via enhanced usability is a better long term approach IMHO.
I guess freezing the url is driven by marketing desires, so here's the downside to that, from marketing standpoint as well: your users won't be able to send a link to a page they liked to their friends via IM, email or facebook, so it actually decreases the appeal of your site even for the most clueless users.
I don't (knowledgeably) think it's safe to have php variables showing around in the adress bar (although they'd show in the status bar of some browsers anyway...). Idealy the user wouldn't know what is the site using behind the scenes - i'm configuring error pages for starters. And yes, for aesthetics as well. If i can rewrite to domain.com/foo/bar i'd be happy; assuming nginx would handle the "translation back to ugly URLs" if it got such a request, which i think it does with location-directives. But having domain.com/file.php?var1=value&var2=value kinda annoys me, makes me feel i'm exposing the site too much.
Frames are not recomended (especially 'cos of search engines) and i'm trying to stick to xhtml 1.1 Strict (so far so good). If the content is well designed, easy to navigate, acessible regardless of browser-choice, intuitive, etc... i guess i could live with cute URLs :)
I'd gladly grok through any rtfm material regarding good webdesign techniques, php, http/1.1 and whatever made you go "Yeah! that's what i've been looking for!" at 4am.
Thanks :)
[if i understand this site right, this oughta be a reply to the original post, not to an answer... sorry]
You could also look into a PHP framework such as Code Igniter that will handle all the PHP side of this for you.
http://codeigniter.com/

Resources