Is it ok to scrape data from Google results? [closed] - web-scraping

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'd like to fetch results from Google using curl to detect potential duplicate content.
Is there a high risk of being banned by Google?

Google disallows automated access in their TOS, so if you accept their terms you would break them.
That said, I know of no lawsuit from Google against a scraper.
Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :)
There are two options to scrape Google results:
1) Use their API
UPDATE 2020: Google has reprecated previous APIs (again) and has new
prices and new limits. Now
(https://developers.google.com/custom-search/v1/overview) you can
query up to 10k results per day at 1,500 USD per month, more than that
is not permitted and the results are not what they display in normal
searches.
You can issue around 40 requests per hour You are limited to what
they give you, it's not really useful if you want to track ranking
positions or what a real user would see. That's something you are not
allowed to gather.
If you want a higher amount of API requests you need to pay.
60 requests per hour cost 2000 USD per year, more queries require a
custom deal.
2) Scrape the normal result pages
Here comes the tricky part. It is possible to scrape the normal result pages.
Google does not allow it.
If you scrape at a rate higher than 8 (updated from 15) keyword requests per hour you risk detection, higher than 10/h (updated from 20) will get you blocked from my experience.
By using multiple IPs you can up the rate, so with 100 IP addresses you can scrape up to 1000 requests per hour. (24k a day) (updated)
There is an open source search engine scraper written in PHP at http://scraping.compunect.com
It allows to reliable scrape Google, parses the results properly and manages IP addresses, delays, etc.
So if you can use PHP it's a nice kickstart, otherwise the code will still be useful to learn how it is done.
3) Alternatively use a scraping service (updated)
Recently a customer of mine had a huge search engine scraping requirement but it was not 'ongoing', it's more like one huge refresh per month.
In this case I could not find a self-made solution that's 'economic'.
I used the service at http://scraping.services instead.
They also provide open source code and so far it's running well (several thousand resultpages per hour during the refreshes)
The downside is that such a service means that your solution is "bound" to one professional supplier, the upside is that it was a lot cheaper than the other options I evaluated (and faster in our case)
One option to reduce the dependency on one company is to make two approaches at the same time. Using the scraping service as primary source of data and falling back to a proxy based solution like described at 2) when required.

Google will eventually block your IP when you exceed a certain amount of requests.

Google thrives on scraping websites of the world...so if it was "so illegal" then even Google won't survive ..of course other answers mention ways of mitigating IP blocks by Google. One more way to explore avoiding captcha could be scraping at random times (dint try) ..Moreover, I have a feeling, that if we provide novelty or some significant processing of data then it sounds fine at least to me...if we are simply copying a website.. or hampering its business/brand in some way...then it is bad and should be avoided..on top of it all...if you are a startup then no one will fight you as there is no benefit.. but if your entire premise is on scraping even when you are funded then you should think of more sophisticated ways...alternative APIs..eventually..Also Google keeps releasing (or depricating) fields for its API so what you want to scrap now may be in roadmap of new Google API releases..

Related

Any free mapping service to display and filter 250000+ datapoints?

I have participated in a Hackathon in my city, and the traffic department made public a dataset with more than 250 thousand traffic accident datapoints, each one containing Latitude, Longitude, type of accident, vehicles involved, etc.
I made a test to display the data using Google Maps API and Google Fusion Tables, but the usage limits were quickly reached with the first two years of a total of 13 years of records.
The data for two years can be displayed and filtered here.
So my question is:
Which free online services could I use in order to interactively display and filter 250 thousand such datapoints as map layers?
It is important that the service be free, because we are volunteering our time for non-profit public good. Currently our City Hall is implementing an API, but it is not ready yet, and it would be useful to present them some popularly well-accepted use-cases to make some political pressure for further API development with THEIR server (specially remotely querying a database instead of crawling a bunch of .csv files as it is now...)
An alternative would be to put everything in GitHub and load the whole dataset client-side to be manipulated with D3.js for example, but that seems very inefficient either for the client/user as for the server.
Thanks for reading, and feel free to re-tag if needed.
You need Google Maps API for Business to achieve what you want, but it costs a lot of money.
However, in some cases, you can get this Business Licence if you work for non-profit organization. I can't find the exact rules to be eligible for this free licence. I tried googled them but I can't find anything. I only find this link, just take a look if it can answer your problem.
You should be able to do that with Google Fusion Tables. The limit is 100,000 points per table, but you can overlay 5 layers onto a single map so in effect you can reach 500,000 points. I implemented the website below and have run it with over 200,000 points.
http://www.skyscan.co.uk/mapsearch.html

Using events as page section usage

I'm currently researching a solution to monitor the performance of specific sections of a page. For example, you have a simple page with 2 images with links to other pages. You are driving lots of traffic to this page and you are experimenting with different contents on that page.
6 months after, you want to see which section of the page performed better with what kind of specific imges.
Let's imagine you require a report that should tell you the following: on average, the first spot performs better, but last week the image was bad and that's why you had less conversion from that spot.
I'd like to use such a system on a high-traffic homepage of an eCommerce website, in order to better monitor the usage of the selling spots.
I was thinking to use Google Analytics events with a positioning scheme (splitting the website in columns and rows, giving to each cell an identification ID such as a1 for column a, row 1) and keeping a local datawarehouse of creatives (images, promotions etc.), but apparently, after 10.000.000 hits per month, Analytics is recommending the premium version which is quite pricey (12k USD per month, 1 year upfront payment).
I was thinking about PIWIK as an alternative, but there is no event tracking there - or am I missing anything?
Looking forward to hearing your input on this matter.
You're better off with a provider like Optimizely for this use case. Still gonna be expensive, but it'll more quickly get you the information you need to make decisions.
We normally use multi variation tests or A/B tests to measure the success of user interfaces. Google Analytics have this feature and it is free.
This links maybe useful
https://www.youtube.com/watch?v=yDWTMOC_Dp4
https://support.google.com/analytics/answer/1745147?hl=en

Scrape all google search result for a specific name

I think the question has been answered here before,but i could not find the desired topic.I am a newbie in web scraping.I have to develop a script that will take all the google search result for a specific name.Then it will grab the related data against that name and if there is found more than one,the data will be grouped according to their names.
All I know is that,google has some kind of restriction on scraping.They provide a custom search api.I still did not use that api,but hoping to get all the resulted links corresponding to a query from that api. But, could not understand what will be the ideal process to do the scraping of the information from that links.Any tutorial link or suggestion is very much appreciated.
You should have provided a bit more what you have been doing, it does not sound like you even tried to solve it yourself.
Anyway, if you are still on it:
You can scrape Google through two ways, one is allowed one is not allowed.
a) Use their API, you can get around 2k results a day.
You can up it to around 3k a day for 2000 USD/year. You can up it more by getting in contact with them directly.
You will not be able to get accurate ranking positions from this method, if you only need a lower number of requests and are mainly interested in getting some websites according to a keyword that's the choice.
Starting point would be here: https://code.google.com/apis/console/
b) You can scrape the real search results
That's the only way to get the true ranking positions, for SEO purposes or to track website positions. Also it allows to get a large amount of results, if done right.
You can Google for code, the most advanced free (PHP) code I know is at http://scraping.compunect.com
However, there are other projects and code snippets.
You can start off at 300-500 requests per day and this can be multiplied by multiple IPs. Look at the linked article if you want to go that route, it explains it in more details and is quite accurate.
That said, if you choose route b) you break Googles terms, so either do not accept them or make sure you are not detected. If Google detects you, your script will be banned by IP/captcha. Not getting detected should be a priority.

When is Google Analytics not good enough? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm trying to determine why an enterprise wouldn't want to use Google Analytics.
Here are the main reasons I've seen mentioned:
Inability to track clients that have Javascript disabled.
Lack of ownership of the statistics - Google owns the data.
Most of the web clients with Javascript disabled will probably be bots/spiders. This data is interesting, but probably not very useful.
As for the ownership issue, this is a bit paranoid IMO.
What am I missing here? When is Google Analytics not good enough?
Here are my findings from additional research:
Google Analytics is limited to 5 million page views per month - source
If a web site generates more than 5 million pageviews per month it will need linked to an active AdWords account to avoid interruption of service.
Lack of / slow technical support
All Google support is handled through email and response times can take a week or more. Commercial analytics products often have much faster & personalized support.
Inability to track files (PDF's, Images, etc.)
GA relies on Javascript and files lack the ability to execute Javascript. The workaround to this problem is to tag the link, but this won't track requests that go directly to the file.
Limited ability to customize
This is a selling point that I see pushed by commercial analytics tools (WebTrends). However it's never explained what customizations are denied by GA but allowed by WebTrends.
The Google Analytics EULA does not allow you to track individual users by identifying them. So if you wanted to add a custom variable for username to track how many times each user logs in, then you would be in a gray zone if not outright violating the EULA.
I use Google Analytics on about 10 sites right now and it's a great tool. In addition to all the analytics stats, you can tie it in with AdSense and it becomes a marketing/revenue tool and not just "wow look at all these cool user stats". If there was a way to track by user ID in certain circumstances (e.g. if user's agreed to it, or if they work for the company that owns the site) then I would have no issues.
Besides, it's free and all you have to do is add JavaScript to the files, so give it a try and see what you think after a few months.
One reason that was, surprisingly, not posted:
timing / speed of reaction
It takes at least 4 hours (up to 24) for GA to update your data.
This is ok for me personally in most of the cases, but when reacting fast is crucial (news sites, one-off events, etc.) you may want to employ some other solution (Mint comes to mind, but it's not the only one out there of course).
Thought I'd add my two pence worth to this thread, as this a topic close to my heart and one I've debated with colleagues for years. We've used webtrends in house for as long as i can remember, back to version 4 of the log analyzer (how different things were back then!). Since Google Analytics came along, we've started to come under increasing pressure from certain parts of our business to switch, as 'it does everything we need form an analytics tool'
Well, true in many senses it does, especially these days. But I championed the integration of our CRM and web analytics tools back in 2006, and as our business isn't e-commerce (the 'conversion' happens offline, sometimes months after the visitor acquisition) we need to integrate in this way to get a true picture of campaign effectiveness, and notion of ROI.
All of this means, we need access to the raw data, need to be able to join visitor records on sessionID etc, without this access we'd be screwed. I'd love it if we could roll without it, but the current requirements mean we can't, so this alone is a HUGE reason why Google analytics is not good enough.
Over and out
For tracking desktop software or creating a whitelabel solution there are better solutions.
For white label an integration based analytics, i use MixPanel. For Desktop Software, i use Deskmetrics
Google Analytics does not work well with mobile phones. While the iPhone and the Palm may be supported, many of the existing handsets do not support the javascript that Google uses.
If you're based in the UK, then theoretically you could be breaking the Data Protection Act by using Analytics.
If information about your users (like which web pages they're looking at) goes "outside the European Economic Area" and onto Google's servers in the US, then you're breaking the DPA.
Pretty obscure, but you did ask :)
Piwik avoids the problem because you host it on your own servers.
Lack of ownership of the statistics - Google owns the data.
... As for the ownership issue, this is a
bit paranoid IMO.
One problem with it is that we can't even access the raw data. We had a use case this week where we wanted a visitor map for an executive presentation. We needed to get more flexible with how the visitor map is displayed (wanted to view the map in Google Earth plug-in). In GA, you can't. You take what they give you. You can see a map of how many visits came from each city, but you can't export a data file of cities and number of visits, to run the data through other tools. So, paranoia aside, there are significant limitations on what you can accomplish with GA.
However this is not a problem if you use Urchin, the self-hosted version of GA: you can export the data and do what you want with it. (And the exported data is richer than the web server log's, as it includes some analysis already.)
Since Piwik is open source, and pluggable, I imagine you could enhance the visitor map plug-in any way you wanted to. And export whatever data you want.
Whether this limitation affects you depends on your needs, obviously.
Update: I've now looked at the GA Data Export API, and it turns out that things you cannot do through the UI (as you can with Urchin), you can do with this API. It does look like you can export the visit data I was talking about, via a feed (although there are daily traffic caps on those requests). So sprinkle salt heavily on what I wrote above.
A couple more points that I've come across:
GA doesn't let you dig beyond full-day statistics; I would often like the ability to investigate whether a traffic dip the previous day was caused by the design update I did at 1pm or the soccer match on TV at 8pm.
GA doesn't offer a workaround for traffic spikes caused by DDoS attacks, Slashdotting etc. When I'm looking at a GA visitor graph of 2009, all I can see is the 2-million-pageview-spike on October 16th, pushing the entire rest of the year down flat against the horizontal axis of the graph. To get a meaningful graph, GA should offer the ability to trim or exclude outlying data points, or the ability to limit/bracket the graph window itself
GA doesn't have an event monitoring client (think Reinvigorate's Snoop tool)
While GA is very user-friendly, I've found it's not as granular as some of the other stats programs (or maybe I'm not looking in the right places). Before the marketing monkeys I work with began pushing GA, we were very satisfied with AWStats. The sheer scope of the data helped us on several occasions hone sites to better suit their audience. While GA is very shiny and laid out well, I personally still prefer the raw numbers like I used to get through AWStats.
Slow data processing speed - Can be as low as 15-30 mins for page views, but may be up to 48 for eCommerce
EULA is limiting in some cases
You won't own or have any control of the data. Google's engineers might use it (anonymously) for testing
Anything more complex requires customization - Downloads and such care of no issue, but there are limits
Cross domain tracking by linker is faulty at best
Visit based - Proper tools are based on Visitor level, GA works on Visit based reporting mostly
Limited number of custom vars used at one time (5)
No tech support, if you're realistic
Usually when there is a downtime notice, it's already gone
API limitations (4 dimensions and 10 metrics at one time, not all can be used together in addition to that)
I have many more, but at the end of the day it is a good tool for it's price.
From the non-technical point, I think the most important is that some enterprise has the high level data security policy. All of the data should be controlled and managed by themselves.
If you use the Google analytics,the data is stored in google's server. For some special enterprise, like insurance, financial company. The policy should be followed.
I would NOT go with server logs. In fact I have them disabled on my server. Why you ask me?
For the simple reason that everytime you hit my server that stupid logging program makes an entry in the physical log file on my HDD. So if my server gets 100,000 hits in a day that's 100,000 time a HDD write operation happens.
You think that's cool? Well it's not. It's slowing your server down, specially if the log file is huge.
Why would someone even consider doing that to their server? Specially when we're working so hard to minify javascript, css and make image files 2 KB smaller!
Please do yourself a favor don't log directly on your server.
At least Google Analytics logs it on Google's server so my server's healthier.
I wouldn't use it for any of my sites, because you're forcing the user to accept your proprietary JavaScript code in their browser, which is bad. Also, giving your data is Google is a really bad idea.
See Piwiki for something you can run yourself as in free software, eliminating both of the problems.

Is anybody happily using Google Analytics with big websites? (million+ pages, million+ monthly visitors) [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I was a happy customer of Google Analytics starting from the Urchin times. But something strange happened a few months ago and GA started showing a fake URL called "(other)" that is credited between 5% and 45% of all site traffic. I've tried filtering out some URL parameters to reduce the number of pages. Currently GA shows only 150,000 pages on my site, which is well below the half million limit that some people are talking about. Still, the page "(other)" is showing as the most popular page on my site.
Is anybody else struggling with this issue? I am wondering whether this could be a scalability issue. My site has been growing over the years, and currently doing 1.25 million unique monthly visitors and over 10 million pageviews. The site itself has around half a million pages. If you are successfully using GA with a bigger website than mine, please share your story. Are you using the Sampling feature of their tracking script?
Thanks!
For a huge website like and I would not use a Free Analytics. I would use something like Web trends or some other paid analytics. We cannot blame GA for this after all its a free service ;-)
GA has page view limits too. (5 Million page views)
Just curious. How long did you take to add the analytics code to your pages? ;-)
In Advanced Web Metrics with Google Analytics Brian Clifton writes that above a certain number of page views, Google Analytics is no more able to list all the seperate page views and starts aggregating the small amount ones under „(other)” entry.
By default, Google Analytics collects
pageview data for every visitor. For
very high traffic sites, the amount
of data can be overwhelming, leading
to large parts of the “long tail” of
information to be missing from your
reports, simply because they are too
far down in the report tables. You can
diminish this issue by creating
separate profiles of visitor
segments—for example, /blog, /forum,
/support, etc. However, another option
is to sample your visitors.
I get about 3.5 million hits a month on one of my sites using GA. I don't see (other) listed anywhere. Specifically what report are you viewing? Is (other) the title or URL of the page?
You can get a loooonnnngggg way on Google Analytics. I had a site doing about 25mm uniques/mo. and it was working for us just fine. The "other" bucket fills up when you hit a certain limit of pageviews/etc. The way around this is to create different filters on the data.
For a huge website (millions of page views per day), you should try out SnowPlow:
https://github.com/snowplow/snowplow
This will give you granular data down to the individual page URLs (unlike Google Analytics at that volume) and, because it is based on Hadoop/Hive/Infobright, it will happily scale up to billions of page views.
Its more to do with a daily limit of unique values for a metric they will report on. if your site uses querystring parameters, all those unique values and parameter variations are seen as separate pages and cause the report to go over the limit of 50,000 unique values in a day for a metric. To eliminate, you should add all the big culprits querystring names to be ignored, making sure however to not add any search querystring names if search is on.
On the Profile Settings, add them to the Exclude URL Query Parameters textbox field, delimited by commas. Once I did this, the (other) went away from the reports. It takes affect at the point they are added, previous days will still have (other) displaying.

Resources