Does anyone know how google analytics processes data? - google-analytics

Anyone have any idea or know of any articles that discusses how google analytics stores and processes the data that comes in from the urchin calls? Curious about the architecture.
thanks!

Their own docs on "How Data Is Calculated" give you a pretty good idea of what data they collect and how they calculate their metrics:
http://code.google.com/apis/analytics/docs/concepts/gaConceptsOverview.html#howDataIsCalculated
As you mentioned, these calculations are distributed across many machines using Google's homegrown architecture, which includes Map/Reduce:
http://en.wikipedia.org/wiki/MapReduce

I think analytics is totally closed. However, if you haven't read about Facebook's Scribe it is probably worth checking out. Also, an extreme case of scalable distributed, logging, and analyzing.

i don't know especially about analytics, but in general Google uses (ehm.. invented?) Map/Reduce.
There are several open source databases which support using Map/Reduce calls, e.g. CouchDb, which is a document-oriented database.
These types of application use Geolocation for determining the location of the user on base of the ip address. Additional information is found out via JavaScripts opjects window.navigator (useragent, platform, language, ...) and screen (dimensions, color depth)
edit:
there is evidence that google uses it's BigTable-DB-Engine (which corresponds to MapReduce) for reader, maps & youtube.
on dbms2.com, they even say that analytics uses MapReduce (could be categorized as "insider knowledge").

Related

How to build a predictive dialer?

I need to build a reliable predictive dialer based on Asterisk. Currently the system we use includes Wombat and Asterisk, and we do not find this solution usable as Wombat provides a poor API and it's impossible to use it without regular manual operations.
The system we want:
Can be used solely via API or direct database queries (adding lists to campaigns, updating lists, starting campaigns, stopping campaigns etc.) so that it can be completely integrated into an existing product
Is free, or paid for annually independent to the usage rate
Is considered stable
Should be able to handle tens of thousands of calls per day, if it matters
Use vicidial.org or hire freelancer to build new core with your needed api.
You can also check OSdial for this, it also developed using asterisk.
We have been working with a preview of the next version of Wombat, through the Early Access program, and Wombat has a complete configuration and reporting JSON API and you can deploy it "headless" in order to scale up to thousands of parallel lines. If you ask Loway they can likely get you access to the Early Access program.
BTW, Vicidial is great for agent-based outbound, but imposes quite a large penalty on the number of agents per server - you cannot reasonably use it to do telecasting at the scale we are looking for as it would require too many servers. Wombat is leaner and can drive over one thousands channel per server. YMMV.
This question would be better placed on a "hire-a-freelancer" site like oDesk ... if you need custom programing done, those are the sorts of places to go to get manpower.
Your specifications are well within what is possible with Asterisk. I'd strongly recommend looking at Vici Dial and OS Dial as others have suggested; out of the can, they are pretty good.
The hard part of any auto-dialer is not the dialer, oddly enough. It's the prediction algorithms, the answering machine detection algorithms and the agent UI. Those are what makes or breaks an auto-dialer application for a company.

Sending Data from ganglia to graphite

I am currently collecting monitoring metrics with Ganglia and I would like to show graphs with that data with Graphite. I know such an integration is possible, and I found an article describing how it should be done. I am not quite sure exactly how this integration works, especially when I want to send it straight into graphite without parsing the data of the gmetad. Any help on how to integrate Ganglia with Graphite will be great.
thanks
There are two approaches to integrate ganglia with graphite.
use third party process to get metrics from gmetad/gmond, tweak metrics data format, send metrics data to carbon server finally.
use the feature "graphite integration" of gmetad where you just need to configure the carbon server address, port, protocol (with an optional graphite path syntax), then gmetad will do all the things left. The more details can be found from your /etc/ganglia/gmetad.conf
I would recommend #2 since it's pretty simple. you just need to upgrade your ganglia packages to version 3.3+.
In above solutions, you can store metrics data in both RRD and whisper. If you don't want this approach, it also supports altering rrdtool graphs with graphite in ganglia-web. see "Using graphite as graphing engine"
Have you checked the ganglia-web wiki ? There is a section Graphite Integration and an other called Using Graphite as the graphing engine which explain well how to do what you want.
I've worked a lot with Ganglia, Graphite from what I've researched works similarly. I was never able to master Whisper, but I've found RRD's (round robin databases) to be pretty reliable. Not sure what you're interested in monitoring, but I would definitely check out JMXtrans. You can get the code from Google. It provides multiple methods for extracting metric data from whatever JVM you're monitoring, and lets you define which metrics you'd like to pipe to Ganglia/Graphite, and some other options.

Recommendations using R with SimpleDB or BigQuery or using PHP with SimpleDB

I am currently working on system that generated product recommendations like those on Amazon : "People who bought this also bought this.."
Current Scenario:
Extract the Google Analytics data of the client and insert it in database.
On the website of the client, on load of product page the API call is made to get the recommendations of the product being viewed.
When API receives the product ID as request it looks in the database and retrieves (using association rules) the recommended product IDs and sends them as response.
The list of these product Ids will be processed to get the product details(image,price..) at the client end and displayed on website.
Currently I am using PHP and MYSQL with gapi package and REST api
storage on AMAZON EC2 .
My Question is:
Now, if I have to choose amongst the following, which will be the best choice to implement the above mentioned concept.
PHP with SimpleDB or BIGQuery.
R language with BIGQuery.
RHIPE-(R and hadoop ) with SimpleDB.
Apache Mahout.
Plese help!
This isn't so easy to answer, because the constraints are fairly specialized.
The following considerations can be made, though:
BIGQuery is not yet public. Thus, with a small usage base, even if you are in the preview population, it will be harder to get advice on improvement.
Each of your answers asked about a modeling system & a storage system. Apache Mahout is not a storage mechanism, so it won't necessarily work on its own. I used to believe that its machine learning implementations were a a pastiche of a few Google Summer of Code, but I've updated that view on the suggestion of a commenter. It still looks like it has rather uneven and spotty coverage of different algorithms, and it's not particularly clear how the components are supported or maintained. I encourage an evangelist for Mahout to address this.
As a result, this eliminates the 1st, 2nd, and 4th options.
What I don't quite get is the need for a real-time server to utilize Hadoop and RHIPE. That should be done in your batch processing for developing the recommendation models, not in real-time. I suppose you could use RHIPE as a simple one-stop front end for firing off queries.
I'd recommend using RApache instead of RHIPE, because you can get your packages and models pre-loaded. I see no advantage to using Hadoop in the front end, but it would be a very natural back end system for the model fitting.
(Update 1) Other interface options include RServe (http://www.rforge.net/Rserve/) and possibly RStudio in server mode. There are R/PHP interfaces (see comments below), but I suspect it would be better to access R through HTTP or TCP/IP.
(Update 2) Addressing the whole process, the basic idea I see is that you could query the data from PHP and pass to R or, if you wish to query from within R, look at the link in the comments (to the OmegaHat tools) or post a new question about R & SimpleDB - I'm sure someone else on SO would be able to give better insight on this particular connection. RApache would let you instantiate many R processes already prepared with packages loaded and data in RAM; thus you would only need to pass whatever data needs to be used for prediction. If your new data is a small vector then RApache should be fine, and it seems this is correct for the data being processed in real-time.
If you want a real-time API for recommendations based on data in a database, Apache Mahout does this directly. You want to use ReloadFromJDBCDataModel, put on top a GenericItemBasedRecommender, and use the servlet-based wrapper in the examples module. It's probably a day or two of work to get familiar with the code and customize it to your needs, but it's pretty simple.
When you get past about 100M data points you would need to look at distributing the computation Hadoop. That's a fair bit more complex. Mahout has a distributed recommender too which you can customize.

How to get distance from Google Map API from server side ASP.NET 2.0

I am working on a project which requires a server side access to google map api. i want to calculate distance (actual distance, not straight line). google map api supports javascript and not asp.net. please give suggestions ...!
you specified google maps in your question - but have you looked at Virtual Earth? Specifically this routing with Virtual Earth Web Service example sounds exactly like what you want:
server-side access (just Add Service Reference inside visual studio)
actual distance (not straight line) since it is using a route
The concerns raised by others about T&Cs for 'internal/intranet use' apply to VE as well as Google I think - you'll have to read up about whether your application needs licensing or not.
p.s. if you did just want to calculate straight-line distance, I have instructions using SQL Server 2008; which also links to some straight c# code that does it too.
The Google API allows you to Geocode via a server side call:
http://code.google.com/apis/maps/documentation/services.html#Geocoding_Direct
This would allow you to get the longitude and latitude of the locations. You can then cache these and use them to calculate distance using the techniques CMS suggests.
You will need to be careful of the Google T&C's though as you are only allowed to store the geocoding data for use on a Google map which is publicly available.
You would probably also run into limitations on the number of requests you could make from a single IP.
However I think what you mean by non-straight line distance is distance taking into account roads and one way streets etc.
If this is the case I think a commercial service is your only option. Although theoretically you could do it all via screen scraping, I'm almost certain that this would break Google's T&C's.
The simplest solution would probably be just to embed a Google map on a page of your application and let the user calculate the distance. You could pre-fill the to and from fields if required.
Again if this is for an internal app i.e. Not publicly available "my understanding" of the Google T&C's would forbid this.
Use something like firebug or fiddler to look at the requests that are being sent to Google from javascript you should then beable to build the request using that information and an HTTPWebRequest in .net and retrieve the same information.
HTH
You can calculate the distance of two geographical coordinates (latitude, longitude) using the Great-circle distance algorithm.
Here you can find some other formulas for distance calculation.
Well, you've pretty much identified the key issue, the Gmaps API is a browser resident javascript API and there's not much getting around that. Most of the API is executed in the browser so there's not much network traffic to spy on.
As tsaunders mentions there is a geocoding API call that is restfully accessible, but it only does reverse/geocoding and if you have lat/lng's already you can use the calculations the rms suggested, but they are as tsaunders points out 'as the crow flies' distance.
If indeed you are looking for road taken distance, the API does do routing but you are back in the browser to get the start/end points from the user.
Perhaps you can be a little more specific about what you are trying to do and why you feel this requires you to to access the API from your server. My application for instance has features that gather information from the user and sends requests back to my server to work on, some of that data are processed by the Gmaps API first.
If I were to use a API platform, I certinaly would not use Google as the free one does not include advances Geocoding menaing the accuracy is poor. There is also no sla , support or rights of service.
The directions are poor, the coverage for Ireland and Geocoding is almost childlike and the privacy stinks. No professional business would use a google mapping solution.
They copy everyone else's idea, say they are there own and get loads of press (they only added tube stations in 2006) an dcyclc lanes (2010), viamichelin added these 2006 and Traffic in 2009 !
Any agency or developers looking for an API should stick to Bing or ViaMichelin for better customisation and user experience which is killer !

Implenting a New Message Notification Feature in a Server Farm Scenario

I'm working on a forum based website, the site also supports onsite messaging (ie. the users can send private messages to other users), what I'm trying to do is notify a member if they have new messages, for example by displaying the inbox link in bold and also the number of messgages, e.g. Inbox(3)
I'm a little confused how this can be implemented for a website running on a server farm, querying the database with every request seems like an overkill to me, so this is out of question, probably a shared cache should be used for this, I tend to think this a common feature for many sites including many of the large ones (running on server farms), I wonder how they implement this, any ideas are appreciated.
SO caches the questions, however every postback requeries your reputation. This can be seen by writing a couple of good answers quickly, then refreshing the front page.
The questions will only change every minute or so, but you can watch your rep go up each time.
Waleed, I recommend you read the articles on high scalability. They have specific case studies on the architectures of various mega scale web applications. (See the side bar on the right side of the main page.)
The general consensus these days is that RDBMs usage in this type of application is a bottle neck. It is also probably safe to say that most of the highly scalable web applications sacrifice consistency to achieve availability.
This series should be informative of various views on the topic. A word on scalability is highly cited.
In all this, keep in mind that these folks are dealing with Flickr, Amazon, Tweeter scale issues and architectures. The solutions are somewhat radical departures from the (previously accepted) norms and unless your forum application is the next Big Thing, you may wish to first test out the conventional approach to determine if it can handle the load or not.

Resources