I'm currently in the middle of the process of linking my Google analytics data to Big Query and the following note caught my attention when selecting the view to pick.
If this is the first time that you have linked this view, then data
will be backfilled for the smaller of 13 months or 10 billion hits.
Its a little unclear to me whether this 13 months of data will have costs in importing once I linked to BigQuery.
The process itself doesn't have a price component for back filling, but the storage it will occupy in BigQuery adds to your storage costs.
If you don't want old data, make sure you archive/remove/delete it.
Related
I am looking for options to ingest Google Analytics data(historical data as well) into Redshift. Any suggestions regarding tools, API's are welcomed. I searched online and found out Stitch as one of the ETL tools, help me know better about this option and other options if you have.
Google Analytics has an API (Core Reporting API). This is good for getting the occasional KPIs, but due to API limits it's not great for exporting great amounts of historical data.
For big data dumps it's better to use the Link to BigQuery ("Link" because I want to avoid the word "integration" which implies a larger level of control than you actually have).
Setting up the link to BigQuery is fairly easy - you create a project in the Google Cloud Console, enable billing (BigQuery comes with a fee, it's not part of the GA360 contract), add your email address as BigQuery Owner in the "IAM&Admin" section, go to your GA account and enter the BigQuery Project ID in the GA Admin section, "Property Settings/Product Linking/All Products/BigQuery Link". The process is described here: https://support.google.com/analytics/answer/3416092
You can select between standard updates and streaming updated - the latter comes with an extra fee, but gives you near realtime data. The former updates data in BigQuery three times a day every eight hours.
The exported data is not raw data, this is already sessionized (i.e. while you will get one row per hit things like the traffic attribution for that hit will be session based).
You will pay three different kinds of fees - one for the export to BigQuery, one for storage, and one for the actual querying. Pricing is documented here: https://cloud.google.com/bigquery/pricing.
Pricing depends on region, among other things. The region where the data is stored might also important be important when it comes to legal matters - e.g. if you have to comply with the GDPR your data should be stored in the EU. Make sure you get the region right, because moving data between regions is cumbersome (you need to export the tables to Google Cloud storage and re-import them in the proper region) and kind of expensive.
You cannot just delete data and do a new export - on your first export BigQuery will backfill the data for the last 13 months, however it will do this only once per view. So if you need historical data better get this right, because if you delete data in BQ you won't get it back.
I don't actually know much about Redshift, but as per your comment you want to display data in Tableau, and Tableau directly connects to BigQuery.
We use custom SQL queries to get the data into Tableau (Google Analytics data is stored in daily tables, and custom SQL seems the easiest way to query data over many tables). BigQuery has a user-based cache that lasts 24 hours as long as the query does not change, so you won't pay for the query every time the report is opened. It still is a good idea to keep an eye on the cost - cost is not based on the result size, but on the amount of data that has to be searched to produce the wanted result, so if you query over a long timeframe and maybe do a few joins a single query can run into the dozens of euros (multiplied by the number of users who use the query).
scitylana.com has a service that can deliver Google Analytics Free data to S3.
You can get 3 years or more.
The extraction is done through the API. The schema is hit level and has 100+ dimensions/metrics.
Depending on the amount of data in your view, I think this could be done with GA360 too.
Another option is to use Stitch's own specfication singer.io and related open source packages:
https://github.com/singer-io/tap-google-analytics
https://github.com/transferwise/pipelinewise-target-redshift
The way you'd use them is piping data from into the other:
tap-google-analytics -c ga.json | target-redshift -c redshift.json
I like Skyvia tool: https://skyvia.com/data-integration/integrate-google-analytics-redshift. It doesn't require coding. With Skyvia, I can create a copy of Google Analytics report data in Amazon Redshift and keep it up-to-date with little to no configuration efforts. I don't even need to prepare the schema — Skyvia can automatically create a table for report data. You can load 10000 records per month for free — this is enough for me.
What is the time taken by google analytics to start exporting historical data into the google cloud after the linking is complete between Big Query and Google Analytics.
According to Google's documentation:
Once the linkage is complete, data should
start flowing to your BigQuery project within 24 hours. 1 file will be
exported each day that contains the previous day’s data, and 3 files
will be exported each day that contain the current day's data. We will
provide a historical export of the smaller of 10 billion hits or 13
months of data within 4 weeks after the integration is complete.
We are having some issues pulling yesterday's Google Analytics data from BigQuery. Can anyone explain at what point a previous day's GA data is finalized?
There is some explanation here of the intraday tables, but it's not very clear:
https://support.google.com/analytics/answer/3437719?hl=en
To get previous day data do you need to need to use the intraday tables at all? Do you have access to the fully processed dataset at 8am local time? Or is it 8 hours after the current day UTC+14:00 (etc)?
I had a similar question and asked their support, this is the reply:
"According to this Google Analytics documentation , it states that '1 file will be exported each day that contains the previous day’s data, and 3 files will be exported each day that contain the current day's data'. In such, the minimum time that the data from Google Analytics to be exported to BigQuery was 8 hours. Although Google Analytics can be linked to BigQuery, the availability of data depends on how it was served by Google Analytics 360."
But based on experience, it's really a minimum time. Sometimes there are delays of 4-5 hours.
My team has been pressing Google's support for providing SLA's for BigQuery dump, so they updated the documentation:
This feature is not governed by a service-level agreement (SLA).
In practice we are experiencing regular delays anywhere between 2 to 12 hours.
I've been using a SSIS Integration component to download data from Google Analytics in order to keep an historical view of some websites and track the evolution of them. Basically the metrics we track are Visits (now Sessions) and Visitros (now Users), and the dimensions are Year and Month. However, today I noticed that the data I downloaded for july had a variation on the Users metric. I heard that google analytics uses an estimation method to "calculate" some (if not all) of their metrics, could it be that after that they "adjust" the data with more acurate information? If so, is this mentioned in the documentation? (a link would be highly appreciated) Since the users are complaining that we are not delivering the real GA Data. I tried looked on the Google analytics documentation page with no luck.
Thanks for your time.
PS: Sorry for my english, it isn´t my native language
If you are using the standard version of Google Analytics (you'll know if you are paying $150k for premium), data is sampled depending on volume. Have a read of this article can-you-trust-your-google-analytics-data
I have seen very slightly differing results being returned if you repeatedly call the api with the same historical parameters repeatedly. In my case the figures only differed by 1-2 over a daily set of several thousand, but nevertheless it differed.
If you want to guarantee your results, consider upgrading to premium
Sampling could be an issue if what you are requesting is over 50,000 rows for the time period you are requesting. To avoid it you can download more often, such as daily.
But I think your issue is that there is a processing time for Google Analytics - if you are downloading at 3 am on the 1st it is probable that the processing for the previous day has not finished.
Google Analytics Premium SLA is for 4 hour data freshness, so even that would have trouble. Pragmatically you should allow 24 hours before you download data for the previous day, 48 hours for e-commerce data.
Thirdly make sure it is not Unique Visitors you are requesting, as this is dependent on the time period you are requesting.
I'm using "Reporting google Analitics API" and I can’t find information about what the last “end date” with data in Analytics is.
For example, let's suppose you want to retrive the last month’s data.
When do you have to perform the query?
The first day of the current month?
...or the second one?
...or maybe the third one?
And only another question: are the returned data for days in pacific time?
Google Analytics API is supposed to have access to the same data you have in the interface.
Google says that data can take up to 24h to process. The time it takes to really update the data depends on the type and size of the account. Small accounts are updated multiple times a day and can have data available in just a few hours. Once you reach 1M hits a month you are moved to a different mode where the data on your account is updated only once a day. Google Analytics Premium customers have updates more often even for large ammounts of traffic.
There's no way to tell through the API what is exactly the time of the last hit processed. You can query the data for today by the hour and see for yourself though.
Usually you don't care and just want to make sure that the data you're querying has been fully processed for that day.
So if you query data for yesterday there's a chance it has not being completely updated, for example if it's midnight the data for yesterday is just a couple minutes ago and probably haven't been completely processed yet. The safest bet in this case is to query data for 2 days ago.
So if today is 2012-06-15 and you want to get 1 month of data a safe approach is to query data with start-date=2012-05-13 and end-date=2012-06-13. This will most of the time give you data for days that have been fully processed, but it's not 100% safe as well. Google Analytics have had outages in the past where data took longer than that to process, these are not usual though. When you get the data out it's really hard to tell just for the API if the data for those days have been fully processed or not, using the 2 days ago isea you just make it more likely that it is.
The days are aggregate following your timezone settings configured on the Google Analytics profile.