Google Analytics: Difference between Data Import & Daily upload - google-analytics

I can't find answers to my questions regarding to how Google Analytics Cost Data Upload works. I have a few questions:
Why two different upload methods? Why not just have one dimension-wide upload option?
Let say I upload costs using daily upload method and after that I upload costs using data import. Will data import override daily data, will that data be merged or even deleted? What will happen in this case? Will the next daily upload override data of data import?
If you delete uploaded data, will costs in GA report be reseted to zero?
Does lifetime data storage limit per property also applies for daily uploads or only data import?
Thank you for your help!

Dimension widening is for the generalized case whereas Cost Data is for a specific upload case (i.e. there are specific reports in GA for Cost Data). However, your question is still valid as to why there are 2 separate uploads, it doesn't look necessary but that's the way it is for now.
Daily upload and Data Import work independently of each other. One is not going to affect the other. You should be using daily upload for cost data since it has a set schema and reports. Data import for other data sets where you define the schema.
For Cost Data, if you delete the data then yes, your reports will not show any related cost data anymore (or you can just unlink the profile from the data set to accomplish the same result). For Dimension Widening, even if you delete the data set or unlink it, any data that was joined before the deletion or unlinking will stay as part of the reports, you can't remove it after the fact. This is because of differences between the two in how the data gets joined.
As stated in #2, these are different mechanisms, so they don't share storage. Daily upload has limits that apply to the upload (e.g. 20 appends/day, max 5MB per append). The doc you linked to clearly states what these are for each. Just treat them separately unless it is stated otherwise.

Related

Best Handle Intraday GA Data in BigQuery

I have a configured google analytics raw data export to big query.
Could anyone from the community suggest efficient ways to query the intraday data as we noticed the problem for Intraday Sync (e.g. 15 minutes delay), the streaming data is growing exponentially across the sync frequency.
For example:
Everyday (T-1) batch data (ga_sessions_yyymmdd) syncs with 15-20GB with 3.5M-5M records.
On the other side, the intraday data streams (with 15 min delay) more than ~150GB per day with ~30M records.
https://issuetracker.google.com/issues/117064598
It's not cost-effective for persisting & querying the data.
And, is this a product bug or expected behavior as the data is not cost-effectively usable for exponentially growing data?
Querying big query cost $5 per TB & streaming inserts cost ~$50 per TB
In my vision, it is not a bug, it is a consequence of how data is structured in Google Analytics.
Each row is a session, and inside each session you have a number of hits. As we can't afford to wait until a session is completely finished, everytime a new hit (or group of hits) occurs the whole session needs to be exported again to BQ. Updating the row is not an option in a streaming system (at least in BigQuery).
I have already created some stream pipelines in Google Dataflow with Session Windows (not sure if it is what Google uses internally), and I faced the same dilemma: wait to export the aggregate only once, or export continuously and have the exponential growth.
An advice that I can give you about querying the ga_realtime_sessions table is:
Only query for the columns you really need (no select *);
use the view that is exported in conjunction with the daily ga_realtime_sessions_yyyymmdd, it doesn't affect the size of the query, but it will prevent you from using duplicated data.

Google analytics realtime data in BigQuery

We have enabled continuous export of Google Analytics data to BigQuery which means we get ga_realtime_sessions_YYYYMMDD tables with data dumps throughout the day.
These tables are – usually! – left in place, so we accumulate a stack of the realtime tables for the previous n dates (n does not seem to be configurable).
However, every once in a while, one of the tables disappears, so there will be gaps in the sequence of dates and we might not have a table for e.g. yesterday.
Is this behaviour documented somewhere?
It would be nice to know which guarantees we have, as we might rely on e.g. realtime data from yesterday while we wait for the “finished” ga_sessions_YYYYMMDD table to show up. The support document linked above does not mention this.
As stated in this help article, these internal ga_realtime_sessions_YYYYMMDD tables should not be used for queries and the ga_realtime_sessions_view_YYYYMMDD view should be used instead for your queries, in order to obtain the fresh data and to avoid unexpected results.
In the case you want to use data from some day ago while you wait for the internal ga_realtime_sessions_YYYYMMDD tables to be created for today, you can choose to copy the data obtained from querying the ga_realtime_sessions_view_YYYYMMDD view, into a separate table at the end of a day for this purpose.

Exporting all Marketo Leads in a CSV?

I am trying to export all of my leads from Marketo (we have over 20M+) into a CSV file, but there is a 10k row limit per CSV export.
Is there any other way that I can export a CSV file with more than 10k row? I tried searching for various dataloader tool on Marketo Launchpoint but couldn't find a tool that would work.
Have you considered using the API? It may not be practical unless you have a developer on your team (I'm a programmer).
marketo lead api
If your leads are in salesforce and marketo/salesforce are in parity, instead of exporting all your leads, do a sync from salesforce to the new MA tool (if you are switching) instead. It's a cleaner easier sync.
For important campaigns etc, you can create smart lists and export those.
There is no 10k row limit for exporting Leads from a list. However, there is a practical limit, especially if you choose to export all columns (instead of only the visible columns). I would generally advise on exporting a maximum of 200,000-300,000 leads per list, so you'd need to create multiple Lists.
As Michael mentioned, the API is also a good option. I would still advise to create multiple Lists, so you can run multiple processes in parallel, which will speed things up. You will need to look at your daily API quota: the default is either 10,000 or 50,000. 10,000 API calls allow you to download 3 million Leads (batch size 300).
I am trying out Data Loader for Marketo on Marketo Launchpoint to export my lead and activity data to my local database. Although it cannot transfer marketo data to CSV file directly, you can download Lead to your local database and then export to get a CSV file. For your reference, we have 100K leads and 1 billion activity data.
You might have to run multiple times for 20M leads, but the tool is quite easy and convenient to use so maybe it’s worth a try.
Initially there are 4 steps to get bulk leads from marketo
1. Creating a Job
2. Enqueue Export Lead Job
2. Polling Job Status
3. Retrieving Your Data
http://developers.marketo.com/rest-api/bulk-extract/bulk-lead-extract/

How well does scriptDB work as substitute for storing data directly in Google Spreadsheets?

I want to retrieve data from Google Analytics API, create custom calculations and then push the aggregations to a Google Spreadsheets in order to reuse in Google Visualisation API app. My concern is that I'll hit the Spreadsheet cell quota very quickly with the raw data needed for the calculation.
I know scriptDB quota is 100MB but before I invest time and resources in learning how it works I'd like to get an idea whether it's feasible for storing raw analytics data (provided it's not too granular and it's just designed to answer specific questions) and how much of it I could realistically store in scriptDB (relative to spreadsheets) before I hit the quota.
Thanks
For bulk data access (e.g. reading a table for Visualization), a spreadsheet will have a speed advantage over ScriptDb. What is faster: ScriptDb or SpreadsheetApp? If you wish to support more sophisticated queries though, to "answer specific questions" as you mention, then ScriptDb will give you an edge, as query times vary with the number of results but should be unaffected by the query criteria themselves.
With data in a spreadsheet, you will be able to obtain a DataTable for Visualization with a single Range.getDataTable() operation. With ScriptDb, you will need to write a script to build your DataTable.
Regarding size constraints, it's not possible to really compare the two without knowing the size of your individual data elements. You're already aware of the general constraints:
Spreadsheet, 40K cells, but may hit (unspecified) size limit before that, depending on data element sizes.
ScriptDb, 50MB, 100MB or 200MB depending on account type. The number of objects that can be stored is affected by the complexity (depth) of the object and the size of the property names, and of course the size of data contained in the objects.
Ultimately, the question of which is best for your application is a matter of opinion, and of which factors matter most for the application. If the analytics data is tabular, then a spreadsheet offers advantages for implementation largely because of Range.getDataTable(), and is faster for bulk access. I'd recommend starting there, and considering a move to ScriptDb if and when you actually hit spreadsheet size or query performance limitations.

How to handle large amounts of data for a web statistics module

I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.

Resources