Effects of high-cardinality Google Analytics event label fields?

Effects of high-cardinality Google Analytics event label fields? - google-analytics

I have a Google Analytics event label with high cardinality that I'd like to implement - it is a string that can take on any combination of a finite-but-large number of names in a comma-separated list.
I'm worried mainly about losing data - I found this Analytics Help support page:
https://support.google.com/analytics/answer/1009671?hl=en
...which states:
Reports containing high-cardinality dimensions may be affected by
Analytics system limits, resulting in the creation of a rolled-up
(other) entry in the report to contain the data that exceeds these
limits.
...and am wondering if that would also affect reports without the label included, i.e., reports just looking at unique category/action pairings - would GA still roll-up otherwise-identical into "other" entries if the (undisplayed) labels are different?
Also, am wondering if there would be any hits to performance for similar report types (not looking at labels, just category/action pairings).
Maybe this is just bad practice out of the gate? :)

Google Analytics stores daily, in the processed tables, up to a maximum of 50,000 rows (in Google Analytics 360 the limit increases to 1,000,000 rows, making the problem of data aggregation less frequent). As a result, many combinations of unique dimension values are stored for each table processed every day. If a given table has a larger number of combinations of values of dimensions, Analytics stores the top N values and creates a row of type (other) for the remaining combinations of values.
https://www.analyticstraps.com/valori-raggruppati-in-other-nei-report/
Anyway, I tried a custom report with label and without (same time period) and with label I got (other) while without that dimension I got the actual values.
So the problem you fear does not exist (unless the event action is also high cardinality) :)

Related

Google Analytics Age As Secondary Dimension Reduces Results

I am trying to distribute Pageviews by URL v/s Age as secondary URL, but the result reduces and shows only a set of old URL's but not the new ones.
Here is a report for last 3 years without secondary dimension. This shows 1009 URL's
When added a Secondary Dimension "Age" the results come down to 9 for same period.

I think that age is only applied in certain scenarios where an ad campaign has targeted certain demographics.
What you are likely seeing is the subset of your data that only contains that dimension.
When selecting a dimension, there isn't an "everything else" label unless it is explicitly set, so you won't see the rest of you data - you can assume that "all data minus the age data equals the rest of the data". This is the same scenario as when you set a custom dimension - if you only record the dimension when there is a value (e.g. a promocode on an ecommerce transaction), you will only ever see traffic that has a value applied.
In this instance there would need to have been "no age set" on the dimension, to get the rest of the traffic - which is how demographics work (again slightly unsure). This is like "direct/none" has a default that is put on source medium when there is no source or medium to be discovered.

Thresholds are applied to prevent anyone viewing a report from inferring the demographics or interests of individual users. When a report contains Age, Gender, or Interest Category (as a primary or secondary dimension, or as part of an applied segment), a threshold may be applied and some data may be withheld from the report.
Documentation: https://support.google.com/analytics/answer/2799357?hl=en
My article with test: https://www.analyticstraps.com/i-report-con-i-dati-demografici-non-tornano/

GA Enhanced Ecommerce max product restriction?

I've been tasked with updating our ecomm tracking but have been told it was not previously implemented with Enhanced Ecommerce because:
... it has limits around number of products. As we have 100,000's of 'Products' due to ... it's not a good fit.
Nonetheless, I am unable to find and conclusive evidence via any (non or official) sources of such limitation/s.
I'd like to upgrade to Enhanced Ecommerce for obvious reasons so does anyone have an idea of limitations around unique product (by id/sku) maximums or anything else?

There's no limit for collecting unique SKUs or other dimensions, but you might have problems during reporting. Limits apply during the processing of high cardinality dimensions, and you might get many of them aggregated as (other) values among your dimensions.
Each report dimension (e.g., Page, Browser, Screen Resolution, etc.)
has a number of values that can be assigned to it. The total number of
unique values for a dimension is known as its cardinality. For
instance, the Mobile (or ga:isMobile) dimension has two potential
values (Yes or No), so the cardinality for that dimension is two.
Other dimensions can have any number of values assigned. For example,
the Page dimension has a different value for every URL that appears on
your site.
Dimensions with a large number of possible values are known as
high-cardinality dimensions. Reports containing high-cardinality
dimensions may be affected by Analytics system limits, resulting in
the creation of a rolled-up (other) entry in the report to contain the
data that exceeds these limits.
For further details and actual limits you can check this support article.
Even if these limitations get applied on SKU level, you can benefit from Product Category level reports, and general shopping/checkout behavior reports.

Different Active Users count when using segments

I would love to understand what I'm looking at - why are the numbers different in this report when I add a segment?
This is the report without any segmentation:
This is the same report with the Mobile Traffic segment:

There two methods that Google uses to identify the number of users.
Calculation 1: Pre-calculated data
This calculation relies only on the number of sessions in the given date range and the time of each session. (This is determined by technology managed on the device, like a web browser, and is often referred to as the client-side time.) Because the result of this calculation can be added to the pre-aggregated data tables, Analytics can reference the table to quickly retrieve and serve this data in a report, including when you change the date range.
Calculation 2: Data calculated on the fly
Calculation 2 is based on the way you assign, collect, and store persistent data about your traffic. There are many solutions you can implement to customize this, but the most common way this data is going to be assigned and stored is through cookies managed via a web browser.
Adding a segment will force GA to calculate the data on the fly and that's why you are seeing a difference in the numbers.
Are you using GA free or 360? and the time range you are using is same in both reports?
You can also have a look into the Google article https://support.google.com/analytics/answer/2992042?hl=en

You are victim of sampling:
https://support.google.com/analytics/answer/2637192?hl=en
Sampling applies when:
you customize the reports
the number of sessions for the report time range exceeds 500K (GA) or 100M (GA 360)
The consequence is that:
the report will be based on a subset of the data (the % depends on the total number of sessions)
therefore your report data won't be as accurate as usual
What you can do to reduce sampling:
increase sample size in UI (will only decrease sampling to a certain extend, but in most cases won't completely remove sampling)
reduce time range
create filtered views so your reports contain the data you need and you don't have to customize them

Discrepancy in Google Analytics data when using segments

I'm having a tough time with Google Analytics, trying to understand why the value of metrics changes when segments are applied.
There is a standard audience overview report, which is based on 100% of sessions (no sampling) and the view is not filtered. The period is March of 2017.
Standard "All visitors" segment looks like this:
Then, there is another built-in segment called "Bounced Sessions". When I apply this segment, the "All visitors" values changes:
Amount of users increases, but the count of pageviews decreases.
Any ideas how to explain this?.. Thank you in advance!

Oki, there can be, multiple reasons. Let me explain first how these numbers are calculated, then we move on to your query.
There two types of data gathering and manipulation from google.
Pre-calculated data -- pre-aggregated tables
These are the precalculated data that Google uses to speed up the UI. Google does not specify when this is done but it can be at any point of the time. These are known as pre-aggregated tables
Data calculated on the fly
Some that you do which result in computation or manipulation falls under this category. Like using segments, creating custom reports etc.
Coming to your problem. When you apply segment, every metric that it effects will be calculated again. Thus it may result in numbers greater than you see in normal view.
Standard audience overview report is pre-aggregated at some point of the day. When you apply segment, the results will be calculated with the fresh data. Since latter is the latest, it will automatically give you increased number of the metrics. Even you can see a decrease as well, all depends on your data and user behavior.
Resolution: If you are a premium user, use Big Query. You must rely on big query for every metric as they are fresh and calculated on the fly

big difference in "visitor" count

I try to pull out the (unique) visitor count for a certain directory using three different methods:
* with a profile
* using an dynamic advanced segment
* using custom report filter
On a smaller site the three methods give the same result. But on the large site (> 5M visits/month) I get a big discrepancy between the profile on one hand and the advanced segment and filter on the other. This might be because of sampling - but the difference is smaller when it comes to pageviews. Is the estimation of visitors worse and the discrepancy bigger when using sampled data? Also when extracting data from the API (using filters or profiles) I still get DIFFERENT data even if GA doesn't indicate that the data is sampled - ie I'm looking at unsampled data.
Another strange thing is that the pageviews are higher in the profile than the filter, while the visitor count is higher for the filter vs the profile. I also applied a filter at the profile to force it to use sample data - and I again get quite similar results to the filter and segment-data.
profile filter segment filter#profile
unique 25550 37778 36433 37971
pageviews 202761 184130 n/a 202761
What I am trying to achieve is to find a way to get somewhat accurat data on unique visitors when I've run out of profiles to use.
More data with discrepancies can be found in this google docs: https://docs.google.com/spreadsheet/ccc?key=0Aqzq0UJQNY0XdG1DRFpaeWJveWhhdXZRemRlZ3pFb0E

Google Analytics (free version) tracks only 10 mio page interactions [0] (pageviews and events, any tracker method that start with "track" is an interaction) per month [1], so presumably the data for your larger site is already heavily sampled (I guess each of you 5 Million visitors has more than two interactions) [2]. Ad hoc reports use only 1 mio datapoints at max, so you have a sample of a sample. Naturally aggregated values suffer more from smaller sample sizes.
And I'm pretty sure the data limits apply to api access too (Google says that there is "no assurance that the excess hits will be processed"), so for the large site the api returns sampled (or incomplete) data, too - so you cannot really be looking at unsampled data.
As for the differences, I'd say that different ad hoc report use different samples so you end up with different results. With GA you shouldn't rely too much an absolute numbers anyway and look more for general trends.
[1] Analytics Premium tracks 50 mio interactions per month (and has support from Google) but comes at 150 000 USD per year
[2] Google suggests to use "_setSampleRate()" on large sites to make sure you have actually sampled data for each day of the month instead of random hit or miss after you exceed the data limits.
Data limits:
http://support.google.com/analytics/bin/answer.py?hl=en&answer=1070983).
setSampleRate:
https://developers.google.com/analytics/devguides/collection/gajs/methods/gaJSApiBasicConfiguration#_gat.GA_Tracker_._setSampleRate

Yes, the sampled data is less accurate, especially with visitor counts.
I've also seen them miss 500k pageviews over two days, only to see them appear in their reporting a few days later. It also doesn't surprise me to see different results from different interfaces. The quality of Google Analytics has diminished, even as they have tried to become more real-time. It appears that their codebase is inconsistent across API's, and their algorithms are all over the map.
I usually stick with the same metrics and reporting methods, so that my results remain comparable to one another. I also run GA in tandem with Gaug.es, as a validation and sanity check. With that extra data, I choose the reporting method in GA that I am most confident with and I rely on that exclusively.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Effects of high-cardinality Google Analytics event label fields? - google-analytics

Related

Google Analytics Age As Secondary Dimension Reduces Results

GA Enhanced Ecommerce max product restriction?

Different Active Users count when using segments

Discrepancy in Google Analytics data when using segments

big difference in "visitor" count

Categories

Resources