Discrepancies on "active users metric" between Firebase Analytics dashboard and BigQuery export - firebase

According to Firebase Analytics docs (https://support.google.com/firebase/answer/6317517#active-users), the active number of users is the number of unique users who initiated sessions on a given day. Also according to the docs, every time a session is started an event with session_start name is sent. I am trying to get that metric using BigQuery's export, but my query is giving me different results (15636 on BigQuery, 14908 on FB analytics)
I have also tried converting to different timezones to see if that might be the issue, but no matter which timezone I try I never get the same (or similar) results
Which query should I run to get the same results I get on Firebase Analytics dashboard for active users?
My query is
SELECT EXACT_COUNT_DISTINCT(user_dim.app_info.app_instance_id)
FROM table_date_range([XXXXX.app_events_], timestamp('2016-11-26'), timestamp('2016-11-29'))
WHERE DATE(event_dim.timestamp_micros) = '2016-11-27'
AND event_dim.name ='session_start'
Thanks
Update
After #djabi's answer I changed my query to use user_engagement rather than session_start and it works much better now. Still some minor differences though (they range from under ten to under 50 out of 16K, depending on the date).
I have tried once again using different timezones by playing around with DATE(date_add(event_dim.timestamp_micros,1,'hour')) but I never got the exact number I get on Firebase Analytics dashboard.
The new numbers are good enough to be considered statistically acceptable, but wondering if anyone has a suggestion to improve the query and get exact results?
The current query is:
SELECT
COUNT(*) AS active_users
FROM (
SELECT
COALESCE(user_dim.user_id, user_dim.app_info.app_instance_id) AS user_id
FROM
TABLE_DATE_RANGE([XXXXX.app_events_], TIMESTAMP('2016-11-24'), TIMESTAMP('2016-11-29'))
WHERE
DATE(event_dim.timestamp_micros) = '2016-11-25'
AND event_dim.name ='user_engagement'
GROUP BY
user_id )
Note: At the moment we are not sending user_id, so the COALESCE will always return the app_instance_id, in case anyone was going to suggest that could be the problem

You need to wait for full 3 days for data from offline devices to be uploaded. Your query correctly filter the events based on the event timestamp and you pull data from 3 days but that is only day and half from today and that is enough for all data to be uploaded. Try including 3 days from yesterday.
Also try using user_engagement event instead of session_start. I believe active user count is based on user_engagement and not on session_start events.
Also FB reports take a bit to process so you wight want and check the FB reports the next day.
FB reports are done on the time zone on the account and events are timestamped in UTC so the day in FB reports is different from UTC calendar day. You want to control for that discrepancy as well to get matching numbers.

Sessions are by-default measured after user activity of 10 seconds in the respective app which you can change. Try changing the sessions start time count to the least number possible and then you may arrive at a number closer to what you are expecting.

For Android stats I used:
user_dim.device_info.resettable_device_id
instead of
user_dim.app_info.app_instance_id
and it produced better results.

Related

Calculate time spent by users on app using Google BigQuery Firebase analytics data

I am trying to calculate total time spent by users on my app. We have integrated firebase analytics data in BigQuery. Can I use sum to the values of engagement_time_msec in select statement of my query? This is what I am trying :
SELECT SUM(x.value.int_value)
FROM "[dataset]",
UNNEST(event_params) AS x WHERE x.key = "engagement_time_msec"
I am getting very big values after executing this query. I am not sure if is it ok to use SUM("engagement_time_msec") for calculating total time spent by users on app.
Any help would be highly appreciated.
It really depends on what dataset you have. Ideally, you would want login and logout timestamps if you have. Take the time_diff between the values, grouping by user, device, loadsequence etc. Anything which defines a single event

Google API shows duplicate rows for TransactionId's

I've got a strange problem.
I'm trying to pull out data from GA API.
metrics: ga:users
dimensions: ga:date,ga:source,ga:medium,ga:transactionId
After reviewing the data I can see that I have multiple transaction Id's.
Usually 5 to 7 duplicates per month - the same transaction ID is in two dates.
In Google Analytics there are no duplicates.
There are in the exported data + Query Explorer also shows duplicates.
Does anybody know why?
Thanks,
Krzysztof
First of all, do you make sure you use unique transactionIDs for each transaction? I've seen cases where the ERP makes certain transaction or orderIDs available again after an order was cancelled.
If you look at the transactionID in GA (click in on the ID itself to drill down into it) and change to Quantity or look at the product revenue for the graph line, do they occur on two different dates?
This behaviour is often seen if you forget to prevent the transaction pixel again on things like a page refresh. Another example is if they perhaps receive an email with "Click here to view your order/transaction" and it fires again on the receipt page.

How to emigrate old statistics to google analytics?

In our project we stored all users event data in our database for over one year , but it's not indexed.
now we are going to use google analytics to store our analytics and analyze the report using google analytics dashboard.
but before start using google analytics , i would like to emigrate all old statics (about 2 million events) to google analytics.
for this matter i should use Measurement Protocol and it's limit allow me to transfer 2 million hits with no problem.
but i didn't succeed to know how to set the time of the event. Measurement Protocol has Queue Time but google says :
Values greater than four hours may lead to hits not being processed.
how it's possible to transfer 2 million events to google analytics with there event time ?
Thanks
You are correct you can use the measurement protocol to send events data directly to google analytics. I don't see any problem in sending 2 million events. However its not possible to set the event time longer then four hours ago.
Queue time is used to set the time that the event occurred as you can see it cant be more then four hours ago and I have found that if you do set it to four hours ago its a bit fuzzy if the data is correct or not. This feature is probably most use in mobile devices where they may go off line for a short time you can store the data then send it all once the device is online again.
So the dates will be the date that you sent the event to Google Analytics you cant back date the data to more then four hours ago. So I am not sure how much use the data will be to you when it is all inserted.
There is no way to do this, but you can make it easier on yourself.
Unfortunately, there is no way to add, remove, or otherwise edit Google Analytics hit data retrospectively, except to delete all of it. You also cannot copy, or move it between accounts, or download it all.
You are not the first to have to come to terms with this.
In this situation, we recommend to our clients that they run their new and old systems in parallel for a testing period (usually 6 months or a year), before switching off one of them.
Yes, it's difficult to let go of old data, but sometimes it has to be done.

How can I query Google Analytics condition on TWO different dates?

I wish to extract (via the Analytics Core Reporting API) all the transactions made TODAY by users that had a specific ga:eventCategory few weeks ago.
I'm looking to see the date of a transaction and all dated of event that are related to that transaction.
If GA was sql I would join by the ga user and take in the dimension both his transactions date and his dimension update date...
Thanks.
Noam.
Like I have indicated in my comment you can segment the data to include only those users who have the specific event. Segmentation works fine with the core reporting API.
Your segment defintion would look like this:
users::condition::ga:eventCategory==[myEventCategory]
(where obviously the thing in [brackets] is a placeholder that needs to be substituted for the event category name). The "users::" prefix means you are segmenting by user scope (as opposed to sessions), so this will include all sessions in the selected timeframe for users who had the event at least in one of their session (even if the event was outside the selected timeframe).
Select transactionId as dimension and some metric (revenue) and todays date and you are done. Or you would be done if this was actually going to work, but there are at least two caveats:
Google Analytics does not work in realtime, so it's unlikely that TODAYs transactions are fully available (Google says it's 24 hours until the data is processed - actually it might happen faster, but you cannot rely on it).
If a user has deleted his or her cookie she won't be recognized as a recurring user and GA will be unable to segment her out. The longer the interval between the event and the transaction the less likey it is that the GA cookie is still present.
So even with a technically correct query it might be that you won't get the data you need.

Last “end date” with data in Analytics

I'm using "Reporting google Analitics API" and I can’t find information about what the last “end date” with data in Analytics is.
For example, let's suppose you want to retrive the last month’s data.
When do you have to perform the query?
The first day of the current month?
...or the second one?
...or maybe the third one?
And only another question: are the returned data for days in pacific time?
Google Analytics API is supposed to have access to the same data you have in the interface.
Google says that data can take up to 24h to process. The time it takes to really update the data depends on the type and size of the account. Small accounts are updated multiple times a day and can have data available in just a few hours. Once you reach 1M hits a month you are moved to a different mode where the data on your account is updated only once a day. Google Analytics Premium customers have updates more often even for large ammounts of traffic.
There's no way to tell through the API what is exactly the time of the last hit processed. You can query the data for today by the hour and see for yourself though.
Usually you don't care and just want to make sure that the data you're querying has been fully processed for that day.
So if you query data for yesterday there's a chance it has not being completely updated, for example if it's midnight the data for yesterday is just a couple minutes ago and probably haven't been completely processed yet. The safest bet in this case is to query data for 2 days ago.
So if today is 2012-06-15 and you want to get 1 month of data a safe approach is to query data with start-date=2012-05-13 and end-date=2012-06-13. This will most of the time give you data for days that have been fully processed, but it's not 100% safe as well. Google Analytics have had outages in the past where data took longer than that to process, these are not usual though. When you get the data out it's really hard to tell just for the API if the data for those days have been fully processed or not, using the 2 days ago isea you just make it more likely that it is.
The days are aggregate following your timezone settings configured on the Google Analytics profile.

Resources