Firebase BigQuery server offset time - firebase

Background:
I'm having the Firebase analytics data exported to BigQuery. And I'm using cron jobs to crunch data in BigQuery for getting insight.
Problem:
To be able to only crunch delta data i.e. the data that has arrived since last time I ran my cron job I need a way to figure out the time when the data arrived at server, since the event_timestamp is generated at client and can be cached at client before sent.
Insights:
I have laborated with event_server_timestamp_offset (offset) which I thought I could use together with event_timestamp. But I was expecting the offset to only be positive but it can also be negative. And when I look at the MAX and MIN for the offset in the entire exported Firebase analytics dataset and re-calculate it to years instead of microseconds I can get more than 18 years offset.
Query:
SELECT
MAX(event_server_timestamp_offset)/(1000000*60*60*24) max_days,
MIN(event_server_timestamp_offset)/(1000000*60*60*24) min_days
FROM
`analytics_<project_id>.events_*`
Result: max_days=6784.485790436655,
min_days=-106.95833052104166
Question:
How can I figure out the server arrival time for my Firebase exported BigQuery data so I can run cron jobs crunching only delta data?
Can I use event_server_timestamp_offset together with event_timestamp? If so, how?
Best regards,
Daniel

Surprisingly enough, this question not having a clear answer for almost 2 years, I am leaving here the answers I got from the Firebase support team. The format is - question asked followed by the answer of the support staff.
Q1. event_date - The date on which the event was logged (YYYYMMDD format in the registered timezone of your app). Does it mean that the event occurred on that date, or that it was actually collected on that date?
A1. Per documentation, event_date refers to the date on which the event is logged/occurred. Note that event_date is based on the Analytics timezone setting of your Firebase Project.
Q2. event_timestamp - The time (in microseconds, UTC) at which the event was logged on the client. Is it safe to assume that this is the exact timestamp the event occurred on client side (in the app timezone of course)?
A2. Yes, this is based on the device timezone setting. However, event_timestamp may be skewed if the device time is incorrect.
Q3. event_server_timestamp_offset - Timestamp offset between collection time and upload time in micros. This is the main field that causes all the misunderstandings - in our BigQuery table for the year 2020 this field takes values in a range between 5 days and -2 days. I mean how can the colleciton time be 2 days ahead?
A3. The event_server_timestamp_offset field in the export schema is the time difference between when the event took place and the app uploaded it to our server. In other words, this is the estimated difference between the client's local time and the actual time, according to our servers. The values of this field are usually positive, but can be negative as well if the device time setting is incorrect.
Q4. One last question is very important - can we ignore the
event_server_timestamp_offset field and just rely on event_timestamp -
as the exact date and time the event occurred on the clientside (not
collected, not uplaoded, etc). If not- please explain how we can get
the exact datetime of the event occuring on the clientside. But if yes
please let me know why do we need the event_server_timestamp_offset field?
A4. Yes, you may actually ignore it and use event_timestamp alone. However, as mentioned earlier, event_timestamp could be off if the device time setting incorrect, but it shouldn't really affect the bigger picture of your analytics data as cases like this are usually one-off.

We use the event_date as the indicator and load the data once a day.

Related

Writing to firestore in firebase cloud-function shifts my dates

Working on flutter/dart project. Calling cloud function from client side and passing 2 dates which are converted toISO8601String.
Printing dates to console while executing cloud-function to double check. They are always valid. After creating document in firestore, the dates are shifted by 1 hour.
I guess because of my current timezone offset which is UTC+1. In summer I had same issue where firestore was shifting my dates by 2 hours because of summer time, resulting in offset UTC+2.
My cloud-functions are deployed at region: europe-west3, which according to firebase docs is Frankfurt. Same time zone as mine (Central European Standard Time). But when I create:
const now = new Date();
Result is one hour less than my local time. Why is that when we are in the same time zone with same offset?
Reading documents from firestore on client side results in inaccurate dates, which is pretty bad for me since I need it to be accurate.
I was searching everywhere but didn't find anything that really helped me. I was trying to construct the date object with many ways but the result was always the same, can somebody please explain me why and help me out? Thanks.
The Java Object Date does not actually contain timezone information, it's just a number of milliseconds since the "epoch" (01.01.1970 00:00:00 UTC) and thus an absolute point in time. It's converted into a valid time every time it's output to a user.
So it seems that the reason for the time shift is that your PC or the browser you are viewing your firestore documents from is set to the wrong timezone.
For further information, take a look at this question or the java documentation.

User Deletion API: userDeletionRequests:upsert :: deletionRequestTime field

Can someone give little clarification how to interpret following parameter:
deletionRequestTime: datetime`:
This marks the point in time up to which all user data for the specified end user and Google Analytics property or Firebase project should be deleted.`
If I set it to 1st Jan2018 (GTM), does it delete all user data:
from that date till today (which is how I interpret).. meaning all 2018 data will be gone?
or, (from epoch time) till that date ... meaning all 2016/2017 etc. data is gone and all that remains is 2018 data?
When trying the API > refreshed User Explorer report in GA interface > I notice all-time data seems is gone (giving me impression that this filed is not respected?). But let me wait 72hrs since API request to draw any conclusion..
Thanks for any clarification.
Cheers!
First off i dont think your miss-interperting it I dont think the documentation is clear.
The following is from userDeletionRequest
deletionRequestTime datetime
This marks the point in time up to which all user data for the specified end user and Google Analytics property or Firebase project should be deleted.
Now to me that means that its a point in time that the data should be deleted. as in one day? one minute a time stamp? would this then mean you will need to loop though every hour minute in a day to delete everything.
My current answer is this is confusing. I am going to contact the team for clarification they are in West coast USA we wont get an answer back for several hours. I will updated this when i know more.
Clarification from Google
As per documentation, deletionRequestTime represents a timestamp up to
which all use data will be deleted. In other words, all data from the
beginning of time until the point returned in deletionRequestTime will
be deleted.
I don't believe you can set the deletionRequestTime field. It is set to the time you make the call.
I believe that is why you are seeing this behavior.

Discrepancies on "active users metric" between Firebase Analytics dashboard and BigQuery export

According to Firebase Analytics docs (https://support.google.com/firebase/answer/6317517#active-users), the active number of users is the number of unique users who initiated sessions on a given day. Also according to the docs, every time a session is started an event with session_start name is sent. I am trying to get that metric using BigQuery's export, but my query is giving me different results (15636 on BigQuery, 14908 on FB analytics)
I have also tried converting to different timezones to see if that might be the issue, but no matter which timezone I try I never get the same (or similar) results
Which query should I run to get the same results I get on Firebase Analytics dashboard for active users?
My query is
SELECT EXACT_COUNT_DISTINCT(user_dim.app_info.app_instance_id)
FROM table_date_range([XXXXX.app_events_], timestamp('2016-11-26'), timestamp('2016-11-29'))
WHERE DATE(event_dim.timestamp_micros) = '2016-11-27'
AND event_dim.name ='session_start'
Thanks
Update
After #djabi's answer I changed my query to use user_engagement rather than session_start and it works much better now. Still some minor differences though (they range from under ten to under 50 out of 16K, depending on the date).
I have tried once again using different timezones by playing around with DATE(date_add(event_dim.timestamp_micros,1,'hour')) but I never got the exact number I get on Firebase Analytics dashboard.
The new numbers are good enough to be considered statistically acceptable, but wondering if anyone has a suggestion to improve the query and get exact results?
The current query is:
SELECT
COUNT(*) AS active_users
FROM (
SELECT
COALESCE(user_dim.user_id, user_dim.app_info.app_instance_id) AS user_id
FROM
TABLE_DATE_RANGE([XXXXX.app_events_], TIMESTAMP('2016-11-24'), TIMESTAMP('2016-11-29'))
WHERE
DATE(event_dim.timestamp_micros) = '2016-11-25'
AND event_dim.name ='user_engagement'
GROUP BY
user_id )
Note: At the moment we are not sending user_id, so the COALESCE will always return the app_instance_id, in case anyone was going to suggest that could be the problem
You need to wait for full 3 days for data from offline devices to be uploaded. Your query correctly filter the events based on the event timestamp and you pull data from 3 days but that is only day and half from today and that is enough for all data to be uploaded. Try including 3 days from yesterday.
Also try using user_engagement event instead of session_start. I believe active user count is based on user_engagement and not on session_start events.
Also FB reports take a bit to process so you wight want and check the FB reports the next day.
FB reports are done on the time zone on the account and events are timestamped in UTC so the day in FB reports is different from UTC calendar day. You want to control for that discrepancy as well to get matching numbers.
Sessions are by-default measured after user activity of 10 seconds in the respective app which you can change. Try changing the sessions start time count to the least number possible and then you may arrive at a number closer to what you are expecting.
For Android stats I used:
user_dim.device_info.resettable_device_id
instead of
user_dim.app_info.app_instance_id
and it produced better results.

Is timezone info redundant provided that UTC timestamp is available?

I have a simple mobile app that schedules future events between people at a specified location. These events may be physical or virtual, so the time specified for the event may or may not be in the same timezone as the 'location' of the event. For example, a physical meeting may be scheduled for two people in London at local time 10am on a specified date. Alternatively, a Skype call may be scheduled for two people in different timezones at 4pm (in one person's timezone) on a specified date though the 'location' of the event is simply 'office' which means two different places in different timezones.
I wonder the following design is going to work for this application:
On the client, it asks user to input the local date and time and specify the timezone local to the event.
On the server, it converts the local date and time with the provided timezone into UTC timestamp, and store this timestamp only.
When a client retrieves these details, it receives the UTC timestamp only and converts it into local time in the same timezone as the client's current timezone. The client's current timezone is determined by the current system timezone setting, which I think is automatically adjusted based on the client's location (of course, assuming the client is connected to a mobile network).
My main motivations for this design are:
UTC is an absolute and universal time standard, and you can convert to/from it from/to any timezone.
Users only care about the local date and time in the timezone they are currently in.
Is this a feasible design? If not, what specific scenarios would break the application or severely affect user experience? Critiques welcome.
For a single event, knowing the UTC instant on which it occurs is usually enough, so long as you have the right UTC instant (see later).
For repeated events, you need to know the time zone in which it repeats... not just the UTC offset, but the actual time zone. For example, if I schedule a weekly meeting at 5pm in Europe/London with colleagues in America/Los_Angeles, then for most of the year it will occur at 9am for them... but for a couple of weeks in the year it will occur at 8am and for a couple of weeks in the year it will occur at 10am, due to differences in when DST is observed.
Even for a single event, you might want to consider what happens if time zone rules change. Suppose I schedule a meeting for 4pm on March 20th 2018, in the Europe/London time zone. Currently that will occur with a UTC offset of 0... but suppose between now and the meeting, the time zone rules change to bring British Summer Time in one hour earlier. If I've written it in my diary as 4pm, I probably don't want the software to think that it's actually at 5pm because that's the UTC instant we originally predicted.
We don't know your exact application requirements, but the above situations at least provide an argument for potentially storing the local time and time zone instead of the UTC instant... but you'll also need to work out what to do if the local time ends up being skipped or being ambiguous due to DST changes. (When the clocks fall back, some local times occur twice. When the clocks skip forward, some local times are skipped. A time that was unambiguous may become invalid or ambiguous if the rules change between the original planning time and the actual event. You should probably account for this in your design.)
To keep it simple, my answers are:
Timezone info is redundant if you want to define a single moment. A
UTC/Unix timestamp completely defines a moment.
Your design seems feasible but on point 2: i would convert to the UTC/Unix timestamp on the client-side and already give this timestamp
in its final form to the server. Reason: the client-side already has the info necessary to convert (see this time-keeping
client-server-db
architecture
example - it works based exactly on the principles you describe).
One possible problem (as described by Jon Skeet in his answer) are recurring events, but this should be reflected in the way you model
time. The difference between recurring events and fixed events is
that the latter completely define a moment (like a UTC/Unix
timestamp) while the first are only a 'function' which can be applied
to the current time to get the next trigger time of the recurring
event. But this might entirely be a different problem than what
you ask - in any case, somehow distinguishing between recurring
events (if you need them) and fixed events in your model is a good
idea.
One decision to make is: PULL or PUSH? Or both? Do you want the server to be able to send emails for example, when an event comes to
pass? Or do you want client-side alerts only when your client-side
app is running? The answers to these questions will help you come
towards a design suitable for you.

Last “end date” with data in Analytics

I'm using "Reporting google Analitics API" and I can’t find information about what the last “end date” with data in Analytics is.
For example, let's suppose you want to retrive the last month’s data.
When do you have to perform the query?
The first day of the current month?
...or the second one?
...or maybe the third one?
And only another question: are the returned data for days in pacific time?
Google Analytics API is supposed to have access to the same data you have in the interface.
Google says that data can take up to 24h to process. The time it takes to really update the data depends on the type and size of the account. Small accounts are updated multiple times a day and can have data available in just a few hours. Once you reach 1M hits a month you are moved to a different mode where the data on your account is updated only once a day. Google Analytics Premium customers have updates more often even for large ammounts of traffic.
There's no way to tell through the API what is exactly the time of the last hit processed. You can query the data for today by the hour and see for yourself though.
Usually you don't care and just want to make sure that the data you're querying has been fully processed for that day.
So if you query data for yesterday there's a chance it has not being completely updated, for example if it's midnight the data for yesterday is just a couple minutes ago and probably haven't been completely processed yet. The safest bet in this case is to query data for 2 days ago.
So if today is 2012-06-15 and you want to get 1 month of data a safe approach is to query data with start-date=2012-05-13 and end-date=2012-06-13. This will most of the time give you data for days that have been fully processed, but it's not 100% safe as well. Google Analytics have had outages in the past where data took longer than that to process, these are not usual though. When you get the data out it's really hard to tell just for the API if the data for those days have been fully processed or not, using the 2 days ago isea you just make it more likely that it is.
The days are aggregate following your timezone settings configured on the Google Analytics profile.

Resources