Can I reliably query the Firebase intraday tables in BigQuery and get 100% of the event data? - firebase

I have two Firebase projects (one iOS and one Android) feeding into Bigquery. I need to combine, flatten, and aggregate some specific data from both projects into one combined table so that I can report off of it without querying all bazillion rows across all daily tables.
In order to populate this aggregate table, I currently have two python scripts querying the iOS and Android intraday tables every 5 minutes. The script gets the max timestamp from the aggregate table, then queries the intraday table to get any records with a greater timestamp (I track the max timestamp separately for iOS and Android because they frequently differ).
I am querying the intraday table with this (abbreviated) wildcard syntax:
SELECT yadda, yadda, timestamp_micros, 'ios' as platform
FROM `myproject.iOSapp.app_events_intraday*`
WHERE timestamp_micros > (Select max(timestamp_micros)
from myAggregateTable WHERE platform = 'ios' )
Is there any danger that when the intraday table flips over to the new day, I will miss any records when my script runs at 23:57 and then again at 00:02?

I thought I would post the results of my testing this for a few months. Here are the basic mechanics as I see them:
New DAY1 intraday table is created at midnight GMT (xyz.app_events_intraday_20180101)
New DAY2 intraday table is created 24 hours later (xyz.app_events_intraday_20180102), but DAY1 intraday table sticks around for a few hours
Eventually, DAY1 table is "renamed" to xyz.app_events_20180101 and you are left with a single (current) intraday table
My tests have shown that additional data is added to the app_events_* tables, even after step 3 has taken place, so it is NOT safe to assume that the data is stable/static once the name has changed. I have new data appear up to 2 or 3 days later.

Related

DynamoDB - Extract date and Query

I am having the following table in my DynamoDB.
I want to get/extract all the data using the following condition or filters
This Month data : This will be the set of records that belongs to 1st of this month to today. ( I think this I can achieve using the BEGINS_WITH filter , again not sure whether this is the correct approach )
This Quarter data : This will be the set of records that belongs to this quarter, basically from 1st of April 2021 to 30th June 2021
This Year data : This will be set of records that belongs to this entire year
Question : How I can filter/query the data using the date column from the above table to get these 3 types (Month , Quarter ,Year ) of data.
Other Details
Table Size : 25 GB
Item Count : 4,081,678
It looks like you have time-based access patterns (e.g. fetch by month, quarter, year, etc).
Because your sort key starts with a date, you can implement your access patterns using the between condition on your sort key. For example (in pseudo code):
Fetch User 1 data for this month
query where user_id = 1 and date between 2021-06-01 and 2021-06-30
Fetch User 1 data for this quarter
query where user_id = 1 and date between 2021-01-01 and 2021-03-31
Fetch User 1 data for this month
query where user_id = 1 and date between 2021-06-01 and 2021-06-30
If you need to fetch across all users, you could use the same approach using the scan operation. While scan is commonly considered wasteful/inefficient, it's a fine approach if you run this type of query infrequently.
However, if this is a common access pattern, you might want to consider re-organizing your data to make this operation more efficient.
As mentioned in the above answer by #Seth Geoghegan , the above table design is not correct, ideally you should think before placing your Partition Key and Sort Key, still for the people like me who already have such kind of scenarios, here is the steps which I followed to mitigate my issue.
Enabled DynamoDB Steams
Re-trigger the data so that they can pass through the DDB Streams ( I added one additional column updated_dttm to all of my records using one of the script )
Process the Streams record , in my case I broken down the date column above to three more columns , event_date , category , sub_category respectively and updated back to the original record using the Lambda
Then I was able to query my data using event_date column , I can also create index over event_date column and make my query/search more effective
Points to Consider
Cost for updating records so that they can go to DDB Streams
Cost of reprocessing the records
Cost for updating records back to DDB

To query Last 7 days data in DynamoDB

I have my dynamo db table as follows:
HashKey(Date) ,RangeKey(timestamp)
DB stores the data of each day(hash key) and time stamp(range key).
Now I want to query data of last 7 days.
Can i do this in one query? or do i need to call dbb 7 times for each day? order of the data does not matter So, can some one suggest an efficient query to do that.
I think you have a few options here.
BatchGetItem - The BatchGetItem operation returns the attributes of one or more items from one or more tables. You identify requested items by primary key. You could specify all 7 primary keys and fire off a single request.
7 calls to DynamoDB. Not ideal, but it'd get the job done.
Introduce a global secondary index that projects your data into the shape your application needs. For example, you could introduce an attribute that represents an entire week by using a truncated timestamp:
2021-02-08 (represents the week of 02/08/21T00:00:00 - 02/14/21T12:59:59)
2021-02-16 (represents the week of 02/15/21T00:00:00 - 02/22/21T12:59:59)
I call this a "truncated timestamp" because I am effectively ignoring the HH:MM:SS portion of the timestamp. When you create a new item in DDB, you could introduce a truncated timestamp that represents the week it was inserted. Therefore, all items inserted in the same week will show up in the same item collection in your GSI.
Depending on the volume of data you're dealing with, you might also consider separate tables to segregate ranges of data. AWS has an article describing this pattern.

BigQuery -firebase export working different when using wildcard character and _TABLE_SUFFIX compared to without using it

My Requirement:
To append unnested data in a separate table and use it for visualization and analytics
Implementing it :
As I am not sure at what time exactly events_intraday_YYYYMMDD syncs into events_YYYYMMDD for reference check here
0- Created an events_normalized table once at the start by using (It is done once not daily)
create analytics_data_export.events_normalized AS
SELECT .....
FROM
`analytics_xxxxxx.events_*
to collect all the data from events_YYYYMMDD
1- Creating/Replacing a daily temp table with
create or replace table analytics_data_export.daily_data_temp AS
SELECT...
_TABLE_SUFFIX BETWEEN
FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 4 DAY)) AND
FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
as I have seen multiple days data syncing together so to be on the safe side I am using 1-4 days data
2- Deleting the inner join of both the tables(daily_data_temp,events_normalized) from events_normalized to remove any duplicates it might have like let's say events_normalized has data till 18th but daily_data_temp has data from 16-19th so all the rows till 18th from events_normalized will be removed
4- Reinserting daily_data_temp in the events_normalized
Questions:
1- Is there any optimized way of implementing the requirements
2- In the 0th step while creating events_normalized table if I use :
WHERE
_TABLE_SUFFIX <=
FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 0 DAY))
I get different results as compared to when I am using
create analytics_data_export.events_normalized AS
SELECT .....
FROM
`analytics_xxxxxx.events_*
The difference is the latter one has the current date data as well wherein events_YYYYMMDD I can only see data of yesterday. I don't understand this behavior
Like if the current day is 20th July in events_YYYYMMDD I can see only till events_20200719
To optimize you can follow below steps:
Create hash out of event_time_stamp and other unique fields, use this to filter the data
Instead of deleting duplicate rows from the larger initial table delete them from small temp table and then insert the table.
its because the filter analytics_xxxxxx.events_* will match both per day events table and intraday event tables which are name
like events_intraday_20200721

How To: Update table with new data but exclude the last second and include it in the next update

I am using azure data explorer and I have a table in which my data is ingested ("A"). Now I want to run a analysis on that data and safe it in another table. ("B")
I am using an update function in which the analysis is done ("update_function") and have created a table "B" in which the analysis results should be stored.
These are the policy settings: `
.alter table B policy update
#'[{"IsEnabled": true, "Source": "A", "Query": "Update_function()", "IsTransactional": false, "PropagateIngestionProperties": false}]'
.alter table B policy ingestionbatching #'{ "MaximumBatchingTimeSpan": "00:10:00", "MaximumNumberOfItems": 100000, "MaximumRawDataSizeMB": 1024}'
So the function is run every 10 minutes, does the analysis on the data which as been ingested in A since the last Update of B and stores the results in B.
Everything works fine as far.
The problem is:
My frequeny of data is around 5 per second.
In the analysis I use a summarize xxx by bin(timestamp,1s) so I have a result for every second. Because the update is sometimes in the middle of a second I get double results for that second in table B.
For example.
Date 02:00:00.967
Date 02:00:01.234
Date 02:00:01.456
Date 02:00:01.754
Date 02:00:01.897
Date 02:00:02.190
If the update funtion runs before 5. than I will have in B a row with Date 02.00.01 summarizing 2.,3. and 4. and the next run of the update function will add an row summarizing 5.
I want the update function to only use the data which as been newly ingested into table A AND where the timestamp is not the last second.
So in the example it would take 1. but leave 2.,3. and 4. for the next function run.
So that in the next update this data can be included then and the second can be analysed as one.
It´s not possible to just fuse the data later in B as it is importan that the analysis runs on all the data of 1 second.
I hope someone can help me out and that there is a way how to do it.
Greetings,
Katharina
There is no way to do this in Azure Data Explorer today, other than to build your own orchestration which periodically queries the table (with some introduced latency, to make sure it's not querying the last second), and uses set-or-append commands to store the aggregated data in the target table. There's a new feature we are working on, called materialized views, which will support this scenario, but it's not publicly available yet. I suggest checking back in a couple of months.

Difference in statistics from Google Analytics Report and BigQuery Data in Hive table

I have a Google Analytics premium account set up to monitor the user activity of a website and mobile application.
Raw data from GA is being stored in BigQuery tables.
However, I noticed that the statistics that I see in a GA report are quite different the statistics that I see when querying the BigQuery tables.
I understand that GA reports show aggregated data and possibly, sampled data. And that the raw data in Bigquery tables is session/hit-level data.
But I am still not sure if I understand the reason why the statistics could be different.
Would really appreciate it if someone clarified this for me.
Thanks in advance.
UPDATE 1:
I exported the raw data from Bigquery into my Hadoop cluster. The data is stored in a hive table. I flattened all the nested and repeated fields before exporting.
Here is the hive query that I ran on the raw data in the Hive table:
SELECT
date as VisitDate,
count(distinct fullvisitorid) as CountVisitors,
SUM(totals_visits) as SumVisits,
SUM(totals_pageviews) AS PVs
FROM
bigquerydata
WHERE
fullvisitorid IS NOT NULL
GROUP BY
date
ORDER BY
VisitDate DESC
A) Taking February 9th as the VisitDate, I get the following results from this query:
i) CountVisitors= 1,074,323
ii) SumVisits= 48,990,198
iii) PVs= 1,122,841,424
Vs
B) Taking the same VisitDate and obtaining the same statistics from the GA report:
i) Users count = 1,549,757
ii) Number of pageviews = 11,604,449 (Huge difference when compared to A(iii))
In the hive query above, am I using any wrong fields or processing the fields in a wrong way? Just trying to figure out why I have this difference in numbers.
UPDATE 2 (following #Felipe Hoffa 's suggestion):
This is how I am flattening the tables in my Python code before exporting the result to GCS and then to Hadoop cluster:
queryString = 'SELECT * FROM flatten(flatten(flatten(flatten(flatten(flatten([' + TABLE_NAME + '],hits),hits.product),hits.promotion),hits.customVariables), hits.customDimensions), hits.customMetrics)'
I understand what you are saying about flattening causing repeated pageviews and each repetition getting into the final wrong addition.
I tried the same query (from Update1) on Bigquery table instead of my Hive table. The numbers matched with those on the Google Analytics Dashboard.
However, assuming that the Hive table is all I have and it has those repeated fields due to flattening.. BUT Is there still anyway that I can fix my hive query to match the stats from Google Analytics dashboard?
Logically speaking, if the repeated fields came up due to flattening.. can't I reverse the same thing in my Hive table? If you think that I can reverse, do you have any suggestion as to how I can proceed on it?
Thank you so much in advance!
Can you run the same query in BigQuery, instead of on the data exported to Hive?
My guess: "The data is stored in a hive table. I flattened all the nested and repeated fields before exporting." When flattening - are you repeating pageviews several times, with each repetition getting into the final wrong addition?
Note how data can get duplicated when flattening rows:
SELECT col, x FROM (
SELECT "wrong" col, SUM(totals.pageviews) x
FROM (FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits))
), (
SELECT "correct" col, SUM(totals.pageviews) x
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
col x
wrong 2262
correct 249
Update given "update 2" to the question:
Since BigQuery is working correctly, and this is a Hive problem, you should add that tag to get relevant answers.
Nevertheless, this is how I would correctly de-duplicate previously duplicated rows with BigQuery:
SELECT SUM(pv)
FROM (
SELECT visitId, MAX(totals.pageviews) pv
FROM (FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits))
GROUP EACH BY 1
)

Resources