I want to use the intraday tables, as per docs they are overwritten approx. 3 times a day.
I want to ask that they are overwritten with new data only or till time data.
Example: Intraday table for today is created at 8 AM UTC.
Considering id is unique.
It has data for id: 1, 2, 3
When it is overwritten let's say at 16 UTC and new uds 4 and 5 came by then.
Would it have data: 1, 2, 3, 4, 5 or just 3, 4, 5?
Bigquery docs for columns
Would the fullVisitorId, hitnumber and time combination be unique across all rows?
Yes, fullVisitorId, hitnumber and time combination will be unique across all rows.
What are intraday (or realtime) tables?
Intraday tables represent Google Analytics data for the current day. They are appended to three times a day with data lagging about two hours and are replaced with a new table when the daily sessions table is ingested to BigQuery. Alternatively, realtime tables are appended to approximately every 15 minutes. Both tables allow reporting on the current day’s analytics data.
How can we use intraday tables?
Because intraday tables are only overwritten in the event of a new daily table ingestion to BigQuery, they will continue to append while also storing the data for yesterday’s sessions. We can solve for yesterday’s missing data by incorporating logic into our data processing and reporting workflow. The logic will execute data processing queries to pull from intraday tables in the event that the daily table is not yet available.
Related
I have a mobile application that is connected to a BigQuery Warehouse through Firebase export. For keeping my dashboards up to date, I run incremental jobs (dbt) several times a day to extract data from the tables BigQuery creates that contain imported Firebase data. (see this article).
For real-time data streaming, a table with the suffix "_intraday" is created. Once that day is over, the data is moved over to the table which only contains full days and the intraday table is deleted.
It looks like when this happens (moving from intraday to full day), the event_timestamp (UNIX) is slightly changed (a few milliseconds) for each entry. The problem: I defined a combination of user_id and event_timestamp as unique key. Due to this issue, the first job dealing with the moved table will identify each row as a new, unique row, duplicating my resulting data exactly by 2.
Has anyone ever seen this issue and knows if it's expected? Do you know any other solution than implementing an event ID on the client, giving each event a unique identifier (through custom event params) and using this instead of user_id + timestamp?
auto-created tables
Thank you.
We are trying to log some events based on our application. There are two types of events,
Client-side events: logged by an android application using firebase SDK, which is saved to bigquery by firebase
Server-side events: we log them using bigquery's go client
Now, firebase stores current day's events into events_intraday_$date table and then flushes that table into partitioned table events_$date.
So, I also logged current day's events into events_intraday_$date table.
The events got logged successfully into the table but got deleted the next day when the events_intraday_$date table is flushed into events_$date table.
I'm not able to understand how is that happening.
Looks like this is intended behaviour:
Within each dataset, a table is imported for each day of export. Daily tables have the format "ga_sessions_YYYYMMDD".
Intraday data is imported approximately three times a day. Intraday tables have the format "ga_sessions_intraday_YYYYMMDD". During the same day, each import of intraday data overwrites the previous import in the same table.
When the daily import is complete, the intraday table from the previous day is deleted. For the current day, until the first intraday import, there is no intraday table. If an intraday-table write fails, then the previous day's intraday table is preserved.
Data for the current day is not final until the daily import is complete. You may notice differences between intraday and daily data based on active user sessions that cross the time boundary of last intraday import.
I am wondering if anyone knows a good way to store time series data of different time resolutions in DynamoDB.
For example, I have devices that send data to DynamoDB every 30 seconds. The individual readings are stored in a Table with the unique device ID as the Hash Key and a timestamp as the Range Key.
I want to aggregate this data over various time steps (30 mins, 1 hr, 1 day etc.) using a lambda and store the aggregates in DynamoDB as well. I then want to be able to grab any resolution data for any particular range of time, 48 30 minute aggregates for the last 24hrs for instance, or each daily aggregate for this month last year.
I am unsure if each new resolution should have its own tables, data_30min, data_1hr etc or if a better approach would be something like making a composite Hash Key by combining the resolution with the Device ID and storing all aggregate data in a single table.
For instance if the device ID is abc123 all 30 minute data could be stored with the Hash Key abc123_30m and the 1hr data could be stored with the HK abc123_1h and each would still use a timestamp as the range key.
What are some pros and cons to each of these approaches and is there a solution I am not thinking of which would be useful in this situation?
Thanks in advance.
I'm not sure if you've seen this page from the tech docs regarding Best Practices for storing time series data in DynamoDB. It talks about splitting your data into time periods such that you only have one "hot" table where you're writing and many "cold" tables that you only read from.
Regarding the primary/sort key selection, you should probably use a coarse timestamp value as the primary key and the actual timestamp as a sort key. Otherwise, if your periods are coarse enough, or each device only produces a relatively small amount of data then your idea of using the device id as the hash key could work as well.
Generating pre-aggregates and storing in DynamoDb would certainly work though you should definitely consider having separate tables for the different granularities you want to support. Beware of mutating data. As long as all your data arrives in order and you don't need to recompute old data, then storing pre-aggregated time series is fine but if data can mutate, or if you have to account for out-of order/late arriving data then things get complicated.
You may also consider a relational database for the "hot" data (ie. last 7 days, or whatever period makes sense) and then, running a batch process to pre-aggregate and move the data into cold, read-only DynamoDB tables, with DAX etc.
I am creating a query for three table (1 master table contains n numbers, 2 Daily Production, 3 Daily Sales)
But queries returning only that data which available in all tables.
Please help me sorting this issue
Scenario:
Two tables (table1, table2) with this format:
id:short; timestamp:short; price:single
The data in both table are thesame except the timestamp.
Timestamp is given as a unix_time in ms.
Question:
What is minimum time difference between the timestamps of same record in table1 and table2.
In theory, the minimum time difference would be 0 or even 1ms. In practice wouldn't the time difference be a function of what it takes to put the data into the table along with other external performance factors?
One question I would have about the tables is why is the same record being stored in two places?