How to cluster raw events tables from Firebase Analytics in BQ in field event_name? - firebase

I would like to cluster raw table with raw data of events from Firebase in BQ, but without reprocessing/creating another tables (keeping costs at minimum).
The main idea is to find a way to cluster tables when they create from intraday table.
I tried to create empty tables with pre-defined schema (same as previous events tables), but partitioned by _partition_time column (NULL partition) and clustered by event_name column.
After Firebase inserts all the data from intraday table, the column event_name stays in details tab of table as cluster field, but no reducing costs happens after querying.
What could be another solution or way how to make it working ?
Thanks in advance.
/edit:
Our table has detail tab as:
detail tab of table
After running this query:
SELECT * FROM 'ooooooo.ooooooo_ooooo.events_20181222'
WHERE event_name = 'screen_view'
the result is:
how query processed whole table
So no cost reducing.
But if I try to create the same table clustered by event_name manually with:
Create TABLE 'aaaa.aaaa.events_20181222'
partition by DATE(event_timestamp)
cluster by event_name
AS
Select * from ooooooo.ooooooo_ooooo.events_20181222
Then the same query from first IMG applied to created table processes only 5mb - so clustering really works.

Related

Create table is not having the expected number of rows

We are dealing with log data of machines which are in presto . A view is created in Teradata Query Grid which would query the presto and have the result set in databricks. For some other internal use case we are trying to create a table in Teradata but facing difficulty in doing so.
When we try to create a table in Teradata for a particular date which has 2610117,459,037,913088 records only 14K odd records get inserted to the target table. Below is the query for the same. xyz.view is the view created in TD query grid which eventually fetches the data from presto.
CREATE TABLE abc.test_table AS
( SELECT * FROM xyz.view WHERE event_date = '2020-01-29' )
WITH DATA PRIMARY INDEX (MATERIAL_ID, SERIAL_ID);
But when we create the table with sample data (say sample 10000000), exact number of records we get in the table created like below:
CREATE TABLE abc.test_table AS
( SELECT * FROM xyz.view WHERE event_date = '2020-01-29' sample 10000000)
WITH DATA PRIMARY INDEX (MATERIAL_ID, SERIAL_ID);
But again creating with a sample of 1 billion records gets us only 208 million odd records in our target table.
Can anyone please help in here as to why is this happening and if it is possible to create the table with 2610117,459,037,913088 records.
We are using TD 16 .

Discrepances in Bytes Processed from an historical table vs ga_sessions_ historical tables

If I extract full data from all existing ga_sessions_ or firebase tables, the Bytes Processed are 4.5GB.
If I save the previous query into a Destination Table and then I extract full data from this table, the Bytes Processed are 217GB.
Both tables have the same table size. Why this discrepancy?
UPDATE:
My standardSQL query:
SELECT TABLE_SUFFIX AS Date,
user_dim.app_info.app_instance_id,
user_dim.app_info.app_version,
user_dim.geo_info.city,
user_properties.key,
event.name
FROM project.dataset.app_events_*,
UNNEST(user_dim.user_properties) AS user_properties,
UNNEST(event_dim) AS event
returnes 4.5GB. If I save this table (called historical_data), and I compose this query:
SELECT *
FROM `project.dataset.historical_data`
then it returnes 217GB.
I think it is possible because of double cross joins - for each cross joined row you now have redundant set of below fields
TABLE_SUFFIX AS Date,
user_dim.app_info.app_instance_id,
user_dim.app_info.app_version,
user_dim.geo_info.city
so even though original table was of 4.5GB in size the result got of 217 GB
make sense to me - and this is something that happ[ens with BigData - result can explode to enormous size if not to be careful enough
And, btw, check number of rows in original table vs. output table

Teradata - joining query for single partition

I need to run a query which joins 5 large table on user_id and filter it on proc_date.
I have planed to do partition on proc_date and partition(5 range partition) on user_id to increase query performance. I keep primary index as well on proc_date and user_id.
"But how can I run the query for just one partition of the user_id at a time? I want to restrict the query to join first partition(on User_id) of every table"
Reason behind this is, once I complete the query for first partition, I can send the output data for next process. While next process is running i can run the query for 2nd partition.
Could anyone please give me some solution to achieve this.

Returning Count of 0 for Record summaries with 0 sub-records

I am using crystal reports XI. I am working with a SQL database that was created before I got here, and I can't make changes to the tables or link structure. There are 4 tables in the database that I need for this report.
Table 1 - Companies || Fields: CompanyIDPK, CompanyName, YearActiveIDFK
Table 2 - ActiveYears || Fields: YearActiveIDPK, YearNameIDFK
Table 3 - YearNames || Fields: YearNameIDPK, YearName
Table 4 - CompanyOrders || Fields: OrderIDPK, CompanyIDFK, YearNameIDFK, OrderNumber, OrderCost
I want to create a report that is grouped by Year and by Company. I want each company to show the number of orders within each year, including showing 0 if there were no orders that year.
I can get the report to show all the companies that were in a given year, but as soon as I try to start showing a count, it only shows companies that had at least one order.
Thanks for any help!!!
My guess is that this is happening because Crystal only adds tables to your SQL query after you've added them to the designer. This happens even if you have linked your tables in the database expert.
I'm assuming you have the default join type of INNER JOIN. What's probably happening is as soon as you add your Count Summary on one of the fields in CompanyOrders, Crystal is adding it to your SQL Query.
The reason this causes a problem is because an inner join only returns records if the linked fields are in both tables. If companies haven't placed an order in the last year, they won't have any records in the CompanyOrders table. This means your SQL Query won't return any records for those companies, because those companies need to be in both tables for records to be returned.
The solution for this is to change the join type from INNER JOIN to LEFT OUTER JOIN. This can be accomplished by going into the Database Expert (Menu > Database > Database Expert), clicking the Links tab, double clicking the line that goes from your Companies to your CompanyOrders table, and selecting the Left Outer Join Radio button.
Now all of the Companies will show up, but since some don't have records in the CompanyOrders table, the count for the orders will be 0.
Let me know if this was your problem.
ZMcK
You didn't say that you couldn't create database objects, so if it is possible I would create a view or stored procedure in the SQL Server database to return the data you require in the format you want and take Crystal Reports out of the equation in terms of linking tables.

Difference in statistics from Google Analytics Report and BigQuery Data in Hive table

I have a Google Analytics premium account set up to monitor the user activity of a website and mobile application.
Raw data from GA is being stored in BigQuery tables.
However, I noticed that the statistics that I see in a GA report are quite different the statistics that I see when querying the BigQuery tables.
I understand that GA reports show aggregated data and possibly, sampled data. And that the raw data in Bigquery tables is session/hit-level data.
But I am still not sure if I understand the reason why the statistics could be different.
Would really appreciate it if someone clarified this for me.
Thanks in advance.
UPDATE 1:
I exported the raw data from Bigquery into my Hadoop cluster. The data is stored in a hive table. I flattened all the nested and repeated fields before exporting.
Here is the hive query that I ran on the raw data in the Hive table:
SELECT
date as VisitDate,
count(distinct fullvisitorid) as CountVisitors,
SUM(totals_visits) as SumVisits,
SUM(totals_pageviews) AS PVs
FROM
bigquerydata
WHERE
fullvisitorid IS NOT NULL
GROUP BY
date
ORDER BY
VisitDate DESC
A) Taking February 9th as the VisitDate, I get the following results from this query:
i) CountVisitors= 1,074,323
ii) SumVisits= 48,990,198
iii) PVs= 1,122,841,424
Vs
B) Taking the same VisitDate and obtaining the same statistics from the GA report:
i) Users count = 1,549,757
ii) Number of pageviews = 11,604,449 (Huge difference when compared to A(iii))
In the hive query above, am I using any wrong fields or processing the fields in a wrong way? Just trying to figure out why I have this difference in numbers.
UPDATE 2 (following #Felipe Hoffa 's suggestion):
This is how I am flattening the tables in my Python code before exporting the result to GCS and then to Hadoop cluster:
queryString = 'SELECT * FROM flatten(flatten(flatten(flatten(flatten(flatten([' + TABLE_NAME + '],hits),hits.product),hits.promotion),hits.customVariables), hits.customDimensions), hits.customMetrics)'
I understand what you are saying about flattening causing repeated pageviews and each repetition getting into the final wrong addition.
I tried the same query (from Update1) on Bigquery table instead of my Hive table. The numbers matched with those on the Google Analytics Dashboard.
However, assuming that the Hive table is all I have and it has those repeated fields due to flattening.. BUT Is there still anyway that I can fix my hive query to match the stats from Google Analytics dashboard?
Logically speaking, if the repeated fields came up due to flattening.. can't I reverse the same thing in my Hive table? If you think that I can reverse, do you have any suggestion as to how I can proceed on it?
Thank you so much in advance!
Can you run the same query in BigQuery, instead of on the data exported to Hive?
My guess: "The data is stored in a hive table. I flattened all the nested and repeated fields before exporting." When flattening - are you repeating pageviews several times, with each repetition getting into the final wrong addition?
Note how data can get duplicated when flattening rows:
SELECT col, x FROM (
SELECT "wrong" col, SUM(totals.pageviews) x
FROM (FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits))
), (
SELECT "correct" col, SUM(totals.pageviews) x
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
col x
wrong 2262
correct 249
Update given "update 2" to the question:
Since BigQuery is working correctly, and this is a Hive problem, you should add that tag to get relevant answers.
Nevertheless, this is how I would correctly de-duplicate previously duplicated rows with BigQuery:
SELECT SUM(pv)
FROM (
SELECT visitId, MAX(totals.pageviews) pv
FROM (FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits))
GROUP EACH BY 1
)

Resources