Beginner GA export bigquery questions - google-analytics

I am new to bigquery. Have a couple of rookie questions --
Is there any way to do a select top x * query, laid out kind of like the preview pane in table details? It can be a lot easier to understand when you can visually see the data and structure.
You can create a unique visit ID by concatenating VisitId and FullVisitorId. Why doesn't this match count of sessions? How do unique visits differ from sessions definitionally?
Thanks!

COUNT(DISTINCT field[, N]) is a statistical approximation. For counts less than N, it is exact.
To get an exact count for large values, use count(*) on a group each by, however this may be a much slower query.
See COUNT documentation at https://cloud.google.com/bigquery/query-reference.

Related

Firebase / NoSQL - How to aggregate data for statistics

I'm creating my first ever project with Firebase, and I come to the point when I need some statistics based on user input. I know Firebase (or NoSQL databases in general) are not ideal for statistics but they work for me in any other cases so I would like to give it a try.
What I have:
I work on the application where people can invite a friend to work for their company, so I have a collection of "referrals" where ID of each referral is basically UserID of a user to who the referral belongs, and then there is a subcollection with name "items" where data are stored.
How my data looks like:
Each item have these data:
applicant
appliedDate
position(part of position is positionId & department on which this position is coming from)
status
What I wanted is to let user to make statistics based on:
date range
status
department
What I was thinking about:
It's probably not the best idea to let firebase iterate over all referrals once users make requests as it may get really expensive on firebase. What I was thinking of is using cloudfunctions to calculate statistics always when something change e.g. when a new applicant applies I will increase the counter by one and the same for a counter to a specific department. However I feel like this make work for total numbers or for predefined queries e.g. "LAST MONTH" but once I will not know what dates user will select it start to get tricky.
Any idea how can I design something like this?
Thanks a lot!
What you're considering is the idiomatic approach to calculate aggregated in Firestore, and most NoSQL databases. If you follow this pattern, Firestore is quite well suited to storing statistics.
It's ad-hoc statistic, like the unknown data range, that are trickier. Usually this comes down to storing the right values to allow you to get rid of the need to read an unknown number of documents to calculate a value.
For example, if you store counters for the statistics per month, week, day and hour, you can satisfy a wide range of date ranges with a limited number of read operations. You may need to read multiple documents, but the number of documents to read depends on the range, and not on the total number of documents in the database.
Of course, for the most flexible ad-hoc querying, you may still want to consider another solution, such as BigQuery, which was made precisely for this use-case.

How would I order my collection on timestamp and score

I have a collection with documents that have a createdAt timestamp and a score number. I sort all the documents on score for our leaderboard. But now I want to also have the daily best.
matchResults.orderBy("score").where("createdAt", ">", yesterday).startAt(someValue).limit(10);
But I found that there are limitations when using different fields.
https://firebase.google.com/docs/firestore/query-data/order-limit-data#limitations.
So how could I get the results of today in chuncks of 10 sorted on score?
You can use multiple orderBy(...) clauses to order on multiple fields, but this won't exactly meet your needs since you must first order by timestamp and only second by score.
A brute force option would be to fetch all the scores for the given day and truncate the list locally. But that of course won't work well if there are thousands of scores to load.
One simple answer would be to use a datestamp instead of timestamp:
matchResults.where("dayCreated", "=", "YYYY-MM-DD").orderBy("score").startAt(...).limit(10)
A second simple answer would be to run a Cloud Function on write events and maintain a daily top scores table separate from your scores data. If the results are frequently viewed, this would ultimately prove more economical and scalable as you would only need to record a small subset (say the top 100) by day, and can simply query that table ordering by score.
Scoreboards are extremely difficult at scale, so don't underestimate the complexity of handling every edge case. Start small and practical, focus on aggregating your results during writes and keep reads small and simple. Limit scope by listing only a top percentage for your "top scores" records and skip complex pagination schemas where possible.

Tableau running total without using Table calculation

did anyone know what is the formula for calculation field to count running total for the group of few dimensions and sort by payment date? eg: I want to count running total for "Sales", group by ProductName, Location, Date, PPID, sort by payment date in descending order.
I can done this in "Table Calculation" but not meet my requirement. because after I get the output, I need to apply it in another calculation fields. So I need to count the running total by formula.
Thanks
Tableau calcs are the only calculations in Tableau that take the order of rows into account.
Your other option is to use custom SQL to write a windowing or analytic query. Read about the SQL keywords PARTITION and OVER. Not all databases support them, but most major ones do.

Distinct count SSAS

I’m facing a little issue to calculate a distinct count a number of clients in SSAS OLAP Cube. The difficulty appears for the credited client’s accounts, in other words, for the clients how have credit (quantity = -1) or for the clients how have bought the product and they receive a credit after (quantity = 0). My actual distinct count in my cube considers these two cases as real buying transaction, but in fact they’re not. I’ve checked in SSAS to make a distinct count with the expression (SUM Quantity > 1), but I didn’t find nothing. Now I’m thinking to model these cases directly in my Datawarehouse, but I don’t see how can’t do it. Can anyone de give me a little help?
Thanks.
I would feed this data into SSAS using a SQL View. Within that View I would define a calculation to return NULL for the rows you dont want to count, something like this:
CASE WHEN quantity <= 0 THEN NULL ELSE Client_Account END AS Client_Account_For_Distinct_Count
Then I would use that column as the basis of the SSAS Distinct Count measure.

Large Table vs Multiple Tables - Normalized Data

I am currently working on a project that collects a customers demographics weekly and stores the delta (from previous weeks) as a new record. This process will encompass 160 variables and a couple hundred million people (my management and a consulting firm requires this, although ~100 of the variables are seemingly useless). These variables will be collected from 9 different tables in our Teradata warehouse.
I am planning to split this into 2 tables.
Table with often used demographics (~60 variables sourced from 3 tables)
Normalized (1 customer id and add date for each demographic variable)
Table with rarely or unused demographics (~100 variables sourced from 6 tables)
Normalized (1 customer id and add date for each demographic variable)
MVC is utilized to save as much space as possible as the database it will live on is limited in size due to backup limitations. (to note the customer id currently consumes 30% (3.5gb) of the table 1's size, so additional tables would add that storage cost)
The table(s) will be accessed by finding the most recent record in relation to the date the Analyst has selected:
SELECT cus_id,demo
FROM db1.demo_test
WHERE (cus_id,add_dt) IN (
SELECT cus_id, MAX(add_dt)
FROM db1.dt_test
WHERE add_dt <= '2013-03-01' -- Analyst selected Point-in-Time Date
GROUP BY 1)
GROUP BY 1,2
This data will be used for modeling purposes, so a reasonable SELECT speed is acceptable.
Does this approach seem sound for storage and querying?
Is any individual table too large?
Is there a better suggested approach?
My concern with splitting further is
Space due to uncompressible fields such as dates and customer ids
Speed with joining 2-3 tables (I suspect an inner join may use very little resources.)
Please excuse my ignorance in this matter. I usually work with large tables that do not persist for long (I am a Data Analyst by profession) or the tables I build for long term data collection only contain a handful of columns.
Additional to Rob's remarks:
What is your current PI/partitioning?
Is the current performance unsatisfactory?
How do the analysts access beside the point-in-time, any other common conditions?
Depending on your needs a (prev_dt, add_dt) might be better than a single add_dt. More overhead to load, but querying might be as simple as date ... between prev_dt and end_dt.
A Join Index on (cus_id), (add_dt) might be helpful, too.
You might replace the MAX(subquery) with a RANK (MAX is usually slower, only when cus_id is the PI RANK might be worse):
SELECT *
FROM db1.demo_test
QUALIFY
RANK() OVER (PARTITION BY cus_id ORDER BY add_dt DESC) = 1
In TD14 you might split your single table in two row-containers of a column-partitioned table.
...
The width of the table at 160 columns, sparsely populated is not necessarily an incorrect physical implementation (normalized in 3NF or slightly de-normalized). I have also seen situations where attributes not regularly accessed are moved to a documentation table. If you elect to implement the latter in your physical implementation it would be in your best interest that each table share the same primary index. This allows the joining of these to tables (60 attributes and 100 attributes) to be AMP-local on Teradata.
If the access of the table(s) will also include the add_dt column you may wish create a partitioned primary index on this column. This will allow the optimizer to eliminate the other partitions from being scanned when the add_dt column is included in the WHERE clause of a query. Another option would be to test the behavior of a value ordered secondary index on the add_dt column.

Resources