Aggregating on groups of data order by date in Snowflake - aggregate-functions

I have the following data in my table:
I need the output to be the following in Snowflake:
It is basically, order by transaction date and getting the first transaction and the last transaction for the country and city and the count of transactions as they are done in sequence. I tried using window functions but I'm not getting the desired result. The tricky part if you can see is that the grouping has to be done but in sequence. You can see TEXAS and CALIFORNIA repeating depending on the sequence of transactions for the country and city.
Best it can be via a query. Second best, in some other way of computation that is fast. Has to be done on batches of data. I don't really want to go to an approach where the data is pulled in an order and then gone through row by row in a sequence unless that is the only option. Open to advises on that as well. Thanks!

Hint: GROUP BY, MIN, MAX, COUNT

I was able to find a logic and the following query works:
select countryid, regionid, min(requesttime), max(requesttime), count(*) from (select deviceid,countryid,regionid,cityid, requesttime,
row_number() over (partition by countryid order by requesttime) as seqnum_1,
row_number() over (partition by countryid, regionid order by requesttime) as seqnum_2
from table t order by requesttime
) t group by countryid, regionid, (seqnum_1 - seqnum_2) order by min(requesttime);

Related

Sessions by hits.page.pagePath in GA bigquery tables

I am new to bigquery, so sorry if this is a noob question! I am interested in breaking out sessions by page path or title. I understand one session can contain multiple paths/titles so the sum would be greater than total sessions. Essentially, I want to create a 'session id' and do a count distinct of sessionids where path like a or b.
It might actually be helpful to start at the very beginning and manually calculate total sessions. I tried to concatenate visit id and full visitor id to create a unique visit id, but apparently that is quite different from sessions. Can someone help enlighten me? Thanks!
I am working with our GA site data. Schema is the standard in GA exports.
DATA SAMPLE
Let's use an example out of the sample BigQuery (London Helmet) data:
There are 63 sessions in this day:
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
How many of those sessions are where hits.page.pagePath like /vests% or /helmets%? How many were vests only vs helmets only? Thanks!
Here is an example of how to calculate whether there were only helmets, or only vests or both helmets and vests or neither:
SELECT
visitID,
has_helmets AND has_vests AS both_helmets_and_vests,
has_helmets AND NOT has_vests AS helmets_only,
NOT has_helmets AND has_vests AS vests_only,
NOT has_helmets AND NOT has_vests AS neither_helmets_nor_vests
FROM (
SELECT
visitId,
SOME(hits.page.pagePath like '/helmets%') WITHIN RECORD AS has_helmets,
SOME(hits.page.pagePath like '/vests%') WITHIN RECORD AS has_vests,
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
Way 1, easier but you need to repeat on each field
Obviously you can do something like this :
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] WHERE hits.page.pagePath like '/helmets%'
And then have multiple queries for your own substrings (one with '/vests%', one with 'helmets%', etc).
Way 2, works fine, but not with repeated fields
If you want ONE query that'll just group by on the first part of the string, you can do something like that :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ) group by a
When I do this, it returns me the following the 63 sessions, with a total count of 63 :).
Way 3, using a FLATTEN on the table to get each hit individually
Since the "hits" field is repeatable, you would need a FLATTEN in your query :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by a
The reason why you need to FLATTEN here is that the "hits" field is repeatable. If you don't flatten, it won't look into ALL the "hits" in your response. Adding "FLATTEN" will make you work off a sub-table where each hit is in its own row, so you can query on all of them.
If you want it by sessions instead of hits, (it'll be both), do something like :
Select b, a Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a, visitID as b, FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by b, a

How to create database table dynamically and insert data selected by query

I'm working on website where I need to find rank of user on the basis of score. Earlier I'm calculating the score and rank of user by sql query .
select * from (
select
usrid,
ROW_NUMBER()
OVER(ORDER BY (count(*)+sum(sup)+sum(opp)+sum(visited)*0.3) DESC) AS rank,
(count(*)+sum(sup)+sum(opp)+sum(visited)*0.3 ) As score
from [DB_].[dbo].[dsas]
group by usrid) as cash
where usrid=#userid
Please don't concentrate more on query because this is only to explain how I select data.
Problem: Now I can't use above query because every time I use rank it need to select rank from dsas table and data of dsas table is increasing day by day and slows down my website.
What I need is select data by above query and insert in another table named as score. Can we do anything like this?
A better solution is to either include score as a field in your user table or have a separate table for scores. Any time you add new sup, opp, or visited data for a user, also recalculate their score at that time.
Then to get the highest ranking users, you will be able to perform a very simple select statement, ordering by score descending, and only fetching the number of rows you want. It will be very fast.

Separating DATE / DATETIME elements into different columns for fast querying in MySQL?

I want to store data in MySQL and query it based on the current day. I want to know what is the best practice to do so.
I want to store data totals for each day, so queries total data will be quick. I thought about modeling my table as follows:
TotalsByCountry
- Year
- Month
- Day
- countryId
- totalNumber
When I query the totals for a specific day and for specific country, I will query the table based on 4 columns, the Year, Month, Day and countryId.
I wanted to know if this is a good practice, or there is a better way to do so, like using one columns for data that holds the month, day and year, and query only two columns, the datetime columns and the coutryId.
need you help in choosing the right way to model the table. I also want to make another table that store totals based on gender, so take that into consideration too.
The data will need to be accessed frequently, maybe in real time because I want to show the data changes in real time. I will be developing the web app in asp.net and probably use web sockets to create constant connection that will update the data on the user in real time. So when data changes, it will be reflected on the user webpage in real-time. That's why I need a table modeling that will be ready for many queries. I will use caching for a few seconds so it want stress the db too much.
I hope I provided enough information, if not, please comment and I will reply.
Having three separate columns to store each individual element of a date (year/month/day) will add unnecessary overhead to your database in terms of insert performance and disk space.
What you will want to do is simply have a single DATETIME column to store the date and time, and have a composite index set up on (countryId, datetime_col).
Even if you wanted to query all rows based on a specific day or month, MySQL will still be able to utilize indexes on the DATETIME field, provided that you are writing your queries in the right way and making sure to never to wrap the DATETIME column within a function when you perform your conditional check.
Here is how you can write your query so that it will still be able to utilize indexes:
-- Get the sum of totalNumber of all rows based on current day
-- where countryId = 1
SELECT SUM(totalNumber) AS totalsum
FROM tbl
WHERE countryId = 1 AND
datetime_col >= CAST(CURDATE() AS DATETIME) AND
datetime_col < CAST(CURDATE() + INTERVAL 1 DAY AS DATETIME)
By making the comparison on the bare DATETIME column, the query remains sargable(i.e. able to utilize index range scans) and MySQL will be able to use indexes to quickly look up rows.
On the other hand, if you were to try to wrap the DATETIME column within a function to make the comparison:
-- Get the sum of totalNumber of all rows based on current day
-- where countryId = 1
SELECT SUM(totalNumber) AS totalsum
FROM tbl
WHERE countryId = 1 AND
DATE(datetime_col) = CURDATE()
...It would be quite inefficient because the DATE() function that wraps the column effectively renders the query as non-sargable, and any kind of index you have set up containing the DATETIME column will not be utilized.
You can also efficiently query for the total sum of all rows in the current month:
-- Get the sum of totalNumber of all rows based on current month
-- where countryId = 1
SELECT SUM(totalNumber) AS monthsum
FROM tbl
WHERE countryId = 1 AND
datetime_col >= CAST(CONCAT(YEAR(NOW()), '-', MONTH(NOW()), '-01') AS DATETIME) AND
datetime_col < CAST(CONCAT(YEAR(NOW()), '-', MONTH(NOW()), '-01') AS DATETIME) + INTERVAL 1 MONTH
And within the current year:
-- Get the sum of totalNumber of all rows based on current year
-- where countryId = 1
SELECT SUM(totalNumber) AS yearsum
FROM tbl
WHERE countryId = 1 AND
datetime_col >= CAST(CONCAT(YEAR(NOW()), '-01-01') AS DATETIME) AND
datetime_col < CAST(CONCAT(YEAR(NOW()), '-01-01') AS DATETIME) + INTERVAL 1 YEAR
My argument is:
If you want to be fast on a database lookups, you need well built queries that uses indexes.
Your approach require 4 indexes (that means slower insert), using a single date column you will require just two indexes, Also the query complexity will increase if you ever need to search for date ranges.

How can I average number of items in our SQLite database by 24-hour period?

Trying to create a report for our support ticketing system and I'm trying to have 2 results in the report that show a rolling average of how many tickets were opened in a day and how many were closed in a day.
Basically, query the entire tickets table, separate out everything by individual days that the tickets were created on, count the number tickets for each individual day, then average that number.
My friend gave me this query:
SELECT AVG(ticket_count)
FROM (SELECT COUNT(*) AS ticket_count FROM tickets
GROUP BY DATE(created_at, '%Y'), DATE(created_at, '%m'), DATE(created_at, '%d')) AS ticket_part
But it's not seeming to work for me. All I get is a single result with the number of tickets created last year.
Here's what finally worked for me:
SELECT round(CAST(AVG(TicketsOpened) AS REAL), 1) as DailyOpenAvg
FROM
(SELECT date(created_at) as Day, COUNT(*) as TicketsOpened
FROM tickets
GROUP BY date(created_at)
) AS X
The middle part of your query is collapsing the table to a single row, so the outer part has nothing upon which to group. It's hard to say exactly what you need without seeing the schema for ticket_count, but at a guess I'd try this:
SELECT
AVG(CAST(TicketsOpened AS REAL)) -- The cast to REAL ensures that { 1, 2 } averages to 1.5 rather than 1
FROM
(
SELECT
CAST(created_at AS DATE) AS Day -- The cast to DATE truncates any time element; if you're storing date alone, you can omit this
COUNT(*) AS TicketsOpened
FROM
ticket_count
GROUP BY
CAST(created_at AS DATE)
) AS X
Hope that helps!

Does a multi-column index work for single column selects too?

I've got (for example) an index:
CREATE INDEX someIndex ON orders (customer, date);
Does this index only accelerate queries where customer and date are used or does it accelerate queries for a single-column like this too?
SELECT * FROM orders WHERE customer > 33;
I'm using SQLite.
If the answer is yes, why is it possible to create more than one index per table?
Yet another question: How much faster is a combined index compared with two separat indexes when you use both columns in a query?
marc_s has the correct answer to your first question. The first key in a multi key index can work just like a single key index but any subsequent keys will not.
As for how much faster the composite index is depends on your data and how you structure your index and query, but it is usually significant. The indexes essentially allow Sqlite to do a binary search on the fields.
Using the example you gave if you ran the query:
SELECT * from orders where customer > 33 && date > 99
Sqlite would first get all results using a binary search on the entire table where customer > 33. Then it would do a binary search on only those results looking for date > 99.
If you did the same query with two separate indexes on customer and date, Sqlite would have to binary search the whole table twice, first for the customer and again for the date.
So how much of a speed increase you will see depends on how you structure your index with regard to your query. Ideally, the first field in your index and your query should be the one that eliminates the most possible matches as that will give the greatest speed increase by greatly reducing the amount of work the second search has to do.
For more information see this:
http://www.sqlite.org/optoverview.html
I'm pretty sure this will work, yes - it does in MS SQL Server anyway.
However, this index doesn't help you if you need to select on just the date, e.g. a date range. In that case, you might need to create a second index on just the date to make those queries more efficient.
Marc
I commonly use combined indexes to sort through data I wish to paginate or request "streamily".
Assuming a customer can make more than one order.. and customers 0 through 11 exist and there are several orders per customer all inserted in random order. I want to sort a query based on customer number followed by the date. You should sort the id field as well last to split sets where a customer has several identical dates (even if that may never happen).
sqlite> CREATE INDEX customer_asc_date_asc_index_asc ON orders
(customer ASC, date ASC, id ASC);
Get page 1 of a sorted query (limited to 10 items):
sqlite> SELECT id, customer, date FROM orders
ORDER BY customer ASC, date ASC, id ASC LIMIT 10;
2653|1|1303828585
2520|1|1303828713
2583|1|1303829785
1828|1|1303830446
1756|1|1303830540
1761|1|1303831506
2442|1|1303831705
2523|1|1303833761
2160|1|1303835195
2645|1|1303837524
Get the next page:
sqlite> SELECT id, customer, date FROM orders WHERE
(customer = 1 AND date = 1303837524 and id > 2645) OR
(customer = 1 AND date > 1303837524) OR
(customer > 1)
ORDER BY customer ASC, date ASC, id ASC LIMIT 10;
2515|1|1303837914
2370|1|1303839573
1898|1|1303840317
1546|1|1303842312
1889|1|1303843243
2439|1|1303843699
2167|1|1303849376
1544|1|1303850494
2247|1|1303850869
2108|1|1303853285
And so on...
Having the indexes in place reduces server side index scanning when you would otherwise use a query OFFSET coupled with a LIMIT. The query time gets longer and the drives seek harder the higher the offset goes. Using this method eliminates that.
Using this method is advised if you plan on joining data later but only need a limited set of data per request. Join against a SUBSELECT as described above to reduce memory overhead for large tables.

Resources