Considerable difference between Firebase Performance and BigQuery export data - firebase

For the same date range, Firebase console shows median slow rendering at around 30% and frozen frames at around 11.5% - however calculating the same logic in BigQuery gives considerably different results: slow rendering at around 26% and frozen frames around 0%. My query looks like this:
SELECT
distinct
percentile_disc(round(trace_info.screen_info.frozen_frame_ratio, 2)*100, 0.5) over() as frozen_p50,
percentile_disc(round(trace_info.screen_info.slow_frame_ratio, 2)*100, 0.5) over() as slow_p50
FROM
`my-table-name`
WHERE
event_type = "SCREEN_TRACE" and
event_name = "_st_MainActivity" and
DATE(_PARTITIONDATE) >= "2023-02-07" and
DATE(_PARTITIONDATE) < "2023-02-08"
For given example date picker in Firebase console is set to Feb 7 - Feb 8 and percentile is set to 50% (median).
The questions that I'm trying to answer are these:
What's wrong with my query if anything?
Why such big difference assuming my query is ok?
---UPD:
I find it strange that for given set of records Google Sheet gives p90 as 7%, BigQuery's percentile_disc 0% and percentile_cont also 0%. The same timeframe, the same number of records, the same values of records.

Related

How to access unaggregated results when aggregation is needed due to dataset size in R

My task is to get total inbound leads for a group of customers, leads by month for the same group of customers and conversion rate of those leads.
The dataset I'm pulling from is 20 million records so I can't query the whole thing. I have successfully done the first step (getting total lead count for each org with this:
inbound_leads <- domo_get_query('6d969e8b-fe3e-46ca-9ba2-21106452eee2',
auto_limit = TRUE,
query = "select org_id,
COUNT(*)
from table
GROUP BY org_id
ORDER BY org_id"
DOMO is the bi tool I'm pulling from and domo_get_query is an internal function from a custom library my company built. It takes a query argument which is a mysql query)and various others which aren't important right now.
sample data looks like this:
org_id, inserted_at, lead_converted_at
1 10/17/2021 2021-01-27T03:39:03
2 10/18/2021 2021-01-28T03:39:03
1 10/17/2021 2021-01-28T03:39:03
3 10/19/2021 2021-01-29T03:39:03
2 10/18/2021 2021-01-29T03:39:03
I have looked through many aggregation online tutorials but none of them seem to go over how to get data needed pre-aggregation (such as number of leads per month per org, which isn't possible once the aggregation has occurred because in the above sample the aggregation would remove the ability to see more than one instance of org_id 1 for example) from a dataset that needs to be aggregated in order to be accessed in the first place. Maybe I just don't understand this enough to know the right questions to ask. Any direction appreciated.
If you're unable to fit your data in memory, you have a few options. You could process the data in batches (i.e. one year at a time) so that it fits in memory. You could use a package like chunked to help.
But in this case I would bet the easiest way to handle your problem is to solve it entirely in your SQL query. To get leads by month, you'll need to truncate your date column and group by org_id, month.
To get conversion rate for leads in those months, you could add a column (in addition to your count column) that is something like:
sum(case when conversion_date is not null then 1 else 0) as convert_count

MariaDB update query taking a long time

I'm currently having some problems with our mysql replication. We're using a master-master setup for failover purposes.
The replication itself is working and I believe thats setup right. But we're having troubles with some queries that takes a cruciating time to execute.
Example:
| 166 | database | Connect | 35 | updating | update xx set xx =
'xx' where xx = 'xx' and xx = 'xx' | 0.000 |
These update queries are taking a 20-30+ seconds sometimes to complete and because of that the replication starts lagging behind and within a day, it will be behind for a couple of hours. Strange part is that it will eventually catchup with the other master.
The table is around ~100MM rows big and around 70GB large. On the master where the queries are executed they take less than a second.
Both configurations, mysql and server, are near identical and we tried optimizing the table and queries, but no luck so far.
Any recommendations we could try to solve this? Let me know if I can provide you with any more information.
Using:
MariaDB 10.1.35 -
CentOS 7.5.1804
The key aspect of this is how many rows are you updating:
If the percentage is low (less than 5% of the rows) then an index can help.
Otherwise, if you are updating a large number of rows (greater than 5%), a full table scan will be optimal. If you have millions of rows this will be slow. Maybe partitioning the table could help, but I would say you have little chances of improving it.
I'm going to assume you are updating a small percentage of rows, so you can use an index. Look at the condition in the WHERE statement. If it looks like this:
WHERE col1 = 'xx' and col2 = 'yy'
Then, an index on those columns will make your query faster. Specifically:
create index ix1 on my_table (col1, col2);
Depending on the selectivity of your columns the flipped index could be faster:
create index ix2 on my_table (col2, col1);
You'll need to try which one is better for your specific case.

How to find the nth largest value of each row in SQL

I have researched this problem and have found the answer for a single query, where you can find the nth value of a single column by using DESC OFFSET 2. What I am trying to do is find the nth value for each item in a row. For example, I'm working with a data base concerning bike share data. The data base stores the duration of each trip and the date. I'm trying to find the 3rd longest duration for each day in a data base. If I was going to find the max duration I would use the following code.
SELECT DATE(start_date) trip_date, MAX(duration)
FROM trips
GROUP BY 1
I want the output to be something like this.
Date 3rd_duration
1/1/2017 334
1/2/2017 587
etc
If the value of the third longest duration is the same for two or more different trips, I would like the trip with the lowest trip_id to be ranked 3rd.
I'm working in SQLite.
Any help would be appreciated.
Neither SQLite nor MySQL have a ROW_NUMBER function built in, so get ready for an ugly query. We can still group by the date, but to find the max duration we can use a correlated subquery.
SELECT
DATE(t1.start_date) AS start_date,
t1.duration
FROM trips t1
WHERE
(SELECT COUNT(*) FROM trips t2
WHERE DATE(t2.start_date) = DATE(t1.start_date) AND
t2.duration <= t1.duration) = 3;
Note that this approach might break down if you could have, for a given date, more than one record with the same duration. In this case, you might get multiple results, neither of which might actually be the third highest duration. In order to handle such ties, you should tell us what the logic is with regard to ties.
Demo here:
Rextester

Google Analytics query for sessions

I am trying to analyze visits to purchase in google analytics through r.
Here is the code
query.list<-Init(start.date = "2016-07-01",
end.date = "2016-08-01",
dimensions = c("ga:daysToTransaction","ga:sessionsToTransaction"),
metrics = c("ga:transaction"),
sort = c("ga:date"),
table.id = "ga:104454195")
I have this code which shows error as
Error in ParseDataFeedJSON(GA.Data) :
code : 400 Reason : Sort key ga:date is not a dimension or metric in this query.
Can you help me to get this desired output
Days to Transaction Transaction %total
0 44 50%
1 11 20%
2-5 22 30%
You are trying to sort your results based on a dimension, which is not included in your result set. You have ga:daysToTransaction and ga:sessionsToTransactions dimensions, and you have tried to apply a sort based on ga:date.
You'll need to use this for sorting:
sort = c("ga:daysToTransaction")
It is not clear for me, if you'll use ga:sessionsToTransactions in an other part of your script, as it'll add an other breakdown compared to your desired output, which needs to be aggregated later to get your expected results.
Also, will you calculate %total in an other part of the script, or you expect it to be returned as part of Analytics response? (About which I'm not sure, if it's possible in GA API or not.)

Unexpected throughput with DynamoDB

I have a table in DDB with site_id as my hash key and person_id as the range key. There are another 6-8 columns on this table with numeric statistics about this person (e.g. times seen, last log in etc). This table has data for about 10 sites and 20 million rows (this is only used as a proof of concept now - the production table will have much bigger numbers).
I m trying to retrieve all person_ids for a given site where time_seen > 10. So I m doing a query using the hashkey and enter the time_seen > 10 as a criterion. This will result to a few thousand entries which I expected to get pretty much instantly. My test harness runs in AWS on the same region.
The read capacity on this table is 100 units. The results I m getting are attached.
For some reason I m hitting the limits. Given the only two limits I m aware of are the max data size returned. I m only returning 32 bytes per row (so approx 100KB per result) so no chance this is the case. The time as you see doesnt hit the 5 sec limit either. So why cant I get my results faster?
Results are retrieved in a single thread from C#.
Thanks

Resources