How to access unaggregated results when aggregation is needed due to dataset size in R - r

My task is to get total inbound leads for a group of customers, leads by month for the same group of customers and conversion rate of those leads.
The dataset I'm pulling from is 20 million records so I can't query the whole thing. I have successfully done the first step (getting total lead count for each org with this:
inbound_leads <- domo_get_query('6d969e8b-fe3e-46ca-9ba2-21106452eee2',
auto_limit = TRUE,
query = "select org_id,
COUNT(*)
from table
GROUP BY org_id
ORDER BY org_id"
DOMO is the bi tool I'm pulling from and domo_get_query is an internal function from a custom library my company built. It takes a query argument which is a mysql query)and various others which aren't important right now.
sample data looks like this:
org_id, inserted_at, lead_converted_at
1 10/17/2021 2021-01-27T03:39:03
2 10/18/2021 2021-01-28T03:39:03
1 10/17/2021 2021-01-28T03:39:03
3 10/19/2021 2021-01-29T03:39:03
2 10/18/2021 2021-01-29T03:39:03
I have looked through many aggregation online tutorials but none of them seem to go over how to get data needed pre-aggregation (such as number of leads per month per org, which isn't possible once the aggregation has occurred because in the above sample the aggregation would remove the ability to see more than one instance of org_id 1 for example) from a dataset that needs to be aggregated in order to be accessed in the first place. Maybe I just don't understand this enough to know the right questions to ask. Any direction appreciated.

If you're unable to fit your data in memory, you have a few options. You could process the data in batches (i.e. one year at a time) so that it fits in memory. You could use a package like chunked to help.
But in this case I would bet the easiest way to handle your problem is to solve it entirely in your SQL query. To get leads by month, you'll need to truncate your date column and group by org_id, month.
To get conversion rate for leads in those months, you could add a column (in addition to your count column) that is something like:
sum(case when conversion_date is not null then 1 else 0) as convert_count

Related

Snowflake, Python/Jupyter analysis

I am new to Snowflake, and running a query to get a couple of day's data - this returns more than 200 million rows, and take a few days. I tried running the same query in Jupyter - and the kernel restars/dies before the query ends. Even if it got into Jupyter - I doubt I could analyze the data in any reasonable timeline (but maybe using dask?).
I am not really sure where to start - I am trying to check the data for missing values, and my first instinct was to use Jupyter - but I am lost at the moment.
My next idea is to stay within Snowflake - and check the columns there with case statements (e.g. sum(case when column_value = '' then 1 else 0 end) as number_missing_values
Does anyone have any ideas/direction I could try - or know if I'm doing something wrong?
Thank you!
not really the answer you are looking for but
sum(case when column_value = '' then 1 else 0 end) as number_missing_values`
when you say missing value, this will only find values that are an empty string
this can also be written is a simpler form as:
count_if(column_value = '') as number_missing_values
The data base already knows how many rows are in a column, and it knows how many null columns there are. If loading data into a table, it might make more sense to not load empty strings, and use null instead then, for not compute cost you can run:
count(*) - count(column) as number_empty_values
also of note, if you have two tables in snowflake you can compare the via the MINUS
aka
select * from table_1
minus
select * from table_2
is useful to find missing rows, you do have to do it in both directions.
Then you can HASH rows, or hash the whole table via HASH_AGG
But normally when looking for missing data, you have an external system, so the driver is 'what can that system handle' and finding common ground.
Also in the past we where search for bugs in our processing that cause duplicate data (where we needed/wanted no duplicates) so then the above, and COUNT DISTINCT like commands come in useful.

Tableau Weighted Average Per Capita Calc not aggregating right

I am trying to create a simple revenue per person calc that works with different filters within the data. I have it working for a single record, however, it breaks and aggregates incorrectly with multiple records.
The formula I have now is simply Sum([Revenue]) / Sum([Attendance]). This works when I only have a single event selected. However, as soon as I select multiple shows it aggregates and doesn't do the weighted avg.
I'm making some assumptions here, but hopefully this will help you out. I've created an .xlsx file with the following data:
Event Revenue Attendance
Event 1 63761 6685
Event 2 24065 3613
Event 3 69325 4635
Event 4 41996 5414
Inside Tableu I've created the calculated column for Rev Per Person.
Finally, in the Analysis dropdown I've enabled Show Column Grand Totals. This gives me the following:
Simple Fix
The problem is that all of the column totals are being calculated using the SUM aggregation. This is the desired behavior for Revenue and Attendance, but for Rev Per Person, you want to display the average.
In Analysis/ Totals / Total All Using you can configure the default aggregation. Here we don't want to set all of them though; but it's useful to know. Leave that where it is, and instead click on the Rev Per Person Grand Total value and change it from 'Automatic' to 'Average'.
Now you'll see a number much closer to the expected.
But it's not exactly what you expect. The average of all the Rev Per Person values gives us $9.73; but if you take the total Revenue / total Attendance you'd expect a value of $9.79.
Slight More Involved Fix
First - undo the simple fix. We'll keep all of the totals at 'Default'. Instead, we'll modify the Rev Per Person calculation.
IF Size() > 1 THEN
// Grand Total
SUM([Revenue]/[Attendance])
ELSE
// Regular View
SUM([Revenue])/SUM([Attendance])
END
Size() is being used to determine if the calculation is being done for an individual cell or not.
More information on Size() and similar functions can be found on Tableau's website here - https://onlinehelp.tableau.com/current/pro/desktop/en-us/functions_functions_tablecalculation.html
Now I see the expected value of $9.79.

How to find the nth largest value of each row in SQL

I have researched this problem and have found the answer for a single query, where you can find the nth value of a single column by using DESC OFFSET 2. What I am trying to do is find the nth value for each item in a row. For example, I'm working with a data base concerning bike share data. The data base stores the duration of each trip and the date. I'm trying to find the 3rd longest duration for each day in a data base. If I was going to find the max duration I would use the following code.
SELECT DATE(start_date) trip_date, MAX(duration)
FROM trips
GROUP BY 1
I want the output to be something like this.
Date 3rd_duration
1/1/2017 334
1/2/2017 587
etc
If the value of the third longest duration is the same for two or more different trips, I would like the trip with the lowest trip_id to be ranked 3rd.
I'm working in SQLite.
Any help would be appreciated.
Neither SQLite nor MySQL have a ROW_NUMBER function built in, so get ready for an ugly query. We can still group by the date, but to find the max duration we can use a correlated subquery.
SELECT
DATE(t1.start_date) AS start_date,
t1.duration
FROM trips t1
WHERE
(SELECT COUNT(*) FROM trips t2
WHERE DATE(t2.start_date) = DATE(t1.start_date) AND
t2.duration <= t1.duration) = 3;
Note that this approach might break down if you could have, for a given date, more than one record with the same duration. In this case, you might get multiple results, neither of which might actually be the third highest duration. In order to handle such ties, you should tell us what the logic is with regard to ties.
Demo here:
Rextester

R - Cluster x number of events within y time period

I have a dataset that has 59k entries recorded over 63 years, I need to identify clusters of events with the criteria being:
6 or more events within 6 hours
Each event has a unique ID, time HH:MM:SS and date DD:MM:YY, an output would ideally have a cluster ID, the eventS that took place within each cluster, and start and finish time and date.
Thinking about the problem in R we would need to look at every date/time and count the number of events in the following 6 hours, if the number is 6 or greater save the event IDs, if not move onto the next date and perform the same task. I have taken a data extract that just contains EventID, Date, Time and Year.
https://dl.dropboxusercontent.com/u/16400709/StackOverflow/DataStack.csv
If I come up with anything in the meantime I will post below.
Update: Having taken a break to think about the problem I have a new approach.
Add 6 hours to the Date/Time of each event then count the number of events that fall within the start end time, if there are 6 or more take the eventIDs and assign them a clusterID. Then move onto the next event and repeat 59k times as a loop.
Don't use clustering. It's the wrong tool. And the wrong term. You are not looking for abstract "clusters", but something much simpler and much more well defined. In particular, your data is 1 dimensional, which makes things a lot easier than the multivariate case omnipresent in clustering.
Instead, sort your data and use a sliding window.
If your data is sorted, and time[x+5] - time[x] < 6 hours, then these events satisfy your condition.
Sorting is O(n log n), but highly optimized. The remainder is O(n) in a single pass over your data. This will beat every single clustering algorithm, because they don't exploit your data characteristics.

Use LINQ to Total Columns

I'm trying to get LINQ SQL to grab and total this data, I'm having a heck of a time trying to do it too.
Here is my code, that doesn't error out, but it doesnt total the data.
' Get Store Record Data
Dim MyStoreNumbers = (From tnumbers In db.Table_Numbers
Where tnumbers.Date > FirstOfTheMonth
Select tnumbers)
I'm trying to create a loop that will group the data by DATE and give me the totals so I can graph it.
As you can see, I'd like to set totals for Internet, TV, Phone, ect... Any help would be great, thank you!
You can group and total the numbers one by one, like this:
From tnumbers In db.Table_Numbers
Where tnumbers.Date > FirstOfTheMonth
Group by tnumbers.Date
Into TotalPhone = sum(tnumbers.Phone)
Select Date, TotalPhone
Here is a link with explanations of this subject from Microsoft.
Edit: added grouping

Resources