cumulative distinct count in hive - count

Here is some sample data from table daily_user. Each row represents an active user on a specific day, the revenue is based on the money generated by the user on that day. The earliest date in this table is 1/1.
date user_id group revenue
1/1 1 a 1
1/1 2 b 0
1/1 3 a 0
1/2 2 b 10
1/2 3 a 0
1/3 3 a 1
The output I want (Basically, each row tells me for each group, from 1/1 to each observation date, how many users have ever paid. For example, the last row means from 1/1-1/3, for group b, in total we have 1 user who paid us):
end_date group # users who ever paid
1/1 a 1
1/1 b 0
1/2 a 1
1/2 b 1
1/3 a 2
1/3 b 1
There seems to be some UDFs to do cumulative sum, but I am not sure if there is any cumulative distinct count function that I can leverage here. Is there anyway to struct a hive query to implement this?

I think the solution is to actually 'collect_set' the users ( collect unique values) and take the size of the array, for small numbers of user ( ie. which would fit in memory)
SELECT size( collect_set( user_id ) ) as uniques
end_date, group
FROM daily_user
GROUP BY end_date, group;
For large numbers of uniques, you'll need a probabilistic data structure, like sketch sets or hyperloglogs, available as UDF's from the Brickhouse library ( http://github.com/klout/brickhouse ). This will give you an estimate which is close, but not the exact number of uniques
SELECT estimated_reach( sketch_set( user_id )) as uniques_est,
end_date, group
FROM daily_user
GROUP BY end_date, group;
You can also merge these, so that can merge pre-calculated collections/sketches from previous days. Something like :
SELECT size(combine_unique( unique_set ) ) as uniques,
group
FROM daily_uniques
WHERE end_date > date_add( today, -30 )
GROUP BY group;
or
SELECT estimated_reach( union_sketch( unique_sketch) ) as uniques,
group
FROM daily_uniques
WHERE end_date > date_add( today, -30 )
GROUP BY group;

The function if(revenue=0,1,0) will have value 1 if the revenue is 0, and will have value 0 otherwise. Summing this function will give you the total number of people who had revenue of 0:
select
date as end_date,
group,
sum(if(revenue=0,1,0)) as number_of_users_who_never_paid
from
daily_user
group by
date,
group

The simplest way of doing this, without writing a custom UDF, would be to do some sort of cartesian join:
select
date as end_date,
group,
sum(if(mon.user_id is not null AND mon.date <= du.date,1,0)) as cumulative_spenders
from
daily_user du
LEFT OUTER JOIN
(
select
distinct
user_id,
date,
group
from
daily_user
where
revenue > 0
) mon
ON
(du.user_id=mon.user_id and du.group=mon.group)
group by
date,
group
This will generate a row per spending transaction per entry in the original table, then aggregate from there.

Related

In SAS, how do get distinct counts of a variable that has multiple observations for each individual?

A simple question, but I have three variables in my dataset: ID, ICDCode, and a Visit Date. There are multiple occurrences of each ICDCode per person(ID). How do I obtain a total, distinct count of icdcodes for the entire dataset, without counting a certain ICDCode twice for an individual? For example, I want to say that there are 100 cases of Heart Disease in my dataset (without counting heart disease 10 times for the same person).
Below is code I have tried:
proc freq data= cases;
table ICDCode;
run;
proc sql;
select ICDCode,
count(*) as Frequency
from cases
group by ID;
quit;
How about simply: (Given that 429.9 represent heart disease)
data cases;
input ID ICDCode;
datalines;
1 429.9
1 429.9
1 2
1 3
3 429.9
3 429.9
3 3
2 1
2 2
;
proc sql;
select count(distinct ID) as n
from cases
where ICDCode = 429.9;
run;
Count the distinct patient ids when grouping by icd code.
Example:
data have;
call streaminit(123);
do patid = 1 to 100;
do dxseq = 1 to 10;
if rand('uniform') < 0.25 or dxseq = 1 then
code = '429.9'; /* hey Oprah, everybody gets a heart disease!; */
else
code = put(428 + round(3*rand('uniform'),0.1), 5.1);
output;
end;
end;
run;
proc sql;
create table codefreq as
select code, count(distinct patid) as pat_count
from have
group by code;
First sort with nodupkey to restrict to one copy of each observed ID/ICDcode combination, then run a simple frequency table.
proc sort data=cases out=want nodupkey;
by id icdcode;
proc freq data=want;
tables icdcode;
run;

Count Number of Values Based on Subgroups and 2 Columns

I was wondering if someone could help me produce this data in R. It is rather complicated, and I am not sure how to start. I apologize in advance if my question is not clear. I am trying to create a unique dataset. Essentially, I am trying to divide my data into four groups and count how many times an individual receives a certain value(s) within a group based on a certain column’s value.
I am looking at roll call data among legislators and how they voted. Specifically, I have panel data with four variables: id is the individual legislator’s identification number; the struggle variable is whether a member had trouble voting (dichotomous); vote indicates how the member voted (it can take on any value between 0 and 9 and it is a categorical variable); and rollcall is the roll call number or an id for each roll call.
First, I would like the data separated into two groups. This separation would be based on whether member 999 (id) took any value for the vote column that equals 1 through 6. If he did, I would like all those roll call votes separated (and the members) in one category. For all the remaining roll call votes (or does not equal 1 though 6), I would like all the roll call votes (and the members) in a separate group.
Second, I would like to separate both groups that were created from the above step (did member 999 take any value that equals 1-6 on the vote variable or not) by whether an individual legislator struggled to vote (struggle) or they did not struggle to vote. Thus, I would have four groups total.
Third, based on the vote variable, I would like to add up the total number times an individual legislator received either the values 7, 8, or 9 (in each four groups). Thus, I would have four new variables and values for each member
Here is an example of the data.
Here is the code to produce that table:
id=c(999,1,2, 999,1,2,999,1,2,999,1,2)
Struggle=c("NO", "YES", "NO", "NO", "NO", "YES", "NO", "NO", "YES", "YES", "YES", "YES")
Vote=c(1,9,1,9,0,1,2,9,9,9,9,1)
Rollcall=c(1,1,1,2,2,2,3,3,3,4,4,4)
data=cbind("id", "Struggle", "Vote", "Rollcall")
I would like for it to look like the following:
A indicates the group in which member 999 received the value between 1-6 in the rollcall variable AND the legislator (id) struggled.
B indicates the group in which member 999 received the value between 1-6 in the rollcall variable & the legislator (id) did not struggled.
C indicates the group in which member 999 did not received the value between 1-6 in the rollcall variable & the legislator (id) struggled.
D indicates the group in which member 999 did not received the value between 1-6 in the rollcall variable & the legislator (id) did not struggled.
That number values in each group indicate the number of times a legislator received either a 7,8, or 9 in one of the four groups (A, B, C, or D).
Does anyone have any advice or potential code to produce this data? I appreciate any assistance someone could provide. Again, I apologize for this complicated question and any lack of clarity.
Interesting question! From what I understand, every group A, B, C, or D in your output will satisfy two conditions: whether id = 999 has Vote in 1:6 or 7:9 and the second condition is whether Struggle is YES or NO.
For each group, the first condition evaluates to be the same. So, we first determine the first condition for each group and then left_join it to original data and then summarize it.
library(tidyverse)
data <- data.frame(id, Struggle, Vote, Rollcall)
data %>%
filter(id==999) %>%
mutate(cond = ifelse(Vote %in% 1:6, TRUE, FALSE)) %>%
select(Rollcall, cond) %>%
left_join(data, by='Rollcall') %>%
group_by(id) %>%
summarize(A = sum( (cond == TRUE) & (Struggle == 'YES') ),
B = sum( (cond == TRUE) & (Struggle == 'NO') ),
C = sum( (cond == FALSE) & (Struggle == 'YES') ),
D = sum( (cond == FALSE) & (Struggle == 'NO') ))
The first four lines of expression is evaluating the first condition (whether Vote of 999 is between 1 and 6 for each Rollcall group.
We left_join that to original data and create 4 groups based on your criteria.
Output:
id A B C D
<dbl> <int> <int> <int> <int>
1 1 1 1 1 1
2 2 1 1 2 0
3 999 0 2 1 1

Moving sum with dates for teradata

I have a situation where I have to create a moving sum for the past 6 months. My data looks like
A B 20-Jan-18 20
A B 20-Mar-18 45
A B 10-Apr-18 15
A B 21-May-18 30
A B 30-Jul-18 10
A B 15-Aug-18 25
And the expected result is
A B 20-Jan-18 20 20 Sum of row1
A B 20-Mar-18 45 65 Sum of row1+2
A B 10-Apr-18 15 80 Sum of row1+2+3
A B 21-May-18 30 110 Sum of row1+2+3+4
A B 30-Jul-18 10 100 Sum of row2+3+4+5 (as row1 is > 6 months in the past)
A B 15-Aug-18 25 125 Sum of row2+3+4+5+6
I tried to use the solution proposed in an earlier thread by inserting dummy records for dates where there is no record and then using ROWS BETWEEN 181 PRECEDING AND CURRENT ROW
But there may be situations where there are multiple records on the same day which means that choosing the last 181 rows will lead to the earliest record getting dropped.
I have checked a lot of cases on this forum and others but can't find a solution for this moving average where the window size is not constant. Please help.
Teradata doesn't implement RANGE in Windowed Aggregates, but you can use old-style SQL to get the same result. If the number of rows per group is not too high it's very efficient, but needs an intermediate table (unless the GROUP BY columns are the PI of the souce table). The self-join on the PI columns results in an AMP-local direct join plus aggregated locally, without matching PIs it will be a less efficient join plus aggregated globally
create volatile table vt as
( select a,b,datecol,sumcol
from mytable
) with data
primary index(a,b);
select t1.a, t1.b, t1.datecol
,sum(t2.sumcol)
from vt as t1
join vt as t2
on t1.a=t2.a
and t1.b=t2.b
and t2.datecol between t1.datecol -181 and t1.datecol
group by 1,2,3
Of course this will not work as expected if there are multiple rows per day (this will increase the number of rows for the sum due to the n*m join). You need some unique column combination, this defect_id might be useful.
Otherwise you need to switch to a Scalar Subquery which takes care about non-uniqueness, but is usually less efficient:
create volatile table vt as
( select a,b,defect_id,datecol,sumcol
from mytable
) with data
primary index(a,b);
select t1.*
,(select sum(t2.sumcol)
from vt as t2
where t1.a=t2.a
and t1.b=t2.b
and t2.datecol between t1.datecol -181 and t1.datecol
)
from vt as t1
To use your existing approach you must aggregate those multiple rows per day first

Counting dummy variables for each person in SAS?

I have a large data set that has multiple rows for each individual. Each individual has a unique ID, and each row is coded as a dummy 1 or a 0 as to the type of doctor's visit it is. IE: A visit can be at the doctor's office, so if it is, it is coded as 1, if it is not, it is coded by 0. I want to count how many of each visits to each type of doctor each individual has. I tried using the count distinct:
proc sql;
create table all as select ID;
count (distinct doctor) as doctor1
from data
group by ID;
quit;
However, this does not seem to be giving me what I want.
Any help or pointers on what codes to use would be really appreciated.
Sample data:
data this;
datalines;
rid dateofvisit doctor hospital clinic;
1 1/1/2014 1 0 0
1 1/3/2014 0 1 0
2 1/5/2014 1 0 0
3 1/6/2014 1 0 0
1 1/7/2014 1 0 0
3 1/8/2014 0 0 1
The count function will normally count all occurrences. Together with distinct, it will count the number of different kinds of occurances. This is not your wish, if I understand you correctly.
Since your occurances is coded with ones, you could use the sum function to calculate how many times your patient has visited the different kinds of doctors.
proc sql;
create table all as select rid,
sum (doctor) as doctor_visits,
sum (hospital) as hospital_visits,
sum (clinic) as clinic_visits,
sum(sum(doctor, hospital, clinic)) as total_visits
from this
group by rid;
quit;

How to calculate the columns adding the value of the previous row in SQLite?

What I want to do is, when I select records from the table, the last column is the subtraction of the two columns. Now in the first record, the last column (i.e. Subtraction of two columns) will be [Value1] - [Value2] where `[Value1] and [Value2] are columns of the table.
Now the second record will be like below,
'Value of (previous row.last column) + ([Value1] - [Value2])
and so for the next record and so on.
The columns are as below,
[ID],[Value1],[Value2]
Now the records will be like below,
[ID] [Value1] [Value2] [Result]
1 10 5 10 - 5 = 5
2 15 7 5 + (15 - 7) = 13
3 100 50 13 + (100 - 50) = 63
and so on......
SQLite doesn't support running totals, but for your data and your desired result it's possible to factor out the arithmetic and write the query like this:
SELECT t.id, t.value1, t.value2, SUM(t1.value1 - t1.value2)
FROM table1 AS t
JOIN table1 AS t1 ON t.id >= t1.id
GROUP BY t.id, t.value1, t.value2
http://sqlfiddle.com/#!7/efaf1/2/0
This query will slow down as your row count increases. So, if you're planning to run this on a large table, you may want to run the calculation outside of SQLite.

Resources