Random sampling without replacement in longitudinal data - r

My data is longitudinal.
VISIT ID VAR1
1 001 ...
1 002 ...
1 003 ...
1 004 ...
...
2 001 ...
2 002 ...
2 003 ...
2 004 ...
Our end goal is picking out 10% each visit to run a test. I tried to use proc SURVEYSELECT to do SRS without replacement and using "VISIT" as strata. But the final sample would have duplicated IDs. For example, ID=001 might be selected both in VISIT=1 and VISIT=2.
Is there any way to do that using SURVEYSELECT or other procedure (R is also fine)? Thanks a lot.

This is possible with some fairly creative data step programming. The code below uses a greedy approach, sampling from each visit in turn, sampling only ids that have not previously been sampled. If more than 90% of the ids for a visit have already been sampled, less than 10% are output. In the extreme case, when every id for a visit has already been sampled, no rows are output for that visit.
/*Create some test data*/
data test_data;
call streaminit(1);
do visit = 1 to 1000;
do id = 1 to ceil(rand('uniform')*1000);
output;
end;
end;
run;
data sample;
/*Create a hash object to keep track of unique IDs not sampled yet*/
if 0 then set test_data;
call streaminit(0);
if _n_ = 1 then do;
declare hash h();
rc = h.definekey('id');
rc = h.definedata('available');
rc = h.definedone();
end;
/*Find out how many not-previously-sampled ids there are for the current visit*/
do ids_per_visit = 1 by 1 until(last.visit);
set test_data;
by visit;
if h.find() ne 0 then do;
available = 1;
rc = h.add();
end;
available_per_visit = sum(available_per_visit,available);
end;
/*Read through the current visit again, randomly sampling from the not-yet-sampled ids*/
samprate = 0.1;
number_to_sample = round(available_per_visit * samprate,1);
do _n_ = 1 to ids_per_visit;
set test_data;
if available_per_visit > 0 then do;
rc = h.find();
if available = 1 then do;
if rand('uniform') < number_to_sample / available_per_visit then do;
available = 0;
rc = h.replace();
samples_per_visit = sum(samples_per_visit,1);
output;
number_to_sample = number_to_sample - 1;
end;
available_per_visit = available_per_visit - 1;
end;
end;
end;
run;
/*Check that there are no duplicate IDs*/
proc sort data = sample out = sample_dedup nodupkey;
by id;
run;

Related

Teradata: Sum Total Amount of Paid Claims by Duplicate DX Flag per Member

First, I'm sure there is a cleaner way to do this, but it's the only way I've been able to make the code combine the DX's into one column. Originally they were in separate columns as 0/1's and I needed them in one column. I tried the PIVOT function, but was not able to figure it out.
The issue is I need the paid amounts to be based on duplicated instances DX's. Which sounds counterintuitive, but for this report it's what I need.
For example. If member A has COPD, ASTHMA, AND DIABETES. The member's paid claims were 40,000 so I need the paid amount for that member to reflect 120,000, etc. and so forth.
The code:
SELECT
DX_FLAG
,Sum( AMT_PAID) AS PHARM_PAID_AMT
,Count(DISTINCT(MEMBER_AMISYS_NBR)) AS MEMBER_COUNT
FROM
(SELECT
st.MEMBER_AMISYS_NBR
,ph.PHARMACY_CLAIM_CK
,ph.AMT_PAID
,FILL.DATE_DATE AS Fill_Date
,Coalesce(CASE WHEN DX_ASTHMA = 'ASTHMA' THEN 'Asthma' END,
CASE WHEN DX_COPD = 'COPD' THEN 'COPD' END,
CASE WHEN DX_DIABETES = 'DIABETES' THEN 'DIABETES' END,
CASE WHEN DX_HEART_FAILURE = 'HEART FAILURE' THEN 'HEART_FAILURE' END,
CASE WHEN DX_HYPERTENSION = 'HYPERTENSION' THEN 'HYPERTENSION' END)
AS DX_FLAG
FROM
STATE_OVERALL_MBRS st
JOIN FT_PHARMACY_CLAIM ph ON st.MEMBER_CURR_CK = ph.PRESCRIBER_MEMBER_CURR_CK AND ph.DELETED_IND = 'N'
JOIN DIM_DATE FILL ON ph.FILL_DATE_DIM_CK = FILL.DATE_DIM_CK
WHERE FILL.DATE_DATE BETWEEN '2021-10-01' AND '2022-09-30'
AND ph.PLAN_DIM_CK =10
AND ph.REVERSAL_IND = 'N'
AND ph.AMT_PAID > 0
) rx
My output looks like this .
DX_FLAG
PHARM_PAID_AMT
MEMBER_COUNT
DIABETES
70,000,000
14,144
COPD
38,266,409
6,641
HEART_FAILURE
10,908,000
2,544
ASTHMA
125,000,000
30,000
HYPERTENSION
52,900
22,325
I have tried adding/removing the Distinct from each select statement and the only one that made a difference was removing distinct from this line, in which case I ended up with far too many member counts (even taking into account the duplicate DX counts).
,Count(DISTINCT(MEMBER_AMISYS_NBR)) AS MEMBER_COUNT
The State_Overall_Mbrs table with DX_Flag looks like this and I needed all the diagnosis to be in one column (with duplicate rows for members depending on how many diagnoses they have):
Member ID Asthma COPD Hypertension Diabetes CHF
55555555 0 1 1 1 0
66666666 1 0 0 1 0
77777777 0 0 1 0 0
Normalize the members table, then join and aggregate; something like this:
SELECT
DX_FLAG
,Sum(AMT_PAID) AS PHARM_PAID_AMT
,Count(DISTINCT(MEMBER_AMISYS_NBR)) AS MEMBER_COUNT
FROM
(SELECT * FROM State_Overall_Members
UNPIVOT (has_dx /* New column to hold the 0 or 1 value */
FOR DX_FLAG IN (Asthma,COPD,Hypertension,Diabetes,CHF)
/* Original column names become the values in new column DX_FLAG */
) nmlz
WHERE has_dx = 1 /* Only unpivot rows with a 1 in original column */
) st
JOIN FT_PHARMACY_CLAIM ph ON st.MEMBER_CURR_CK = ph.PRESCRIBER_MEMBER_CURR_CK AND ph.DELETED_IND = 'N'
JOIN DIM_DATE FILL ON ph.FILL_DATE_DIM_CK = FILL.DATE_DIM_CK
WHERE FILL.DATE_DATE BETWEEN '2021-10-01' AND '2022-09-30'
AND ph.PLAN_DIM_CK =10
AND ph.REVERSAL_IND = 'N'
AND ph.AMT_PAID > 0
GROUP BY DX_FLAG;
Another option to normalize the members table would be to have a subquery for each DX and UNION those, along these lines:
... FROM
(SELECT MEMBER_CURR_CK, MEMBER_AMISYS_NBR, AMT_PAID, 'Asthma' (VARCHAR(16)) AS DX_FLAG
FROM State_Overall_Members
WHERE Asthma = 1
UNION ALL
SELECT MEMBER_CURR_CK, MEMBER_AMISYS_NBR, AMT_PAID, 'COPD' (VARCHAR(16)) AS DX_FLAG
FROM State_Overall_Members
WHERE COPD = 1
UNION ALL
...
) st
JOIN ...

Count ID based on start date

I have a source table that looks like this
I start counting ID of the Pd based on the first date then go to the 2nd date and check if it is Pd the add the ID, the go the 3rd date and check if Pd from the previous date are change or not if the change the count them to new group. Please see the desired output. Could you please help?
Thank you
In a single pass solution you will need to track each ids prior inv. When this tracking is in place you will
decrement an invs count based on ids prior inv
increment an invs count based on ids current inv
in the tracker replace the ids prior inv with the current inv
The number of ids is dynamic and not known apriori, and ids prior inv value lookup is keyed on id. The best DATA Step feature for dynamic lookup is HASH
Also, because the counts output is a pivot based on inv values, you will need to either
have a series of if/then or select/when statements to increment/decrement the invs counts
output data as date inv count and Proc TRANSPOSE
Data
data have;
format id 4. date yymmdd10. inv $2.;
input id date yymmdd10. event $ e_seq inv ; datalines;
100 2018-01-01 In 1 Pd
101 2018-01-01 In 1 Pd
102 2018-02-04 In 1 Pd
100 2018-02-07 N 2 NG
101 2018-02-14 P 2 G
101 2018-02-18 A 3 Pd
100 2018-03-15 A 3 Pd
102 2018-05-01 P 2 G
103 2018-06-03 In 1 Pd
run;
Sample code
Nested DOW loops are used to test for end of input data and ensure one row output for each date (the group)
data want(keep=date G NG Pd);
if 0 then set have; * prep pdv for hash;
* ids is the 'tracker';
declare hash ids();
ids.defineKey('id');
ids.defineData('id', 'lastinv');
ids.defineDone();
lastinv = inv; * prep lastinv in pdv;
do until (end);
do until (last.date);
set have end=end;
where inv in ('Pd' 'G' 'NG');
by date;
if ids.find() = 0 then do; * decrement count based on ids prior inv;
select (lastinv);
when ('G') G + -1;
when ('NG') NG + -1;
when ('Pd') Pd + -1;
otherwise ;
end;
end;
* update ids prior inv;
lastinv = inv;
ids.replace();
* increment count based on ids prior inv;
select (lastinv);
when ('G') G + 1;
when ('NG') NG + 1;
when ('Pd') Pd + 1;
otherwise ;
end;
end;
OUTPUT; * <------------ output one row of counts per date;
end;
run;

substr instr length for loop

good morning guys i have a problem with code i work on Health Care and complain code must be checkbox but they ask for Report that contain the treatment code which is gonna appear in database like this 1:15:2:3 etc so i need to calculate each code separate
i have to count until i get ":" then i need to take the number which can be 1 or 2 digit then making inner join with the other table
can anyone help me to fix this function and the problem in the loop and get the number for each one
create or replace function hcc_get_tcd_codes (p_id in number )
return varchar2 is
x number := 0 ;
y number := 0 ;
z number ;
code1 number ;
code_name varchar2(15);
begin
for i in 0 .. x
loop
select length(t.tcd_codes ) into x from hcc_patient_sheet t where t.id = p_id ; --- (9)العدد كامل
select instr(tcd_codes, ':') into y from hcc_patient_sheet t where t.id = p_id ; ---- عدد الكود الاو(3)ل
select instr(tcd_codes, ':')+1 + y into z from hcc_patient_sheet t where t.id = p_id ; --عدد الكود كامل +1
enter code here
i := x -y ;
select substr(t.tcd_codes,z, instr(tcd_codes, ':')-1) into code1
--,select substr(t.tcd_codes, 0, instr(tcd_codes, ':')-1) as code2
from Hcc_Patient_Sheet t
where t.id = 631 ;
select t.alt_name into code_name from hcc_complaint_codes t where t.code = code1 ;
select instr(tcd_codes, ':') into y from hcc_patient_sheet t where t.id = p_id ; ---- عدد الكود الاول
return code_name ;
end loop ;
end;
Often with frequent sounding string processing issues, a wheel has already been invented, and even packaged.
select * from table(apex_string.split('THIS:IS:GREAT',':'));
Partial SUBSTR doesn't seem to be the best option; I'd suggest you to split that colon-separated-values string into row as follows:
SQL> with test (col) as
2 (select '1:15:2:3' from dual)
3 select regexp_substr(col, '[^:]+', 1, level) one_value
4 from test
5 connect by level <= regexp_count(col, ':') + 1;
ONE_VALUE
--------------------------------
1
15
2
3
SQL>
and use such an option in your query; something like this:
select ...
into ...
from some_table t
where t.id in (select regexp_substr(that_string, '[^:]+', 1, level) one_value
from dual
connect by level <= regexp_count(that_string, ':') + 1
);
If it has to be row-by-row, use the above option as a source for the cursor FOR loop, as
for cur_r in (select regexp_substr(that_string, '[^:]+', 1, level) one_value
from dual
connect by level <= regexp_count(that_string, ':') + 1
)
loop
do_something_here
end loop;

How to calculate a row value based on the previous row value in the same column

I have the following data set:
DATE CODE RANK PARTITION
? ABS 0 1
12/04/2014 RET 1 1
20/04/2014 RET 2 1
01/05/2014 ABS 2 1
13/05/2014 RET 2 1
01/06/2015 ABS 2 1
09/10/2015 RETk 2 1
? ABS 0 2
02/04/2015 RET 1 2
03/04/2015 RET 2 2
04/04/2015 ABS 2 2
05/04/2015 STT 3 2
06/04/2015 RETk 4 2
07/04/2015 RETk 4 2
RANK is the column I want to calculate in my SQL given the columns DATE, CODE AND the previous value of the same column. It's initialized here to 0.
The logic I want to implement is as follows:
If RANK-1 (previous row) IS NULL AND CODE = ABS THEN RANK = 0
If RANK-1 (previous row) IS NULL AND CODE <> ABS THEN RANK <- (RANK-1) + 1
If RANK-1 = 0 or 1 AND CODE = RET THEN RANK <- (RANK-1) + 1
If RANK-1 = 2 AND CODE = STT THEN RANK <- (RANK-1) + 1
If RANK-1 = 3 AND CODE = RETk THEN RANK <- (RANK-1) + 1
If CODE = ABS THEN RANK <- (RANK-1) (previous row)
Else 0
The Teradata release I am using is R14. The calculation is done on a partition basis as shown in the example above. I have added some more constraints in the model to make it clearer. In this example, if the current code is RET, I do not increase the rank until the previous one is 0 or 1. Similarly, If my current code is RETk, I do not increase the rank until the previous one is equal to 3, otherwise, I do not change the rank. I repeat the same process in the following partition and so on ...
I cannot figure out how to update the current column value given the previous one... I tried many logic implementation with OLAP functions without success.
Can anyone give me a hint?
Thank you very much for your help
You can always use a recursive query for tasks like this. But performance will be bad unless the number of rows per group is low.
First you need a way to advance to the next row, as the next row's date can't be calculated based on the current row's date you must materialize the data and add a ROW_NUMBER:
CREATE TABLE tab(dt DATE, CODE VARCHAR(10), rnk INT, part INT);
INSERT INTO tab( NULL,'ABS' ,0 , 1);
INSERT INTO tab(DATE'2014-04-12','RET' ,1 , 1);
INSERT INTO tab(DATE'2014-04-20','RET' ,2 , 1);
INSERT INTO tab(DATE'2014-05-01','ABS' ,2 , 1);
INSERT INTO tab(DATE'2014-05-13','RET' ,2 , 1);
INSERT INTO tab(DATE'2014-06-01','ABS' ,2 , 1);
INSERT INTO tab(DATE'2014-10-09','RETk',2 , 1);
INSERT INTO tab( NULL,'ABS' ,0 , 2);
INSERT INTO tab(DATE'2015-04-02','RET' ,1 , 2);
INSERT INTO tab(DATE'2015-04-03','RET' ,2 , 2);
INSERT INTO tab(DATE'2015-04-04','ABS' ,2 , 2);
INSERT INTO tab(DATE'2015-04-05','STT' ,3 , 2);
INSERT INTO tab(DATE'2015-04-06','RETk',4 , 2);
INSERT INTO tab(DATE'2015-04-07','RETk',4 , 2);
CREATE VOLATILE TABLE vt AS
(
SELECT dt, code, part
-- used to find the next row
,ROW_NUMBER() OVER (PARTITION BY part ORDER BY dt) AS rn
FROM tab
) WITH DATA
PRIMARY INDEX(part, rn)
ON COMMIT PRESERVE ROWS
;
And now it's just applying your logic using CASE row after row:
WITH RECURSIVE cte (dt, code, rnk, part, rn) AS
(
SELECT
dt
,code
,CASE WHEN code = 'ABS' THEN 0 ELSE 1 END
,part
,rn
FROM vt
WHERE rn = 1
UNION ALL
SELECT
vt.dt
,vt.code
,CASE
WHEN cte.rnk IN (0,1) AND vt.CODE = 'RET' THEN cte.rnk + 1
WHEN cte.rnk = 2 AND vt.CODE = 'STT' THEN cte.rnk + 1
WHEN cte.rnk = 3 AND vt.CODE = 'RETk' THEN cte.rnk + 1
WHEN vt.CODE = 'ABS' THEN cte.rnk
ELSE cte.rnk
END
,vt.part
,vt.rn
FROM vt JOIN cte
ON vt.part =cte.part
AND vt.rn =cte.rn + 1
)
SELECT *
FROM cte
ORDER BY part, dt;
But I think your logic is not actually like this (based on the previous rows exact RANK value), you're just stuck in procedural thinking :-)
You might be able to do what you want using OLAP-functions only...
Something along the lines of:
create table table1
(
datecol date,
code varchar(10),
rankcol integer
);
--insert into table1 select '2014/05/13', 'RETj', 0;
select
case
when s1.code='ABS' and s2.rankcol = 1 then 1
when s1.code='RET' and s2.rankcol = 0 then 1
when s1.code='RET' and s2.rankcol = 1 then 2
else 0
end RET_res,
s1.*, s2.*
from
(select rankcol, code, row_number() OVER (order by datecol) var1 from table1) s1,
(select rankcol, code, row_number() OVER (order by datecol) var1 from table1) s2
where s1.var1=s2.var1-1
order by s1.var1
;

PL/SQL Simple logic error, cant figure out

Ok so this is my code....
DECLARE
V_INVENTORY_ITEM INVENTORY.ITEM%TYPE;
V_INVENTORY_PRICE INVENTORY.PRICE%TYPE;
V_INVENTORY_ONHAND INVENTORY.ONHAND%TYPE;
V_TRANS_ITEM TRANSACTION.ITEM%TYPE;
V_TRANS_CODE TRANSACTION.CODE%TYPE;
V_NEW_INVE_ITEM NEW_INVENTORY.ITEM%TYPE;
V_NEW_INVE_SOLD NEW_INVENTORY.SOLD%TYPE;
V_NEW_INVE_RETURNED NEW_INVENTORY.RETURNED%TYPE;
V_NEW_INVE_ONHAND NEW_INVENTORY.ONHANDNEW%TYPE;
V_NEW_INVE_PURCHASED NEW_INVENTORY.PURCHASED%TYPE;
V_NEW_INVE_ORIGINAL NEW_INVENTORY.ONHANDORIG%TYPE;
CURSOR INVEN_CURSOR IS
SELECT ITEM, PRICE, ONHAND FROM INVENTORY
ORDER BY ITEM;
CURSOR TRANS_CURSOR IS
SELECT ITEM, CODE FROM TRANSACTION
WHERE V_INVENTORY_ITEM = ITEM
ORDER BY ITEM;
BEGIN
OPEN INVEN_CURSOR;
LOOP
FETCH INVEN_CURSOR INTO V_INVENTORY_ITEM, V_INVENTORY_PRICE, V_INVENTORY_ONHAND;
EXIT WHEN INVEN_CURSOR%NOTFOUND;
V_NEW_INVE_SOLD := 0;
V_NEW_INVE_RETURNED := 0;
V_NEW_INVE_ONHAND := 0;
V_NEW_INVE_PURCHASED := 0;
V_NEW_INVE_ORIGINAL := V_INVENTORY_ONHAND;
OPEN TRANS_CURSOR;
LOOP
FETCH TRANS_CURSOR INTO V_TRANS_ITEM, V_TRANS_CODE;
EXIT WHEN TRANS_CURSOR%NOTFOUND;
IF V_TRANS_CODE = 'P' THEN
V_NEW_INVE_ONHAND := V_INVENTORY_ONHAND + 1;
V_NEW_INVE_PURCHASED := V_NEW_INVE_PURCHASED + 1;
V_NEW_INVE_ORIGINAL := V_INVENTORY_ONHAND;
END IF;
IF V_TRANS_CODE = 'R' THEN
V_NEW_INVE_RETURNED := V_NEW_INVE_RETURNED + 1;
V_NEW_INVE_ONHAND := V_INVENTORY_ONHAND + 1;
V_NEW_INVE_ORIGINAL := V_INVENTORY_ONHAND;
END IF;
IF V_TRANS_CODE = 'S' THEN
V_NEW_INVE_SOLD := V_NEW_INVE_SOLD + 1;
V_NEW_INVE_ONHAND := V_INVENTORY_ONHAND - 1;
V_NEW_INVE_ORIGINAL := V_INVENTORY_ONHAND;
END IF;
END LOOP;
INSERT INTO NEW_INVENTORY
VALUES(V_INVENTORY_ITEM, V_NEW_INVE_SOLD, V_NEW_INVE_PURCHASED, V_NEW_INVE_RETURNED, V_INVENTORY_ONHAND, V_NEW_INVE_ONHAND);
CLOSE TRANS_CURSOR;
END LOOP;
CLOSE INVEN_CURSOR;
END;
/
I'm trying to update a table, which is an inventory table...
this reads the transaction table and updates a new table...(the new inventory)
on my if statements something is wrong, because every variable comes out as 0;
any suggestions?
my tables
SQL> select * from inventory;
ITEM PRICE ONHAND
--------------- ---------- ----------
BALL 12.99 5
PEN 1.99 10
PENCIL 2.99 1
PAPER 5.99 3
ERASER .99 6
BACKPACK 19.99 10
STAPLER 3.99 12
RULER 4.99 9
NOTEBOOK 6.99 12
9 rows selected.
SQL>
SQL> select * from transaction;
ITEM CO
--------------- --
BALL P
BALL R
BALL S
BALL S
BALL S
PEN R
PEN S
PEN S
PEN P
PENCIL S
PENCIL R
PENCIL S
PENCIL P
PAPER S
PAPER S
PAPER S
ERASER R
ERASER S
ERASER S
ERASER P
BACKPACK S
BACKPACK S
BACKPACK S
BACKPACK P
STAPLER R
STAPLER S
RULER S
NOTEBOOK S
NOTEBOOK S
NOTEBOOK S
NOTEBOOK S
NOTEBOOK S
NOTEBOOK S
33 rows selected.
SQL>
SQL> select * from new_inventory;
ITEM SOLD RETURNED ONHAND
--------------- ---------- ---------- ----------
BACKPACK 0 0 0
BALL 0 0 0
ERASER 0 0 0
NOTEBOOK 0 0 0
PAPER 0 0 0
PEN 0 0 0
PENCIL 0 0 0
RULER 0 0 0
STAPLER 0 0 0
9 rows selected.
Try putting the following statements before opening the TRAN_CURSOR
V_NEW_INVE_SOLD := 0;
V_NEW_INVE_RETURNED := 0;
V_NEW_INVE_ONHAND := 0;
There is no reason for needing to a) reinvent the join (you've got two cursors that you're manually doing a nested loop join on - why do that, when the Oracle Optimizer is perfectly capable of joining two tables together and deciding the best join method itself?) and b) doing the calculations and inserts row-by-row (aka slow-by-slow).
Instead, you can achieve the whole thing in a single insert statement like so:
insert into new_inventory (item,
new_inve_purchased,
new_inve_returned,
new_inve_sold,
orig_onhand,
new_inve_onhand) -- Amend as appropriate; I guessed at what the new_inventory column names were.
select item,
nvl(new_inve_purchased, 0) new_inve_purchased,
nvl(new_inve_returned, 0) new_inve_returned,
nvl(new_inve_sold, 0) new_inve_sold,
nvl(onhand, 0) orig_onhand,
nvl(onhand, 0)
+ nvl(new_inve_purchased, 0)
+ nvl(new_inve_returned, 0)
- nvl(new_inve_sold, 0) new_inve_onhand
from (select inv.item,
inv.onhand,
trn.code
from inventory inv
inner join transaction trn on (inv.item = trn.item))
pivot (sum(1) for code in ('P' as new_inve_purchased,
'R' as new_inve_returned,
'S' as new_inve_sold));
The benefits of using a single SQL statement to do the work are:
It's easier to debug - you can run the select statement on its own to see what it's doing
It'll be more performant; you're letting the database SQL engine do the majority of the work, rather than having PL/SQL talk to SQL, and SQL returning results back to PL/SQL for each row in the inventory table.
Because it's much more compact than the corresponding PL/SQL, there's less code to keep track of, making it much easier to read and understand.
Note also that I've specified the list of columns that you're inserting into (although I had to guess at their names - you'll need to amend as appropriate!).
This is good practice (especially if it's code that will end up in production!) as failing to do so could lead to trouble down the line when someone adds a new column into the table. Best be specific now, and avoid such problems entirely!

Resources