In SAS, how to plot counts over weeks including zero counts? - plot

I am trying to create a plot in SAS that shows the number of lab results individual laboratories submit each week over the course of the year. I have managed to plot this out, but the plot skips weeks in which the laboratory submitted zero lab results, i.e. the count would be zero.
data testlabs;
input labdate:datetime22.3 labname$;
cards;
08JAN2019:09:40:37.000 A
07AUG2019:09:36:16.000 A
08AUG2019:13:16:51.000 B
21APR2019:09:33:54.000 B
22APR2016:12:47:51.000 B
08JUN2019:09:25:50.000 B
09JAN2019:13:48:24.000 A
10JAN2019:12:21:02.000 C
19FEB2019:14:40:39.000 C
09MAR2019:09:38:48.000 C
20NOV2019:09:50:30.000 A
07AUG2019:14:03:55.000 A
09MAR2019:09:31:39.000 B
09JUN2019:12:11:29.000 B
04APR2019:17:00:00.000 B
26NOV2019:13:05:28.000 C
09JUN2019:09:38:50.000 C
06MAY2019:12:44:20.000 C
08MAY2019:10:14:52.000 A
08JUN2019:08:43:17.000 A
02DEC2019:12:26:51.000 A
05MAY2019:12:53:17.000 B
06SEP2019:09:52:36.000 C
10MAR2019:09:31:41.000 A
08MAR2019:09:40:40.000 C
14JUL2019:09:38:59.000 B
08JAN2019:10:40:37.000 A
;
run;
proc sql;
create table testlabs1 as
select distinct count(*) as lab_count,
labname,
put(datepart(labdate),weeku6.)as wk
from testlabs
where year(datepart(labdate))>2018
group by wk, labname
order by labname, wk
;quit;
symbol color=blue interpol=join;
proc gplot data=testlabs1;
plot lab_count*(wk);
by labname;
run;quit;
This creates three plots with points only on weeks with at least one lab. I would like to plot all 52 weeks of the year, including weeks where the count is zero.

You need a process that can create something from nothing. The COMPLETETYPES option in SUMMARY/MEANS will do that.
data testlabs;
input labdate:datetime22.3 labname$;
lbdate = datepart(labdate);
format lbdate weeku6.;
cards;
08JAN2019:09:40:37.000 A
07AUG2019:09:36:16.000 A
08AUG2019:13:16:51.000 B
21APR2019:09:33:54.000 B
22APR2016:12:47:51.000 B
08JUN2019:09:25:50.000 B
09JAN2019:13:48:24.000 A
10JAN2019:12:21:02.000 C
19FEB2019:14:40:39.000 C
09MAR2019:09:38:48.000 C
20NOV2019:09:50:30.000 A
07AUG2019:14:03:55.000 A
09MAR2019:09:31:39.000 B
09JUN2019:12:11:29.000 B
04APR2019:17:00:00.000 B
26NOV2019:13:05:28.000 C
09JUN2019:09:38:50.000 C
06MAY2019:12:44:20.000 C
08MAY2019:10:14:52.000 A
08JUN2019:08:43:17.000 A
02DEC2019:12:26:51.000 A
05MAY2019:12:53:17.000 B
06SEP2019:09:52:36.000 C
10MAR2019:09:31:41.000 A
08MAR2019:09:40:40.000 C
14JUL2019:09:38:59.000 B
08JAN2019:10:40:37.000 A
;;;;
run;
proc print;
run;
proc summary data=testlabs completetypes nway;
class labname lbdate / mlf;
output out=testlabs2(drop=_type_ rename=(_freq_=lab_count));
run;
proc print;
run;

You will want to join your aggregate with a cross join of the wk x labname combinations.
The join will supplement the aggregate combination coverage, forcing a full coverage.
Example:
data weeks;
do week = intnx ('week', '01jan2019'd, 0) by 7 while (year(week) <= 2019);
output;
end;
format week weeku6.;
run;
data labs;
do labname = 'A', 'B', 'C', 'D'; output; end;
run;
proc sql;
create table testlabs1 as
select
labs.labname,
year(weeks.week) as year,
weeks.week,
coalesce(aggregate.lab_count,0) as lab_count
from
labs
cross join
weeks
left join
(
select distinct count(*) as lab_count,
labname,
intnx('year', datepart(labdate), 0) as yr format=year4.,
intnx('week', datepart(labdate), 0) as wk format=weeku6.
from testlabs
where year(datepart(labdate))>2018
group by yr, wk, labname
) aggregate
on aggregate.labname = labs.labname
& aggregate.wk = weeks.week
order by
year, labname, week
;
quit;
symbol color=blue interpol=join;
proc gplot data=testlabs1;
plot lab_count*(week);
by year labname;
where year = 2019;
run;quit;

Related

SAS: variable count

I would like to count parts of the dataset.
This is my dataset:
YEAR ㅣ FIRMCode ㅣ FIRMName
2000 ㅣ 10 ㅣ 1
2001 ㅣ 11 ㅣ 1
.
.
2020 ㅣ 17 ㅣ 1
2000 ㅣ 11 ㅣ 2
.
.
2020 ㅣ 16 ㅣ 2
I want to count the number of types of firm codes each year, regardless of the firm name. The firm codes are from 10 to 20. So my output would look like:
YEAR ㅣ FIRMCode(10) ㅣ FIRMCode(11) ... ㅣ FIRMCode(20)
2000 ㅣ #firms with code10 ㅣ #firms with code11
.
.
2020 ㅣ #firms with code10 ㅣ #firms(11)
Thank you so much in advance.
One way is to use SQL to compute count (distinct FIRMName) within a group and then Proc REPORT or Proc TABULATE to present the distinct counts as columns.
Your data does not appear to model a scenario in which a firm can have multiple codes within a year; if so, the need for distinct is not specifically needed.
Example:
Compute the number of firms with a code within a year, maintaining a categorical data structure. Present the counts in a wide layout presentation. Also, show a data transformation from two level category into a wide data set (generally not recommended)
data have;
call streaminit (20210425);
do firmname = 'a', 'b', 'c', 'd';
do year = 2000 to 2010;
code = 10 + rand('integer', 0, 5);
output;
end;
end;
run;
proc sql;
create table counts as
select year, code, count(distinct firmname) as firm_ucount
from have
group by year, code
order by year, code;
proc tabulate data=counts;
class year code;
var firm_ucount;
table year, code * firm_ucount='' * mean='' * f=4.;
run;
proc report data=counts;
columns year firm_ucount,code;
define year / group;
define code / '' across;
define firm_ucount / '#firms with code';
run;
proc transpose data=counts out=want prefix=n_code_;
by year;
id code;
var firm_ucount;
run;
TABULATE, count class in column-dimension
REPORT, count variable ACROSS
TRANSPOSE, code as part of column name

How to get final length of a line in a query?

I am just learning SQL and I got a task, that I need to find the final length of a discontinuous line when I have imput such as:
start | finish
0 | 3
2 | 7
15 | 17
And the correct answer here would be 9, because it spans from 0-3 and then I am suppsed to ignore the parts that are present multiple times so from 3-7(ignoring the two because it is between 0 and 3 already) and 15-17. I am supposed to get this answer solely through an sql query(no functions) and I am unsure of how. I have tried to experiment with some code using with, but I can't for the life of me figure out how to ignore all the multiples properly.
My half-attempt:
WITH temp AS(
SELECT s as l, f as r FROM lines LIMIT 1),
cte as(
select s, f from lines where s < (select l from temp) or f > (select r from temp)
)
select * from cte
This really only gives me all the rows tha are not completly usless and extend the length, but I dont know what to do from here.
Use a recursive CTE that breaks all the (start, finish) intervals to as many 1 unit length intervals as is the total length of the interval and then count all the distinct intervals:
WITH cte AS (
SELECT start x1, start + 1 x2, finish FROM temp
WHERE start < finish -- you can omit this if start < finish is always true
UNION
SELECT x2, x2 + 1, finish FROM cte
WHERE x2 + 1 <= finish
)
SELECT COUNT(DISTINCT x1) length
FROM cte
See the demo.
Result:
length
9

Count ID based on start date

I have a source table that looks like this
I start counting ID of the Pd based on the first date then go to the 2nd date and check if it is Pd the add the ID, the go the 3rd date and check if Pd from the previous date are change or not if the change the count them to new group. Please see the desired output. Could you please help?
Thank you
In a single pass solution you will need to track each ids prior inv. When this tracking is in place you will
decrement an invs count based on ids prior inv
increment an invs count based on ids current inv
in the tracker replace the ids prior inv with the current inv
The number of ids is dynamic and not known apriori, and ids prior inv value lookup is keyed on id. The best DATA Step feature for dynamic lookup is HASH
Also, because the counts output is a pivot based on inv values, you will need to either
have a series of if/then or select/when statements to increment/decrement the invs counts
output data as date inv count and Proc TRANSPOSE
Data
data have;
format id 4. date yymmdd10. inv $2.;
input id date yymmdd10. event $ e_seq inv ; datalines;
100 2018-01-01 In 1 Pd
101 2018-01-01 In 1 Pd
102 2018-02-04 In 1 Pd
100 2018-02-07 N 2 NG
101 2018-02-14 P 2 G
101 2018-02-18 A 3 Pd
100 2018-03-15 A 3 Pd
102 2018-05-01 P 2 G
103 2018-06-03 In 1 Pd
run;
Sample code
Nested DOW loops are used to test for end of input data and ensure one row output for each date (the group)
data want(keep=date G NG Pd);
if 0 then set have; * prep pdv for hash;
* ids is the 'tracker';
declare hash ids();
ids.defineKey('id');
ids.defineData('id', 'lastinv');
ids.defineDone();
lastinv = inv; * prep lastinv in pdv;
do until (end);
do until (last.date);
set have end=end;
where inv in ('Pd' 'G' 'NG');
by date;
if ids.find() = 0 then do; * decrement count based on ids prior inv;
select (lastinv);
when ('G') G + -1;
when ('NG') NG + -1;
when ('Pd') Pd + -1;
otherwise ;
end;
end;
* update ids prior inv;
lastinv = inv;
ids.replace();
* increment count based on ids prior inv;
select (lastinv);
when ('G') G + 1;
when ('NG') NG + 1;
when ('Pd') Pd + 1;
otherwise ;
end;
end;
OUTPUT; * <------------ output one row of counts per date;
end;
run;

SAS Counting Occurrences based on multiple layers within set time period

I am trying to count occurrences where the same person was billed for an item, four or more times, by the same place within 30 days of each instance. For example, input would look something like:
person service place date
A x shop1 01/01/15
A x shop1 01/15/15
A x shop1 01/20/15
B y shop2 03/20/15
B y shop2 04/01/15
C z shop1 05/05/15
And output would look something like:
person service place date count
A x shop1 01/01/15 3
A x shop1 01/15/15 3
A x shop1 01/20/15 3
B y shop2 03/20/15 2
B y shop2 04/01/15 2
C z shop1 05/05/15 1
I have tried stuff similar to:
data work.want;
do _n_ =1 by 1 until (last.PLACE);
set work.rawdata;
by PERSON PLACE;
if first.PLACE then count=0;
count+1;
end;
frequency= count;
do _n_ = 1 by 1 until (last.PLACE);
set work.rawdata;
by PERSON PLACE;
output;
end;
run;
this gives a count based on person and place but does not factor in time. Any help or suggestions would be greatly appreciated! Thank you
This can be done easily with proc sql...
Your data:
data have;
input person $ service $ place $;
datalines;
A x shop1
A x shop1
A x shop1
B y shop2
B y shop2
C z shop1
;
run;
Then we count the occurences of "place" for each 1,2 group, and join the original table.
proc sql;
create table want as
select a.*, b._count
from have as a
inner join
(
select person, service, count(place) as _count
from have
group by 1,2
) as b
on a.person = b.person
and a.service = b.service
;
quit;
Is there a date field? We need it in order to group the data by month (or 30 days), for example.
proc sql;
create table summary as
select person, service, place, count(*) as count
from rawdata
group by person, service, place
having count>=4;
quit;
Note: This doesn't check to see if the events occurred within 30 days of each other. I didn't know the type of data you had for this in your dataset.

Sum all field with the other field < itself in sqlite

Sorry because I dont think good title for my problem.
I have table a(f1 integer, date Long), date increase, and the data
f1 date
1 1
2 2
3 3
...
I need to sum f1 by date, with record 1{1,1} the sum f1 is 1,with record 2 the sum f1 is 1+2, record 3 the sum f1 is 1+2+3...
How can I do that?
This requires a correlated subquery:
SELECT date,
(SELECT SUM(f1)
FROM a AS a2
WHERE a2.date <= a.date
) AS f1_sum
FROM a
ORDER BY date;
But it's inefficient. Consider just scanning the table, sorted by the date, and summing f1 as you're reading it.

Resources