PROC SQL with GROUP command extremely slow. Why? Workaround possible? - plot

I have a MACRO which takes a data set D and essentially outputs k disjoint datasets, D_1,...,D_k. The value k is not fixed and depends on properties of the data that are not known in advance. We can assume that k is not larger than 10, though.
The dataset D contains the variables x and y, and I want to overlay the line/scatter plots of x and y for each of D_i over each other. In my particular case x is time, and I want to see the output y for each D_i and compare them to each other.
Hopefully that was clear.
How can I do this? I don't know k in advance, so I need some sort of %do loop. But it doesn't seem that I can put a do loop inside "proc sgplot".
I might be able to make a macro that includes a very long series of commands, but I'm not sure.
How can I overlay these plots in SAS?
EDIT: I am including for reference why I am trying to avoid doing a PROC SGPLOT with the GROUP clause. I tried the following code and it is taking over 30 minutes to compute (I canceled the calculation after this, so I don't know how long it will actually take). PROC SQL runs quite quickly, the program is stuck on PROC SGPLOT.
proc sql;
create table dataset as select
date, product_code, sum(num_of_records) as total_rec
from &filename
group by product_code, data
order by product_code, date
;
quit;
PROC SGPLOT Data = dataset;
scatter x = date y = total_rec/group=product_code;
title "Total records by product code";
run;
The number of observations in the file is 76,000,000.

What you should do is either change your macro to produce one dataset with a variable d_i (or whatever you can logically name it) which identifies which dataset it would've gone to (or identifies it with whatever determines what dataset it would've gone to), or post-macro combine the datasets.
Then, you can use group to overlay your plots. So for example:
data my_data;
call streaminit(7);
do d_i = 1 to 5;
y = 10;
x = 0;
output;
do x = 1 to 10;
y + round(rand('Uniform')*3,.1)-1.5;
output;
end;
end;
run;
proc sgplot data=my_data;
series x=x y=y/group=d_i;
run;

Related

How to Plot a Series of Rates Over Time in SAS

I have 3 data sets: "Complete", "Incomplete", and "Case_List". "Complete" contains records of individuals that have had a full series of a vaccine; "Incomplete" is identical except that the number of doses is less than the full series; and "Case_List" contains confirmed cases of a specific infection. Each data set contains a date, which I have transformed into week of the year (1:53), the individuals age, which I have divided into age groups(easiest to refer to age groups as 1:8, but their character variables), and an ID. Every ID/record in "Complete" is, by definition, in "incomplete" as the individual received dose 1 before dose 2, but I don't have access to any personal identifiers to link them to the "Case_List" ID's.
I am new to SAS and have yet to find enough instruction on plotting to be able to plot a graph with the Case_List over Week(1:53) overlayed with Incomplete over Week(1:53) and Complete over Week(1:53), and all of that broken down by Age_Group(1:8). If I can't get it figured out, I will just plot everything in R.
Other thoughts:
Is it easier to merge Incomplete and Complete so there are only two data sets?
Is 8 iterations of a graph that already contains 3 lines going to be too messy for one plot?
Thanks for your help.
In SAS, you can't overlay plots from multiple datasets - you need to combine everything into one dataset.
You don't have to "merge" anything, though, just set them together and add a "category" variable.
data incompletes completes case_list;
call streaminit(7);
do week = 1 to 53;
do _i = 1 to 200;
age = rand('Integer',1,8);
_output = rand('Uniform');
if _output lt (0.1+week/100) then output completes;
if _output lt (0.2+week/80) then output incompletes;
if _output lt (0.2-((week/150)**2)) then output case_list;
end;
end;
run;
data total;
set completes(in=_comp) incompletes(in=_incomp) case_list(in=_case);
if _comp then category="Complete";
else if _incomp then category="Incomplete";
else category="Disease Cases";
run;
Then you can overlay plots, depending on exactly what you want to do.
proc sgplot data=total;
vline week/group=category;
run;
You could add paneling by age as noted in the comments, or you have a few other options depending on what exactly you do, but I think this gets at what you really want to know - how do I overlay plots in SAS.

Compare cell against series of cell pairs

I'm trying to make a LibreOffice spreadsheet formula that populates a column based on another input column, comparing each input with a series of range pairs defined in another sheet and finally outputting a symbol based on matched criteria. I have a series of ranges that specify a - output, and another series that corresponds to +, but not all inputs will fall into a category. I am using this trinary output later for another expression, which I already have in place.
My question becomes: how can I test input against each range pair without spelling out the cell coordinates for each individual cell (ie OR(AND(">= $A$1", "< $B$1"), AND(">=$A$2", "<$B$2"), ...))? Ideally I could just specify an array to compare against like $A$1:$B$4. Writing it in a python macro would work, too, since I don't plan on sharing this file.
I wrote a really quick list comp in python to illustrate what I'm after. This snippet would be one half, such as testing - qualification, and these values may be fed into a condition that outputs the symbol:
>>> def cmp(f, r):
... return r[0] <= f < r[1]
>>> f = (1, 2, 3)
>>> ranges = ((2, 5), (4, 6), (3, 8))
>>> [any([cmp(i, r) for r in ranges]) for i in f]
[False, True, True]
Here is a small test example with real input and real ranges.
Change the range pairs so that they are in two columns starting from A13. Be sure that they are in sorted order (Data -> Sort).
A B C
~~~~~~~~ ~~~~~~~~ ~
145.1000 145.5000 -
146.0000 146.4000 +
146.6000 147.0000 -
147.0000 147.4000 +
147.6000 148.0000 -
440.0000 445.0000 +
In each row, specify whether it is negative or positive. To do this, I entered the following formula in C13 and filled down. If the range pairs are not consistent enough then enter values for C13 and below manually.
=IF(ISODD(ROW());"-";"+")
Now, enter the following formula in cell C3 and fill down.
=IFNA(IF(
VLOOKUP(A3;A$13:C$18;2;1) >= A3;
VLOOKUP(A3;A$13:C$18;3;1);
"None");"None")
The formula finds the closest pair and then checks if the number is inside that range or not. For better testing, I would also suggest using 145.7000 as input, which should result in no shift if I understood the question correctly.
The results in column C:
-
+
None
None
Documentation: VLOOKUP, IFNA, ROW.
EDIT:
The following formula produces correct results for the example data you gave, and it works for anything between 144.0 and 148.0.
=IFNA(VLOOKUP(A3;A$13:C$18;3;1); "None")
However, 150.0 produces - and 550.0 produces +. If that is not what you want, then use the formula above that has two VLOOKUP expressions.

SAS: can the COUNT() function be used for categorical variables?

The categorical variable is "unique_carrier", which is as follows:
enter image description here
I want to count the number of each carrier in the variable "unique_carrier", My codes in SAS are as below,
PROC MEANS DATA=schedule_Jan NOPRINT;
BY unique_carrier _CHARACTER_ ;
OUTPUT OUT= flight_count COUNT(unique_carrier) =number_of_flights;
RUN;
but things go wrong when run this line (as below), and the log is, I wonder whether the COUNT function can be used for count categorical variable:
222 OUTPUT OUT= flight_count COUNT(unique_carrier) =number_of_flights;
-----
22
76
ERROR 22-322: Syntax error, expecting one of the following: ;, (, /, CSS, CV, IDGROUP, IDGRP,
KURTOSIS, LCLM, MAX, MAXID, MEAN, MEDIAN, MIN, MINID, MODE, N, NMISS, OUT, P1,
P10, P20, P25, P30, P40, P5, P50, P60, P70, P75, P80, P90, P95, P99, PROBT, Q1,
Q3, QRANGE, RANGE, SKEWNESS, STDDEV, STDERR, SUM, SUMWGT, T, UCLM, USS, VAR.
ERROR 76-322: Syntax error, statement will be ignored.
You can get the count by using PROC FREQ.
proc freq data=schedule_Jan ;
tables unique_carrier / noprint out=flight_count ;
run;
This will have the number of observations per value of UNIQUE_CARRIER in the variable COUNT. You could add a dataset option to rename it to NUMBER_OF_FLIGHTS if you want.
Or you can use PROC MEANS (aka PROC SUMMARY).
proc summary data=schedule_Jan nway;
class unique_carrier;
output out=flight_count ;
run;
This will have the number of observations in the variable _FREQ_. You could use dataset option to rename this or add N=number_of_flights to the OUTPUT statement to add another variable with the same count.
Or you could even 'roll your own' by writing some SQL code.
proc sql ;
create table flight_count as
select unique_carrier
, count(*) as number_of_flights
from schedule_Jan
group by 1
order by 1
;
quit;

SAS to R conversion of merge

I am currently working on converting a SAS macro into a R code. I have worked a lot on R but I am relatively new to SAS. I am having trouble understanding the SAS code for a merge command -
data dates;
merge A(keep=date rename=(date=beg))
A(keep=date firstobs= 5 rename=(date=end))
A(keep=date firstobs= 10 rename=(date=bega))
A(keep=date firstobs= 15 rename=(date=ee))
A(keep=date firstobs= 30 rename=(date=eend));
index+1;
if nmiss(beg,end,bega,eend,ee)=0;
run;
I understand that this command is joining the file A to itself 5 times. But I am not able to visualize the output. What does 'index+1' and 'if' stand for.
What is the R version for this code?
I'm not quite familiar with R, but I know some SAS. I'm not sure if I would call this a macro... The output of your merged data set will depend on how your input data set looks like. Just run your code, and you'll be able to see it in your work folder...
Generally, the data step is structured like an implicit loop. The index+1 looks like the sum statement with the syntax: variable+expression. In this case, the value of index after +1 will be retained for another iteration.
The if statement here contains a boolean condition (i.e. it can have the value of either True or False, but not both) to set a constraint when outputting the data step. If it's true, the current row of data will be outputted. nmiss(var1,var2,var3,...) is a function that will return the number of arguments specified inside nmiss() that are missing. E.g. if only var1 is missing, nmiss(var1,var2,var3,...) = 1.
As Yick says, the index+1 statement creates a new variable in your output data set that begins with one and increments for each observation processed.
The nmiss(...) function used like this is called a sub-setting IF expression, meaning that observations having a non-zero result (no missing values) are not written out to your final dataset.
The best way to visualize the results will be for you to run this code twice using a small test dataset, once using that if statement and once without. For example:
data a;
do i=1 to 50;
date = today() + i;
output;
end;
run;
data dates1;
merge A(keep=date rename=(date=beg))
A(keep=date firstobs= 5 rename=(date=end))
A(keep=date firstobs= 10 rename=(date=bega))
A(keep=date firstobs= 15 rename=(date=ee))
A(keep=date firstobs= 30 rename=(date=eend));
index+1;
if nmiss(beg,end,bega,eend,ee)=0;
format beg end bega ee eend yymmdd10.;
run;
data dates2;
merge A(keep=date rename=(date=beg))
A(keep=date firstobs= 5 rename=(date=end))
A(keep=date firstobs= 10 rename=(date=bega))
A(keep=date firstobs= 15 rename=(date=ee))
A(keep=date firstobs= 30 rename=(date=eend));
index+1;
format beg end bega ee eend yymmdd10.;
run;
After running the above, open both datasets in SAS and compare them side-by-side. The effect of the subsetting-IF statement should be obvious, as well as probably help you understand why this was done (a clever trick, by the way). I added a FORMAT statement to make it a bit easier to see.
It's been a while since I wrote R (so this might not be the best code), but this would be roughly equivalent to
n = nrow(a)
dates = data.frame(cbind(
1:(n-29),
a[1:(n-29),"date"],
a[5:(n-25),"date"],
a[10:(n-20),"date"],
a[15:(n-15),"date"],
a[30:n,"date"]
))
names(dates) = c("index","beg","end","bega","ee","eend")
As you said, you are merging A onto itself 5 times. As others have said, the index+1 statement simply acts as a row index count. The if nmiss(...)=0; statement means you only get rows where everything lines up.
So use the cbind() function in R to do the merge. cbind() requires that you have like lengths on the inputs so you have to adjust your ranges. These ranges are the equivalent to the firstobs= option on the input Data Set plus the subsetting if ... ; statement.

SAS - select n equally spaced values between a and b

How would you do to translate the following R-command in SAS
sequence <- seq(from=a, to=b, length.out=n)
In other words, how would you do in SAS to select n equally spaced values between a and b?
You could easily replicate this in SAS with a DO loop, having previously stored the required values in macro variables. I'm not sure in what context you are using this, however the code below will create a dataset with the required number of rows and equally spaced values. Hopefully this will point you in the right direction.
%let n=5;
%let a=1;
%let b=2;
%let x=%sysevalf((&b.-&a.)/(&n.-1));
%put n = &n.
a = &a.
b = &b.
x = &x.;
data test;
do i=&a. to &b. by &x.;
output;
end;
run;

Resources