SAS: can the COUNT() function be used for categorical variables? - count

The categorical variable is "unique_carrier", which is as follows:
enter image description here
I want to count the number of each carrier in the variable "unique_carrier", My codes in SAS are as below,
PROC MEANS DATA=schedule_Jan NOPRINT;
BY unique_carrier _CHARACTER_ ;
OUTPUT OUT= flight_count COUNT(unique_carrier) =number_of_flights;
RUN;
but things go wrong when run this line (as below), and the log is, I wonder whether the COUNT function can be used for count categorical variable:
222 OUTPUT OUT= flight_count COUNT(unique_carrier) =number_of_flights;
-----
22
76
ERROR 22-322: Syntax error, expecting one of the following: ;, (, /, CSS, CV, IDGROUP, IDGRP,
KURTOSIS, LCLM, MAX, MAXID, MEAN, MEDIAN, MIN, MINID, MODE, N, NMISS, OUT, P1,
P10, P20, P25, P30, P40, P5, P50, P60, P70, P75, P80, P90, P95, P99, PROBT, Q1,
Q3, QRANGE, RANGE, SKEWNESS, STDDEV, STDERR, SUM, SUMWGT, T, UCLM, USS, VAR.
ERROR 76-322: Syntax error, statement will be ignored.

You can get the count by using PROC FREQ.
proc freq data=schedule_Jan ;
tables unique_carrier / noprint out=flight_count ;
run;
This will have the number of observations per value of UNIQUE_CARRIER in the variable COUNT. You could add a dataset option to rename it to NUMBER_OF_FLIGHTS if you want.
Or you can use PROC MEANS (aka PROC SUMMARY).
proc summary data=schedule_Jan nway;
class unique_carrier;
output out=flight_count ;
run;
This will have the number of observations in the variable _FREQ_. You could use dataset option to rename this or add N=number_of_flights to the OUTPUT statement to add another variable with the same count.
Or you could even 'roll your own' by writing some SQL code.
proc sql ;
create table flight_count as
select unique_carrier
, count(*) as number_of_flights
from schedule_Jan
group by 1
order by 1
;
quit;

Related

using columns as lower and upper bound in RANDOM function in Teradata

This is Teradata specific question. In RANDOM function, I want the lower bound to be taken directly from one of the columns. e.g. I want a random value between age of the subscriber and till date. SO I want to put RANDOM(int_tenure, 0). I am receiving below error:
"Syntax error, expected something like an integer or a decimal number or a floating point number or '+' or '-' between '(' and the word 'int_tenure'"
the RANDOM only can take literals (no field/column names) and first parameter and to be lower/equal than second one. So in first step it's not possible. But you can work around: Generate a random factor [0;1] and apply this factor to the interval.
select 10 as lower_bound
,20 as upper_bound
-- ,random(lower_bound, upper_bound) -- will not work
,random(0, 1000)/1000.0000 as RND_Factor -- a random factor between 0 and 1
,(upper_bound-lower_bound)*RND_Factor+lower_bound;

Compare cell against series of cell pairs

I'm trying to make a LibreOffice spreadsheet formula that populates a column based on another input column, comparing each input with a series of range pairs defined in another sheet and finally outputting a symbol based on matched criteria. I have a series of ranges that specify a - output, and another series that corresponds to +, but not all inputs will fall into a category. I am using this trinary output later for another expression, which I already have in place.
My question becomes: how can I test input against each range pair without spelling out the cell coordinates for each individual cell (ie OR(AND(">= $A$1", "< $B$1"), AND(">=$A$2", "<$B$2"), ...))? Ideally I could just specify an array to compare against like $A$1:$B$4. Writing it in a python macro would work, too, since I don't plan on sharing this file.
I wrote a really quick list comp in python to illustrate what I'm after. This snippet would be one half, such as testing - qualification, and these values may be fed into a condition that outputs the symbol:
>>> def cmp(f, r):
... return r[0] <= f < r[1]
>>> f = (1, 2, 3)
>>> ranges = ((2, 5), (4, 6), (3, 8))
>>> [any([cmp(i, r) for r in ranges]) for i in f]
[False, True, True]
Here is a small test example with real input and real ranges.
Change the range pairs so that they are in two columns starting from A13. Be sure that they are in sorted order (Data -> Sort).
A B C
~~~~~~~~ ~~~~~~~~ ~
145.1000 145.5000 -
146.0000 146.4000 +
146.6000 147.0000 -
147.0000 147.4000 +
147.6000 148.0000 -
440.0000 445.0000 +
In each row, specify whether it is negative or positive. To do this, I entered the following formula in C13 and filled down. If the range pairs are not consistent enough then enter values for C13 and below manually.
=IF(ISODD(ROW());"-";"+")
Now, enter the following formula in cell C3 and fill down.
=IFNA(IF(
VLOOKUP(A3;A$13:C$18;2;1) >= A3;
VLOOKUP(A3;A$13:C$18;3;1);
"None");"None")
The formula finds the closest pair and then checks if the number is inside that range or not. For better testing, I would also suggest using 145.7000 as input, which should result in no shift if I understood the question correctly.
The results in column C:
-
+
None
None
Documentation: VLOOKUP, IFNA, ROW.
EDIT:
The following formula produces correct results for the example data you gave, and it works for anything between 144.0 and 148.0.
=IFNA(VLOOKUP(A3;A$13:C$18;3;1); "None")
However, 150.0 produces - and 550.0 produces +. If that is not what you want, then use the formula above that has two VLOOKUP expressions.

PROC SQL with GROUP command extremely slow. Why? Workaround possible?

I have a MACRO which takes a data set D and essentially outputs k disjoint datasets, D_1,...,D_k. The value k is not fixed and depends on properties of the data that are not known in advance. We can assume that k is not larger than 10, though.
The dataset D contains the variables x and y, and I want to overlay the line/scatter plots of x and y for each of D_i over each other. In my particular case x is time, and I want to see the output y for each D_i and compare them to each other.
Hopefully that was clear.
How can I do this? I don't know k in advance, so I need some sort of %do loop. But it doesn't seem that I can put a do loop inside "proc sgplot".
I might be able to make a macro that includes a very long series of commands, but I'm not sure.
How can I overlay these plots in SAS?
EDIT: I am including for reference why I am trying to avoid doing a PROC SGPLOT with the GROUP clause. I tried the following code and it is taking over 30 minutes to compute (I canceled the calculation after this, so I don't know how long it will actually take). PROC SQL runs quite quickly, the program is stuck on PROC SGPLOT.
proc sql;
create table dataset as select
date, product_code, sum(num_of_records) as total_rec
from &filename
group by product_code, data
order by product_code, date
;
quit;
PROC SGPLOT Data = dataset;
scatter x = date y = total_rec/group=product_code;
title "Total records by product code";
run;
The number of observations in the file is 76,000,000.
What you should do is either change your macro to produce one dataset with a variable d_i (or whatever you can logically name it) which identifies which dataset it would've gone to (or identifies it with whatever determines what dataset it would've gone to), or post-macro combine the datasets.
Then, you can use group to overlay your plots. So for example:
data my_data;
call streaminit(7);
do d_i = 1 to 5;
y = 10;
x = 0;
output;
do x = 1 to 10;
y + round(rand('Uniform')*3,.1)-1.5;
output;
end;
end;
run;
proc sgplot data=my_data;
series x=x y=y/group=d_i;
run;

SPSS Count depending on Conditions in several variables

I am quite new to SPSS and I need to count the number of certain errors made in a test (Stroop Test). There are three kinds of variables:
theCongruencies - can be 'I' or 'C' for incongruent or congruent
theWordkeys - code for a key that indicates the first letter of a word
thePressedKeys - code for the key pressed by the user
Each type exists 80 times called e.g. theCongruencies_1 to the theCongruencies_80.
I want to count how many times there is the same value in theWordKeys_x and thePressedKeys_x when theCongruencies_x has the value 'I'.
Example: theCongruencies_42 = 'I' theWordKeys_42 = 88 thePressedKeys_42 = 88
So I need to do something like this in my SPSS Code:
COMPUTE InhibErrs = COUNT(
IF(
theCongruencies_1 to theCongruencies_80 EQ 'I'
AND theWordkeys_1 to theWordkeys_80 EQ thePressedKeys_1 to thePressedKeys_80)).
execute.
Thanks a lot
Deego
Try this:
compute countVar=0.
do repeat theCongruencies=theCongruencies_1 to theCongruencies_80
/theWordkeys=theWordkeys_1 to theWordkeys_80
/thePressedKeys=thePressedKeys_1 to thePressedKeys_80.
compute countVar=sum(countVar, (theCongruencies="I" and theWordkeys=thePressedKeys)).
end repeat.
exe.

SAS - select n equally spaced values between a and b

How would you do to translate the following R-command in SAS
sequence <- seq(from=a, to=b, length.out=n)
In other words, how would you do in SAS to select n equally spaced values between a and b?
You could easily replicate this in SAS with a DO loop, having previously stored the required values in macro variables. I'm not sure in what context you are using this, however the code below will create a dataset with the required number of rows and equally spaced values. Hopefully this will point you in the right direction.
%let n=5;
%let a=1;
%let b=2;
%let x=%sysevalf((&b.-&a.)/(&n.-1));
%put n = &n.
a = &a.
b = &b.
x = &x.;
data test;
do i=&a. to &b. by &x.;
output;
end;
run;

Resources