Countifs in SAS - count

I have a SAS dataset with 3 columns. A FirmIndex, a ProducIndex and a third column called PrChange. Within each FirmIndex & ProductIndex group I want to count how many PrChange are different from . and from 0, and add that to a column called Number. Then I want to divide that column Number by the number of observations within each group which are not ..
Below an example of the dataset and desired output.
data prod;
input firmindex productindex PrChange Number Fract;
cards;
1 1 . 1 0.250
1 1 0.00 1 0.250
1 1 0.00 1 0.250
1 1 -0.40 1 0.250
1 1 0.00 1 0.250
1 2 . 2 1.000
1 2 1.00 2 1.000
1 2 0.30 2 1.000
1 3 . 4 0.800
1 3 0.70 4 0.800
1 3 1.00 4 0.800
1 3 0.70 4 0.800
1 3 0.00 4 0.800
1 3 -0.30 4 0.800
1 4 . 5 1.000
1 4 0.20 5 1.000
1 4 -1.00 5 1.000
1 4 -0.90 5 1.000
1 4 -0.50 5 1.000
1 4 1.00 5 1.000
2 1 . 2 1.000
2 1 0.30 2 1.000
2 1 -0.50 2 1.000
2 2 . 5 0.714
2 2 0.30 5 0.714
2 2 0.10 5 0.714
2 2 0.00 5 0.714
2 2 0.00 5 0.714
2 2 0.80 5 0.714
2 2 -0.20 5 0.714
2 2 0.40 5 0.714
2 3 . 1 1.000
2 3 0.60 1 1.000
2 4 . 5 0.714
2 4 -1.00 5 0.714
2 4 0.80 5 0.714
2 4 -0.20 5 0.714
2 4 0.00 5 0.714
2 4 0.00 5 0.714
2 4 -0.70 5 0.714
2 4 0.90 5 0.714
2 5 . 3 1.000
2 5 0.90 3 1.000
2 5 -0.70 3 1.000
2 5 -0.50 3 1.000
;
run;
Here is what I tried to generate the column number, but it is not working:
data work.prod;
set work.prod;
by firmindex productindex;
if first.productindex or first.firmindex then sum = 0;
else if PrChange ne 0 and PrChange ne .;
sum = sum + 1;
run;

Your problem here is that you need the number to divide by prior to running the rows of data. This is where SAS is different from Excel; SAS is row-based, meaning it takes your code and runs it against each row of data (more or less) one at a time, rather than dynamically looking at every cell from every other cell (like Excel). Much faster and more efficient but less flexible for stuff like this.
Your particular question begs for a DoW loop. This takes over the normal data step loop and performs its own loop - twice. Once to calculate the number/fract values, then once to copy those to the BY group. Note I only check for last.productIndex; last/first transitions are always set on a second by variable when they're true for the first by variable.
Here we do the first loop once for the first set of values( the first 5 records) then we re-loop through the same 5 records. Then for the next 3. Etc. Each time the two loops take the same number of rows so they always stay in sync.
data want;
do _n_ = 1 by 1 until (last.productIndex);
set have;
by firmindex productindex;
number_denom = sum(number_Denom,not missing(PrChange));
number = sum(number, not (PrChange in (.,0)));
end;
fract = number/number_denom;
do _n_ = 1 by 1 until (last.productIndex);
set have;
by firmindex productindex;
output;
end;
run;

I'm going to give the IML answer that I'm able to give. Rick or someone else more IML-savvy probably can do better than this. In R or other matrix language I think this would be much easier, but I don't have the IML chops to do this without looping; maybe it's possible.
proc iml;
use have;
read all var _all_ into h;
u = h[uniqueby(h,1:2), 1:2]; *generate the "unique" categories for the first two columns;
v = j(nrow(h),5); *generate a matrix to save this into;
v[,1:3] = h; *start it out with the first three columns of the dataset;
do i = 1 to nrow(u); *iterate over the unique category matrix;
number = ncol(loc(h[loc((h[,1:2] = u[i,1:2])[,#]),3]));
*the inner LOC produces a two column 1/0 matrix with match 1 / nomatch 0 for each col
then reduce to 1 column via subscript reduction product, to get correct 1/0 match vector
the outer LOC takes the rows of h from that (so rows of h matching u), then returns nonzero/nonmissing
which then ncol summarizes into a count;
fract_denom = ncol(loc(h[loc((h[,1:2] = u[i,1:2])[,#]),3] ^= .));
*similar, but here we have to verify they are not missing explicitly, considering 0 valid;
v[loc((v[,1:2] = u[i,1:2])[,#]),4] = number; *assign to col4 of V;
v[loc((v[,1:2] = u[i,1:2])[,#]),5] = number/fract_denom; *assign to col5 of V;
end;
print v;
quit;
This uses the unique-loc method, more or less, with some modifications; probably is an easier way to get the matches.

A SQL in SAS solution - Parfait's is probably the better one overall, but SAS's willingness to remerge makes the SASsy solution a bit simpler.
proc sql;
create table want as
select firmindex, productindex, prchange,
sum (not (prchange in (0,.))) as number,
calculated number / (sum ( not missing(prchange))) as fract
from have
group by firmindex, productindex;
quit;
SAS will do the grouping/counting/etc. and then merge back to the original dataset with no problem, skipping the need for correlated subqueries. NOT standard SQL, but quite common in SAS nonetheless.

Consider proc sql using conditional CASE WHEN correlated subqueries:
proc sql;
create table ProdChangeCount as
SELECT p.firmindex, p.productindex,
(SELECT SUM(CASE WHEN sub.PrChange ^= . AND sub.PrChange ^= 0 THEN 1 ELSE 0 END)
FROM Prod sub
WHERE sub.firmindex = p.firmindex
AND sub.productindex = p.productindex) AS Number,
CALCULATED Number /
(SELECT Count(*)
FROM Prod sub
WHERE sub.PrChange ^= .
AND sub.firmindex = p.firmindex
AND sub.productindex = p.productindex) AS Frac
FROM Prod p;
quit;

Related

How to sort a vector in R without repeating ranks

Good afternoon ,
My question may seem very elementary but i'm getting troubles with it.
Assume we have the following vector :
x=c(0.75,0.75,1,1,0.5,0.5,0.5,0.25,0.25)
I'm willing to sort the vector decreasingly , then getting indices which means :
sort.int(x, index.return=TRUE,decreasing=TRUE)
$x
[1] 1.00 1.00 0.75 0.75 0.50 0.50 0.50 0.25 0.25
$ix
[1] 3 4 1 2 5 6 7 8 9
However, the expected output should be :
y=c(2,2,1,1,3,3,3,4,4)
This means :
1 is the highest value ----- > 1
0.75 is the second highest value ----- > 2
0.5 is the third ----- > 3
0.25 is the lowest value -----> 4
I also tried :
x=c(0.75,0.75,1,1,0.5,0.5,0.5,0.25,0.25)
order(unique(sort(x)))
sort(unique(x),decreasing=TRUE)
[1] 1 2 3 4
[1] 1.00 0.75 0.50 0.25
But I don't know how to subset from x to get the expected output y .
Thank you for help !
sort will sort all the values, and use each value once. It seems like you want to ignore the indices of duplicated values after the first. We can use match for this, which will always return the index of the first match.
match(sort.int(x, decreasing = TRUE), unique(x))
# [1] 2 2 1 1 3 3 3 4 4

create list and generate descriptives for each variable

I want to generate descriptive statistics for multiple variables at a time (close to 50), rather than writing out the code several times.
Here is a very basic example of data:
id var1 var2
1 1 3
2 2 3
3 1 4
4 2 4
I typically write out each line of code to get a frequency count and descriptives, like so:
library(psych)
table(df$var1)
table(df1$var2)
describe(df1$var1)
describe(df1$var2)
I would like to create a list and get the output from these analyses, rather than writing out 100 lines of code. I tried this, but it is not working:
variable_list<-list(df1$var, df2$var)
for (variable in variable_list){
table(df$variable_list))
describe(df$variable_list))}
Does anyone have advice on getting this to work?
The describe from psych can take a data.frame and returns the descriptive statistics for each column
library(psych)
describe(df1)
# vars n mean sd median trimmed mad min max range skew kurtosis se
#id 1 4 2.5 1.29 2.5 2.5 1.48 1 4 3 0 -2.08 0.65
#var1 2 4 1.5 0.58 1.5 1.5 0.74 1 2 1 0 -2.44 0.29
#var2 3 4 3.5 0.58 3.5 3.5 0.74 3 4 1 0 -2.44 0.29
If it is subset of columns, specify either column index or column name to select and subset the dataset
describe(df1[2:3])
Another option is descr from collapse
library(collapse)
descr(slt(df1, 2:3))
Or to select numeric columns
descr(num_vars(df1))
Or for factors
descr(fact_vars(df1))

Calculating velocity from a fixed point zero in r

I am trying to calculate in R the velocity from acceleration in a data frame where the first value is fixed at 0. I would like to use v=u+at to find the velocity from velocity[2:nrow(trial.data)] where t is a constant 0.002. The initial data frame looks like this:
trial.data <- data.table("acceleration" = sample(-5:5,5), "velocity" = c(0))
acceleration velocity
1 0 0
2 5 0
3 -1 0
4 3 0
5 4 0
I have tried using lag from the second row however this gives a value of zero with the correct value in row 3 with other values following also being incorrect.
trial.data$velocity[2:nrow(trial.data)] =
(lag(trial.data$velocity,default=0)) + trial.data$acceleration * 0.002
acceleration velocity
1 0 0.000
2 5 0.000
3 -1 0.010
4 3 -0.002
5 4 0.006
Velocity is accumulated acceleration, so use cumsum:
trial.data <- data.table("acceleration" = c(0,5,-1,3,4))
u <- 0 #starting velocity
velocity <- c(u,u+cumsum(trial.data$acceleration)*0.002)
trial.data$velocity <- velocity[-length(velocity)]
Output:
> trial.data
acceleration velocity
1: 0 0.000
2: 5 0.000
3: -1 0.010
4: 3 0.008
5: 4 0.014
Note the the velocity vector had a final element (which happens to be 0.022) which was neglected when reading it into the data table, since otherwise the columns would be of unequal length. The above code starts with u = 0, but the u could be changed to any other starting velocity and the code would work as intended.

Extract and organise values from a structure

I have the following structure as an output of computation:
structure(c(2L,1L,1L,2L), .Label=c("high","low"),
class="factor", prob=c(1,0.667,0.8,0.333))
What is the best way to extract information from this structure and represent in a data frame?
For instance:
Val Label Prob
2 low 1
1 high 0.667
1 high 0.8
2 low 0.333
I have tried as.numeric(), unname() but neither worked.
We can assign the parts we'd like. And as in most problems there are a few ways to get the attribute:
data.frame(Val=as.integer(x), Label=x, Prob=attr(x,"prob"))
Val Label Prob
1 2 low 1.000
2 1 high 0.667
3 1 high 0.800
4 2 low 0.333

Subtracting Values in Previous Rows: Ecological Lifetable Construction

I was hoping I could get some help. I am constructing a life table, not for insurance, but for ecology (a cross-sectional of the population of a any kind of wild fauna), so essentially censoring variables like smoker/non-smoker, pregnant, gender, health-status, etc.:
AgeClass=C(1,2,3,4,5,6)
SampleSize=c(100,99,87,46,32,19)
for(i in 1:6){
+ PropSurv=c(Sample/100)
+ }
> LifeTab1=data.frame(cbind(AgeClass,Sample,PropSurv))
Which gave me this:
ID AgeClas Sample PropSurv
1 1 100 1.00
2 2 99 0.99
3 3 87 0.87
4 4 46 0.46
5 5 32 0.32
6 6 19 0.19
I'm now trying to calculate those that died in each row (DeathInt) by taking the initial number of those survived and subtracting it by the number below it (i.e. 100-99, then 99-87, then 87-46, so on and so forth). And try to look like this:
ID AgeClas Sample PropSurv DeathInt
1 1 100 1.00 1
2 2 99 0.99 12
3 3 87 0.87 41
4 4 46 0.46 14
5 5 32 0.32 13
6 6 19 0.19 NA
I found this and this, and I wasn't sure if they answered my question as these guys subtracted values based on groups. I just wanted to subtract values by row.
Also, just as a side note: I did a for() to get the proportion that survived in each age group. I was wondering if there was another way to do it or if that's the proper, easiest way to do it.
Second note: If any R-users out there know of an easier way to do a life-table for ecology, do let me know!
Thanks!
If you have a vector x, that contains numbers, you can calculate the difference by using the diff function.
In your case it would be
LifeTab1$DeathInt <- c(-diff(Sample), NA)

Resources