Counting observations using multiple BY groups SAS - count

I am examining prescription patterns within a large EHR dataset. The data is structured so that we are given several key bits of information, such as patient_num, encounter_num, ordering_date, medication, age_event (age at event) etc. Example below:
Patient_num enc_num ordering_date medication age_event
1111 888888 07NOV2008 Wellbutrin 48
1111 876578 11MAY2011 Bupropion 50
2222 999999 08DEC2009 Amitriptyline 32
2222 999999 08DEC2009 Escitalopram 32
3333 656463 12APR2007 Imipramine 44
3333 643211 21DEC2008 Zoloft 45
3333 543213 02FEB2009 Fluoxetine 45
Currently I have the dataset sorted by patient_id then by ordering_date so that I can see what each individual was prescribed during their encounters in a longitudinal fashion. For now, I am most concerned with the prescription(s) that were made during their first visit. I wrote some code to count the number of prescriptions and had originally restricted later analyses to RX = 1, but as we can see, that doesn't work for people with multiple scripts on the same encounter (Patient 2222).
data pt_meds_;
set pt_meds;
by patient_num;
if first.patient_num then RX = 1;
else RX + 1;
Patient_num enc_num ordering_date medication age_event RX
1111 888888 07NOV2008 Wellbutrin 48 1
1111 876578 11MAY2011 Bupropion 50 2
2222 999999 08DEC2009 Amitriptyline 32 1
2222 999999 08DEC2009 Escitalopram 32 2
3333 656463 12APR2007 Imipramine 44 1
3333 643211 21DEC2008 Zoloft 45 2
3333 543213 02FEB2009 Fluoxetine 45 3
I think it would be more appropriate to recode the encounter numbers into a new variable so that they reflect a style similar to the RX variable. Where each encounter is listed 1-n, and the number will repeat if multiple scripts are made in the same encounter. Such as below:
Patient_num enc_num ordering_date medication age_event RX Enc_
1111 888888 07NOV2008 Wellbutrin 48 1 1
1111 876578 11MAY2011 Bupropion 50 2 2
2222 999999 08DEC2009 Amitriptyline 32 1 1
2222 999999 08DEC2009 Escitalopram 32 2 1
3333 656463 12APR2007 Imipramine 44 1 1
3333 643211 21DEC2008 Zoloft 45 2 2
3333 543213 02FEB2009 Fluoxetine 45 3 3
From what I have seen, this could be possible with a variant of the above code using 2 BY groups (patient_num & enc_num), but I can't seem to get it. I think the first. / last. codes require sorting, but if I am to sort by enc_num, they won't be in chronological order because the encounter numbers are generated by the system and depend on all other encounters going in at that time.
I tried to do the following code (using ordering_date instead because its already sorted properly) but everything under Enc_ is printed as a 1. I'm sure my logic is all wrong. Any thoughts?
data pt_meds_test;
set pt_meds_;
by patient_num ordering_date;
if first.patient_num;
if first.ordering_date then enc_ = 1;
else enc_ + 1;

.First/.Last flags doesn't require sorting if data is properly ordered or you use NOTSORTED in your BY statement. If your variable in BY statement is not properly ordered then BY statment will throw error and stop executing when encounter deviations. Like this:
data class;
set sashelp.class;
by age;
first = first.age;
last = last.age;
ERROR: BY variables are not properly sorted on data set SASHELP.CLASS.
Name=Alfred Sex=M Age=14 Height=69 Weight=112.5 FIRST.Age=1 LAST.Age=1 first=. last=. _ERROR_=1 _N_=1
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 2 observations read from the data set SASHELP.CLASS.
Try this code to see how exacly .first/.last flags works:
data pt_meds_test;
set pt_meds_;
by patient_num ordering_date;
fp = first.patient_num;
lp = last.patient_num;
fo = first.ordering_date;
lo = last.ordering_date;
Those condidions works differently than you think:
if expression;
If expression is true then continue with next instructions after if.
Otherwise return to begining of data step (no implicit output). This also implies your observation is not retained in the output.
In most cases if without then is equivalent to where. However
whereworks faster but it is limited to variables that comes from data set you are reading
if can be used with any type of expression including calculated fields
More info:: IF
Statement, Subsetting
I think lag() function can be your answear.
data pt_meds_test;
set pt_meds_;
by patient_num;
retain enc_;
prev_patient_num = lag(patient_num);
prev_ordering_date = lag(ordering_date);
if first.patient_num then enc_ = 1;
else if patient_num = prev_patient_num and ordering_date ne prev_ordering_date then enc_ + 1;
With lag() function you can look what was the value of vairalbe on the previos observation and compare it with current one later.
But be carefull. lag() doesn't look for variable value from previous observation. It takes vale of variable and stores it in a FIFO queue with size of 1. On next call it retrives stored value from queue and put new value there.
More info: LAG Function

I'm not sure if this hurts the rest of your analysis, but what about just
proc freq data=pt_meds noprint;
tables patient_num ordering_date / out=pt_meds_freq;
data pt_meds_freq2;
set pt_meds_freq;
by patient_num ordering_date;
if first.patient_num;


Output R dataframe to SAS format Issue

I have a dataset that looks like this:
df_dummy = data.frame(
df_dummy$Company <- as.character(df_dummy$Company)
df_dummy$Measure <- as.character(df_dummy$Measure)
I am using this to export to an .xpt file
write.xport(df_dummy, file = "data/tmp.xpt")
In SAS, I use this code to import:
libname sasfile 'PATH\data';
libname xptfile xport 'PATH\data\tmp.xpt' access=readonly;
proc copy inlib=xptfile outlib=sasfile;
The table looks fine, but the rate doesn't show the decimal point.
In my actual dataset, there are a lot more rows but it's the same format essentially and if I run a lookup.xport I get this:
Variables in data set `MEASURES':
dataset name type format flength fdigits iformat iflength ifdigits label nobs
MEASURES ID character 0 0 0 0 29064
MEASURES MEASURE character 0 0 0 0 29064
MEASURES NUM numeric 0 0 0 0 29064
MEASURES DEN numeric 0 0 0 0 29064
MEASURES RATE numeric 0 0 0 0 29064
However, if I use the same SAS code to import this, I get something that looks completely off and I can't figure out what's causing it.
I cannot replicate your issue using R (3.4.1) and SAS (9.4 TS1M4) on Mac OS X with both being 64 bit versions. The 32/64 bit versions can cause issues sometimes.
I used R Studio and SAS UE, both freely available for education usage.
Full R code:
df_dummy = data.frame(
df_dummy$Company <- as.character(df_dummy$Company)
df_dummy$Measure <- as.character(df_dummy$Measure)
write.xport(df_dummy, file = "tmp.xpt")
Full SAS Code:
libname sasfile '/folders/myfolders/';
libname xptfile xport '/folders/myfolders/tmp.xpt' access=readonly;
proc copy inlib=xptfile outlib=sasfile;
Your example works. Even with older version or R. Make sure your transport file had not been corrupted by transferring between machines. A transport file is binary data with fixed length 80 byte records, but much of data looks like ASCII codes.
SAS transport files follow the SAS V5 rules for names. Make sure that your member name and variable names are valid SAS names and are not longer than 8 characters. Character variables cannot be longer than 200 characters.
You can quickly look at the file using a simple data step. Especially for your small example. So if you see that the length is not exactly a multiple of 80 or you see that the header records do not start at the beginning of an 80 byte record then something has corrupted the file.
56 data _null_;
57 infile '/test/tmp.xpt' lrecl=80 recfm=f ;
58 input;
59 list;
60 run;
NOTE: The infile '/test/tmp.xpt' is:
Owner Name=xxxxx,Group Name=xxxxx,
Access Permission=-rw-r--r--,
Last Modified=29Sep2017:09:16:16,
File Size (bytes)=1680
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1 HEADER RECORD*******LIBRARY HEADER RECORD!!!!!!!000000000000000000000000000000
2 CHAR SAS SAS SASLIB 7.00 R 3.0.2. 29SEP17:09:16:16
ZONE 54522222545222225454442232332222523232302222222222222222222222223354533333333333
NUMR 3130000031300000313C92007E000000203E0E200000000000000000000000002935017A09A16A16
3 29SEP17:09:16:16
4 HEADER RECORD*******MEMBER HEADER RECORD!!!!!!!000000000000000001600000000140
5 HEADER RECORD*******DSCRPTR HEADER RECORD!!!!!!!000000000000000000000000000000
6 CHAR SAS DF_DUMMYSASDATA 7.00 R 3.0.2. 29SEP17:09:16:16
ZONE 54522222445454455454454232332222523232302222222222222222222222223354533333333333
NUMR 3130000046F45DD9313414107E000000203E0E200000000000000000000000002935017A09A16A16
7 29SEP17:09:16:16
8 HEADER RECORD*******NAMESTR HEADER RECORD!!!!!!!000000000500000000000000000000
9 CHAR ........COMPANY ........
ZONE 00000000444544522222222222222222222222222222222222222222222222220000000022222222
NUMR 020008013FD01E900000000000000000000000000000000000000000000000000000000000000000
10 CHAR ....................................................................MEASURE
ZONE 00000000000000000000000000000000000000000000000000000000000000000000444555422222
NUMR 00000000000000000000000000000000000000000000000000000000000002000802D51352500000
11 CHAR ........ ....................
ZONE 22222222222222222222222222222222222222222222000000002222222200000000000000000000
NUMR 00000000000000000000000000000000000000000000000000000000000000000008000000000000
12 CHAR ................................................NUM
ZONE 00000000000000000000000000000000000000000000000045422222222222222222222222222222
NUMR 000000000000000000000000000000000000000001000803E5D00000000000000000000000000000
13 CHAR ........ ........................................
ZONE 22222222222222222222222200000000222222220000000100000000000000000000000000000000
NUMR 00000000000000000000000000000000000000000000000000000000000000000000000000000000
14 CHAR ............................DEN
ZONE 00000000000000000000000000004442222222222222222222222222222222222222222222222222
NUMR 000000000000000000000100080445E0000000000000000000000000000000000000000000000000
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
15 CHAR ........ ............................................................
ZONE 22220000000022222222000000010000000000000000000000000000000000000000000000000000
NUMR 00000000000000000000000000080000000000000000000000000000000000000000000000000000
16 CHAR ........RATE ........
ZONE 00000000545422222222222222222222222222222222222222222222222222220000000022222222
NUMR 01000805214500000000000000000000000000000000000000000000000000000000000000000000
17 CHAR ....... ....................................................
ZONE 00000002000000000000000000000000000000000000000000000000000022222222222222222222
NUMR 00000000000000000000000000000000000000000000000000000000000000000000000000000000
18 HEADER RECORD*******OBS HEADER RECORD!!!!!!!000000000000000000000000000000
19 CHAR 0001 A A ......B.......B2......0002 B A ......B.......B2......
ZONE 33332222422222224A000000410000004300000033332222422222224A0000004100000043000000
NUMR 00010000100000001000000024000000220000000002000020000000100000002400000022000000
20 CHAR 0003 C A ......B.......B2......0004 D A ......B.......B2......
ZONE 33332222422222224A000000410000004300000033332222422222224A0000004100000043000000
NUMR 00030000300000001000000024000000220000000004000040000000100000002400000022000000
21 CHAR 0005 E A ......B.......B2......
ZONE 33332222422222224A00000041000000430000002222222222222222222222222222222222222222
NUMR 00050000500000001000000024000000220000000000000000000000000000000000000000000000
NOTE: 21 records were read from the infile '/test/tmp.xpt'.

Complex select in SQLite view

I have two tables where Security holds the access bit mask for a given NTFS file system scan and FileSystemRights which equates to the string representations for the well known bit masks. I need to create a view which exposes the expected (not just proper) string representations for a given bit mask. The problem is several enum values composite and contain combinations of lower values, so the desired idea is not to repeat the implicit values.
For example, a value of 1179817 (Security.Id = 24) should only report ReadAndExecute and Synchronize, excluding ExecuteFile, ListDirectory, Read, ReadAttributes, ReadData, ReadExtendedAttributes, ReadPermissions and Traverse, as those are all part of ReadAndExecute (eg. ReadAndExecute & Read == Read). Its obviously correct to show them all, but a user wants only to see the non implicit values.
I'm lost within the constraints of SQL to produce a join that behaves like this without some abysmal nested case that would be a nightmare to look at.
Does a better programmatic approach exist?
Id Name Value
-- ---- -----
1 None 0
2 ListDirectory 1
3 ReadData 1
4 WriteData 2
5 CreateFiles 2
6 CreateDirectories 4
7 AppendData 4
8 ReadExtendedAttributes 8
9 WriteExtendedAttributes 16
10 ExecuteFile 32
11 Traverse 32
12 DeleteSubdirectoriesAndFiles 64
13 ReadAttributes 128
14 WriteAttributes 256
15 Write 278
16 Delete 65536
17 ReadPermissions 131072
18 Read 131209
19 ReadAndExecute 131241
20 Modify 197055
21 ChangePermissions 262144
22 TakeOwnership 524288
23 Synchronize 1048576
24 FullControl 2032127
25 GenericAll 268435456
26 GenericExecute 536870912
27 GenericWrite 1073741824
28 GenericRead 2147483648
Id FileSystemRights IdentityReference
-- ---------------- -----------------
20 2032127 BUILTIN\Administrators
21 2032127 BUILTIN\Administrators
23 268435456 CREATOR OWNER
24 1179817 BUILTIN\Users
25 4 BUILTIN\Users
26 2 BUILTIN\Users
SELECT s.Id AS SecurityId,
FROM Security s
JOIN FileSystemRights f
ON CASE f.Value
WHEN 0 THEN s.FileSystemRights = f.Value
ELSE (s.FileSystemRights & f.Value) == f.Value
ORDER BY s.Id, f.Name;
Add the actual value of the name to the query.
Then wrap another query around that to filter out values for the same entry that are a subset of another value:
WITH AllValues(SecurityId, Name, Value) AS (
FROM Security s
JOIN FileSystemRights f
ON CASE f.Value
WHEN 0 THEN s.FileSystemRights = f.Value
ELSE (s.FileSystemRights & f.Value) == f.Value
SELECT SecurityId,
FROM AllValues
FROM AllValues AS AV2
WHERE AV2.SecurityId = AllValues.SecurityId
AND (AV2.Value & AllValues.Value) != 0
AND AV2.Value > AllValues.Value
ORDER BY 1, 2;

GameTheory package: Convert data frame of games to Coalition Set

I am looking to explore the GameTheory package from CRAN, but I would appreciate help in converting my data (in the form of a data frame of unique combinations and results) in to the required coalition object. The precursor to this I believe to be an ordered list of all coalition values (
My real data has n ~ 30 'players', and unique combinations = large (say 1000 unique combinations), for which I have 1 and 0 identifiers to describe the combinations. This data is sparsely populated in that I do not have data for all combinations, but will assume combinations not described have zero value. I plan to have one specific 'player' who will appear in all combinations, and act as a baseline.
By way of example this is the data frame I am starting with:
games <- read.csv('C:\\Users\\me\\Desktop\\SampleGames.csv', header = TRUE, row.names = 1)
n1 n2 n3 n4 Stakes Wins Success_Rate
1 1 1 0 0 800 60 7.50%
2 1 0 1 0 850 45 5.29%
3 1 0 0 1 150000 10 0.01%
4 1 1 1 0 300 25 8.33%
5 1 1 0 1 1800 65 3.61%
6 1 0 1 1 1900 55 2.89%
7 1 1 1 1 700 40 5.71%
8 1 0 0 0 3000000 10 0.00333%
where n1 is my universal player, and in this instance, I have described all combinations.
To calculate my 'base' coalition value from player {1} alone, I am looking to perform the calculation: 0.00333% (success rate) * all stakes, i.e.
0.00333% * (800 + 850 + 150000 + 300 + 1800 + 1900 + 700 + 3000000) = 105
I'll then have zero values for {2}, {3} and {4} as they never "play" alone in this example.
To calculate my first pair coalition value, I am looking to perform the calculation:
7.5%(800 + 300 + 1800 + 700) + 0.00333%(850 + 150000 + 1900 + 3000000) = 375
This is calculated as players {1,2} base win rate (7.5%) by the stakes they feature in, plus player {1} base win rate (0.00333%) by the combinations he features in that player {2} does not - i.e. exclusive sets.
This logic is repeated for the other unique combinations. For example row 4 would be the combination of {1,2,3} so the calculation is:
7.5%(800+1800) + 5.29%(850+1900) + 8.33%(300+700) + 0.00333%(3000000+150000) = 529 which descriptively is set {1,2} success rate% by Stakes for the combinations it appears in that {3} does not, {1,3} by where {2} does not feature, {1,2,3} by their occurrences, and the base player {1} by examples where neither {2} nor {3} occur.
My expected outcome therefore should look like this I believe:
c(105,0,0,0, 375,304,110,0,0,0, 529,283,246,0, 400)
where the first four numbers are the single player combinations {1} {2} {3} and {4}, the next six numbers are two player combinations {1,2} {1,3} {1,4} (and the null cases {2,3} {2,4} {3,4} which don't exist), then the next four are the three player combinations {1,2,3} {1,2,4} {1,3,4} and the null case {2,3,4}, and lastly the full combination set {1,2,3,4}.
I'd then feed this in to the DefineGame function of the package to create my coalitions object.
Appreciate any help: I have tried to be as descriptive as possible. I really don't know where to start on generating the necessary sets and set exclusions.

Stata counting substring

My table looks like this:
. "A05" 1
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 1
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 1
441 "C05AA03" 130
The count for the full 8-character AQ_ATC codes is already correct.
The shorter codes are unique in the table and are substrings of the complete 8-character codes (they represent the first x characters).
What I am looking for is the count of the appearances of the shorter codes throughout the entire table.
For example in this case the resulting table would be
. "A05" 2715 <-- 2525 + 190
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 7430 <-- 4330 + 3100
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 130 <-- 130
441 "C05AA03" 130
The partial codes do not overlap, by what I mean that if there is "C05" there wont be another partial code "C05A1".
I created the amountATC column using
bysort ATC: egen amountATC = total(AQ_ATC==AQ_ATC)
I attempted recycling the code that I had received yesterday but failed in doing so.
My attempt looks like this:
levelsof AQ_ATC, local(ATCvals)
quietly foreach y in AQ_ATC {
local i = 0
quietly foreach x of local ATCvals {
if strpos(`y', `"`x'"') == 1{
local i = `i'+1
replace amountATC = `i'
My idea was to use a counter "i" and increase it by 1 everytime the an AQ_ATC starts with another AQ_ATC code. Then I write "i" into amountATC and after I iterated over the entire table for my AQ_ATC, I will have an "i"-value that will be equal to the amount of occurences of the substring. Then I reset "i" to 0 and continue with the next AQ_ATC.
At least thats how I intended for it to work, what it did in the end is set all amountATC-values to 1.
I also attempted looking into different egen-functions such as noccur and moss, but my connection keeps timing out when I attempt to install the packages.
It seems as if you come from another language and you insist in using loops when not strictly necessary. Stata does many things without explicit loops, precisely because commands already apply to all observations.
One way is:
set more off
input ///
ID str15 AQ_ATC amountATC
. "A05" 1
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 1
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 1
441 "C05AA03" 130
*----- what you want -----
sort AQ_ATC ID
gen grou = sum(missing(ID))
bysort grou AQ_ATC: gen tosum = amountATC if _n == 1 & !missing(ID)
by grou: egen s = total(tosum)
replace amountATC = s if missing(ID)
list, sepby(grou)
With your edit the same principles apply. Below code that adjusts to your change and slightly changes the code (one line less):
*----- what you want -----
sort AQ_ATC
gen grou = sum(missing(ID))
bysort grou: gen s = sum(amountATC) if AQ_ATC != AQ_ATC[_n+1] & !missing(ID)
by grou: replace amountATC = s[_N] if missing(ID)
More efficient should be:
bysort grou: gen s = sum(amountATC) if AQ_ATC != AQ_ATC[_n+1]
by grou: replace amountATC = s[_N] - 1 if missing(ID)
Some comments:
sort is a very handy command. If you sort the data by AQ_ATC they are arranged in such a way that the short (sub)strings are placed before corresponding long strings.
The by: prefix is fundamental and very helpful, and I noticed you can use it after defining appropriate groups. I created the groups taking advantage of the fact that all short (sub)strings have a missing(ID).
Then (by the groups just defined) you only want to add up one value (observation) per amountATC. That's what the condition if AQ_ATC != AQ_ATC[_n+1] does.
Finally, replace back into your original variable. I would usually generate a copy and work with that, so my original variable doesn't suffer.
An excellent read for the by: prefix is Speaking Stata: How to move step by: step, by Nick Cox.
Yet another slightly different way:
*----- what you want -----
sort AQ_ATC
gen grou = sum(missing(ID))
egen t = tag(grou AQ_ATC)
bysort grou: gen s = sum(amountATC * t)
by grou: replace amountATC = s[_N] - 1 if missing(ID)

How can I apply fisher test on this set of data (nominal variables)

I'm pretty new in statistics:
fisher = function(idxToTest, idxATI){
p = c()
for(i in c(1:length(idxToTest)))
tbl = table(data[[idxToTest[i]]], data[[idxATI]])
rez = fisher.test(tbl, workspace = 20000000000)
dependent=c(dependent, TRUE)
idxDependent = c(idxDependent, idxToTest[i])
dependent = c(dependent, FALSE)
p = c(p, rez$p.value)
This is the function I use. It seems to work.
What I understood until now is that I have to pass as first parameter data like:
Men Women
Dieting 10 30
Non-dieting 5 60
My data comes from a CSV:
data = read.csv('***.csv', header = TRUE, sep=',');
My first problem is that I don't know how to converse from:
Loan.Purpose Home.Ownership
lp_value_1 ho_value_2
lp_value_1 ho_value_2
lp_value_2 ho_value_1
lp_value_3 ho_value_2
lp_value_2 ho_value_3
lp_value_4 ho_value_2
lp_value_3 ho_value_3
ho_value_1 ho_value_2 ho_value_3
lp_value1 0 2 0
lp_value2 1 0 1
lp_value3 0 1 1
lp_value4 0 1 0
The second issue is that I don't know what the second parameter should be
POST UPDATE: This is what I get using fisher.test(myTable):
Error in fisher.test(test) : FEXACT error 501.
The hash table key cannot be computed because the largest key
is larger than the largest representable int.
The algorithm cannot proceed.
Reduce the workspace size or use another algorithm.
where myTable is:
car 18 0 0 5 27
credit_card 190 0 2 38 214
debt_consolidation 620 0 2 87 598
educational 5 0 0 3 7
Basically, fisher tests only work on smallish data sets because they require alot of memory. But all is good because chi-square tests make minimal additional assumptions and are easier on the computer. Just do:
to get your p-values.
Make sure you read through and understand the help page for chisq.test, especially the examples at the bottom.
Then look at a mosaicplot to see the quantities like:
this reference explains how mosaicplots work.
