Hopefully a straight forward answer, but how can I determine if a variable has more then one observation that equals the maximum? For example, if the following is my data:
ID Doctor COUNT
576434 Tim 1
576434 Lynn 1
576434 Moran 1
576434 Wade 2
576434 Ashwin 2
Looking at the variable "COUNT", we can see that two observations equal the maximum value 2. I would then want to flag this "ID" as having a tie. The data is already in order by ID then COUNT, so my thought was maybe to just see if the values in the last two rows equal, but hoping for a better way. Thanks!
Ok, take a look at this strange solution...
First, your data (with a few more lines to test it out)
data have;
format
ID 8.
Doctor $10.
COUNT 8.;
input
#1 ID 8.
#8 Doctor $7.
#16 COUNT 8.;
DATALINES;
576434 Tim 1
576434 Lynn 1
576434 Moran 1
576434 Wade 2
576434 Ashwin 2
111111 AAAAAA 1
111111 BBBBBB 2
111111 CCCCCC 3
111111 DDDDDD 3
111111 EEEEEE 3
222222 ZZZZZZ 1
222222 WWWWWW 2
;
RUN;
proc sort data=have;
by ID;
run;
Create a dataset to help us, selecting only those max values which count > 1 by ID
proc sql;
CREATE TABLE AUX AS
SELECT
ID
,MAX(COUNT) AS AUXCOUNT
FROM (
SELECT
ID
,COUNT
,COUNT(*) AS COUNTOBS
FROM HAVE
GROUP BY 1,2
HAVING CALCULATED COUNTOBS > 1
)
GROUP BY 1
ORDER BY 1;
quit;
Merge with the first data set to flag the data when there's a tie:
data want;
merge
have (in=a)
aux (in=b);
by ID;
if count = auxcount then TIEFLAG = "TIE";
else TIEFLAG = "";
drop auxcount;
run;
RESULT
ID DOCTOR COUNT FLAG
111111 AAAAAA 1
111111 BBBBBB 2
111111 CCCCCC 3 TIE
111111 DDDDDD 3 TIE
111111 EEEEEE 3 TIE
222222 ZZZZZZ 1
222222 WWWWWW 2
576434 Tim 1
576434 Lynn 1
576434 Moran 1
576434 Wade 2 TIE
576434 Ashwin 2 TIE
Let's assume your dataset has additional rows for example's sake:
data have;
input ID Doctor$ Count;
datalines;
576434 Tim 1
576434 Lynn 1
576434 Moran 1
576434 Wade 2
576434 Ashwin 2
576435 Barry 8
576435 Jim 10
576435 Bart 10
576391 Tom 1
576391 Bill 2
run;
Step 1: Sort your dataset by ID, descending Count
proc sort data=have;
by ID descending count;
run;
We now have the original dataset put in an order that we can work with. Next, we we'll remove duplicate max entries.
Step 2: Remove duplicates for ID descending Count
proc sort data=have
out=_temp_
dupout=dupes
nodupkey;
by ID descending count;
run;
We don't care about the output dataset from proc sort, but we do care about the dupout dataset.
Dupes
ID Doctor Count
576434 Ashwin 2 <---- Duplicate max
576434 Lynn 1
576434 Moran 1
576435 Bart 10 <---- Duplicate max
Step 3: Pick out the duplicates
Notice that the start of each ID group is the max duplicate value. Since it's sorted by ID, then count in descending order, the very first entry of each ID group in the dupout dataset will get us every instance of a duplicate max. Because everything is sorted thanks to proc sort, this trick will work without errors.
data dupe_max;
set dupes;
by ID descending count;
if(first.ID);
keep ID;
run;
Step 4: Merge these IDs back with the original sorted dataset
data want;
merge dupes(in=dupmax)
have(in=all);
by ID;
Duplicate_Max = (have=dupes);
run;
Two things are going on here:
The in= option allows us to create a Boolean variable that tells us which dataset the observation is coming from. In other words, if the ID exists in dupes, the variable dupmax = 1. If the ID exists in have, the variable all = 1.
variable = (logic here) is a shorthand way of creating Boolean 1/0 variables. You can get the same result by doing:
if(have=dupes) then Duplicate_Max = 1; else Duplicate_Max = 0;
Behind the scenes, this is what's happening:
ID from have ID from dupes
vvv vvvvvv
ID Doctor Count Duplicate_Max all dupmax Match?
576391 Bill 2 0 1 0 No
576391 Tom 1 0 1 0 No
576434 Wade 2 1 1 1 Yes
576434 Ashwin 2 1 1 1 Yes
576434 Tim 1 1 1 1 Yes
576434 Lynn 1 1 1 1 Yes
576434 Moran 1 1 1 1 Yes
576435 Jim 10 1 1 1 Yes
576435 Bart 10 1 1 1 Yes
576435 Barry 8 1 1 1 Yes
It's simpler to do it in SQL. You don't even need to sort. Assuming your source data set is "have" as either of the two other posters provided:
PROC SQL ;
CREATE TABLE dupemax AS
SELECT a.id, a.count, a.numobs
FROM (SELECT id, count, COUNT(*) AS numobs
FROM have
GROUP BY id, count
) a
%* gives the number of rows with each id/count combination ;
INNER JOIN
(SELECT id, MAX(count) AS count
FROM have
GROUP BY id
) b
%* finds the id/MAX(count) value ;
ON a.id EQ b.id AND a.count EQ b.count
%* so only getting the id and the MAX(count), not all "count" values ;
WHERE a.numobs GE 2
%* and of the id/MAX(count), only where there were multiple rows of the MAX(count) ;
;
QUIT ;
Related
I have a data set of different patient ID's, clinic visit dates, and attendance (see example data below, separated by patient ID for clarity).
I am interested in sequentially counting treatment episodes, which are defined as attending >= 4 visits for their starting month, followed by >= 1 visit every month afterwards. If a patient attends <1 visit after starting (i.e., after completing their initial >=4 visits in the starting month), that treatment episode is considered ended. A new treatment episode subsequently starts the next time a patient attends >= 4 visits in a given month, and that same episode continues as long as the patient attends >=1 visit/month thereafter. When patients either do not meet or break this pattern, I'd like to input 0.
Example data (note: I've excluded each day's date to prevent the example from being excessively long and re-produced dates to give a clearer picture of the desired data):
Patient ID
Visit Date
Attendance
1
01/01/2023
Yes
1
01/02/2023
Yes
1
01/03/2023
Yes
1
01/04/2023
Yes
1
02/01/2023
Yes
1
03/01/2023
Yes
1
04/01/2023
No
1
05/01/2023
Yes
1
06/01/2023
No
1
07/01/2023
Yes
1
07/02/2023
Yes
1
07/03/2023
Yes
1
07/04/2023
Yes
1
08/01/2023
Yes
----------
----------
----------
Patient ID
Visit Date
Attendance
----------
----------
----------
2
01/01/2023
Yes
2
02/01/2023
Yes
2
03/01/2023
Yes
2
03/02/2023
Yes
2
03/03/2023
Yes
2
03/04/2023
Yes
2
04/01/2023
Yes
2
05/01/2023
Yes
2
07/01/2023
Yes
Desired data:
Patient ID
Visit Date
Attendance
Tx Episode
1
01/01/2023
Yes
1
1
01/02/2023
Yes
1
1
01/03/2023
Yes
1
1
01/04/2023
Yes
1
1
02/01/2023
Yes
1
1
03/01/2023
Yes
1
1
04/01/2023
No
0
1
05/01/2023
Yes
0
1
06/01/2023
No
0
1
07/01/2023
Yes
2
1
07/02/2023
Yes
2
1
07/03/2023
Yes
2
1
07/04/2023
Yes
2
1
08/01/2023
Yes
2
----------
----------
----------
----------
Patient ID
Visit Date
Attendance
Tx Episode
----------
----------
----------
----------
2
01/01/2023
Yes
0
2
02/01/2023
Yes
0
2
03/01/2023
Yes
1
2
03/02/2023
Yes
1
2
03/03/2023
Yes
1
2
03/04/2023
Yes
1
2
04/01/2023
Yes
1
2
05/01/2023
Yes
1
2
07/01/2023
Yes
0
I am somewhat new to programming in R and have initially attempted to use ifelse() but wasn't able to come up with logicals that worked. I've also attempted to write loops, which have failed to run.
Any help would be greatly appreciated and I'm happy to provide more detail if the above isn't clear.
Thanks in advance for your time/effort!
This seems fairly complex, and not sure of entire logic, but thought this may help. This uses the lubridate library, but otherwise base R functions. A helper function elapsed_months was borrowed from here.
First an empty list is created enc_list that will store results for the final data.frame.
We construct two loops - the first to analyze data for each Patient_ID, and the second to evaluate encounters for that given patient.
Note that I subset based on Attendance being "Yes" - if not attended, would not want to include that data in evaluation. This is an assumption on my part.
A table of months for the Visit_Date is made so that we know which months have >= 4 encounters.
The enc_active is a simple flag on whether row-by-row we are dealing with an active encounter. The enc_num is the number treatment encounter that is incremented when new treatment encounters are discovered.
Going row-by-row through encounter data, first check if in an active encounter. If it is, check if the number of elapsed months is 0 (same month) or 1 (consecutive month). If true, then record that encounter. If not true, then the treatment encounter is over.
If not an active encounter, check if has a month with 4+ encounters, and if it does, set to a new active treatment encounter. Note that in cases were not true, it will record 0 for Tx_Encounter and then reset the flag.
The final results are stored back in the list which will be bound together with rbind (row bind) in the end.
The merge will combine the results with the original data.frame, which will be needed since the rows with Attendance or "No" were removed early on. Since the merge will leave Tx_Encounter with missing for those "No"s, we'll replace NA with 0.
Some example data was adapted from your comment. Please let me know of questions - happy to do a StackOverflow chat to go over as well. I do have an interest in this form of data from my own experiences.
library(lubridate)
elapsed_months <- function(end_date, start_date) {
ed <- as.POSIXlt(end_date)
sd <- as.POSIXlt(start_date)
12 * (ed$year - sd$year) + (ed$mon - sd$mon)
}
enc_list <- list()
for (id in unique(df$Patient_ID)) {
enc_data <- df[df$Patient_ID == id & df$Attendance == "Yes", ]
enc_month <- table(cut(enc_data$Visit_Date, 'month'))
enc_active <- F
enc_num <- 0
for (i in 1:nrow(enc_data)) {
if (enc_active) {
if(elapsed_months(enc_data$Visit_Date[i], enc_data$Visit_Date[i - 1]) <= 1) {
enc_data[i, "Tx_Episode"] <- enc_num
} else {
enc_active = F
enc_data[i, "Tx_Episode"] <- 0
}
} else {
if(enc_month[as.character(floor_date(enc_data$Visit_Date[i], unit = "month"))] >= 4) {
enc_active = T
enc_num <- enc_num + 1
enc_data[i, "Tx_Episode"] <- enc_num
} else {
enc_data[i, "Tx_Episode"] <- 0
}
}
}
enc_list[[id]] <- enc_data
}
df_final <- merge(
do.call('rbind', enc_list),
df,
all.y = T
)
df_final$Tx_Episode[is.na(df_final$Tx_Episode)] <- 0
Output
Patient_ID Visit_Date Attendance Tx_Episode
1 1 2023-01-01 Yes 1
2 1 2023-01-02 Yes 1
3 1 2023-01-03 Yes 1
4 1 2023-01-04 Yes 1
5 1 2023-02-01 Yes 1
6 1 2023-03-01 Yes 1
7 1 2023-04-01 No 0
8 1 2023-05-01 Yes 0
9 1 2023-06-01 No 0
10 1 2023-07-01 Yes 2
11 1 2023-07-02 Yes 2
12 1 2023-07-03 Yes 2
13 1 2023-07-04 Yes 2
14 1 2023-08-01 Yes 2
15 2 2023-01-01 Yes 0
16 2 2023-02-01 Yes 0
17 2 2023-03-01 Yes 1
18 2 2023-03-02 Yes 1
19 2 2023-03-03 Yes 1
20 2 2023-03-04 Yes 1
21 2 2023-04-01 Yes 1
22 2 2023-04-02 Yes 1
23 2 2023-04-03 Yes 1
24 2 2023-04-04 Yes 1
25 2 2023-06-12 Yes 0
I am new to R and data analysis. I have a database similar to this one below, just a lot bigger and I was trying to find a general way to count for each country how many actions there are and how many subquestion with value 1, value 2 and so on there are. For each action there are multiple questions, subquestions and subsubquestions but I would love to find a way to count
1:how many actions there are per country, excluding subquestions
2: a way to find out how many subquestions 1 or 2 with value 1 there are for each country, actionn and questionn.
id country questionn subquestion value actionn
06 NIE 1 1 1 1
05 NIG 1 1 1 1
07 TAN 1 1 1 1
08 BEN 1 1 1 1
03 TOG 1 1 2 1
45 MOZ 1 1 2 1
40 ZIM 1 1 1 1
56 COD 1 1 1 1
87 BFA 1 1 1 1
09 IVC 1 1 2 1
08 SOA 1 1 2 1
02 MAL 1 1 2 1
78 MAI 1 1 2 1
35 GUB 1 1 2 1
87 RWA 1 1 2 1
41 ETH 1 1 1 1
06 NIE 1 2 2 1
05 NIG 1 2 1 1
87 BFA 1 2 1 2
I have tried to create subsets of the data frame and count everything for each country once at a time but it is going to take forever and I was wondering if there was a general way to do it.
For the first question I have done this
df1<-df %>% group_by (country) %>% summarise (countries=county)
unique(df1)
count(df1)
For the second question I was thinking of individually select and count each rows which has quesionn=1, subquestion=1, value=1 and actionn=1, then to select and count how many per country with qustionn=1, subquestionn=2, value=1, actionn=1 etc. Value refers to whether the answer to the question is 1=yes or 2=no.
I would be grateful for any help, thank you soo much :)
For the first question you can try to do something like this:
df %>%
filter(subquestion != 2) %>%
group_by(country) %>%
summarise(num_actions = n())
This will return the number of actions per country, removing rows that do not have 2 for the subquestion column. Note that the n() in the summarize function will count the number observations in the groups (in this case countries).
I am not sure I fully understand the second question, but my suggestion would be to make a new label for the particular observation you want to know (how many subquestions 1 or 2 with value 1 there are for each country, actionn and questionn):
df %>%
mutate(country_question_code = paste(country, action, questionn, sep = "_")) %>%
group_by(country_question_code) %>%
summarize(num_subquestion = n())
For question 1 possible solution (assuming country name are not unique and actionn can be 0, 1,2 or more..
For just total count:
df%>%group_by(country)%>%
summarise(
"Count_actions" = sum(actionn)
) #ignores all other columns.
If you want to count how many times a country appears use n() in place of sum(actionn, na.rm=TRUE).# this may not be desired but sometime simple solution is the best
(Just count the frequency of country)
Or df%>%group_by(country, actionn)%>%summarise("count_actions"= n()) will give country wise count for each type ( say 1,2 or more actions).
Data table version dt[, .(.N), by=.(country, actionn )]
For question 2: use grouping for "for each on your question" after putting filter on data as required. Here, filter subquestions 1 or 2 with (and) value 1 for each "country, question and actionn":
df%>%filter(subquestions <=2 & value==1)%>%group_by( country, question, actionn)%>%summarise("counts_desired"= n(), "sums_desired"= sum(actionn, na.rm=TRUE))
Hope this works. I am also learning and applying it on similar data.
Have not tested it and made certain assumptions about your data (numerical and clean). (Also for.mobile while traveling! Cheers!!)
I am willing to create a new variable called recency - how recent is the transaction of the customer - which is useful for RFM analysis. The definition is as follows: We observe transaction log of each customer weekly and assign dummy variable called "trans" if the customers made a transaction. Recency variable will equal to the number of the week if she made a transaction at that week, otherwise recency will be equal to the previous recency value. To make it more clear, I have also created a demo data.table for you.
demo<-data.table( cust=rep(c(1:3), 3))
demo[,week:=seq(1,3,1),by=cust]
demo[, trans:=c(1,1,1,0,1,0,1,1,0)]
demo[, rec:=c(1,1,1, 1,2,1,3,3,1)]
I need to calculate "rec" variable which I entered manually in demo data.table. Please also consider that, I can handle it with looping which takes a lot of time. Therefore, I would be grateful if you help me with data.table way. Thanks in advance.
This works for the example:
demo[, v := cummax(week*trans), by=cust]
cust week trans rec v
1: 1 1 1 1 1
2: 2 1 1 1 1
3: 3 1 1 1 1
4: 1 2 0 1 1
5: 2 2 1 2 2
6: 3 2 0 1 1
7: 1 3 1 3 3
8: 2 3 1 3 3
9: 3 3 0 1 1
We observe transaction log of each customer weekly and assign dummy variable called "trans" if the customers made a transaction. Recency variable will equal to the number of the week if she made a transaction at that week, otherwise recency will be equal to the previous recency value.
This means taking the cumulative max week, ignoring weeks where there is no transaction. Since weeks are positive numbers, we can treat the no-transaction weeks as zero.
I may be missing some elegant ways in Stata to get to this example, which has to do with electrical parts and observed monthly failures etc.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
I would like to group by (bysort) each PartID and record the highest frequency for FailType within each PartID type. Ties can be broken arbitrarily, and preferably, the lower one can be picked.
I looked at groups etc., but do not know how to peel off certain elements from the result set. So that is a major question for me. If you execute a query, how do you select only the elements you want for the next computation? Something like n(0) is the count, n(1) is the mean etc. I was able to use contract, bysort etc. and create a separate data set which I then merged back into the main set with an extra column There must be something simple using gen or egen so that there is no need to create an extra data set.
The expected results here will be:
PartID Freq
ABD 4 #(4 occurs twice)
ABC 2 #(tie broken with minimum)
BBB 0 #(0 occurs 3 times)
Please let me know how I can pick off specific elements that I need from a result set (can be from duplicate reports, tab etc.)
Part II - Clarification: Perhaps I should have clarified and split the question into two parts. For example, if I issue this followup command after running your code: tabdisp Type, c(Freq). It may print out a nice table. Can I then use that (derived) table to perform more computations programatically?
For example get the first row of the table.
Table. ----------------------
Type| Freq ----------+-----------
A | -1
B | -1
C | -1
D | -3
S | -3
---------------------- –
I found this difficult to follow (see comment on question), but some technique is demonstrated here. The numbers of observations in subsets of observations defined by by: are given by _N. The rest is sorting tricks. Negating the frequency is a way to select the highest frequency and the lowest Type which I think is what you are after when splitting ties. Negating back gets you the positive frequencies.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
bysort PartID FailType: gen Freq = -_N
bysort PartID (Freq Type) : gen ToShow = _n == 1
replace Freq = -Freq
list PartID Type FailType Freq if ToShow
+---------------------------------+
| PartID Type FailType Freq |
|---------------------------------|
1. | ABC A 2 1 |
3. | ABD A 4 2 |
7. | BBB A 0 3 |
+---------------------------------+
I am at my very first steps with SAS, and I incurred into the following problem which I am not able to solve.
Suppose my dataset is:
data dat;
input id score gender;
cards;
1 10 1
1 10 1
1 9 1
1 9 1
1 9 1
1 8 1
2 9 2
2 8 2
2 9 2
2 9 2
2 10 2
;
run;
What I need to do is to count the number of times the score variable takes values 8, 9 and 10 by id. Then to create the newly variables count8, count9 and count10 such that I can get the following output:
id gender count8 count9 count10
1 1 1 3 2
2 2 1 3 1
How would you suggest ot proceed? Any help would be greatly appreciated.
Lots of ways to do that. Here's a simple one data step approach.
data want;
set dat;
by id;
if first.id then do;
count8=0;
count9=0;
count10=0;
end;
select(score);
when(8) count8+1;
when(9) count9+1;
when(10) count10+1;
otherwise;
end;
if last.id then output;
keep id count8 count9 count10;
run;
SELECT...WHEN is a shortening of a bunch of IF statements, basically (like CASE..WHEN in other languages).
Gender should be dropped, by the way, unless it's always the same by ID (or unless you intend to count by it.)
A more flexible approach than this is to use a PROC FREQ (or PROC MEANS or ...) and transpose it:
proc freq data=dat noprint;
tables id*score/out=want_pre;
run;
proc transpose data=want_pre out=want prefix=count;
by id;
id score;
var count;
run;
If you really only want 8,9,10 and want to drop records less than 8, do so in the data=dat part of PROC FREQ:
proc freq data=dat(where=(score ge 8)) noprint;