I have a large data set that has multiple rows for each individual. Each individual has a unique ID, and each row is coded as a dummy 1 or a 0 as to the type of doctor's visit it is. IE: A visit can be at the doctor's office, so if it is, it is coded as 1, if it is not, it is coded by 0. I want to count how many of each visits to each type of doctor each individual has. I tried using the count distinct:
proc sql;
create table all as select ID;
count (distinct doctor) as doctor1
from data
group by ID;
quit;
However, this does not seem to be giving me what I want.
Any help or pointers on what codes to use would be really appreciated.
Sample data:
data this;
datalines;
rid dateofvisit doctor hospital clinic;
1 1/1/2014 1 0 0
1 1/3/2014 0 1 0
2 1/5/2014 1 0 0
3 1/6/2014 1 0 0
1 1/7/2014 1 0 0
3 1/8/2014 0 0 1
The count function will normally count all occurrences. Together with distinct, it will count the number of different kinds of occurances. This is not your wish, if I understand you correctly.
Since your occurances is coded with ones, you could use the sum function to calculate how many times your patient has visited the different kinds of doctors.
proc sql;
create table all as select rid,
sum (doctor) as doctor_visits,
sum (hospital) as hospital_visits,
sum (clinic) as clinic_visits,
sum(sum(doctor, hospital, clinic)) as total_visits
from this
group by rid;
quit;
Related
This is what my data.table looks like. The A:E columns are just to draw comparison to excel. Column NewShares is my desired column. I DO NOT have that column in my data.
A B C D E F
dt<-fread('
InitialShares Level Price Amount CashPerShare NewShares
1573.333 0 9.5339 13973.71 0 1573.333
0 1 10.2595 0 .06689 1584.73
0 1 10.1575 0 .06689 1596.33
0 1 9.6855 0 .06689 1608.58')
I am trying to calculate NewShares with the assumption that new shares are added to InitialShares by reinvesting dividends(NewShares*CashPershare) at 90% of the price(Price*.9). In excel land the formula will be =F2+((F2*E3*B3)/(C3*0.9)) as of the second row. The first row is just equal to InitialShares.
In R land, I am trying(which is not quite right):
dt[,NewShares:= cumsum(InitialShares[1]*Level * CashPerShare/(Price*.9)+InitialShares[1])]
Please pay attention to the Decimal points of NewShares once you generate the field in order to validate your approach.
If you expand your formula, you'll realize that this works:
dt[, NewShares := cumprod(1+Level*CashPerShare/Price/0.9)*InitialShares[1]]
Assuming my dataframe has one column, I wish to add another column to indicate if my ith element is unique within the first i elements. The results I want is:
c1 c2
1 1
2 1
3 1
2 0
1 0
For example, 1 is unique in {1}, 2 is unique in {1,2}, 3 is unique in {1,2,3}, 2 is not unique in {1,2,3,2}, 1 is not unique in {1,2,3,2,1}.
Here is my code, but is runs extremely slow given I have nearly 1 million rows.
for(i in 1:nrow(df)){
k <- sum(df$C1[1:i]==df$C1[i]))
if(k>1){df[i,"C2"]=0}
else{df[i,"C2"]=1}
}
Is there a quicker way of achieving this?
The following works:
x$c2 = as.numeric(! duplicated(x$c1))
Or, if you prefer more explicit code (I do, but it’s slower in this case):
x$c2 = ifelse(duplicated(x$c1), 0, 1)
I am trying to run a cumsum on a data frame on two separate columns. They are essentially tabulation of events for two different variables. Only one variable can have an event recorded per row in the data frame. The way I attacked the problem was to create a new variable, holding the value ‘1’, and create two new columns to sum the variables totals. This works fine, and I can get the correct total amount of occurrences, but the problem I am having is that in my current ifelse statement, if the event recorded is for variable “A”, then variable “B” is assigned 0. But, for every row, I want to have the previous variable’s value assigned to the current row, so that I don’t end up with gaps where it goes from 1 to 2, to 0, to 3.
I don't want to run summarize on this either, I would prefer to keep each recorded instance and run new columns through mutate.
CURRENT DF:
Event Value Variable Total.A Total.B
1 1 A 1 0
2 1 A 2 0
3 1 B 0 1
4 1 A 3 0
DESIRED RESULT:
Event Value Variable Total.A Total.B
1 1 A 1 0
2 1 A 2 0
3 1 B 2 1
4 1 A 3 1
Thanks!
You can use the property of booleans that you can sum them as ones and zeroes. Therefore, you can use the cumsum-function:
DF$Total.A <- cumsum(DF$variable=="A")
Or as a more general approach, provided by #Frank you can do:
uv = unique(as.character(DF$Variable))
DF[, paste0("Total.",uv)] <- lapply(uv, function(x) cumsum(DF$V == x))
If you have many levels to your factor, you can get this in one line by dummy coding and then cumsuming the matrix.
X <- model.matrix(~Variable+0, DF)
apply(X, 2, cumsum)
I'm having a lot of trouble figuring out how to subset a data set in R despite reading through many pages here. The set contains information from over 3000 participants. Each participant was asked about five different health conditions and gave binary answers (i.e., yes/no diabetes; yes/no obesity, etc.). How do I make a subset that includes people who have only ONE of the conditions? For instance, everyone in this new subset would have either obesity or diabetes or high cholesterol, but none would have two or more conditions.
Thank you!!
ETA: After a night's sleep, I looked at everything (and the comments) again. Here's some clarification and what I've done since.
Sample data (mydata) (0 = no, 1 = yes)
Participant HighCho Diabetes Obesity
1 1 1 0
2 0 1 1
3 1 0 0
4 0 0 0
5 0 1 0
I want my subset outcome to include only those with none of the three conditions or only one of the three:
Participant HighCho Diabetes Obesity
3 1 0 0
4 0 0 0
5 0 1 0
I've written:
new.data <- subset(mydata = (HighCho == 0 & Diabetes == 0 & Obesity==0) | HighCho == 1 | Diabetes == 1 | Obesity == 1)
My problem is that even though I capture everyone who is free from all conditions, I still include people who have more than one condition. I thought with my "or" statement, I would only include those with only one of the three conditions (rather than two). Any insights as to what I might be doing incorrectly?
You can use the apply function to sum the number of conditions each participant has.
mydata[apply(mydata[, c('HighCho', 'Diabetes', 'Obesity')], 1, sum) %in% 0:1, ]
I want to count the days when a subject did not receive treatment (a "0" in my file. If a subject did receive treatment it is denoted with "1". Subject can get multiple courses of treatments and I would like to count the days between the first and second treatment. I am not (yet) interested in the time between the second and third treatment.
Basically my spss file looks like this:
id day1 day2 day3 day4 day28
A--- 1-----0-----0----1------0
B--- 1---- 0-----1----0------1
C---etc
I am only interested in the first series of zeros. The output I hope to get is:
id first_series_zero
A 2
B 1
C ...
Can anyone help my out, here. Obviously, just counting all the zeros isn't going to work, because there might be multiple sets of zeroes in one row.
Cheers, Dylan
Here is one pretty general approach that will allow you to calculate the times between all of the different treatments. First I create a vector that stores the locations of all of the treatments, Loc1 TO Loc5 (using day1 to day5 as an example).
DATA LIST FREE / day1 day2 day3 day4 day5.
BEGIN DATA
1 0 0 1 0
1 0 1 0 1
END DATA.
VECTOR day = day1 TO day5.
VECTOR Loc(5,F2.0).
COMPUTE #id = 1.
LOOP #i = 1 TO 5.
DO IF day(#i) = 1.
COMPUTE Loc(#id) = #i.
COMPUTE #id = #id + 1.
END IF.
END LOOP.
Now if you run this transformation, the Loc vector will look like this for this example data.
Loc1 Loc2 Loc3 Loc4 Loc5
1 4 . . .
1 3 5 . .
Now to calculate the difference for the first series is as simple as:
COMPUTE first_series_zero = Loc2 - Loc1 - 1.
This will return missing if there is never a second (or first) treatment, and is not dependent on day1 always being the first day of the treatment. Now to calculate the difference between all of the treatments is quite simple, and here is a DO REPEAT approach.
VECTOR DifS(4,F2.0).
DO REPEAT F = Loc1 TO Loc4 /B = Loc2 TO Loc5 /D = DifS1 TO DifS4.
COMPUTE D = B - F - 1.
END REPEAT.
And so DifS1 would be the zeroes between the 1st and 2nd treatment, DifS2 would be the zeroes between the 2nd and 3rd treatment etc. (Both this do repeat and the first loop could be made more efficient with a loop that only goes over valid/possible values.)