SAS to R conversion of merge - r

I am currently working on converting a SAS macro into a R code. I have worked a lot on R but I am relatively new to SAS. I am having trouble understanding the SAS code for a merge command -
data dates;
merge A(keep=date rename=(date=beg))
A(keep=date firstobs= 5 rename=(date=end))
A(keep=date firstobs= 10 rename=(date=bega))
A(keep=date firstobs= 15 rename=(date=ee))
A(keep=date firstobs= 30 rename=(date=eend));
index+1;
if nmiss(beg,end,bega,eend,ee)=0;
run;
I understand that this command is joining the file A to itself 5 times. But I am not able to visualize the output. What does 'index+1' and 'if' stand for.
What is the R version for this code?

I'm not quite familiar with R, but I know some SAS. I'm not sure if I would call this a macro... The output of your merged data set will depend on how your input data set looks like. Just run your code, and you'll be able to see it in your work folder...
Generally, the data step is structured like an implicit loop. The index+1 looks like the sum statement with the syntax: variable+expression. In this case, the value of index after +1 will be retained for another iteration.
The if statement here contains a boolean condition (i.e. it can have the value of either True or False, but not both) to set a constraint when outputting the data step. If it's true, the current row of data will be outputted. nmiss(var1,var2,var3,...) is a function that will return the number of arguments specified inside nmiss() that are missing. E.g. if only var1 is missing, nmiss(var1,var2,var3,...) = 1.

As Yick says, the index+1 statement creates a new variable in your output data set that begins with one and increments for each observation processed.
The nmiss(...) function used like this is called a sub-setting IF expression, meaning that observations having a non-zero result (no missing values) are not written out to your final dataset.
The best way to visualize the results will be for you to run this code twice using a small test dataset, once using that if statement and once without. For example:
data a;
do i=1 to 50;
date = today() + i;
output;
end;
run;
data dates1;
merge A(keep=date rename=(date=beg))
A(keep=date firstobs= 5 rename=(date=end))
A(keep=date firstobs= 10 rename=(date=bega))
A(keep=date firstobs= 15 rename=(date=ee))
A(keep=date firstobs= 30 rename=(date=eend));
index+1;
if nmiss(beg,end,bega,eend,ee)=0;
format beg end bega ee eend yymmdd10.;
run;
data dates2;
merge A(keep=date rename=(date=beg))
A(keep=date firstobs= 5 rename=(date=end))
A(keep=date firstobs= 10 rename=(date=bega))
A(keep=date firstobs= 15 rename=(date=ee))
A(keep=date firstobs= 30 rename=(date=eend));
index+1;
format beg end bega ee eend yymmdd10.;
run;
After running the above, open both datasets in SAS and compare them side-by-side. The effect of the subsetting-IF statement should be obvious, as well as probably help you understand why this was done (a clever trick, by the way). I added a FORMAT statement to make it a bit easier to see.

It's been a while since I wrote R (so this might not be the best code), but this would be roughly equivalent to
n = nrow(a)
dates = data.frame(cbind(
1:(n-29),
a[1:(n-29),"date"],
a[5:(n-25),"date"],
a[10:(n-20),"date"],
a[15:(n-15),"date"],
a[30:n,"date"]
))
names(dates) = c("index","beg","end","bega","ee","eend")
As you said, you are merging A onto itself 5 times. As others have said, the index+1 statement simply acts as a row index count. The if nmiss(...)=0; statement means you only get rows where everything lines up.
So use the cbind() function in R to do the merge. cbind() requires that you have like lengths on the inputs so you have to adjust your ranges. These ranges are the equivalent to the firstobs= option on the input Data Set plus the subsetting if ... ; statement.

Related

How to Plot a Series of Rates Over Time in SAS

I have 3 data sets: "Complete", "Incomplete", and "Case_List". "Complete" contains records of individuals that have had a full series of a vaccine; "Incomplete" is identical except that the number of doses is less than the full series; and "Case_List" contains confirmed cases of a specific infection. Each data set contains a date, which I have transformed into week of the year (1:53), the individuals age, which I have divided into age groups(easiest to refer to age groups as 1:8, but their character variables), and an ID. Every ID/record in "Complete" is, by definition, in "incomplete" as the individual received dose 1 before dose 2, but I don't have access to any personal identifiers to link them to the "Case_List" ID's.
I am new to SAS and have yet to find enough instruction on plotting to be able to plot a graph with the Case_List over Week(1:53) overlayed with Incomplete over Week(1:53) and Complete over Week(1:53), and all of that broken down by Age_Group(1:8). If I can't get it figured out, I will just plot everything in R.
Other thoughts:
Is it easier to merge Incomplete and Complete so there are only two data sets?
Is 8 iterations of a graph that already contains 3 lines going to be too messy for one plot?
Thanks for your help.
In SAS, you can't overlay plots from multiple datasets - you need to combine everything into one dataset.
You don't have to "merge" anything, though, just set them together and add a "category" variable.
data incompletes completes case_list;
call streaminit(7);
do week = 1 to 53;
do _i = 1 to 200;
age = rand('Integer',1,8);
_output = rand('Uniform');
if _output lt (0.1+week/100) then output completes;
if _output lt (0.2+week/80) then output incompletes;
if _output lt (0.2-((week/150)**2)) then output case_list;
end;
end;
run;
data total;
set completes(in=_comp) incompletes(in=_incomp) case_list(in=_case);
if _comp then category="Complete";
else if _incomp then category="Incomplete";
else category="Disease Cases";
run;
Then you can overlay plots, depending on exactly what you want to do.
proc sgplot data=total;
vline week/group=category;
run;
You could add paneling by age as noted in the comments, or you have a few other options depending on what exactly you do, but I think this gets at what you really want to know - how do I overlay plots in SAS.

Vectorizing R custom calculation with dynamic day range

I have a big dataset (around 100k rows) with 2 columns referencing a device_id and a date and the rest of the columns being attributes (e.g. device_repaired, device_replaced).
I'm building a ML algorithm to predict when a device will have to be maintained. To do so, I want to calculate certain features (e.g. device_reparations_on_last_3days, device_replacements_on_last_5days).
I have a function that subsets my dataset and returns a calculation:
For the specified device,
That happened before the day in question,
As long as there's enough data (e.g. if I want last 3 days, but only 2 records exist this returns NA).
Here's a sample of the data and the function outlined above:
data = data.frame(device_id=c(rep(1,5),rep(2,10))
,day=c(1:5,1:10)
,device_repaired=sample(0:1,15,replace=TRUE)
,device_replaced=sample(0:1,15,replace=TRUE))
# Exaxmple: How many times the device 1 was repaired over the last 2 days before day 3
# => getCalculation(3,1,data,"device_repaired",2)
getCalculation <- function(fday,fdeviceid,fdata,fattribute,fpreviousdays){
# Subset dataset
df = subset(fdata,day<fday & day>(fday-fpreviousdays-1) & device_id==fdeviceid)
# Make sure there's enough data; if so, make calculation
if(nrow(df)<fpreviousdays){
calculation = NA
} else {
calculation = sum(df[,fattribute])
}
return(calculation)
}
My problem is that the amount of attributes available (e.g. device_repaired) and the features to calculate (e.g. device_reparations_on_last_3days) has grown exponentially and my script takes around 4 hours to execute, since I need to loop over each row and calculate all these features.
I'd like to vectorize this logic using some apply approach which would also allow me to parallelize its execution, but I don't know if/how it's possible to add these arguments to a lapply function.

Giving a unique identifier to every execution in R

Can anyone please tell me how to assign a unique value to a result set every time its executed ? As displayed in table below, a entry should be added in front of every record and this entry should be same for all the result set that has been obtained during a single execution. The purpose of this to extract the all records in future by just giving a short statement like (where Unique ID = A_Ground_01). Thanks
User DateTime Latitude Longitude Floor **Unique ID**
1 A 2017-06-15 47.29404 5.010650 Ground A_Ground_01
2 A 2017-06-15 47.29403 5.010634 Ground A_Ground_01
3 A 2017-06-15 47.29403 5.010668 Ground A_Ground_02
4 A 2017-06-15 47.29403 5.010663 Ground A_Ground_02
With not knowing anything about your initial dataframe, or the function being executed I might recommend something similar to the following.
In this example I'll assume you have a main dataframe we'll call df.main and some new data you will be binding to the main dataframe, we'll call df.newdata
Create a column in your main dataframe called df.main$ExecID that will contain integer values.
Run whatever your function is and assign df.newdata$ExecID <- max(df.main$ExecID) + 1
Generate the unique id using df.newdata$UniqueID <- paste(df.newdata$User, df.newdata$Floor, df.newdata$ExecID, sep = "_")
Then run rbind(df.main, df.newdata)
To provide a better solution for your specific situation, we really would need to see example code of how your script is written.

PROC SQL with GROUP command extremely slow. Why? Workaround possible?

I have a MACRO which takes a data set D and essentially outputs k disjoint datasets, D_1,...,D_k. The value k is not fixed and depends on properties of the data that are not known in advance. We can assume that k is not larger than 10, though.
The dataset D contains the variables x and y, and I want to overlay the line/scatter plots of x and y for each of D_i over each other. In my particular case x is time, and I want to see the output y for each D_i and compare them to each other.
Hopefully that was clear.
How can I do this? I don't know k in advance, so I need some sort of %do loop. But it doesn't seem that I can put a do loop inside "proc sgplot".
I might be able to make a macro that includes a very long series of commands, but I'm not sure.
How can I overlay these plots in SAS?
EDIT: I am including for reference why I am trying to avoid doing a PROC SGPLOT with the GROUP clause. I tried the following code and it is taking over 30 minutes to compute (I canceled the calculation after this, so I don't know how long it will actually take). PROC SQL runs quite quickly, the program is stuck on PROC SGPLOT.
proc sql;
create table dataset as select
date, product_code, sum(num_of_records) as total_rec
from &filename
group by product_code, data
order by product_code, date
;
quit;
PROC SGPLOT Data = dataset;
scatter x = date y = total_rec/group=product_code;
title "Total records by product code";
run;
The number of observations in the file is 76,000,000.
What you should do is either change your macro to produce one dataset with a variable d_i (or whatever you can logically name it) which identifies which dataset it would've gone to (or identifies it with whatever determines what dataset it would've gone to), or post-macro combine the datasets.
Then, you can use group to overlay your plots. So for example:
data my_data;
call streaminit(7);
do d_i = 1 to 5;
y = 10;
x = 0;
output;
do x = 1 to 10;
y + round(rand('Uniform')*3,.1)-1.5;
output;
end;
end;
run;
proc sgplot data=my_data;
series x=x y=y/group=d_i;
run;

Conditional Label in R without Loops

I'm trying to find out the best (best as in performance) to having a data frame of the form getting a new column called "Season" with each of the four seasons of the year:
MON DAY YEAR
1 1 1 2010
2 1 1 2010
3 1 1 2010
4 1 1 2010
5 1 1 2010
6 1 1 2010
One straightforward to do this is create a loop conditioned on the MON and DAY column and assign the value one by one but I think there is a better way to do this. I've seen on other posts suggestions for ifelse or := or apply but most of the problem stated is just binary or the value can be assigned based on a given single function f based on the parameters.
In my situation I believe a vector containing the four stations labels and somehow the conditions would suffice but I don't see how to put everything together. My situation resembles more of a switch case.
Using modulo arithmetic and the fact that arithmetic operators coerce logical-values to 0/1 will be far more efficient if the number of rows is large:
d$SEASON <- with(d, c( "Winter","Spring", "Summer", "Autumn")[
1+(( (DAY>=21) + MON-1) %/% 3)%%4 ] )
The first added "1" shifts the range of the %%4 operationon all the results inside the parentheses from 0:3 to 1:4. The second subtracted "1" shifts the (inner) 1:12 range back to 0:11 and the (DAY >= 21) advances the boundary months forward one.
I'll start by giving a simple answer then I'll delve into the details.
I quick way to do this would be to check the values of MON and DAY and output the correct season. This is trivial :
f=function(m,d){
if(m==12 && d>=21) i=3
else if(m>9 || (m==9 && d>=21)) i=2
else if(m>6 || (m==6 && d>=21)) i=1
else if(m>3 || (m==3 && d>=21)) i=0
else i=3
}
This f function, given a day and a month, will return an integer corresponding to the season (it doesn't matter much if it's an integer or a string ; integer only allows to save a bit of memory but it's a technicality).
Now you want to apply it to your data.frame. No need to use a loop for this ; we'll use mapply. d will be our simulated data.frame. We'll factor the output to have nice season names.
d=data.frame(MON=rep(1:12,each=30),DAY=rep(1:30,12),YEAR=2012))
d$SEA=factor(
mapply(f,d$MON,d$DAY),
levels=0:3,
labels=c("Spring","Summer","Autumn","Winter")
)
There you have it !
I realize seasons don't always change a 21st. If you need fine tuning, you should define a 3-dimension array as a global variable to store the accurate days. Given a season and a year, you could access the corresponding day and replace the "21"s in the f function with the right calls (you would obviously add a third argument for the year).
About the things you mentionned in your question :
ifelse is the "functionnal" way to make a conditionnal test. On atomic variables it's only slightly better than the conditionnal statements but it is vectorized, meaning that if the argument is a vector, it will loop itself on its elements. I'm not familiar with it but it's the way to got for an optimized solution
mapply is derived from sapply of the "apply family" and allows to call a function with several arguments on vector (see ?mapply)
I don't think := is a standard operator in R, which brings me to my next point :
data.table ! It's a package that provides a new structure that extends data.frame for fast computing and typing (among other things). := is an operator in that package and allows to define new columns. In our case you could write d[,SEA:=mapply(f,MON,DAY)] if d is a data.table.
If you really care about performance, I can't insist enough on using data.table as it is a major improvement if you have a lot of data. I don't know if it would really impact time computing with the solution I proposed though.

Resources