I have a MACRO which takes a data set D and essentially outputs k disjoint datasets, D_1,...,D_k. The value k is not fixed and depends on properties of the data that are not known in advance. We can assume that k is not larger than 10, though.
The dataset D contains the variables x and y, and I want to overlay the line/scatter plots of x and y for each of D_i over each other. In my particular case x is time, and I want to see the output y for each D_i and compare them to each other.
Hopefully that was clear.
How can I do this? I don't know k in advance, so I need some sort of %do loop. But it doesn't seem that I can put a do loop inside "proc sgplot".
I might be able to make a macro that includes a very long series of commands, but I'm not sure.
How can I overlay these plots in SAS?
EDIT: I am including for reference why I am trying to avoid doing a PROC SGPLOT with the GROUP clause. I tried the following code and it is taking over 30 minutes to compute (I canceled the calculation after this, so I don't know how long it will actually take). PROC SQL runs quite quickly, the program is stuck on PROC SGPLOT.
proc sql;
create table dataset as select
date, product_code, sum(num_of_records) as total_rec
from &filename
group by product_code, data
order by product_code, date
;
quit;
PROC SGPLOT Data = dataset;
scatter x = date y = total_rec/group=product_code;
title "Total records by product code";
run;
The number of observations in the file is 76,000,000.
What you should do is either change your macro to produce one dataset with a variable d_i (or whatever you can logically name it) which identifies which dataset it would've gone to (or identifies it with whatever determines what dataset it would've gone to), or post-macro combine the datasets.
Then, you can use group to overlay your plots. So for example:
data my_data;
call streaminit(7);
do d_i = 1 to 5;
y = 10;
x = 0;
output;
do x = 1 to 10;
y + round(rand('Uniform')*3,.1)-1.5;
output;
end;
end;
run;
proc sgplot data=my_data;
series x=x y=y/group=d_i;
run;
I am working with a large federal dataset with thousands of observations and thousands of variables. Replicate weights are provided. I am using the "survey" package in R to apply these weights:
els.weighted=svrepdesign(data=els, repweights = ~els$F3F1PNLWT,
combined.weights = TRUE).
I am interested in some categorical descriptive characteristics of a subset of the population, such as family living arrangements. I want to get these sorted out into a contingency table that shows frequency. I would like to sort people based on four variables (none of which are binary, but all of which are numeric) This is what I would like to get:
.
The blank boxes are where the cross-tabulation/frequency counts would show. (I only put in 3 columns beneath F1COMP for brevity's sake, but it has 9 outcomes – indexed 1-9)
My current code: svyby(~F1FCOMP, ~F1RTRCC +BYS33C +F1A10 +byurban, els.weighted, svytotal)
This code does sort the data, but it sorts every single combination, by default. I want them pared down to represent only specific subpopulations of each variable. I tried:
svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C==1 +F1A10==2 | F1A10==3 +byurban==3, els.weighted, svytotal)
But got stopped:
Error: unexpected '==' in "svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C=="
Additionally, my current version of the code tells me how many cases occur for each combination, This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down.
This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down
.
You can see in that picture that I only get one number for F1FCOMP per row – the number of cases who fit the specified combination – a specific subpopulation. I want to know more about that subpopulation. That is, F1COMP has nine different outcomes (indexed 1-9), and I want to see how many of each subpopulation fits into each of the 9 outcomes of F1COMP.
I'm afraid this question has two sub parts. My project is to determine which insurance carrier has the lowest cost based on CPT Codes. Since there are so many CPT Codes I wanted to group them using cut like this:
uCPTCode<- unique(data$CPTCode)
uCPTCode <- cut(uCPTCode,
breaks = c(-Inf, "01999", "69979", "79999", "89398", "99091", "99499", Inf),
labels = c("NA","Anesthesia", "Surgery", "Radiology", "Pathology&Laboratory", "Medicine","Evaluation&Management", "Temp"),
right = FALSE)
Not sure unique is required or wise, but seemed to make sense to me. The issue is that some codes have leading zeros and terminating letters like this
2608 Levels: 0014F 0159T 0164T 0191T 0195T 0232T 0319T 0326T 0513F 0517F 0518F
So question 1 is what is the process to convert these ranges into integers corresponding to the labels I have in the cut function so I can graph the grouped results the x axis?
Question 2 is that I expected the ranges to be continuous, but they are not. How to I manage what happens around code 99000 through 99216 where previous groups (Medicine, Anesthesiology and Evaluation and Management) get combined? Here is a link to the CPT grouper file https://www.dropbox.com/s/wm55n17pufoacww/CPTGrouper.xlsx?dl=0
Here is a smattering of results to see where I am going with it
https://www.dropbox.com/s/h6sdnvm9yew6jdg/SampleStudyResults.xlsx?dl=0
Thanks very much for your time and attention
I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS
You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.
#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.
Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}
I have some data formatted as below. I have done some analysis on this and would like to be able to plot the price development in the same graph as the analyzed data.
This requires me to have the same x-axes for the data.
So I would like to aggregate the "shares" column in say 150 increments, and add the "finalprice" and "time" to this.
The aggregation should include the latest time and price, so if the aggregation needs to occur over two or more rows of data then the last row should provide the price and time data.
My question is how to create a new vector with 150 shares per row.
The length of the vector will equal sum(shares)/150.
Is there an easy way to do this? Thanks in advance.
Edit:
I thought about expanding the observations using rep(finalprice, shares) and then getting each 150th value of the expanded vector.
Data sample:
"date","ord","shares","finalprice","time","stock"
20120702,E,2000,99.35,540.84753333,500
20120702,E,28000,99.35,540.84753333,500
20120702,E,50,99.5,542.03073333,500
20120702,E,13874,99.5,542.29411667,500
20120702,E,292,99.5,542.30191667,500
20120702,E,784,99.5,542.30193333,500
20120702,E,13300,99.35,543.04805,500
20120702,E,16658,99.35,543.04805,500
20120702,E,42,99.5,543.04805,500
20120702,E,400,99.4,546.17173333,500
20120702,E,100,99.4,547.07,500
20120702,E,2219,99.3,549.47988333,500
20120702,E,781,99.3,549.5238,500
20120702,E,50,99.3,553.4052,500
20120702,E,1500,99.35,559.86275,500
20120702,E,103,99.5,567.56726667,500
20120702,E,1105,99.7,573.93326667,500
20120702,E,4100,99.5,582.2657,500
20120702,E,900,99.5,582.2657,500
20120702,E,1024,99.45,582.43891667,500
20120702,E,8214,99.45,582.43891667,500
20120702,E,10762,99.45,582.43895,500
20120702,E,1250,99.6,586.86446667,500
20120702,E,5000,99.45,594.39061667,500
20120702,E,20000,99.45,594.39061667,500
20120702,E,15000,99.45,594.39061667,500
20120702,E,4000,99.45,601.34491667,500
20120702,E,8700,99.45,603.53608333,500
20120702,E,3290,99.6,609.23213333,500
I think I got it solved.
expand <- rep(finalprice, shares)
Increment <- expand[seq(from = 1, to = length(expand), by = 150)]