Plotting pre aggregated data - r

I have pre-calculated data with amount on the x axis and the count (as a proportion) which I'm using as the y axis.
What I would like to have is the functionality I would get if I had used stat="bin". I can't use rep to simply explode the data to it's original form and then rebin it, because of the large size of the dataset.
For example:
I would like to be able to smooth the data, like I would have been able to by using binwidth.
Also, I'm plotting this data using geom_freqpoly. However, if I don't have a specific amount on the x axis I'd like to have it as a 0 value, instead of joining to the next point, which binning using ggplot does.

Since no one had a response for ggplot, I used rep to re-expand and sample the data.
So, if I had 18 million observations originally, I used 180,000 for the times argument of rep, and multiplied by this by my previously calculated proportion of the data. I'm not sure what the threshold would then be for the times argument (if it's less than 1 will no data point be created?). This means I lose the less frequent observations altogether, but this is OK in my case.

Many of the ggplot stat functions will accept a weight as part of the aesthetic, e.g.: aex(x=X, y=Y, weight=n). Depending on you versions, a couple even complain about the "unused aesthetic, 'weight'", but then go on to do the right thing! I've used this on geom_hist, bin2d, and probably others.

Related

Adding multiple lines to plot, without ggplot

I would like to plot multiple lines on the same plot, without using ggplot.
I have scores for different individuals across a set time period and wish to plot a line between yearly scores for each individual. Data is organised with each row representing an individual and each column an observed value in a given year.
Currently I am using a for loop, but am aware that this is often not efficient in R, and am interested if there are any more suitable approaches available within base R.
I will be working with up 100,000 individuals
Thanks.
Code:
df=data.frame(runif(10,0,100),runif(10,0,100),runif(10,0,100),runif(10,0,100))
df=data.frame(t(df))
Years=seq(1,10,1)
plot(1,type="n",xlab="Year",ylab="Score", xlim=c(1,10), ylim=c(0,100))
for(x in 1:4){lines(Years,df[x,])}
Efficiency is not much of a consideration when plotting since plotting to a device is a slow operation in itself. You can use matplot (which uses a loop internally). It's basically a more sophisticated version of your code wrapped in a function.
matplot(Years, t(df), xlab="Year", ylab="Score", type = "l")

R - Plotting 2 different sized variables with different time axis with ggplot

I have two variables that are different sizes. These variables were measured on very similar time scales, but they are different by a few days (data spans about half a year). Here are the variables, along with the time axes:
data1 # double, length of 229081
time1 # double, length of 229081
data2 # double, length of 230842
time2 # double, length of 230842
I want to plot these two variables as line plots on the same graph using ggplot. I've tried the following but to no avail:
data = data.frame(data1, data2)
qplot(time2, data, color=colors, scale_colour_identity(guide="legend"),
geom=c("line", "line"))
I feel like the above should work, since I'm using the bigger time axis, which encompasses the smaller data set. I tried looking for a similar question on stackoverflow, but couldn't quite find it.
You will need to pass an independent data frame to an additional call to geom_line:
As you didn't provide any reproducible example (it's good to do so!):
data1<-data.frame(d1=runif(100),t1=sample(1:1000,100))
data2<-data.frame(d1=runif(98),t1=sample(1:2000,98))
Then you can plot:
ggplot(data1,aes(d1,t1))+geom_line()+geom_line(data=data2,aes(d1,t1,color="red"))

ggplot making a descriptive bar graph with no clear y variable

I've been trying to create a proportional stacked bar graph using ggplot and a huge data set that is one column of a dummy variable and one column a factor variable with 14 different levels.
I posted a small sample of the data here.
Despite not having a clear y-variale in my data, I can produce a plot that is only really useful looking at the factors that have a lot of observations, but when there's only one or two, you can't see the proportion at all. The code I used is here.
ggplot(data,aes(factor(data$factor),fill=data$dummy))+
geom_bar()
ggplot says you need to apply a ddply function to the data frame.
ce<-ddply(data,"factor",transform, percent_y=y/sum(y)*100)
Their example doesn't really apply in the case of this data since there's no clear y-variable to call in the plot; just counts of each factor that is 1 or 0.
My best guess for a ddply function spits out an error about differeing number of rows.
ce<-ddply(plot,"factor(data$factor)",transform,
percent=sum(data$dummy)*100/(dim(data$dummy)[1]))

Is there a better way to plot multicolor lines in R than splitting the data?

Sequential portions of my time series are under different treatments, and I'd like to separately color a line connecting observations in each portion.
For example, in the series under treatment A I'd have a red line, and in the succeeding series under treatment B I'd have a blue line.
plot(response, type="l",col="treatment") failed - all observations were connected with a line the same color.
This listhost posting proposed just splitting the data by treatment and then separately plotting each subset on the same plot. (http://r.789695.n4.nabble.com/Can-R-plot-multicolor-lines-td791081.html).
Is there a more elegant way?
An alternative using Map that avoids manually plotting segments:
dat <- data.frame(treatment=rep(LETTERS[1:2],3:4),
response=c(6,5,2,1,5,6,7),time=1:7)
plot(response ~ time, data=dat, type="n")
Map(
function(x) lines(response ~ time, data=x, col=x$treatment),
split(dat, dat$treatment)
)
There are two popular more elegant ways. One is to use the ggplot2 package. Without more information it's hard to advise you other than look at help or examples in various places. The other is to check out the function matplot. That will require you to first restructure your data as a matrix but it can easily do what you want. Keep in mind that while it says in the help, "Plot the columns of one matrix against the columns of another", the x-axis matrix can be a vector the same length as one column of a matrix containing your line information. The function will just recycle the x vector.

Uniform plot points in R -- Research / HW

This is for research I am doing for my Masters Program in Public Health
I am graphing data against each other, a standard x,y type deal, over top of that I am plotting a predicted line. I get what I think to be the most funky looking point/boxplot looking thing ever with an x axis that is half filled out and I don't understand why as I do not call a boxplot function. When I call the plot function it is my understanding that only the points will plot.
The data I am plotting looks like this
TOTAL.LACE | DAYS.TO.FAILURE
9 | 15
16 | 7
... | ...
The range of the TOTAL.LACE is from 0 to 19 and DAYS.TO.FAILURE is 0 - 30
My code is as follows, maybe it is something before the plot but I don't think it is:
# To control the type of symbol we use we will use psymbol, it takes
# value 1 and 2
psymbol <- unique(FAILURE + 1)
# Build a test frame that will predict values of the lace score due to
# a patient being in a state of failure
test <- survreg(Surv(time = DAYS.TO.FAILURE, event = FAILURE) ~ TOTAL.LACE,
dist = "logistic")
pred <- predict(test, type="response") <-- produces numbers from about 14 to 23
summary(pred)
ord <- order(TOTAL.LACE)
tl_ord <- TOTAL.LACE[ord]
pred_ord <- pred[ord]
plot(TOTAL.LACE, DAYS.TO.FAILURE, pch=unique(psymbol)) <-- Produces goofy graph
lines(tl_ord, pred_ord) <-- this produces the line not boxplots
Here is the resulting picture
Not to sure how to proceed from here, this is an off shoot of another problem I had with the same data set at this link here I am not understanding why boxplots are being drawn, the reason being is I did not specifically call the boxplot() command so I don't know why they appeared along with point plots. When I issue the following command: plot(DAYS.TO.FAILURE, TOTAL.LACE) I only get points on the resulting plot like I expected, but when I change the order of what is plotted on x and y the boxplots show up, which to me is unexpected.
Here is a link to sample data that will hopefully help in reproducing the problem as pointed out by #Dwin et all Some Sample Data
Thank you,
Since you don't have a reproducible example, it is a little hard to provide an answer that deals with your situation. Here I generate some vaguely similar-looking data:
set.seed(4)
TOTAL.LACE <- rep(1:19, each=1000)
zero.prob <- rbinom(19000, size=1, prob=.01)
DAYS.TO.FAILURE <- rpois(19000, lambda=15)
DAYS.TO.FAILURE <- ifelse(zero.prob==1, DAYS.TO.FAILURE, 0)
And here is the plot:
First, the problem with some of the categories not being printed on the x-axis is because they don't fit. When you have so many categories, to make them all fit you have to display them in a smaller font. The code to do this is to use cex.axis and set the value <1 (you can read more about this here):
boxplot(DAYS.TO.FAILURE~TOTAL.LACE, cex.axis=.8)
As to the question of why your plot is "goofy" or "funky-looking", it is a bit hard to say, because those terms are rather nebulous. My guess is that you need to more clearly understand how boxplots work, and then understand what these plots are telling you about the distribution of your data. In a boxplot, the midline of the box is the 50th percentile of your data, while the bottom and top of the box are the 25th and 75th percentiles. Typically, the 'whiskers' will extend out to the furthest datapoint that is at most 1.5 times the inter-quartile range beyond the ends of the box. In your case, for the first 9 TOTAL.LACEs, more than 75% of your data are 0's, so there is no box and thus no whiskers are possible. Everything beyond the whisker limits is plotted as an individual point. I don't think your plots are "funky" (although I'll admit I have no idea what you mean by that), I think your data may be "funky" and your boxplots are representing the distributions of your data accurately according to the rules by which boxplots are constructed.
In the future (and I mean this politely), it will help you get more useful and faster answers if you can write questions that are more clearly specified, and contain a reproducible example.
Update: Thanks for providing more information. I gather by "funky" you mean that it is a boxplot, rather than a typical scatterplot. The thing to realize is that plot() is a generic function that will call different methods depending on what you pass to it. If you pass simple continuous data, it will produce a scatterplot, but if you pass continuous data and a factor, then it will produce a boxplot, even if you don't call boxplot explicitly. Consider:
plot(TOTAL.LACE, DAYS.TO.FAILURE)
plot(as.factor(TOTAL.LACE), DAYS.TO.FAILURE)
Evidently, you have converted DAYS.TO.FAILURE to a factor without meaning to. Presumably this was done in the pch=unique(psymbol) argument via the code psymbol <- unique(FAILURE + 1) above. Although I haven't had time to try this, I suspect eliminating that line of code and using pch=(FAILURE + 1) will accomplish your goals.

Resources