R - Infer legend details from variable - r

It's my first question on SO. Hopefully it will be enough detail:
I've made a Kaplan-Meier plot and would like to add a legend, however I am not having much luck with it. I know how to make a legend when you know which line is representative of it's respective category. Unfortunately since I'm using a large data set I don't know which line is which, and therefore am having a hard time creating the legend manually. Is there any way R can infer which category is which colour on the following graph? (The current selections aren't right in the legend, I was just guessing)
kmsurv1 <- survfit(Surv(as.numeric(time),hydraulic)~type)
# Specify axis options within plot()
plot(kmsurv1, col=c(1:12), main="Hydraulic Breakdown of Vehicles", sub="subtitle", xlab="Time", ylab="Probability of Being Operational", xlim=c(15400, 16500),ylim=c(.6,1.0))
legend("bottomleft", inset = 0, title = "Vehicle Type",legend= c("Hitachi Backhoe","Transport Trucks", "Water Trucks","Cat D8 Dozers", "D10 Dozers")
,fill = c(1:12), horiz=TRUE)

I'm assuming you are using the survival package. The package document specifies that with print(obj), the order of printout is the order in which they plot. You can then extract those names with rownames(summary(sfit)$table). Just make sure that the colors you choose are in the same order in the plot and legend lines. Here's an example:
library(survival)
sfit <- survfit(Surv(start, stop, event) ~ sex, mgus1, subset=(enum==1))
print(sfit) # the order of printout is the order in which they plot
plot(sfit, col=1:4)
legend("topleft",legend=rownames(summary(sfit)$table),col=1:4, lty=1)

I found an extremely easy answer from: http://rpubs.com/sinhrks/plot_surv
install.packages('ggfortify')
install.packages("ggplot2")
library(ggplot2)
library(ggfortify)
fit <- survfit(Surv(as.numeric(time),hydraulic)~type, data = mydata)
autoplot(fit,xlim = c(15000,16500))
My Answer
Thank you to https://stackoverflow.com/users/3396821/mlavoie for pointing me in the right direction.

Related

How to create histogram plot in ggplot2 without data frame?

I am plotting two histograms in R by using the following code.
x1<-rnorm(100)
x2<-rnorm(50)
h1<-hist(x1)
h2<-hist(x2)
plot(h1, col=rgb(0,0,1,.25), xlim=c(-4,4), ylim=c(0,0.6), main="", xlab="Index", ylab="Percent",freq = FALSE)
plot(h2, col=rgb(1,0,0,.25), xlim=c(-4,4), ylim=c(0,0.6), main="", xlab="Index", ylab="Percent",freq = FALSE,add=TRUE)
legend("topright", c("H1", "H2"), fill=c(rgb(0,0,1,.25),rgb(1,0,0,.25)))
The code produces the following output.
I need a visually good looking (or stylistic) version of the above plot. I want to use ggplot2. I am looking for something like this (see Change fill colors section). However, I think, ggplot2 only works with data frames. I do not have data frames in this case. Hence, how can I create good looking histogram plot in ggplot2? Please let me know. Thanks in advance.
You can (and should) put your data into a data.frame if you want to use ggplot. Ideally for ggplot, the data.frame should be in long format. Here's a simple example:
df1 = rbind(data.frame(grp='x1', x=x1), data.frame(grp='x2', x=x2))
ggplot(df1, aes(x, fill=grp)) +
geom_histogram(color='black', alpha=0.5)
There are lots of options to change the appearnce how you like. If you want to have the histograms stacked or grouped, or shown as percent versus count, or as densities etc., you will find many resources in previous questions showing how to implement each of those options.

Contour plot via Scatter plot

Scatter plots are useless when number of plots is large.
So, e.g., using normal approximation, we can get the contour plot.
My question: Is there any package to implement the contour plot from scatter plot.
Thank you #G5W !! I can do it !!
You don't offer any data, so I will respond with some artificial data,
constructed at the bottom of the post. You also don't say how much data
you have although you say it is a large number of points. I am illustrating
with 20000 points.
You used the group number as the plotting character to indicate the group.
I find that hard to read. But just plotting the points doesn't show the
groups well. Coloring each group a different color is a start, but does
not look very good.
plot(x,y, pch=20, col=rainbow(3)[group])
Two tricks that can make a lot of points more understandable are:
1. Make the points transparent. The dense places will appear darker. AND
2. Reduce the point size.
plot(x,y, pch=20, col=rainbow(3, alpha=0.1)[group], cex=0.8)
That looks somewhat better, but did not address your actual request.
Your sample picture seems to show confidence ellipses. You can get
those using the function dataEllipse from the car package.
library(car)
plot(x,y, pch=20, col=rainbow(3, alpha=0.1)[group], cex=0.8)
dataEllipse(x,y,factor(group), levels=c(0.70,0.85,0.95),
plot.points=FALSE, col=rainbow(3), group.labels=NA, center.pch=FALSE)
But if there are really a lot of points, the points can still overlap
so much that they are just confusing. You can also use dataEllipse
to create what is basically a 2D density plot without showing the points
at all. Just plot several ellipses of different sizes over each other filling
them with transparent colors. The center of the distribution will appear darker.
This can give an idea of the distribution for a very large number of points.
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=c(seq(0.15,0.95,0.2), 0.995),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.15, lty=1, lwd=1)
You can get a more continuous look by plotting more ellipses and leaving out the border lines.
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=seq(0.11,0.99,0.02),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.05, lty=0)
Please try different combinations of these to get a nice picture of your data.
Additional response to comment: Adding labels
Perhaps the most natural place to add group labels is the centers of the
ellipses. You can get that by simply computing the centroids of the points in each group. So for example,
plot(x,y,pch=NA)
dataEllipse(x,y,factor(group), levels=c(seq(0.15,0.95,0.2), 0.995),
plot.points=FALSE, col=rainbow(3), group.labels=NA,
center.pch=FALSE, fill=TRUE, fill.alpha=0.15, lty=1, lwd=1)
## Now add labels
for(i in unique(group)) {
text(mean(x[group==i]), mean(y[group==i]), labels=i)
}
Note that I just used the number as the group label, but if you have a more elaborate name, you can change labels=i to something like
labels=GroupNames[i].
Data
x = c(rnorm(2000,0,1), rnorm(7000,1,1), rnorm(11000,5,1))
twist = c(rep(0,2000),rep(-0.5,7000), rep(0.4,11000))
y = c(rnorm(2000,0,1), rnorm(7000,5,1), rnorm(11000,6,1)) + twist*x
group = c(rep(1,2000), rep(2,7000), rep(3,11000))
You can use hexbin::hexbin() to show very large datasets.
#G5W gave a nice dataset:
x = c(rnorm(2000,0,1), rnorm(7000,1,1), rnorm(11000,5,1))
twist = c(rep(0,2000),rep(-0.5,7000), rep(0.4,11000))
y = c(rnorm(2000,0,1), rnorm(7000,5,1), rnorm(11000,6,1)) + twist*x
group = c(rep(1,2000), rep(2,7000), rep(3,11000))
If you don't know the group information, then the ellipses are inappropriate; this is what I'd suggest:
library(hexbin)
plot(hexbin(x,y))
which produces
If you really want contours, you'll need a density estimate to plot. The MASS::kde2d() function can produce one; see the examples in its help page for plotting a contour based on the result. This is what it gives for this dataset:
library(MASS)
contour(kde2d(x,y))

axis labels on y-axis only using xyplot in R

Relatively new to site and r so please forgive any protocols I may not adhere to.
I am producing a plot with xyplot. My code
library(lattice)
height <- c(1,3,5)
mass <- c(10, 12, 14)
d <- data.frame (height,mass)
xyplot(height ~ mass, type = 'a', scales = list(alternating = 1, tck = c(1,0)))
and I get this
My problem is that I cannot remove the labels from the x-axis so only the y-axis ticks are labelled. This is so I can stack a number of plots with data.arrange. I have looked here and on other places online and found some answers but I clearly do not understand the code because I still cannot do it. I have tried removing the axes and rebuilding with "scales" to no success.
Can someone please assist me with this?
Regards
Aaron
Inside the scales argument you can add a list for the attributes of the x-axis and set labels=NULL. For additional options, see the scales section in the help for xyplot. I've also removed the x-axis title since you probably won't want that repeated for each graph either:
xyplot(height ~ mass, type ='a', xlab="",
scales=list(alternating=1, tck=c(1,0), x=list(labels=NULL)))

Mosaic plot with labels in each box showing a name and percentage of all observations

I would like to create a mosaic plot (R package vcd, see e.g. http://cran.r-project.org/web/packages/vcd/vignettes/residual-shadings.pdf ) with labels inside the plot. The labels should show either a combination of the various factors or some custom label and the percentage of total observations in this combination of categories (see e.g. http://i.usatoday.net/communitymanager/_photos/technology-live/2011/07/28/nielsen0728x-large.jpg , despite this not quite being a mosaic plot).
I suspect something like the labeling_values function might play a role here, but I cannot quite get it to work.
library(vcd)
library(MASS)
data("Titanic")
mosaic(Titanic, labeling = labeling_values)
Alternative ways to represent two variables with categorical data in a friendly way for non-statisticians are also welcome and are acceptable solutions.
Here is an example of adding proportions as labels. As usual, the degree of customization of a plot is a matter of taste, but this shows at least the principles. See ?labeling_cells for further possibilities.
labs <- round(prop.table(Titanic), 2)
mosaic(Titanic, pop = FALSE)
labeling_cells(text = labs, margin = 0)(Titanic)

How to extract coordinates to plot line segments connecting legend keys in ggplot2?

I've long puzzled over a concise way to communicate significance of an interaction between numeric and categorical variables in a line plot (response on the Y-axis, numeric predictor variable on the X-axis, and each level of the categoric variable a line of a different color or pattern plotted on those axes). I finally came up with the idea of drawing the traditional "brackets and p-values" connecting legend keys instead of lines of data.
Here is a mockup of what I mean:
library(ggplot2);
mydat <- do.call(rbind,lapply(1:3,function(ii) data.frame(
y=seq(0,10)*c(.695,.78,1.39)[ii]+c(.322,.663,.847)[ii],
a=factor(ii-1),b=0:10)));
myplot <- ggplot(data=mydat,aes(x=b,y=y,colour=a,group=a)) +
geom_line()+theme(legend.position=c(.1,.9));
# Plotting with p-value bracket:
myplot +
# The three line segments making up the bracket
geom_segment(x=1.2,xend=1.2,y=13.8,yend=13) +
geom_segment(x=1.1,xend=1.2,y=13,yend=13) +
geom_segment(x=1.1,xend=1.2,y=13.8,yend=13.8) +
# The text accompanying the bracket.
geom_text(label='p < 0.001',x=2,y=13.4);
This is less cluttered than trying to plot brackets someplace on the line-plot itself.
The problem is that the x and y values for the geom_segments and geom_text were obtained by trial and error and for another dataset these coordinates would be completely wrong. That's a problem if I'm trying to write a function whose purpose is to automate the process of pulling these contrasts out of models and plotting them (kind of like the effects package, but with more flexibility about how to represent the data).
My question is: is there a way to somehow pull the actual coordinates of each box comprising the legend and convert them to the scale used by geom_segment and geom_text, or manually specify the coordinates of each box when creating the myplot object, or reliably predict where the individual boxes will be and convert them to the plot's scale given that myplot$theme$legend.position returns 0.1 0.9?
I'd like to do this within ggplot2, because it's robust, elegant, and perfect for all the other things I want to do with my script. I'm open to using additional packages that extend ggplot2 and I'm also open to other approaches to visually indicating significance level on line-plots. However, suggestions that amount to "you shouldn't even do that" are not constructive-- because whether or not I personally agree with you, my collaborators and their editors don't read Stackoverflow (unfortunately).
Update:
This question kind of simplifies to: if the myplot$theme$legend.key.height is in lines and myplot$theme$legend.position seems to be roughly in fractions of the overall plot area (but not exactly) how can I convert these to the units in which the x and y axes are delineated, or alternatively, convert the x and y axis scales to the units of legend.key.height and legend.position?
I don't know the answer to your question as posed. But, another, definitely quickly do-able if less fancy approach to convey the information is to change the names of the levels so that the level names include significance codes. In your first example, you could use
levels(mydat$a) <- list("0" = "0", "1 *" = "1", "2 *" = "2")
And then the legend will reflect this:
With more levels and combos of significance, you could probably work out a set of symbols. Then mention in your figure legend the p level reflected in each set of symbols.
This might be a related way to convey the information: The figure below is produced by rxnNorm in HandyStuff here. Unfortunately, this is another non-answer as I have not been able to make this work with the new version of ggplot2. Hopefully I can figure it out soon.
My answer is not using ggplot2, but the lattice package. I think dotplot is what I would use if I want to compare a continuous variable versus categorical variables.
Here I use dotplot in 2 manners, one where I reproduce your plot, and another where
library(lattice)
library(latticeExtra) ## to get ggplot2 theme
#y versus levels of B, in different panel of A
p1 <- dotplot(b~y|a ,
data = mydat,
groups = a,
type = c("p", "h"),
main = "interaction between numeric and categorical variables ",
xlab = "continuous value",
par.settings = ggplot2like())
#y versus levels of B , grouped by a(color and line are defined by a)
p2 <- dotplot(b~y, groups= a ,
data = mydat,
type = c("l"),
main = "interaction between numeric and categorical variables ",
xlab = "continuous value",
par.settings = ggplot2like())
library(gridExtra) ## to arrange many grid plots
grid.arrange(p1,p2)

Resources