I have a simple scatter plot
x<-rnorm(100)
y<-rnorm(100)
z<-rnorm(100)
I want to plot the plot(x,y) but the color of the points should be color coded based on z.
Also, I would like to have the ability to define how many groups (and thus colours) z should have. And that this grouping should be resistant to outliers (maybe split the z density into n equal density groups).
Till now I do this manually, is there any way to do this automatically?
Note: I want to do this with base R not with ggplot.
You can pass a vector of colours to the col parameter, so it is just a matter of defining your z groups in a way that makes sense for your application. There is the cut() function in base, or cut2() in Hmisc which offers a bit more flexibility. To assist in picking reasonable colour palettes, the RColorBrewer package is invaluable. Here's a quick example after defining x,y,z:
z.cols <- cut(z, 3, labels = c("pink", "green", "yellow"))
plot(x,y, col = as.character(z.cols), pch = 16)
You can obviously add a legend manually. Unfortunately, I don't think all types of plots accept vectors for the col argument, but type = "p" obviously works. For instance, plot(x,y, type = "l", col = as.character(z.cols)) comes out as a single colour for me. For these plots, you can add different colours with lines() or segments() or whatever the low level plotting command you need to use is. See the answer by #Andrie for doing this with type = "l" plots in base graphics here.
Related
I am making a scatter plot of two variables and would like to colour the points by a factor variable. Here is some reproducible code:
data <- iris
plot(data$Sepal.Length, data$Sepal.Width, col=data$Species)
This is all well and good but how do I know what factor has been coloured what colour??
data<-iris
plot(data$Sepal.Length, data$Sepal.Width, col=data$Species)
legend(7,4.3,unique(data$Species),col=1:length(data$Species),pch=1)
should do it for you. But I prefer ggplot2 and would suggest that for better graphics in R.
The command palette tells you the colours and their order when col = somefactor. It can also be used to set the colours as well.
palette()
[1] "black" "red" "green3" "blue" "cyan" "magenta" "yellow" "gray"
In order to see that in your graph you could use a legend.
legend('topright', legend = levels(iris$Species), col = 1:3, cex = 0.8, pch = 1)
You'll notice that I only specified the new colours with 3 numbers. This will work like using a factor. I could have used the factor originally used to colour the points as well. This would make everything logically flow together... but I just wanted to show you can use a variety of things.
You could also be specific about the colours. Try ?rainbow for starters and go from there. You can specify your own or have R do it for you. As long as you use the same method for each you're OK.
Like Maiasaura, I prefer ggplot2. The transparent reference manual is one of the reasons.
However, this is one quick way to get it done.
require(ggplot2)
data(diamonds)
qplot(carat, price, data = diamonds, colour = color)
# example taken from Hadley's ggplot2 book
And cause someone famous said, plot related posts are not complete without the plot, here's the result:
Here's a couple of references:
qplot.R example,
note basically this uses the same diamond dataset I use, but crops the data before to get better performance.
http://ggplot2.org/book/
the manual: http://docs.ggplot2.org/current/
There are two ways that I know of to color plot points by factor and then also have a corresponding legend automatically generated. I'll give examples of both:
Using ggplot2 (generally easier)
Using R's built in plotting functionality in combination with the colorRampPallete function (trickier, but many people prefer/need R's built-in plotting facilities)
For both examples, I will use the ggplot2 diamonds dataset. We'll be using the numeric columns diamond$carat and diamond$price, and the factor/categorical column diamond$color. You can load the dataset with the following code if you have ggplot2 installed:
library(ggplot2)
data(diamonds)
Using ggplot2 and qplot
It's a one liner. Key item here is to give qplot the factor you want to color by as the color argument. qplot will make a legend for you by default.
qplot(
x = carat,
y = price,
data = diamonds,
color = diamonds$color # color by factor color (I know, confusing)
)
Your output should look like this:
Using R's built in plot functionality
Using R's built in plot functionality to get a plot colored by a factor and an associated legend is a 4-step process, and it's a little more technical than using ggplot2.
First, we will make a colorRampPallete function. colorRampPallete() returns a new function that will generate a list of colors. In the snippet below, calling color_pallet_function(5) would return a list of 5 colors on a scale from red to orange to blue:
color_pallete_function <- colorRampPalette(
colors = c("red", "orange", "blue"),
space = "Lab" # Option used when colors do not represent a quantitative scale
)
Second, we need to make a list of colors, with exactly one color per diamond color. This is the mapping we will use both to assign colors to individual plot points, and to create our legend.
num_colors <- nlevels(diamonds$color)
diamond_color_colors <- color_pallet_function(num_colors)
Third, we create our plot. This is done just like any other plot you've likely done, except we refer to the list of colors we made as our col argument. As long as we always use this same list, our mapping between colors and diamond$colors will be consistent across our R script.
plot(
x = diamonds$carat,
y = diamonds$price,
xlab = "Carat",
ylab = "Price",
pch = 20, # solid dots increase the readability of this data plot
col = diamond_color_colors[diamonds$color]
)
Fourth and finally, we add our legend so that someone reading our graph can clearly see the mapping between the plot point colors and the actual diamond colors.
legend(
x ="topleft",
legend = paste("Color", levels(diamonds$color)), # for readability of legend
col = diamond_color_colors,
pch = 19, # same as pch=20, just smaller
cex = .7 # scale the legend to look attractively sized
)
Your output should look like this:
Nifty, right?
The col argument in the plot function assign colors automatically to a vector of integers. If you convert iris$Species to numeric, notice you have a vector of 1,2 and 3s So you can apply this as:
plot(iris$Sepal.Length, iris$Sepal.Width, col=as.numeric(iris$Species))
Suppose you want red, blue and green instead of the default colors, then you can simply adjust it:
plot(iris$Sepal.Length, iris$Sepal.Width, col=c('red', 'blue', 'green')[as.numeric(iris$Species)])
You can probably see how to further modify the code above to get any unique combination of colors.
The lattice library is another good option. Here I've added a legend on the right side and jittered the points because some of them overlapped.
xyplot(Sepal.Width ~ Sepal.Length, group=Species, data=iris,
auto.key=list(space="right"),
jitter.x=TRUE, jitter.y=TRUE)
I created several plots in R. Occasionally, the program does not match the color of the variables in the plot to the variable colors in the legend. In the attached file (Unfortunately, I can't yet attach images b/c of reputation), the first 2 graphs are assigned a black/red color scheme. But, the third chart automatically uses a green/black and keeps the legend with black/red. I cannot understand why this happens.
How can I prevent this from happening?
I know it's possible to assign color, but I am struggling to find a clear way to do this.
Code:
plot(rank, abundance, pch=16, col=type, cex=0.8)
legend(60,50,unique(type),col=1:length(type),pch=16)
plot(rank, abundance, pch=16, col=Origin, cex=0.8)
legend(60,50,unique(Origin),col=1:length(Origin),pch=16)
Below is where color pattern won't match
plot(rank, abundance, pch=16, col=Lifecycle, cex=0.8)
legend(60,50,unique(Lifecycle),col=1:length(Lifecycle),pch=16)
data frame looks like this:
Plant rank abundance Lifecycle Origin type
X 1 23 Perennial Native Weedy
Y 2 10 Annual Exotic Ornamental
Z 3 9 Perennial Native Ornamental
First, I create some fake data.
df <- data.frame(rank = 1:10, abundance = runif(10,10,100),
Lifecycle = sample(c('Perennial', 'Annual'), 10, replace=TRUE))
Then I explicitly say what colors I want my points to be.
cols=c('dodgerblue', 'plum')
Then I plot, using the factor df$Lifecycle to color points.
plot(df$rank, df$abundance, col = cols[df$Lifecycle], pch=16)
When the factor df$Lifecycle is used above, it converts it to a numeric reference to cols, such that it sorts the values alphabetically. Therefore, in the legend, we just need to sort the unique df$Lifecycle values, and then hand it our color vector (cols).
legend(5, 40, sort(unique(df$Lifecycle)), col=cols, pch=16, bty='n')
Hopefully this helps.
I'm using prcomp to do PCA analysis in R, I want to plot my PC1 vs PC2 with different color text labels for each of the two categories,
I do the plot with:
plot(pca$x, main = "PC1 Vs PC2", xlim=c(-120,+120), ylim = c(-70,50))
then to draw in all the text with the different colors I've tried:
text(pca$x[,1][1:18], pca$[,1][1:18], labels=rownames(cava), col="green",
adj=c(0.3,-0.5))
text(pca$x[,1][19:35], pca$[,1][19:35], labels=rownames(cava), col="red",
adj=c(0.3,-0.5))
But R seams to plot 2 numbers over each other instead of one, the pcs$x[,1][1:18] plots the correct points I know because if I use that plot the points it works and produces the same plot as plot(pca$x).
It would be great if any could help to plot the labels for the two categories or
even plot the points different color to make it easy to differentiate between the plots easily.
You need to specify your x and y coordinates a bit differently:
text(pca$x[1:18,1], pca$x[1:18,2] ...)
This means take the first 18 rows and the first column (which is PC1) for the x coord, etc.
I'm surprised what you did doesn't throw an error.
If you want the points themselves colored, you can do it this way:
plot(pca$x, main = "PC1 Vs PC2", col = c(rep("green", 18), rep("red", 18)))
I have created a NMDS plot using the 'vegan' package, like this:
y=metaMDS(data,type="p").
plot(y)
Now I have this NMDS with a good spread of my points. However, I would like to add the graphics of the plot. I would like to give the points in the plot a different colour, depending on a categorical variable (the variable is called 'regio') in my dataset, which has two values (1 or 2).
Is this possible? And if so, how?
Best,
Koen
The easiest way is to use the grouping variable regio to index into a vector of colours you want to plot with. E.g., (untested as I don't have your data...)
colvec <- c("red","blue")
plot(y, type = "n")
points(y, display = "sites", col = colvec[data$regio])
## or
text(y, display = "sites", col = colvec[data$regio])
## depending on how you want to represent the sample scores
I've long puzzled over a concise way to communicate significance of an interaction between numeric and categorical variables in a line plot (response on the Y-axis, numeric predictor variable on the X-axis, and each level of the categoric variable a line of a different color or pattern plotted on those axes). I finally came up with the idea of drawing the traditional "brackets and p-values" connecting legend keys instead of lines of data.
Here is a mockup of what I mean:
library(ggplot2);
mydat <- do.call(rbind,lapply(1:3,function(ii) data.frame(
y=seq(0,10)*c(.695,.78,1.39)[ii]+c(.322,.663,.847)[ii],
a=factor(ii-1),b=0:10)));
myplot <- ggplot(data=mydat,aes(x=b,y=y,colour=a,group=a)) +
geom_line()+theme(legend.position=c(.1,.9));
# Plotting with p-value bracket:
myplot +
# The three line segments making up the bracket
geom_segment(x=1.2,xend=1.2,y=13.8,yend=13) +
geom_segment(x=1.1,xend=1.2,y=13,yend=13) +
geom_segment(x=1.1,xend=1.2,y=13.8,yend=13.8) +
# The text accompanying the bracket.
geom_text(label='p < 0.001',x=2,y=13.4);
This is less cluttered than trying to plot brackets someplace on the line-plot itself.
The problem is that the x and y values for the geom_segments and geom_text were obtained by trial and error and for another dataset these coordinates would be completely wrong. That's a problem if I'm trying to write a function whose purpose is to automate the process of pulling these contrasts out of models and plotting them (kind of like the effects package, but with more flexibility about how to represent the data).
My question is: is there a way to somehow pull the actual coordinates of each box comprising the legend and convert them to the scale used by geom_segment and geom_text, or manually specify the coordinates of each box when creating the myplot object, or reliably predict where the individual boxes will be and convert them to the plot's scale given that myplot$theme$legend.position returns 0.1 0.9?
I'd like to do this within ggplot2, because it's robust, elegant, and perfect for all the other things I want to do with my script. I'm open to using additional packages that extend ggplot2 and I'm also open to other approaches to visually indicating significance level on line-plots. However, suggestions that amount to "you shouldn't even do that" are not constructive-- because whether or not I personally agree with you, my collaborators and their editors don't read Stackoverflow (unfortunately).
Update:
This question kind of simplifies to: if the myplot$theme$legend.key.height is in lines and myplot$theme$legend.position seems to be roughly in fractions of the overall plot area (but not exactly) how can I convert these to the units in which the x and y axes are delineated, or alternatively, convert the x and y axis scales to the units of legend.key.height and legend.position?
I don't know the answer to your question as posed. But, another, definitely quickly do-able if less fancy approach to convey the information is to change the names of the levels so that the level names include significance codes. In your first example, you could use
levels(mydat$a) <- list("0" = "0", "1 *" = "1", "2 *" = "2")
And then the legend will reflect this:
With more levels and combos of significance, you could probably work out a set of symbols. Then mention in your figure legend the p level reflected in each set of symbols.
This might be a related way to convey the information: The figure below is produced by rxnNorm in HandyStuff here. Unfortunately, this is another non-answer as I have not been able to make this work with the new version of ggplot2. Hopefully I can figure it out soon.
My answer is not using ggplot2, but the lattice package. I think dotplot is what I would use if I want to compare a continuous variable versus categorical variables.
Here I use dotplot in 2 manners, one where I reproduce your plot, and another where
library(lattice)
library(latticeExtra) ## to get ggplot2 theme
#y versus levels of B, in different panel of A
p1 <- dotplot(b~y|a ,
data = mydat,
groups = a,
type = c("p", "h"),
main = "interaction between numeric and categorical variables ",
xlab = "continuous value",
par.settings = ggplot2like())
#y versus levels of B , grouped by a(color and line are defined by a)
p2 <- dotplot(b~y, groups= a ,
data = mydat,
type = c("l"),
main = "interaction between numeric and categorical variables ",
xlab = "continuous value",
par.settings = ggplot2like())
library(gridExtra) ## to arrange many grid plots
grid.arrange(p1,p2)