I want to draw a scatter plot with R. I use ggplot2 to draw the picture:
data<-data.frame(x=runif(50),y=runif(50))
ggplot(data, aes(x,y))+geom_point()
but I want the dots to have different colors according to the "x" value, the dots belongs to the following "x" ranges must have different colors. [0,0.2), [0.2,0.4), [0.4,0.6), [0.6,0.8),[0.8,1].
There's probably a better way to do this, but here's my solution:
# what we started with
data<-data.frame(x=runif(50),y=runif(50))
# create discretized variable z from x to determine plotted color.
# Since you wanted 5 levels, multiplied by 5 and took the floor, and then
# converted to a factor
z<-factor(floor((data$x)*5)) # or z<-factor(floor((data[,1])*5))
# add z to previous data frame and store in new variable dat
dat<-cbind(data,z)
# make pretty labels
lolim<-seq(0,0.8,0.2)
hilim<-seq(0.2,1,0.2)
lbls<-paste(lolim,'-',hilim)
# plot, changed x-axis ticks to show cutoff values
ggplot(dat,aes(x=x,y=y,color=z))+
geom_point()+
scale_color_hue(name='x',labels=lbls)+
scale_x_continuous(breaks=seq(0,1,0.2))
last_plot() + aes(colour=cut(x, breaks = seq(0,1,by=0.2)))
Related
Trying to run a simple and quick analysis of some variables. I run this code:
ggplot(data, aes(var1)) +
geom_bar()
Resulting in a Histogram however in spite of having only 6 possible values in var1, x Axis only shows 2,4,6. Is it possible to easily include all 6 possible values as labels?
You want to have frequency bar plot for six individual numbers. However, you wish to see all of these numbers on the X axis, which makes me think that you actually treat them as categorical data rather then numeric data, so you actually would prefer a categorical X axis which shows all the data. Turning the x into a factor should do the trick:
data <- data.frame(var1=floor(6*runif(200) + 1))
ggplot(data, aes(factor(var1))) + geom_bar()
Below: left - without factor, right - with factor.
What does your data look like?
Assuming you have a numeric x, adding scale_x_continuous(breaks = seq(1,6, by = 1))should work.
Of course this would only work if the x values go from 1 to 6... Otherwise you can replace the seq call with a vector that contains the values you want.
I wish to create something similar to the following types of discontinuous heat map in R:
My data is arranged as follows:
k_e percent time
.. .. ..
.. .. ..
I wish k_e to be x-axis, percent on y-axis and time to denote the color.
All links I could find plotted a continuous matrix http://www.r-bloggers.com/ggheat-a-ggplot2-style-heatmap-function/ or interpolated. But I wish neither of the aforementioned, I want to plot discontinuous heatmap as in the images above.
The second one is a hexbin plot
If your (x,y) pairs are unique, you can do an x y plot, if that's what you want, you can try using base R plot functions:
x <- runif(100)
y<-runif(100)
time<-runif(100)
pal <- colorRampPalette(c('white','black'))
#cut creates 10 breaks and classify all the values in the time vector in
#one of the breaks, each of these values is then indexed by a color in the
#pal colorRampPalette.
cols <- pal(10)[as.numeric(cut(time,breaks = 10))]
#plot(x,y) creates the plot, pch sets the symbol to use and col the color
of the points
plot(x,y,pch=19,col = cols)
With ggplot, you can also try:
library(ggplot2)
qplot(x,y,color=time)
Generate data
d <- data.frame(x=runif(100),y=runif(100),w=runif(100))
Using ggplot2
require(ggplot2)
Sample Count
The following code produces a discontinuous heatmap where color represents the number of items falling into a bin:
ggplot(d,aes(x=x,y=y)) + stat_bin2d(bins=10)
Average weight
The following code creates a discontinuous heatmap where color represents the average value of variable w for all samples inside the current bin.
ggplot(d,aes(x=x,y=y,z=w)) + stat_summary2d(bins=10)
I have a variable ceroonce which is number of schools per county (integers) in 2011. When I plot it with boxplot() it only requires the ceroonce variable. A boxplot is then retrieved in which the y axis is the number of schools and the x axis is... the "factor" ceroonce. But in ggplot, when using geom_boxplot, it requires me to input both x and y axis, but I just want a boxplot of ceroonce. I have tried inputing ceroonce as both the x and y axis. But then a weird boxplot is retrieved in which the y axis is the number of schools but the x axis (which should be the factor variable) is also the number of schools? I am assuming this is very basic statistics, but I am just confused. I am attaching the images hoping this will clarify my question.
This is the code I am using:
ggplot(escuelas, aes(x=ceroonce, y=ceroonce))+geom_boxplot()
boxplot(escuelas$ceroonce)
ggplot(escuelas, aes(x="ceroonce", y=ceroonce))+geom_boxplot()
ggplot will interpret the character string "ceroonce" as a vector with the same length as the ceroonce column and it will give the result you're looking for.
There are no fancy statistics happening here. boxplot is simply assuming that since you've given it a single vector, that you want a single box in your boxplot. ggplot and geom_histogram simply don't make that assumption.
If you want a bit less typing, you can do this:
qplot(y=escuelas$ceroonce, x= 1, geom = "boxplot")
ggplot2 will automatically create a vector of 1s equal in length to the length of escuelas$ceroonce
This could work for you:
ggplot(escuelas, aes(x= "", y=ceroncee)) + geom_boxplot()
I have a dataset that lookd pretty much like this one from diamonds:
diamonds2 = subset(diamonds, cut!='Good' & cut!='Very Good', -c(table, x, y, z, clarity, depth, price))
I want to make a boxplot like this one:
ggplot(diamonds2, aes(x=color, y=carat, col=cut))+geom_boxplot()
And the hard question comes here. My idea is to perform pairwise wilcox.test for each distribution of the variable y (carat) by group (cut) and for each of the columns (color).
library(plyr)
ddply(diamonds2,"color",
function(x) {
w <- wilcox.test(carat~cut,data=diamonds2)
with(w,data.frame(statistic,p.value))
})
The code fails because is asking for 2 levels (obviously). I can make a subset before applying the function (to remove one of the "cut") but It's not giving me what I want and can't understand why.
Additionally I would like to plot the results as asterisks of the color between the two distributions I'm comparing.
In the first boxplot (D), I would like to plot 3 asterisks, a purple (red and blue are significantly different), a yellow and a cian.
About the asterisk color plotting I've been playing a bit with the function geom_text from ggplot2 but I can't figure out how to plot below the X axis or plot text in different colors.
How can I add text to points rendered with geom_jittered to label them? geom_text will not work because I don't know the coordinates of the jittered dots. Could you capture the position of the jittered points so I can pass to geom_text?
My practical usage would be to plot a boxplot with the geom_jitter over it to show the data distribution and I would like to label the outliers dots or the ones that match certain condition (for example the lower 10% for the values used for color the plots).
One solution would be to capture the xy positions of the jittered plots and use it later in another layer, is that possible?
[update]
From Joran answer, a solution would be to calculate the jittered values with the jitter function from the base package, add them to a data frame and use them with geom_point. For filtering he used ddply to have a filter column (a logic vector) and use it for subsetting the data in geom_text.
He asked for a minimal dataset. I just modified his example (a unique identifier in the label colum)
dat <- data.frame(x=rep(letters[1:3],times=100),y=runif(300),
lab=paste('id_',1:300,sep=''))
This is the result of joran example with my data and lowering the display of ids to the lowest 1%
And this is a modification of the code to have colors by another variable and displaying some values of this variable (the lowest 1% for each group):
library("ggplot2")
#Create some example data
dat <- data.frame(x=rep(letters[1:3],times=100),y=runif(300),
lab=paste('id_',1:300,sep=''),quality= rnorm(300))
#Create a copy of the data and a jittered version of the x variable
datJit <- dat
datJit$xj <- jitter(as.numeric(factor(dat$x)))
#Create an indicator variable that picks out those
# obs that are in lowest 1% by x
datJit <- ddply(datJit,.(x),.fun=function(g){
g$grp <- g$y <= quantile(g$y,0.01);
g$top_q <- g$qual <= quantile(g$qual,0.01);
g})
#Create a boxplot, overlay the jittered points and
# label the bottom 1% points
ggplot(dat,aes(x=x,y=y)) +
geom_boxplot() +
geom_point(data=datJit,aes(x=xj,colour=quality)) +
geom_text(data=subset(datJit,grp),aes(x=xj,label=lab)) +
geom_text(data=subset(datJit,top_q),aes(x=xj,label=sprintf("%0.2f",quality)))
Your question isn't completely clear; for example, you mention labeling points at one point but also mention coloring points, so I'm not sure which you really mean, or perhaps both. A reproducible example would be very helpful. But using a little guesswork on my part, the following code does what I think you're describing:
#Create some example data
dat <- data.frame(x=rep(letters[1:3],times=100),y=runif(300),
lab=rep('label',300))
#Create a copy of the data and a jittered version of the x variable
datJit <- dat
datJit$xj <- jitter(as.numeric(factor(dat$x)))
#Create an indicator variable that picks out those
# obs that are in lowest 10% by x
datJit <- ddply(datJit,.(x),.fun=function(g){
g$grp <- g$y <= quantile(g$y,0.1); g})
#Create a boxplot, overlay the jittered points and
# label the bottom 10% points
ggplot(dat,aes(x=x,y=y)) +
geom_boxplot() +
geom_point(data=datJit,aes(x=xj)) +
geom_text(data=subset(datJit,grp),aes(x=xj,label=lab))
Just an addition to Joran's wonderful solution:
I ran into trouble with the x-axis positioning when I tried to use in a facetted plot using facet_wrap(). The problem is, that ggplot2 uses 1 as the x-value on every facet. The solution is to create a vector of jittered 1s:
datJit$xj <- jitter(rep(1,length(dat$x)),amount=0.1)