ggplot sequence patterns - r

I am trying to plot a sequence of coloured small squares representing different types of activities. For example, in the following data frame, type represents the type of activity and
count represent how many of those activities ocurred before a "different typed" one took place.
df3 <- data.frame(type=c(1,6,4,6,1,4,1,4,1,1,1,1,6,6,1,1,3,1,4,1,4,6,4,6,4,4,6,4,6,4),
count=c(6,1,1,1,2,1,6,3,1,6,8,10,3,1,2,2,1,2,1,1,1,1,1,1,3,3,1,17,1,12) )
In ggplot by now I am not using count. I am just giving consecutive numbers as xvalues and 1 as yvalues. However it gives me something like ggplot Image
This is the code I used, note that for y I always use 1 and for x i use just consecutive numbers:
ggplot(df3,aes(x=1:nrow(df3),y=rep(1,30))) + geom_bar(stat="identity",aes(color=as.factor(type)))
I would like to get small squares with the width=df3$count.
Do you have any suggestions? Thanks in advance

I am not entirely clear on what you need, but I offer one possible way to plot your data. I have used geom_rect() to draw rectangles of width equal to your count column. The rectangles are plotted in the same order as the rows of your data.
df3 <- data.frame(type=c(1,6,4,6,1,4,1,4,1,1,1,1,6,6,1,
1,3,1,4,1,4,6,4,6,4,4,6,4,6,4),
count=c(6,1,1,1,2,1,6,3,1,6,8,10,3,1,2,
2,1,2,1,1,1,1,1,1,3,3,1,17,1,12))
library(ggplot2)
df3$type <- factor(df3$type)
df3$ymin <- 0
df3$ymax <- 1
df3$xmax <- cumsum(df3$count)
df3$xmin <- c(0, head(df3$xmax, n=-1))
plot_1 <- ggplot(df3,
aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax, fill=type)) +
geom_rect(colour="grey40", size=0.5)
png("plot_1.png", height=200, width=800)
print(plot_1)
dev.off()

Related

Histogram: Combine continuous and discrete values in ggplot2

I have a set of times that I would like to plot on a histogram.
Toy example:
df <- data.frame(time = c(1,2,2,3,4,5,5,5,6,7,7,7,9,9, ">10"))
The problem is that one value is ">10" and refers to the number of times that more than 10 seconds were observed. The other time points are all numbers referring to the actual time. Now, I would like to create a histogram that treats all numbers as numeric and combines them in bins when appropriate, while plotting the counts of the ">10" at the side of the distribution, but not in a separate plot. I have tried to call geom_histogram twice, once with the continuous data and once with the discrete data in a separate column but that gives me the following error:
Error: Discrete value supplied to continuous scale
Happy to hear suggestions!
Here's a kind of involved solution, but I believe it best answers your question, which is that you are desiring to place next to typical histogram plot a bar representing the ">10" values (or the values which are non-numeric). Critically, you want to ensure that you maintain the "binning" associated with a histogram plot, which means you are not looking to simply make your scale a discrete scale and represent a histogram with a typical barplot.
The Data
Since you want to retain histogram features, I'm going to use an example dataset that is a bit more involved than that you gave us. I'm just going to specify a uniform distribution (n=100) with 20 ">10" values thrown in there.
set.seed(123)
df<- data.frame(time=c(runif(100,0,10), rep(">10",20)))
As prepared, df$time is a character vector, but for a histogram, we need that to be numeric. We're simply going to force it to be numeric and accept that the ">10" values are going to be coerced to be NAs. This is fine, since in the end we're just going to count up those NA values and represent them with a bar. While I'm at it, I'm creating a subset of df that will be used for creating the bar representing our NAs (">10") using the count() function, which returns a dataframe consisting of one row and column: df$n = 20 in this case.
library(dplyr)
df$time <- as.numeric(df$time) #force numeric and get NA for everything else
df_na <- count(subset(df, is.na(time)))
The Plot(s)
For the actual plot, you are asking to create a combination of (1) a histogram, and (2) a barplot. These are not the same plot, but more importantly, they cannot share the same axis, since by definition, the histogram needs a continuous axis and "NA" values or ">10" is not a numeric/continuous value. The solution here is to make two separate plots, then combine them with a bit of magic thanks to cowplot.
The histogram is created quite easily. I'm saving the number of bins for demonstration purposes later. Here's the basic plot:
bin_num <- 12 # using this later
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
Thanks to the subsetting previously, the barplot for the NA values is easy too:
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3)
Yikes! That looks horrible, but have patience.
Stitching them together
You can simply run plot_grid(p1, p2) and you get something workable... but it leaves quite a lot to be desired:
There are problems here. I'll enumerate them, then show you the final code for how I address them:
Need to remove some elements from the NA barplot. Namely, the y axis entirely and the title for x axis (but it can't be NULL or the x axes won't line up properly). These are theme() elements that are easily removed via ggplot.
The NA barplot is taking up WAY too much room. Need to cut the width down. We address this by accessing the rel_widths= argument of plot_grid(). Easy peasy.
How do we know how to set the y scale upper limit? This is a bit more involved, since it will depend on the ..count.. stat for p1 as well as the numer of NA values. You can access the maximum count for a histogram using ggplot_build(), which is a part of ggplot2.
So, the final code requires the creation of the basic p1 and p2 plots, then adds to them in order to fix the limits. I'm also adding an annotation for number of bins to p1 so that we can track how well the upper limit setting works. Here's the code and some example plots where bin_num is set at 12 and 5, respectively:
# basic plots
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3) +
labs(x="") + theme(axis.line.y=element_blank(), axis.text.y=element_blank(),
axis.title.y=element_blank(), axis.ticks.y=element_blank()
) +
scale_x_discrete(expand=expansion(add=1))
#set upper y scale limit
max_count <- max(c(max(ggplot_build(p1)$data[[1]]$count), df_na$n))
# fix limits for plots
p1 <- p1 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15))) +
annotate('text', x=0, y=max_count, label=paste('Bins:', bin_num)) # for demo purposes
p2 <- p2 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15)))
plot_grid(p1, p2, rel_widths=c(1,0.2))
So, our upper limit fixing works. You can get really crazy playing around with positioning, etc and the plot_grid() function, but I think it works pretty well this way.
Perhaps, this is what you are looking for:
df1 <- data.frame(x=sample(1:12,50,rep=T))
df2 <- df1 %>% group_by(x) %>%
dplyr::summarise(y=n()) %>% subset(x<11)
df3 <- subset(df1, x>10) %>% dplyr::summarise(y=n()) %>% mutate(x=11)
df <- rbind(df2,df3 )
label <- ifelse((df$x<11),as.character(df$x),">10")
p <- ggplot(df, aes(x=x,y=y,color=x,fill=x)) +
geom_bar(stat="identity", position = "dodge") +
scale_x_continuous(breaks=df$x,labels=label)
p
and you get the following output:
Please note that sometimes you could have some of the bars missing depending on the sample.

plot points in front of lines for each group/ggplot2 equivalent of type="o"

Suppose I want to plot a graph with both points and lines where points appear in front of their corresponding lines in each group. In particular, I want group 1 to be plotted with red filled points, where the points are connected by a line, but group 2 to be plotted with (just) a blue line, but I want group 2 to be plotted over group 1. For example, in base graphics:
set.seed(101)
dd <- data.frame(x=rep(1:10,2),
y=rep(1:10,2),
f=factor(rep(1:2,each=10)))
dd$y[11:20] <- dd$y[11:20] + rnorm(10)
d1 <- subset(dd,f=="1")
d2 <- subset(dd,f=="2")
par(cex=1.5)
plot(y~x,data=d1,bg="red",pch=21,type="o")
lines(y~x,data=d2,col="blue",lwd=2)
legend("bottomright",c("group 1","group 2"),
col=c("black","blue"),
pch=c(21,NA),
pt.bg=c("red",NA),
lty=1,
lwd=c(1,2))
(My real data are a little more complex.) I'm going a little nuts trying to do this cleanly in ggplot2.
If I draw points before lines, group 1's points get overlaid by the lines in the same group:
library(ggplot2); theme_set(theme_bw())
g0 <- ggplot(dd,aes(x,y,fill=f,colour=f,shape=f))+
scale_fill_manual(values=c("red",NA))+
scale_colour_manual(values=c("black","blue")) +
scale_shape_manual(values=c(21,NA))
g0 + geom_point()+ geom_line()
ggsave("order2.png",width=3,height=3)
If I draw lines before points, group 2's lines get overlaid by group 1's points:
g0 + geom_line()+ geom_point()
ggsave("order3.png",width=3,height=3)
The desired order is (group 1 lines), (group 1 points), (group 2 lines). I can do this by manually overlaying the geoms again, one group at a time, but this is way ugly.
g0 + geom_line() + geom_point()+
geom_point(data=d1)+
geom_line(data=d2,show.legend=FALSE)
ggsave("order4.png",width=3,height=3)
I think the "best" solution to this is to write a low-level geom_linepoint that works as desired; I've looked into this a bit and it's not entirely trivial ... can anyone suggest a cleaner, simpler solution?
Here's a "low tech"1 solution. Below is a function that adds a line layer and then a point layer successively for each level of a given grouping variable.
linepoint = function(data, group.var, lsize=1.2, psize=4) {
lapply(split(data, data[,group.var]), function(dg) {
list(geom_line(data=dg, size=lsize),
geom_point(data=dg, size=psize))
})
}
ggplot(dd, aes(x,y, fill=f, colour=f,shape=f))+
scale_fill_manual(values=c("red",NA))+
scale_colour_manual(values=c("black","blue")) +
scale_shape_manual(values=c(21,NA)) +
linepoint(dd, "f")
1 "Low tech" compared to writing a new geom. #baptiste's (now deleted) answer does create a new geom and seems to get the job done, so I'm not sure why he deleted it.

5 dimensional plot in r

I am trying to plot a 5 dimensional plot in R. I am currently using the rgl package to plot my data in 4 dimensions, using 3 variables as the x,y,z, coordinates, another variable as the color. I am wondering if I can add a fifth variable using this package, like for example the size or the shape of the points in the space. Here's an example of my data, and my current code:
set.seed(1)
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
colnames(df) <- c("var1","var2","var3","var4","var5")
require(rgl)
plot3d(df$var1, df$var2, df$var3, col=as.numeric(df$var4), size=0.5, type='s',xlab="var1",ylab="var2",zlab="var3")
I hope it is possible to do the 5th dimension.
Many thanks,
Here is a ggplot2 option. I usually shy away from 3D plots as they are hard to interpret properly. I also almost never put in 5 continuous variables in the same plot as I have here...
ggplot(df, aes(x=var1, y=var2, fill=var3, color=var4, size=var5^2)) +
geom_point(shape=21) +
scale_color_gradient(low="red", high="green") +
scale_size_continuous(range=c(1,12))
While this is a bit messy, you can actually reasonably read all 5 dimensions for most points.
A better approach to multi-dimensional plotting opens up if some of your variables are categorical. If all your variables are continuous, you can turn some of them to categorical with cut and then use facet_wrap or facet_grid to plot those.
For example, here I break up var3 and var4 into quintiles and use facet_grid on them. Note that I also keep the color aesthetics as well to highlight that most of the time turning a continuous variable to categorical in high dimensional plots is good enough to get the key points across (here you'll notice that the fill and border colors are pretty uniform within any given grid cell):
df$var4.cat <- cut(df$var4, quantile(df$var4, (0:5)/5), include.lowest=T)
df$var3.cat <- cut(df$var3, quantile(df$var3, (0:5)/5), include.lowest=T)
ggplot(df, aes(x=var1, y=var2, fill=var3, color=var4, size=var5^2)) +
geom_point(shape=21) +
scale_color_gradient(low="red", high="green") +
scale_size_continuous(range=c(1,12)) +
facet_grid(var3.cat ~ var4.cat)

log-scaled density plot: ggplot2 and freqpoly, but with points instead of lines

What I really want to do is plot a histogram, with the y-axis on a log-scale. Obviously this i a problem with the ggplot2 geom_histogram, since the bottom os the bar is at zero, and the log of that gives you trouble.
My workaround is to use the freqpoly geom, and that more-or less does the job. The following code works just fine:
ggplot(zcoorddist) +
geom_freqpoly(aes(x=zcoord,y=..density..),binwidth = 0.001) +
scale_y_continuous(trans = 'log10')
The issue is that at the edges of my data, I get a couple of garish vertical lines that really thro you off visually when combining a bunch of these freqpoly curves in one plot. What I'd like to be able to do is use points at every vertex of the freqpoly curve, and no lines connecting them. Is there a way to to this easily?
The easiest way to get the desired plot is to just recast your data. Then you can use geom_point. Since you don't provide an example, I used the standard example for geom_histogram to show this:
# load packages
require(ggplot2)
require(reshape)
# get data
data(movies)
movies <- movies[, c("title", "rating")]
# here's the equivalent of your plot
ggplot(movies) + geom_freqpoly(aes(x=rating, y=..density..), binwidth=.001) +
scale_y_continuous(trans = 'log10')
# recast the data
df1 <- recast(movies, value~., measure.var="rating")
names(df1) <- c("rating", "number")
# alternative way to recast data
df2 <- as.data.frame(table(movies$rating))
names(df2) <- c("rating", "number")
df2$rating <- as.numeric(as.character(df$rating))
# plot
p <- ggplot(df1, aes(x=rating)) + scale_y_continuous(trans="log10", name="density")
# with lines
p + geom_linerange(aes(ymax=number, ymin=.9))
# only points
p + geom_point(aes(y=number))

2d color plot in R

I have a data frame with many events, each of them having a timestamp.
I need a 2-dimensional plot of this: x axis represents days, y axis represents the time of a day (e.g. hours), and the number of events in this hour of this day is represented by the color (or maybe another way?) of the corresponding cell.
First I've tried to use
ggplot(events) +
geom_jitter(aes(x = round(TimeStamp / (3600*24)),
y = TimeStamp %% (3600*24))),
but due to a large number of events (more than 1 million per month) it's possible to see only the fact that there were events during a specific hour, not how many there were (almost all cells are just filled with black). So, the question is - how to create such a plot in R?
You could make a hexbin plot:
set.seed(42)
events <- data.frame(x=round(rbinom(1000,1000, 0.1)),y=round(rnorm(1000,10,3)))
library(ggplot2)
library(hexbin)
p1 <- ggplot(events,aes(x,y)) + geom_hex()
print(p1)
The way I'm doing is using a small alpha (i. e. transparency) for each event so that superimposing events have an higher (cumulated) alpha, giving thus an idea of the number of superimposed events:
library(ggplot2)
events <- data.frame(x=round(rbinom(1000,1000, 0.1)),y=round(rnorm(1000,10,3)))
ggplot(events)
+ geom_point(aes(x=x, y=y), colour="black", alpha=0.2)
Another solution would be to represent it as an heatmap:
hm <- table(events)
xhm <- as.numeric(rownames(hm))
yhm <- as.numeric(colnames(hm))
image(xhm,yhm,hm)

Resources