I have a data frame (100 x 4). The first column is a set of "bins" 0-100, the remaining columns are the counts for each variable of events within each bin (0 to the maximum number of events).
What I'm trying to do is to plot each of the three columns of data (2:4), alongside each other. Because the counts in each of the bins for each of the data sets is close to identical, the data are overlapped in the histogram/barplots I've created, despite my use of beside=true, and position = dodge.
I've set the first column as both numeric and character, but the results are identical- the bars are overlayed on top of each other. (semi-transparent density plots don't work because I want counts not the distribution densities).
The attached code, based on both R and other documentation produced the attached chart.
barplot(BinCntDF$preT,main=NewMain_Trigger, plot=TRUE,
xlab="sample frequency interval counts (0-100 msec bins)",
names.arg=BinCntDF$dT, las=0,
ylab="bin counts", axes=TRUE, xlim=c(0,100),
ylim=c(0,1000), col="red")
geom_bar(position="dodge")
barplot(BinCntDF$postT, beside=TRUE, add=TRUE)
geom_bar()
The goal is to be able to compare the two (or more) data sets side by side on the same axes, without either overlapping the other(s).
I think you have confused barplot with ggplot2. ggplot2 is a library where the function geom_bar comes from and isn't compatible with barplot which comes with Base R.
Simply compare ?barplot and ?geom_bar, and you will see that geom_bar is from the ggplot2 library. To achieve what you're after I have used the ggplot2 library and reshape2.
Step 1
Based on your description, I have assumed that your data looks roughly like this:
df <- data.frame(x = 1:10,
c1 = sample(0:100, replace=TRUE, size=10),
c2 = sample(0:50, replace=TRUE, size=10),
c3 = sample(0:70, replace=TRUE, size=10))
To plot it using ggplot2 you first have to transform the data to a long format instead of a wide format. You can do this using melt function from reshape2.
library(reshape2)
a <- melt(df, id=c("x"))
The output would look something like this
> head(a)
x variable value
1 1 c1 62
2 2 c1 47
3 3 c1 20
4 4 c1 64
5 5 c1 4
6 6 c1 52
Step 2
There are plenty of tutorials online to what ggplot2 does and the arguments. I would recommend you Google, or search through the many threads in SO to understand.
ggplot(a, aes(x=x, y=value, group=variable, fill=variable)) +
geom_bar(stat='identity', position='dodge')
Which gives you the output:
In a nutshell:
group groups the variables of interest
stat=identity ensures that no additional aggregations are made on your data
With that many bins (100) and groups (3) the plot will look messy, but try this:
set.seed(123)
myDF <- data.frame(bins=1:100, x=sample(1:100, replace=T), y=sample(1:100, replace=T), z=sample(1:100, replace=T))
myDF.m <- melt(myDF, id.vars='bins')
ggplot(myDF.m, aes(x=bins, y=value, fill=variable)) + geom_bar(stat='identity', position='dodge')
You could also try plotting w/ facets:
ggplot(myDF.m, aes(x=bins, y=value, fill=variable)) + geom_bar(stat='identity') + facet_wrap(~ variable)
Related
I am trying to show different growing season lengths by displaying crop planting and harvest dates at multiple regions.
My final goal is a graph that looks like this:
which was taken from an answer to this question. Note that the dates are in julian days (day of year).
My first attempt to reproduce a similar plot is:
library(data.table)
library(ggplot2)
mydat <- "Region\tCrop\tPlanting.Begin\tPlanting.End\tHarvest.Begin\tHarvest.End\nCenter-West\tSoybean\t245\t275\t1\t92\nCenter-West\tCorn\t245\t336\t32\t153\nSouth\tSoybean\t245\t1\t1\t122\nSouth\tCorn\t183\t336\t1\t153\nSoutheast\tSoybean\t275\t336\t1\t122\nSoutheast\tCorn\t214\t336\t32\t122"
# read data as data table
mydat <- setDT(read.table(textConnection(mydat), sep = "\t", header=T))
# melt data table
m <- melt(mydat, id.vars=c("Region","Crop"), variable.name="Period", value.name="value")
# plot stacked bars
ggplot(m, aes(x=Crop, y=value, fill=Period, colour=Period)) +
geom_bar(stat="identity") +
facet_wrap(~Region, nrow=3) +
coord_flip() +
theme_bw(base_size=18) +
scale_colour_manual(values = c("Planting.Begin" = "black", "Planting.End" = "black",
"Harvest.Begin" = "black", "Harvest.End" = "black"), guide = "none")
However, there's a few issues with this plot:
Because the bars are stacked, the values on the x-axis are aggregated and end up too high - out of the 1-365 scale that represents day of year.
I need to combine Planting.Begin and Planting.End in the same color, and do the same to Harvest.Begin and Harvest.End.
Also, a "void" (or a completely uncolored bar) needs to be created between Planting.Begin and Harvest.End.
Perhaps the graph could be achieved with geom_rect or geom_segment, but I really want to stick to geom_bar since it's more customizable (for example, it accepts scale_colour_manual in order to add black borders to the bars).
Any hints on how to create such graph?
I don't think this is something you can do with a geom_bar or geom_col. A more general approach would be to use geom_rect to draw rectangles. To do this, we need to reshape the data a bit
plotdata <- mydat %>%
dplyr::mutate(Crop = factor(Crop)) %>%
tidyr::pivot_longer(Planting.Begin:Harvest.End, names_to="period") %>%
tidyr::separate(period, c("Type","Event")) %>%
tidyr::pivot_wider(names_from=Event, values_from=value)
# Region Crop Type Begin End
# <chr> <fct> <chr> <int> <int>
# 1 Center-West Soybean Planting 245 275
# 2 Center-West Soybean Harvest 1 92
# 3 Center-West Corn Planting 245 336
# 4 Center-West Corn Harvest 32 153
# 5 South Soybean Planting 245 1
# ...
We've used tidyr to reshape the data so we have one row per rectangle that we want to draw and we've also make Crop a factor. We can then plot it like this
ggplot(plotdata) +
aes(ymin=as.numeric(Crop)-.45, ymax=as.numeric(Crop)+.45, xmin=Begin, xmax=End, fill=Type) +
geom_rect(color="black") +
facet_wrap(~Region, nrow=3) +
theme_bw(base_size=18) +
scale_y_continuous(breaks=seq_along(levels(plotdata$Crop)), labels=levels(plotdata$Crop))
The part that's a bit messy here that we are using a discrete scale for y but geom_rect prefers numeric values, so since the values are factors now, we use the numeric values for the factors to create ymin and ymax positions. Then we need to replace the y axis with the names of the levels of the factor.
If you also wanted to get the month names on the x axis you could do something like
dateticks <- seq.Date(as.Date("2020-01-01"), as.Date("2020-12-01"),by="month")
# then add this to you plot
... +
scale_x_continuous(breaks=lubridate::yday(dateticks),
labels=lubridate::month(dateticks, label=TRUE, abbr=TRUE))
Let's say I have a histogram with two overlapping groups. Here's a possible command from ggplot2 and a pretend output graph.
ggplot2(data, aes(x=Variable1, fill=BinaryVariable)) + geom_histogram(position="identity")
So what I have is the frequency or count of each event. What I'd like to do instead is to get the difference between the two events in each bin. Is this possible? How?
For example, if we do RED minus BLUE:
Value at x=2 would be ~ -10
Value at x=4 would be ~ 40 - 200 = -160
Value at x=6 would be ~ 190 - 25 = 155
Value at x=8 would be ~ 10
I'd prefer to do this using ggplot2, but another way would be fine. My dataframe is set up with items like this toy example (dimensions are actually 25000 rows x 30 columns) EDITED: Here is example data to work with GIST Example
ID Variable1 BinaryVariable
1 50 T
2 55 T
3 51 N
.. .. ..
1000 1001 T
1001 1944 T
1002 1042 N
As you can see from my example, I'm interested in a histogram to plot Variable1 (a continuous variable) separately for each BinaryVariable (T or N). But what I really want is the difference between their frequencies.
So, in order to do this we need to make sure that the "bins" we use for the histograms are the same for both levels of your indicator variable. Here's a somewhat naive solution (in base R):
df = data.frame(y = c(rnorm(50), rnorm(50, mean = 1)),
x = rep(c(0,1), each = 50))
#full hist
fullhist = hist(df$y, breaks = 20) #specify more breaks than probably necessary
#create histograms for 0 & 1 using breaks from full histogram
zerohist = with(subset(df, x == 0), hist(y, breaks = fullhist$breaks))
oneshist = with(subset(df, x == 1), hist(y, breaks = fullhist$breaks))
#combine the hists
combhist = fullhist
combhist$counts = zerohist$counts - oneshist$counts
plot(combhist)
So we specify how many breaks should be used (based on values from the histogram on the full data), and then we compute the differences in the counts at each of those breaks.
PS It might be helpful to examine what the non-graphical output of hist() is.
Here's a solution that uses ggplot as requested.
The key idea is to use ggplot_build to get the rectangles computed by stat_histogram. From that you can compute the differences in each bin and then create a new plot using geom_rect.
setup and create a mock dataset with lognormal data
library(ggplot2)
library(data.table)
theme_set(theme_bw())
n1<-500
n2<-500
k1 <- exp(rnorm(n1,8,0.7))
k2 <- exp(rnorm(n2,10,1))
df <- data.table(k=c(k1,k2),label=c(rep('k1',n1),rep('k2',n2)))
Create the first plot
p <- ggplot(df, aes(x=k,group=label,color=label)) + geom_histogram(bins=40) + scale_x_log10()
Get the rectangles using ggplot_build
p_data <- as.data.table(ggplot_build(p)$data[1])[,.(count,xmin,xmax,group)]
p1_data <- p_data[group==1]
p2_data <- p_data[group==2]
Join on the x-coordinates to compute the differences. Note that the y-values aren't the counts, but the y-coordinates of the first plot.
newplot_data <- merge(p1_data, p2_data, by=c('xmin','xmax'), suffixes = c('.p1','.p2'))
newplot_data <- newplot_data[,diff:=count.p1 - count.p2]
setnames(newplot_data, old=c('y.p1','y.p2'), new=c('k1','k2'))
df2 <- melt(newplot_data,id.vars =c('xmin','xmax'),measure.vars=c('k1','diff','k2'))
make the final plot
ggplot(df2, aes(xmin=xmin,xmax=xmax,ymax=value,ymin=0,group=variable,color=variable)) + geom_rect()
Of course the scales and legends still need to be fixed, but that's a different topic.
I have a plot where the request is a factor with long values so they don't display on the char axis.
plot( time_taken ~ request )
The data in this case looks like:
time_taken request
1 7 /servlet1/endpoint2/
2 2 /session/
3 10 /servlet1/endpoint3/
4 2 /servlet1/endpoint2/
5 8 /servlet4/endpoint2/
6 5 /session/
...
Question: Is there a way to plot something like the factor level id on the x axis, and the factor level id + factor full string in the legend?
The code in your question generates a boxplot, so I assume that's what you want. Here are four ways to go about it.
This will generate a boxplot with the x-axis numbered, and the full names in the legend.
library(ggplot2)
ggplot(df) +
geom_boxplot(aes(x=as.integer(request),y=time_taken, color=request))+
labs(x="request")
As you can see below, though, with ggplot the labels are discernible (at least in the example).
ggp <- ggplot(df) + geom_boxplot(aes(x=request,y=time_taken))
ggp
In a situation like this I'd be inclined to rotate the plot.
ggp + coord_flip()
Finally, here's a way in base R, although IMO it's the least appealing option.
plot(time_taken~factor(as.integer(request)),df, xlab="request")
labs <- with(df,paste(as.integer(sort(unique(request))),sort(unique(request)),sep=" - "))
legend("topright",legend=labs)
A possible solution uses ggplot2. Following an example with some sample data.
df <- data.frame(factor = c("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb",
"ccccccccccccccccccccccccccccccccccccc"),
time = c(5, 7, 9))
library(ggplot2)
qplot(data = df, factor, time) + scale_x_discrete(labels = abbreviate)
You can also use directly the function abbreviate on your factor levels in your data frame, so that you can work with abbreviated labels, also avoiding ggplot2, if you're not familiar with it.
Look at ?abbreviate
I have a time series with multiple days of data. In between each day there's one period with no data points. How can I omit these periods when plotting the time series using ggplot2?
An artificial example shown as below, how can I get rid of the two periods where there's no data?
code:
Time = Sys.time()+(seq(1,100)*60+c(rep(1,100)*3600*24, rep(2, 100)*3600*24, rep(3, 100)*3600*24))
Value = rnorm(length(Time))
g <- ggplot()
g <- g + geom_line (aes(x=Time, y=Value))
g
First, create a grouping variable. Here, two groups are different if the time difference is larger than 1 minute:
Group <- c(0, cumsum(diff(Time) > 1))
Now three distinct panels could be created using facet_grid and the argument scales = "free_x":
library(ggplot2)
g <- ggplot(data.frame(Time, Value, Group)) +
geom_line (aes(x=Time, y=Value)) +
facet_grid(~ Group, scales = "free_x")
The problem is that how does ggplot2 know you have missing values? I see two options:
Pad out your time series with NA values
Add an additional variable representing a "group". For example,
dd = data.frame(Time, Value)
##type contains three distinct values
dd$type = factor(cumsum(c(0, as.numeric(diff(dd$Time) - 1))))
##Plot, but use the group aesthetic
ggplot(dd, aes(x=Time, y=Value)) +
geom_line (aes(group=type))
gives
csgillespie mentioned padding by NA, but a simpler method is to add one NA after each block:
Value[seq(1,length(Value)-1,by=100)]=NA
where the -1 avoids a warning.
I am trying to plot a sequence of coloured small squares representing different types of activities. For example, in the following data frame, type represents the type of activity and
count represent how many of those activities ocurred before a "different typed" one took place.
df3 <- data.frame(type=c(1,6,4,6,1,4,1,4,1,1,1,1,6,6,1,1,3,1,4,1,4,6,4,6,4,4,6,4,6,4),
count=c(6,1,1,1,2,1,6,3,1,6,8,10,3,1,2,2,1,2,1,1,1,1,1,1,3,3,1,17,1,12) )
In ggplot by now I am not using count. I am just giving consecutive numbers as xvalues and 1 as yvalues. However it gives me something like ggplot Image
This is the code I used, note that for y I always use 1 and for x i use just consecutive numbers:
ggplot(df3,aes(x=1:nrow(df3),y=rep(1,30))) + geom_bar(stat="identity",aes(color=as.factor(type)))
I would like to get small squares with the width=df3$count.
Do you have any suggestions? Thanks in advance
I am not entirely clear on what you need, but I offer one possible way to plot your data. I have used geom_rect() to draw rectangles of width equal to your count column. The rectangles are plotted in the same order as the rows of your data.
df3 <- data.frame(type=c(1,6,4,6,1,4,1,4,1,1,1,1,6,6,1,
1,3,1,4,1,4,6,4,6,4,4,6,4,6,4),
count=c(6,1,1,1,2,1,6,3,1,6,8,10,3,1,2,
2,1,2,1,1,1,1,1,1,3,3,1,17,1,12))
library(ggplot2)
df3$type <- factor(df3$type)
df3$ymin <- 0
df3$ymax <- 1
df3$xmax <- cumsum(df3$count)
df3$xmin <- c(0, head(df3$xmax, n=-1))
plot_1 <- ggplot(df3,
aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax, fill=type)) +
geom_rect(colour="grey40", size=0.5)
png("plot_1.png", height=200, width=800)
print(plot_1)
dev.off()