I have a data frame with many events, each of them having a timestamp.
I need a 2-dimensional plot of this: x axis represents days, y axis represents the time of a day (e.g. hours), and the number of events in this hour of this day is represented by the color (or maybe another way?) of the corresponding cell.
First I've tried to use
ggplot(events) +
geom_jitter(aes(x = round(TimeStamp / (3600*24)),
y = TimeStamp %% (3600*24))),
but due to a large number of events (more than 1 million per month) it's possible to see only the fact that there were events during a specific hour, not how many there were (almost all cells are just filled with black). So, the question is - how to create such a plot in R?
You could make a hexbin plot:
set.seed(42)
events <- data.frame(x=round(rbinom(1000,1000, 0.1)),y=round(rnorm(1000,10,3)))
library(ggplot2)
library(hexbin)
p1 <- ggplot(events,aes(x,y)) + geom_hex()
print(p1)
The way I'm doing is using a small alpha (i. e. transparency) for each event so that superimposing events have an higher (cumulated) alpha, giving thus an idea of the number of superimposed events:
library(ggplot2)
events <- data.frame(x=round(rbinom(1000,1000, 0.1)),y=round(rnorm(1000,10,3)))
ggplot(events)
+ geom_point(aes(x=x, y=y), colour="black", alpha=0.2)
Another solution would be to represent it as an heatmap:
hm <- table(events)
xhm <- as.numeric(rownames(hm))
yhm <- as.numeric(colnames(hm))
image(xhm,yhm,hm)
Related
I have a set of times that I would like to plot on a histogram.
Toy example:
df <- data.frame(time = c(1,2,2,3,4,5,5,5,6,7,7,7,9,9, ">10"))
The problem is that one value is ">10" and refers to the number of times that more than 10 seconds were observed. The other time points are all numbers referring to the actual time. Now, I would like to create a histogram that treats all numbers as numeric and combines them in bins when appropriate, while plotting the counts of the ">10" at the side of the distribution, but not in a separate plot. I have tried to call geom_histogram twice, once with the continuous data and once with the discrete data in a separate column but that gives me the following error:
Error: Discrete value supplied to continuous scale
Happy to hear suggestions!
Here's a kind of involved solution, but I believe it best answers your question, which is that you are desiring to place next to typical histogram plot a bar representing the ">10" values (or the values which are non-numeric). Critically, you want to ensure that you maintain the "binning" associated with a histogram plot, which means you are not looking to simply make your scale a discrete scale and represent a histogram with a typical barplot.
The Data
Since you want to retain histogram features, I'm going to use an example dataset that is a bit more involved than that you gave us. I'm just going to specify a uniform distribution (n=100) with 20 ">10" values thrown in there.
set.seed(123)
df<- data.frame(time=c(runif(100,0,10), rep(">10",20)))
As prepared, df$time is a character vector, but for a histogram, we need that to be numeric. We're simply going to force it to be numeric and accept that the ">10" values are going to be coerced to be NAs. This is fine, since in the end we're just going to count up those NA values and represent them with a bar. While I'm at it, I'm creating a subset of df that will be used for creating the bar representing our NAs (">10") using the count() function, which returns a dataframe consisting of one row and column: df$n = 20 in this case.
library(dplyr)
df$time <- as.numeric(df$time) #force numeric and get NA for everything else
df_na <- count(subset(df, is.na(time)))
The Plot(s)
For the actual plot, you are asking to create a combination of (1) a histogram, and (2) a barplot. These are not the same plot, but more importantly, they cannot share the same axis, since by definition, the histogram needs a continuous axis and "NA" values or ">10" is not a numeric/continuous value. The solution here is to make two separate plots, then combine them with a bit of magic thanks to cowplot.
The histogram is created quite easily. I'm saving the number of bins for demonstration purposes later. Here's the basic plot:
bin_num <- 12 # using this later
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
Thanks to the subsetting previously, the barplot for the NA values is easy too:
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3)
Yikes! That looks horrible, but have patience.
Stitching them together
You can simply run plot_grid(p1, p2) and you get something workable... but it leaves quite a lot to be desired:
There are problems here. I'll enumerate them, then show you the final code for how I address them:
Need to remove some elements from the NA barplot. Namely, the y axis entirely and the title for x axis (but it can't be NULL or the x axes won't line up properly). These are theme() elements that are easily removed via ggplot.
The NA barplot is taking up WAY too much room. Need to cut the width down. We address this by accessing the rel_widths= argument of plot_grid(). Easy peasy.
How do we know how to set the y scale upper limit? This is a bit more involved, since it will depend on the ..count.. stat for p1 as well as the numer of NA values. You can access the maximum count for a histogram using ggplot_build(), which is a part of ggplot2.
So, the final code requires the creation of the basic p1 and p2 plots, then adds to them in order to fix the limits. I'm also adding an annotation for number of bins to p1 so that we can track how well the upper limit setting works. Here's the code and some example plots where bin_num is set at 12 and 5, respectively:
# basic plots
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3) +
labs(x="") + theme(axis.line.y=element_blank(), axis.text.y=element_blank(),
axis.title.y=element_blank(), axis.ticks.y=element_blank()
) +
scale_x_discrete(expand=expansion(add=1))
#set upper y scale limit
max_count <- max(c(max(ggplot_build(p1)$data[[1]]$count), df_na$n))
# fix limits for plots
p1 <- p1 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15))) +
annotate('text', x=0, y=max_count, label=paste('Bins:', bin_num)) # for demo purposes
p2 <- p2 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15)))
plot_grid(p1, p2, rel_widths=c(1,0.2))
So, our upper limit fixing works. You can get really crazy playing around with positioning, etc and the plot_grid() function, but I think it works pretty well this way.
Perhaps, this is what you are looking for:
df1 <- data.frame(x=sample(1:12,50,rep=T))
df2 <- df1 %>% group_by(x) %>%
dplyr::summarise(y=n()) %>% subset(x<11)
df3 <- subset(df1, x>10) %>% dplyr::summarise(y=n()) %>% mutate(x=11)
df <- rbind(df2,df3 )
label <- ifelse((df$x<11),as.character(df$x),">10")
p <- ggplot(df, aes(x=x,y=y,color=x,fill=x)) +
geom_bar(stat="identity", position = "dodge") +
scale_x_continuous(breaks=df$x,labels=label)
p
and you get the following output:
Please note that sometimes you could have some of the bars missing depending on the sample.
I have searched considerably for what I want to accomplish, but I haven't run across examples or plots that are specifically what I'm looking for), so I am reaching out to the community.
What I have (data downloadable here):
Time-series data (each record 2 hours apart and spanning nearly a year) with associated elevation and property ownership.
library(ggplot2)
data <- read.csv("dataex.csv")
data$timestamp <-as.POSIXct(as.character(data$timestamp),format="%m/%d/%Y %H:%M", tz="GMT")
What I want (via ggplot):
A line or bar plot showing elevation (y-axis) across time (x-axis) for each data record colored by ownership (for a line plot, filling the area under the line, or for a bar plot, filling the bar). I've tried iterations of geom_line, geom_bar, and geom_area (w geom_bar below the closest I have come). I'd like at least one of the following options to come true!
Option A - The closest I have come to achieving this (plotting per data record) is with the following code:
ggplot(data, aes(x=timestamp, y=elev, fill=OWNER)) + geom_bar(stat="identity")
However, I'd like the bars to be touching each other, but if I adjust the width in geom_bar(), everything disappears. (Also, if I run the above code on other batches of similar data, it will only show a fraction of the bars, likely because they have more data records). Seems like its just too much data to plot. So I tried another route...
Option B - Plotting by day, which turns out to be more informative, showing each day the variability in ownership.
ggplot(data, aes(x=as.Date(Date, format='%Y-%m-%d'), y=elev, fill=OWNER)) + geom_bar(stat="identity", width=1)
However, this sums the y-axis, so the elevation is not interpretable. I could divide the y-axis by 12 (the typical number of records per day) but there are occasional days with fewer than 12 records, which causes the y-axis to be incorrect. Is there a function or a way to divide the y-axis by the respective number of records per day that is being represented in the plot? Or does someone have advice for a better solution?
Something like:
library(readr)
library(dplyr)
library(ggplot2)
library(ggalt)
readr::read_csv("~/Desktop/dataex.csv") %>%
mutate(timestamp=lubridate::mdy_hm(timestamp)) %>%
select(timestamp, elev, Owner=OWNER) -> df
ggplot(df, aes(timestamp, elev, group=Owner, color=Owner)) +
geom_segment(aes(xend=timestamp, yend=0), size=0.1) +
scale_x_datetime(expand=c(0,0), date_breaks="2 months") +
scale_y_continuous(expand=c(0,0), limits=c(0,2250), label=scales::comma) +
ggthemes::scale_color_tableau() +
hrbrmisc::theme_hrbrmstr(grid="Y") +
labs(x=NULL, y="Elevation") +
theme(legend.position="bottom") +
theme(axis.title.y=element_text(angle=0, margin=margin(r=-20)))
?
I am visualizing a 4 dimensional data set.
Let's denote the variables as x, y1, y2 and y3, where x is for dates, y is a continuous variable and y2, y3 are components of 2 dimensional vectors (y2, y3). Now I want to a line plot for x and y1, additionally attaching arrows for (y2, y3) at points (x, y1).
I have tried
ggplot(data=data,aes(x=x,y=y1)) + geom_line() +
geom_segment(aes(xend=x+y2,yend=y1+y3), arrow = arrow())
but it doesn't work well so I think I may need to do some rescaling. How can I do this with ggplot?
UPDATE: I've attached a sample data set (together with its column definition). The data set contains oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific. The data is expected to aid in the understanding and prediction of El Nino/Southern Oscillation (ENSO) cycles (from the description of the repository). Now, for example, I want to visualize x=day, y1=humidity, y2=zon.winds, y3=mer.winds with the symbol described above.
UPDATE2: for example, I want to plot this for a particular buoy
I am having trouble figuring what you want to display.
As far as I can see, your dataset has 50 buoys that each deliver a measurement each day.
library(ggplot2)
elnino <- read.table('elnino.txt', col.names=c('buoy','day','latitude','longitude','zon.winds','mer.winds','humidity','air.temp','ss.temp'), as.is=TRUE, na='.')
elnino <- elnino[elnino$humidity > 40,] # removing a single point that seems to be an outlier.
ggplot(elnino, aes(x=day,y=humidity, group=buoy)) + geom_point()
ggplot(elnino, aes(x=day,y=humidity, group=buoy)) + geom_line()
Which gives these two results.
What I cannot see is how do you want to display the ''zon.winds'' and ''mer.winds'' variables? I figure these in combination gives a vector, but where do you want these placed? You would get ~ 700 arrows filling your plot.
Update
In that case, you got it right, that you have to use geom_segment and calculate the ''x'', ''xend'', ''y'' and ''yend'', see geom_segment.
# We select a single buoy
el <- subset(elnino, buoy==1)
library(grid)
ggplot(el, aes(x=day,y=humidity, group=buoy)) + geom_line() + geom_segment(aes(yend=humidity+zon.winds, xend=day+mer.winds), arrow = arrow(length = unit(0.1,"cm")))
This however doe not look very nice, because the coordinates in ''zon.winds'' and ''mer.winds'' are taken as absolutes! So to utilise them, we will need to do some manual transformation of them. My values are absolute arbitrarily.
el <- transform(el, zon.winds = zon.winds * -0.3, mer.winds=mer.winds * -0.3)
I'm an undergrad researcher and I've been teaching myself R over the past few months. I just started trying ggplot, and have run into some trouble. I've made a series of boxplots looking at the depth of fish at different acoustic receiver stations. I'd like to add a scatterplot that shows the depths of the receiver stations. This is what I have so far:
data <- read.csv(".....MPS.csv", header=TRUE)
df <- data.frame(f1=factor(data$Tagging.location), #$
f2=factor(data$Station),data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), data$depth)
df$f1f2 <- interaction(df$f1, df$f2) #$
plot1 <- ggplot(aes(y = data$Detection.depth, x = f2, fill = f1), data = df) + #$
geom_boxplot() + stat_summary(fun.data = give.n, geom = "text",
position = position_dodge(height = 0, width = 0.75), size = 3)
plot1+xlab("MPS Station") + ylab("Depth(m)") +
theme(legend.title=element_blank()) + scale_y_reverse() +
coord_cartesian(ylim=c(150, -10))
plot2 <- ggplot(aes(y=data$depth, x=f2), data=df2) + geom_point()
plot2+scale_y_reverse() + coord_cartesian(ylim=c(150,-10)) +
xlab("MPS Station") + ylab("Depth (m)")
Unfortunately, since I'm a new user in this forum, I'm not allowed to upload images of these two plots. My x-axis is "Stations" (which has 12 options) and my y-axis is "Depth" (0-150 m). The boxplots are colour-coded by tagging site (which has 2 options). The depths are coming from two different columns in my spreadsheet, and they cannot be combined into one.
My goal is to to combine those two plots, by adding "plot2" (Station depth scatterplot) to "plot1" boxplots (Detection depths). They are both looking at the same variables (depth and station), and must be the same y-axis scale.
I think I could figure out a messy workaround if I were using the R base program, but I would like to learn ggplot properly, if possible. Any help is greatly appreciated!
Update: I was confused by the language used in the original post, and wrote a slightly more complicated answer than necessary. Here is the cleaned up version.
Step 1: Setting up. Here, we make sure the depth values in both data frames have the same variable name (for readability).
df <- data.frame(f1=factor(data$Tagging.location), f2=factor(data$Station), depth=data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), depth=data$depth)
Step 2: Now you can plot this with the 'ggplot' function and split the data by using the `col=f1`` argument. We'll plot the detection data separately, since that requires a boxplot, and then we'll plot the depths of the stations with colored points (assuming each station only has one depth). We specify the two different plots by referencing the data from within the 'geom' functions, instead of specifying the data inside the main 'ggplot' function. It should look something like this:
ggplot()+geom_boxplot(data=df, aes(x=f2, y=depth, col=f1)) + geom_point(data=df2, aes(x=f2, y=depth), colour="blue") + scale_y_reverse()
In this plot example, we use boxplots to represent the detection data and color those boxplots by the site label. The stations, however, we plot separately using a specific color of points, so we will be able to see them clearly in relation to the boxplots.
You should be able to adjust the plot from here to suit your needs.
I've created some dummy data and loaded into the chart to show you what it would look like. Keep in mind that this is purely random data and doesn't really make sense.
I am trying to plot a sequence of coloured small squares representing different types of activities. For example, in the following data frame, type represents the type of activity and
count represent how many of those activities ocurred before a "different typed" one took place.
df3 <- data.frame(type=c(1,6,4,6,1,4,1,4,1,1,1,1,6,6,1,1,3,1,4,1,4,6,4,6,4,4,6,4,6,4),
count=c(6,1,1,1,2,1,6,3,1,6,8,10,3,1,2,2,1,2,1,1,1,1,1,1,3,3,1,17,1,12) )
In ggplot by now I am not using count. I am just giving consecutive numbers as xvalues and 1 as yvalues. However it gives me something like ggplot Image
This is the code I used, note that for y I always use 1 and for x i use just consecutive numbers:
ggplot(df3,aes(x=1:nrow(df3),y=rep(1,30))) + geom_bar(stat="identity",aes(color=as.factor(type)))
I would like to get small squares with the width=df3$count.
Do you have any suggestions? Thanks in advance
I am not entirely clear on what you need, but I offer one possible way to plot your data. I have used geom_rect() to draw rectangles of width equal to your count column. The rectangles are plotted in the same order as the rows of your data.
df3 <- data.frame(type=c(1,6,4,6,1,4,1,4,1,1,1,1,6,6,1,
1,3,1,4,1,4,6,4,6,4,4,6,4,6,4),
count=c(6,1,1,1,2,1,6,3,1,6,8,10,3,1,2,
2,1,2,1,1,1,1,1,1,3,3,1,17,1,12))
library(ggplot2)
df3$type <- factor(df3$type)
df3$ymin <- 0
df3$ymax <- 1
df3$xmax <- cumsum(df3$count)
df3$xmin <- c(0, head(df3$xmax, n=-1))
plot_1 <- ggplot(df3,
aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax, fill=type)) +
geom_rect(colour="grey40", size=0.5)
png("plot_1.png", height=200, width=800)
print(plot_1)
dev.off()