Histogram / Density plot from overlapping uneven bins and counts - r

I have data where sample counts have been pre-calculated for bins across a range, and the bins are overlapping and uneven sizes. Looks something like:
x2 <- data.frame("BinFrom" = c(1,1,2,2,4,4,4,5,5,5,8,8,8,9,11,14,17,18,19),
"BinTo" = c(3,6,4,8,5,8,6,10,12,6,7,15,11,10,20,20,18,19,20),
"Count" = c(1000,2400,15,2000,20,3800,10,6000,4200,10,25,3000,2800,10,1300,9000,10,5,40))
I wish to generate a histogram and density plot for these data. Is there a way to do this?
ggdensity etc expect the expanded data. I attempted to force that format by expanding on the mid-point of the bins, e.g.:
x2 <- x2 %>% mutate(MidBin = BinFrom + ((BinTo-BinFrom)/2))
xp <- x2 %>% expandRows(., "Count")
ggdensity(xp, "MidBin")
but this loses important data, and is not possible with my actual data frame as the row expansion exhausts the vector memory.
All help appreciated

Create new matrix and count base-overlap
base=cbind.data.frame(base=c(min(x2$BinFrom):max(x2$BinTo)))
base$overlap=sapply(base$base,function(x) sum(x2$Count[x >= x2$BinFrom & x <= x2$BinTo ]))
plot
ggplot(base,aes(x=base,y=overlap))+geom_bar(stat = "identity")
#or
ggplot(base,aes(x=base,y=overlap))+geom_area(alpha=0.25)

Related

How to make "interactive" time series plots for exploratory data analysis

I have a time series data frame similar to data created below. Measurements of 5 variables are taken on each individual. Individuals have unique ID numbers. Note that in this data set each individual is of the same length (each has 1000 observations), but in my real data set each individual is of has different lengths (teach individual has a different number of observations). For each individual, I want to plot all 5 variables on top of one another (i.e. all on the y axis) and plot them against time (x axis). I want to print each of these plots to an external document of some kind (pdf, or whatever is recommended for this application) with one plot per page, meaning each individual will have its own page with a single plot. I want these time series plots to be "interactive", in that I can move my mouse over a point, and it will tell me what time individual data points are at. My goal in doing this is exploring the association between peaks, valleys, and other regions between the 5 variables. I am not sure if ggplot2 is still the best application for this, but I would still like for the plots to be aesthetically appealing so that it will be easier to see patterns in the data. Also, is pasting these plots to a pdf the most sensible route? Or would I be better off using R notebook or some other application?
ID <- rep(c("A","B","C"), each=1000)
time <- rep(c(1:1000), times = 3)
one <- rnorm(1000)
two <- rnorm(1000)
three <- rnorm(1000)
four <- rnorm(1000)
five<-rnorm(1000)
data<- data.frame(cbind(ID,time,one,two,three,four,five))
Try using the plotly package. And since you want it to be interactive, you'll want to export as something like html rather than pdf.
To produce a single faceted plot (note I added stringAsFactors = FALSE to your sample data):
library(tidyverse)
library(plotly)
ID <- rep(c("A","B","C"), each=1000)
time <- rep(c(1:1000), times = 3)
one <- rnorm(1000)
two <- rnorm(1000)
three <- rnorm(1000)
four <- rnorm(1000)
five<-rnorm(1000)
data<- data.frame(cbind(ID,time,one,two,three,four,five),
stringsAsFactors = FALSE)
data_long <- data %>%
gather(variable,
value,
one:five) %>%
mutate(time = as.numeric(time),
value = as.numeric(value))
plot <- data_long %>%
ggplot(aes(x = time,
y = value,
color = variable)) +
geom_point() +
facet_wrap(~ID)
interactive_plot <- ggplotly(plot)
htmlwidgets::saveWidget(interactive_plot, "example.html")
If you want to produce and export an interactive plot for every ID programmatically:
walk(unique(data_long$ID),
~ htmlwidgets::saveWidget(ggplotly(data_long %>%
filter(ID == .x) %>%
ggplot(aes(x = time,
y = value,
color = variable)) +
geom_point() +
labs(title = paste(.x))),
paste("plot_for_ID_", .x, ".html", sep = "")))
Edit: I changed map() to walk() so that the plots are produced without console output (previously just a list with 3 empty elements).

How to plot the difference between two ggplot density distributions?

I would like to use ggplot2 to illustrate the difference between two similar density distributions. Here is a toy example of the type of data I have:
library(ggplot2)
# Make toy data
n_sp <- 100000
n_dup <- 50000
D <- data.frame(
event=c(rep("sp", n_sp), rep("dup", n_dup) ),
q=c(rnorm(n_sp, mean=2.0), rnorm(n_dup, mean=2.1))
)
# Standard density plot
ggplot( D, aes( x=q, y=..density.., col=event ) ) +
geom_freqpoly()
Rather than separately plot the density for each category ( dup and sp ) as above, how could I plot a single line that shows the difference between these distributions?
In the toy example above, if I subtracted the dup density distribution from the sp density distribution, the resulting line would be above zero on the left side of the plot (since there is an abundance of smaller sp values) and below 0 on the right (since there is an abundance of larger dup values). Not that there may be a different number of observations of type dup and sp.
More generally - what is the best way to show differences between similar density distributions?
There may be a way to do this within ggplot, but frequently it's easiest to do the calculations beforehand. In this case, call density on each subset of q over the same range, then subtract the y values. Using dplyr (translate to base R or data.table if you wish),
library(dplyr)
library(ggplot2)
D %>% group_by(event) %>%
# calculate densities for each group over same range; store in list column
summarise(d = list(density(q, from = min(.$q), to = max(.$q)))) %>%
# make a new data.frame from two density objects
do(data.frame(x = .$d[[1]]$x, # grab one set of x values (which are the same)
y = .$d[[1]]$y - .$d[[2]]$y)) %>% # and subtract the y values
ggplot(aes(x, y)) + # now plot
geom_line()

R - ggplot2 - Get histogram of difference between two groups

Let's say I have a histogram with two overlapping groups. Here's a possible command from ggplot2 and a pretend output graph.
ggplot2(data, aes(x=Variable1, fill=BinaryVariable)) + geom_histogram(position="identity")
So what I have is the frequency or count of each event. What I'd like to do instead is to get the difference between the two events in each bin. Is this possible? How?
For example, if we do RED minus BLUE:
Value at x=2 would be ~ -10
Value at x=4 would be ~ 40 - 200 = -160
Value at x=6 would be ~ 190 - 25 = 155
Value at x=8 would be ~ 10
I'd prefer to do this using ggplot2, but another way would be fine. My dataframe is set up with items like this toy example (dimensions are actually 25000 rows x 30 columns) EDITED: Here is example data to work with GIST Example
ID Variable1 BinaryVariable
1 50 T
2 55 T
3 51 N
.. .. ..
1000 1001 T
1001 1944 T
1002 1042 N
As you can see from my example, I'm interested in a histogram to plot Variable1 (a continuous variable) separately for each BinaryVariable (T or N). But what I really want is the difference between their frequencies.
So, in order to do this we need to make sure that the "bins" we use for the histograms are the same for both levels of your indicator variable. Here's a somewhat naive solution (in base R):
df = data.frame(y = c(rnorm(50), rnorm(50, mean = 1)),
x = rep(c(0,1), each = 50))
#full hist
fullhist = hist(df$y, breaks = 20) #specify more breaks than probably necessary
#create histograms for 0 & 1 using breaks from full histogram
zerohist = with(subset(df, x == 0), hist(y, breaks = fullhist$breaks))
oneshist = with(subset(df, x == 1), hist(y, breaks = fullhist$breaks))
#combine the hists
combhist = fullhist
combhist$counts = zerohist$counts - oneshist$counts
plot(combhist)
So we specify how many breaks should be used (based on values from the histogram on the full data), and then we compute the differences in the counts at each of those breaks.
PS It might be helpful to examine what the non-graphical output of hist() is.
Here's a solution that uses ggplot as requested.
The key idea is to use ggplot_build to get the rectangles computed by stat_histogram. From that you can compute the differences in each bin and then create a new plot using geom_rect.
setup and create a mock dataset with lognormal data
library(ggplot2)
library(data.table)
theme_set(theme_bw())
n1<-500
n2<-500
k1 <- exp(rnorm(n1,8,0.7))
k2 <- exp(rnorm(n2,10,1))
df <- data.table(k=c(k1,k2),label=c(rep('k1',n1),rep('k2',n2)))
Create the first plot
p <- ggplot(df, aes(x=k,group=label,color=label)) + geom_histogram(bins=40) + scale_x_log10()
Get the rectangles using ggplot_build
p_data <- as.data.table(ggplot_build(p)$data[1])[,.(count,xmin,xmax,group)]
p1_data <- p_data[group==1]
p2_data <- p_data[group==2]
Join on the x-coordinates to compute the differences. Note that the y-values aren't the counts, but the y-coordinates of the first plot.
newplot_data <- merge(p1_data, p2_data, by=c('xmin','xmax'), suffixes = c('.p1','.p2'))
newplot_data <- newplot_data[,diff:=count.p1 - count.p2]
setnames(newplot_data, old=c('y.p1','y.p2'), new=c('k1','k2'))
df2 <- melt(newplot_data,id.vars =c('xmin','xmax'),measure.vars=c('k1','diff','k2'))
make the final plot
ggplot(df2, aes(xmin=xmin,xmax=xmax,ymax=value,ymin=0,group=variable,color=variable)) + geom_rect()
Of course the scales and legends still need to be fixed, but that's a different topic.

2D Histogram in R: Converting from Count to Frequency within a Column

Would appreciate help with generating a 2D histogram of frequencies, where frequencies are calculated within a column. My main issue: converting from counts to column based frequency.
Here's my starting code:
# expected packages
library(ggplot2)
library(plyr)
# generate example data corresponding to expected data input
x_data = sample(101:200,10000, replace = TRUE)
y_data = sample(1:100,10000, replace = TRUE)
my_set = data.frame(x_data,y_data)
# define x and y interval cut points
x_seq = seq(100,200,10)
y_seq = seq(0,100,10)
# label samples as belonging within x and y intervals
my_set$x_interval = cut(my_set$x_data,x_seq)
my_set$y_interval = cut(my_set$y_data,y_seq)
# determine count for each x,y block
xy_df = ddply(my_set, c("x_interval","y_interval"),"nrow") # still need to convert for use with dplyr
# convert from count to frequency based on formula: freq = count/sum(count in given x interval)
################ TRYING TO FIGURE OUT #################
# plot results
fig_count <- ggplot(xy_df, aes(x = x_interval, y = y_interval)) + geom_tile(aes(fill = nrow)) # count
fig_freq <- ggplot(xy_df, aes(x = x_interval, y = y_interval)) + geom_tile(aes(fill = freq)) # frequency
I would appreciate any help in how to calculate the frequency within a column.
Thanks!
jac
EDIT: I think the solution will require the following steps
1) Calculate and store overall counts for each x-interval factor
2) Divide the individual bin count by its corresponding x-interval factor count to obtain frequency.
Not sure how to carry this out though. .
If you want to normalize over the x_interval values, you can create a column with a count per interval and then divide by that. I must admit i'm not a ddply wiz so maybe it has an easier way, but I would do
xy_df$xnrows<-with(xy_df, ave(nrow, x_interval, FUN=sum))
then
fig_freq <- ggplot(xy_df, aes(x = x_interval, y = y_interval)) +
geom_tile(aes(fill = nrow/xnrows))

Subset of data included in more than one ggplot facet

I have a population and a sample of that population. I've made a few plots comparing them using ggplot2 and its faceting option, but it occurred to me that having the sample in its own facet will distort the population plots (however slightly). Is there a way to facet the plots so that all records are in the population plot, and just the sampled records in the second plot?
Matt,
If I understood your question properly - you want to have a faceted plot where one panel contains all of your data, and the subsequent facets contain only a subset of that first plot?
There's probably a cleaner way to do this, but you can create a new data.frame object with the appropriate faceting variable that corresponds to each subset. Consider:
library(ggplot2)
df <- data.frame(x = rnorm(100), y = rnorm(100), sub = sample(letters[1:5], 100, TRUE))
df2 <- rbind(
cbind(df, faceter = "Whole Sample")
, cbind(df[df$sub == "a" ,], faceter = "Subset A")
#other subsets go here...
)
qplot(x,y, data = df2) + facet_wrap(~ faceter)
Let me know if I've misunderstood your question.
-Chase

Resources