One of the variables in my data frame is a factor denoting whether an amount was gained or spent. Every event has a "gain" value; there may or may not be a corresponding "spend" amount. Here is an image with the observations overplotted:
Adding some random jitter helps visually, however, the "spend" amounts are divorced from their corresponding gain events:
I'd like to see the blue circles "bullseyed" in their gain circles (where the "id" are equal), and jittered as a pair. Here are some sample data (three days) and code:
library(ggplot2)
ccode<-c(Gain="darkseagreen",Spend="darkblue")
ef<-data.frame(
date=as.Date(c("2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03")),
site=c("Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace","Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace"),
id=c("C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99","C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99"),
gainspend=c("Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend"),
amount=c(6,14,34,31,3,10,6,14,2,16,16,14,1,1,15,11,8,7,2,10,15,4,3,NA,NA,4,5,NA,NA,NA,NA,NA,NA,2,NA,1,NA,3,NA,NA,2,NA,NA,2,NA,3))
#▼ 3 day, points centered
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
#▼ 3 day, jitted
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5,position=position_jitter(w=0,h=0.2)) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
My main idea is the old "add jitter manually" approach. I'm wondering if a nicer approach could be something like plotting little pie charts as points a la package scatterpie.
In this case you could add a random number for the amount of jitter to each ID so points within groups will be moved the same amount. This takes doing work outside of ggplot2.
First, draw the "jitter" to add for each ID. Since a categorical axis is 1 unit wide, I choose numbers between -.3 and .3. I use dplyr for this work and set the seed so you will get the same results.
library(dplyr)
set.seed(16)
ef2 = ef %>%
group_by(id) %>%
mutate(jitter = runif(1, min = -.3, max = .3)) %>%
ungroup()
Then the plot. I use a geom_blank() layer so that the categorical site axis is drawn before I add the jitter. I convert site to be numeric from a factor and add the jitter on; this only works for factors so luckily categorical axes in ggplot2 are based on factors.
Now paired ID's move together.
ggplot(ef2, aes(x = date, y = site)) +
geom_blank() +
geom_point(aes(size = amount, color = gainspend,
y = as.numeric(factor(site)) + jitter),
alpha=0.5) +
scale_color_manual(values = ccode) +
scale_size_continuous(range = c(1, 15), breaks = c(5, 10, 20))
#> Warning: Removed 15 rows containing missing values (geom_point).
Created on 2021-09-23 by the reprex package (v2.0.0)
You can add some jitter by id outside the ggplot() call.
jj <- data.frame(id = unique(ef$id), jtr = runif(nrow(ef), -0.3, 0.3))
ef <- merge(ef, jj, by = 'id')
ef$sitej <- as.numeric(factor(ef$site)) + ef$jtr
But you need to make site integer/numeric to do this. So when it comes to making the plot, you need to manually add axis labels with scale_y_continuous(). (Update: the geom_blank() trick from aosmith above is a better solution!)
ggplot(ef,aes(date,sitej)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20)) +
scale_y_continuous(breaks = 1:3, labels= sort(unique(ef$site)))
This seems to work, but there are still a few gain/spend circles without a partner--perhaps there is a problem with the id variable.
Perhaps someone else has a better approach!
I was assisted to overlay two graphs with different x-axis on this link I have 2 graphs on R. They have different x axis, but similar trend profile. how do I overlay them on r?.
However, I am now trying to overlay 4 graphs. I tried to overlay them but they are not aligned.
I need assistance to overlay these four graphs.
My initial trial codes were as follows:
My raw data is in this following link https://drive.google.com/drive/folders/1ZZQAATkbeV-Nvq1YYZMYdneZwMvKVUq1?usp=sharing.
Codes used to execute:
first <- ggplot(data = first,
aes(x, y)) +
geom_line(pch = 1)
second <- ggplot(data = second,
aes(x, y)) +
geom_line(pch = 1)
third <- ggplot(data = third,
aes(x, y)) +
geom_line(pch = 1)
fourth <- ggplot(data = fourth,
aes(x, y)) +
geom_line(pch = 1)
first$match <- first$x
second$match <- second$x - second$x[second$y == max(second$y)] + first$x[first$y == max(first$y)]
third$match <- third$x
fourth$match <- fourth$x
first$series = "first"
second$series = "second"
third$series = "third"
fourth$series = "fourth"
all_data <- rbind(first, second, third, fourth)
ggplot(all_data) + geom_line(aes(x = match, y, color = series)) +
scale_x_continuous(name = "X, arbitrary units") +
theme(axis.text.x = element_blank())
Would greatly appreciate the help indeed.
OP, I thought I would propose a solution for your question. OP has 4 datasets with x and y columns, and wants to align the peaks in each dataset so that they stack on top of one another. Here's what it looks like when we plot all datasets together:
p <- ggplot(mapping=aes(x=x, y=y)) + theme_bw() +
geom_line(data=first, aes(color="first")) +
geom_line(data=second, aes(color="second")) +
geom_line(data=third, aes(color="third")) +
geom_line(data=fourth, aes(color="fourth"))
The approach will be as follows:
Find the peak x value for each dataset
Adjust each peak x value to match that of the first peak x value
Combine the datasets and plot together which respects Tidy Data principles
Finding peaks and adjusting x values
To find the peaks, I like to use the findpeaks() function from the pracma library. You feed the function your dataset's y values (arranged by increasing x value), and the function will return a matrix with each row representing a "peak" and the columns give you height of peak in y value, index or row of dataset for the peak, where the peak begins, and where the peak ends. As an example, here's how we can apply this principle and what the result looks like on one of the datasets:
library(pracma)
first <- arrange(first, x) # arrange first by increasing x
findpeaks(first$y, sortstr = TRUE, npeaks=1)
[,1] [,2] [,3] [,4]
[1,] 1047.54 402 286 515
The argument sortstr= indicates we want the list of peaks sorted by "highest" first, and we are only interested in picking the first peak. In this case, we can see that 402 is the index of the x,y value in first for the peak. So we can access that x value via first[index,]$x.
The one concern we may have here is that this may not work for fourth, since the max value of y is actually not the peak of interest; however, if we run the function and test this out, using the findpeaks() method where we return the highest peak works fine: apparently the function does not find there is a "peak" at the right since it has an "up", but not a "down".
The function below handles all the steps to do what we need to: arranging, finding peaks, and adjusting peaks.
# find the minimum peak. We know it's from third, but here's
# how you do it if you don't "know" that
peaks_first <- findpeaks(first$y, sortstr = TRUE, npeaks=1)
peaks_second <- findpeaks(second$y, sortstr = TRUE, npeaks=1)
peaks_third <- findpeaks(third$y, sortstr = TRUE, npeaks=1)
peaks_fourth <- findpeaks(fourth$y, sortstr = TRUE, npeaks=1)
# minimum peak x value
peak_x <- min(c(first[peaks_first[2],]$x, second[peaks_second[2],]$x, third[peaks_third[2],]$x, fourth[peaks_fourth[2],]$x))
# function to use to fix each dataset
fix_x <- function(peak_x, dataset) {
dataset <- arrange(dataset, x)
d_peak <- findpeaks(dataset$y, sortstr = TRUE, npeaks=1)
d_peak_x <- dataset[d_peak[2],]$x
x_adj <- peak_x - d_peak_x
dataset$x <- dataset$x + x_adj
return(dataset)
}
# apply and fix each dataset
fix_first <- fix_x(peak_x, first)
fix_second <- fix_x(peak_x, second)
fix_third <- fix_x(peak_x, third)
fix_fourth <- fix_x(peak_x, fourth)
# combine datasets
fix_first$measure <- 'First'
fix_second$measure <- 'Second'
fix_third$measure <- 'Third'
fix_fourth$measure <- 'Fourth'
fixed <- rbind(fix_first, fix_second, fix_third, fix_fourth)
fixed$measure <- factor(fixed$measure, levels=c('First','Second','Third','Fourth'))
Plot Together
Now fixed contains all the data, and we can plot them all together:
ggplot(fixed, aes(x=x, y=y, color=measure)) + theme_bw() +
geom_line()
Alternate Plotting Methods
If you want to "stack" the lines on top of one another, this is what is known as a ridgeline plot. There are two methods I can show for how to create the ridgeline plot: faceting or using ggridges and geom_ridgeline(). I can demonstrate both.
# Using facets
ggplot(fixed, aes(x=x, y=y, color=measure)) + theme_bw() +
geom_line(show.legend = FALSE) +
facet_grid(measure~.)
Note I chose not to show the legend, since the strip text indicates this same information.
# Using ggridges and geom_ridgeline
ggplot(fixed, aes(x=x, y=measure, color=measure)) + theme_bw() +
geom_ridgeline(aes(height=y), fill=NA, scale=0.001)
When using geom_ridgeline(), you'll notice that the y= aesthetic becomes the column used for the stacking, and your original y value is instead mapped to the height= aesthetic. I also had to play around with scale=, since for discrete values, each measure will be treated as integers (1, 2, 3, 4). Your height= values are waaaay higher than that, so we have to scale them down so that they are around this range (scaled down by about 1000).
I've written an R script that loops through a data.frame making multiple of complex plots that includes a histogram. The problem is that the histograms often show a tall, uninformative peak at x=0 or x=1 and it obscures the rest of the data which is more informative. I have figured out that I can hide the tall peak by defining the limits of the x and y axes of each histogram as seen in the code below - but what I really need to figure out is how to define the y-axis limits such that they are optimized for the second-largest peak in my histogram.
Here's some code that simulates my data and plots histograms with different sorts of axis limits imposed:
require(ggplot2)
set.seed(5)
df = data.frame(matrix(sample(c(1:10), 1000, replace = TRUE, prob = c(0.8,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01)), nrow=100))
cols = names(df)
for (i in c(1:length(cols))) {
my_col = cols[i]
p1 = ggplot(df, aes_string(my_col)) + geom_histogram(bins = 10)
print(p1)
p2 = p1 + ggtitle(paste("Fixed X Limits", my_col)) + scale_x_continuous(limits = c(1,10))
print(p2)
p3 = p1 + ggtitle(paste("Fixed Y Limits", my_col)) + scale_y_continuous(limits = c(0,3))
print(p3)
p4 = p1 + ggtitle(paste("Fixed X & Y Limits", my_col)) + scale_y_continuous(limits = c(0,3)) + scale_x_continuous(limits = c(1,10))
print(p4)
}
The problem is that in this data, I can hard-code y-limits and have a reasonable expectation that they will work well for all the histograms. With my real data the size of the peaks varies wildly between the numerous histograms I am producing. I've tried defining the y-limit with various equations based on descriptive numbers like the mean, median and range but nothing I've come up with works well for all cases.
If I could define the y-limit in relation to the second-tallest peak of the histogram, I would have something that was perfectly suited for each situation.
I am not sure how ggplot builds its histograms, but one method would be to grab the results from hist:
maxDensities <- sapply(df, function(i) max(hist(i)$density))
# take the second highest peak:
myYlim <- rev(sort(maxDensities))[2]
I would process the data to determine the height you need.
Something along the lines of:
sort(table(cut(df$X1,breaks=10)),T)[2]
Working from the inside out
cut will bin the data (not really needed with integer data like you have but probably needed with real data
table then creates a table with the count of each of those bins
sort sorts the table from highest to lowest
[2] takes the 2nd highest value