I have data in R with overlapping points.
x = c(4,4,4,7,3,7,3,8,6,8,9,1,1,1,8)
y = c(5,5,5,2,1,2,5,2,2,2,3,5,5,5,2)
plot(x,y)
How can I plot these points so that the points that are overlapped are proportionally larger than the points that are not. For example, if 3 points lie at (4,5), then the dot at position (4,5) should be three times as large as a dot with only one point.
Here's one way using ggplot2:
x = c(4,4,4,7,3,7,3,8,6,8,9,1,1,1,8)
y = c(5,5,5,2,1,2,5,2,2,2,3,5,5,5,2)
df <- data.frame(x = x,y = y)
ggplot(data = df,aes(x = x,y = y)) + stat_sum()
By default, stat_sum uses the proportion of instances. You can use raw counts instead by doing something like:
ggplot(data = df,aes(x = x,y = y)) + stat_sum(aes(size = ..n..))
Here's a simpler (I think) solution:
x <- c(4,4,4,7,3,7,3,8,6,8,9,1,1,1,8)
y <- c(5,5,5,2,1,2,5,2,2,2,3,5,5,5,2)
size <- sapply(1:length(x), function(i) { sum(x==x[i] & y==y[i]) })
plot(x,y, cex=size)
## Tabulate the number of occurrences of each cooordinate
df <- data.frame(x, y)
df2 <- cbind(unique(df), value = with(df, tapply(x, paste(x,y), length)))
## Use cex to set point size to some function of coordinate count
## (By using sqrt(value), the _area_ of each point will be proportional
## to the number of observations it represents)
plot(y ~ x, cex = sqrt(value), data = df2, pch = 16)
You didn't really ask for this approach but alpha may be another way to address this:
library(ggplot2)
ggplot(data.frame(x=x, y=y), aes(x, y)) + geom_point(alpha=.3, size = 3)
You need to add the parameter cex to your plot function. First what I would do is use the function as.data.frame and table to reduce your data to unique (x,y) pairs and their frequencies:
new.data = as.data.frame(table(x,y))
new.data = new.data[new.data$Freq != 0,] # Remove points with zero frequency
The only downside to this is that it converts numeric data to factors. So convert back to numeric, and plot!
plot(as.numeric(new.data$x), as.numeric(new.data$y), cex = as.numeric(new.data$Freq))
You may also want to try sunflowerplot.
sunflowerplot(x,y)
Let me propose alternatives to adjusting the size of the points. One of the drawbacks of using size (radius? area?) is that the reader's evaluation of spot size vs. the underlying numeric value is subjective.
So, option 1: plot each point with transparency --- ninja'd by Tyler!
option 2: use jitter to push your data around slightly so the plotted points don't overlap.
A solution using lattice and table ( similar to #R_User but no need to remove 0 since lattice do the job)
dt <- as.data.frame(table(x,y))
xyplot(dt$y~dt$x, cex = dt$Freq^2, col =dt$Freq)
Related
Frequently I want to plot raster data with a lot of missing values, including entire missing rows or columns. Consider the following as a toy example:
library(ggplot2)
set.seed(50)
d = expand.grid(x = 1:100, y = 1:100)
d$v = rnorm(nrow(d))
d[d$x %in% sample(d$x, 5), "v"] = NA_real_
ggplot() + geom_raster(aes(x, y, fill = v), data = d)
This works so far, but what if I want to omit plotting the missing values at all, instead of plotting gray squares for them? If I change data = d to data = d[!is.na(d$v),], then I get the warning "Raster pixels are placed at uneven horizontal intervals and will be shifted. Consider using geom_tile() instead." I don't see a shift in this example, but I worry that if ggplot2 shifts the data, that could lead to squares being plotted at the wrong coordinates for real data. How do I avoid this shifting?
I was assisted to overlay two graphs with different x-axis on this link I have 2 graphs on R. They have different x axis, but similar trend profile. how do I overlay them on r?.
However, I am now trying to overlay 4 graphs. I tried to overlay them but they are not aligned.
I need assistance to overlay these four graphs.
My initial trial codes were as follows:
My raw data is in this following link https://drive.google.com/drive/folders/1ZZQAATkbeV-Nvq1YYZMYdneZwMvKVUq1?usp=sharing.
Codes used to execute:
first <- ggplot(data = first,
aes(x, y)) +
geom_line(pch = 1)
second <- ggplot(data = second,
aes(x, y)) +
geom_line(pch = 1)
third <- ggplot(data = third,
aes(x, y)) +
geom_line(pch = 1)
fourth <- ggplot(data = fourth,
aes(x, y)) +
geom_line(pch = 1)
first$match <- first$x
second$match <- second$x - second$x[second$y == max(second$y)] + first$x[first$y == max(first$y)]
third$match <- third$x
fourth$match <- fourth$x
first$series = "first"
second$series = "second"
third$series = "third"
fourth$series = "fourth"
all_data <- rbind(first, second, third, fourth)
ggplot(all_data) + geom_line(aes(x = match, y, color = series)) +
scale_x_continuous(name = "X, arbitrary units") +
theme(axis.text.x = element_blank())
Would greatly appreciate the help indeed.
OP, I thought I would propose a solution for your question. OP has 4 datasets with x and y columns, and wants to align the peaks in each dataset so that they stack on top of one another. Here's what it looks like when we plot all datasets together:
p <- ggplot(mapping=aes(x=x, y=y)) + theme_bw() +
geom_line(data=first, aes(color="first")) +
geom_line(data=second, aes(color="second")) +
geom_line(data=third, aes(color="third")) +
geom_line(data=fourth, aes(color="fourth"))
The approach will be as follows:
Find the peak x value for each dataset
Adjust each peak x value to match that of the first peak x value
Combine the datasets and plot together which respects Tidy Data principles
Finding peaks and adjusting x values
To find the peaks, I like to use the findpeaks() function from the pracma library. You feed the function your dataset's y values (arranged by increasing x value), and the function will return a matrix with each row representing a "peak" and the columns give you height of peak in y value, index or row of dataset for the peak, where the peak begins, and where the peak ends. As an example, here's how we can apply this principle and what the result looks like on one of the datasets:
library(pracma)
first <- arrange(first, x) # arrange first by increasing x
findpeaks(first$y, sortstr = TRUE, npeaks=1)
[,1] [,2] [,3] [,4]
[1,] 1047.54 402 286 515
The argument sortstr= indicates we want the list of peaks sorted by "highest" first, and we are only interested in picking the first peak. In this case, we can see that 402 is the index of the x,y value in first for the peak. So we can access that x value via first[index,]$x.
The one concern we may have here is that this may not work for fourth, since the max value of y is actually not the peak of interest; however, if we run the function and test this out, using the findpeaks() method where we return the highest peak works fine: apparently the function does not find there is a "peak" at the right since it has an "up", but not a "down".
The function below handles all the steps to do what we need to: arranging, finding peaks, and adjusting peaks.
# find the minimum peak. We know it's from third, but here's
# how you do it if you don't "know" that
peaks_first <- findpeaks(first$y, sortstr = TRUE, npeaks=1)
peaks_second <- findpeaks(second$y, sortstr = TRUE, npeaks=1)
peaks_third <- findpeaks(third$y, sortstr = TRUE, npeaks=1)
peaks_fourth <- findpeaks(fourth$y, sortstr = TRUE, npeaks=1)
# minimum peak x value
peak_x <- min(c(first[peaks_first[2],]$x, second[peaks_second[2],]$x, third[peaks_third[2],]$x, fourth[peaks_fourth[2],]$x))
# function to use to fix each dataset
fix_x <- function(peak_x, dataset) {
dataset <- arrange(dataset, x)
d_peak <- findpeaks(dataset$y, sortstr = TRUE, npeaks=1)
d_peak_x <- dataset[d_peak[2],]$x
x_adj <- peak_x - d_peak_x
dataset$x <- dataset$x + x_adj
return(dataset)
}
# apply and fix each dataset
fix_first <- fix_x(peak_x, first)
fix_second <- fix_x(peak_x, second)
fix_third <- fix_x(peak_x, third)
fix_fourth <- fix_x(peak_x, fourth)
# combine datasets
fix_first$measure <- 'First'
fix_second$measure <- 'Second'
fix_third$measure <- 'Third'
fix_fourth$measure <- 'Fourth'
fixed <- rbind(fix_first, fix_second, fix_third, fix_fourth)
fixed$measure <- factor(fixed$measure, levels=c('First','Second','Third','Fourth'))
Plot Together
Now fixed contains all the data, and we can plot them all together:
ggplot(fixed, aes(x=x, y=y, color=measure)) + theme_bw() +
geom_line()
Alternate Plotting Methods
If you want to "stack" the lines on top of one another, this is what is known as a ridgeline plot. There are two methods I can show for how to create the ridgeline plot: faceting or using ggridges and geom_ridgeline(). I can demonstrate both.
# Using facets
ggplot(fixed, aes(x=x, y=y, color=measure)) + theme_bw() +
geom_line(show.legend = FALSE) +
facet_grid(measure~.)
Note I chose not to show the legend, since the strip text indicates this same information.
# Using ggridges and geom_ridgeline
ggplot(fixed, aes(x=x, y=measure, color=measure)) + theme_bw() +
geom_ridgeline(aes(height=y), fill=NA, scale=0.001)
When using geom_ridgeline(), you'll notice that the y= aesthetic becomes the column used for the stacking, and your original y value is instead mapped to the height= aesthetic. I also had to play around with scale=, since for discrete values, each measure will be treated as integers (1, 2, 3, 4). Your height= values are waaaay higher than that, so we have to scale them down so that they are around this range (scaled down by about 1000).
facet_grid and facet_wrap have the scales parameter, which as far as I know allows each plot to adjust the scales of the x and/or y axis to the data being plotted. Since according to the grammar of ggplot x and y are just two among many aesthetics, and there's a scale for each aesthetic, I figured it would be reasonable to have the option of letting each aesthetic be free, but so far I didn't find a way to do it.
I was trying to set it in particular for the Size, since sometimes a variables lives in a different order of magnitude depending on the group I'm using for the facet, and having the same scale for every group blocks the possibility of seeing within-group variation.
A reproducible example:
set.seed(1)
x <- runif(20,0,1)
y <- runif(20,0,1)
groups <- c(rep('small', 10), rep('big', 10))
size_small <- runif(10,0,1)
size_big <- runif(10,0,1) * 1000
df <- data.frame(x, y, groups, sizes = c(size_small, size_big))
And an auxiliary function for plotting:
basic_plot <- function(df) ggplot(df) +
geom_point(aes(x, y, size = sizes, color = groups)) +
scale_color_manual(values = c('big' = 'red', 'small' = 'blue')) +
coord_cartesian(xlim=c(0,1), ylim=c(0,1))
If I we plot the data as is, we get the following:
basic_plot(df)
Non faceted plot
The blue dots are relatively small, but there is nothing we can do.
If we add the facet:
basic_plot(df) +
facet_grid(~groups, scales = 'free')
Faceted plot
The blue dots continue being small. But I would like to take advantage of the fact that I'm dividing the data in two, and allow the size scale to adjust to the data of each plot. I would like to have something like the following:
plot_big <- basic_plot(df[df$groups == 'big',])
plot_small <- basic_plot(df[df$groups == 'small',])
grid.arrange(plot_big, plot_small, ncol = 2)
What I want
Can it be done without resorting to this kind of micromanaging, or a manual rescaling of the sizes like the following?
df %>%
group_by(groups) %>%
mutate(maximo = max(sizes),
sizes = scale(sizes, center = F)) %>%
basic_plot() +
facet_grid(~groups)
I can manage to do those things, I'm just trying to see if I'm not missing another option, or if I'm misunderstanding the grammar of graphics.
Thank you for your time!
As mentioned, original plot aesthetics are maintained when calling facet_wrap. Since you need grouped graphs, consider base::by (the subsetting data frame function) wrapped in do.call:
do.call(grid.arrange,
args=list(grobs=by(df, df$groups, basic_plot),
ncol=2,
top="Grouped Point Plots"))
Should you need to share a legend, I always use this wrapper from #Steven Lockton's answer
do.call(grid_arrange_shared_legend, by(df, df$groups, basic_plot))
In R I have created a simple matrix of one column yielding a list of numbers with a set mean and a given standard deviation.
rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }
r <- rnorm2(100,4,1)
I now would like to plot how these numbers differ from the mean. I can do this in Excel as shown below:
But I would like to use ggplot2 to create a graph in R. in the Excel graph I have cheated by using a line graph but if I could do this as columns it would be better. I have tried using a scatter plot but I cant work out how to turn this into deviations from the mean.
Perhaps you want:
rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }
set.seed(101)
r <- rnorm2(100,4,1)
x <- seq_along(r) ## sets up a vector from 1 to length(r)
par(las=1,bty="l") ## cosmetic preferences
plot(x, r, col = "green", pch=16) ## draws the points
## if you don't want points at all, use
## plot(x, r, type="n")
## to set up the axes without drawing anything inside them
segments(x0=x, y0=4, x1=x, y1=r, col="green") ## connects them to the mean line
abline(h=4)
If you were plotting around 0 you could do this automatically with type="h":
plot(x,r-4,type="h", col="green")
To do this in ggplot2:
library("ggplot2")
theme_set(theme_bw()) ## my cosmetic preferences
ggplot(data.frame(x,r))+
geom_segment(aes(x=x,xend=x,y=mean(r),yend=r),colour="green")+
geom_hline(yintercept=mean(r))
Ben's answer using ggplot2 works great, but if you don't want to manually adjust the line width, you could do this:
# Half of Ben's data
rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }
set.seed(101)
r <- rnorm2(50,4,1)
x <- seq_along(r) ## sets up a vector from 1 to length(r)
# New variable for the difference between each value and the mean
value <- r - mean(r)
ggplot(data.frame(x, value)) +
# geom_bar anchors each bar at zero (which is the mean minus the mean)
geom_bar(aes(x, value), stat = "identity"
, position = "dodge", fill = "green") +
# but you can change the y-axis labels with a function, to add the mean back on
scale_y_continuous(labels = function(x) {x + mean(r)})
in base R it's quite simple, just do
plot(r, col = "green", type = "l")
abline(4, 0)
You also tagged ggplot2, so in that case it will be a bit more complicated, because ggplot requires creating a data frame and then melting it.
library(ggplot2)
library(reshape2)
df <- melt(data.frame(x = 1:100, mean = 4, r = r), 1)
ggplot(df, aes(x, value, color = variable)) +
geom_line()
I've made a quick example data frame for this below. Basically I want to create a line plot with the average value as a line and a shadow around that line representative of the range of the values. I realise I'll likely have to find row min/max but am unsure how to do this for rows and also don't know how I would go about plotting this
TEST <- data.frame(a=c(1,5,7,2), b=c(3,8,2,5), c=c(6,10,2,1))
TEST$mean <- rowMeans(TEST)
Any help appreciated - Thanks
It is probably easily done with base R too, but here's a ggplot approach
Adding Min and Max and some index for the x axis
TEST <- transform(TEST, Min = pmin(a,b,c), Max = pmax(a,b,c), indx = seq_len(dim(TEST)[1]))
Plotting, using geom_ribbon
library(ggplot2)
ggplot(TEST) +
geom_line(aes(indx, mean), group = 1) +
geom_ribbon(aes(x = indx, ymax = Max, ymin = Min), alpha = 0.6, fill = "skyblue")
Just to add another option, here's a possibile solution using only base R:
TEST <- data.frame(a=c(1,5,7,2), b=c(3,8,2,5), c=c(6,10,2,1))
# compute mean, min and max of rows
means <- rowMeans(TEST)
maxs <- apply(TEST,1,max)
mins <- apply(TEST,1,min)
# create x-coordinates
xcoords <- 1:nrow(TEST)
# create an empty plot to make space for everything
plot(x=c(min(xcoords),max(xcoords)),y=c(min(mins),max(maxs)),
type="n", main="Average",xlab="X",ylab="Y")
# add min-max ranges (color is DodgerBlue with 80/255 of opacity,
# for rgb values of colors see http://en.wikipedia.org/wiki/Web_colors)
rangecolor <- rgb(30,144,255,alpha=80,maxColorValue=255)
polygon(x=c(xcoords,rev(xcoords)),y=c(maxs,rev(means)),col=rangecolor,border=NA)
polygon(x=c(xcoords,rev(xcoords)),y=c(mins,rev(means)),col=rangecolor,border=NA)
# add average line (black)
meancolor <- "black"
lines(x=xcoords,y=means,col=meancolor)
Result :
For future reuse, you can also wrap it into a helpful function :
plotLineWithRange <- function(x, yVal, yMin, yMax,
lineColor="Black", rangeColor="LightBlue",
main="", xlab="X", ylab="Y"){
if(missing(x)){
x <- 1:length(yVal)
}
stopifnot(length(yVal) == length(yMin) && length(yVal) == length(yMax))
plot(x=c(min(x),max(x)),y=c(min(yMin),max(yMax)),type="n", main=main,xlab=xlab,ylab=ylab)
polygon(x=c(x,rev(x)),y=c(yMax,rev(yVal)),col=rangeColor,border=NA)
polygon(x=c(x,rev(x)),y=c(yMin,rev(yVal)),col=rangeColor,border=NA)
lines(x=x,y=yVal,col=lineColor)
}
# usage example:
plotLineWithRange(yVal=means,yMin=mins,yMax=maxs,main="Average")