I have a couple of box and whisker plots in R. In both, the x-axis corresponds to one categorical variable whilst the grouping colours correspond to the other.
If I draw both plots with an untransformed y-axis, they are both fine. However, if I try to square-root transform the y-axis (using: coord_trans(y = "sqrt")), one of those graph is still fine whilst the other drops the lines corresponding to the median in most boxes (except those for which there are only two groups and where the boxes are therefore slightly wider, see "Numbers" 1 and 2 on the first plot). Further, for the graph that does not draw properly, if I reduce the number of categories on my x-axis (hence getting the boxes wider again), the median lines appear again.
Is this a bug with coord_trans (if so, how can I get around it) or a problem with my code?
Thank you very much for any suggestion.
library(car)
library(gplots)
library(plyr)
library(ggplot2)
library(gridExtra)
library(gdata)
Category=factor(c(rep(1, times =3240), rep(2, times =2160)),
labels=c("A","B"), levels=c(1,2))
ID=factor(rep(seq(from = 1, to = 45),each = 120))
Months=factor(rep(seq(from = 1, to = 3), each = 40, times = 45),
labels=c("Jan","Feb","Mar"),levels=c(1:3))
Obs=rnorm(5400, mean=25, sd=15)
Data=data.frame(Category,ID,Months,Obs)
Data=subset(Data, (Data$Category=="B") | !(Data$ID%in%c(1,2)) |
(Data$Months%in%c("Jan","Feb")))
for (j in 1:2)
{
sel=which(Data$Category==unique(levels(Data$Category))[j])
Observ=Data$Obs[sel]
Month=Data$Months[sel]
Number=droplevels(Data$ID[sel])
Number=droplevels(Number)
Data_used=data.frame(Number,Month,Observ)
plot1 = ggplot(Data_used, aes(Number, Observ)) +
geom_boxplot(aes(fill=Month, drop=FALSE), na.rm=TRUE) +
scale_y_continuous(breaks = c(0,20,40,60,80,100), limits=c(0,115)) +
coord_trans(y = "sqrt")
plot(plot1)
}
#Dennis is correct in his comment that scale_y_sqrt() will correct this. Because median and quartiles are order statistics it doesn't matter whether the data are transformed before or after calculating them.
Related
I'm doing a common R ggplot2 graph with boxplot: boxplots supplemented individual samples as points shown by geom_jitter(), to show the individual sample positions and numbers in each group. Normally I have not noticed a problem, but with some recent data, I've noticed substantial inaccuracy and variation in the y position of the jitter. However, the boxplot stays stable with respect to the Y, and so does geom_point() when used to show the same points as jitter is plotting. Error is likely not noticeable when you have many data points, but if required to do something with 5-10 samples in a group, it can produce an obvious error that makes a plot that may mislead you, if you were not aware of the issue.
I first thought this may have always happened and I didn't notice, so I made some random numbers and made a ggplot with geom_jitter(), but at first the problem disappeared. Some example data and plots are given to show the normal and problematic cases.
Data generation and plotting that worked as expected:
df <- data.frame("X" = rep("X", 5), "Y" = rnorm(5, 100, 30))
check the plot:
library(ggplot2)
ggplot(df, aes(X, Y)) + geom_boxplot() + geom_jitter(col = "red") + geom_point(col = "blue")
The red and blue dots are almost exactly aligned, and you can just watch the plot come in RStudio preview if you repeat the code 5 times and not notice variation in the jitter point y-position (only horizontally along the X-axis, as expected). In a problematic case like below, you quickly see the y-axis point variation, especially because it sometimes shifts the range of the y-axis.
With more variation in random numbers, I found a difference visible between the red and blue points, which varied each time of plotting the same data:
df <- data.frame("X" = rep("X", 5), "Y" = rnorm(5, 100, 400))
The actual numbers to get this problem were:
X Y
1 X 610.78026
2 X -38.58905
3 X -196.00943
4 X 94.37797
5 X 415.58417
In my result, the lowest point, -196, sometimes was about -170, sometimes about -250. The range of the y-axis shifts each time. It's similar to the problem I had happen with real data.
I found with other testing of data having more variance, or a larger range between points, did not explain occurrence variability of the jitter y-position. In some cases with more variance, geom_jitter() again produced near perfect y-positions. So I wondered if it may have something to do with trouble mapping with certain plot areas used by ggplot2. I thought to test that by forcing ggplot to keep the same ylimit using ylim(-206, 621) but it failed to stop the area with the above problematic case. It gives a mysterious, yet consistent error of: "Warning message: Removed 1 rows containing missing values (geom_point)."
(In the corresponding plot, it lost the red jitter point for the 610.7 value, despite enough pixel space in the plot preview window for about 10 more points between the blue point and the top of the graph. In another attempt, 2 jitter points get lost, because the bottom sometimes goes past the lower limit).
A roundabout solution would be to make random points for the X group, all keeping the same Y and group identity, but it's not efficient. When non-numerical groups are used on X, I found it will have a numerical position of 1 for any labels being added. Adding the following to the last dataframe gives the proper appearance + geom_point(aes(x= rnorm(5, 1, .2), y = Y), col = "yellow") - but that would become quite cumbersome if there are many groups if there is not some way to automatically get the correct X position for groups of boxplots.
To solve the problem, any input on what the cause of it is would be a great help.
It sounds like you do not want the default geom_jitter behavior, which adds a uniformly distributed amount of noise separately to the x and y value before plotting, by default "40% of the resolution of the data: this means the jitter values will occupy 80% of the implied bins."
For a continuous variable like yours "resolution" is "the smallest non-zero distance between adjacent values."
Try this:
geom_jitter(col = "red", height = 0) +
That will tell ggplot you want no noise applied to the y values before plotting.
Another approach would be to add noise yourself before the plotting step, giving you the ability to control its distribution and range specifically.
e.g. instead of having the jittering fill a uniform rectangle: ...
library(dplyr)
tibble(x = rep(1:2, each = 1000),
y = rep(3:4, each = 1000)) -> point_data
ggplot(point_data, aes(x,y)) + geom_jitter()
We could add whatever noise function we want. Here, for no particular reason, I make donuts around the real data, and compare that to the default jitter:
point_data %>%
mutate(angle = runif(2000, 0, 2*pi),
dist = rnorm(2000, 0.3, 0.05),
x2 = x + dist*cos(angle),
y2 = y + dist*sin(angle)) %>%
ggplot() +
geom_jitter(aes(x,y), color = "red", alpha = 0.2) +
geom_point(aes(x2,y2))
I am attempting to place individual points on a plot using ggplot2, however as there are many points, it is difficult to gauge how densely packed the points are. Here, there are two factors being compared against a continuous variable, and I want to change the color of the points to reflect how closely packed they are with their neighbors. I am using the geom_point function in ggplot2 to plot the points, but I don't know how to feed it the right information on color.
Here is the code I am using:
s1 = rnorm(1000, 1, 10)
s2 = rnorm(1000, 1, 10)
data = data.frame(task_number = as.factor(c(replicate(100, 1),
replicate(100, 2))),
S = c(s1, s2))
ggplot(data, aes(x = task_number, y = S)) + geom_point()
Which generates this plot:
However, I want it to look more like this image, but with one dimension rather than two (which I borrowed from this website: https://slowkow.com/notes/ggplot2-color-by-density/):
How do I change the colors of the first plot so it resembles that of the second plot?
I think the tricky thing about this is you want to show the original values, and evaluate the density at those values. I borrowed ideas from here to achieve that.
library(dplyr)
data = data %>%
group_by(task_number) %>%
# Use approxfun to interpolate the density back to
# the original points
mutate(dens = approxfun(density(S))(S))
ggplot(data, aes(x = task_number, y = S, colour = dens)) +
geom_point() +
scale_colour_viridis_c()
Result:
One could, of course come up with a meausure of proximity to neighbouring values for each value... However, wouldn't adjusting the transparency basically achieve the same goal (gauging how densely packed the points are)?
geom_point(alpha=0.03)
I've written an R script that loops through a data.frame making multiple of complex plots that includes a histogram. The problem is that the histograms often show a tall, uninformative peak at x=0 or x=1 and it obscures the rest of the data which is more informative. I have figured out that I can hide the tall peak by defining the limits of the x and y axes of each histogram as seen in the code below - but what I really need to figure out is how to define the y-axis limits such that they are optimized for the second-largest peak in my histogram.
Here's some code that simulates my data and plots histograms with different sorts of axis limits imposed:
require(ggplot2)
set.seed(5)
df = data.frame(matrix(sample(c(1:10), 1000, replace = TRUE, prob = c(0.8,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01)), nrow=100))
cols = names(df)
for (i in c(1:length(cols))) {
my_col = cols[i]
p1 = ggplot(df, aes_string(my_col)) + geom_histogram(bins = 10)
print(p1)
p2 = p1 + ggtitle(paste("Fixed X Limits", my_col)) + scale_x_continuous(limits = c(1,10))
print(p2)
p3 = p1 + ggtitle(paste("Fixed Y Limits", my_col)) + scale_y_continuous(limits = c(0,3))
print(p3)
p4 = p1 + ggtitle(paste("Fixed X & Y Limits", my_col)) + scale_y_continuous(limits = c(0,3)) + scale_x_continuous(limits = c(1,10))
print(p4)
}
The problem is that in this data, I can hard-code y-limits and have a reasonable expectation that they will work well for all the histograms. With my real data the size of the peaks varies wildly between the numerous histograms I am producing. I've tried defining the y-limit with various equations based on descriptive numbers like the mean, median and range but nothing I've come up with works well for all cases.
If I could define the y-limit in relation to the second-tallest peak of the histogram, I would have something that was perfectly suited for each situation.
I am not sure how ggplot builds its histograms, but one method would be to grab the results from hist:
maxDensities <- sapply(df, function(i) max(hist(i)$density))
# take the second highest peak:
myYlim <- rev(sort(maxDensities))[2]
I would process the data to determine the height you need.
Something along the lines of:
sort(table(cut(df$X1,breaks=10)),T)[2]
Working from the inside out
cut will bin the data (not really needed with integer data like you have but probably needed with real data
table then creates a table with the count of each of those bins
sort sorts the table from highest to lowest
[2] takes the 2nd highest value
I've seen many examples of a density plot but the density plot's y-axis is the probability. What I am looking for a is a line plot (like a density plot) but the y-axis should contain counts (like a histogram).
I can do this in excel where I manually make the bins and the frequencies and make a bar histogram and then I can change the chart type to a line - but can't find anything similar in R.
I've checked out both base and ggplot2; yet can't seem to find an answer. I understand that histograms are meant to be bars but I think representing them as a continuous line makes more visual sense.
Using default R graphics (i.e. without installing ggplot) you can do the following, which might also make what the density function does a bit clearer:
# Generate some data
data=rnorm(1000)
# Get the density estimate
dens=density(data)
# Plot y-values scaled by number of observations against x values
plot(dens$x,length(data)*dens$y,type="l",xlab="Value",ylab="Count estimate")
This is an old question, but I thought it might be helpful to post a solution that specifically addresses your question.
In ggplot2, you can plot a histogram and display the count with bars using:
ggplot(data) +
geom_histogram()
You can also plot a histogram and display the count with lines using a frequency polygon:
ggplot(data) +
geom_freqpoly()
For more info --
ggplot2 reference
To adapt the example on the ?stat_density help page:
m <- ggplot(movies, aes(x = rating))
# Standard density plot.
m + geom_density()
# Density plot with y-axis scaled to counts.
m + geom_density(aes(y = ..count..))
Although this is old, I thought the following might be useful.
Let's say you have a data set of 10,000 points, and you believe they belong to a certain distribution, and you would like to plot the histogram of the actual data and the line of the probability density of the ideal distribution on top of it.
noise <- 2
#
# the noise is tagged onto the end using runif
# just do demo issues w/real data and fitting
# the subtraction causes the data to have some
# negative values, which must be addressed in
# the fit later on
#
noisylognorm <- rlnorm(10000,
mean = 0.25,
sd = 1) +
(noise * runif(10000) - noise / 10)
#
# using package fitdistrplus
#
# subset is used to remove the negative values
# as the lognormal distribution needs positive only
#
fitlnorm <- fitdist(subset(noisylognorm,
noisylognorm > 0),
"lnorm")
fitlnorm_density <- density(rlnorm(10000,
mean = fitlnorm$estimate[1],
sd = fitlnorm$estimate[2]))
hist(subset(noisylognorm,
noisylognorm < 25),
breaks = seq(-1, 25, 0.5),
col = "lightblue",
xlim = c(0, 25),
xlab = "value",
ylab = "frequency",
main = paste0("Log Normal Distribution\n",
"noise = ", noise))
lines(fitlnorm_density$x,
10000 * fitlnorm_density$y * 0.5,
type="l",
col = "red")
Note the * 0.5 in the lines function. As far as I can tell, this is necessary to account for the width of the hist() bars.
There is a very simple and fast way for count data.
First let's generate some dummy count data:
my.count.data = rpois(n = 10000, lambda = 3)
And then the plotting command (assuming you have called library(magrittr)):
my.count.data %>% table %>% plot
I have data in percentages. I would like to use ggplot to create a graph, but I cannot get it to work like I would like. Since the data is very skewed a simple stacked column doesn't work well because the really small values don't show up. Here is a sample set:
Actual Predicted
a 0.5 5
b 9.5 5
c 90 90
On the left is an excel plot and on the right is R-ggplot
The problem is that in R the columns do not stack up to be even.
Here is my R code:
a = c("a","b","c","a","b","c")
b = c("Actual","Actual","Actual","Predicted","Predicted","Predicted")
c = c(0.5,2.5,97,0.2,2.2,97.6)
c = c+1
dat = data.frame(Type=a, Case=b, Percentage=c)
ggplot(dat, aes(x=Case, y=Percentage, fill=Type)) + geom_bar(stat="identity") + scale_y_log10()
*In both Excel and R I do a +1 to deal with numbers 0-1, so the y-axis is off slightly
If I use:
ggplot(dat, aes(x=Case, y=Percentage, fill=Type)) + geom_bar(stat="identity",position = "fill") + scale_y_log10()
The total heights match, however the two blue portions do not match in size (they are both 90%)
Just because two sets of numbers add up to the same value (103 in this case) doesn't mean the sum of the logs will add up to the same value! When you stack the bars without "fill" you get them different heights because the sums of the logs of the values are different. When you then scale it all to the same height you have to squash the blue boxes down by different rates and so they look different.
The Excel bar chart is deliberately misleading. The left red bar is the same size as the blue bar above it but represents a value of about a tenth of the blue bar. You can't make a barchart on a log scale of proportions - its just wrong.
There is a brilliant way to show small numbers without losing them or misrepresenting them. Its an amazing visualisation technique called 'writing the numbers in a table'.
I managed to get it to work like excel. Like Spacedman said, the plot is visually misleading, but numerically correct. The reason is that we want to compare bar segment actual height, when numerically you need to look at the y-axis start and end values. Its similar to bar charts that don't have a y-axis minimum of zero. Here is an example.
I am not sure if I will use the method for visualizing my data, but I had to figure it out.
Here is the result:
Here is the code (I might clean it up as a function that can be called when you assign the y values in ggplot).
a = c("a","b","c","a","b","c")
b = c("Actual","Actual","Actual","Predicted","Predicted","Predicted")
c = c(0.5,9.5,90,5,5,90)
c = c+1
dat = data.frame(Type=a, Case=b, Percentage=c, Cumsum_L=c, Cumsum=c, Norm=c)
for(i in 1:length(dat$Percentage)){
cumsum=0
for(j in 1:i){
if(dat$Case[j]==dat$Case[i]){
cumsum=cumsum+(dat$Percentage[j])
}
}
dat$Cumsum_L[i]=cumsum-dat$Percentage[i]
dat$Cumsum[i]=cumsum
if(dat$Cumsum_L[i]==0){
dat$Cumsum_L[i]=1
}
dat$Norm[i] = log(dat$Cumsum[i])-log(dat$Cumsum_L[i])
}
intervals = seq(from = 0, to = 100, by = 10)
intervals_log = log(intervals)
intervals_log[1]=0
ggplot(dat, aes(x=Case, y=Norm, fill=Type)) + geom_bar(stat="identity") +
scale_y_continuous(name="Percent",breaks = intervals_log, labels=intervals )
*I also need to fix the end points +1 kinda thing.
**I also might be butchering maths.