I want to create a scatter plot, but the scale of the axes is messed up. I want it to have an increasing order, but in the plot y = 7 lies between y = 8.8 and y = 11.8.
It is a bit difficult to explain, so I uploaded a picture of the plot to
splot <- ggplot(df, aes(x_val, y_val)) + geom_point() + ggtitle(title) + xlab(label) + ylab(label)
df looks like that
x_val y_val x_min x_max y_min y_max series
1 8.2640626 7.1605616 7.43370308695577 9.09442211304423 5.62731954407747 8.69380365592253 1IWG
2 10.0321728 8.8790822 8.43774194466477 11.6266036553352 6.97682936735609 10.7813350326439 1J4N
3 13.4994332665331 11.8238683366733 12.4200921869666 14.5787743460995 9.99549351881522 13.6522431545315 1KPL
Thanks for any help.
Use str(df) to examine your data frame df. If the variables you are trying to plot are factors, then use as.numeric() to convert them so that they are interpreted as numbers. Or you can try to specify that they are numeric when you create your data set, depending on how the frame is defined.
Related
So my first ggplot2 box plot was just one big stretched out box plot, the second one was correct but I don't understand what changed and why the second one worked. I'm new to R and ggplot2, let me know if you can, thanks.
#----------------------------------------------------------
# This is the original ggplot that didn't work:
#----------------------------------------------------------
zSepalFrame <- data.frame(zSepalLength, zSepalWdth)
zPetalFrame <- data.frame(zPetalLength, zPetalWdth)
p1 <- ggplot(data = zSepalFrame, mapping = aes(x=zSepalWdth, y=zSepalLength, group = 4)) + #fill = zSepalLength
geom_boxplot(notch=TRUE) +
stat_boxplot(geom = 'errorbar', width = 0.2) +
theme_classic() +
labs(title = "Iris Data Box Plot") +
labs(subtitle ="Z Values of Sepals From Iris.R")
p1
#----------------------------------------------------------
# This is the new ggplot box plot line that worked:
#----------------------------------------------------------
bp = ggplot(zSepalFrame, aes(x=factor(zSepalWdth), y=zSepalLength, color = zSepalWdth)) + geom_boxplot() + theme(legend.position = "none")
bp
This is what the ggplot box plot looked like
I don't have your precise dataset, OP, but it seems to stem from assigning a continuous variable to your x axis, when boxplots require a discrete variable.
A continuous variable is something like a numeric column in a dataframe. So something like this:
x <- c(4,4,4,8,8,8,8)
Even though the variable x only contains 4's and 8's, R assigns this as a numeric type of variable, which is continuous. It means that if you plot this on the x axis, ggplot will have no issue with something falling anywhere in-between 4 or 8, and will be positioned accordingly.
The other type of variable is called discrete, which would be something like this:
y <- c("Green", "Green", "Flags", "Flags", "Cars")
The variable y contains only characters. It must be discrete, since there is no such thing as something between "Green" and "Cars". If plotted on an x axis, ggplot will group things as either being "Green", "Flags", or "Cars".
The cool thing is that you can change a continuous variable into a discrete one. One way to do that is to factorize or force R to consider a variable as a factor. If you typed factor(x), you get this:
[1] 4 4 4 8 8 8 8
Levels: 4 8
The values in x are the same, but now there is no such thing as a number between 4 and 8 when x is a factor - it would just add another level.
That is in short why your box plot changes. Let's demonstrate with the iris dataset. First, an example like yours. Notice that I'm assigning x=Sepal.Length. In the iris dataset, Sepal.Length is numeric, so continuous.
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +
geom_boxplot()
This is similar to yours. The reason is that the boxplot is drawn by grouping according to x and then calculating statistics on those groups. If a variable is continuous, there are no "groups", even if data is replicated (like as in x above). One way to make groups is to force the data to be discrete, as in factor(Sepal.Length). Here's what it looks like when you do that:
ggplot(iris, aes(x=factor(Sepal.Length), y=Sepal.Width)) +
geom_boxplot()
The other way to have this same effect would be to use the group= aesthetic, which does what you might think: it groups according to that column in the dataset.
ggplot(iris, aes(x=Sepal.Length), y=Sepal.Width, group=Sepal.Length)) +
geom_boxplot()
I'm currently trying to develop a surface plot that examines the results of the below data frame. I want to plot the increasing values of noise on the x-axis and the increasing values of mu on the y-axis, with the point estimate values on the z-axis. After looking at ggplot2 and ggplotly, it's not clear how I would plot each of these columns in surface or 3D plot.
df <- "mu noise0 noise1 noise2 noise3 noise4 noise5
1 1 0.000000 0.9549526 0.8908646 0.919630 1.034607
2 2 1.952901 1.9622004 2.0317115 1.919011 1.645479
3 3 2.997467 0.5292921 2.8592976 3.034377 3.014647
4 4 3.998339 4.0042379 3.9938346 4.013196 3.977212
5 5 5.001337 4.9939060 4.9917115 4.997186 5.009082
6 6 6.001987 5.9929932 5.9882173 6.015318 6.007156
7 7 6.997924 6.9962483 7.0118066 6.182577 7.009172
8 8 8.000022 7.9981131 8.0010066 8.005220 8.024569
9 9 9.004437 9.0066182 8.9667536 8.978415 8.988935
10 10 10.006595 9.9987245 9.9949733 9.993018 10.000646"
Thanks in advance.
Here's one way using geom_tile(). First, you will want to get your data frame into more of a Tidy format, where the goal is to have columns:
mu: nothing changes here
noise: need to combine your "noise0", "noise1", ... columns together, and
z: serves as the value of the noise and we will apply the fill= aesthetic using this column.
To do that, I'm using dplyr and gather(), but there are other ways (melt(), or pivot_longer() gets you that too). I'm also adding some code to pull out just the number portion of the "noise" columns and then reformatting that as an integer to ensure that you have x and y axes as numeric/integers:
# assumes that df is your data as data.frame
df <- df %>% gather(key="noise", value="z", -mu)
df <- df %>% separate(col = "noise", into=c('x', "noise"), sep=5) %>% select(-x)
df$noise <- as.integer(df$noise)
Here's an example of how you could plot it, but aesthetics are up to you. I decided to also include geom_text() to show the actual values of df$z so that we can see better what's going on. Also, I'm using rainbow because "it's pretty" - you may want to choose a more appropriate quantitative comparison scale from the RColorBrewer package.
ggplot(df, aes(x=noise, y=mu, fill=z)) + theme_bw() +
geom_tile() +
geom_text(aes(label=round(z, 2))) +
scale_fill_gradientn(colors = rainbow(5))
EDIT: To answer OP's follow up, yes, you can also showcase this via plotly. Here's a direct transition:
p <- plot_ly(
df, x= ~noise, y= ~mu, z= ~z,
type='mesh3d', intensity = ~z,
colors= colorRamp(rainbow(5))
)
p
Static image here:
A much more informative way to show this particular set of information is to see the variation of df$z as it relates to df$mu by creating df$delta_z and then using that to plot. (you can also plot via ggplot() + geom_tile() as above):
df$delta_z <- df$z - df$mu
p1 <- plot_ly(
df, x= ~noise, y= ~mu, z= ~delta_z,
type='mesh3d', intensity = ~delta_z,
colors= colorRamp(rainbow(5))
)
Giving you this (static image here):
ggplot accepts data in the long format, which means that you need to melt your dataset using, for example, a function from the reshape2 package:
dfLong = melt(df,
id.vars = "mu",
variable.name = "noise",
value.name = "meas")
The resulting column noise contains entries such as noise0, noise1, etc. You can extract the numbers and convert to a numeric column:
dfLong$noise = with(dfLong, as.numeric(gsub("noise", "", noise)))
This converts your data to:
mu noise meas
1 1 0 1.0000000
2 2 0 2.0000000
3 3 0 3.0000000
...
As per ggplot documentation:
ggplot2 can not draw true 3D surfaces, but you can use geom_contour(), geom_contour_filled(), and geom_tile() to visualise 3D surfaces in 2D.
So, for example:
ggplot(dfLong,
aes(x = noise
y = mu,
fill = meas)) +
geom_tile() +
scale_fill_gradientn(colours = terrain.colors(10))
Produces:
I'm studying the example of coord_trans() of ggplot2:
library(ggplot2)
library(scales)
set.seed(4747)
df <- data.frame(a = abs(rnorm(26)),letters)
plot <- ggplot(df,aes(a,letters)) + geom_point()
plot + coord_trans(x = "log10")
plot + coord_trans(x = "sqrt")
I modified the code plot + coord_trans(x = "log10") as following and get what I expected:
plot + scale_x_log10(breaks=trans_breaks("log10", function(x) 10^x),
labels=trans_format("log10", math_format(10^.x)))
I modified the code plot + coord_trans(x = "sqrt") as following and get a strange x-axis:
plot + scale_x_sqrt(breaks=trans_breaks("sqrt", function(x) sqrt(x)),
labels=trans_format("sqrt", math_format(.x^0.5)))
How could I fix the problem?
I get why you said it was a strange / terrible axis. The documentation for trans_breaks even warns you about this in its first line:
These often do not produce very attractive breaks.
To make it less unattractive, I would use round(,2) so my axis labels only have 2 decimal points instead of the default 8 or 9 - cluttering up the axis. Then I would set a sensible range, say in your case 0 to 5 (c(0,5)).
Finally, you can specify the number of ticks for your axis using n in the trans_breaks call.
So putting it together, here's how you can format your x-axis and its tick label in the scale_x_sqrt(x) format:
plot <- ggplot(df,aes(a,letters)) + geom_point()
plot + scale_x_sqrt(breaks=trans_breaks("sqrt", function(x) round(sqrt(x),2), n=5)(c(0, 5)))
Produces this:
The c(0,5) is passed to pretty(), a lesser-known Base R's function. From the documentation, pretty does the following:
Compute a sequence of about n+1 equally spaced "round" values which cover the range of the values in x.
pretty(c(0,5)) simply produces [1] 0 1 2 3 4 5 in our case.
You can even fine-tune your axis by changing the parameters. Here the code uses 3 decimal points (round(x,3)) and we asked for 3 number of ticks n=3:
plot <- ggplot(df,aes(a,letters)) + geom_point()
plot + scale_x_sqrt(breaks=trans_breaks("sqrt", function(x) round(sqrt(x),3), n=3)(c(0, 5)))
Produces this:
EDIT based on OP's additional comments:
To get round integer values, floor() or round(x,0) works, so the following code:
plot <- ggplot(df,aes(a,letters)) + geom_point()
plot + scale_x_sqrt(breaks=trans_breaks("sqrt", function(x) round(sqrt(x),0), n=5)(c(0, 5)))
Produces this:
I was happy to discover that ggplot has binned scatter plots, which are useful for exploring and visualizing relationships in large data. Yet the top bin appears to misbehave. Here's an example: All bin averages are roughly linearly aligned, as they should be, but the top one is off on both dimensions:
the code:
library(ggplot2)
# simulate an example of linear data
set.seed(1)
N <- 10^4
x <- runif(N)
y <- x + rnorm(N)
dt <- data.frame(x=x, y=y)
ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point')
is there a simple workaround (and where should this be posted)?
stat_summary_bin is actually excluding the two rows with the largest x-values from the bins, and those two values are ending up with bin = NA. The mean of those two excluded values is plotted as a separate bin to the right of the regular bins. First, I show what is going wrong in your original plot then I provide a workaround to get the desired behavior.
What's going wrong in the original plot
To see what's going wrong in your original plot, create a plot with two calls to stat_summary_bin where we calculate the mean of each bin and the number of values in each bin. Then use ggplot_build to capture all of the internal data that ggplot generated to create the plot.
p1 = ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y=mean, bins=10, size=5, geom='text',
aes(label=..y..)) +
stat_summary_bin(fun.y=length, bins=10, size=5, geom='text',
aes(label=..y.., y=0))
p1b = ggplot_build(p1)
Now let's look at the data for the mean and length layers, respectively. I've printed only bins 9 through 11 (the three right-most bins) for brevity. Bin 11 is the "extra" bin and you can see that it contains only 2 values (its label is 2 in the second table below), and that the mean of those two values is -0.1309998, as can be seen in the first table below.
p1b$data[[2]][9:11,c(1,2,4,6,7)]
label bin y x width
9 0.8158320 9 0.8158320 0.8498505 0.09998242
10 0.9235531 10 0.9235531 0.9498329 0.09998242
11 -0.1309998 11 -0.1309998 1.0498154 0.09998244
p1b$data[[3]][9:11,c(1,2,4,6,7)]
label bin y x width
9 1025 9 1025 0.8498505 0.09998242
10 1042 10 1042 0.9498329 0.09998242
11 2 11 2 1.0498154 0.09998244
Which two values are those? It looks like they come from the two rows with the highest x values in the original data frame:
mean(dt[order(-dt$x), "y"][1:2])
[1] -0.1309998
I'm not sure how stat_summary_bin is managing to bin the data such that the two highest x values are excluded.
Workaround to get the desired behavior
A workaround is to summarize the data yourself, so you'll have complete control over how the bins are created. The example below uses your original code and then plots pre-summarized values in blue, so you can compare the behavior. I've included the dplyr package so that I can use the chaining operator (%>%) to summarize the data on the fly:
library(dplyr)
ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point') +
geom_point(data=dt %>%
group_by(bins=cut(x,breaks=seq(min(x),max(x),length.out=11), include.lowest=TRUE)) %>%
summarise(x=mean(x), y=mean(y)),
aes(x,y), size=3, color="blue") +
theme_bw()
#eipi10 has already explained, why this is happening.
Perhaps the simplest solution is to add a scale_x_continuous with limits to your plot, so that the extra "NA" bin is excluded from the plot.
ggplot(dt, aes(x, y)) +
geom_point(alpha = 0.1, size = 0.01) +
stat_summary_bin(fun.y='mean', bins=10, color='orange', size=5, geom='point') +
scale_x_continuous(limits = range(x))
This should be acceptable with large data such as in the example, where the small number of data points that were excluded from the bins will not significantly bias the stats. However, if dealing with situations where missing a couple of data points from the summary statistics is important, then the solution provided by #eipi will be better.
In short:
I would like to have separate legends for each "panel" of two-panel plot made using facet_wrap . Using facet_wrap(scales="free") works fine for when I want different axis scales, but not for the size of points.
Background:
I have data for several samples with three measurements each: x, y, and z. Each sample is from either class 1 or class 2. x and y have the same distributions in each class. However, all the z measurements for class 1 are less than 1.0; z measurements for class 2 range from 0 to 100.
Where I'm stuck:
Plot x and y on the x and y axes, respectively. Make the area of each point proportional to its z value.
d = matrix(c(runif(100),runif(20)*100),ncol=3)
e = data.frame( gl(2,20), d )
colnames(e) = c("class","x","y","z")
ggplot( data = e, aes(x=x, y=y, size=z) ) +
geom_point() + scale_area() +
facet_wrap( ~ class, ncol=1, scales="free" )
Problem:
Note that the dots on the first panel are difficult to see because they are on the very low end of the scale used for the single legend which ranges from 0 to 100. Is it even possible to have two separate legends (each with a different range) or should I make two plots and combine them with viewports?
A solution using grid.arrange. I've left in the call to facet_wrap so the strip.text remains. You could easily remove this.
# plot for class 1
c1 <- ggplot(e[e$class==1,], aes(x=x,y=y,size=z)) + geom_point() + scale_area() + facet_wrap(~class)
# plot for class 2
c2 <- c1 %+% e[e$class==2,]
library(gridExtra)
grid.arrange(c1,c2, ncol=1)