How to annotate text on individual facet in ggplot2 - r

This question is a follow-up to: Annotating text on individual facet in ggplot2
I was trying out the code provided in the accepted answer and got something that was strangely different than the result provided. Granted the post is older and I'm using R 3.5.3 and ggplot2 3.1.0, but what I'm getting doesn't seem to make sense.
library(ggplot2)
p <- ggplot(mtcars, aes(mpg, wt)) + geom_point()
p <- p + facet_grid(. ~ cyl)
#below is how the original post created a dataframe for the text annotation
#this will produce an extra facet in the plot for reasons I don't know
ann_text <- data.frame(mpg = 15,wt = 5,lab = "Text",cyl = factor(8,levels = c("4","6","8")))
p+geom_text(data = ann_text,label = "Text")
This is the code from the accepted answer in the linked question. For me it produces the following graph with an extra facet (i.e, an addition categorical variable of 3 seems to have been added to cyl)
#below is an alternative version that produces the correct plot, that is,
#without any extra facets.
ann_text_alternate <- data.frame(mpg = 15,wt = 5,lab = "Text",cyl = 8)
p+geom_text(data = ann_text_alternate,label = "Text")
This gives me the correct graph:
Anybody have any explanations?

What is going on is a factors issue.
First, you facet by cyl, a column in dataset mtcars. This is an object of class "numeric" taking 3 different values.
unique(mtcars$cyl)
#[1] 6 4 8
Then, you create a new dataset, the dataframe ann_text. But you define cyl as an object of class "factor". And what is in this column can be seen with str.
str(ann_text)
#'data.frame': 1 obs. of 4 variables:
# $ mpg: num 15
# $ wt : num 5
# $ lab: Factor w/ 1 level "Text": 1
# $ cyl: Factor w/ 3 levels "4","6","8": 3
R codes factors as integers starting at 1, level "8" is the level number 3.
So when you combine both datasets, there are 4 values for cyl, the original numbers 4, 6 and 8 plus the new number, 3. Hence the extra facet.
This is also the reason why the solution works, in dataframe ann_text_alternate column cyl is a numeric variable taking one of the already existing values.
Another way of making it work would be to coerce cyl to factor when faceting. Note that
levels(factor(mtcars$cyl))
#[1] "4" "6" "8"
And the new dataframe ann_text no longer has a 4th level. Start plotting the graph as in the question
p <- ggplot(mtcars, aes(mpg, wt)) + geom_point()
p <- p + facet_grid(. ~ factor(cyl))
and add the text.
p + geom_text(data = ann_text, label = "Text")

Related

Box plots not appearing properly in RStudio

I am creating box plots within R, however, they are appearing incorrectly. My data is based off of German Credit Dataset on Kaggle.
My code with two different attributes trying to be tested:
data %>%
ggplot(aes(x = Creditability, y = Purpose, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Purpose")
data %>%
ggplot(aes(x = Creditability, y = Account.Balance, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Account Balance")
I've tried a few of the different attributes for it, but results in the same error
Edited info: Is it because the attributes have too much information? I have split the sample into test (300) vs train (700) and I am currently using train. Would it simply be because there's too much info?
Edit picture:
Factors
Edit for graph error:
Error
As others have explained in the comments, you cannot show boxplots where the y axis is set to be a factor. Factors are by their nature discrete variables, even if the levels are named as numbers. In order to utilize the stat function for the boxplot geom, you need the y axis to be continuous and the x axis to be discrete (or able to be separated into discrete values via the group= aesthetic).
Let me demonstrate with the mtcars dataset built into ggplot2:
library(ggplot2)
ggplot(mtcars, aes(x=factor(carb), y=mpg)) + geom_boxplot()
Here we can draw boxpots because the x aesthetic is forced to be discrete (via factor(carb)), while the y axis is using mpg which is a numeric column in the mtcars dataset.
If you set both carb and mpg to be factors, you get something that should look pretty similar to what you're seeing:
ggplot(mtcars, aes(x=factor(carb), y=factor(mpg))) + geom_boxplot()
In your case, all your columns in your dataset are factors. If they are factors that can be coerced to be numbers, you can turn them into continuous vectors via using as.numeric(levels(column_name)[column_name]). Alternatively, you can use as.numeric(as.character(column_name)). Here's what it looks like to first convert the mtcars$mpg column to a factor of numeric values, and then back to being only numeric via this method.
df <- mtcars
# convert to a factor
df$mpg <- factor(df$mpg)
# back to numeric!
df$mpg <- as.numeric(levels(df$mpg)[df$mpg])
# this plot looks like it did before when we did the same with mtcars
ggplot(df, aes(x=factor(carb), y=mpg)) + geom_boxplot()
So, for your case, do this two step process:
data$Purpose <- as.numeric(levels(data$Purpose)[data$Purpose])
data %>%
ggplot(aes(x = Creditability, y = Purpose, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Purpose")
That should work. You can follow in a similar fashion for your other variables.

Is there a way to reorder the y axis values in decreasing order while keeping the x axis values the same in R?

I have a set of numeric data that corresponds with specific dates and I would like to make a line chart showing the changes. When I enter my code, I end up with a graph that is not ordered on the y axis so the numbers are all over the place. I would like the graph to range from 0-15 with the values associated with the dates just as points.
ggplot(TomFrostDO, aes(TomFrostDO$Date, TomFrostDO$Surface)) +
geom_point(aes(x= TomFrostDO$Date, y=TomFrostDO$Surface), color="red") +
geom_line(aes(y=TomFrostDO$Surface, color="red"))
I will try my luck (if I'm wrong I will delete this post) ... did you get this kind of graph ?
If so, it is because your surface data are not numerical values (I bet that when you open your file in R, it gets these values converted as factor). You can access this by doing:
> str(data)
'data.frame': 13 obs. of 2 variables:
$ surface: Factor w/ 13 levels "0.69","1.9","2.05",..: 6 5 2 3 7 13 1 12 4 8 ...
$ Date : Date, format: "2019-05-29" "2019-06-10" "2019-07-08" "2019-07-11" ...
To solve this issue, you can transform your variable in numeric by doing:
data$surface = as.numeric(as.vector(data$surface))
> str(data)
'data.frame': 13 obs. of 2 variables:
$ surface: num 6.66 5.31 1.9 2.05 6.72 ...
$ Date : Date, format: "2019-05-29" "2019-06-10" "2019-07-08" "2019-07-11" ...
And now, if you are plotting these values, you can do:
library(ggplot2)
ggplot(data, aes( x = Date, y = surface))+
geom_point(color = "red")+
geom_line(color = "red")+
scale_x_date(date_breaks = "months",
date_labels = "%b%y")+
scale_y_continuous(limits = c(0,15), breaks = c(0,5,10,15))
Is it what you are looking for ?
If not, can you please add a reproducible example of your dataset and also an image of the current plot you are getting.
BTW, when you are writing an expression for ggplot(..., you do not need to specify $ in your aes, because as you defined data in the first part, ggplot knows where to look for column names.
So simply as ggplot(data = data, aes( x = Date, y = surface) is enough. So, no need to repeat it in the geom_point or the geom_line because they will be made based on what you passed in ggplot(....
You can have the use to pass these kind of arguments in geom_ if you need to plot particular data that you did not define in the first ggplot(.. (for example, such as a second dataset to use, or a second y axis ...)
Data
data = data.frame(surface = c(6.66,5.31,1.90,2.05,6.72,13.65,0.69,12.80,3.83,7.57,9.33,11.63,9.82),
Date = as.Date(c("29/05/2019","10/06/2019","08/07/2019","11/07/2019","22/07/2019","5/08/2019",
"19/08/2019","22/08/2019","04/09/2019","16/09/2019","30/09/2019","14/10/2019","14/11/2019"), format = "%d/%m/%Y"))

force boxplots from geom_boxplot to constant width

I'm making a boxplot in which x and fill are mapped to different variables, a bit like this:
ggplot(mpg, aes(x=as.factor(cyl), y=cty, fill=as.factor(drv))) +
geom_boxplot()
As in the example above, the widths of my boxes come out differently at different x values, because I do not have all possible combinations of x and fill values, so .
I would like for all the boxes to be the same width. Can this be done (ideally without manipulating the underlying data frame, because I fear that adding fake data will cause me confusion during further analysis)?
My first thought was
+ geom_boxplot(width=0.5)
but this doesn't help; it adjusts the width of the full set of boxplots for a given x factor level.
This post almost seems relevant, but I don't quite see how to apply it to my situation. Using + scale_fill_discrete(drop=FALSE) doesn't seem to change the widths of the bars.
The problem is due to some cells of factor combinations being not present. The number of data points for all combinations of the levels of cyl and drv can be checked via xtabs:
tab <- xtabs( ~ drv + cyl, mpg)
tab
# cyl
# drv 4 5 6 8
# 4 23 0 32 48
# f 58 4 43 1
# r 0 0 4 21
There are three empty cells. I will add fake data to override the visualization problems.
Check the range of the dependent variable (y-axis). The fake data needs to be out of this range.
range(mpg$cty)
# [1] 9 35
Create a subset of mpg with the data needed for the plot:
tmp <- mpg[c("cyl", "drv", "cty")]
Create an index for the empty cells:
idx <- which(tab == 0, arr.ind = TRUE)
idx
# row col
# r 3 1
# 4 1 2
# r 3 2
Create three fake lines (with -1 as value for cty):
fakeLines <- apply(idx, 1,
function(x)
setNames(data.frame(as.integer(dimnames(tab)[[2]][x[2]]),
dimnames(tab)[[1]][x[1]],
-1),
names(tmp)))
fakeLines
# $r
# cyl drv cty
# 1 4 r -1
#
# $`4`
# cyl drv cty
# 1 5 4 -1
#
# $r
# cyl drv cty
# 1 5 r -1
Add the rows to the existing data:
tmp2 <- rbind(tmp, do.call(rbind, fakeLines))
Plot:
library(ggplot2)
ggplot(tmp2, aes(x = as.factor(cyl), y = cty, fill = as.factor(drv))) +
geom_boxplot() +
coord_cartesian(ylim = c(min(tmp$cty - 3), max(tmp$cty) + 3))
# The axis limits have to be changed to suppress displaying the fake data.
You can now use position_dodge() function.
ggplot(mpg, aes(x=as.factor(cyl), y=cty, fill=as.factor(drv))) +
geom_boxplot(position = position_dodge(preserve = "single"))
Just use the facet_grid() function, makes things a lot easier to visualize:
ggplot(mpg, aes(x=as.factor(drv), y=cty, fill=as.factor(drv))) +
geom_boxplot() +
facet_grid(.~cyl)
See how I switch from x=as.factor(cyl) to x=as.factor(drv).
Once you have done this you can always change the way you want the strips to be displayed and remove margins between the panels... it can easily look like your expected display.
By the way, you don't even need to use the as.factor() before specifying the columns to be used by ggplot(). this again improve the readability of your code.

facet_wrap with ggparcoord() from GGally

I want to plot multiple parallel coordinate plots from one dataset. Currently I have a working solution with split and l_ply which produces 4 ggplot2 objects. I would like to solve this with facet_wrap or facet_grid to have a more compact layout and a single legend. Is this possible?
With a normal ggplot2 object (boxplot) facet_wrap works perfectly. With the GGally functionggparcoord() I get the error Error in layout_base(data, vars, drop = drop) :
At least one layer must contain all variables used for facetting
What am I doing wrong?
require(GGally)
require(ggplot2)
# Example Data
x <- data.frame(var1=rnorm(40,0,1),
var2=rnorm(40,0,1),
var3=rnorm(40,0,1),
type=factor(rep(c("x", "y"), length.out=40)),
set=factor(rep(c("A","B","C","D"), each=10))
)
# this works
p1 <- ggplot(x, aes(x=type, y=var1, group=type)) + geom_boxplot()
p1 <- p1 + facet_wrap(~ set)
p1
# this does not work
p2 <- ggparcoord(x, columns=1:3, groupColumn=4)
p2 <- p2 + facet_wrap(~ set)
p2
Any suggestions are appreciated! Thank you!
You can't use directly facet_wrap() with function ggparcoord() because this function use as data only those columns which are specified in call to this function. It can be seen by looking on data element of p2. There is no column named set.
p2 <- ggparcoord(x, columns=1:3, groupColumn=4)
head(p2$data)
type .ID anyMissing variable value
1 x 1 FALSE var1 0.95473093
2 y 2 FALSE var1 -0.05566205
3 x 3 FALSE var1 2.57548872
4 y 4 FALSE var1 0.14508261
5 x 5 FALSE var1 -0.92022584
6 y 6 FALSE var1 -0.05594902
To get the same type of plot with faceting, first, you need to add new column (contains just numbers corresponding to number of cases) to existing data frame and then reshape this data frame.
x$ID<-1:40
df.x<-melt(x,id.vars=c("set","ID","type"))
Then use function ggplot() and geom_line() to plot data.
ggplot(df.x,aes(x=variable,y=value,colour=type,group=ID))+
geom_line()+facet_wrap(~set)

How to make pretty ordered facet labels for numeric values?

I often facet according to a numeric variable, but I want the facet label to be more explanatory than the simple number. I usually create a new label variable that has the numeric value pasted to explanatory text. However, when values have more than one place before the decimal, the first number is used for sorting the factor. Any suggestions to avoid this?
iris[,1:4]<-iris[,1:4]*10
This would work fine for iris, when it does not have more than one value before the decimal.
iris$Petal.Width.label<-paste("Petal.Width=", iris$Petal.Width)
iris$Petal.Width.label<-paste("Petal.Width=", iris$Petal.Width)
qplot(data=iris,
x=Sepal.Length,
y=Sepal.Width,
colour=Species)+facet_wrap(~Petal.Width.label)
Related to:
ggplot: How to change facet labels?
How to change the order of facet labels in ggplot (custom facet wrap labels)
Just reoder the levels of your label:
data(iris)
iris[ , 1:4] <- iris[ , 1:4] * 10
iris$Petal.Width.label <- paste("Petal.Width=", iris$Petal.Width)
# reoder levels by Petal.Width
iris$Petal.Width.label2 <- factor(iris$Petal.Width.label,
levels = unique(iris$Petal.Width.label[order(iris$Petal.Width)]))
qplot(data = iris,
x = Sepal.Length,
y = Sepal.Width,
colour = Species)+
facet_wrap( ~Petal.Width.label2)

Resources