How to obtain a 'normal' boxplot? (R)

How to obtain a 'normal' boxplot? (R) - r

I was trying to make a boxplot using the R environment following the many guides that I found online (such this one: http://www.sthda.com/english/wiki/ggplot2-box-plot-quick-start-guide-r-software-and-data-visualization) using my dataframe:
library(ggplot2)
value=c('2000000','115000','500000','20000','3000','1000000')
condition=c('C','C','C','H','H','H')
df=data.frame(value,condition)
df$value=as.factor(df$value)
ggplot(df, aes(x=condition, y=value))+
geom_boxplot()
However, following these steps, my results is similar to this figure:
https://i.stack.imgur.com/HloKG.png
I can't figure it out why ggplot cannot understand that I'm using two conditions!
Thanks for your help

Why are your value values character (originally) or factor (after as_factor)? They need to be numeric for a boxplot y axis.
library(ggplot2)
df$value <- as.numeric(df$value)
ggplot(df, aes(x = condition, y = value))+
geom_boxplot()

The value attribute should be numerical, not a factor:
df$value=as.factor(df$value)
Then you will have two boxplots of condition type.

Related

Box plots not appearing properly in RStudio

I am creating box plots within R, however, they are appearing incorrectly. My data is based off of German Credit Dataset on Kaggle.
My code with two different attributes trying to be tested:
data %>%
ggplot(aes(x = Creditability, y = Purpose, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Purpose")
data %>%
ggplot(aes(x = Creditability, y = Account.Balance, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Account Balance")
I've tried a few of the different attributes for it, but results in the same error
Edited info: Is it because the attributes have too much information? I have split the sample into test (300) vs train (700) and I am currently using train. Would it simply be because there's too much info?
Edit picture:
Factors
Edit for graph error:
Error

As others have explained in the comments, you cannot show boxplots where the y axis is set to be a factor. Factors are by their nature discrete variables, even if the levels are named as numbers. In order to utilize the stat function for the boxplot geom, you need the y axis to be continuous and the x axis to be discrete (or able to be separated into discrete values via the group= aesthetic).
Let me demonstrate with the mtcars dataset built into ggplot2:
library(ggplot2)
ggplot(mtcars, aes(x=factor(carb), y=mpg)) + geom_boxplot()
Here we can draw boxpots because the x aesthetic is forced to be discrete (via factor(carb)), while the y axis is using mpg which is a numeric column in the mtcars dataset.
If you set both carb and mpg to be factors, you get something that should look pretty similar to what you're seeing:
ggplot(mtcars, aes(x=factor(carb), y=factor(mpg))) + geom_boxplot()
In your case, all your columns in your dataset are factors. If they are factors that can be coerced to be numbers, you can turn them into continuous vectors via using as.numeric(levels(column_name)[column_name]). Alternatively, you can use as.numeric(as.character(column_name)). Here's what it looks like to first convert the mtcars$mpg column to a factor of numeric values, and then back to being only numeric via this method.
df <- mtcars
# convert to a factor
df$mpg <- factor(df$mpg)
# back to numeric!
df$mpg <- as.numeric(levels(df$mpg)[df$mpg])
# this plot looks like it did before when we did the same with mtcars
ggplot(df, aes(x=factor(carb), y=mpg)) + geom_boxplot()
So, for your case, do this two step process:
data$Purpose <- as.numeric(levels(data$Purpose)[data$Purpose])
data %>%
ggplot(aes(x = Creditability, y = Purpose, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Purpose")
That should work. You can follow in a similar fashion for your other variables.

Multiple columns on x-axis in R

I'm quite new to R, and there has been a question similar to mine asked before, however it doesn't quite get to what I need.
I have a table as follows:
I wish to plot the Value, and Threshold alongside each other on the X-axis for each metric, so effectively, I will have three pairs of plots on the X-axis. I have attempted to use reshape2 and ggplot2 for this as follows:
library(reshape2)
df <- melt(msi, id.vars="Average Metric Value (Abbr)")
# I get an error message, but the output seems ok.
library(ggplot2)
ggplot(df, aes(x="Average Metric Value (Abbr)", y=value, fill=variable)) + geom_bar(stat='identity', position='dodge')
The output graph is as follows:
I'm sure I can work out how to separate each of the three pairs later, but as you can see, I don't have the metric names for each of the three pairs along the x-axis, and I am missing the first "Value" bar, presumably because it equals the same as the second and I am only getting unique values plotted.
How do I get around that and have the names of each metric beneath each pairs of values?

We can do this by placing inside the aes_string or use backquotes in the aes for those columns that have spaces in its names
library(dplyr)
library(tidyr)
gather(msi, variable, value, Value:Threshold) %>%
ggplot(., aes(x= `Average Metric Value (Abbr)`,
y=value,
fill=variable)) +
geom_bar(stat='identity', position='dodge')

How do I create a barplot in R with a cumulative standard deviation?

I want to make a plot similar to the one attached by Lindfield et al. 2016. I'm familiar with the ggplot command in R with the format:
ggplot(dataframe, aes(x, y)) + geom_bar(stat = 'identity')
However, I don't know how to make a cumulative se error for a stacked barplot; only one that employs a position_dodge command.
I know that there are disadvantages to using stacked bars with se errors, but for my data set, it is more presentable than using the unstacked barplots.
Thanks.

I don't know how you get the cumulative standard errors in an appropriate way (I guess it depends on how your values are generated) but I think you need to do calculate them and store them in a second DF, for example if you have an initial data.frame created like this:
DF <- data.frame( x=c("a","a","b","b"),
sp=c("shark","cod","shark","cod"),
y=c(10,5,15,7),
stringsAsFactors=FALSE )
where y is the value associated with each species at each x point, then you'd create a second DF containing the lower and upper limits of your s.e. for each x value, eg
seDF <- data.frame( x=c('a','b'),
yl=c(12,18),
yu=c(17,24),
stringsAsFactors=FALSE )
Then you can create your plot with:
ggplot() +
geom_bar( data=DF, mapping=aes(x=x,y=y,fill=sp),
position="stack", stat="identity") +
geom_linerange( data=seDF, mapping=aes(x=x, ymin=yl, ymax=yu) )
I used geom_linerange rather then geom_errorbar as it doesn't create crossbars at either end.

ggplot2: axis markers when grouping by two variables

I have a simple bar graph in ggplot, with two factor variables on the x axis:
library(ggplot2)
dat <- data.frame(group1= c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
group2= rep(1:4,4),
val = 1:16)
ggplot(dat, aes(x=group1,y=val,group=group2))+
geom_bar(stat="identity", position="dodge")
What is the simplest way to add a second x axis label (for group2)? There is a more complex version of this question here, but I don't see how to apply this logic to this simple case.

As suggested at the question posted by Jimbou, one solution is:
ggplot(dat, aes(y=val,x=group2))+
geom_bar(stat="identity")+
facet_grid(.~group1,scales="free")
I'd be curious to know whether there is another solution using annotate, as also suggested in that question, that works in the case in which the grouping variables are two factors.

Using density in stat_bin with factor variables

It seems density plot in stat_bin doesn't work as expected for factor variables. The density is 1 for each category on y-axis.
For example, using diamonds data:
diamonds_small <- diamonds[sample(nrow(diamonds), 1000), ]
ggplot(diamonds_small, aes(x = cut)) + stat_bin(aes(y=..density.., fill=cut))
I understand I could use
stat_bin(aes(y=..count../sum(..count..), fill=cut))
to make it work. However, according to the docs of stat_bin, it should works with categorical variables.

You can get what you (might) want by setting the group aesthetic manually.
ggplot(diamonds_small, aes(x = cut)) + stat_bin(aes(y=..density..,group=1))
However, you can't easily fill differently within a group. You can summarize the data yourself:
library(plyr)
ddply(diamonds_small,.(cut),
function(x) data.frame(dens=nrow(x)/nrow(diamonds_small)))
ggplot(dd_dens,aes(x=cut,y=dens))+geom_bar(aes(fill=cut),stat="identity")
A slightly more compact version of the summarization step:
as.data.frame.table(prop.table(table(diamonds_small$cut)))