It seems density plot in stat_bin doesn't work as expected for factor variables. The density is 1 for each category on y-axis.
For example, using diamonds data:
diamonds_small <- diamonds[sample(nrow(diamonds), 1000), ]
ggplot(diamonds_small, aes(x = cut)) + stat_bin(aes(y=..density.., fill=cut))
I understand I could use
stat_bin(aes(y=..count../sum(..count..), fill=cut))
to make it work. However, according to the docs of stat_bin, it should works with categorical variables.
You can get what you (might) want by setting the group aesthetic manually.
ggplot(diamonds_small, aes(x = cut)) + stat_bin(aes(y=..density..,group=1))
However, you can't easily fill differently within a group. You can summarize the data yourself:
library(plyr)
ddply(diamonds_small,.(cut),
function(x) data.frame(dens=nrow(x)/nrow(diamonds_small)))
ggplot(dd_dens,aes(x=cut,y=dens))+geom_bar(aes(fill=cut),stat="identity")
A slightly more compact version of the summarization step:
as.data.frame.table(prop.table(table(diamonds_small$cut)))
Related
I am creating box plots within R, however, they are appearing incorrectly. My data is based off of German Credit Dataset on Kaggle.
My code with two different attributes trying to be tested:
data %>%
ggplot(aes(x = Creditability, y = Purpose, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Purpose")
data %>%
ggplot(aes(x = Creditability, y = Account.Balance, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Account Balance")
I've tried a few of the different attributes for it, but results in the same error
Edited info: Is it because the attributes have too much information? I have split the sample into test (300) vs train (700) and I am currently using train. Would it simply be because there's too much info?
Edit picture:
Factors
Edit for graph error:
Error
As others have explained in the comments, you cannot show boxplots where the y axis is set to be a factor. Factors are by their nature discrete variables, even if the levels are named as numbers. In order to utilize the stat function for the boxplot geom, you need the y axis to be continuous and the x axis to be discrete (or able to be separated into discrete values via the group= aesthetic).
Let me demonstrate with the mtcars dataset built into ggplot2:
library(ggplot2)
ggplot(mtcars, aes(x=factor(carb), y=mpg)) + geom_boxplot()
Here we can draw boxpots because the x aesthetic is forced to be discrete (via factor(carb)), while the y axis is using mpg which is a numeric column in the mtcars dataset.
If you set both carb and mpg to be factors, you get something that should look pretty similar to what you're seeing:
ggplot(mtcars, aes(x=factor(carb), y=factor(mpg))) + geom_boxplot()
In your case, all your columns in your dataset are factors. If they are factors that can be coerced to be numbers, you can turn them into continuous vectors via using as.numeric(levels(column_name)[column_name]). Alternatively, you can use as.numeric(as.character(column_name)). Here's what it looks like to first convert the mtcars$mpg column to a factor of numeric values, and then back to being only numeric via this method.
df <- mtcars
# convert to a factor
df$mpg <- factor(df$mpg)
# back to numeric!
df$mpg <- as.numeric(levels(df$mpg)[df$mpg])
# this plot looks like it did before when we did the same with mtcars
ggplot(df, aes(x=factor(carb), y=mpg)) + geom_boxplot()
So, for your case, do this two step process:
data$Purpose <- as.numeric(levels(data$Purpose)[data$Purpose])
data %>%
ggplot(aes(x = Creditability, y = Purpose, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Purpose")
That should work. You can follow in a similar fashion for your other variables.
I was trying to make a boxplot using the R environment following the many guides that I found online (such this one: http://www.sthda.com/english/wiki/ggplot2-box-plot-quick-start-guide-r-software-and-data-visualization) using my dataframe:
library(ggplot2)
value=c('2000000','115000','500000','20000','3000','1000000')
condition=c('C','C','C','H','H','H')
df=data.frame(value,condition)
df$value=as.factor(df$value)
ggplot(df, aes(x=condition, y=value))+
geom_boxplot()
However, following these steps, my results is similar to this figure:
https://i.stack.imgur.com/HloKG.png
I can't figure it out why ggplot cannot understand that I'm using two conditions!
Thanks for your help
Why are your value values character (originally) or factor (after as_factor)? They need to be numeric for a boxplot y axis.
library(ggplot2)
df$value <- as.numeric(df$value)
ggplot(df, aes(x = condition, y = value))+
geom_boxplot()
The value attribute should be numerical, not a factor:
df$value=as.factor(df$value)
Then you will have two boxplots of condition type.
I want to make a plot similar to the one attached by Lindfield et al. 2016. I'm familiar with the ggplot command in R with the format:
ggplot(dataframe, aes(x, y)) + geom_bar(stat = 'identity')
However, I don't know how to make a cumulative se error for a stacked barplot; only one that employs a position_dodge command.
I know that there are disadvantages to using stacked bars with se errors, but for my data set, it is more presentable than using the unstacked barplots.
Thanks.
I don't know how you get the cumulative standard errors in an appropriate way (I guess it depends on how your values are generated) but I think you need to do calculate them and store them in a second DF, for example if you have an initial data.frame created like this:
DF <- data.frame( x=c("a","a","b","b"),
sp=c("shark","cod","shark","cod"),
y=c(10,5,15,7),
stringsAsFactors=FALSE )
where y is the value associated with each species at each x point, then you'd create a second DF containing the lower and upper limits of your s.e. for each x value, eg
seDF <- data.frame( x=c('a','b'),
yl=c(12,18),
yu=c(17,24),
stringsAsFactors=FALSE )
Then you can create your plot with:
ggplot() +
geom_bar( data=DF, mapping=aes(x=x,y=y,fill=sp),
position="stack", stat="identity") +
geom_linerange( data=seDF, mapping=aes(x=x, ymin=yl, ymax=yu) )
I used geom_linerange rather then geom_errorbar as it doesn't create crossbars at either end.
I am trying to find the best way to create barplots in R with standard errors displayed. I have seen other articles but I cannot figure out the code to use with my own data (having not used ggplot before and this seeming to be the most used way and barplot not cooperating with dataframes). I need to use this in two cases for which I have created two example dataframes:
Plot df1 so that the x-axis has sites a-c, with the y-axis displaying the mean value for V1 and the standard errors highlighted, similar to this example with a grey colour. Here, plant biomass should the mean V1 value and treatments should be each of my sites.
Plot df2 in the same way, but so that before and after are located next to each other in a similar way to this, so pre-test and post-test equate to before and after in my example.
x <- factor(LETTERS[1:3])
site <- rep(x, each = 8)
values <- as.data.frame(matrix(sample(0:10, 3*8, replace=TRUE), ncol=1))
df1 <- cbind(site,values)
z <- factor(c("Before","After"))
when <- rep(z, each = 4)
df2 <- data.frame(when,df1)
Apologies for the simplicity for more experienced R users and particuarly those that use ggplot but I cannot apply snippets of code that I have found elsewhere to my data. I cannot even get enough code together to produce a start to a graph so I hope my descriptions are sufficient. Thank you in advance.
Something like this?
library(ggplot2)
get.se <- function(y) {
se <- sd(y)/sqrt(length(y))
mu <- mean(y)
c(ymin=mu-se, ymax=mu+se)
}
ggplot(df1, aes(x=site, y=V1)) +
stat_summary(fun.y=mean, geom="bar", fill="lightgreen", color="grey70")+
stat_summary(fun.data=get.se, geom="errorbar", width=0.1)
ggplot(df2, aes(x=site, y=V1, fill=when)) +
stat_summary(fun.y=mean, geom="bar", position="dodge", color="grey70")+
stat_summary(fun.data=get.se, geom="errorbar", width=0.1, position=position_dodge(width=0.9))
So this takes advantage of the stat_summary(...) function in ggplot to, first, summarize y for given x using mean(...) (for the bars), and then to summarize y for given x using the get.se(...) function for the error-bars. Another option would be to summarize your data prior to using ggplot, and then use geom_bar(...) and geom_errorbar(...).
Also, plotting +/- 1 se is not a great practice (although it's used often enough). You'd be better served plotting legitimate confidence limits, which you could do, for instance, using the built-in mean_cl_normal function instead of the contrived get.se(...). mean_cl_normal returns the 95% confidence limits based on the assumption that the data is normally distributed (or you can set the CL to something else; read the documentation).
I used group_by and summarise_each function for this and std.error function from package plotrix
library(plotrix) # for std error function
library(dplyr) # for group_by and summarise_each function
library(ggplot2) # for creating ggplot
For df1 plot
# Group data by when and site
grouped_df1<-group_by(df1,site)
#summarise grouped data and calculate mean and standard error using function mean and std.error(from plotrix)
summarised_df1<-summarise_each(grouped_df1,funs(mean=mean,std_error=std.error))
# Define the top and bottom of the errorbars
limits <- aes(ymax = mean + std_error, ymin=mean-std_error)
#Begin your ggplot
#Here we are plotting site vs mean and filling by another factor variable when
g<-ggplot(summarised_df1,aes(site,mean))
#Creating bar to show the factor variable position_dodge
#ensures side by side creation of factor bars
g<-g+geom_bar(stat = "identity",position = position_dodge())
#creation of error bar
g<-g+geom_errorbar(limits,width=0.25,position = position_dodge(width = 0.9))
#print graph
g
For df2 plot
# Group data by when and site
grouped_df2<-group_by(df2,when,site)
#summarise grouped data and calculate mean and standard error using function mean and std.error
summarised_df2<-summarise_each(grouped_df2,funs(mean=mean,std_error=std.error))
# Define the top and bottom of the errorbars
limits <- aes(ymax = mean + std_error, ymin=mean-std_error)
#Begin your ggplot
#Here we are plotting site vs mean and filling by another factor variable when
g<-ggplot(summarised_df2,aes(site,mean,fill=when))
#Creating bar to show the factor variable position_dodge
#ensures side by side creation of factor bars
g<-g+geom_bar(stat = "identity",position = position_dodge())
#creation of error bar
g<-g+geom_errorbar(limits,width=0.25,position = position_dodge(width = 0.9))
#print graph
g
I'm going to use the diamond data set that comes standard with the ggplot2 package to illustrate what I'm looking for.
I want to build a graph that is like this:
library(ggplot2)
ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar(position="dodge")
However, instead of having a count, I would like to return the mean of a continuous variable. I'd like to return cut and color and get the mean carat. If I put in this code:
ggplot(diamonds, aes(carat, fill=cut)) + geom_bar(position="dodge")
My output is a count of the number of carats vs the cut.
Anyone know how to do this?
You can get a new data frame with mean(carat) grouped by cut and color and then plot:
library(plyr)
data <- ddply(diamonds, .(cut, color), summarise, mean_carat = mean(carat))
ggplot(data, aes(color, mean_carat,fill=cut))+geom_bar(stat="identity", position="dodge")
If you want faster solutions you can use either dplyr or data.table
With dplyr:
library(dplyr)
data <- group_by(diamonds, cut, color)%.%summarise(mean_carat=mean(carat))
With data.table:
library(data.table)
data <- data.table(diamonds)[,list(mean_carat=mean(carat)), by=c('cut', 'color')]
The code for the plot is the same for both.