ggplot boxplot: too many outliers? - r

The dataset is available here but I am only using the ones from Year 2010 - 2016 as a subset: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results/
I am trying to plot the height of different gender with a boxplot and it returns this plot:
I felt that it is not correct since there are way too many outliers...(mean=175, min=133, max=221).
I was wondering if I need to adjust the Y-axis to include more data points in this boxplot? If so, how can I do that?
Here is my code:
ggplot(data = olympics, aes(x = Sex, y = Height) +
geom_boxplot() +
labs(title= "Height Distribution of Olympics Athletes by Gender")
Also, I was wondering if it is possible to plot such a graph with base R language as well? Thank you!

Welcome to stackoverflow #VanLindert. The best way to get help is to give us code to run that replicates the problem. The datapasta and reprex packages make this easy to do. https://reprex.tidyverse.org/articles/articles/datapasta-reprex.html
What I suspect is going on is that you are readjusting the y-axis limits and the boxplot keeps changing. When you use plot + scale_y_continuous(limits = c(130, 225)) or the shorthand plot + ylim(130, 225) ggplot filters out values above/below those 130 and 225 and the quartiles are recalculated. If you want to just zoom in on the plot to a specific range, you can use
plot + coord_cartesian(ylim = c(130, 225))

Related

Is there a way to programmatically ensure geom_density gets to the x-axis?

I found this on the Tidyverse Github:
https://github.com/tidyverse/ggplot2/issues/3716
but I can't find the resolution of yutannihilation's question.
For exploratory data analysis, I would like for the outline stroke to reach the x-axis as it does with base R, including facets with scales="free".
Is there a way to do this programmatically? The user may have multiple facets of data, on the same or different scales. Can I ensure the x-axis is wide enough to take the density to zero?
I have tried outline.type = "full" and "both" but neither seem to work.
The MRE shows the issue. The use case is within a Shiny app and can be facet_wrap-ed as well.
Thanks!
#R base
plot(density(diamonds$carat, adjust = 5))
#ggplot
library(ggplot2)
ggplot(diamonds, aes(carat)) +
geom_density(adjust = 5)
A straightforward solution would be to calculate the density yourself and plot that:
library(ggplot2)
ggplot(as.data.frame(density(diamonds$carat, adjust = 5)[1:2]), aes(x, y)) +
geom_line()

How do I built a boxplot much better with extreme outlier using ggplot

I have a dataset with extreme outlier. I am trying to build boxplot using following code
ggplot(arrange_df, aes(x = "", y = idleS)) + geom_boxplot()
It display the plot like this
How do I present the boxplot much better with extreme outlier. Thanks for any help or suggestion

using a boxplot in R

I am trying to make a density plot out of 4000 rows (height.values), with 4 different categories (height.ind). this is the code i used.
library(ggplot2)
plom %>%
ggplot(aes(x = height.values, color=height.ind)) +
geom_density() +
labs(title = "height alimony")
I am able to get a density plot but there are a lot of lines instead of the 4 i want.
Anyone has an idea to fix it?

Creating nice overlayed histogram in R with ggplot

I'm hoping to get some help on making the following histogram looks as nice and understandable as possible. I am plotting the salaries of Immigrant versus US Born workers. I am wondering
1. How would you modify colors, axis intervals, etc. to make the graph more clear/appealing?
2. How could I add a key to indicate purple is for US born workers, and pink is for foreign born?
3. How can I add two different lines to indicate the median of each group? And a corresponding label for each?
My current code is set up as this:
ggplot(NHIS1,aes(x=adj_SALARY, y=..density..)) +
geom_histogram(data=subset(NHIS1,IMMIGRANT=='0'), alpha=.5,binwidth=800, fill="purple",position="identity") + xlim(4430.4,50000) +
geom_vline(xintercept=median(NHIS1$adj_SALARY), col="black", linetype="dashed") +
geom_histogram(data=subset(NHIS1,IMMIGRANT=='1'), alpha=.5,binwidth=800,fill="red") + xlim(4430.4,50000)
geom_vline(xintercept=median(NHIS1$adj_SALARY), col="black", linetype="dashed")
And my final histogram at the moment appears as this:
If you have two variables, one for income , one for immigrant status, you do not need to plot two histograms but one will suffice if you specify the grouping. Also, I'd suggest you also use density lines, which help smooth over the histogram's bumps:
Assuming this is roughly like your data:
df <- data.frame(income = sample(1000:5000, 1000),
born = sample(c("US", "Foreign"), 1000, replace = T))
Then a crude way to plot one histogram as well as density lines for the two groups would be this:
ggplot(df, aes(x=income, color=born, fill=born)) +
geom_histogram(aes(y=..density..), alpha=0.5, binwidth=100,
position="identity") +
geom_density(alpha=.2)
This question has been asked before: overlaying-histograms-with-ggplot2-in-r discusses several options with many examples. You should definitely take a look at it.
Another option to compare the distributions could be violin plots using geom_violin(). I see violin plots as the better option when you need to compare distributions because they give you more flexibility and are still clearer. But that may be just me. Refer to the examples in the manual.

Overlay ggplot2 stat_density2d plots with alpha channels constant across groups

I would like to plot multiple groups in a stat_density2 plot with alpha values related to the counts of observations in each group. However, the levels formed by stat_density2d seem to be normalized to the number of observations in each group. For example,
temp <- rbind(movies[1:2,],movies[movies$mpaa == "R" | movies$mpaa == "PG-13",])
ggplot(temp, aes(x=rating,y=length)) +
stat_density2d(geom="tile", aes(fill = mpaa, alpha=..density..), contour=FALSE) +
theme_minimal()
Creates a plot like this:
Because I only included 2 points without ratings, they result in densities that look much tighter/stronger than the other two, and so wash out the other two densities. I've tried looking at Overlay two ggplot2 stat_density2d plots with alpha channels and Specifying the scale for the density in ggplot2's stat_density2d but they don't really address this specific issue.
Ultimately, what I'm trying to accomplish with my real data, is I have "power" samples from discrete 2d locations for multiple conditions, and I am trying to plot what their relative powers/spatial distributions are. I am duplicating points in locations relative to their powers, but this has resulted in low power conditions with just a few locations looking the strongest when using stat_density2d. Please let me know if there is a better way of going about doing this!
Thanks!
stat_hexbin, which understands ..count.. in addition to ..density.., may work for you:
ggplot(temp, aes(x=rating,y=length)) +
stat_binhex(geom="hex", aes(fill = mpaa, alpha=..count..)) +
theme_minimal()
Although you may want to adjust the bin width.
Not the most elegant r code, but this seems to work. I normalize my real data a bit differently than this, but this gets the solution I found across. I use a for loop where I find the average power for the condition and add a new stat_density2d layer with the alpha scaled by that average power.
temp <- rbind(movies[1:2,],movies[movies$mpaa == "R" | movies$mpaa == "PG-13",])
mpaa = unique(temp$mpaa)
p <- ggplot() + theme_minimal()
for (ii in seq(1,3)) {
ratio = length(which(temp$mpaa == mpaa[ii]))
p <- p + stat_density2d(data=temp[temp$mpaa == mpaa[ii],],
aes(x=rating,y=length,fill = mpaa, alpha=..level..),
geom="polygon",
contour=TRUE,
alpha = ratio/20,
lineType = "none")
}
print(p)

Resources