Creating nice overlayed histogram in R with ggplot - r

I'm hoping to get some help on making the following histogram looks as nice and understandable as possible. I am plotting the salaries of Immigrant versus US Born workers. I am wondering
1. How would you modify colors, axis intervals, etc. to make the graph more clear/appealing?
2. How could I add a key to indicate purple is for US born workers, and pink is for foreign born?
3. How can I add two different lines to indicate the median of each group? And a corresponding label for each?
My current code is set up as this:
ggplot(NHIS1,aes(x=adj_SALARY, y=..density..)) +
geom_histogram(data=subset(NHIS1,IMMIGRANT=='0'), alpha=.5,binwidth=800, fill="purple",position="identity") + xlim(4430.4,50000) +
geom_vline(xintercept=median(NHIS1$adj_SALARY), col="black", linetype="dashed") +
geom_histogram(data=subset(NHIS1,IMMIGRANT=='1'), alpha=.5,binwidth=800,fill="red") + xlim(4430.4,50000)
geom_vline(xintercept=median(NHIS1$adj_SALARY), col="black", linetype="dashed")
And my final histogram at the moment appears as this:

If you have two variables, one for income , one for immigrant status, you do not need to plot two histograms but one will suffice if you specify the grouping. Also, I'd suggest you also use density lines, which help smooth over the histogram's bumps:
Assuming this is roughly like your data:
df <- data.frame(income = sample(1000:5000, 1000),
born = sample(c("US", "Foreign"), 1000, replace = T))
Then a crude way to plot one histogram as well as density lines for the two groups would be this:
ggplot(df, aes(x=income, color=born, fill=born)) +
geom_histogram(aes(y=..density..), alpha=0.5, binwidth=100,
position="identity") +
geom_density(alpha=.2)

This question has been asked before: overlaying-histograms-with-ggplot2-in-r discusses several options with many examples. You should definitely take a look at it.
Another option to compare the distributions could be violin plots using geom_violin(). I see violin plots as the better option when you need to compare distributions because they give you more flexibility and are still clearer. But that may be just me. Refer to the examples in the manual.

Related

ggplot boxplot: too many outliers?

The dataset is available here but I am only using the ones from Year 2010 - 2016 as a subset: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results/
I am trying to plot the height of different gender with a boxplot and it returns this plot:
I felt that it is not correct since there are way too many outliers...(mean=175, min=133, max=221).
I was wondering if I need to adjust the Y-axis to include more data points in this boxplot? If so, how can I do that?
Here is my code:
ggplot(data = olympics, aes(x = Sex, y = Height) +
geom_boxplot() +
labs(title= "Height Distribution of Olympics Athletes by Gender")
Also, I was wondering if it is possible to plot such a graph with base R language as well? Thank you!
Welcome to stackoverflow #VanLindert. The best way to get help is to give us code to run that replicates the problem. The datapasta and reprex packages make this easy to do. https://reprex.tidyverse.org/articles/articles/datapasta-reprex.html
What I suspect is going on is that you are readjusting the y-axis limits and the boxplot keeps changing. When you use plot + scale_y_continuous(limits = c(130, 225)) or the shorthand plot + ylim(130, 225) ggplot filters out values above/below those 130 and 225 and the quartiles are recalculated. If you want to just zoom in on the plot to a specific range, you can use
plot + coord_cartesian(ylim = c(130, 225))

Filling cross over under a Cumulative Frequency plot using ggplot in R

I am trying to plot two Cumulative Frequency curves in ggplot, and shade the cross over at a certain cut off. I haven't been using ggplot for long, so I was hoping someone might be able to help me with this one.
The plot without filled regions, looks like this...
Which I have created using the following code...
library(ggplot2) # required
north <- rnorm(3060, mean=277,sd=3.01) # to create synthetic data
south <- rnorm(3060, mean=278, sd=3.26) # in place of my real data.
#placing in dataframe
df_temp <- data.frame(temp=c(north,south),
region=c(rep("north",length=3060),rep("south",length=3060)))
#manipulating into cdf, as I've seen in other examples
temp.regions <- ddply(df_temp, .(region), summarize,
temp = unique(temp),
ecdf = ecdf(temp)(unique(temp)))
# feeding into ggplot.
ggplot(temp.regions,aes(x=temp, y=ecdf, color = region)) +
geom_line(aes(x=temp,color=region))+
scale_colour_manual(values = c("blue","red"))
What I would then like, would be to shade both curves for temperatures below 0.2 on the y axis. Ideally I'd like to see the blue one shaded in blue, and the red one shaded in red. Then, where they cross over in purple.
However, the closest I have managed is as follows... ]
Which I have achieved using the following additions to my code.
# creating a dataframe with just the temperatures for below 0.2
# to try and aid control when plotting
temp.below <- temp.regions[which(temp.regions$ecdf<0.2),]
# plotting routine again.
ggplot(temp.regions, aes(x=temp, y=ecdf, color = region)) +
geom_line(aes(x=temp,color=region))+
scale_colour_manual(values = c("blue","red"))+
# with additional line for shading.
geom_ribbon(data=temp.below,
aes(x=temp,ymin=0,ymax=0.2), alpha=0.5)
I've seen a few examples of people shading for a normal distribution density plot, which is where I have adapted my code from. But for some reason my boxes don't seem to want anything to do with the temperature curve.
Please help! I'm sure it's quite simple, I'm just really lost and have tried a few, producing less convincing results than these.
Thank you so much for taking a look.
PROBLEM SOLVED THANKS TO HELP BELOW...
running suggested code from below
geom_ribbon(aes(ymin=0,ymax=ecdf, fill=region), alpha=0.5)
gives...
which is so very almost the solution I'm after, but with one final addition... like so
#geom_ribbon(aes(ymin=0,ymax=ecdf, fill=region), alpha=0.5)
geom_ribbon(data=temp.below, aes(ymin=0,ymax=ecdf, fill=region), alpha=0.5)
I get what I'm after...
The reason I set the data again is so that it only fills the lowest 20% of the two regions.
Thank you so much for the help :-)
Looks like you're thinking about it in the right way.
With geom_ribbon i dont think you need to set data to anything else. Just set aes(ymin = 0, ymax = ecdf, fill = region). I think that should do it.

Calculation of density estimate in density2d?

I have a more general question regarding the principle behind density2d.
I'm using ggplot and the density2d function to visualize animal movements. My idea was calculating heat maps showing where the animal is most of the time and/or to identify areas of particular interest. Yet, the density2d function sometimes generates rather inexplicable plots.
Here's what I mean:
set.seed(4)
x<-runif(50,1,599)
y<-runif(50,1,599)
df<-data.table(x,y)
ggplot(df,aes(x=x,y=y))
+stat_density2d(aes(x=x,y=y,fill=..level..,alpha=..level..),bins=50,geom="polygon")
+coord_equal(xlim=c(0,600),ylim=c(0,600))
+expand_limits(x=c(0,600),y=c(0,600))
+geom_path()
which looks like this:
There are areas with a density estimate but without data (around x:50, y:300).
Now compare with this:
set.seed(13)
x<-runif(50,1,599)
y<-runif(50,1,599)
df<-data.table(x,y)
ggplot(df,aes(x=x,y=y))
+stat_density2d(aes(x=x,y=y,fill=..level..,alpha=..level..),bins=50,geom="polygon")
+coord_equal(xlim=c(0,600),ylim=c(0,600))
+expand_limits(x=c(0,600),y=c(0,600))
+geom_path()
which looks like this:
Here there are regions "wihtout" a density estimate but with actual data (around x:100,y:550).
Someone asked a related question:
Create heatmap with distribution of attribute values in R (not density heatmap)
but there are no satisfactory answers to be found.
So my question would be (i) Why? and (ii) How to avoid/adjust if possible?
This may be helpful. I am not that familiar with stat_density2d. After seeing your code and ggplot documents (http://docs.ggplot2.org/0.9.2.1/stat_density2d.html), I thought ..level.. might not be the one. I, then, tried ..density.. Someone will be able to explain why you need density meanwhile I think this is the graph you wanted.
ggplot(data = df, aes(x = x, y = y)) +
stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) +
geom_path() +
coord_equal(xlim=c(0,600),ylim=c(0,600)) +
expand_limits(x=c(0,600),y=c(0,600))

selecting only some facets to print in facet_wrap, ggplot2

my question is very simple, but I have failed to solve it after many attempts. I just want to print some facets of a facetted plot (made with facet_wrap in ggplot2), and remove the ones I am no interested in.
I have facet_wrap with ggplot2, as follows:
#anomalies linear trends
an.trends <- ggplot()+
geom_smooth(method="lm", data=tndvilong.anomalies, aes(x=year, y=NDVIan, colour=TenureZone,
group=TenureZone))+
scale_color_manual(values=miscol) +
ggtitle("anomalies' trends")
#anomalies linear trends by VEG
an.trendsVEG <- an.trends + facet_wrap(~VEG,ncol=2)
print(an.trendsVEG)
And I get the plot as I expected (you can see it in te link below):
anomalies' trends by VEG
The question is: how do I get printed only the facest I am interested on?
I only want to print "CenKal_ShWoodl", "HlShl_ShDens", "NKal_ShWoodl", and "ThShl_ShDens"
Thanks
I suggest the easiest way to do that is to simply give ggplot() an appropriate subset. In this case:
facets <- c("CenKal_ShWoodl", "HlShl_ShDens", "NKal_ShWoodl", "ThShl_ShDens")
an.trends.sub <- ggplot(tndvilong.anomalies[tndvilong.anomalies$VEG %in% facets,])+
geom_smooth(method="lm" aes(x=year, y=NDVIan, colour=TenureZone,
group=TenureZone))+
scale_color_manual(values=miscol) +
ggtitle("anomalies' trends") +
facet_wrap(~VEG,ncol=2)
Obviously without your data I can't be sure this will give you what you want, but based on your description, it should work. I find that with ggplot, it is generally best to pass it the data you want plotted, rather than finding ways of changing the plot itself.

How can I make a legend in ggplot2 with one point entry and one line entry?

I am making a graph in ggplot2 consisting of a set of datapoints plotted as points, with the lines predicted by a fitted model overlaid. The general idea of the graph looks something like this:
names <- c(1,1,1,2,2,2,3,3,3)
xvals <- c(1:9)
yvals <- c(1,2,3,10,11,12,15,16,17)
pvals <- c(1.1,2.1,3.1,11,12,13,14,15,16)
ex_data <- data.frame(names,xvals,yvals,pvals)
ex_data$names <- factor(ex_data$names)
graph <- ggplot(data=ex_data, aes(x=xvals, y=yvals, color=names))
print(graph + geom_point() + geom_line(aes(x=xvals, y=pvals)))
As you can see, both the lines and the points are colored by a categorical variable ('names' in this case). I would like the legend to contain 2 entries: a dot labeled 'Data', and a line labeled 'Fitted' (to denote that the dots are real data and the lines are fits). However, I cannot seem to get this to work. The (awesome) guide here is great for formatting, but doesn't deal with the actual entries, while I have tried the technique here to no avail, i.e.
print(graph + scale_colour_manual("", values=c("green", "blue", "red"))
+ scale_shape_manual("", values=c(19,NA,NA))
+ scale_linetype_manual("",values=c(0,1,1)))
The main trouble is that, in my actual data, there are >200 different categories for 'names,' while I only want the 2 entries I mentioned above in the legend. Doing this with my actual data just produces a meaningless legend that runs off the page, because the legend is trying to be a key for the colors (of which I have way too many).
I'd appreciate any help!
I think this is close to what you want:
ggplot(ex_data, aes(x=xvals, group=names)) +
geom_point(aes(y=yvals, shape='data', linetype='data')) +
geom_line(aes(y=pvals, shape='fitted', linetype='fitted')) +
scale_shape_manual('', values=c(19, NA)) +
scale_linetype_manual('', values=c(0, 1))
The idea is that you specify two aesthetics (linetype and shape) for both lines and points, even though it makes no sense, say, for a point to have a linetype aesthetic. Then you manually map these "nonsense" aesthetics to "null" values (NA and 0 in this case), using a manual scale.
This has been answered already, but based on feedback I got to another question (How can I fix this strange behavior of legend in ggplot2?) this tweak may be helpful to others and may save you headaches (sorry couldn't put as a comment to the previous answer):
ggplot(ex_data, aes(x=xvals, group=names)) +
geom_point(aes(y=yvals, shape='data', linetype='data')) +
geom_line(aes(y=pvals, shape='fitted', linetype='fitted')) +
scale_shape_manual('', values=c('data'=19, 'fitted'=NA)) +
scale_linetype_manual('', values=c('data'=0, 'fitted'=1))

Resources