Trying to better understand the models I've created with ggplot2 - r

I was pointed toward ggplot2 as an effective tool for data visualization and it's been tremendously helpful, but I'm trying to fully understand exactly what I've done and I'm having a bit of trouble finding the proper resources to tell me.
library(ggplot2)
bone <- read.csv('/Users/kylehammerberg/Desktop/ML Extra Credit/spnbmd.csv')
### Generate scatter plot of data
ggplot(bone) + aes(age, spnbmd, color=sex) + geom_point()
### Fit splines to both male and female bone density data
### geom_point to create scatter plot
### geom_smooth to fit splines
ggplot(bone) + aes(age, spnbmd, color=sex) + geom_point() +
geom_smooth(formula = y~splines::bs(x, knots=c(10,15,20)), method="lm")
### Fit splines to bone density by race
ggplot(bone) + aes(age, spnbmd, color=ethnic) + geom_point() +
geom_smooth(formula = y~splines::bs(x, knots=c(10,15,20)), method="lm")
I don't know exactly what the bs() part of the code is doing and I want to better understand the shaded regions around the generated splines. Are they some sort of confidence interval?

They are.
The help function in R is, well, helpful.
help(geom_smooth)
help( "bs::splines" )
Although not straightforward stated what the shaded region is, the geom_smooth manual page will tell you this on the se argument:
se: Display confidence interval around smooth? (‘TRUE’ by
default, see ‘level’ to control.)
Also for this kind of illustration it's pretty much a given that the shaded region is in fact some sort of uncertainty representation.
The manual page for bs::splines will tell you that it creates a B-spline. To say what that is and isn't is beyond the scope of this site. You will have better luck at stats.stackexchange.com, and for that matter wikipedia:
https://stats.stackexchange.com/search?q=B-spline
https://en.wikipedia.org/wiki/B-spline
But suffice to say, they construct those solid lines that you see that are in a way a sort of running average of the data behind it. In your case done separate for each group as per the color you specify.

Related

aes parameter to anchor start and end points for ggplot geom_smooth regression (loess)?

Is there a parameter that can anchor start and end points for a loess geom_smooth regression? If I increase the span (so that the regression isn't too wiggly), the starting and ending points seem to be drastically different (I have multiple lines on a graph, using as.factor) when in reality they are not (quite close together). I can't share my data as it is for confidential academic research, and I'm not sure how to reproduce an example for this... just wondering if this is possible with ggplot.
Here are some pictures that illustrate the problem, though...
Low span (span = 0.1), just the first 10 out of the 750 points to be graphed --> with this you can see the true starting points:
And then with the high span (span = 1.0), and all 750 points, the starting value and ending values are completely different. I'm not sure why this happens, but it is very misleading:
Basically, I want the smoothness of the second picture, but the specific and accurate starting points of the first when I graph all of the data (i.e., all 750 points). Let me know if there's any way to do this. Thanks for all your help.
Without seeing your code, I can already tell that you're setting your axis limits for the "span = 1.0" version using xlim(0,10) or scale_x_continuous(limits=c(0,10)) - is that correct? Change it to the following:
coord_cartesian(xlim = c(0, 10))
This is because xlim() (which is just a wrapper for scale_x_continuous(limits=...)) does not just zoom in on your data, but in fact discards any of the data outside of those limits before performing any calculations. Check the documentation on xlim() and the documentation on coord_cartesian() for more info.
It's easy to see how this is working using the following example:
# create dataset
set.seed(8675309)
df <- data.frame(x=1:1000, y=rnorm(1000))
# basic plot
p <- ggplot(df, aes(x,y)) + theme_bw() +
geom_point(color='gray75', size=1) + geom_smooth()
p
We get a basic plot, and as we expect, the result of geom_smooth() on this dataset is a straight line parallel to the x axis at y=0.
If we use xlim() or scale_x_continuous(limits=...) to see the first 10 points, you see that the geom_smooth() line is not the same:
p + xlim(0,10)
# or this one... results in the same plot
p + scale_x_continuous(limits=c(0,10))
The resulting line has a much higher standard deviation and is a bit above y=0, since the first 10 points happen to be just a bit above the average for the rest of the 990 points. If you use coord_cartesian(xlim=...), the zooming in of the plot happens after the calculations are made and no points are discarded, giving you the same points plotted, but the geom_smooth() line that matches that of the full dataset:
p + coord_cartesian(xlim=c(0,10))

Creating nice overlayed histogram in R with ggplot

I'm hoping to get some help on making the following histogram looks as nice and understandable as possible. I am plotting the salaries of Immigrant versus US Born workers. I am wondering
1. How would you modify colors, axis intervals, etc. to make the graph more clear/appealing?
2. How could I add a key to indicate purple is for US born workers, and pink is for foreign born?
3. How can I add two different lines to indicate the median of each group? And a corresponding label for each?
My current code is set up as this:
ggplot(NHIS1,aes(x=adj_SALARY, y=..density..)) +
geom_histogram(data=subset(NHIS1,IMMIGRANT=='0'), alpha=.5,binwidth=800, fill="purple",position="identity") + xlim(4430.4,50000) +
geom_vline(xintercept=median(NHIS1$adj_SALARY), col="black", linetype="dashed") +
geom_histogram(data=subset(NHIS1,IMMIGRANT=='1'), alpha=.5,binwidth=800,fill="red") + xlim(4430.4,50000)
geom_vline(xintercept=median(NHIS1$adj_SALARY), col="black", linetype="dashed")
And my final histogram at the moment appears as this:
If you have two variables, one for income , one for immigrant status, you do not need to plot two histograms but one will suffice if you specify the grouping. Also, I'd suggest you also use density lines, which help smooth over the histogram's bumps:
Assuming this is roughly like your data:
df <- data.frame(income = sample(1000:5000, 1000),
born = sample(c("US", "Foreign"), 1000, replace = T))
Then a crude way to plot one histogram as well as density lines for the two groups would be this:
ggplot(df, aes(x=income, color=born, fill=born)) +
geom_histogram(aes(y=..density..), alpha=0.5, binwidth=100,
position="identity") +
geom_density(alpha=.2)
This question has been asked before: overlaying-histograms-with-ggplot2-in-r discusses several options with many examples. You should definitely take a look at it.
Another option to compare the distributions could be violin plots using geom_violin(). I see violin plots as the better option when you need to compare distributions because they give you more flexibility and are still clearer. But that may be just me. Refer to the examples in the manual.

Filling cross over under a Cumulative Frequency plot using ggplot in R

I am trying to plot two Cumulative Frequency curves in ggplot, and shade the cross over at a certain cut off. I haven't been using ggplot for long, so I was hoping someone might be able to help me with this one.
The plot without filled regions, looks like this...
Which I have created using the following code...
library(ggplot2) # required
north <- rnorm(3060, mean=277,sd=3.01) # to create synthetic data
south <- rnorm(3060, mean=278, sd=3.26) # in place of my real data.
#placing in dataframe
df_temp <- data.frame(temp=c(north,south),
region=c(rep("north",length=3060),rep("south",length=3060)))
#manipulating into cdf, as I've seen in other examples
temp.regions <- ddply(df_temp, .(region), summarize,
temp = unique(temp),
ecdf = ecdf(temp)(unique(temp)))
# feeding into ggplot.
ggplot(temp.regions,aes(x=temp, y=ecdf, color = region)) +
geom_line(aes(x=temp,color=region))+
scale_colour_manual(values = c("blue","red"))
What I would then like, would be to shade both curves for temperatures below 0.2 on the y axis. Ideally I'd like to see the blue one shaded in blue, and the red one shaded in red. Then, where they cross over in purple.
However, the closest I have managed is as follows... ]
Which I have achieved using the following additions to my code.
# creating a dataframe with just the temperatures for below 0.2
# to try and aid control when plotting
temp.below <- temp.regions[which(temp.regions$ecdf<0.2),]
# plotting routine again.
ggplot(temp.regions, aes(x=temp, y=ecdf, color = region)) +
geom_line(aes(x=temp,color=region))+
scale_colour_manual(values = c("blue","red"))+
# with additional line for shading.
geom_ribbon(data=temp.below,
aes(x=temp,ymin=0,ymax=0.2), alpha=0.5)
I've seen a few examples of people shading for a normal distribution density plot, which is where I have adapted my code from. But for some reason my boxes don't seem to want anything to do with the temperature curve.
Please help! I'm sure it's quite simple, I'm just really lost and have tried a few, producing less convincing results than these.
Thank you so much for taking a look.
PROBLEM SOLVED THANKS TO HELP BELOW...
running suggested code from below
geom_ribbon(aes(ymin=0,ymax=ecdf, fill=region), alpha=0.5)
gives...
which is so very almost the solution I'm after, but with one final addition... like so
#geom_ribbon(aes(ymin=0,ymax=ecdf, fill=region), alpha=0.5)
geom_ribbon(data=temp.below, aes(ymin=0,ymax=ecdf, fill=region), alpha=0.5)
I get what I'm after...
The reason I set the data again is so that it only fills the lowest 20% of the two regions.
Thank you so much for the help :-)
Looks like you're thinking about it in the right way.
With geom_ribbon i dont think you need to set data to anything else. Just set aes(ymin = 0, ymax = ecdf, fill = region). I think that should do it.

Calculation of density estimate in density2d?

I have a more general question regarding the principle behind density2d.
I'm using ggplot and the density2d function to visualize animal movements. My idea was calculating heat maps showing where the animal is most of the time and/or to identify areas of particular interest. Yet, the density2d function sometimes generates rather inexplicable plots.
Here's what I mean:
set.seed(4)
x<-runif(50,1,599)
y<-runif(50,1,599)
df<-data.table(x,y)
ggplot(df,aes(x=x,y=y))
+stat_density2d(aes(x=x,y=y,fill=..level..,alpha=..level..),bins=50,geom="polygon")
+coord_equal(xlim=c(0,600),ylim=c(0,600))
+expand_limits(x=c(0,600),y=c(0,600))
+geom_path()
which looks like this:
There are areas with a density estimate but without data (around x:50, y:300).
Now compare with this:
set.seed(13)
x<-runif(50,1,599)
y<-runif(50,1,599)
df<-data.table(x,y)
ggplot(df,aes(x=x,y=y))
+stat_density2d(aes(x=x,y=y,fill=..level..,alpha=..level..),bins=50,geom="polygon")
+coord_equal(xlim=c(0,600),ylim=c(0,600))
+expand_limits(x=c(0,600),y=c(0,600))
+geom_path()
which looks like this:
Here there are regions "wihtout" a density estimate but with actual data (around x:100,y:550).
Someone asked a related question:
Create heatmap with distribution of attribute values in R (not density heatmap)
but there are no satisfactory answers to be found.
So my question would be (i) Why? and (ii) How to avoid/adjust if possible?
This may be helpful. I am not that familiar with stat_density2d. After seeing your code and ggplot documents (http://docs.ggplot2.org/0.9.2.1/stat_density2d.html), I thought ..level.. might not be the one. I, then, tried ..density.. Someone will be able to explain why you need density meanwhile I think this is the graph you wanted.
ggplot(data = df, aes(x = x, y = y)) +
stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) +
geom_path() +
coord_equal(xlim=c(0,600),ylim=c(0,600)) +
expand_limits(x=c(0,600),y=c(0,600))

selecting only some facets to print in facet_wrap, ggplot2

my question is very simple, but I have failed to solve it after many attempts. I just want to print some facets of a facetted plot (made with facet_wrap in ggplot2), and remove the ones I am no interested in.
I have facet_wrap with ggplot2, as follows:
#anomalies linear trends
an.trends <- ggplot()+
geom_smooth(method="lm", data=tndvilong.anomalies, aes(x=year, y=NDVIan, colour=TenureZone,
group=TenureZone))+
scale_color_manual(values=miscol) +
ggtitle("anomalies' trends")
#anomalies linear trends by VEG
an.trendsVEG <- an.trends + facet_wrap(~VEG,ncol=2)
print(an.trendsVEG)
And I get the plot as I expected (you can see it in te link below):
anomalies' trends by VEG
The question is: how do I get printed only the facest I am interested on?
I only want to print "CenKal_ShWoodl", "HlShl_ShDens", "NKal_ShWoodl", and "ThShl_ShDens"
Thanks
I suggest the easiest way to do that is to simply give ggplot() an appropriate subset. In this case:
facets <- c("CenKal_ShWoodl", "HlShl_ShDens", "NKal_ShWoodl", "ThShl_ShDens")
an.trends.sub <- ggplot(tndvilong.anomalies[tndvilong.anomalies$VEG %in% facets,])+
geom_smooth(method="lm" aes(x=year, y=NDVIan, colour=TenureZone,
group=TenureZone))+
scale_color_manual(values=miscol) +
ggtitle("anomalies' trends") +
facet_wrap(~VEG,ncol=2)
Obviously without your data I can't be sure this will give you what you want, but based on your description, it should work. I find that with ggplot, it is generally best to pass it the data you want plotted, rather than finding ways of changing the plot itself.

Resources