I have a more general question regarding the principle behind density2d.
I'm using ggplot and the density2d function to visualize animal movements. My idea was calculating heat maps showing where the animal is most of the time and/or to identify areas of particular interest. Yet, the density2d function sometimes generates rather inexplicable plots.
Here's what I mean:
set.seed(4)
x<-runif(50,1,599)
y<-runif(50,1,599)
df<-data.table(x,y)
ggplot(df,aes(x=x,y=y))
+stat_density2d(aes(x=x,y=y,fill=..level..,alpha=..level..),bins=50,geom="polygon")
+coord_equal(xlim=c(0,600),ylim=c(0,600))
+expand_limits(x=c(0,600),y=c(0,600))
+geom_path()
which looks like this:
There are areas with a density estimate but without data (around x:50, y:300).
Now compare with this:
set.seed(13)
x<-runif(50,1,599)
y<-runif(50,1,599)
df<-data.table(x,y)
ggplot(df,aes(x=x,y=y))
+stat_density2d(aes(x=x,y=y,fill=..level..,alpha=..level..),bins=50,geom="polygon")
+coord_equal(xlim=c(0,600),ylim=c(0,600))
+expand_limits(x=c(0,600),y=c(0,600))
+geom_path()
which looks like this:
Here there are regions "wihtout" a density estimate but with actual data (around x:100,y:550).
Someone asked a related question:
Create heatmap with distribution of attribute values in R (not density heatmap)
but there are no satisfactory answers to be found.
So my question would be (i) Why? and (ii) How to avoid/adjust if possible?
This may be helpful. I am not that familiar with stat_density2d. After seeing your code and ggplot documents (http://docs.ggplot2.org/0.9.2.1/stat_density2d.html), I thought ..level.. might not be the one. I, then, tried ..density.. Someone will be able to explain why you need density meanwhile I think this is the graph you wanted.
ggplot(data = df, aes(x = x, y = y)) +
stat_density2d(geom="tile", aes(fill = ..density..), contour = FALSE) +
geom_path() +
coord_equal(xlim=c(0,600),ylim=c(0,600)) +
expand_limits(x=c(0,600),y=c(0,600))
Related
I was pointed toward ggplot2 as an effective tool for data visualization and it's been tremendously helpful, but I'm trying to fully understand exactly what I've done and I'm having a bit of trouble finding the proper resources to tell me.
library(ggplot2)
bone <- read.csv('/Users/kylehammerberg/Desktop/ML Extra Credit/spnbmd.csv')
### Generate scatter plot of data
ggplot(bone) + aes(age, spnbmd, color=sex) + geom_point()
### Fit splines to both male and female bone density data
### geom_point to create scatter plot
### geom_smooth to fit splines
ggplot(bone) + aes(age, spnbmd, color=sex) + geom_point() +
geom_smooth(formula = y~splines::bs(x, knots=c(10,15,20)), method="lm")
### Fit splines to bone density by race
ggplot(bone) + aes(age, spnbmd, color=ethnic) + geom_point() +
geom_smooth(formula = y~splines::bs(x, knots=c(10,15,20)), method="lm")
I don't know exactly what the bs() part of the code is doing and I want to better understand the shaded regions around the generated splines. Are they some sort of confidence interval?
They are.
The help function in R is, well, helpful.
help(geom_smooth)
help( "bs::splines" )
Although not straightforward stated what the shaded region is, the geom_smooth manual page will tell you this on the se argument:
se: Display confidence interval around smooth? (‘TRUE’ by
default, see ‘level’ to control.)
Also for this kind of illustration it's pretty much a given that the shaded region is in fact some sort of uncertainty representation.
The manual page for bs::splines will tell you that it creates a B-spline. To say what that is and isn't is beyond the scope of this site. You will have better luck at stats.stackexchange.com, and for that matter wikipedia:
https://stats.stackexchange.com/search?q=B-spline
https://en.wikipedia.org/wiki/B-spline
But suffice to say, they construct those solid lines that you see that are in a way a sort of running average of the data behind it. In your case done separate for each group as per the color you specify.
I'm hoping to get some help on making the following histogram looks as nice and understandable as possible. I am plotting the salaries of Immigrant versus US Born workers. I am wondering
1. How would you modify colors, axis intervals, etc. to make the graph more clear/appealing?
2. How could I add a key to indicate purple is for US born workers, and pink is for foreign born?
3. How can I add two different lines to indicate the median of each group? And a corresponding label for each?
My current code is set up as this:
ggplot(NHIS1,aes(x=adj_SALARY, y=..density..)) +
geom_histogram(data=subset(NHIS1,IMMIGRANT=='0'), alpha=.5,binwidth=800, fill="purple",position="identity") + xlim(4430.4,50000) +
geom_vline(xintercept=median(NHIS1$adj_SALARY), col="black", linetype="dashed") +
geom_histogram(data=subset(NHIS1,IMMIGRANT=='1'), alpha=.5,binwidth=800,fill="red") + xlim(4430.4,50000)
geom_vline(xintercept=median(NHIS1$adj_SALARY), col="black", linetype="dashed")
And my final histogram at the moment appears as this:
If you have two variables, one for income , one for immigrant status, you do not need to plot two histograms but one will suffice if you specify the grouping. Also, I'd suggest you also use density lines, which help smooth over the histogram's bumps:
Assuming this is roughly like your data:
df <- data.frame(income = sample(1000:5000, 1000),
born = sample(c("US", "Foreign"), 1000, replace = T))
Then a crude way to plot one histogram as well as density lines for the two groups would be this:
ggplot(df, aes(x=income, color=born, fill=born)) +
geom_histogram(aes(y=..density..), alpha=0.5, binwidth=100,
position="identity") +
geom_density(alpha=.2)
This question has been asked before: overlaying-histograms-with-ggplot2-in-r discusses several options with many examples. You should definitely take a look at it.
Another option to compare the distributions could be violin plots using geom_violin(). I see violin plots as the better option when you need to compare distributions because they give you more flexibility and are still clearer. But that may be just me. Refer to the examples in the manual.
I have a dataset myData which contains x and y values for various Samples. I can create a line plot for a dataset which contains a few Samples with the following pseudocode, and it is a good way to represent this data:
myData <- data.frame(x = 290:450, X52241 = c(..., ..., ...), X75123 = c(..., ..., ...))
myData <- myData %>% gather(Sample, y, -x)
ggplot(myData, aes(x, y)) + geom_line(aes(color=Sample))
Which generates:
This turns into a Spaghetti Plot when I have a lot more Samples added, which makes the information hard to understand, so I want to represent the "hills" of each sample in another way. Preferably, I would like to represent the data as a series of stacked bars, one for each myData$Sample, with transparency inversely related to what is in myData$y. I've tried to represent that data in photoshop (badly) here:
Is there a way to do this? Creating faceted plots using facet_wrap() or facet_grid() doesn't give me what I want (far too many Samples). I would also be open to stacked ridgeline plots using ggridges, but I am not understanding how I would be able to convert absolute values to a stat(density) value needed to plot those.
Any suggestions?
Thanks to u/Joris for the helpful suggestion! Since, I did not find this question elsewhere, I'll go ahead and post the pretty simple solution to my question here for others to find.
Basically, I needed to apply the alpha aesthetic via aes(alpha=y, ...). In theory, I could apply this over any geom. I tried geom_col(), which worked, but the best solution was to use geom_segment(), since all my "bars" were going to be the same length. Also note that I had to "slice" up the segments in order to avoid the problem of overplotting similar to those found here, here, and here.
ggplot(myData, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, yend=Sample, alpha=y), color='blue3', size=14)
That gives us the nice gradient:
Since the max y values are not the same for both lines, if I wanted to "match" the intensity I normalized the data (myDataNorm) and could make the same plot. In my particular case, I kind of preferred bars that did not have a gradient, but which showed a hard edge for the maximum values of y. Here was one solution:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, y=end=Sample, alpha=ifelse(y>0.9,1,0)) +
theme(legend.position='none')
Better, but I did not like the faint-colored areas that were left. The final code is what gave me something that perfectly captured what I was looking for. I simply moved the ifelse() statement to apply to the x aesthetic, so the parts of the segment drawn were only those with high enough y values. Note my data "starts" at x=290 here. Probably more elegant ways to combine those x and xend terms, but whatever:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(
x=ifelse(y>0.9,x,290), xend=ifelse(y>0.9,x-1,290),
y=Sample, yend=Sample), color='blue3', size=14) +
xlim(290,400) # needed to show entire scale
I'm a beginner in R, so please be patient with me if there are very obvious mistakes in my code and for my question! For a homework problem, I am struggling to make what I think is a geom_area plot look like this:
As background, we are using the diamonds dataframe from ggplot2 library. We were given the plot and asked to reproduce it. My biggest problem is with the y-axis. The graph given indicated that the y-axis represents density, which I think is the percentage/proportion of each clarity grade given the title. Originally, I thought perhaps I needed to create a new dataframe with "Price" and "Clarity Proportion (or, density)", but I wasn't sure how to do that. The professor hinted that we should not need to create a new variable for this problem.
Here's what I have so far. It produces the error message: "In Ops.ordered(left, right): '/' is not meaningful for ordered factors":
set.seed(123)
d <- ggplot(diamonds[sample(nrow(diamonds),5000),]) #these were given in the homework
d + geom_area(aes(x = price, y = lapply(count(diamonds$clarity), FUN = count(diamonds$clarity)/53940), colour = clarity), position = "fill") +
labs(title = "Clarity Proportion by Price")
I know my y-axis is wrong, but I'm just not sure how to transform it. Your explanation and insight are greatly appreciated!
I am trying to plot two Cumulative Frequency curves in ggplot, and shade the cross over at a certain cut off. I haven't been using ggplot for long, so I was hoping someone might be able to help me with this one.
The plot without filled regions, looks like this...
Which I have created using the following code...
library(ggplot2) # required
north <- rnorm(3060, mean=277,sd=3.01) # to create synthetic data
south <- rnorm(3060, mean=278, sd=3.26) # in place of my real data.
#placing in dataframe
df_temp <- data.frame(temp=c(north,south),
region=c(rep("north",length=3060),rep("south",length=3060)))
#manipulating into cdf, as I've seen in other examples
temp.regions <- ddply(df_temp, .(region), summarize,
temp = unique(temp),
ecdf = ecdf(temp)(unique(temp)))
# feeding into ggplot.
ggplot(temp.regions,aes(x=temp, y=ecdf, color = region)) +
geom_line(aes(x=temp,color=region))+
scale_colour_manual(values = c("blue","red"))
What I would then like, would be to shade both curves for temperatures below 0.2 on the y axis. Ideally I'd like to see the blue one shaded in blue, and the red one shaded in red. Then, where they cross over in purple.
However, the closest I have managed is as follows... ]
Which I have achieved using the following additions to my code.
# creating a dataframe with just the temperatures for below 0.2
# to try and aid control when plotting
temp.below <- temp.regions[which(temp.regions$ecdf<0.2),]
# plotting routine again.
ggplot(temp.regions, aes(x=temp, y=ecdf, color = region)) +
geom_line(aes(x=temp,color=region))+
scale_colour_manual(values = c("blue","red"))+
# with additional line for shading.
geom_ribbon(data=temp.below,
aes(x=temp,ymin=0,ymax=0.2), alpha=0.5)
I've seen a few examples of people shading for a normal distribution density plot, which is where I have adapted my code from. But for some reason my boxes don't seem to want anything to do with the temperature curve.
Please help! I'm sure it's quite simple, I'm just really lost and have tried a few, producing less convincing results than these.
Thank you so much for taking a look.
PROBLEM SOLVED THANKS TO HELP BELOW...
running suggested code from below
geom_ribbon(aes(ymin=0,ymax=ecdf, fill=region), alpha=0.5)
gives...
which is so very almost the solution I'm after, but with one final addition... like so
#geom_ribbon(aes(ymin=0,ymax=ecdf, fill=region), alpha=0.5)
geom_ribbon(data=temp.below, aes(ymin=0,ymax=ecdf, fill=region), alpha=0.5)
I get what I'm after...
The reason I set the data again is so that it only fills the lowest 20% of the two regions.
Thank you so much for the help :-)
Looks like you're thinking about it in the right way.
With geom_ribbon i dont think you need to set data to anything else. Just set aes(ymin = 0, ymax = ecdf, fill = region). I think that should do it.