I want to generate a density plot of observed temperatures that is scaled by the number of events observed for each temperature data point. My data contains two columns: Temperature and Number [of observations].
Right now, I have a density plot that only incorporates the Temperature frequency according to:
plot(density(Temperature, na.rm=T), type="l", bty="n")
How do I scale this density to account for the Number of observations at each temperature? For example, I want to be able to see the temperature density plot scaled to show if there are greater/fewer observations for each temperature at higher/lower temperatures.
I think I'm looking for something that could weight the temperatures?
I think you can get what you want by passing a weights argument to density. Here's an example using ggplot
dat <- data.frame(Temperature = sort(runif(10)), Number = 1:10)
ggplot(dat, aes(Temperature)) + geom_density(aes(weights=Number/sum(Number)))
And to do this in base (using DanM's data):
plot(density(dat$Temperature,weights=dat$Number/sum(dat$Number),na.rm=T),type='l',bty='n')
Related
I've combed through several questions on here already and I can't seem to figure out what's happening with my density plots.
I have a set of radiocarbon dates which are attributed to different cultures. I need to display the frequencies of dates through time, but distinguish the dates by culture. A stacked histogram does the job (Fig. 1), but their use is generally discouraged, so that's out of the question, yet I want something smoother than a frequency plot (Fig. 2).
Figure 1: Histogram
Figure 2: Frequency plot
When I produce a density plot coloured by culture (Fig. 3), the relative distribution of the cultures on the y-axis change drastically from their counts. For example, in the density plot, the blue density curve is much higher than that of the purple; yet, in the histogram, we can see that there are way more dates attributed to the purple group.
Figure 3: Density plot
Am I doing something wrong with my code (see below)? Or perhaps I need to scale the density curves in some way? Or is there something about density plots I'm not understanding? (Disclaimer: my knowledge of stats is fairly weak)
Thanks in advance!
ggplot(test, aes(x=CalBP))+
theme_tufte(base_family="sans")+
theme(axis.line=element_line(), axis.text=element_text(color="black")) +
theme(legend.position="none") +
theme(text=element_text(size=14)) +
geom_density(aes(color=factor(Culture), fill=factor(Culture)), alpha = 0.5) +
scale_x_reverse() +
labs(x="Cal. B.P.") +
ylab(expression("Density")) +
coord_cartesian(xlim = c(4773, 225)) +
scale_fill_manual(values=c("#cf9045", "#ebe332", "#5f9388", "#6abeef", "#9d88d6")) +
scale_color_manual(values=c("#cf9045", "#ebe332", "#5f9388", "#6abeef", "#9d88d6"))
The difference is that a density plot is scaled so that the total area under the curve is 1. It's function is to model a probability density function, which (by definition) has area 1.
If every group in your data had the same number of observations, then the only difference between the density plot and the histogram would be the y-axis. When you have different numbers of observations, the density plot normalizes for this (each will have total area 1), while the bars of the histogram are much higher for the group with more observations.
In base R, you can fix this in the histogram by setting freq = FALSE, but I've not seen density plots scaled up to histograms - it's usually more interesting to ignore the effects of the relative sample sizes.
I work with a massive 4D nifti file (x - y - z - subject; MRI data) and due to the size I can't convert to a csv file and open in R. I would like to get a series of overlaying density plots (classic example here) one for each subject with the idea to just visualise that there is not much variance in density distributions across the sample.
I could however, extract summary statistics for each subject (mean, median, SD, range etc. of the variable of interest) and use these to create the density plots (at least for the variables that are normally distributed). Something like this would be fantastic but I am not sure how to do it for density plots.
Your help will be much appreciated.
So these really aren't density plots per se - they are plots of densties of normal distributions with given means and standard deviations.
That can be done in ggplot2, but you need to expand your table of subjects and summaries into grids of points and normal densities at those points.
Here's an example. First, make up some data, consisting of subject IDs and some simulated sample averages and sample standard deviations.
library(tidyverse)
set.seed(1)
foo <- data_frame(Subject = LETTERS[1:10], avg=runif(10, 10,20), stdev=runif(10,1,2))
Now, for each subject we need to obtain a suitable grid of "x" values along with the normal density (for that subject's avg and stdev) evaluated at those "x" values. I've chosen plus/minus 4 standard deviations. This can be done using do. But that produces a funny data frame with a column consisting of data frames. I use unnest to explode out the data frame.
bar <- foo %>%
group_by(Subject) %>%
do(densities=data_frame(x=seq(.$avg-4*.$stdev, .$avg+4*.$stdev, length.out = 50),
density=dnorm(x, .$avg, .$stdev))) %>%
unnest()
Have a look at bar to see what happened. Now we can use ggplot2 to put all these normal densities on the same plot. I'm guessing with lots of subjects you wouldn't want a legend for the plot.
bar %>%
ggplot(aes(x=x, y=density, color=Subject)) +
geom_line(show.legend = FALSE)
stat_density2d is really a nice display for dense scatter plots, however I could not find any explanation on what the density actually means. I have a plot with densities ranging from 0 to 400. What is the unit of this scale ?
Thanks !
The density values will depend on the range of x and y in your dataset.
stat_density2d(...) uses kde2d(...) in the MASS package to calculate the 2-dimensional kernal density estimate, based on bivariate normal distributions. The density at a point is scaled so that the integral of density over all x and y = 1. So if you data is highly localized, or if the range for x and y is small, you can get large numbers for density.
You can see this in the following simple example:
library(ggplot2)
set.seed(1)
df1 <- data.frame(x=c(rnorm(50,0,5),rnorm(50,20,5)),
y=c(rnorm(50,0,5),rnorm(50,20,5)))
ggplot(df1, aes(x,y)) + geom_point()+
stat_density2d(geom="path",aes(color=..level..))
set.seed(1)
df2 <- data.frame(x=c(rnorm(50,0,5),rnorm(50,20,5))/100,
y=c(rnorm(50,0,5),rnorm(50,20,5))/100)
ggplot(df2, aes(x,y)) + geom_point()+
stat_density2d(geom="path",aes(color=..level..))
These two data frames are identical except that in df2 the scale is 1/100 that in df1 (in each direction), and therefore the density levels are 10,000 times greater in the the plot of df2.
I have a file which contains time-series data for multiple variables from a to k.
I would like to create a graph that plots the average of the variables a to k over time and above and below that average line adds a smoothed area representing maximum and minimum variation on each day.
So something like confidence intervals but in a smoothed version.
Here's the dataset:
https://dl.dropbox.com/u/22681355/co.csv
and here's the code I have so far:
library(ggplot2)
library(reshape2)
meltdf <- melt(df,id="Year")
ggplot(meltdf,aes(x=Year,y=value,colour=variable,group=variable)) + geom_line()
This depicts bootstrapped 95 % confidence intervals:
ggplot(meltdf,aes(x=Year,y=value,colour=variable,group=variable)) +
stat_summary(fun.data = "mean_cl_boot", geom = "smooth")
This depicts the mean of all values of all variables +-1SD:
ggplot(meltdf,aes(x=Year,y=value)) +
stat_summary(fun.data ="mean_sdl", mult=1, geom = "smooth")
You might want to calculate the year means before calculating the means and SD over the variables, but I leave that to you.
However, I believe a boostrap confidence interval would be more sensible, since the distribution is clearly not symmetric. It would also be narrower. ;)
And of course you could log-transform your values.
I have a file with 2 numeric columns: value and count. File may have > 5000 rows. I do plot(value, count) to find the shape of distribution. But because there are too many data points the picture is not very clear.
Do you know better visualization approach? Probably histograms or barplot with grouping close values on x axis will be the better way to look on data? I cannot figure out the syntax of using histogram or barplot for my case.
If you want to relate the two (continuous) quantities value and count to each other, then you want to do a scatterplot. The problem is that if you have too many observations, the points will overlap and the plot ends up as a big opaque mass with a few scattered outliers. There are a couple of ways to solve this:
Use a smaller plotting symbol: plot(value, count, pch=".")
Plot the data points with a transparency factor: plot(value, count, col=rgb(0, 0, 1, alpha=0.1))
Why not plot a subset of the data? For example, plot the counts associated with values corresponding to the 5th, 10th, ..., 90th, 95th percentiles, e.g.,
value.subset <- quantile(value, seq(0, 1, 0.05))plot
Then plot the quantiles against their respective counts.