Probability density plots from count/value ("binned") data - r

I have a data set which contains 27M samples per day. I can reduce this, using count(), to 1500 samples per day, without loss.
When I come to plot, for example, histograms from this, I can use stat="identity" to process the count data considerably faster than the original data.
Is there a similar way to process the count data to obtain ridges using ggridges::geom_density_ridges(), or similar, to get the probability density without having to process the original data set?

It sounds like your current set-up is something like this (obviously with far more cases): a data frame containing a large vector of numeric measurements, with at least one grouping variable to specify different ridge lines.
We will stick to 2000 samples rather than 27M samples for demonstration purposes:
set.seed(1)
df <- data.frame(x = round(c(rnorm(1000, 35, 5), rnorm(1000, 60, 12))),
group = rep(c('A', 'B', 'C'), len = 2000))
We can reduce these 2000 observations down to ~200 by using count, and plot with geom_histogram using stat = 'identity':
df %>%
group_by(x, group) %>%
count() %>%
ggplot(aes(x, y = n, fill = group)) +
geom_histogram(stat = 'identity', color = 'black')
But we want to create density ridgelines from these 200 rows of counts rather than the original data. Of course, we could uncount them and create a density ridgeline normally, but this would be tremendously inefficient. What we can do is use the counts as weights for a density calculation. It seems that geom_density_ridges doesn't take a weight parameter, but stat_density does, and you can tell it to use the density_ridges geom. This allows us to pass our counts as weights for the density calculation.
library(ggridges)
df %>%
group_by(x, group) %>%
count() %>%
ggplot(aes(x, fill = group)) +
stat_density(aes(weight = n, y = group, height = after_stat(density)),
geom = 'density_ridges', position = 'identity')
Note that this should give us the same result as creating a ridgleine from our whole data set before counting, since our 'bins' are unique interval values. If your real data is binning continuous data before counting, you will have a slightly less accurate kernel density estimate when using count data, depending on how 'thin' your bins are.

Related

How to make the points in a geom_point scatter plot depend on the count?

I'm using ggplot to create a scatterplot of a dataframe. The x and y axis are two columns in the frame, and the following code gives me a scatter plot:
ggplot(df,aes(x=Season,y=type))+
geom_point(fill="blue")
But the points are all the same size. I want each point to depend on the count of how many rows matched for the combination of x and y. Anyone know how?
You haven't provided any sample data, so I'm generating some:
library('tidyverse')
df <- data.frame(Season = sample(c('W', 'S'), size=20, replace=T),
type=sample(c('A', 'B'), size=20, replace=T))
df %>%
group_by(Season, type) %>%
summarise(count = n()) %>%
ggplot(aes(x=Season, y=type, size=count)) +
geom_point(col="blue")
The idea is to count all the occurences of your Season–type data, and then use that new count field to adjust the size in your ggplot.

How can I plot 3 repeat observations per sample on a scatter in R?

I have a dataframe with the following columns; Sample, Read_length, Length, Rep, Year, Sex. Each unique sample has 6 Length values (2 Read_length conditions x 3 Reps). I would like to plot Length vs Year in such a way that each group of 3 repeats is visually linked on the plot, so I can see the variation. I am using colour and point shape to distinguish between the 2 read-lengths and between Male & Female.
ggplot(data1, aes(x = Year, y = Length, shape = Sex, colour = Read_length)) + geom_point(size = 3) + scale_shape_manual(values = c(1, 4))
Is there a way to group first by read_length, and then by sample name, to generate the groups of three (and how to then plot that)?
Take your input data and use group_by() from dplyr. This will allow ggplot, and many other tidyverse functions to process each sample separately.
data %>% group_by(Sample)

How to visualize data in R with different starting values? (i.e. standardizing so that every data starts from zero)

I wrapped my head around this question. The chart should look similar like this:
So I am basically trying to plot returns but "standardizing" them before. Is there any quick way to do this? I thought about dividing each row entry by the value of the first row respectively, e.g. if stock starts trading at 200, data point 1 will be 200/200=1, datapoint 2 say 210/200= 1.05 etc. - I could then also multiply that value by 100 so I would start the first one with 100, second 105 etc.
Does this make sense or is there a smarter way to do this?
Many thanks!
You may want seq_along(). I don't have your data, so here is an example with some dummy data:
set.seed(12345)
df <- data.frame(company = c(rep("A", 100), rep("B", 100), rep("C",100), rep("D",100)),
value = c(rnorm(100, 150, 25), rnorm(100, 250, 25), rnorm(100, 50, 25), rnorm(100, 300, 25)),
time = c(151:250, 100:199, 200:299, 251:350))
Add a new column in your data after grouping by the group/color variable. Use seq_along() to populate that column with a sequence of integers starting at 1 for each set. If you need to, transform that new column to whatever scale you need. Note, this only works if your horizontal axis data is evenly spaced. If the intervals are not the same, this will cause trouble.
library(dplyr)
library(ggplot2)
df %>%
group_by(company) %>%
mutate(time2 = seq_along(time)) %>%
ggplot(aes(x = time2, y = value, color = company)) +
geom_line(size = 2) +
xlab("relative time")
If your data is unevenly spaced, consider transforming to subtract each value per group by the minimum per group. This has the bonus of preserving the interval widths. If you divide by the minimum value, the time intervals will be compressed differently in each group. Again, you could manipulate the new variable in other ways, like adding 100 so that all values start at 100.
df %>%
group_by(company) %>%
mutate(time2 = time - min(time)) %>%
ggplot(aes(x = time2, y = value, color = company)) +
geom_line(size = 2) +
xlab("relative time")
To come back to the solution I've proposed before, I've prepared following dataset:
close <- aapl %>%
select(AAPL.Close) #selecting only the closing prices for apple
The first closing price for apple is 32.1875 - for me to make it comparable with another firm independent of the share price I want to "standardize this value". Thus I will divide each row in the dataset by 32.1875 and multiply the solution by 100. This will lead to a new row, which I call relative, that begins with 100 (the base value).
close$Relative <- (close$AAPL.Close/32.1875)*100
Now I do the same with AMZN, I spare you guys the code the concept is the same. When done I bind both data.frames together:
close <- cbind(close,amzn)
And plot the data:
ggplot(close, aes(x=close$dates))+
geom_line(aes(y=Relative), color="Red")+
geom_line(aes(y=Relative1), color="Blue")

Averaging time series groups and plotting them against one another

In the time series data created below data, individuals (denoted by a unique ID) were sampled from 2 populations (NC and SC). All individuals have the same number of observations. I want to average the data for each respective "time point" for all individuals that belong to the same "State" (the average line) and I want to plot the average lines from each state against each other. I want it to look something like this:
library(tidyverse)
set.seed(123)
ID <- rep(1:10, each = 500)
Time = rep(c(1:500),10)
Location = rep(c("NC","SC"), each = 2500)
Var <- rnorm(5000)
data <- data.frame(
ID = factor(ID),
Time = Time,
State = Location,
Variable = Var
)
I would recommend getting familiar with the various dplyr functions. Specifically, group_by and summarise. You may want to read through: Introduction to dplyr or going through this series of blog posts.
In short, we are grouping the data by the Time and State variable and then summarizing that data with an average (i.e., mean(Variable)). To plot the data, we put Time on our x-axis, the newly created avg_var on our y-axis, and use State to represent color. These are assigned as our chart's aesthetics (i.e., aes(...). Finally, we add the line geom with geom_line() to render the lines on our visualization.
data %>%
group_by(Time, State) %>%
summarise(avg_var = mean(Variable)) %>%
ggplot(aes(x = Time, y = avg_var, color = State)) +
geom_line()

Density of each group of weighted geom_density sum to one

How can I group a density plot and have the density of each group sum to one, when using weighted data?
The ggplot2 help for geom_density() suggests a hack for using weighted data: dividing by the sum of the weights. But when grouped, this means that the combined density of the groups totals one. I would like the density of each group to total one.
I have found two clumsy ways to do this. The first is to treat each group as a separate dataset:
library(ggplot2)
library(ggplot2movies) # load the movies dataset
m <- ggplot()
m + geom_density(data = movies[movies$Action == 0, ], aes(rating, weight = votes/sum(votes)), fill=NA, colour="black") +
geom_density(data = movies[movies$Action == 1, ], aes(rating, weight = votes/sum(votes)), fill=NA, colour="blue")
Obvious disadvantages are the manual handling of factor levels and aesthetics. I also tried using the windowing functionality of the data.table package to create a new column for the total votes per Action group, dividing by that instead:
movies.dt <- data.table(movies)
setkey(movies.dt, Action)
movies.dt[, votes.per.group := sum(votes), Action]
m <- ggplot(movies.dt, aes(x=rating, weight=votes/votes.per.group, group = Action, colour = Action))
m + geom_density(fill=NA)
Are there neater ways to do this? Because of the size of my tables, I'd rather not replicate rows by their weighting for the sake of using frequency.
Using dplyr
library(dplyr)
library(ggplot2)
library(ggplot2movies)
movies %>%
group_by(Action) %>%
mutate(votes.grp = sum(votes)) %>%
ggplot(aes(x=rating, weight=votes/votes.grp, group = Action, colour = Action)) +
geom_density()
I think an auxillary table might be your only option. I had a similar problem here. The issue it seems is that, when ggplot uses aggregating functions in aes(...), it applies them to the whole dataset, not the subsetted data. So when you write
aes(weight=votes/sum(votes))
the votes in the numerator is subsetted based on Action, but votes in the denominator, sum(votes), is not. The same is true for the implicit grouping with facets.
If someone else has a way around this I'd love to hear it.

Resources