Rolling mean based on conditions R - r

Here is a brief description of my data: The first column is date by month, the second column is binary variable (0 or 1), the third column is stock return, so each month's stock return point to 1 or 0.
I want to calculate the 12-month rolling mean return separately based on the second column (0 or 1). There will be different number of 0s and 1s in the 12-month rolling base. There should be 2 outcome (mean_rolling_0, and mean_rolling_1).

Use rollmean() from the zoo package, and apply this per group with group_by() in dplyr.
Here's an example. I'm guessing at your data structure, but it will also work for similar structures.
library(tidyverse)
library(zoo)
# sample data
d = tibble(a = 1:100,
b = sample(c(0,1), 100, replace = T),
c = a/10 + rnorm(100))
# compute rolling mean
d2 = d %>%
group_by(b) %>%
mutate(roll = rollmean(c, 12, na.pad=TRUE, align="right"))
# plot to see the effect
ggplot(data = d2) + geom_line(aes(x = a, y = c, colour = factor(b))) +
geom_line(aes(x = a, y = roll, colour = factor(b)), linetype = 'dashed')

Related

Plot multiple measurements

I have a data set with 12 individuals measured at 25 time points. I want a graph, which lines ordered by individual (1-12) and measurment (A, B, or C) and the timepoints on the x axis and the value on the y.
The cols of my dataset look like this (so it is already in long format):
Individuum (1 x 25; 2 x 25...) / Measurment (A B or C) / timepoint (1 - 25, 1- 25,...) / value
I already tried this:
ggplot(data = Replicate1, mapping = aes(x = Reading, y = value, linetype = Group))
but there are no lines showed and I dont know how to add the measurement.
You may do something like this. Showing you on a sample data.
set.seed(12)
df <- data.frame(individual = rep(1:12, each =3),
obs = LETTERS[1:3],
time = rep(1:25, each = 36),
val = sample(25:100, 900, T))
library(tidyverse)
df %>%
ggplot(aes(x= time, y = val, group = individual, color = as.factor(individual))) +
geom_line() +
facet_wrap(. ~ obs, ncol = 1)
Created on 2021-07-03 by the reprex package (v2.0.0)

Group observations in ggplot2 with long form data

I have data in long form that looks like this:
id <- rep(seq(1:16), each = 3)
trial <- rep(seq(1:3), times = 16)
repeatedMeasure <- round(rnorm(48, mean = 3, sd = 2))
measuredOnce <- rep(10:14, times = c(9,6,6,12,15))
con1 <- rep(c('hi', 'lo'), each = 6, times = 4)
con2 <- rep(c('up', 'down'), each = 3, times = 8)
dat <- as.data.frame(cbind(id, trial, con1, con2, repeatedMeasure, measuredOnce))
dat$measuredOnce <- as.character(dat$measuredOnce)
dat$measuredOnce <- as.numeric(dat$measuredOnce)
Participants complete multiple trials. There is a unique measurement for each trial in the 'repeatedMeasures' variable. However, they are only measured once for the variable titled 'measuredOnce'. I want to produce a bar plot of the measuredOnce variable - something like this:
ggplot(data = dat) +
aes(x = measuredOnce) +
geom_bar() +
facet_wrap(~con1*con2)
However, I want to specify that the measurements for measuredOnce are grouped by id, so that the number of observations (and hence the height of the bar) is divided by three.
I know I could produce what I want by using spread() or taking every third row, but would like to work with the same (long) data frame.
Edit: plot using code above with group = id and fill = id added to aesthetics.
Edit 2: What I am looking for is something that looks like the plot produced by this code
dat %>%
spread(key = trial, value = repeatedMeasure) %>%
ggplot() +
aes(x = measuredOnce) +
geom_bar() +
facet_wrap(~con1*con2)
but without creating a new data frame using spread().

Control relative sizes of discrete scale in ggplot2

I'm trying to generate a polar violin plot with ggplot2. I'd like to control the relative size of each category (the width of each category of the factor on the x axis, which then translates to angle once I make the coordinates polar).
Is there any way to do this?
Example code:
means <- runif(n = 10, min=0.1, max=0.6)
sds <- runif(n = 10, min=0.2, max=0.4)
frame <- data.frame(
cat = sample(1:10, size=10000, replace=TRUE),
value = rnorm(10000)
) %>%
mutate(
mn = means[cat],
sd = sds[cat],
value = (value * sd) + mn,
cat = factor(cat)
)
frame %>%
ggplot(aes(x = cat, y = value)) + geom_violin() +
coord_polar()
Any help or advice is appreciated.
Alternatively (and perhaps better), I'd like to be able to make a polar coordinates chart that isn't centered. Where the angles are the same for each discrete category, but the points converge, say, 1/3 of the way from the bottom of the circle, rather than in the center of the circle.
Based on comments, I'm redoing my previous answer. If what you want is a fan/weed leaf shape, you can add dummy data for additional cat values. In this example, I just doubled the number of levels in cat, but you could change this. Then I set the x breaks to only show the values that actually have data, but let the dummy values take up space to change the shape. Still not sure if this is what you meant but it's interesting to try.
library(tidyverse)
means <- runif(n = 10, min=0.1, max=0.6)
sds <- runif(n = 10, min=0.2, max=0.4)
frame <- data.frame(
cat = sample(1:10, size=10000, replace=TRUE),
value = rnorm(10000)
) %>%
mutate(
mn = means[cat],
sd = sds[cat],
value = (value * sd) + mn,
cat = factor(cat)
)
frame %>%
mutate(cat = as.integer(cat)) %>%
bind_rows(tibble(cat = 11:20, value = NA)) %>%
ggplot(aes(x = as.factor(cat), y = value)) +
geom_violin(scale = "area") +
coord_polar(start = -pi / 2) +
scale_x_discrete(breaks = 1:10)
#> Warning: Removed 10 rows containing non-finite values (stat_ydensity).
Created on 2018-05-08 by the reprex package (v0.2.0).

Apply MASS::fitdistr to multiple data by a factor

My question is at the end in bold.
I know how to fit the beta distribution to some data. For instance:
library(Lahman)
library(dplyr)
# clean up the data and calculate batting averages by playerID
batting_by_decade <- Batting %>%
filter(AB > 0) %>%
group_by(playerID, Decade = round(yearID - 5, -1)) %>%
summarize(H = sum(H), AB = sum(AB)) %>%
ungroup() %>%
filter(AB > 500) %>%
mutate(average = H / AB)
# fit the beta distribution
library(MASS)
m <- MASS::fitdistr(batting_by_decade$average, dbeta,
start = list(shape1 = 1, shape2 = 10))
alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]
# plot the histogram of data and the beta distribution
ggplot(career_filtered) +
geom_histogram(aes(average, y = ..density..), binwidth = .005) +
stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red",
size = 1) +
xlab("Batting average")
Which yields:
Now I want to calculate different beta parameters alpha0 and beta0 for each batting_by_decade$Decade column of the data so I end up with 15 parameter sets, and 15 beta distributions that I can fit to this ggplot of batting averages faceted by Decade:
batting_by_decade %>%
ggplot() +
geom_histogram(aes(x=average)) +
facet_wrap(~ Decade)
I can hard code this by filtering for each decade, and passing that decade's worth of data into the fidistr function, repeating this for all decades, but is there a way of calculating all beta parameters per decade quickly and reproducibly, perhaps with one of the apply functions?
You can leverage summarise together with two custom functions for this:
getAlphaEstimate = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate[1]}
getBetaEstimate = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate[2]}
batting_by_decade %>%
group_by(Decade) %>%
summarise(alpha = getAlphaEstimate(average),
beta = getBetaEstimate(average)) -> decadeParameters
However, you will not be able to plot it with stat_summary according to Hadley's post here: https://stackoverflow.com/a/1379074/3124909
Here's an example of how you'd go from generating dummy data all the way through to plotting.
temp.df <- data_frame(yr = 10*187:190,
al = rnorm(length(yr), mean = 4, sd = 2),
be = rnorm(length(yr), mean = 10, sd = 2)) %>%
group_by(yr, al, be) %>%
do(data_frame(dats = rbeta(100, .$al, .$be)))
First I made up some scale parameters for four years, grouped by each combination, and then used do to create a dataframe with 100 samples from each distribution. Aside from knowing the "true" parameters, this dataframe should look a lot like your original data: a vector of samples with an associated year.
temp.ests <- temp.df %>%
group_by(yr, al, be) %>%
summarise(ests = list(MASS::fitdistr(dats, dbeta, start = list(shape1 = 1, shape2 = 1))$estimate)) %>%
unnest %>%
mutate(param = rep(letters[1:2], length(ests)/2)) %>%
spread(key = param, value = ests)
This is the bulk of your question here, very much solved the way you solved it. If you step through this snippet line by line, you'll see you have a dataframe with a column of type list, containing <dbl [2]> in each row. When you unnest() it splits those two numbers into separate rows, so then we identify them by adding a column that goes "a, b, a, b, ..." and spread them back apart to get two columns with one row for each year. Here you can also see how closely fitdistr matched the true population we sampled from, looking at a vs al and b vs be.
temp.curves <- temp.ests %>%
group_by(yr, al, be, a, b) %>%
do(data_frame(prop = 1:99/100,
trueden = dbeta(prop, .$al, .$be),
estden = dbeta(prop, .$a, .$b)))
Now we turn that process inside out to generate the data to plot the curves. For each row, we use do to make a dataframe with a sequence of values prop, and calculate the beta density at each value for both the true population parameters and our estimated sample parameters.
ggplot() +
geom_histogram(data = temp.df, aes(dats, y = ..density..), colour = "black", fill = "white") +
geom_line(data = temp.curves, aes(prop, trueden, color = "population"), size = 1) +
geom_line(data = temp.curves, aes(prop, estden, color = "sample"), size = 1) +
geom_text(data = temp.ests,
aes(1, 2, label = paste("hat(alpha)==", round(a, 2))),
parse = T, hjust = 1) +
geom_text(data = temp.ests,
aes(1, 1, label = paste("hat(beta)==", round(b, 2))),
parse = T, hjust = 1) +
facet_wrap(~yr)
Finally we put it together, plotting a histogram of our sample data. Then a line from our curve data for the true density. Then a line from our curve data for our estimated density. Then some labels from our parameter estimate data to show the sample parameters, and facets by year.
This is an apply solution, but I prefer #CMichael's dplyr solution.
calc_beta <- function(decade){
dummy <- batting_by_decade %>%
dplyr::filter(Decade == decade) %>%
dplyr::select(average)
m <- fitdistr(dummy$average, dbeta, start = list(shape1 = 1, shape2 = 10))
alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]
return(c(alpha0,beta0))
}
decade <- seq(1870, 2010, by =10)
params <- sapply(decade, calc_beta)
colnames(params) <- decade
Re: #CMichael's comment about avoiding a double fitdistr, we could rewrite the function to getAlphaBeta.
getAlphaBeta = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate}
batting_by_decade %>%
group_by(Decade) %>%
summarise(params = list(getAlphaBeta(average))) -> decadeParameters
decadeParameters$params[1] # it works!
Now we just need to unlist the second column in a nice way....

Add shape at the start and end of lines, and at some interval along the lines, defined by a grouping variable

that's my df (almost 100,000 rows and 10 ID values)
Date.time P ID
1 2013-07-03 12:10:00 1114.3 J9335
2 2013-07-03 12:20:00 1114.5 K0904
3 2013-07-03 12:30:00 1114.3 K0904
4 2013-07-03 12:40:00 1114.1 K1136
5 2013-07-03 12:50:00 1114.1 K1148
............
With ggplot I create this graph:
ggplot(df) + geom_line(aes(Date.time, P, group=ID, colour=ID)
No problem with this graph. But at the moment that I have to print it also in b/w, the separation in colors is not a smart choice.
I try to group the ID with the line type but the result is not so exiting.
So my idea is to add a different symbol at the beginning and at the end of every line: so the different IDs can be identified also in a b/w paper.
I add the lines:
geom_point(data=df, aes(x=min(Date.time), y=P, shape=ID))+
geom_point(data=df, aes(x=max(Date.time), y=P, shape=ID))
But an error occur..
Any suggestions?
Given that every line is composed by around 5000 or 10000 values it's impossible to plot the values as different characters. A solution could be to plot the lines and then plot the point as different symbol for every ID divided into breaks (for example one character every 500 values). Is it possible to do that?
What about adding the geom_points using a subset of you data with only the min-max time values?
# some data
df <- data.frame(
ID = rep(c("a", "b"), each = 4),
Date.time = rep(seq(Sys.time(), by = "hour", length.out = 4), 2),
P = sample(1:10, 8))
df
# create a subset with min and max time values
# if min(x) and max(x) is the same for each ID:
df_minmax <- subset(x= df, subset = Date.time == min(Date.time) | Date.time == max(Date.time))
# if min(x) and max(x) may differ between ID,
# calculate min and max values *per* ID
# Here I use ddply, but several other aggregating functions in base R will do as well.
library(plyr)
df_minmax <- ddply(.data = df, .variables = .(ID), subset,
Date.time == min(Date.time) | Date.time == max(Date.time))
gg <- ggplot(data = df, aes(x = Date.time, y = P)) +
geom_line(aes(group = ID, colour = ID)) +
geom_point(data = df_minmax, aes(shape = ID))
gg
If you wish to have some control over your shapes, you may have a look at ?scale_shape_discrete (with examples here).
Edit following updated question
For each ID, add a shape to the line at some interval.
# create a slightly larger data set
df <- data.frame(
ID = rep(c("a", "b"), each = 100),
Date.time = rep(seq(Sys.time(), by = "day", length.out = 100), 2),
P = c(sample(1:10, 100, replace = TRUE), sample(11:20, 100, replace = TRUE)))
# for each ID:
# create a time sequence from min(time) to max(time), by some time step
# e.g. a week
df_gap <- ddply(.data = df, .variables = .(ID), summarize,
Date.time =
seq(from = min(Date.time), to = max(Date.time), by = "week"))
# add P from df to df_gap
df_gap <- merge(x = df_gap, y = df)
gg <- ggplot(data = df, aes(x = Date.time, y = P)) +
geom_line(aes(group = ID, colour = ID)) +
geom_point(data = df_gap, aes(shape = ID)) +
# if your gaps are not a multiple of the length of the data
# you may wish to add the max points as well
geom_point(data = df_minmax, aes(shape = ID))
gg
The error stems from the fact that the single numeric value min(Date.time) doesn't match up in length with the vectors P or ID. Another problem might be that you're re-declaring your data variable even though you already have ggplot(df).
The solution that immediately comes to mind is to figure out what the row indexes are for your minimum and maximum dates. If they all share the same minimum and maximum time stamps than its easy. Use the which() function to come up with an array of the row numbers you'll need.
min.index <- which(df$Date.time == min(df$Date.time))
max.index <- which(df$Date.time == max(df$Date.time))
Then use those arrays as your indexes.
geom_point(aes(x=Date.time[min.index], y=P[min.index], shape=ID[min.index]))+
geom_point(aes(x=Date.time[max.index], y=P[max.index], shape=ID[max.index]))

Resources