Apply MASS::fitdistr to multiple data by a factor - r

My question is at the end in bold.
I know how to fit the beta distribution to some data. For instance:
library(Lahman)
library(dplyr)
# clean up the data and calculate batting averages by playerID
batting_by_decade <- Batting %>%
filter(AB > 0) %>%
group_by(playerID, Decade = round(yearID - 5, -1)) %>%
summarize(H = sum(H), AB = sum(AB)) %>%
ungroup() %>%
filter(AB > 500) %>%
mutate(average = H / AB)
# fit the beta distribution
library(MASS)
m <- MASS::fitdistr(batting_by_decade$average, dbeta,
start = list(shape1 = 1, shape2 = 10))
alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]
# plot the histogram of data and the beta distribution
ggplot(career_filtered) +
geom_histogram(aes(average, y = ..density..), binwidth = .005) +
stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red",
size = 1) +
xlab("Batting average")
Which yields:
Now I want to calculate different beta parameters alpha0 and beta0 for each batting_by_decade$Decade column of the data so I end up with 15 parameter sets, and 15 beta distributions that I can fit to this ggplot of batting averages faceted by Decade:
batting_by_decade %>%
ggplot() +
geom_histogram(aes(x=average)) +
facet_wrap(~ Decade)
I can hard code this by filtering for each decade, and passing that decade's worth of data into the fidistr function, repeating this for all decades, but is there a way of calculating all beta parameters per decade quickly and reproducibly, perhaps with one of the apply functions?

You can leverage summarise together with two custom functions for this:
getAlphaEstimate = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate[1]}
getBetaEstimate = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate[2]}
batting_by_decade %>%
group_by(Decade) %>%
summarise(alpha = getAlphaEstimate(average),
beta = getBetaEstimate(average)) -> decadeParameters
However, you will not be able to plot it with stat_summary according to Hadley's post here: https://stackoverflow.com/a/1379074/3124909

Here's an example of how you'd go from generating dummy data all the way through to plotting.
temp.df <- data_frame(yr = 10*187:190,
al = rnorm(length(yr), mean = 4, sd = 2),
be = rnorm(length(yr), mean = 10, sd = 2)) %>%
group_by(yr, al, be) %>%
do(data_frame(dats = rbeta(100, .$al, .$be)))
First I made up some scale parameters for four years, grouped by each combination, and then used do to create a dataframe with 100 samples from each distribution. Aside from knowing the "true" parameters, this dataframe should look a lot like your original data: a vector of samples with an associated year.
temp.ests <- temp.df %>%
group_by(yr, al, be) %>%
summarise(ests = list(MASS::fitdistr(dats, dbeta, start = list(shape1 = 1, shape2 = 1))$estimate)) %>%
unnest %>%
mutate(param = rep(letters[1:2], length(ests)/2)) %>%
spread(key = param, value = ests)
This is the bulk of your question here, very much solved the way you solved it. If you step through this snippet line by line, you'll see you have a dataframe with a column of type list, containing <dbl [2]> in each row. When you unnest() it splits those two numbers into separate rows, so then we identify them by adding a column that goes "a, b, a, b, ..." and spread them back apart to get two columns with one row for each year. Here you can also see how closely fitdistr matched the true population we sampled from, looking at a vs al and b vs be.
temp.curves <- temp.ests %>%
group_by(yr, al, be, a, b) %>%
do(data_frame(prop = 1:99/100,
trueden = dbeta(prop, .$al, .$be),
estden = dbeta(prop, .$a, .$b)))
Now we turn that process inside out to generate the data to plot the curves. For each row, we use do to make a dataframe with a sequence of values prop, and calculate the beta density at each value for both the true population parameters and our estimated sample parameters.
ggplot() +
geom_histogram(data = temp.df, aes(dats, y = ..density..), colour = "black", fill = "white") +
geom_line(data = temp.curves, aes(prop, trueden, color = "population"), size = 1) +
geom_line(data = temp.curves, aes(prop, estden, color = "sample"), size = 1) +
geom_text(data = temp.ests,
aes(1, 2, label = paste("hat(alpha)==", round(a, 2))),
parse = T, hjust = 1) +
geom_text(data = temp.ests,
aes(1, 1, label = paste("hat(beta)==", round(b, 2))),
parse = T, hjust = 1) +
facet_wrap(~yr)
Finally we put it together, plotting a histogram of our sample data. Then a line from our curve data for the true density. Then a line from our curve data for our estimated density. Then some labels from our parameter estimate data to show the sample parameters, and facets by year.

This is an apply solution, but I prefer #CMichael's dplyr solution.
calc_beta <- function(decade){
dummy <- batting_by_decade %>%
dplyr::filter(Decade == decade) %>%
dplyr::select(average)
m <- fitdistr(dummy$average, dbeta, start = list(shape1 = 1, shape2 = 10))
alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]
return(c(alpha0,beta0))
}
decade <- seq(1870, 2010, by =10)
params <- sapply(decade, calc_beta)
colnames(params) <- decade
Re: #CMichael's comment about avoiding a double fitdistr, we could rewrite the function to getAlphaBeta.
getAlphaBeta = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate}
batting_by_decade %>%
group_by(Decade) %>%
summarise(params = list(getAlphaBeta(average))) -> decadeParameters
decadeParameters$params[1] # it works!
Now we just need to unlist the second column in a nice way....

Related

Referring to the input data of ggplot and use that in a custom function within a geom

I'm using ggplot geom_vline in combination with a custom function to plot certain values on top of a histogram.
The example function below e.g. returns a vector of three values (the mean and x sds below or above the mean). I can now plot these values in geom_vline(xintercept) and see them in my graph.
#example function
sds_around_the_mean <- function(x, multiplier = 1) {
mean <- mean(x, na.rm = TRUE)
sd <- sd(x, na.rm = TRUE)
tibble(low = mean - multiplier * sd,
mean = mean,
high = mean + multiplier * sd) %>%
pivot_longer(cols = everything()) %>%
pull(value)
}
Reproducible data
#data
set.seed(123)
normal <- tibble(data = rnorm(1000, mean = 100, sd = 5))
outliers <- tibble(data = runif(5, min = 150, max = 200))
df <- bind_rows(lst(normal, outliers), .id = "type")
df %>%
ggplot(aes(x = data)) +
geom_histogram(bins = 100) +
geom_vline(xintercept = sds_around_the_mean(df$data, multiplier = 3),
linetype = "dashed", color = "red") +
geom_vline(xintercept = sds_around_the_mean(df$data, multiplier = 2),
linetype = "dashed")
The problem is, that as you can see I would have to define data$df at various places.
This becomes more error-prone when I apply any change to the original df that I pipe into ggplot, e.g. filtering out outliers before plotting. I would have to apply the same changes again at multiple places.
E.g.
df %>% filter(type == "normal")
#also requires
df$data
#to be changed to
df$data[df$type == "normal"]
#in geom_vline to obtain the correct input values for the xintercept.
So instead, how could I replace the df$data argument with the respective column of whatever has been piped into ggplot() in the first place? Something similar to the "." operator, I assume. I've also tried stat_summary with geom = "vline" to achieve this, but without the desired effect.
You can enclose the ggplot part in curly brackets and reference the incoming dataset with the . symbol both in the ggplot command and when calculating the sds_around_the_mean. This will make it dynamic.
df %>%
{ggplot(data = ., aes(x = data)) +
geom_histogram(bins = 100) +
geom_vline(xintercept = sds_around_the_mean(.$data, multiplier = 3),
linetype = "dashed", color = "red") +
geom_vline(xintercept = sds_around_the_mean(.$data, multiplier = 2),
linetype = "dashed")}

Extrapolate dataset with limited data points and add all values to new dataset

I have a dataset with very limited data points.
x<- c(4, 8, 13, 24)
y<- c(40, 37, 28, 20)
df<- data.frame(x,y)
Now I want to extrapolate this data, creating a dataset where the value of y will be given for every value (no decimals) of x between 1-100. x and y have a linear relationship.
Secondly, could this be done for multiple dataframes by using something like a loop?
Thank you!
This is a short snippet that does this:
linear_xy <- lm(y ~ x, data = df)
# df <- broom:::augment.lm(linear_xy, newdata = complete(df, x = 1:100)) # one way
df <- df %>% # another way
complete(x = 1:100) %>%
mutate(.fitted = predict(linear_xy, newdata = .))
ggplot(df, aes(x, y)) +
geom_line(aes(y = .fitted)) +
geom_point() +
ggpubr::theme_pubr()
This requires that you have the packages {tidyverse}, {broom}, and {ggpubr} installed.
Second part
Assumming we want to do this with multiple data-frames, we have to
restructure things a bit.
x <- c(4, 8, 13, 24)
y <- c(40, 37, 28, 20)
df <- tibble(x, y)
I don't have multiple data-frames (or tibbles), so I'll make this the
primary one, and make up a function (a factory) that yields data-frames, that are a bit different from the above df.
df_factory <- . %>%
mutate(x_new = x + sample.int(100, size = n()),
x = if_else(x_new >= 100, x, x_new),
y_new = y + rnorm(n(), mean = median(y), sd = sd(y)),
y = y_new,
y_new = NULL,
x_new = NULL)
Thus df_factory is a function of one-variable, and that must be a
data-frame that has an x and y;
df1 <- df_factory(df)
df2 <- df_factory(df)
df3 <- df_factory(df)
all_dfs <- list(df1, df2, df3)
all_dfs <- bind_rows(all_dfs, .id = "df_id")
Here we ensure that the relation to the original data-frame is preserved in the all_dfs data-frame via the new variable df_id.
Next we want to:
Collapse the variables into their individual data-frame, and we put
that in a list-column named data.
For each (see rowwise) we have to perform:
An "interpolating" linear model (not a piece-wise one so...)
Predict on each of these linear_xy (which are also stored in a list-column`).
Unnest it all back, so it can be fed into ggplot as one contiguous data-frame.
all_dfs %>%
nest(data = c(x,y)) %>%
rowwise() %>%
mutate(linear_xy = list(lm(y ~ x, data = data)),
augment = list(broom:::augment.lm(linear_xy,
newdata = complete(data, x = 1:100)))) %>%
ungroup() %>%
select(-data, -linear_xy) %>%
unnest(augment) ->
all_dfs_predictions
Note: -> at the end shows what the pipe result is now assigned to.
The group informs ggplot to treat the rows as separate via their
df_id. And for fun we add the color and fill to also depend on df_id. In fact I could have choosen something else to be the coloraesthetics dependent, like "original df" vs. "others" or if a certain threshold should distinguish them, etc.. But then the group aesthetic would still tell ggplot to separate the rows amongst this relation.
ggplot(all_dfs_predictions, aes(x, y, group = df_id, color = df_id, fill = df_id)) +
geom_line(aes(y = .fitted)) +
geom_point() +
lims(x = c(1,100)) +
ggpubr::theme_pubr()

Adding a power curve to scatterplot

I want to add a power curve with confidence intervals to my diamter-weight relationship, which clearly follows a y=a*x^b regression. So far, I used the geom_smooth "loess" version, but this is not yet quite right and perfect. Any suggestion how to add a power regression line would be much appreciated. Below is the used code:
p2<-ggplot(Data,aes(x=Diameter,y=Wet_weight,colour=Site))+
geom_point(size=3.5,alpha=0.3)+
geom_smooth(aes(group=Species),method=loess,colour="black")+
labs(x="\nUmbrella diamter (mm)",y="Wet weight (mg)\n")+theme_classic()+
scale_colour_manual(values=c("black","dark blue","blue","dark green","green"))+
theme(axis.title.x=element_text(size=20),
axis.text.x=element_text(size=18,colour="black"),
axis.title.y=element_text(size=20),
axis.text.y=element_text(size=18,colour="black"),
axis.ticks=element_line(colour="black",size=1),
axis.line=element_line(colour="black",size=1,linetype="solid"),
legend.position=c(0.18,0.75),
legend.text=element_text(colour="black",size=17),
legend.title=element_text(colour="black",size=18))
p2
Thank you!
I used this to get many equations, R2, and plots.
df= #change your data frame so it fits the current code
variables=c("group","year") #if you have multiple groups/seasons/years/elements add them here
df$y= #which variable will be your y
df$x= #which variable will be your x
#No changes get the equations
text=df %>%
group_by(across(all_of(variables))) %>% #your grouping variables
do(broom::tidy(lm(log(y) ~ log(x), data = .))) %>%
ungroup() %>%
mutate(y = round(ifelse(term=='(Intercept)',exp(estimate),estimate),digits = 2)) %>% #your equation values rounded to 2
select(-estimate,-std.error,-statistic ,-p.value) %>%
pivot_wider(names_from = term,values_from = y) %>%
rename(.,a=`(Intercept)`,b=`log(x)`)
#CHANGE before running!! add your grouping variables
rsq=df %>%
split(list(.$group,.$year)) %>% #---- HERE add the names after $
map(~lm(log(y) ~ log(x), data = .)) %>%
map(summary) %>%
map_dbl("r.squared") %>%
data.frame()
#Join the R2 and y results for the plot in a single data frame and write the equations
labels.df=mutate(rsq,groups=row.names(rsq)) %>%
separate(col = groups,into = c(variables),sep = "[.]",
convert = TRUE, remove = T, fill = "right") %>%
rename("R"='.') %>%
left_join(text,.) %>%
mutate(R=round(R,digits = 4), #round your R2 digits
eq= paste('y==',a,"~x^(",b,")", sep = ""),
rsql=paste("R^2==",R),
full= paste('y==',a,"~x^(",b,")","~~R^2==",R, sep = ""))
# plot
ggplot(df,aes(x = x,y = y)) +
geom_point(size=4,mapping = aes(
colour=factor(ifelse(is.na(get(variables[2])),"",(get(variables[2])))), #points colour
shape=get(variables[1]))) + # different shapes
facet_wrap(get(variables[1])~ifelse(is.na(get(variables[2])),"",get(variables[2])),
scales = "free",labeller = labeller(.multi_line = F))+ #for multiple groups; join text in one line
stat_smooth(mapping=aes(colour=get(variables[1])), #colours for our trend
method = 'nls', formula = 'y~a*x^b',
method.args = list(start=c(a=1,b=1)),se=FALSE) +
geom_text(labels.df,x = Inf, y = Inf,size=5, mapping = aes(label = (eq)), parse = T,vjust=1, hjust=1)+
geom_text(labels.df,x = Inf,y = Inf,size=5, mapping = aes(label = (rsql)), parse = T,vjust=2.5, hjust=1)+
#scale_y_log10() + #add this to avoid problems with big y values
labs(x="Your x label",y="your y label")+
theme_bw(base_size = 16) +
theme(legend.position = "none",
strip.background = element_rect(fill="#b2d6e2"))

How to plot line graph of normalized differences from binned data with ggplot?

I have several sets of data that I calculate binned normalized differences for. The results I want to plot within a single line plot using ggplot. The lines representing different combinations of the paired differences are supposed to be distinguished by colors and line types.
I am stuck on taking the computed values from the bins (would be y-axis values now), and plotting these onto an x-axis.
Below is the code I use for importing the data and calculating the normalized differences.
# Read data from column 3 as data table for different number of rows
# you could use replicate here for test
# dat1 <- data.frame(replicate(1,sample(25:50,10000,rep=TRUE)))
# dat2 <- data.frame(replicate(1,sample(25:50,9500,rep=TRUE)))
dat1 <- fread("/dir01/a/dat01.txt", header = FALSE, data.table=FALSE, select=c(3))
dat2 <- fread("/dir02/c/dat02.txt", header = FALSE, data.table=FALSE, select=c(3))
# Change column names
colnames(dat1) <- c("Dat1")
colnames(dat2) <- c("Dat2")
# Perhaps there is a better way to compute the following as all-in-one? I have broken these down step by step.
# 1) Sum for each bin
bin1 = cut(dat1$Dat1, breaks = seq(25, 50, by = 2))
sum1 = tapply(dat1$Dat1, bin1, sum)
bin2 = cut(dat2$Dat2, breaks = seq(25, 50, by = 2))
sum2 = tapply(dat2$Dat2, bin2, sum)
# 2) Total sum of all bins
sumt1 = sum(sum1)
sumt2 = sum(sum2)
# 3) Divide each bin by total sum of all bins
sumn1 = lapply(sum1, `/`, sumt1)
sumn2 = lapply(sum2, `/`, sumt2)
# 4) Convert to data frame as I'm not sure how to difference otherwise
df_sumn1 = data.frame(sumn1)
df_sumn2 = data.frame(sumn2)
# 5) Difference between the two as percentage
dbin = (df_sumn1 - df_sumn2)*100
How can I plot those results using ggplot() and geom_line()?
I want
dbin values on the x-axis ranging from 25-50
different colors and line types for the lines
Here is what I tried:
p1 <- ggplot(dbin, aes(x = ?, color=Data, linetype=Data)) +
geom_line() +
scale_linetype_manual(values=c("solid")) +
scale_x_continuous(limits = c(25, 50)) +
scale_color_manual(values = c("#000000"))
dput(dbin) outputs:
structure(list(X.25.27. = -0.0729132928804117, X.27.29. = -0.119044772581772,
X.29.31. = 0.316016473225017, X.31.33. = -0.292812782147632,
X.33.35. = 0.0776336591308158, X.35.37. = 0.0205584754637611,
X.37.39. = -0.300768421159599, X.39.41. = -0.403235174844081,
X.41.43. = 0.392510458816457, X.43.45. = 0.686758883448307,
X.45.47. = -0.25387105113263, X.47.49. = -0.0508324553382303), class = "data.frame", row.names = c(NA,
-1L))
Edit
The final piece of code that works, using only the dbin and plots multiple dbins:
dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100)))
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100)))
dat3 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 12:37/100)))
dat4 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 37:12/100)))
calc_bin_props <- function(data) {
as_tibble(data) %>%
mutate(bin = cut(a, breaks = seq(25, 50, by = 2))) %>%
group_by(bin) %>%
summarise(sum = sum(a), .groups = "drop") %>%
filter(!is.na(bin)) %>%
ungroup() %>%
mutate(sum = sum / sum(sum))
}
diff_data <-
full_join(
calc_bin_props(data = dat1),
calc_bin_props(dat2),
by = "bin") %>%
separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
mutate(dbinA = (sum.x - sum.y * 100)) %>%
select(-starts_with("trsh"))
diff_data2 <-
full_join(
calc_bin_props(data = dat3),
calc_bin_props(dat4),
by = "bin") %>%
separate(bin, c("trsh", "bin", "trshb", "trshc")) %>%
mutate(dbinB = (sum.x - sum.y * 100)) %>%
select(-starts_with("trsh"))
# Combine two differences, and remove sum.x and sum.y
full_data <- cbind(diff_data, diff_data2[,4])
full_data <- full_data[,-c(2:3)]
# Melt the data to plot more than 1 variable on a plot
m <- melt(full_data, id.vars="bin")
theme_update(plot.title = element_text(hjust = 0.5))
ggplot(m, aes(as.numeric(bin), value, col=variable, linetype = variable)) +
geom_line() +
scale_linetype_manual(values=c("solid", "longdash")) +
scale_color_manual(values = c("black", "black"))
dev.off()
library(tidyverse)
Creating example data as shown in question, but adding different probabilities to the two sample() calls, to create so visible difference
between the two sets of randomized data.
dat1 <- data.frame(a = replicate(1,sample(25:50,10000,rep=TRUE, prob = 25:0/100))) %>% as_tibble()
dat2 <- data.frame(a = replicate(1,sample(25:50,9500,rep=TRUE, prob = 0:25/100))) %>% as_tibble()
Using dplyr we can handle this within data.frames (tibbles) without
the need to switch to other datatypes.
Let’s define a function that can be applied to both datasets to get
the preprocessing done.
We use base::cut() to create
a new column that pairs each value with its bin. We then group the data
by bin, calculate the sum for each bin and finally divide the bin sums
by the total sum.
calc_bin_props <- function(data) {
as_tibble(data) %>%
mutate(bin = cut(a, breaks = seq(25, 50, by = 2), labels = seq(25, 48, by = 2))) %>%
group_by(bin) %>%
summarise(sum = sum(a), .groups = "drop") %>%
filter(!is.na(bin)) %>%
ungroup() %>%
mutate(sum = sum / sum(sum))
}
Now we call calc_bin_props() on both datasets and join them by bin.
This gives us a dataframe with the columns bin, sum.x and sum.y.
The latter two are correspond to the bin sums derived from dat1 and
dat2. With the mutate() line we calculate the differences between the
two columns.
diff_data <-
full_join(
calc_bin_props(data = dat1),
calc_bin_props(dat2),
by = "bin") %>%
mutate(dbin = (sum.x - sum.y),
bin = as.numeric(as.character(bin))) %>%
select(-starts_with("trsh"))
Before we feed the data into ggplot() we convert it to the long
format using pivot_longer() this allows us to instruct ggplot() to
plot the results for sum.x, sum.y and dbin as separate lines.
diff_data %>%
pivot_longer(-bin) %>%
ggplot(aes(as.numeric(bin), value, color = name, linetype = name)) +
geom_line() +
scale_linetype_manual(values=c("longdash", "solid", "solid")) +
scale_color_manual(values = c("black", "purple", "green"))

Reordering a factor based on a summary statistic of a subset of the data

I'm trying to reorder a factor from a subset of my data frame, defined by another factor using forcats::fct_reorder().
Consider the following data frame df:
set.seed(12)
df <- data.frame(fct1 = as.factor(rep(c("A", "B", 'C'), each = 200)),
fct2 = as.factor(rep(c("j", "k"), each = 100)),
val = c(rnorm(100, 2), # A - j
rnorm(100, 1), # A - k
rnorm(100, 1), # B - j
rnorm(100, 6), # B - k
rnorm(100, 8), # C - j
rnorm(100, 4)))# C - k
I want to plot facetted group densities using the ggridges package. For example:
ggplot(data = df, aes(y = fct2, x = val)) +
stat_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = T,
quantile_fun = median,
quantile_lines = T) +
facet_wrap(~fct1, ncol = 1)
I would now like to order fct1 by the median (default in fct_reorder()) of the values of the upper density in each facet, i.e. where fct2 == "k". The goal in this example would therefore be that the facets appear in the order B - C - A.
This seems very similar to this question here, with the difference that I do not want to summarize the data first because I need the raw data to plot the densities.
I've tried to adapt the code in the answer of the linked question:
df <- df %>% mutate(fct1 = forcats::fct_reorder(fct1, filter(., fct2 == 'k') %>% pull(val)))
But it returns the following error:
Error in forcats::fct_reorder(fct1, filter(., fct2 == "k") %>% pull(val)) :
length(f) == length(.x) is not TRUE
It's obvious that they are not the same length, but I don't quite get why this error is necessary. My guess is that it's generally not guaranteed that all levels of fct1 are present in the subset, which would certainly be problematic. Yet, this isn't the case in my example. Is there a way to work around this error or am I doing something wrong more generally?
I'm aware that I can work around this with a couple of lines of extra code, e.g. create a helper variable of the subsetted data, reorder that and then take the level order to my factor in the original data set. I would still like a prettier solution, because I regularly face that very same task.
You can do this with a little helper function:
f <- function(i) -median(df$val[df$fct2 == "k" & df$fct1 == df$fct1[i]])
Which allows you to reorder like this:
df$fct1 <- forcats::fct_reorder(df$fct1, sapply(seq(nrow(df)), f))
Which gives you this plot:
ggplot(data = df, aes(y = fct2, x = val)) +
stat_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = T,
quantile_fun = median,
quantile_lines = T) +
facet_wrap(~fct1, ncol = 1)

Resources