I have estimated a set of distributions through grouping the data on time period and gender using the following code:
df.weibull <- tapply(df$attribute, list(time=df$time, gender=df$gender), fitdist, "weibull")
I would like to graph the scale parameter of these distributions over time, with a separate line for each gender. I know I can access an individual scale parameter by:
df.weibull[1,"M"][[1]]$estimate["scale"]
but I cannot figure out how to access all the scale parameters at once in a direct manner. Solutions to either access all the parameters or how to write the original function to return a more accessible data structure are fine.
EDIT: Here is some code that reproduces the data structure:
gender.df <- c("M","M","M","M","M","M","F","F","F","F","F","F")
time.df <- c(1,1,1,2,2,2,1,1,1,2,2,2)
attribute.df <- c(10,20,30,11,21,31,45,55,65,1,2,3)
df <- data.frame(attribute.df,time.df,gender.df)
names(df) <- c("attribute", "time", "gender")
library(fitdistrplus)
df.weibull <- tapply(df$attribute, list(time=df$time, gender=df$gender), fitdist, "weibull")
It seems like you are trying to fit a Weibull distribution by gender and time groups. I suppose that this is just a tiny subsection of your dataset because you have just 3 observations per group. How about:
library(tidyverse)
library(data.table)
sumdf <- setDT(df)[, as.list(fitdist(attribute, dist= "weibull")$estimate), by = .(time, gender)]
time gender shape scale
1: 1 M 2.738085 22.587353
2: 2 M 2.893080 23.666143
3: 1 F 7.793204 58.553205
4: 2 F 2.738652 2.258509
Then you could plot e.g.:
ggplot(sumdf) + geom_line(aes(x = time, y = scale, col = gender))
Or
ggplot(sumdf) + geom_line(aes(x = time, y = shape, col = gender))
Related
I am working with ggplot to plot bivariate data in groups along with standard ellipses of these data using a separate set of tools. These return n=100 x,y coordinates that define each ellipse, and then for each group, I would like to plot about 10-25 ellipses.
Conceptually, how can this be achieved? I can plot a single ellipse easily using geom_polygon, but I am confused how to get the data organized to make it work so multiple ellipses are plotted and guides (color, fills, linetypes, etc.) are applied per group.
In the traditional R plotting, I could just keep adding lines using a for loop.
Thanks!
UPDATE: Here is a CSV containing 100 coordinates for a single ellipse.
Data
Let's say I have three groups of bivariate data to which the ellipse fitting has been applied: Green, Red, Blue. For each group, I'd like to plot several ellipses.
I don't know how I would organize the data in such a way to work in the long format prefered by ggplot and preserve the group affiliations. Would a list work?
UPDATE2:
Here is a csv of raw x and y data organized into two groups: river and lake
Data
The data plot like this:
test.data <- read.csv("ellipse_test_data.csv")
ggplot(test.data) +
geom_point(aes(x, y, color = group)) +
theme_classic()
I am using a package called SIBER, which will fit Bayesian ellipses to the data for comparing groups by ellipse area, etc. The output of the following creates a list with number of elements = number of groups of data, and each element contains a 6 x n (n=number of draws) for each fitted ellipse - first four columns are a covariance matrix Sigma in vector format and the last two are the bivariate means:
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^5 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 100 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.test <- siberMVN(siber.test, parms, priors)
First few rows of the first element in the list:
$`1.river`
Sigma2[1,1] Sigma2[2,1] Sigma2[1,2] Sigma2[2,2] mu[1] mu[2]
[1,] 1.2882740 2.407070e-01 2.407070e-01 1.922637 -15.52846 12.14774
[2,] 1.0677979 -3.997169e-02 -3.997169e-02 2.448872 -15.49182 12.37709
[3,] 1.1440816 7.257331e-01 7.257331e-01 4.040416 -15.30151 12.14947
I would like to be able to extract a random number of these ellipses and plot them with ggplot using alpha transparency.
The package SIBER has a function (addEllipse) to convert the '6 x n' entries to a set number of x and y points that define an ellipse, but I don't know how to organize that output for ggplot. I thought there might be an elegant way to do with all internally with ggplot.
The ideal output would be something like this, but in ggplot so the ellipses could match the aesthetics of the levels of data:
some code to do this on the bundled demo dataset from SIBER.
In this example we try to create some plots of the multiple samples of the posterior ellipses using ggplot2.
library(SIBER)
library(ggplot2)
library(dplyr)
library(ellipse)
Fit a basic SIBER model to the example data bundled with the package.
# load in the included demonstration dataset
data("demo.siber.data")
#
# create the siber object
siber.example <- createSiberObject(demo.siber.data)
# Calculate summary statistics for each group: TA, SEA and SEAc
group.ML <- groupMetricsML(siber.example)
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^4 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 10 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.posterior <- siberMVN(siber.example, parms, priors)
# The posterior estimates of the ellipses for each group can be used to
# calculate the SEA.B for each group.
SEA.B <- siberEllipses(ellipses.posterior)
siberDensityPlot(SEA.B, xticklabels = colnames(group.ML),
xlab = c("Community | Group"),
ylab = expression("Standard Ellipse Area " ('\u2030' ^2) ),
bty = "L",
las = 1,
main = "SIBER ellipses on each group"
)
Now we want to create some plots of some sample ellipses from these distributions. We need to create a data.frame object of all the ellipses for each group. In this exmaple we simply take the frist 10 posterior draws assuming them to be independent of one another, but you could take a random sample if you prefer.
# how many of the posterior draws do you want?
n.posts <- 10
# decide how big an ellipse you want to draw
p.ell <- 0.95
# for a standard ellipse use
# p.ell <- pchisq(1,2)
# a list to store the results
all_ellipses <- list()
# loop over groups
for (i in 1:length(ellipses.posterior)){
# a dummy variable to build in the loop
ell <- NULL
post.id <- NULL
for ( j in 1:n.posts){
# covariance matrix
Sigma <- matrix(ellipses.posterior[[i]][j,1:4], 2, 2)
# mean
mu <- ellipses.posterior[[i]][j,5:6]
# ellipse points
out <- ellipse::ellipse(Sigma, centre = mu , level = p.ell)
ell <- rbind(ell, out)
post.id <- c(post.id, rep(j, nrow(out)))
}
ell <- as.data.frame(ell)
ell$rep <- post.id
all_ellipses[[i]] <- ell
}
ellipse_df <- bind_rows(all_ellipses, .id = "id")
# now we need the group and community names
# extract them from the ellipses.posterior list
group_comm_names <- names(ellipses.posterior)[as.numeric(ellipse_df$id)]
# split them and conver to a matrix, NB byrow = T
split_group_comm <- matrix(unlist(strsplit(group_comm_names, "[.]")),
nrow(ellipse_df), 2, byrow = TRUE)
ellipse_df$community <- split_group_comm[,1]
ellipse_df$group <- split_group_comm[,2]
ellipse_df <- dplyr::rename(ellipse_df, iso1 = x, iso2 = y)
Now to create the plots. First plot all the raw data as we want.
first.plot <- ggplot(data = demo.siber.data, aes(iso1, iso2)) +
geom_point(aes(color = factor(group):factor(community)), size = 2)+
ylab(expression(paste(delta^{15}, "N (\u2030)")))+
xlab(expression(paste(delta^{13}, "C (\u2030)"))) +
theme(text = element_text(size=15))
print(first.plot)
Now we can try to add the posterior ellipses on top and facet by group
second.plot <- first.plot + facet_wrap(~factor(group):factor(community))
print(second.plot)
# rename columns of ellipse_df to match the aesthetics
third.plot <- second.plot +
geom_polygon(data = ellipse_df,
mapping = aes(iso1, iso2,
group = rep,
color = factor(group):factor(community),
fill = NULL),
fill = NA,
alpha = 0.2)
print(third.plot)
Facet-wrapped plot of sample of posterior ellipses by group
I've got a dataset of different energies (eV) and related counts. I changed the detection wavelength throughout the measurement which resulted in having a first column with all wavelength and than further columns. There the different rows are filled with NAs because no data was measured at the specific wavelength.
I would like to plot the spectra in R, but it doesn't work because the length of X and y values differs for each column.
It would be great, if someone could help me.
Thank you very much.
It would be better if we could work with (simulated) data you provided. Here's my attempt at trying to visualize your problem the way I see it.
library(ggplot2)
library(tidyr)
# create and fudge the data
xy <- data.frame(measurement = 1:20, red = rnorm(20), green = rnorm(20, mean = 10), uv = NA)
xy[16:20, "green"] <- NA
xy[16:20, "uv"] <- rnorm(5, mean = -3)
# flow it into "long" format
xy <- gather(xy, key = color, value = value, - measurement)
# plot
ggplot(xy, aes(x = measurement, y = value, group = color)) +
theme_bw() +
geom_line()
Let's say I have a histogram with two overlapping groups. Here's a possible command from ggplot2 and a pretend output graph.
ggplot2(data, aes(x=Variable1, fill=BinaryVariable)) + geom_histogram(position="identity")
So what I have is the frequency or count of each event. What I'd like to do instead is to get the difference between the two events in each bin. Is this possible? How?
For example, if we do RED minus BLUE:
Value at x=2 would be ~ -10
Value at x=4 would be ~ 40 - 200 = -160
Value at x=6 would be ~ 190 - 25 = 155
Value at x=8 would be ~ 10
I'd prefer to do this using ggplot2, but another way would be fine. My dataframe is set up with items like this toy example (dimensions are actually 25000 rows x 30 columns) EDITED: Here is example data to work with GIST Example
ID Variable1 BinaryVariable
1 50 T
2 55 T
3 51 N
.. .. ..
1000 1001 T
1001 1944 T
1002 1042 N
As you can see from my example, I'm interested in a histogram to plot Variable1 (a continuous variable) separately for each BinaryVariable (T or N). But what I really want is the difference between their frequencies.
So, in order to do this we need to make sure that the "bins" we use for the histograms are the same for both levels of your indicator variable. Here's a somewhat naive solution (in base R):
df = data.frame(y = c(rnorm(50), rnorm(50, mean = 1)),
x = rep(c(0,1), each = 50))
#full hist
fullhist = hist(df$y, breaks = 20) #specify more breaks than probably necessary
#create histograms for 0 & 1 using breaks from full histogram
zerohist = with(subset(df, x == 0), hist(y, breaks = fullhist$breaks))
oneshist = with(subset(df, x == 1), hist(y, breaks = fullhist$breaks))
#combine the hists
combhist = fullhist
combhist$counts = zerohist$counts - oneshist$counts
plot(combhist)
So we specify how many breaks should be used (based on values from the histogram on the full data), and then we compute the differences in the counts at each of those breaks.
PS It might be helpful to examine what the non-graphical output of hist() is.
Here's a solution that uses ggplot as requested.
The key idea is to use ggplot_build to get the rectangles computed by stat_histogram. From that you can compute the differences in each bin and then create a new plot using geom_rect.
setup and create a mock dataset with lognormal data
library(ggplot2)
library(data.table)
theme_set(theme_bw())
n1<-500
n2<-500
k1 <- exp(rnorm(n1,8,0.7))
k2 <- exp(rnorm(n2,10,1))
df <- data.table(k=c(k1,k2),label=c(rep('k1',n1),rep('k2',n2)))
Create the first plot
p <- ggplot(df, aes(x=k,group=label,color=label)) + geom_histogram(bins=40) + scale_x_log10()
Get the rectangles using ggplot_build
p_data <- as.data.table(ggplot_build(p)$data[1])[,.(count,xmin,xmax,group)]
p1_data <- p_data[group==1]
p2_data <- p_data[group==2]
Join on the x-coordinates to compute the differences. Note that the y-values aren't the counts, but the y-coordinates of the first plot.
newplot_data <- merge(p1_data, p2_data, by=c('xmin','xmax'), suffixes = c('.p1','.p2'))
newplot_data <- newplot_data[,diff:=count.p1 - count.p2]
setnames(newplot_data, old=c('y.p1','y.p2'), new=c('k1','k2'))
df2 <- melt(newplot_data,id.vars =c('xmin','xmax'),measure.vars=c('k1','diff','k2'))
make the final plot
ggplot(df2, aes(xmin=xmin,xmax=xmax,ymax=value,ymin=0,group=variable,color=variable)) + geom_rect()
Of course the scales and legends still need to be fixed, but that's a different topic.
Would appreciate help with generating a 2D histogram of frequencies, where frequencies are calculated within a column. My main issue: converting from counts to column based frequency.
Here's my starting code:
# expected packages
library(ggplot2)
library(plyr)
# generate example data corresponding to expected data input
x_data = sample(101:200,10000, replace = TRUE)
y_data = sample(1:100,10000, replace = TRUE)
my_set = data.frame(x_data,y_data)
# define x and y interval cut points
x_seq = seq(100,200,10)
y_seq = seq(0,100,10)
# label samples as belonging within x and y intervals
my_set$x_interval = cut(my_set$x_data,x_seq)
my_set$y_interval = cut(my_set$y_data,y_seq)
# determine count for each x,y block
xy_df = ddply(my_set, c("x_interval","y_interval"),"nrow") # still need to convert for use with dplyr
# convert from count to frequency based on formula: freq = count/sum(count in given x interval)
################ TRYING TO FIGURE OUT #################
# plot results
fig_count <- ggplot(xy_df, aes(x = x_interval, y = y_interval)) + geom_tile(aes(fill = nrow)) # count
fig_freq <- ggplot(xy_df, aes(x = x_interval, y = y_interval)) + geom_tile(aes(fill = freq)) # frequency
I would appreciate any help in how to calculate the frequency within a column.
Thanks!
jac
EDIT: I think the solution will require the following steps
1) Calculate and store overall counts for each x-interval factor
2) Divide the individual bin count by its corresponding x-interval factor count to obtain frequency.
Not sure how to carry this out though. .
If you want to normalize over the x_interval values, you can create a column with a count per interval and then divide by that. I must admit i'm not a ddply wiz so maybe it has an easier way, but I would do
xy_df$xnrows<-with(xy_df, ave(nrow, x_interval, FUN=sum))
then
fig_freq <- ggplot(xy_df, aes(x = x_interval, y = y_interval)) +
geom_tile(aes(fill = nrow/xnrows))
I have a long format data frame where rows represent the responses (one of four categories) of different people. An example dataset is provided here:
df <- data.frame(person=c(rep("A",100),rep("B",100)),resp=c(sample(4,100,replace=TRUE),sample(4,100,replace=TRUE)))
df$resp <- factor(df$resp)
summary(df)
person resp
A:100 1:52
B:100 2:55
3:54
4:39
I want to present a chart where the x-axis plots the response category, the y-axis shows the proportion of responses in a category, and where error bars are calculated via bootstrapping (sampling with replacement).
I can calculate the proportion (in an extremely kludgy way; I'm sure this could be improved but this is not my main concern):
pFrame <- ddply(df,.(person,resp),summarise,trials = length(resp))
# can't figure out how to calculate the proportion with plyr.
pFrame$prop <- NA
people <- unique(df$person)
responses <- unique(df$resp)
for (i in 1 : length(people)){
nTrials <- nrow(subset(df,person==people[i]))
for (j in 1 : 4){
pFrame$prop[pFrame$person==people[i] & pFrame$resp==responses[j]] <- pFrame$trials[pFrame$person==people[i] & pFrame$resp==responses[j]] / nTrials
}
}
and plot it:
ggplot(pFrame,aes(x=resp,y=prop,colour=person)) + geom_point()
but I would really like to use something like stat_summary(fun.data="mean_cl_boot") to show the variability on the proportions (i.e. acting on the original data frame df, and bootstrapping over the rows). I've tried a few attempts at creating custom functions but this doesn't seem trivial because the factor levels need to be transformed for the bootstrap first.
I couldn't get ggplot's "mean_cl_boot" to work. Here is an alternative solution though:
library(boot)
summary_for_plot <- melt(prop.table(table(df), 1))
names(summary_for_plot) <- c("person", "resp", "V1")
# function for boot()
summary_function <- function(df, d){
melt(prop.table(table(df[d,]), 1))[, 3]
}
bootres <- boot(df, statistic = summary_function, R=100)
# get the standard deviation, used for the confidence intervals
summary_for_plot$sd <- sd(bootres$t)
ggplot(summary_for_plot, aes(x= resp, y = V1, color = person)) + geom_point() +
geom_errorbar(aes(ymin = V1-sd, ymax = V1+sd), width = 0.2)