Bootstrap a proportion (of factor levels) in ggplot2 - r

I have a long format data frame where rows represent the responses (one of four categories) of different people. An example dataset is provided here:
df <- data.frame(person=c(rep("A",100),rep("B",100)),resp=c(sample(4,100,replace=TRUE),sample(4,100,replace=TRUE)))
df$resp <- factor(df$resp)
summary(df)
person resp
A:100 1:52
B:100 2:55
3:54
4:39
I want to present a chart where the x-axis plots the response category, the y-axis shows the proportion of responses in a category, and where error bars are calculated via bootstrapping (sampling with replacement).
I can calculate the proportion (in an extremely kludgy way; I'm sure this could be improved but this is not my main concern):
pFrame <- ddply(df,.(person,resp),summarise,trials = length(resp))
# can't figure out how to calculate the proportion with plyr.
pFrame$prop <- NA
people <- unique(df$person)
responses <- unique(df$resp)
for (i in 1 : length(people)){
nTrials <- nrow(subset(df,person==people[i]))
for (j in 1 : 4){
pFrame$prop[pFrame$person==people[i] & pFrame$resp==responses[j]] <- pFrame$trials[pFrame$person==people[i] & pFrame$resp==responses[j]] / nTrials
}
}
and plot it:
ggplot(pFrame,aes(x=resp,y=prop,colour=person)) + geom_point()
but I would really like to use something like stat_summary(fun.data="mean_cl_boot") to show the variability on the proportions (i.e. acting on the original data frame df, and bootstrapping over the rows). I've tried a few attempts at creating custom functions but this doesn't seem trivial because the factor levels need to be transformed for the bootstrap first.

I couldn't get ggplot's "mean_cl_boot" to work. Here is an alternative solution though:
library(boot)
summary_for_plot <- melt(prop.table(table(df), 1))
names(summary_for_plot) <- c("person", "resp", "V1")
# function for boot()
summary_function <- function(df, d){
melt(prop.table(table(df[d,]), 1))[, 3]
}
bootres <- boot(df, statistic = summary_function, R=100)
# get the standard deviation, used for the confidence intervals
summary_for_plot$sd <- sd(bootres$t)
ggplot(summary_for_plot, aes(x= resp, y = V1, color = person)) + geom_point() +
geom_errorbar(aes(ymin = V1-sd, ymax = V1+sd), width = 0.2)

Related

Export results from LOESS plot

I am trying to export the underlying data from a LOESS plot (blue line)
I found this post on the subject and was able to get it to export like the post says:
Can I export the result from a loess regression out of R?
However, as the last comment from the poster in that post says, I am not getting the results for my LOESS line. Does anyone have any insights on how to get it to export properly?
Thanks!
Code for my export is here:
#loess object
CL111_loess <- loess(dur_cleaned~TS_LightOn, data = CL111)
#get SE
CL111_predict <- predict(CL111_loess, se=T)
CL111_ouput <- data.frame("fitted" = CL111_predict$fit, "SE"=CL111_predict$se.fit)
write.csv(CL111_ouput, "CL111_output.csv")
Data for the original plot is here:
Code for my original plot is here:
{r}
#individual plot
ggplot(data = CL111) + geom_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "lm", se = FALSE, colour = "Green") +
labs(x = "TS Light On (Seconsd)", y = "TS Response Time (Seconds)", title = "Layout 1, Condition AO, INS High") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "loess", se = TRUE) + xlim(0, 400) + ylim (0, 1.0)
#find coefficients for best fit line
lm(CL111_LM$dur_cleaned ~ CL111_LM$TS_LightOn)
You can get this information via ggplot_build().
If your plot is saved as gg1, run ggplot_build(gg1); then you have to examine the data object (which is a list of data for different layers) and try to figure out which layer you need (in this case, I looked for which data layer included a colour column that matched the smooth line ...
bb <- ggplot_build(gg1)
## extract the right component, just the x/y coordinates
out <- bb$data[[2]][,c("x","y")]
## check
plot(y~x, data = out)
You can do whatever you want with this output now (write.csv(), save(), saveRDS() ...)
I agree that there is something weird/that I don't understand about the way that ggplot2 is setting up the loess fit. You do have to do predict() with the right newdata (e.g. a data frame with a single column TS_LightOn that ranges from 0 to 400) - otherwise you get predictions of the points in your data set, which may not be properly spaced/in the right order - but that doesn't resolve the difference for me.
To complement #ben-bolker, I have just written a small function that may be useful for retrieving the internal dataset created by ggplot for a geom_smooth call. It takes the resultant ggplot as input and returns the smoothed data. The problem it solves is that, as Ben observed, internally ggplot creates a smoothed fit with predicted data on random intervals, different from the interval used for the input data. This function will get you back the ggplot fit data with an interval based on integer and equally spaced values. That function uses a loess fit on the already smoothed data, using a small value of span (0.1), that is adjusted upward on-the-fly to cope with small numbers of values.
This is useful if you used geom_smooth with a method that is not 'loess' or using 'NULL' and you cannot easily build a model that replicates what geom_smooth is doing internally.
The function separates different series on the same plot as well as series located on different facets. It also returns the 'ymin' and 'ymax' values.
Note that this function uses an interval based on integer values of x. You can modify this if you need an interval based on equally-spaced values of x, but not integral. In that case, pass your x interval of choice in the xInterval parameter, or tweak the line:
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x)))).
get_geom_smooth_dataFromPlot <- function (a_ggplot, xInterval=NULL) {
#internal ggplot values read in ggTable
ggTable <- ggplot_build(a_ggplot)$data[[1]]
#facet panels
panels <- as.numeric(names(table(ggTable$PANEL)))
nPanel <- length(panels)
onePanel <- (nPanel==1)
#number of series in each plot
groups <- as.numeric(names(table(ggTable$group)))
nGroup <- length(groups)
oneGroup <- (nGroup==1)
out <- data.frame()
#are there 'ymin' and 'ymax' values?
SE_data <- "ymin" %in% colnames(ggTable)
for (pan in (1:nPanel)) {
for (grp in (1:nGroup)) {
sub <- subset(ggTable, (PANEL==panels[pan])&(group==groups[grp]))
#no group series for this facet panel?
if (dim(sub)[1] == 0) next
if (is.null(xInterval)) {
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x))))
} else {
outOne <- data.frame(x=xInterval)
}
nObs <- dim(outOne)[1]
#hack to avoid problems with a small range for the x interval
# when there are more than 90 x values
# we use a span of 0.1, but
# we adjust on-the-fly up to a span of 0.5
# for 10 values of the x interval
cSpan <- max (0.1, 0.5 * 10 / (nObs-(nObs-10)/2))
if (!onePanel) outOne$panel <- pan
if (!oneGroup) outOne$group <- grp
mod <- loess(y~x, data=sub, span=cSpan)
outOne$y <- predict(mod, outOne$x, se=FALSE)
if (SE_data) {
mod <- loess(ymin~x, data=sub, span=cSpan)
outOne$ymin <- predict(mod, outOne$x, se=FALSE)
mod <- loess(ymax~x, data=sub, span=cSpan)
outOne$ymax <- predict(mod, outOne$x, se=FALSE)
}
out <- rbind(out, outOne)
}
}
return (out)
}

How to efficiently draw lots of graphs in R from data in a wide format?

I'm trying to draw 18 graphs using R and the ggplot2 package. My data look like this:
v1 v2 v3 ... v18 subject group
534 543 512 ... 410 1 (6.5, 18]
437 576 465 ... 420 2 (0, 6.5]
466 487 492 ... 501 3 (18, 55]
And I need to create a "faceted" histogram showing distributions for all of the groups in one frame (i. e. to conveniently present all of the subgroups' distributions) like this:
I came up with this code for a single plot:
ggplot(data = df, aes (x = v1)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
But since there are 18 variables (v1, v2,...), I'm looking for a way to write an efficient function/loop/command that would draw all the 18 graphs without me having to copy/paste and change the variable name 18 times. Like this:
ggplot(data = df, aes (x = **v1**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
ggplot(data = df, aes (x = **v2**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
ggplot(data = df, aes (x = **v3**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
I know that the solution probably lies in looping and it seems like a useful skill to have, so I'm also using this opportunity to learn this right.
Thank you, any help is appreciated! (And thanks to all the suggestions so far!)
This is where I've gotten so far with the kind help of the user below:
for (v in c(v1,v2)) {
pdf("plots.pdf")
histograms <- ggplot(data = data, aes (x = v)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
print(histograms)
}
dev.off()
EDIT A significantly revised answer is provided having clarified the needs.
The problem presents several common issues, each of which are addressed in other posts. However, perhaps this suggestion allows for a one-stop solution to these common issues.
My first suggestion is to reformat the data into a "long" format. There are many resources describing this and packages to help. Many users embrace the "tidyverse" set of tools and I'll leave that to others. I'll demonstrate a simple approach using base functions. I don't recommend the reshape() function in the stats package. I find it to be useful for repeated measures with time as one of the variables but find it rather complicated for other data.
A large fake data set will be generated in the "wide" format with demographic data (id, sex, weight, age, group) and 18 variables named "v01", "v02", ..., "v18" as random integers between 400 and 500.
# Set random number generator and number of "individuals" in fake data
set.seed(1234) # to ensure reproducibility
N <- 936 # number of "individuals" in the fake data
# Create typical fake demographic data and divide the age into 4 groups
id <- factor(sample(1e4:9e4, N, replace = FALSE))
age <- rpois(N, 36)
sex <- sample(c("F","M"), N, replace = TRUE)
weight <- 16 * log(age)
group <- cut(age, breaks = c(12, 32, 36, 40, 62))
Generate 18 fake values for each individual for the wide format and then create the fake "wide" data.frame.
# 18 variable measurements for wide format
V <- replicate(18, sample(400:600, N, replace = TRUE), simplify = FALSE)
names(V) <- sprintf("v%02d", 1:18)
# Add a little variation to the fake data
adj <- sample(1:6, 18, replace = TRUE)
V <- Map("/", V, adj) # divide each value by the number in 'adj'
V <- lapply(V, round, 1) # simplify
# Create data.frame with variable data in wide format
vars <- as.data.frame(V)
names(vars)
# Assemble demographic and variable data into a typical "wide" data set
wide <- data.frame(id, sex, weight, age, group, vars)
names(wide)
head(wide)
In the "wide" format, each row corresponds to a unique individual with demographic information and 18 values for 18 variables. This is going to be changed into the "long" format with each value represented by a row. The new "long" data frame will have two new variables for the data (values) and a factor indicating the group from which the data came (ind). Typically they get renamed but I will simply work with the default names here.
As noted above, the simple base function stack() will be used to stack the variables into a single vector. In contrast to cbind(), the data.frame() function will replicate values only as long as they are an even multiple of each other. The following code takes advantage of this property to build the "long" data.frame.
# Identify those variables to be stacked (they all start with 'v')
sel <- grepl("^v", names(wide))
long <- data.frame(wide[!sel], stack(wide[sel]))
head(long)
My second suggestion is to use one of the "apply" functions to create a list of ggplot objects. By storing the plots in this variable, you have the option of plotting them with different formats without running the plotting code each time.
The code creates a plot for each of the 18 different variables, which are identified by the new variable ind. I changed boundary = 500 to a bins = 10 since I don't know what your actual data looks like. I also added a "caption" to each plot identifying the original variable.
library(ggplot2) # to use ggplot...
plotList <- lapply(levels(long$ind), function(i)
ggplot(data = subset(long, ind == i), aes(x = values))
+ geom_histogram(bins = 10)
+ facet_wrap(~ group, nrow = 2)
+ labs(caption = paste("Variable", i)))
names(plotList) <- levels(long$ind) # name the list elements for convenience
Now to examine each of the 18 plots (this may not work in RStudio):
opar <- par(ask = TRUE)
plotList # This is the same as print(plotList)
par(opar) # turn off the 'ask' option
To save the plots to file, the advice of Imo is good. But it would be wise to take control of the size and nature of the file output. I suggest you look at the help files for pdf() and dev.print(). The last part of this answer shows one possibility with the pdf() function using a for loop to generate single page plots.
for (v in levels(long$ind)) {
fname <- paste(v, "pdf", sep = ".")
fname <- file.path("~", fname) # change this to specify a directory
pdf(fname, width = 6.5, height = 7, paper = "letter")
print(plotList[[v]])
dev.off()
}
And just to add another possible approach, here's a solution with lattice showing 6 groups of variables per plot. (Personally, I'm a fan of this simpler approach.)
library(lattice)
idx <- split(levels(long$ind), gl(3, 6, 18))
opar <- par(ask = TRUE)
for (i in idx)
plot(histogram(~values | group + ind, data = long,
subset = ind %in% i, as.table = TRUE))
par(opar)

Graph estimated parameters from a set of estimated distributions

I have estimated a set of distributions through grouping the data on time period and gender using the following code:
df.weibull <- tapply(df$attribute, list(time=df$time, gender=df$gender), fitdist, "weibull")
I would like to graph the scale parameter of these distributions over time, with a separate line for each gender. I know I can access an individual scale parameter by:
df.weibull[1,"M"][[1]]$estimate["scale"]
but I cannot figure out how to access all the scale parameters at once in a direct manner. Solutions to either access all the parameters or how to write the original function to return a more accessible data structure are fine.
EDIT: Here is some code that reproduces the data structure:
gender.df <- c("M","M","M","M","M","M","F","F","F","F","F","F")
time.df <- c(1,1,1,2,2,2,1,1,1,2,2,2)
attribute.df <- c(10,20,30,11,21,31,45,55,65,1,2,3)
df <- data.frame(attribute.df,time.df,gender.df)
names(df) <- c("attribute", "time", "gender")
library(fitdistrplus)
df.weibull <- tapply(df$attribute, list(time=df$time, gender=df$gender), fitdist, "weibull")
It seems like you are trying to fit a Weibull distribution by gender and time groups. I suppose that this is just a tiny subsection of your dataset because you have just 3 observations per group. How about:
library(tidyverse)
library(data.table)
sumdf <- setDT(df)[, as.list(fitdist(attribute, dist= "weibull")$estimate), by = .(time, gender)]
time gender shape scale
1: 1 M 2.738085 22.587353
2: 2 M 2.893080 23.666143
3: 1 F 7.793204 58.553205
4: 2 F 2.738652 2.258509
Then you could plot e.g.:
ggplot(sumdf) + geom_line(aes(x = time, y = scale, col = gender))
Or
ggplot(sumdf) + geom_line(aes(x = time, y = shape, col = gender))

Trying to vertically scale the graph of a data set with R, ggplot2

I'm working with a data frame of size 2 x 400. I need to graph this (let's call it data set A) on the same graph as the main data set for my project.
All I need is the general shape of data set A's graph. ie i only need to see the trend.
The scale that data set A takes place on happens to be much smaller than that of the main graph. So dataset A just looks like a horizontal line.
I decided to scale data set A by multiplying it by a factor of... I tried various values to get the optimum vertical scaling, which leads me to the problem I'm having.
When trying to find the ideal multiplicative factor by trial and error, I expected the general shape of data set A's graph to retain its shape, and only vary in its relative vertical points . ie the horizontal coordinates of all maxes and mins shouldn't move, and only the vertical points should be moving. but this wasn't happening. I'd like to know why.
Here's the data set A (yellow), when multiplied by factor of 3:
factor of 5:
The yellow dots are the geom_point and the yellow curve is the corresponding geom_smooth.
EDIT:
here is my the code original code:
I haven't had much formal training with code. I'm apologize for any messiness!
library("ggplot2")
library("dplyr")
# READ IN DATA
temp_data <-read.table(col.names = "y",
"C:/Users/Ben/Documents/Visual Studio 2013/Projects/Home/Home/steamdata2.txt")
boilpoint <- which(temp_data$y == "boil") # JUST A MARKER..
temp_data <- filter(temp_data, y != "boil") # GETTING RID OF THE MARKER ENTRY
# DON'T KNOW WHY BUT I HAD TO DO THIS INTERMEDIATE STEP
# BEFORE I COULD CONVERT FROM FACTOR -> NUMERIC
temp_data$y <- as.character(temp_data$y)
# CONVERTING TO NUMERIC
temp_data$y <- as.numeric(temp_data$y)
# GETTING RID OF BASICALLY THE LAST ENTRY WHICH HAS THE LARGEST VALUE
temp_data <- filter(temp_data, y<max(temp_data$y))
# ADD ANOTHER COLUMN WITH THE ROW NUMBER,
# BECAUSE I DON'T KNOW HOW TO ACCESS THIS FOR GGPLOT
temp_data <- transform(temp_data, x = 1:nrow(temp_data))
n <- nrow(temp_data) # Num of readings
period <- temp_data[n,1] # (sec)
RpS <- n / period # Avg Readings per Second
MIN <- min(temp_data$y)
MAX <- max(temp_data$y)
# DERIVATIVE OF ORIGINAL
deriv <- data.frame(matrix(ncol=2, nrow=n))
# ADD ANOTHER COLUMN TO ACCESS ROW NUMBERS FOR GGPLOT LATER
colnames(deriv) <- c("y","x")
deriv <- transform(deriv, x = c(1:n))
# FILL DERIVATIVE DATAFRAME
deriv[1, 1] <- 0
for(i in 2:n){
deriv[i - 1, 1] <- temp_data[i, 1] - temp_data[i - 1, 1]
}
deriv <- filter(deriv, y != 0)
# DID THE SAME FOR SECOND DERIVATIVE
dderiv <- data.frame(matrix(ncol = 2, nrow = nrow(deriv)))
colnames(dderiv) <- c("y", "x")
dderiv <- transform(dderiv, x=rep(0, nrow(deriv)))
dderiv[1, 1] <- 0
for(i in 2:nrow(deriv)) {
dderiv$y[i - 1] <- (deriv$y[i] - deriv$y[i - 1]) /
(deriv$x[i] - deriv$x[i - 1])
dderiv$x[i - 1] <- deriv$x[i] + (deriv$x[i] - deriv$x[i - 1]) / 2
}
dderiv <- filter(dderiv, y!=0)
# HERE'S WHERE I FACTOR BY VARIOUS MULTIPLES
deriv <- MIN + deriv * 3
dderiv <- MIN + dderiv * 3
graph <- ggplot(temp_data, aes(x, y)) + geom_smooth()
graph <- graph + geom_point(data = deriv, color = "yellow")
graph <- graph + geom_smooth(data = deriv, color = "yellow")
graph <- graph + geom_point(data = dderiv, color = "green")
graph <- graph + geom_smooth(data = dderiv, color = "green")
graph <- graph + geom_vline(xintercept = boilpoint, color = "red")
graph <- graph + xlab("Readings (n)") +
ylab(expression(paste("Temperature (",degree,"C)")))
graph <- graph + xlim(c(0,n)) + ylim(c(MIN, MAX))
It's hard to check without your raw data, but I'm 99% sure that your main problem is that you're hard-coding the y limits with ylim(c(MIN, MAX)). This is exacerbated by accidentally scaling both variables in your deriv and dderiv data frame, not just y.
I was able to debug the problem when I noticed that your top "scale by 3" graph has a lot more yellow points than your bottom "scale by 5" graph.
The quick fix is don't scale the row numbers, only scale the y values, which is to say, replace this
# scales entire data frame: bad!
deriv <- MIN + deriv * 3
dderiv <- MIN + dderiv * 3
with this:
# only scale y
deriv$y <- MIN + deriv$y * 3
dderiv$y <- MIN + dderiv$y * 3
I think there is another problem too: even with my correction above, negative values of your derivatives will be excluded. If deriv$y or dderiv$y is ever negative, then MIN + deriv$y * 3 will be less than MIN, and since your y axis begins at MIN it won't be plotted.
So I think the whole fix would be to instead do something like
# keep the original y values around so we can experiment with scaling
# without running *all* the code again
deriv$y_orig <- deriv$y
# multiplicative scale
# fill in the value of `prop` to be the proportion of the vertical plot area
# that you want taken up by the derivative
deriv$y <- deriv$y_orig * diff(c(MIN, MAX)) / diff(range(deriv$y_orig)) * prop
# shift into plot range
# fill in the value of `intercept` to be the y value of the
# lowest point of this line
deriv$y <- deriv$y + MIN - min(deriv$y) + 1
I normally don't answer questions that aren't reproducible with data because I hate lack of clarity and I hate the inability to test. However, your question was very clear and I'm pretty sure this will work even without testing. Fingers crossed!
A few other, more general comments:
It's good you know that to convert factor to numeric you need to go via character. It's an annoyance, but if you want to understand more here's the r-faq on it.
I'm not sure why you bother with (deriv$x[i] - deriv$x[i - 1]) in your for loop. Since you define x to be 1, 2, 3, ... the difference is always 1. I'm more confused by why you divide by 2 in the second derivative.
Your for loop can probably be replaced by the diff() function. (See below.)
You seem to have just gotten your foot in the dplyr door, so I used base functions in my recommendation. Keep working with dplyr, I think you'll like it. The big dplyr function you're not using is mutate. It works like base::transform for adding new columns.
I dislike that you've created all these different data frames, it clutters things up. I think your code could be simplified to something like this
all_data = filter(temp_data, y != "boil") %>%
mutate(y = as.numeric(as.character(y))) %>%
filter(y < max(y)) %>%
mutate(
x = 1:n(),
deriv = c(NA, diff(y)) / c(NA, diff(x)),
dderiv = c(NA, diff(deriv)) / 2
)
Rather than having separate data frames for the original data, first derivative and second derivative, this puts them all in the same data frame.
The big benefit of having things in one data frame is that you could then "gather" it into a nice, long (rather than wide) tidy format and simplify your plotting call:
library(tidyr)
long_data = gather(all_data, key = function, value = y, y, deriv, dderiv)
Then your ggplot call would look more like this:
graph <- ggplot(temp_data, aes(x, y, color = function)) +
geom_smooth() +
geom_point() +
geom_vline(xintercept = boilpoint, color = "red") +
scale_color_manual(values = c("green", "yellow", "blue")) +
xlab("Readings (n)") +
ylab(expression(paste("Temperature (",degree,"C)"))) +
xlim(c(0,n)) + ylim(c(MIN, MAX))
With data in long format, you'd have a column of you data (I've named it "function") that maps to color, so you don't have to add all the layers one at a time, and you get a nicely generated legend!

Transform time series to same length

I am looking for a way to transform time series of different length into a unique length.
I think this question has already been asked by I can't find it. I guess I am just not using the right vocabulary for the question.
Data 1: 20 variables x 250 observations (time points)
Data 2: 20 variables x 50 observations (time points)
I would like to transform these data into 100 observations while keeping the shape of curves for the 20 variables in both cases.
Thanks a lot
Sample data
set.seed(123)
data <- matrix(0, 250, 20)
data[1, ] <- rnorm(20)
for (i in 2:nrow(data)) {
data[i, ] <- data[i - 1, ] + rnorm(20, 0, 0.02)
}
rownames(data) <- 0:249
One way of handling this is with reshape2 and dplyr:
library("reshape2")
library("dplyr")
library("ggplot2")
molten <- melt(data, varnames = c("Time", "Variable"))
Plot of original data:
ggplot(molten, aes(x = Time, y = value, colour = factor(Variable))) + geom_line()
Now reduce the data.frame by a factor of 5 using means of the values in each time period:
shorter <- molten %>%
group_by(Variable, Time %/% 5) %>%
summarise(value = mean(value), Time = mean(Time))
Plot new data:
ggplot(shorter, aes(x = Time, y = value, colour = factor(Variable))) + geom_line()
If you want original wide form of data:
shorterWide <- acast(shorter, Time ~ Variable)
I think I find a way using this function
Basic two-dimensional cubic spline fitting in R
I guess the keyword I was missing was cubic spline.
In my case I want to do something like that
spline(Data1, n = 100)
spline(Data1, n = 100)

Resources