How to programmatically overlap arbitrary stat_functions in ggplot? - r

I am looking for a way to automatically plot an arbitrary number of stat_function objects in a single ggplot, each one with a different set of parameters, and coloring them.
Initially I thought of having one big data.table with a large number of samples from each distribution, each set associated with an index, and using geom_density, grouping and coloring by the index.
This is, however, very inefficient. There is, in my opinion, no need to spend time and memory to produce and keep large sets of values if we already have parameters that perfectly describe each distribution.
I present my initial solution below, but is there a more elegant and/or practical way of doing this?
distrData.dt <- data.table( Shape = c(2.1,2.2,2.3), Scale = c(1.1,1.2,1.3), time = c(1,2,3) )
ggplot(data.table(x=c(0:15)), aes(x)) +
apply(distrData.dt,1, FUN = function(x) stat_function(fun = dgamma,arg = list(shape=as.numeric(x[1]),scale=as.numeric(x[2])), mapping = aes_string(color=x[3]) ) ) +
scale_colour_gradient("Time Step", low="blue", high="red", space="Lab")
This is the current result:
It produces the main result, that is, it will plot as many "perfect" densities as the number of parameter sets you give it. However, I am not using aesthetics to pass parameters from the column names ("Shape" and "Scale") or to get the color of each line. As far as I understand, that is not possible, but is there another way?

First of all, your solution is absolutely fine to me: it does the job, and it does it elegantly. I just wanted to both expand on #joran's comment and show one useful trick that's called "function factory", which is perfectly suitable for a case like yours.
So I'm building a function that returns a function with fixed parameters. Note that using force prevents from shape and scale being lazily evaluated, that is necessary since we'll be using a for loop.
I'm using data.frame instead of data.table, but there shouldn't be a significant difference. That vector("list", n) construction is preallocating space for a list, as seen in ?list. I don't think it's obligatory in this particular case (significant overhead will appear for lenghts, say, >100, unlikely here), but it's always better to avoid iteratively growing objects, that's a bad practice.
As a last remark, check the stat_function call: it seems reasonably readable, at least you can see what's the mapping and what's related to dgamma parameters.
dgamma_factory <- function(shape, scale) {
force(shape)
force(scale)
function(x) dgamma(x, shape = shape, scale = scale)
}
l <- vector("list", nrow(distrData.dt))
for (i in seq.int(nrow(distrData.dt))) {
params <- distrData.dt[i, ]
l[[i]] <- stat_function(
fun = dgamma_factory(params$Shape, params$Scale),
mapping = aes_string(color = params$time))
}
ggplot(data.frame(x=c(0:15)), aes(x)) +
l +
scale_colour_gradient("Time Step", low="blue", high="red", space="Lab")

Related

Advise a Chemist: Automate/Streamline his Voltammetry Data Graphing Code

I am a chemist dealing with a significant amount of voltammetry data recently. Let me be very clear and give some research information. I run scans from a starting voltage to an ending voltage on solid state conductive films. These scans are saved as .txt files (name scheme: run#.txt) in a single folder. I am looking at how conductance changes as temperature changes. The LINEST line plotting current v. voltage at a given temperature gives me a line with slope = conductance. Once I have the conductances (slopes) for each scan, I plot conductance v. temperature to see the temperature dependent conductance characteristics. I had been doing this in Excel, but have found quicker ways to get the job done using R. I am brand new to R (Rstudio) and recognize that my coding is not the best. Without doubt, this process can be streamlined and sped up which would help immensely. This is how I am performing the process currently:
# Set working directory with folder containing all .txt files for inspection
# Add all .txt files to the global environment
allruns<-list.files(pattern=".txt")
for(i in 1:length(allruns))assign(allruns[i],read.table(allruns[i]))
Since the voltage column (a 1x1000 matrix) is the same for all runs and is in column V1 of each .txt file, I assign a x to be the voltage column from the first folder
x<-run1.txt$V1
All currents (these change as voltage changes) are found in the V2 column of all the .txt files, so I assign y# to each. These are entered one at a time..
y1<-run1.txt$V2
y2<-run2.txt$V2
y3<-run3.txt$V2
# ...
yn<-runn.txt$V2
So that I can get the eqn for each LINEST (one LINEST for each scan and plotted with abline later). Again entered one at a time:
run1<-lm(y1~x)
run2<-lm(y2~x)
run3<-lm(y3~x)
# ...
runn<-lm(yn~x)
To obtain a single graph with all LINEST (one for each scan ) on the same plot, without the data points showing up, I have been using this pattern of coding to first get all data points on a single plot in separate series:
plot(x,y1,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y3,yn)))
par(new=TRUE)
plot(x,y2,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y3,yn)))
par(new=TRUE)
plot(x,y3,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y1,yn)))
# ...
par(new=TRUE)
plot(x,yn,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y1,yn)))
#To obtain all LINEST lines (one for each scan, on the single graph):
abline(run1,col=””, lwd=1)
abline(run2,col=””,lwd=1)
abline(run3,col=””,lwd=1)
# ...
abline(runn,col=””,lwd=1)
# Then to get each LINEST equation:
summary(run1)
summary(run2)
summary(run3)
# ...
summary(runn)
Each time I use summary(), I copy the slope and paste it into an Excel sheet- along with corresponding scan temp which I have recorded separately. I then graph the conductance v temp points for the film as X-Y scatter with smooth lines to give the temperature dependent conductance curve. Giving me a single LINEST lines plot in R and the conductance v temp in Excel.
This technique is actually MUCH quicker than doing it all in Excel, but it can be done much quicker and efficiently!!! Also, if I need to change something, this entire process needs to be reexecuted with whatever change is necessary. This process takes me maybe 5 hours in Excel and 1.5 hours in R (maybe I am too slow). Nonetheless, any tips to help automate/streamline this further are greatly appreciated.
There are plenty of questions about operating on data in lists; storing a list of matrix or a list of data.frame is fast, and code that operates cleanly on one can be applied to the remaining n-1 very easily.
(Note: the way I'm showing it here is one technique: maintaining everything in well-compartmentalized lists. Other will suggest -- very justifiably -- that combing things into a single data.frame and adding a group variable (to identify from which file/experiment the data originated) will help with more advanced multi-experiment regression or combined plotting, such as with ggplot2. I'm not going to go into this latter technique here, not yet.)
It is long decried not to do for(...) assign(..., read.csv(...)); you have the important part done, so this is relatively easy:
allruns <- sapply(list.files(pattern = "*.txt"), read.table, simplify = FALSE)
(The use of sapply(..., simplify=FALSE) is similar to lapply(...), but it has a nice side-effect of naming the individual list-ified elements with, in this case, each filename. It may not be critical here but is quite handy elsewhere.)
Extracting your invariant and variable data is simple enough:
allLMs <- lapply(allruns, function(mdl) lm(V2 ~ V1, data = mdl))
I'm using each table's V1 here instead of a once-extracted x ... though you might wonder why, I argue keeping it like for two reasons: (1) JUST IN CASE the V1 variable is ever even one-row-different, this will save you; (2) it is very easy to construct the model like this.
At this point, each object within allLMs is an lm object, meaning we might do:
summary(allLMs[[1]])
Plotting: I think I understand why you are using par=NEW, and I have to laugh ... I had been deep in R for a while before I started using that technique. What I think you need is actually much simpler:
xlim <- rev(range(allruns[[1]]$V1))
ylim <- range(sapply(allruns, `[`, "V2"))
# this next plot just sets the box and axes, no points
plot(NA, type = "na", xlim = xlim, ylim = ylim)
# no need to plot points with "transparent" ...
ign <- sapply(allLMs, abline, col = "") # and other abline options ...
Copying all models into Excel, again, using lists:
out <- do.call(rbind, sapply(allLMs, function(m) summary(m)$coefficients[,1]))
This will now be a single data.frame with all coefficients in two columns. (Feel free to use similar techniques to extract the other model summary attributes, including std err, t.value, or Pr(>|t|) (in the $coefficients); or $r.squared, $adj.r.squared, etc.)
write.csv(out, file="clipboard", sep="\t")
and paste into Excel. (Or, better yet, save it to a CSV file and import that, since you might want to keep it around.)
One of the tricks to using lists for this is to persevere: keep things in lists as long as you can, so that you don't have deal with models individually. One mantra is that if you do it once, you shouldn't have to type it again, just loop/apply/map/whatever. Don't extract too much from the lists before you have to.
Note: r2evans' answer provides good general advice and doesn't require heavy package dependencies. But it probably doesn't hurt to see alternative strategies.
The tidyverse can be quite handy for this sort of thing, here's a dummy example for illustration,
library(tidyverse)
# creating dummy data files
dummy <- function(T) {
V <- seq(-5, 5, length=20)
I <- jitter(T*V + T, factor = 1)
write.table(data.frame(V=V, I = I),
file = paste0(T,".txt"),
row.names = FALSE)
}
purrr::walk(300:320, dummy)
# reading
lf <- list.files(pattern = "\\.txt")
read_one <- function(f, ...) {cbind(T = as.numeric(gsub("\\.txt", "", f)), read.table(f, ...))}
m <- purrr::map_df(lf, read_one, header = TRUE, .id="id")
head(m)
ggplot(m, aes(V, I, group = T)) +
facet_wrap( ~ T) +
geom_point() +
geom_smooth(se = FALSE)
models <- m %>%
split(.$T) %>%
map(~lm(I ~ V, data = .))
coefs <- models %>% map_df(broom::tidy, .id = "T")
ggplot(coefs, aes(as.numeric(T), estimate)) +
geom_line() +
facet_wrap(~term, scales = "free")

How can you use ggplot to superimpose many plots of related functions in an automatic way?

I have a family of functions that are all the same except for one adjustable parameter, and I want to plot all these functions on one set of axes all superimposed on one another. For instance, this could be sin(n*x), with various values of n, say 1:30, and I don't want to have to type out each command individually -- I figure there should be some way to do it programatically.
library(ggplot2)
define trig functions as a function of frequency: sin(x), sin(2x), sin(3x) etc.
trigf <- function(i)(function(x)(sin(i*x)))
Superimpose two function plots -- this works manually of course
ggplot(data.frame(x=c(0,pi)), aes(x)) + stat_function(fun=trigf(1)) + stat_function(fun=trigf(2))
now try to generalize -- my idea was to make a list of the stat_functions using lapply
plotTrigf <- lapply(1:5, function(i)(stat_function(fun=function(x)(sin(i*x))) ))
try using the elements of the list manually but it doesn't really work -- only the i=5 plot is shown and I'm not sure why when that's not what I referenced
ggplot(data.frame(x=c(0,pi)), aes(x)) +plotTrigf[[1]] + plotTrigf[[2]]
I Thought this Reduce might handle the 'generalized sum' to add to a ggplot() but it doesn't work -- it complains of a non-numeric argument to binary operator
Reduce("+", plotTrigf)
So I'm kind of stuck both in executing this strategy, or perhaps there's some other way to do this.
Are you using version R <3.2? The problem is that you actually need to evaluate your i parameter in your lapply call. Right now it's being left as a promise and not getting evaulated till you try to plot and at that point i has the last value it had in the lapply loop which is 5. Use:
plotTrigf <- lapply(1:5, function(i) {force(i);stat_function(fun=function(x)(sin(i*x))) })
You can't just add stat_function calls together, even without Reduce() you get the error
stat_function(fun=sin) + stat_function(fun=cos)
# Error in stat_function(fun = sin) + stat_function(fun = cos) :
# non-numeric argument to binary operator
You need to add them to a ggplot object. You can do this with Reduce() if you just specify the init= parameter
Reduce("+", plotTrigf, ggplot(data.frame(x=c(0,pi)), aes(x)))
And actually the special + operator for ggplot objects allows you to add a list of objects so you don't even need the Reduce at all (see code for ggplot2:::add_ggplot)
ggplot(data.frame(x=c(0,pi)), aes(x)) + plotTrigf
The final result is
You need to use force in order to make sure the parameter is being evaluated at the right time. It's a very useful technique and a common source of confusion in loops, you should read about it in Hadley's book http://adv-r.had.co.nz/Functions.html
To solve your question: you just need to add force(i) when defining all the plots, inside the lapply function, before making the call to stat_function. Then you can use Reduce or any other method to combine them. Here's a way to combine the plots using lapply (note that I'm using the <<- operator which is discouraged)
p <- ggplot(data.frame(x=c(0,pi)), aes(x))
lapply(plotTrigf, function(x) {
p <<- p + x
return()
})

Setting equal xlim and ylim in plot function

Is there a way to get the plot function to generate equal xlimand ylimautomatically?
I do not want to define a fix range beforehand, but I want the plot function to decide about the range itself. However, I expect it to pick the same range for x and y.
A possible solution is to define a wrapper to the plot function:
plot.Custom <- function(x, y, ...) {
.limits <- range(x, y)
plot(x, y, xlim = .limits, ylim = .limits, ...)
}
One way is to manipulate interactively and then choose the right one. A slider will appear once you run the following code.
library(manipulate)
manipulate(
plot(cars, xlim=c(x.min,x.max)),
x.min=slider(0,15),
x.max=slider(15,30))
I'm not aware of anyway to do this using plot(doesn't mean there isn't one). ggplot might be the way to go; it lends itself more to be being retroactively changed since it is designed around a layer system.
library(ggplot2)
#Creating our ggplot object
loop_plot <- ggplot(cars, aes(x = speed, y = dist)) +
geom_point()
#pulling out the 'auto' x & y axis limits
rangepull <- t(cbind(
ggplot_build(loop_plot)$panel$ranges[[1]]$x.range,
ggplot_build(loop_plot)$panel$ranges[[1]]$y.range))
#taking the max and min(so we don't cut out data points)
newrange <- list(cor.min = min(rangepull[,1]), cor.max = max(rangepull[,2]))
#changing our plot size to be nice and symmetric
loop_plot <- loop_plot +
xlim(newrange$cor.min, newrange$cor.max) +
ylim(newrange$cor.min, newrange$cor.max)
Note that the loop_plot object is of ggplot class, and wont actually print until its called.
I used the cars dataset in the code above to show whats going on, but just sub in your data set[s] and then do whatever postmortem your end goal is.
You'll also be able to add in titles and the like based off of the dataset name et cetera which will likely end up producing a clearer visualization out of your loop.
Hopefully this works for your needs.

Manipulating contrast etc within a vector of colours

I'm seeking any efficient way to perform simple manipulations on colour vectors in R, such as brightness and contrast. I have a hacky method that converts hex-string to numerical values and adjusts these by, for example, increasing/decreasing values for lightness, or rescaling them for lightness contrast, before converting back to hex. It works but is too slow to run interactively, and I can't see any libraries (e.g. colorspace) that have this functionality. Does anyone know of an alternative method? Thanks in advance.
This illustrates the flow for darkening (the simplest manipulation):
cols = rainbow(100)
d = data.frame(x = 1:100, y0=rep(0,100), y1=rep(100,100))
plot_cols = function(colours){
plot.new(); plot.window(xlim=c(0,100), ylim=c(0,100))
segments(d$x, d$y0, d$x, d$y1, col=colours, lwd=5)
}
plot_cols(cols)
cols_num = t(col2rgb(cols))/255
plot_cols( rgb(cols_num * .5) )
Contrast effects require standard deviation sd which is I think the bottleneck.

Graphing a polynomial output of calc.poly

I apologize first for bringing what I imagine to be a ridiculously simple problem here, but I have been unable to glean from the help file for package 'polynom' how to solve this problem. For one out of several years, I have two vectors of x (d for day of year) and y (e for an index of egg production) data:
d=c(169,176,183,190,197,204,211,218,225,232,239,246)
e=c(0,0,0.006839425,0.027323127,0.024666883,0.005603878,0.016599262,0.002810977,0.00560387 8,0,0.002810977,0.002810977)
I want to, for each year, use the poly.calc function to create a polynomial function that I can use to interpolate the timing of maximum egg production. I want then to superimpose the function on a plot of the data. To begin, I have no problem with the poly.calc function:
egg1996<-poly.calc(d,e)
egg1996
3216904000 - 173356400*x + 4239900*x^2 - 62124.17*x^3 + 605.9178*x^4 - 4.13053*x^5 +
0.02008226*x^6 - 6.963636e-05*x^7 + 1.687736e-07*x^8
I can then simply
plot(d,e)
But when I try to use the lines function to superimpose the function on the plot, I get confused. The help file states that the output of poly.calc is an object of class polynomial, and so I assume that "egg1996" will be the "x" in:
lines(x, len = 100, xlim = NULL, ylim = NULL, ...)
But I cannot seem to, based on the example listed:
lines (poly.calc( 2:4), lty = 2)
Or based on the arguments:
x an object of class "polynomial".
len size of vector at which evaluations are to be made.
xlim, ylim the range of x and y values with sensible defaults
Come up with a command that successfully graphs the polynomial "egg1996" onto the raw data.
I understand that this question is beneath you folks, but I would be very grateful for a little help. Many thanks.
I don't work with the polynom package, but the resultant data set is on a completely different scale (both X & Y axes) than the first plot() call. If you don't mind having it in two separate panels, this provides both plots for comparison:
library(polynom)
d <- c(169,176,183,190,197,204,211,218,225,232,239,246)
e <- c(0,0,0.006839425,0.027323127,0.024666883,0.005603878,
0.016599262,0.002810977,0.005603878,0,0.002810977,0.002810977)
egg1996 <- poly.calc(d,e)
par(mfrow=c(1,2))
plot(d, e)
plot(egg1996)

Resources