Export results from LOESS plot - r

I am trying to export the underlying data from a LOESS plot (blue line)
I found this post on the subject and was able to get it to export like the post says:
Can I export the result from a loess regression out of R?
However, as the last comment from the poster in that post says, I am not getting the results for my LOESS line. Does anyone have any insights on how to get it to export properly?
Thanks!
Code for my export is here:
#loess object
CL111_loess <- loess(dur_cleaned~TS_LightOn, data = CL111)
#get SE
CL111_predict <- predict(CL111_loess, se=T)
CL111_ouput <- data.frame("fitted" = CL111_predict$fit, "SE"=CL111_predict$se.fit)
write.csv(CL111_ouput, "CL111_output.csv")
Data for the original plot is here:
Code for my original plot is here:
{r}
#individual plot
ggplot(data = CL111) + geom_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "lm", se = FALSE, colour = "Green") +
labs(x = "TS Light On (Seconsd)", y = "TS Response Time (Seconds)", title = "Layout 1, Condition AO, INS High") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "loess", se = TRUE) + xlim(0, 400) + ylim (0, 1.0)
#find coefficients for best fit line
lm(CL111_LM$dur_cleaned ~ CL111_LM$TS_LightOn)

You can get this information via ggplot_build().
If your plot is saved as gg1, run ggplot_build(gg1); then you have to examine the data object (which is a list of data for different layers) and try to figure out which layer you need (in this case, I looked for which data layer included a colour column that matched the smooth line ...
bb <- ggplot_build(gg1)
## extract the right component, just the x/y coordinates
out <- bb$data[[2]][,c("x","y")]
## check
plot(y~x, data = out)
You can do whatever you want with this output now (write.csv(), save(), saveRDS() ...)
I agree that there is something weird/that I don't understand about the way that ggplot2 is setting up the loess fit. You do have to do predict() with the right newdata (e.g. a data frame with a single column TS_LightOn that ranges from 0 to 400) - otherwise you get predictions of the points in your data set, which may not be properly spaced/in the right order - but that doesn't resolve the difference for me.

To complement #ben-bolker, I have just written a small function that may be useful for retrieving the internal dataset created by ggplot for a geom_smooth call. It takes the resultant ggplot as input and returns the smoothed data. The problem it solves is that, as Ben observed, internally ggplot creates a smoothed fit with predicted data on random intervals, different from the interval used for the input data. This function will get you back the ggplot fit data with an interval based on integer and equally spaced values. That function uses a loess fit on the already smoothed data, using a small value of span (0.1), that is adjusted upward on-the-fly to cope with small numbers of values.
This is useful if you used geom_smooth with a method that is not 'loess' or using 'NULL' and you cannot easily build a model that replicates what geom_smooth is doing internally.
The function separates different series on the same plot as well as series located on different facets. It also returns the 'ymin' and 'ymax' values.
Note that this function uses an interval based on integer values of x. You can modify this if you need an interval based on equally-spaced values of x, but not integral. In that case, pass your x interval of choice in the xInterval parameter, or tweak the line:
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x)))).
get_geom_smooth_dataFromPlot <- function (a_ggplot, xInterval=NULL) {
#internal ggplot values read in ggTable
ggTable <- ggplot_build(a_ggplot)$data[[1]]
#facet panels
panels <- as.numeric(names(table(ggTable$PANEL)))
nPanel <- length(panels)
onePanel <- (nPanel==1)
#number of series in each plot
groups <- as.numeric(names(table(ggTable$group)))
nGroup <- length(groups)
oneGroup <- (nGroup==1)
out <- data.frame()
#are there 'ymin' and 'ymax' values?
SE_data <- "ymin" %in% colnames(ggTable)
for (pan in (1:nPanel)) {
for (grp in (1:nGroup)) {
sub <- subset(ggTable, (PANEL==panels[pan])&(group==groups[grp]))
#no group series for this facet panel?
if (dim(sub)[1] == 0) next
if (is.null(xInterval)) {
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x))))
} else {
outOne <- data.frame(x=xInterval)
}
nObs <- dim(outOne)[1]
#hack to avoid problems with a small range for the x interval
# when there are more than 90 x values
# we use a span of 0.1, but
# we adjust on-the-fly up to a span of 0.5
# for 10 values of the x interval
cSpan <- max (0.1, 0.5 * 10 / (nObs-(nObs-10)/2))
if (!onePanel) outOne$panel <- pan
if (!oneGroup) outOne$group <- grp
mod <- loess(y~x, data=sub, span=cSpan)
outOne$y <- predict(mod, outOne$x, se=FALSE)
if (SE_data) {
mod <- loess(ymin~x, data=sub, span=cSpan)
outOne$ymin <- predict(mod, outOne$x, se=FALSE)
mod <- loess(ymax~x, data=sub, span=cSpan)
outOne$ymax <- predict(mod, outOne$x, se=FALSE)
}
out <- rbind(out, outOne)
}
}
return (out)
}

Related

Plotting dr4pl dose response curves, and how to integrate them with ggplot2?

I am trying to set up a high-throughput way of plotting out dose response curves from a large screening experiment. Prism obviously has the easiest way of doing dose-response curves well, but I can't copy and paste this much data.
Since CRAN removed drc, the package dr4pl seems the way to go, but there is very little instruction available yet.
#make data frame
dose <- c("0.078125","0.156250","0.312500","0.625000","1.250000","2.500000","5.000000","10.000000","20.000000")
POC<-c("1.05637425", "0.87380081", "0.79171200", "0.83166848", "0.77361290", "0.35199288", "0.19404609", "0.09079221", "0.09850658")
data<-data.frame(dose, POC)
#use the dr4pl pakcage to calculate curve and IC50 etc
model<-dr4pl(POC~dose, data)
summary.model <- summary(model)
summary.model$coefficients
#plot this
plot(dr4pl(POC~dose, data=data))
The above will generate the type of curve I need using dr4pl, and get me the IC50. but how would I plot several datasets/curves on one plot?
Ideally I'd rather plot the data using ggplot2: plot+geom_point() and add in the dose response line by using the dr4pl summary as a +stat_smooth() model, if that makes sense? But I don't know how to do this.
Any help would be appreciated
I can get most of the way but not all the way. The main step is to write a predict() method for dr4pl objects:
predict.dr4pl <- function (object, newdata=NULL, se.fit=FALSE, level, interval) {
xseq <- if (is.null(newdata)) object$data$Dose else newdata$x
pred <- MeanResponse(xseq, object$parameters)
if (!se.fit) {
return(pred)
}
qq <- qnorm((1+level)/2)
se <- sapply(xseq,
function(x) car::deltaMethod(object,
"UpperLimit + (LowerLimit - UpperLimit)/(1 + (x/IC50)^Slope)")[["Estimate"]])
return(list(fit=data.frame(fit=pred,lwr=pred-qq*se,
upr=pred+qq*se), se.fit=se))
}
I included a slightly hacky way to compute the confidence intervals via the delta method - this might not be too reliable (bootstrapping would be better ...)
It works OK (sort of) for your data (changed the name to dd because it's sometimes dicey to name the data data (fortunes::fortune("dog"))).
dd <- data.frame(dose = c(0.078125,0.156250,0.312500,0.625000,1.25,
2.50,5.0,10.0,20.0),
POC = c(1.05637425, 0.87380081, 0.79171200,
0.83166848, 0.77361290, 0.35199288,
0.19404609, 0.09079221, 0.09850658))
library(dr4pl)
ggplot(dd, aes(dose,POC)) + geom_point() +
geom_smooth(method="dr4pl",se=TRUE) + coord_trans(x="log10")
the confidence intervals are terrible, turn them off with se=FALSE
dr4pl puts the x-axis on a log10-scale by default, but the standard scale_x_log10() screws this up because it is applied before the fitting and prediction, so I use coord_trans(x="log10") instead.
However, coord_trans() doesn't play so nicely if the axes are on a very broad logarithmic scale - I tried the example above with the sample_data_1 data from the package and it didn't work.
But I'm afraid I've spent enough time on this for now.
It would more robust to use the predict method above to generate the values you want, over the range you want, separately, and then use geom_line() + geom_ribbon() to add the information to the plot ....
If you're willing to fit the model first (outside geom_smooth) you can do this (this is using sample_data_1 from dr4pl package - it's from the first example in ?dr4pl)
model2 <- dr4pl(dose = sample_data_1$Dose,
response = sample_data_1$Response)
ggplot(sample_data_1, aes(Dose,Response)) + geom_point() +
stat_function(fun=function(x) predict(model2,newdata=data.frame(x=x))) +
scale_x_log10()
which is less sensitive to the order of scaling/unscaling the x axis.
Improved but slow bootstrap CIs:
predictdf.dr4pl <- function (model, xseq, se, level, nboot=200) {
pred <- MeanResponse(xseq, model$parameters)
if (!se) {
return(base::data.frame(x=xseq, y=pred))
}
## bootstrap residuals
pred0 <- MeanResponse(model$data$Dose, model$parameters)
res <- pred0-model$data$Response
bootres <- matrix(nrow=length(xseq), ncol=nboot)
pb <- txtProgressBar(max=nboot,style=3)
for (i in seq(nboot)) {
setTxtProgressBar(pb,i)
mboot <- dr4pl(model$data$Dose,
pred0 + sample(res, size=length(pred0),
replace=TRUE))
bootres[,i] <- MeanResponse(xseq, mboot$parameters)
}
fit <- data.frame(x = xseq,
y=pred,
ymin=apply(bootres,1,quantile,(1-level)/2),
ymax=apply(bootres,1,quantile,(1+level)/2))
return(fit)
}
print(ggplot(dd, aes(dose,POC))
+ geom_point()
+ geom_smooth(method="dr4pl",se=TRUE) + coord_trans(x="log10")
)

plotting log(10) lengths differ

I am having difficulty plotting a log(10) formula on to existing data points. I derived a logarithmic function based on a list of data where "Tout_F_6am" is my independent variable and "clo" is my dependent variable.
When I go to plot it, I am getting the error that lengths x and y are different. Can someone please help me figure out whats going wrong?
logKT=lm(log10(clo)~ Tout_F_6am,data=passive)
summary(logKT) #r2=0.12
coef(logKT)
plot(passive$Tout_F_6am,passive$clo) #plot data points
x=seq(53,84, length=6381)#match length of x variable
y=logKT
lines(x,y,type="l",lwd=2,col="red")
length(passive$Tout_F_6am) #6381
length(passive$clo) #6381
Additionally, can the formula curve(-0.0219-0.005*log10(x),add=TRUE,col=2)be written as eq=(10^-0.022)*(10^-0.005*x)? thanks!
The problem is that you are trying to plot the model object, not the predictions from the model. Try something like this:
Define the explanatory values you want to plot, in a data frame (or tibble). It doesn't have to be as many as there are data points.
library(dplyr)
explanatory_data <- tibble(
Tout_F_6am = seq(53, 84, 0.1)
)
Add a column of predicted values using predict(). This takes a model and your explanatory data. predict() will return the transformed values, so you have to backtransform them.
prediction_data <- explanatory_data %>%
mutate(
log10_clo = predict(logKT, explanatory_data),
clo = 10 ^ log10_clo
)
Finally, draw your plot.
plot(clo ~ Tout_F_6am, data = prediction_data, log="y", type = "l")
The plotting is actually easier using ggplot2. This should give you more or less what you want.
library(ggplot2)
ggplot(passive, aes(Tout_F_6am, clo)) +
geom_point() +
geom_smooth(method = "lm") +
scale_y_log10()

Code to find the hull of a list of ROC curves (upper and lower limits of the set of curves)

I have made code that computes the two lines I am asking for in the question, as shown in the image below (desired lines are in red).
EDIT : This is the expected graph using my snippet to generate the ROC curves (atleast I'm pretty sure this is right) :
The problem is that said code is very very ugly (too long to even post here) and the process I came up with seems extremely tedious to me. Yet I can't seem to come up with anything better.
Here is a quick snippet to produce an input list of ROC curves
library(MASS)
library(dplyr)
simple_roc <- function(labels, scores){
labels <- labels[order(scores, decreasing=TRUE)]
return(rbind(c(0,0,0),data.frame(TPR=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels)))
}
diab_data=rbind(data.frame(Pima.tr),data.frame(Pima.te))
roc_curves_list_logisitic=list()
for (k in 1:100) {
#Set a fixed seed for reproducibility
set.seed(k)
# sampled_rows <- createDataPartition(diab_data$type, p = .7, list = FALSE)
sampled_rows <- sample(1:nrow(diab_data), size=floor(0.7*nrow(diab_data)))
diab_data_train=diab_data[sampled_rows,]
diab_data_test=diab_data[-sampled_rows,]
diab_data_train[,1:7]=scale(diab_data_train[,1:7])
diab_data_test[,1:7]=scale(diab_data_test[,1:7])
diab_data_train[,"type"]=as.numeric(as.character(recode_factor(diab_data_train[,"type"],`Yes` = "1", `No` = "0")))
diab_data_test[,"type"]=as.numeric(as.character(recode_factor(diab_data_test[,"type"],`Yes` = "1", `No` = "0")))
logistic_model_simple=glm(data=diab_data_train,as.formula(paste(colnames(diab_data_train)[8], "~",
paste(colnames(diab_data_train)[-8], collapse = "+"),
sep = "")),family=binomial(link = "logit"))
roc_curves_list_logisitic[[k]]=simple_roc(diab_data_test[,"type"],
ifelse(predict(logistic_model_simple,diab_data_test,type='response')>0.5,1,0))
}
I am now asking for help, in case anyone has a "beautiful" solution to produce the two red lines in this graph (in ggplot2) using the list of ROC curves I provided as input.
Preferably I would like to end up with two dataframes lower_bound_roc_curves and upper_bound_roc_curves containing the necessary values to plot the two lines seperately if I need them.
Thanks in advance,
EDIT 2 :#denis Here are some parts I think your code gets wrong :
I have a solution with data.table and zoo. The first step is to have a common FPR between all your curves. It is to be able to plot the maximum and the minimum of all curve. To do so:
library(data.table)
library(zoo)
FPRlist <- unique(rbindlist(lapply(roc_curves_list_logisitic,function(ROC){
rccurve <- as.data.table(ROC)
rccurve[,.(FPR = FPR)]
})))
I create a table FPRlist containing all the FPR existing in all your curves. I will after merge each curve with this table containing all FPR, and use na.locf to complete the missing values.
I use rbindlist to make one table, with an ID for each ROC curve
results <- rbindlist(lapply(seq(roc_curves_list_logisitic),function(idx){
rccurve <- as.data.table(roc_curves_list_logisitic[[idx]])
rccurve <- merge(FPRlist,rccurve,all = T)
rccurve[,TPR := na.locf(TPR,na.rm = F)] # I complete the values
rccurve[,ID := idx] # I create an ID
rccurve
}))
I then calculate the max and min across all ID (all ROC curve) for each FPR step
resultmax <- results[,.(TPR = max(TPR)),by = FPR]
resultmin <- results[,.(TPR = min(TPR)),by = FPR]
And plot it the same way you plot it
ggplot()+
geom_line(data = results,aes(FPR,TPR,color = as.factor(ID)))+
theme_light() %+replace% theme(legend.position = "none")+
geom_line(data = resultmax,aes(FPR,TPR),color = "red",size = 1)+
geom_line(data = resultmin,aes(FPR,TPR),color = "red",size = 1)
I let the dplyr translation to dplyr users, because I am not used to.
Edit
I modified my plot to make a comparison with the plot of just all raw ROC curves without any merge nor na.locf. One can see that the red lines I propose do follow the max and the min of all curves. The second plot is obtained as follow:
results2 <- rbindlist(lapply(seq(roc_curves_list_logisitic),function(idx){
rccurve <- as.data.table(roc_curves_list_logisitic[[idx]])
rccurve[,ID := idx] # I create an ID
rccurve
}))
p2 <- ggplot()+
geom_line(data = results2,aes(FPR,TPR,color = as.factor(ID)))+
theme_light() %+replace% theme(legend.position = "none")
It just plots all the ROC curves contained in the list provided in the OS question. The two column plot is obtained with multiplot function (see here)

How to customize title, axis labels, etc. in a plot of a decomposed time series

I'm reasonably familiar with the usual ways of modifying a plot by writing your own x axis labels or a main title, but I've been unable to customize the output when plotting the results of a time series decomposition.
For example,
library(TTR)
t <- ts(co2, frequency=12, start=1, deltat=1/12)
td <- decompose(t)
plot(td)
plot(td, main="Title Doesn't Work") # gets you an error message
gives you a nice, basic plot of the observed time series, trend, etc. With my own data (changes in depth below the water surface), however, I'd like to be able to switch the orientation of the y axes (eg ylim=c(40,0) for 'observed', or ylim=c(18,12) for 'trend'), change 'seasonal' to 'tidal', include the units for the x axis ('Time (days)'), and provide a more descriptive title for the figure.
My impression is that the kind of time series analyses I'm doing is pretty basic and, eventually, I may be better off using another package, perhaps with better graphical control, but I'd like to use ts() and decompose() if I can for now (yeah, cake and consumption). Assuming this doesn't get too horrendous.
Is there a way to do this?
Thanks! Pete
You can modify the plot.decomposed.ts function (that's the plot "method" that gets dispatched when you run plot on an object of class decomposed.ts (which is the class of td).
getAnywhere(plot.decomposed.ts)
function (x, ...)
{
xx <- x$x
if (is.null(xx))
xx <- with(x, if (type == "additive")
random + trend + seasonal
else random * trend * seasonal)
plot(cbind(observed = xx, trend = x$trend, seasonal = x$seasonal, random = x$random),
main = paste("Decomposition of", x$type, "time series"), ...)
}
Notice in the code above that the function hard-codes the title. So let's modify it so that we can choose our own title:
my_plot.decomposed.ts = function(x, title="", ...) {
xx <- x$x
if (is.null(xx))
xx <- with(x, if (type == "additive")
random + trend + seasonal
else random * trend * seasonal)
plot(cbind(observed = xx, trend = x$trend, seasonal = x$seasonal, random = x$random),
main=title, ...)
}
my_plot.decomposed.ts(td, "My Title")
Here's a ggplot version of the plot. ggplot requires a data frame, so the first step is to get the decomposed time series into data frame form and then plot it.
library(tidyverse) # Includes the packages ggplot2 and tidyr, which we use below
# Get the time values for the time series
Time = attributes(co2)[[1]]
Time = seq(Time[1],Time[2], length.out=(Time[2]-Time[1])*Time[3])
# Convert td to data frame
dat = cbind(Time, with(td, data.frame(Observed=x, Trend=trend, Seasonal=seasonal, Random=random)))
ggplot(gather(dat, component, value, -Time), aes(Time, value)) +
facet_grid(component ~ ., scales="free_y") +
geom_line() +
theme_bw() +
labs(y=expression(CO[2]~(ppm)), x="Year") +
ggtitle(expression(Decomposed~CO[2]~Time~Series)) +
theme(plot.title=element_text(hjust=0.5))

Log Log Probability Chart in R

I'm sure this is easy, but I've been tearing my hair out trying to find out how to do this in R.
I have some data that I am trying to fit to a power law distribution. To do this, you need to plot the data on a log-log cumulative probability chart. The y-axis is the LOG of the frequency of the data (or log-probability, if you like), and the x-axis is the log of the values. If it's a straight line, then it fits a power law distribution, and the gradient determines the power law parameter.
If I want the frequency of the data, I can just use the ecdf() function:
My data set is called Profits.negative, and it's just a long list of trading profits that were less than zero (and I've notionally converted them all to positive numbers to avoid logging problems later on).
So I can type
plot(ecdf(Profits.negative))
And I get a handy empirical CDF function plotted. All I need to do is to convert both axes to log scales. I can do the x-axis:
Profits.negative.logs <- log(Profits.negative)
plot(ecdf(Profits.negative.logs))
Almost there! I just need to work out how to log the y-axis! But I can't seem to do it, and I can't work out how to extract the figures from the ecdf object. Can anyone help?
I know there is a power.law.fit function, but that just estimates the parameters - I want to plot the data and see if it lines up.
You can fit and plot power-laws using the poweRlaw package. Here's an example. First we generate some data from a heavy tailed distribution:
set.seed(1)
x = round(rlnorm(100, 3, 2)+1)
Next we load the package and create a data object and a displ object:
library(poweRlaw)
m = displ$new(x)
We can estimate xmin and the scaling parameter:
est = estimate_xmin(m))
and set the parameters
m$setXmin(est[[2]])
m$setPars(est[[3]])
Then plot the data and add the fitted line:
plot(m)
lines(m, col=2)
To get:
Data generation first (you part, actually ;)):
set.seed(1)
Profits.negative <- runif(1e3, 50, 100) + rnorm(1e2, 5, 5)
Logging and ecdf:
Profits.negative.logs <- log(Profits.negative)
fn <- ecdf(Profits.negative.logs)
ecdf returns function, and if you want to extract something from it - it's good idea to look into function's closure:
ls(environment(fn))
# [1] "f" "method" "n" "nobs" "x" "y" "yleft" "yright"
Well, now we can access x and y:
x <- environment(fn)$x
y <- environment(fn)$y
Probably it's what you need. Indeed, plot(fn) and plot(x,y,type="l") show virtually the same results. To log y-axis you need just:
plot(x,log(y),type="l")
Here is an approach using ggplot2:
library(ggplot2)
# data
set.seed(1)
x = round(rlnorm(100, 3, 2)+1)
# organize data into a df
df <- data.frame(x = sort(x, decreasing = T),
pk <- ecdf(x)(x),
k <- seq_along(x))
# plot
ggplot(df, aes(x=k, y= pk)) + geom_point(alpha=0.5) +
coord_trans(x = 'log10', y = 'log10') +
scale_x_continuous(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) +
scale_y_continuous(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)))

Resources