Specify grouping variable in a function - r

I wrote a function to ease the visualization of a bunch of correlations that I was doing. Specifically, I was interested in viewing bivariate relationships side by side in ggplot2 panels with the p-value and rho value printed directly on the graph. I wrote this function using the iris dataset:
library(ggplot2)
library(dplyr)
grouped_cor_ <- function(data, x, y, group.col){
x <- lazyeval::as.lazy(x)
y <- lazyeval::as.lazy(y)
cor1 <- lazyeval::interp(~ cor.test(x, y,method="spearman",na.action = "na.exclude")$estimate, x = x, y = y)
corp <- lazyeval::interp(~ cor.test(x, y,method="spearman", na.action = "na.exclude")$p.value, x = x, y = y)
mnx <- lazyeval::interp(~ mean(x, na.rm=TRUE), x = x, y = y)
mny <- lazyeval::interp(~ mean(y, na.rm=TRUE), x = x, y = y)
summarise_(group_by(data, Species), rho=cor1, pval=corp, xcoord=mnx, ycoord=mny)
}
This is the data frame that I am using to print the statistics from the correlation:
grouped_cor_(data=iris, x=~Petal.Width, y=~Petal.Length)
Then this is the function that calls the plot:
corHighlight <- function(Data, x, y){
cordf<-grouped_cor_(Data, x = substitute(x), y = substitute(y))
cordf$prho <- paste("rho=",round(cordf$rho,3), "\n p-value=",round(cordf$pval,3), sep=" ")
plt<-ggplot(Data, aes_q(x = substitute(x), y = substitute(y))) +
geom_text(data=cordf, aes_q(x=substitute(xcoord),
y=substitute(ycoord),
label=substitute(prho)), colour='red') +
geom_point(size=2, alpha=0.3) +
facet_wrap(~Species)
print(plt)
}
corHighlight(Data=iris,
x=Petal.Width,
y=Petal.Length)
The function, though a little clunky, works well now with one small detail that I can't seem to figure out. I can't figure out how to add a column specification for the grouping variable. Right now the function is tied to the iris dataset because it only accepts a grouping variable named `species'. My question then is how do I separate this function from the iris dataset and generalized the grouping variable.
Can anyone recommend an efficient way of doing this? Happy to accept any comments that improve the function as well.

This would let you pass a single grouping factor to your helper function. Does require using group_by_ since I extract the name from the formula as a character but then coerce back it to a name:
grouped_cor_ <- function(data, x, y, form){
x <- lazyeval::as.lazy(x)
y <- lazyeval::as.lazy(y); fac <- as.name(as.character(form)[2])
cor1 <- lazyeval::interp(~ cor.test(x, y,method="spearman",na.action = "na.exclude")$estimate, x = x, y = y)
corp <- lazyeval::interp(~ cor.test(x, y,method="spearman", na.action = "na.exclude")$p.value, x = x, y = y)
mnx <- lazyeval::interp(~ mean(x, na.rm=TRUE), x = x, y = y)
mny <- lazyeval::interp(~ mean(y, na.rm=TRUE), x = x, y = y)
summarise_( group_by_(data, fac), rho=cor1, pval=corp, xcoord=mnx, ycoord=mny)
}
To illustrate what I said in the comment (allow the function to accept a formula that can be processed by `facet_wrap``:
corHighlight <- function(Data, x, y, form){
cordf<-grouped_cor_(Data, x = substitute(x), y = substitute(y), form=substitute(form))
cordf$prho <- paste("rho=",round(cordf$rho,3), "\n p-value=",round(cordf$pval,3), sep=" ")
plt<-ggplot(Data, aes_q(x = substitute(x), y = substitute(y))) +
geom_text(data=cordf, aes_q(x=substitute(xcoord),
y=substitute(ycoord),
label=substitute(prho)), colour='red') +
geom_point(size=2, alpha=0.3) +
facet_wrap(form)
print(plt)
}
corHighlight(Data=iris,
x=Petal.Width,
y=Petal.Length, form = ~Species)

Related

geom_smooth() with median instead of mean

I am building a plot with ggplot. I have data where y is mostly independent of X, but I randomly have a few extreme values of Y at low values of X. Like this:
set.seed(1)
X <- rnorm(500, mean=5)
y <- rnorm(500)
y[X < 3] <- sample(c(0, 1000), size=length(y[X < 3]),prob=c(0.9, 0.1),
replace=TRUE)
I want to make the point that the MEDIAN y-value is still constant over X values. I can see that this is basically true here:
mean(y[X < 3])
median(y[X < 3])
If I make a geom_smooth() plot, it does mean, and is very affected by outliers:
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth()
I have a few potential fixes. For example, I could first use group_by/summarize to make a dataset of binned medians and then plot that. I would rather NOT do this because in my real data I have a lot of facetting and grouping variables, and it would be a lot to keep track of (non-ideal). A lot plot definitely looks better, but log does not have nice interpretation in my application (median does have nice interpretation)
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth() +
scale_y_log10()
Finally, I know about geom_quantile but I think I'm using it wrong. Is there a way to add an error bar? Also- this geom_quantile plot looks way too smooth, and I don't understand why it is sloping down. Am I using it wrong?
ggplot(data=NULL, aes(x=X, y=y)) +
geom_quantile(quantiles=c(0.5))
I realize that this problem probably has a LOT of workarounds, but if possible I would love to use geom_smooth and just provide an argument that tells it to use a median. I want geom_smooth for a side-by-side comparison with consistency. I want to put the mean and median geom_smooths side-by-side to show "hey look, super strong pattern between Y and X is driven by a few large outliers, if we look only at median the pattern disappears".
Thanks!!
You can create your own method to use in geom_smooth. As long as you have a function that produces an object on which the predict generic works to take a data frame with a column called x and translate into appropriate values of y.
As an example, let's create a simple model that interpolates along a running median. We wrap it in its own class and give it its own predict method:
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
Now we can use our method in geom_smooth:
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now of course, this doesn't look very "flat", but it is way flatter than the line calculated by the loess method of the standard geom_smooth() :
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, color = "red", se = FALSE) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now, I understand that this is not the same thing as "regressing on the median", so you may wish to explore different methods, but if you want to get geom_smooth to plot them, this is how you can go about it. Note that if you want standard errors, you will need to have your predict function return a list with members called fit and se.fit
Here's a modification of #Allan's answer that uses a fixed x window rather than a fixed number of points. This is useful for irregular time series and series with multiple observations at the same time (x value). It uses a loop so it's not very efficient and will be slow for larger data sets.
# running median with time window
library(dplyr)
library(ggplot2)
library(zoo)
# some irregular and skewed data
set.seed(1)
x <- seq(2000, 2020, length.out = 400) # normal time series, gives same result for both methods
x <- sort(rep(runif(40, min = 2000, max = 2020), 10)) # irregular and repeated time series
y <- exp(runif(length(x), min = -1, max = 3))
data <- data.frame(x = x, y = y)
# ggplot(data) + geom_point(aes(x = x, y = y))
# 2 year window
xwindow <- 2
nwindow <- xwindow * length(x) / 20 - 1
# rolling median
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# rolling time window median
rolling_median2 <- function(formula, data, xwindow = 2, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
ys <- rep(NA, length(x)) # for the smoothed y values
xs <- setdiff(unique(x), NA) # the unique x values
i <- 1 # for testing
for (i in seq_along(xs)){
j <- xs[i] - xwindow/2 < x & x < xs[i] + xwindow/2 # x points in this window
ys[x == xs[i]] <- median(y[j], na.rm = TRUE) # y median over this window
}
y <- ys
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed2")
}
predict.rollmed2 <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# plot smooth
ggplot(data) +
geom_point(aes(x = x, y = y)) +
geom_smooth(aes(x = x, y = y, colour = "nwindow"), formula = y ~ x, method = "rolling_median", se = FALSE, method.args = list(n_roll = nwindow)) +
geom_smooth(aes(x = x, y = y, colour = "xwindow"), formula = y ~ x, method = "rolling_median2", se = FALSE, method.args = list(xwindow = xwindow))
Created on 2022-01-05 by the reprex package (v2.0.1)

ggplot2: How to get geom_text() to play nice with facet_grid()?

So I'm trying to plot a couple of curves using ggplot(), and I would like to have each curve sitting in its own plot in a facet_grid. All of this works fine.
The problem is that I'd also like to annotate the curve with the x value corresponding to the peak y value. I tried using geom_text(), and I tried implementing it as shown below, but it doesn't seem to quite work. It's clearly printing something onto the plot, but not the way I hoped it would; i.e., each plot has its corresponding x value printed on it at the location (x, max(y)).
I suspect I've not implemented the ifelse() correctly, but I'm not experienced enough with R to figure out what exactly the problem is.
Any suggestions on where I'm going wrong?
Output:
Data + code:
library('ggplot2')
x <- seq(5, 15, length=1000)
y <- dnorm(x, mean=10, sd=1)
z <- rep_len("z", length.out = 1000)
x1 <- seq(5, 15, length=1000)
y1 <- dnorm(x1, mean=10, sd=2)
z1 <- rep_len("z1", length.out = 1000)
x <- c(x, x1)
y <- c(y, y1)
z <- c(z, z1)
df <- data.frame(x, y, z)
ggplot(data = df, aes(x, y)) + geom_line() + facet_grid(.~z) + geom_text(data = df, aes(x, y, label = ifelse(y == max(y), as.numeric(x), '')), inherit.aes = FALSE, hjust = 0, vjust = 0)
Edit: the output I'm expecting is something like this:
You need to fix two things.
(1) calculate max per z
(2) avoid duplicate y_values
The following code should fix both:
library(dplyr)
df2 <- df %>%
distinct(y, .keep_all = TRUE) %>%
group_by(z) %>%
mutate(y_label = ifelse(y == max(y), as.numeric(x), ''))
as.data.frame(df2)
ggplot(data = df2, aes(x, y)) + geom_line() + facet_grid(.~z) + geom_text(aes(label = y_label), hjust = 0, vjust = 0)
You need to provide geom_text a data.frame with data for z and z1.
x y z
z 9.994995 0.3989373 z
z1 9.994995 0.1994705 z1
How to get that? Well, here's one way.
df.split <- split(df, f = df$z)
df.max <- sapply(df.split, FUN = function(x) which.max(x$y))
df.max <- mapply(function(x1, x2) x1[x2, ], x1 = df.split, x2 = df.max, SIMPLIFY = FALSE)
df.max <- do.call(rbind, df.max)
which you can then plot
ggplot(data = df, aes(x, y)) +
geom_line() +
geom_text(data = df.max, aes(x = x, y = y, label = round(y, 2))) +
facet_grid(. ~ z)
Get the means and maxes for each z:
Ys <- df %>% group_by(z) %>% summarise(maxY = max(y))
Xs <- df %>% group_by(z) %>% summarise(meanX = mean(x))
Plot with the geom_text
ggplot(data = df, aes(x, y)) +
geom_line() +
geom_text(data = left_join(Xs,Ys), aes(meanX, maxY, label = meanX)) +
facet_grid(.~z)
Or more succinctly
ggplot(data = df, aes(x, y)) +
geom_line() +
geom_text(data =
df %>%
group_by(z) %>%
summarise(maxY = max(y), meanX = mean(x)),
aes(meanX, maxY, label = meanX)) +
facet_grid(.~z)

How to use the rlang `!!!` operator to define a function that wraps around a ggplot call? (Error: Can't use `!!!` at top level)

Context
Reading the vignette Programming with dplyr I tried to use the ... and !!! operators to implement a function that would wrap around ggplot functions and would accept an arbitrary number of arguments that would define which variables in a dataframe were to be mapped to each aesthetic.
My goal
I wanted to define a function plot_points2() such that
plot_points2(df, x = x, y = y, color = z) would be equivalent to df %>% ggplot( mapping = aes(x = x, y = y, color = z) ) + geom_point(alpha = 0.1)
plot_points2(df, x = x, y = z, color = y) would be equivalent to df %>% ggplot( mapping = aes(x = x, y = z, color = y) ) + geom_point(alpha = 0.1)
plot_points2(df, x = x, y = z) would be equivalent to df %>% ggplot( mapping = aes(x = x, y = z) ) + geom_point(alpha = 0.1)
What failed
packages
require(tidyverse)
require(rlang)
reduced example dataset
df <- tibble(g1= sample(x = c(1,2,3), replace = T, size = 10000),
g2= sample(x = c("a","b","c"), replace = T, size = 10000),
x = rnorm(10000, 50, 10),
y = rnorm(10000, 0, 20) + x*2,
z = rnorm(10000, 10, 5))
df
my attempt
plot_points2 <- function(d, ...){
args <- quos(...)
print(args)
ggplot(data = d, mapping = aes(!!!args)) + geom_point(alpha = 0.1)
}
plot_points2(df, x = x, y = y, color = z)
the error
Error: Can't use `!!!` at top level
Call `rlang::last_error()` to see a backtrace
Why I think it should work
I figure what I wanted to acomplish isn't much different from an example in the vignette that uses these operators to make a function that wraps around mutate(), and passes multiple arguments that defined the grouping variables (in deed I was able to implement a function that does that to the example dataset above I'm posting as an example), but somehow the latter works and the former doesn't:
this works
add_dif_to_group_mean <- function(df, ...) {
groups <- quos(...)
df %>% group_by(!!!groups) %>% mutate(x_dif = x-mean(x),
y_dif = y-mean(y),
z_dif = z-mean(z))
}
df %>% add_dif_to_group_mean(g1)
df %>% add_dif_to_group_mean(g1, g2)
this doesn't
plot_points2 <- function(d, ...){
args <- quos(...)
print(args)
ggplot(data = d, mapping = aes(!!!args)) + geom_point(alpha = 0.1)
}
plot_points2(df, x = x, y = y, color = z)
I also read that the problem could be related with aes() being evaluated only when the plot is printed, but in that case I think using !! and unpacking manually should raise the same error but it doesn't:
plot_points2b <- function(d, ...){
args <- quos(...)
print(args)
ggplot(data = d, mapping = aes(x = !!args[[1]],
y = !!args[[2]],
color = !!args[[3]])) +
geom_point(alpha = 0.1)
}
plot_points2b(df, x = x, y = y, color = z)
In deed this last example works fine if you plot 3 variables, but it doesn't allow you to plot a number of variables different from 3
eg: plot_points2b(df, x = x, y = z) is not equivalent to
df %>% ggplot( mapping = aes(x = x, y = z) ) + geom_point(alpha = 0.1)
In stead it raises the error:
Error in args[[3]] : subscript out of bounds
Anyone knows what concept am I missing here? Thank you in advance!
Your specific use case is an example in ?aes. aes automatically quotes its arguments. One can simply directly pass the dots. Try:
plot_points3 <- function(d, ...){
print(aes(...))
ggplot(d, aes(...)) + geom_point(alpha = 0.1)
}
plot_points3(df, x = x, y = y, color = z)
This nicely prints:
Aesthetic mapping:
* `x` -> `x`
* `y` -> `y`
* `colour` -> `z`
And yields the required plot.
As mentioned in my comment, I think you may already have x and y in your environment and that is why some of your code is working. I'm not totally sure what you are trying to achieve but I think you are doing too much rlang for getting your code to run without error.
For example:
plot_points <- function(d, ...){
ggplot(data = d, mapping = aes(x = x, y = y)) +
geom_point(alpha = 0.1)
}
plot_points (df, x, y)
will make your plot without any reason to add the overhead and complexity of !!! or enquo().
You were on this path here too, where this much simpler code works fine:
add_dif_to_group_mean <- function(., ...) {
df %>% group_by(g1) %>% mutate(x_dif = x-mean(x),
y_dif = y-mean(y),
z_dif = z-mean(z))
}
df %>% add_dif_to_group_mean(g1)
Likewise:
plot_points2 <- function(d, ...){
ggplot(data = d, mapping = aes(x=x, y=y, color=z)) +
geom_point(alpha = 0.1)
}
plot_points2(df, x = x, y = y, color = z)
works fine from what I can tell.
So I understand that maybe you are working through the examples in the book, which is great. But I think there is a missing issue somewhere that would make it so you have to do all the extra stuff in a real world function. For example, maybe you want to pass in strings like "x" and "y" instead of x and y?

Adding orthogonal regression line in ggplot

I have plotted a scatter graph in R, comparing expected to observed values,using the following script:
library(ggplot2)
library(dplyr)
r<-read_csv("Uni/MSci/Project/DATA/new data sheets/comparisons/for comarison
graphs/R Regression/GAcAs.csv")
x<-r[1]
y<-r[2]
ggplot()+geom_point(aes(x=x,y=y))+
scale_size_area() +
xlab("Expected") +
ylab("Observed") +
ggtitle("G - As x Ac")+ xlim(0, 40)+ylim(0, 40)
My plot is as follows:
I then want to add an orthogonal regression line (as there could be errors in both the expected and observed values). I have calculated the beta value using the following:
v <- prcomp(cbind(x,y))$rotation
beta <- v[2,1]/v[1,1]
Is there a way to add an orthogonal regression line to my plot?
Borrowed from this blog post & this answer. Basically, you will need Deming function from MethComp or prcomp from stats packages together with a custom function perp.segment.coord. Below is an example taken from above mentioned blog post.
library(ggplot2)
library(MethComp)
data(airquality)
airquality <- na.exclude(airquality)
# Orthogonal, total least squares or Deming regression
deming <- Deming(y=airquality$Wind, x=airquality$Temp)[1:2]
deming
#> Intercept Slope
#> 24.8083259 -0.1906826
# Check with prcomp {stats}
r <- prcomp( ~ airquality$Temp + airquality$Wind )
slope <- r$rotation[2,1] / r$rotation[1,1]
slope
#> [1] -0.1906826
intercept <- r$center[2] - slope*r$center[1]
intercept
#> airquality$Wind
#> 24.80833
# https://stackoverflow.com/a/30399576/786542
perp.segment.coord <- function(x0, y0, ortho){
# finds endpoint for a perpendicular segment from the point (x0,y0) to the line
# defined by ortho as y = a + b*x
a <- ortho[1] # intercept
b <- ortho[2] # slope
x1 <- (x0 + b*y0 - a*b)/(1 + b^2)
y1 <- a + b*x1
list(x0=x0, y0=y0, x1=x1, y1=y1)
}
perp.segment <- perp.segment.coord(airquality$Temp, airquality$Wind, deming)
perp.segment <- as.data.frame(perp.segment)
# plot
plot.y <- ggplot(data = airquality, aes(x = Temp, y = Wind)) +
geom_point() +
geom_abline(intercept = deming[1],
slope = deming[2]) +
geom_segment(data = perp.segment,
aes(x = x0, y = y0, xend = x1, yend = y1),
colour = "blue") +
theme_bw()
Created on 2018-03-19 by the reprex package (v0.2.0).
The MethComp package seems to be no longer maintained (was removed from CRAN).
Russel88/COEF allows to use stat_/geom_summary with method="tls" to add an orthogonal regression line.
Based on this and wikipedia:Deming_regression I created the following functions, which allow to use noise ratios other than 1:
deming.fit <- function(x, y, noise_ratio = sd(y)/sd(x)) {
if(missing(noise_ratio) || is.null(noise_ratio)) noise_ratio <- eval(formals(sys.function(0))$noise_ratio) # this is just a complicated way to write `sd(y)/sd(x)`
delta <- noise_ratio^2
x_name <- deparse(substitute(x))
s_yy <- var(y)
s_xx <- var(x)
s_xy <- cov(x, y)
beta1 <- (s_yy - delta*s_xx + sqrt((s_yy - delta*s_xx)^2 + 4*delta*s_xy^2)) / (2*s_xy)
beta0 <- mean(y) - beta1 * mean(x)
res <- c(beta0 = beta0, beta1 = beta1)
names(res) <- c("(Intercept)", x_name)
class(res) <- "Deming"
res
}
deming <- function(formula, data, R = 100, noise_ratio = NULL, ...){
ret <- boot::boot(
data = model.frame(formula, data),
statistic = function(data, ind) {
data <- data[ind, ]
args <- rlang::parse_exprs(colnames(data))
names(args) <- c("y", "x")
rlang::eval_tidy(rlang::expr(deming.fit(!!!args, noise_ratio = noise_ratio)), data, env = rlang::current_env())
},
R=R
)
class(ret) <- c("Deming", class(ret))
ret
}
predictdf.Deming <- function(model, xseq, se, level) {
pred <- as.vector(tcrossprod(model$t0, cbind(1, xseq)))
if(se) {
preds <- tcrossprod(model$t, cbind(1, xseq))
data.frame(
x = xseq,
y = pred,
ymin = apply(preds, 2, function(x) quantile(x, probs = (1-level)/2)),
ymax = apply(preds, 2, function(x) quantile(x, probs = 1-((1-level)/2)))
)
} else {
return(data.frame(x = xseq, y = pred))
}
}
# unrelated hlper function to create a nicer plot:
fix_plot_limits <- function(p) p + coord_cartesian(xlim=ggplot_build(p)$layout$panel_params[[1]]$x.range, ylim=ggplot_build(p)$layout$panel_params[[1]]$y.range)
Demonstration:
library(ggplot2)
#devtools::install_github("Russel88/COEF")
library(COEF)
fix_plot_limits(
ggplot(data.frame(x = (1:5) + rnorm(100), y = (1:5) + rnorm(100)*2), mapping = aes(x=x, y=y)) +
geom_point()
) +
geom_smooth(method=deming, aes(color="deming"), method.args = list(noise_ratio=2)) +
geom_smooth(method=lm, aes(color="lm")) +
geom_smooth(method = COEF::tls, aes(color="tls"))
Created on 2019-12-04 by the reprex package (v0.3.0)
I'm not sure I completely understand the question, but if you want line segments to show errors along both x and y axis, you can do this using geom_segment.
Something like this:
library(ggplot2)
df <- data.frame(x = rnorm(10), y = rnorm(10), w = rnorm(10, sd=.1))
ggplot(df, aes(x = x, y = y, xend = x, yend = y)) +
geom_point() +
geom_segment(aes(x = x - w, xend = x + w)) +
geom_segment(aes(y = y - w, yend = y + w))

Multiple Regression lines in ggplot2

here is a test code and I don't understand why is not working as expected. Is a ggplot2 question, not an R one.
library(ggplot2)
K = 10
x <- 1:100/100
y <- sapply (x, FUN= function(x) 1+x)
xy <- data.frame(x,y)
set.seed(1234)
xy$yrand <- xy$y + runif(100,min = -0.35, max = 0.5)
folds <- cut(seq(1, nrow(xy)), breaks = K, labels = FALSE)
p1 <- ggplot(xy, aes(x = xy$x, y = xy$yrand))+geom_point() +ggtitle ("Simple
x vs y plot with added random noise") + xlab("X") + ylab("Y")
for(i in 1:K){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- xy[testIndexes, ]
trainData <- xy[-testIndexes, ]
lmTemp <- lm(yrand ~ x, data = trainData)
p1 <- p1 + geom_line(data = trainData, aes(x = trainData$x, y = predict(lmTemp, newdata = trainData)))
}
p1
Now what I would like to see is a plot with 10 lines (the regression lines). But I only see one. Can someone help me out? Is the ggplot2 syntax that is wrong?
Thanks, Umberto
EDITED:
I marked the answer I got since it is a nice way of doing it. I just wanted to add a simple way of doing it preparing the datasets for the graph I wanted to create. I think this method is slightly easier to understand if you don't have so much R experience.
for(i in 1:K){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- xy[testIndexes, ]
trainData <- xy[-testIndexes, ]
lmTemp <- lm(yrand ~ x, data = trainData)
# Let's build a data set for the lines
fitLines <- rbind(fitLines, data.frame(rep(paste("set",i),nrow(trainData)),trainData[,1], predict(lmTemp, newdata = trainData)))
}
names(fitLines) <- c("set", "x","y")
p1 + geom_line(data = fitLines, aes(x = x, y = y, col = set))
And this is what you get
You could use the crossv_kfold()function from the modelr-package, and put your complete code into a "pipe-workflow":
library(modelr)
library(tidyverse)
x <- 1:100/100
y <- sapply (x, FUN= function(x) 1+x)
xy <- data.frame(x,y)
set.seed(1234)
xy$yrand <- xy$y + runif(100,min = -0.35, max = 0.5)
xy %>%
crossv_kfold() %>%
mutate(
models = map(train, ~ lm(yrand ~ x, data = .x)),
predictions = map2(models, test, ~predict(.x, newdata = .y, type = "response"))
) %>%
select(-train, -test, -models) %>%
unnest() %>%
bind_cols(xy) %>%
ggplot(aes(x = x, y = predictions)) +
stat_smooth(aes(colour = .id), method = "lm", se = FALSE) +
geom_point(aes(y = yrand))
Putting the colour-aes inside the ggplot-call would also map the points to the groups:
xy %>%
crossv_kfold() %>%
mutate(
models = map(train, ~ lm(yrand ~ x, data = .x)),
predictions = map2(models, test, ~predict(.x, newdata = .y, type = "response"))
) %>%
select(-train, -test, -models) %>%
unnest() %>%
bind_cols(xy) %>%
ggplot(aes(x = x, y = predictions, colour = .id)) +
stat_smooth(, method = "lm", se = FALSE) +
geom_point(aes(y = yrand))

Resources