This question already has answers here:
ggplot plots in scripts do not display in Rstudio
(5 answers)
Closed 5 years ago.
I have written a function that plots (in both Base R and ggplot) the misclassification rate for various values of K in a KNN classification problem. My problem is that, while Base R plot displays, ggplot graph does not display. When I take the ggplot code out of the function, it works. I'm not sure what I am doing wrong.
Can someone please point out what I am doing wrong?
Code:
library(ISLR)
library(ggplot2)
library(class)
data("Weekly")
train <- (Weekly$Year < 2009)
Weekly.train <- Weekly[ train, ]
Weekly.test <- Weekly[ !train, ]
knn.train.x <- scale( as.data.frame(Weekly$Lag2[train]) )
knn.test.x <- scale( as.data.frame(Weekly$Lag2[!train]) )
train.Direction <- Weekly$Direction[train]
set.seed(1234)
#Function for choosing k in knn
misclassknn <- function(train, test,
response.train,
response.test,
Kmax){
K <- 1:Kmax
misclass <- numeric(Kmax)
for( k in K){
knn.pred <- knn(train,test,response.train, k=k)
misclass[k] <- mean(knn.pred!=response.test)
}
# base R
plot(c(1, Kmax), c(0, 1), type = "n",
main = "Misclassification Rate for K Values",
xlab = "K", ylab = "Misclassification Rate")
points(1 : Kmax, misclass, type = "b", pch = 16)
# ggplot
df <- data.frame(1 : Kmax, misclass)
names(df) <- c("misclass", "K")
ggplot(df, aes(x = misclass, y = K)) + geom_line() + ylim(0, 1) +
geom_point() + labs( title = "Misclassification Rate for K Values",
y = "Misclassification Rate", x = "K")
return(list(K = Kmax, misclass = misclass,
Kmin = which.min(misclass)))
}
misclassknn(train = knn.train.x,
test = knn.test.x,
response.train = train.Direction,
response.test = Weekly$Direction[!train],
Kmax = 15)
We take for granted the way plotting works in R. For ggplot it actually returns an object, which is a description for how to build plot. Inside of functions, due to the different scope things aren't displayed like they are in the global scope. Basically you either need to manually tell it to display the plot by wrapping print() or ggplot_build around the ggplot command, or you need to return the object as an output of your function and then call it from the global scope. Essentially you're just forcing it to display the object.
It's common to store the ggplot object with something like
p <- ggplot(etc) + geom_etc() + ...
Now the object p can be built into a graph when you want it to be. You could use print(p) (or ggplot_build(p)) or you could just use p in your main code if it's returned by the function, e.g. return(p).
Related
I want to draw multiple simulated paths from any distribution (lognormal in the present case) on the same plot using ggplot2?
Using print(ggplot()) inside a for- loop does not show the paths all together.
library(ggplot2)
t <- 1000 # length of a simulation
time <- seq(0,t-1,by = 1) # make vector of time points
s <- cumsum(rlnorm(t, meanlog = 0, sdlog = 1)) # simulate trajectory of lognormal variable
df <- data.frame(cbind(time,s)) # make dataframe
colnames(df) <- c("t","s") # colnames
ggplot(df, aes(t,s )) + geom_line() # Get one trajectory
Now i want (say) 100 such paths in the same plot;
nsim <- 100 # number of paths
for (i in seq(1,nsim, by =1)) {
s <- cumsum(rlnorm(t, meanlog = 0, sdlog = 1))
df <- data.frame(cbind(time,s))
colnames(df) <- c("t","s")
print(ggplot(df, aes(t,s, color = i)) + geom_line())
}
The above loop obviously cannot do the job.
Any way to visualize such simulations using simple R with ggplot?
Instead of adding each line iteratively, you could iteratively simulate in a loop, collect all results in a data.frame, and plot all lines at once.
library(ggplot2)
nsim <- 100
npoints <- 1000
sims <- lapply(seq_len(nsim), function(i) {
data.frame(x = seq_len(npoints),
y = cumsum(rlnorm(npoints, meanlog = 0, sdlog = 1)),
iteration = i)
})
sims <- do.call(rbind, sims)
ggplot(sims, aes(x, y, colour = iteration, group = iteration)) +
geom_line()
Created on 2019-08-13 by the reprex package (v0.3.0)
In ggplot one method to achieve such methods is to add extra layers to the plot at each iteration. Doing so, a simple change of the latter code should be sufficient.
library(ggplot2)
nsim <- 100 # number of paths
dat <- vector("list", nsim)
p <- ggplot()
t <- 1000 # length of a simulation
time <- seq(0, t-1, by = 1)
for (i in seq(nsim)) {
s <- cumsum(rlnorm(t, meanlog = 0, sdlog = 1))
dat[[i]] <- data.frame(t = time, s = s)
p <- p + geom_line(data = dat[[i]], mapping = aes(x = t, y = s), col = i)
}
p #or print(p)
Note how I initiate the plot, similarly to how I initiate a list to contain the data frames prior to the loop. The loop then builds the plot step by step, but it is not visualized before i print the plot after the for loop. At which point every layer is evaluated (thus it can take a bit longer than standard R plots.)
Additionally as I want to specify the colour for each specific line, the col argument has to be moved outside the aes.
I have a two dimensional dataset (say columns x and y). I use the following function to plot a QQ-plot of this data.
# Creating a toy data for presentation
df = cbind(x = c(1,5,8,2,9,6,1,7,12), y = c(1,4,10,1,6,5,2,1,32))
# Plotting the QQ-plot
df_qq = as.data.frame(qqplot(df[,1], df[,2], plot.it=FALSE))
ggplot(df_qq) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1)
That is the resulting graph:
My question is, how to avoid plotting the last point (i.e. (12,32))? I would rather not delete it manually because i have several of these data pairs and there are similar outliers in each of them. What I would like to do is to write a code that somehow identifies the points that are too far from the 45 degree line and eliminate them from df_qq (for instance if it is 5 times further than the average distance to the 45 line it can be eliminated). My main objective is to make the graph easier to read. When outliers are not eliminated the more regular part of the QQ-plot occupies a too small part of the graph and it prevents me from visually evaluating the similarity of two vectors apart from the outliers.
I would appreciate any help.
There is a CRAN package, referenceIntervals that uses Cook's distance to detect outliers. By applying it to the values of df_qq$y it can then give an index into df_qq to be removed.
library(referenceIntervals)
out <- cook.outliers(df_qq$y)$outliers
i <- which(df_qq$y %in% out)
ggplot(df_qq[-i, ]) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1)
Edit.
Following the OP's comment,
But as far as I understand this function does not look at
the relation between x & y,
maybe the following function is what is needed to remove outliers only if they are outliers in one of the vectors but not in both.
cookOut <- function(X){
out1 <- cook.outliers(X[[1]])$outliers
out2 <- cook.outliers(X[[2]])$outliers
i <- X[[1]] %in% out1
j <- X[[2]] %in% out2
w <- which((!i & j) | (i & !j))
if(length(w)) X[-w, ] else X
}
Test with the second data set, the one in the comment.
The extra vector, id is just to make faceting easier.
df1 <- data.frame(x = c(1,5,8,2,9,6,1,7,12), y = c(1,4,10,1,6,5,2,1,32))
df2 <- data.frame(x = c(1,5,8,2,9,6,1,7,32), y = c(1,4,10,1,6,5,2,1,32))
df_qq1 = as.data.frame(qqplot(df1[,1], df1[,2], plot.it=FALSE))
df_qq2 = as.data.frame(qqplot(df2[,1], df2[,2], plot.it=FALSE))
df_qq_out1 <- cookOut(df_qq1)
df_qq_out2 <- cookOut(df_qq2)
df_qq_out1$id <- "A"
df_qq_out2$id <- "B"
df_qq_out <- rbind(df_qq_out1, df_qq_out2)
ggplot(df_qq_out) +
geom_point(aes(x=x, y=y), size = 2) +
geom_abline(intercept = c(0,0), slope = 1) +
facet_wrap(~ id)
I need to visualize data using boxplot, but it can not generate a list object. I tried simply using unlist on the lm-object, but it still says that the data is a list. I have read about this in R documentation, that the unlisting lm fit is a list which has individual residuals as components. How can I do it?
new_data.ref_conc <- lm(formula = conc~OD, data=new_data)
unlist(new_data.ref_conc)
boxplot(new_data.ref_conc~control+treat, data=new_data)
here is one way how to boxplot the residuals (epsilon) and the fitted values (yhat) of linear regression. Since you did not provide your data, I created my own:
set.seed(1)
x <- rexp(100, 1)
y <- 1 + 2*x + rnorm(100)
lm_obj <- lm(y~x)
plotdata <- data.frame(type = rep(c("res", "yhat"), each = 100),
value = c(residuals(lm_obj), fitted(lm_obj)))
boxplot(value~type, data = plotdata, col = c("dodgerblue", "hotpink2"), pch = 16,
names = c("Residuals", "Fitted Values"), main = "My Boxplot")
I'm needing help with the following question:
Consider the following R function, named negloglike that has two input arguments: lam and x, in that order.
Use this function to produce a plot of the log-likelihood function over a range of values λ ∈ (0, 2).
negloglike <- function(lam, x) {
l = -sum(log(dexp(x, lam)))
return(l)
}
Can anyone please help? Is it possible to do something like this with ggplot? I've been trying to do it with a set value of lam (like 0.2 here for example) using stat_function:
ggplot(data = data.frame(x = 0), mapping = aes(x = x)) +
stat_function(fun = negloglike, args = list(lam = 0.2)) +
xlim(0,10)
but the plot always returns a horizontal line at some y-value instead of returning a curve.
Should I be possibly using a different geom? Or even a different package altogether?
Much appreciated!
The trick is to Vectorize the function over the argument of interest.
Thanks for the tip go to the most voted answer to this question. It uses base graphics only, so here is a ggplot2 equivalent.
First I will define the negative log-likelihood using function dexp
library(ggplot2)
negloglike <- function(lam, x) {negloglike <- function(lam, x) {
l = -sum(dexp(x, lam, log = TRUE))
return(l)
}
nllv <- Vectorize(negloglike, "lam")
But it's better to use the analytic form, which is easy to establish by hand.
negloglike2 <- function(lam, x) {
l = lam*sum(x) - length(x)*log(lam)
return(l)
}
nllv2 <- Vectorize(negloglike2, "lam")
ggplot(data = data.frame(lam = seq(0, 2, by = 0.2)), mapping = aes(x = lam)) +
stat_function(fun = nllv2, args = list(x = 0:10))
Both nllv and nllv2 give the same graph.
BACKGROUND
My current plot looks like this:
PROBLEM
I want to force the regression line to start at 1 for station_1.
CODE
library(ggplot2)
#READ IN DATA
var_x = c(2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011)
var_y = c(1.000000,1.041355,1.053106,1.085738,1.126375,1.149899,1.210831,1.249480,1.286305,1.367923,1.486978,1.000000,0.9849343,0.9826141,0.9676000,0.9382975,0.9037476,0.8757748,0.8607960,0.8573634,0.8536138,0.8258877)
var_z = c('Station_1','Station_1','Station_1','Station_1','Station_1','Station_1','Station_1','Station_1','Station_1','Station_1','Station_1','Station_2','Station_2','Station_2','Station_2','Station_2','Station_2','Station_2','Station_2','Station_2','Station_2','Station_2')
df_data = data.frame(var_x,var_y,var_z)
out = ggplot(df_data,aes(x=var_x,y=var_y,group=var_z))
out = out + geom_line(aes(linetype=var_z),size=1)
out = out + theme_classic()
#SELECT DATA FOR Station_1
PFI_data=subset(df_data,var_z=="Station_1")
#PLOT REGRESSION FOR Station_1
out = out+ stat_smooth(data = PFI_data,
method=lm,
formula = y~x,
se=T,size = 1.4,colour = "blue",linetype=1)
Any help would be appreciated - this has been driving me crazy for too long!
First of all, you should be careful when forcing a regression line to some fixed point. Here's a link to a discussion why.
Now, from a technical perspective, I'm relying heavily on these questions and answers: one, two. The outline of my solution is the following: precompute the desired intercept, run a regression without it, add the intercept to the resulting prediction.
I'm using an internal ggplot2:::predictdf.default function to save some typing. The cbind(df, df) part may look strange, but it's a simple hack to make geom_smooth work properly, since there are two factor levels in var_z.
# Previous code should remain intact, replace the rest with this:
# SELECT DATA FOR Station_1
PFI_data=subset(df_data,var_z=="Station_1")
names(PFI_data) <- c("x", "y", "z")
x0 <- df_data[df_data$var_z == "Station_1", "var_x"][1]
y0 <- df_data[df_data$var_z == "Station_1", "var_y"][1]
model <- lm(I(y-y0) ~ I(x-x0) + 0, data = PFI_data)
xrange <- range(PFI_data$x)
xseq <- seq(from=xrange[1], to=xrange[2])
df <- ggplot2:::predictdf.default(model, xseq, se=T, level=0.95)
df <- rbind(df, df)
df[c("y", "ymin", "ymax")] <- df[c("y", "ymin", "ymax")] + y0
out + geom_smooth(aes_auto(df), data=df, stat="identity")