ggplot2: adding lines in a loop and retaining colour mappings - r

When running the following two pieces of code, I unexpectedly get different results. I need to add lines in a loop as in EX2, but all lines end up having the same colour. Why is this?
EX1
economics2 <- economics
economics2$unemploy <- economics$unemploy + 1000
economics3 <- economics
economics3$unemploy <- economics$unemploy + 2000
economics4 <- economics
economics4$unemploy <- economics$unemploy + 3000
b <- ggplot() +
geom_line(aes(x = date, y = unemploy, colour = as.character(1)), data=economics2) +
geom_line(aes(x = date, y = unemploy, colour = as.character(2)), data=economics3) +
geom_line(aes(x = date, y = unemploy, colour = as.character(3)), data=economics4)
print(b)
EX2
#economics2, economics3, economics4 are reused from EX1.
b <- ggplot()
econ <- list(economics2, economics3, economics4)
for(i in 1:3){
b <- b + geom_line(aes(x = date, y = unemploy, colour = as.character(i)), data=econ[[i]])
}
print(b)

This is not a good way to use ggplot. Try this way:
econ <- list(e1=economics2, e2=economics3, e3=economics4)
df <- cbind(cat=rep(names(econ),sapply(econ,nrow)),do.call(rbind,econ))
ggplot(df, aes(date,unemploy, color=cat)) + geom_line()
This puts your three versions of economics into a single data.frame, in long format (all the data in 1 column, with a second column, cat in this example, identifying the source). Once you've done that, ggplot takes care of everything else. No loops.
The specific reason your loop failed, as pointed out in the comment, is that using aes(...) stores the expression in the ggplot object, and that expression is evaluated when you call print(...). At that point i is 3.
Note that this does not apply to the data=... argument, so you could have done something like this:
b=ggplot()
for(i in 1:3){
b <- b + geom_line(aes(x=date,y=unemploy,colour=cat),
data=cbind(cat=as.character(i),econ[[i]]))
}
print(b)
But, this is still the wrong way to use ggplot.

Related

Why does R behave differently when parsing parameters of plotting?

I am attempting to plot multiple time series variables on a single line chart using ggplot. I am using a data.frame which contains n time series variables, and a column of time periods. Essentially, I want to loop through the data.frame, and add exactly n goem_lines to a single chart.
Initially I tried using the following code, where;
df = data.frame containing n time series variables, and 1 column of time periods
wid = n (number of time series variables)
p <- ggplot() +
scale_color_manual(values=c(colours[1:wid]))
for (i in 1:wid) {
p <- p + geom_line(aes(x=df$Time, y=df[,i], color=var.lab[i]))
}
ggplotly(p)
However, this only produces a plot of the final time series variable in the data.frame. I then investigated further, and found that following sets of code produce completely different results:
p <- ggplot() +
scale_color_manual(values=c(colours[1:wid]))
i = 1
p = p + geom_line(aes(x=df$Time, y=df[,i], color=var.lab[i]))
i = 2
p = p + geom_line(aes(x=df$Time, y=df[,i], color=var.lab[i]))
i = 3
p = p + geom_line(aes(x=df$Time, y=df[,i], color=var.lab[i]))
ggplotly(p)
Plot produced by code above
p <- ggplot() +
scale_color_manual(values=c(colours[1:wid]))
p = p + geom_line(aes(x=df$Time, y=df[,1], color=var.lab[1]))
p = p + geom_line(aes(x=df$Time, y=df[,2], color=var.lab[2]))
p = p + geom_line(aes(x=df$Time, y=df[,3], color=var.lab[3]))
ggplotly(p)
Plot produced by code above
In my mind, these two sets of code are identical, so could anyone explain why they produce such different results?
I know this could probably be done quite easily using autoplot, but I am more interested in the behavior of these two snipits of code.
What you're trying to do is a 'hack' way by plotting multiple lines, but it's not ideal in ggplot terms. To do it successfully, I'd use aes_string. But it's a hack.
df <- data.frame(Time = 1:20,
Var1 = rnorm(20),
Var2 = rnorm(20, mean = 0.5),
Var3 = rnorm(20, mean = 0.8))
vars <- paste0("Var", 1:3)
col_vec <- RColorBrewer::brewer.pal(3, "Accent")
library(ggplot2)
p <- ggplot(df, aes(Time))
for (i in 1:length(vars)) {
p <- p + geom_line(aes_string(y = vars[i]), color = col_vec[i], lwd = 1)
}
p + labs(y = "value")
How to do it properly
To make this plot more properly, you need to pivot the data first, so that each aesthetic (aes) is mapped to a variable in your data frame. That means we need a single variable to be color in our data frame. Hence, we pivot_longer and plot again:
library(tidyr)
df_melt <- pivot_longer(df, cols = Var1:Var3, names_to = "var")
ggplot(df_melt, aes(Time, value, color = var)) +
geom_line(lwd = 1) +
scale_color_manual(values = col_vec)

Encounter a ggplot2's problem. pic1 is good, then pic2 is good,but when review pic1,it gets bad

Recently, I encountered a question in ggplot2 field. It's confused for me that everytime I plot first plot with ggplot names "pic1"(the result of running is okay), and then I plotted second one with ggplot2 called "pic2". Of course, the "pic2" is good. But at this moment, I check "pic1", I found the regression line became a vertical line.For example:
"pic1"
p <- ggplot()
p <- p + geom_line(data = MyData, aes(x = otherCrop, y = eta ))
p <- p+ geom_point(data = dat,aes(x =otherCrop,
y = dat$sumEnemies, colour = YEAR ),position = position_jitter(width = .01),size = 1)
p <- p+labs(colour = "年份\nYear") + theme_classic(base_size=18) +
theme(axis.title.x=element_text( vjust=0))
p=p + theme(text=element_text(family="Times", size=18))
pic1=p
"pic2"
p <- ggplot()
p <- p + geom_line(data = MyData, aes(x = SHDI, y = eta ))
p <- p+ geom_point(data = dat,aes(x = dat$SHDI,
y = eta,colour = YEAR ),position = position_jitter(width = .01),size = 1)
p <- p+labs(colour = "年份\nYear") + theme_classic(base_size=18) +
theme(axis.title.x=element_text( vjust=0))
p=p + theme(text=element_text(family="Times", size=18))
pic2=p
But at this moment, I started to review "pic1", I found it as below:
It became a strange short vertical line. This would be difficult because I cannot plot them in a same paper. Does anybody know what's the problem?
I think this is a great example of why using the dataframe$column syntax inside an aes call is discouraged: it makes your plot vulnerable to subsequent changes in your data. Here's a simple example. Start with a data frame with columns x and y:
library(ggplot2)
df <- data.frame(x = 1:10, y = 1:10)
Now make a ggplot, but instead of using aes(x = x, y = y), we make the mistake of doing aes(x = df$x, y = df$y):
vulnerable_plot <- ggplot()
vulnerable_plot <- vulnerable_plot + geom_line(data = df, aes(x = df$x, y = df$y))
pic1 <- vulnerable_plot
Now we review our plot. Sure, ggplot nags us to say we shouldn't use this syntax, but the plot looks fine, so who cares, right?
pic1
#> Warning: Use of `df$x` is discouraged. Use `x` instead.
#> Warning: Use of `df$y` is discouraged. Use `y` instead.
Now, let's make pic2 identical to pic1 except we use the correct syntax:
invulnerable_plot <- ggplot()
invulnerable_plot <- invulnerable_plot + geom_line(data = df, aes(x = x, y = y))
pic2 <- invulnerable_plot
Now we don't get any warning, but the plot looks the same.
pic2
So there's no difference between pic1 and pic2. Or is there? What happens when we change our data frame?
df$y <- 10:1
vulnerable_plot
Oh dear. Our first plot has changed because the plot object has a reference to an external variable that it relies on to build the plot. That's not what we wanted.
However, with the version where we used the correct syntax, a copy of the data was taken and is kept with the plot data, so it remains unaffected by subsequent changes to df:
invulnerable_plot
Created on 2020-08-23 by the reprex package (v0.3.0)

ggplot2 multiplot using changing variables

I am trying to create multiple plots using ggplot2 that is then gathered in using multiplot. However, when I try to create X graphs I end up with X of the same graph.
My problem code pretty much boils down to this, asuming df is the dataframe
library(ggplot2)
i = 1
j = 2
xVar = df[[i]]
yVar = df[[j]]
plot1 = ggplot(data = df, aes(xVar, yVar)) + geom_point(shape=1)
i = 1
j = 3
xVar = df[[i]]
yVar = df[[j]]
plot2 = ggplot(data = df, aes(xVar, yVar)) + geom_point(shape=1)
multiplot(plot1,plot2, cols=2)
At this point plot1 is equal to plot2 and I dont understand why.
My full code if interested:
n = 1
columns = colnames(df)
plots = list()
for(i in 3:7)
{
for(j in (i+1):7)
{
if(j < 8 & i < 7) {
xVar = df[[i]]
yVar = df[[j]]
plots[[n]] = ggplot(data = df, aes(x=xVar, y=yVar)) +
geom_point(shape=1) +
labs(x=columns[[i]], y=columns[[j]]) +
theme(axis.title=element_text(size=8))
n = n + 1
}
}
}
multiplot(plotlist = plots, cols=3)
There are lots of things going on here.
First, it is a really, really, really bad idea to use external variables in calls to aes(...). The arguments to aes(...) are evaluated in the context of the data=... argument, so in the context of df in your case. If that fails they are evaluated in the global environment. So it is highly preferable to do something like this:
gg <- data.frame(x=df[[i]],y=df[[j]])
plots[[n]] = ggplot(data = gg, aes(x,y)) +...
Second, ggplot stores the expressions from aes(...) and evaluates them when the plot is rendered (so, during the call to multiplot(...)). All of your plots use variables named xVar and yVar in aes(...). So when these plots are rendered, ggplot uses whatever is stored in those variables at the time - presumably from the last plot definition. That's why all your plots look like the last one. This is the reference to "lazy evaluation" in the other answer.
On the other hand, ggplot evaluates the data=... argument immediately, and stores the dataset as part of the plot definition (in the gtable). So creating different data frames (called gg above), for each plot will work.
Finally, it looks like you are trying to create a pairs plot (every column vs. every other column, more or less). Unless this is a homework assignment, there are much easier ways to do this. You could use ggpairs(...) in the GGally package (which uses grid graphics), or you could do it this way using basic ggplot with facets:
# make up some data
set.seed(1) # for reproducible example
df <- data.frame(matrix(rnorm(700),nc=7))
df[4] <- 1+2*df[3] + rnorm(100)
df[5] <- 3*df[3] - 2*df[4] + rnorm(100)
df[6] <- -10*df[5] + rnorm(100)
# you start here...
gg.pairs <- function(data) { # scatterplot matrix using ggplot facets
require(ggplot2)
require(data.table)
require(reshape2) # for melt(...)
DT <- data.table(melt(cbind(id=1:nrow(data),data),id="id"),key="id")
gg <- DT[DT,allow.cartesian=T]
setnames(gg,c("id","H","x","V","y"))
ggplot(gg[as.integer(gg$H)<as.integer(gg$V),], aes(x,y)) +
geom_point(shape=1) +
facet_grid(V~H, scales="free")
}
gg.pairs(df[3:7])
I think that your problem is in R lazy evaluation. Indeed what happens is that plot1 and plot2 are not created when you assign it but when you call it, and at this moment there is only one copy (the last one) of xVarand yVar and plots are the same
Well, I can't explain what is happening, but a workaround is to use column names instead of columns withaes_string. The following makes two unique plots in multiplot for me, and this change could easily be incorporated into your plot loop.
dat = data.frame(x = rnorm(10), y1 = rnorm(10), y2 = rpois(10, 5))
xVar = names(dat)[1]
yVar = names(dat)[2]
plot1 = ggplot(data = dat, aes_string(xVar, yVar)) + geom_point(shape=1)
yVar = names(dat)[3]
plot2 = ggplot(data = dat, aes_string(xVar, yVar)) + geom_point(shape=1)
multiplot(plot1, plot2, cols=2)

Data driven plot names in data.table

This is a personal project to learn the syntax of the data.table package. I am trying to use the data values to create multiple graphs and label each based on the by group value. For example, given the following data:
# Generate dummy data
require(data.table)
set.seed(222)
DT = data.table(grp=rep(c("a","b","c"),each=10),
x = rnorm(30, mean=5, sd=1),
y = rnorm(30, mean=8, sd=1))
setkey(DT, grp)
The data consists of random x and y values for 3 groups (a, b, and c). I can create a formatted plot of all values with the following code:
# Example of plotting all groups in one plot
require(ggplot2)
p <- ggplot(data=DT, aes(x = x, y = y)) +
aes(shape = factor(grp))+
geom_point(aes(colour = factor(grp), shape = factor(grp)), size = 3) +
labs(title = "Group: ALL")
p
This creates the following plot:
Instead I would like to create a separate plot for each by group, and change the plot title from “Group: ALL” to “Group: a”, “Group: b”, “Group: c”, etc. The documentation for data.table says:
.BY is a list containing a length 1 vector for each item in by. This can be useful when by is not known in advance. The by variables are also available to j directly by name; useful for example for titles of graphs if j is a plot command, or to branch with if()
That being said, I do not understand how to use .BY or .SD to create separate plots for each group. Your help is appreciated.
Here is the data.table solution, though again, not what I would recommend:
make_plot <- function(dat, grp.name) {
print(
ggplot(dat, aes(x=x, y=y)) +
geom_point() + labs(title=paste0("Group: ", grp.name$grp))
)
NULL
}
DT[, make_plot(.SD, .BY), by=grp]
What you really should do for this particular application is what #dmartin recommends. At least, that's what I would do.
Instead of using data.table, you could use facet_grid in ggplot with the labeller argument:
p <- ggplot(data=DT, aes(x = x, y = y)) + aes(shape = factor(grp)) +
geom_point(aes(colour = factor(grp), shape = factor(grp)), size = 3) +
facet_grid(. ~ grp, labeller = label_both)
See the ggplot documentation for more information.
I see you already have a "facetting" option. I had done this
p+facet_wrap('grp')
But this gives the same result:
p+facet_wrap(~grp)

ggplot2-line plotting with TIME series and multi-spline

This question's theme is simple but drives me crazy:
1. how to use melt()
2. how to deal with multi-lines in single one image?
Here is my raw data:
a 4.17125 41.33875 29.674375 8.551875 5.5
b 4.101875 29.49875 50.191875 13.780625 4.90375
c 3.1575 29.621875 78.411875 25.174375 7.8012
Q1:
I've learn from this post Plotting two variables as lines using ggplot2 on the same graph to know how to draw the multi-lines for multi-variables, just like this:
The following codes can get the above plot. However, the x-axis is indeed time-series.
df <- read.delim("~/Desktop/df.b", header=F)
colnames(df)<-c("sample",0,15,30,60,120)
df2<-melt(df,id="sample")
ggplot(data = df2, aes(x=variable, y= value, group = sample, colour=sample)) + geom_line() + geom_point()
I wish it could treat 0 15 30 60 120 as real number to show the time series, rather than name_characteristics. Even having tried this, I failed.
row.names(df)<-df$sample
df<-df[,-1]
df<-as.matrix(df)
df2 <- data.frame(sample = factor(rep(row.names(df),each=5)), Time = factor(rep(c(0,15,30,60,120),3)),Values = c(df[1,],df[2,],df[3,]))
ggplot(data = df2, aes(x=Time, y= Values, group = sample, colour=sample))
+ geom_line()
+ geom_point()
Loooooooooking forward to your help.
Q2:
I've learnt that the following script can add the spline() function for single one line, what about I wish to apply spline() for all the three lines in single one image?
n <-10
d <- data.frame(x =1:n, y = rnorm(n))
ggplot(d,aes(x,y))+ geom_point()+geom_line(data=data.frame(spline(d, n=n*10)))
Your variable column is a factor (you can verify by calling str(df2)). Just convert it back to numeric:
df2$variable <- as.numeric(as.character(df2$variable))
For your other question, you might want to stick with using geom_smooth or stat_smooth, something like this:
p <- ggplot(data = df2, aes(x=variable, y= value, group = sample, colour=sample)) +
geom_line() +
geom_point()
library(splines)
p + geom_smooth(aes(group = sample),method = "lm",formula = y~bs(x),se = FALSE)
which gives me something like this:

Resources