Passing through data frames into functions and into ggplot by column - r

I'm trying to do my first function in R. I have a dataframe of inderminate columns, and I want to create a ggplot of each set of columns. For example, columns, 1&2, 1&3, 1&4 etc.
However, when I try the following function I get the object not found error, but only when we get the the ggplot portion.
Thanks,
BrandPlot=function(Brand){
NoCol=ncol(Brand)
count=2
while (count<=NoCol){
return(ggplot(Brand, aes(x=Brand[,1], y=Brand[,count]))+geom_point())
count=(count+1)
}
}
To clarify,
I'm trying to get the effect (also, I plan on adding additional things like geom_smooth() but I want to get it working first
ggplot(Brand, aes(x=Brand[,1], y=Brand[,2]))+geom_point
ggplot(Brand, aes(x=Brand[,1], y=Brand[,3]))+geom_point
ggplot(Brand, aes(x=Brand[,1], y=Brand[,4]))+geom_point
ggplot(Brand, aes(x=Brand[,1], y=Brand[,5]))+geom_point
(also, I plan on adding additional things like geom_smooth() ) but I want to get it working first

Per the note above, something like this may be what you're looking for...
brandplot <- function(x){
require(reshape2)
require(ggplot2)
x_melt <- melt(x, id.vars = names(x)[1])
ggplot(x_melt,
aes_string(x = names(x_melt)[1],
y = 'value',
group = 'variable')) +
geom_point() +
facet_wrap( ~ variable)
}
dat <- data.frame(a = sample(1:10, 25, T),
b = sample(20:30, 25, T),
c = sample(40:50, 25, T))
brandplot(dat)

[Note: #maloneypatr's solution is a better way to use ggplot for your application].
To answer your question directly, there are a couple of problems.
Your function returns after the first run through the loop (e.g., count=2), so you will never get more than one plot from this.
ggplot evaluates arguments to aes(...) in the context of the data frame defined in data=..., so it is looking for something like Brand$Brand (e.g., a column named Brand in the dataframe Brand). Since there is no such column, you get the Object not found error.
The following code will generate a series of n-1 plots where n = ncol(Brand).
BrandPlot=function(Brand){
for (count in 2:ncol(Brand)){
ggp <- ggplot(Brand, aes_string(x=names(Brand)[1], y=names(Brand)[count]))
ggp <- ggp + geom_point()
ggp <- ggp + ggtitle(paste(names(Brand)[count], " vs. ", names(Brand)[1]))
plot(ggp)
}
}

Related

R GGPLOT2 lapply and function not finding object?

I hope I can get a contextual clue as to what may be wrong here without providing data frame, but can if necessary, but ultimately I want to utilize lapply to create multiple boxplots across multiple Ys and same X, but get the following error, but Termed is definitely in my CMrecruitdat data.frame:
Error in aes_string(x = Termed, y = RecVar, fill = Termed) :
object 'Termed' not found
RecVar <- CMrecruitdat[,c("Req.Open.To.System.Entry", "Req.Open.To.Hire", "Tenure")]
BP <- function (RecVar){
require(ggplot2)
ggplot(CMrecruitdat, aes_string(x=Termed, y=RecVar, fill=Termed))+
geom_boxplot()+
guides(fill=false)
}
lapply(RecVar, FUN=BP)
If you use aes_string, you should pass strings rather than vectors and use strings for all your fields.
RecVar <- CMrecruitdat[,c("Termed", "Req.Open.To.System.Entry", "Req.Open.To.Hire", "Tenure")]
BP <- function (RecVar){
require(ggplot2)
ggplot(RecVar, aes_string(x="Termed", y=RecVar, fill="Termed"))+
geom_boxplot()+
guides(fill=false)
}
lapply(names(RecVar), FUN=BP)

for loop with ggplots produces graphs with identical values but different headings

I have read lots of posts about using loops for ggplot to generate lots of graphs, but cannot find any that explain my problem...
I have a dataframe and am trying to loop over 92 columns, creating a new graph for each column. I want to save each plot as a separate object. When I run my loop (code below) and print the graphs, all the graphs are correct. However, when I change the print() command with assign(), the graphs are not correct. The titles are changing as they should, however the graph-values are all identical (they are all the values for the final graph). I found this out because when I used plot_grid() to generate a figure of 10 plots, the graph titles and axis labels were all correct, but the values were identical!
My data set is large, so I have provided a small data set for illustration below.
Sample datafame:
library(ggplot)
library(cowplot)
df <- as.data.frame(cbind(group=c(rep("A", 4), rep("B", 4)), a=sample(1:100, 8), b=sample(100:200, 8), c=sample(300:400, 8))) #make data frame
cols <- 2:4 #define columns for plots
for(i in 1:length(cols)){
df[,cols[i]] <- as.numeric(as.character(df[,cols[i]]))
} #convert columns to numeric
Plots:
for (i in 1:length(cols)){
g <- ggplot(df, aes(x=group, y=df[,cols[i]])) +
geom_boxplot() +
ggtitle(colnames(df)[cols[i]])
print(g)
assign(colnames(df)[cols[i]], g) #generate an object for each plot
}
plot_grid(a, b, c)
I am thinking that when ggplots make the plot, it only renders the data from the final value of i? Or somthing like that? Is there a way around this?
I wish to do it like this, as there are a lot of graphs I wish to make and then I want to mix and match plots for figures.
Thanks!
I have cleaned up how you generated your sample data frame.
library(ggplot2)
library(cowplot)
df <- data.frame(group=c(rep("A", 4), rep("B", 4)),
a=sample(1:100, 8),
b=sample(100:200, 8),
c=sample(300:400, 8)) #make data frame
Just using data.frame() will suffice. This makes your code clearer and avoids the need for all that post-processing in your 'for loop' to convert your dataframe to numeric and to remove the factors generated - Note that as.data.frame() and cbind() tend to default to factors if you don't have 'stringsAsFactors = FALSE' and that the numeric to character conversion can be avoided by using cbind.data.frame() rather than cbind().
I have also refactored your 'for loop' that generates your plots. You generate a list of integers called 'cols' (cols <- 2:4 ) which you then reiterate across to generate your plots from each column of data. This is unnecessary, we can just create a range in the for statement conditions - 'for (i in 2:ncol(df))' - this simply reiterates from 2 to 4 (the number of columns in your dataframe) - starting from 2 is required to avoid column 1 which contains metadata. This is preferable because:
i) When reviewing your code the condition used is immediately apparent without searching through the rest of your code
ii) R has a number of functions/parameters similarly named to your variable 'cols' and it is best to avoid confusion.
With the code cleaned up we can now try to locate the cause of the bug:
library(ggplot2)
library(cowplot)
df <- data.frame(group=c(rep("A", 4), rep("B", 4)),
a=sample(1:100, 8),
b=sample(100:200, 8),
c=sample(300:400, 8)) #make data frame
for (i in 2:ncol(df)){
g <- ggplot(df, aes(x=group, y=df[,i])) +
geom_boxplot() +
ggtitle(colnames(df)[i])
print(g)
assign(colnames(df)[i], g) #generate an object for each plot
}
It's not immediately obvious why your code doesn't work. The suggestion by Imo has merit. Saving your plots to a list would prevent your environment from getting cluttered with objects, however it would not solve this bug. The cause is unintuitive and requires a deep understanding about how the assign() function is evaluated. See the answer provided here by Konrad Rudolph. The following should work and retains the style of your original code. As Konrad suggests in his answer it might be more "R" like to use lapply. Note that we have given the for loop local scope and that we now re-define i locally. Previously the last value of i generated in the loop was being used to generate each object created via the assign() function. Note the use of <<- to assign g to the global environment.
for (i in 2:ncol(df))
local({
i <- i
g <<- ggplot(df, aes(x=group, y=df[,i])) +
geom_boxplot() +
ggtitle(colnames(df)[i])
print(i)
print(g)
assign(colnames(df)[i], g, pos =1) #generate an object for each plot
})
plot_grid(a, b, c)
You owe me a drink.
There are two standard ways to deal with this problem:
1- Work with a long-format data.frame
2- Use aes_string to refer to variable names in the wide format data.frame
Here's an illustration of possible strategies.
library(ggplot2)
library(gridExtra)
# data from other answer
df <- data.frame(group=c(rep("A", 4), rep("B", 4)),
a=sample(1:100, 8),
b=sample(100:200, 8),
c=sample(300:400, 8))
## first method: long format
m <- reshape2::melt(df, id = "group")
p <- ggplot(m, aes(x=group, y=value)) +
geom_boxplot()
pl <- plyr::dlply(m, "variable", function(.d) p %+% .d + ggtitle(unique(.d$variable)))
grid.arrange(grobs=pl)
## second method: keep wide format
one_plot <- function(col = "a") ggplot(df, aes_string(x="group", y=col)) + geom_boxplot() + ggtitle(col)
pl <- plyr::llply(colnames(df)[-1], one_plot)
grid.arrange(grobs=pl)
## third method: more explicit looping
pl <- vector("list", length = ncol(df)-1)
for(ii in seq_along(pl)){
.col <- colnames(df)[-1][ii]
.p <- ggplot(df, aes_string(x="group", y=.col)) + geom_boxplot() + ggtitle(.col)
pl[[ii]] <- .p
}
grid.arrange(grobs=pl)
Sometimes, when wrapping a ggplot call inside a function/for loop one faces issues with local variables (not the case here, if aes_string is used). In such cases one can define a local environment.
Note that using a construct like aes(y=df[,i]) may appear to work, but can produce very wrong results. Consider a facetted plot, the data.frame will be split into different groups for each panel, and this subsetting can fail miserably to group the right data if numeric values are passed directly to aes() instead of variable names.

Manually added legend not working in ggplot2?

Here's facsimile of my data:
d1 <- data.frame(
e=rnorm(3000,10,10)
)
d2 <- data.frame(
e=rnorm(2000,30,30)
)
So, I got around the problem of plotting two different density distributions from two very different datasets on the same graph by doing this:
ggplot() +
geom_density(aes(x=e),fill="red",data=d1) +
geom_density(aes(x=e),fill="blue",data=d2)
But when I try to manually add a legend, like so:
ggplot() +
geom_density(aes(x=e),fill="red",data=d1) +
geom_density(aes(x=e),fill="blue",data=d2) +
scale_fill_manual(name="Data", values = c("XXXXX" = "red","YYYYY" = "blue"))
Nothing happens. Does anybody know what's going wrong? I thought I could actually manually add legends if need be.
Generally ggplot works best when your data is in a single data.frame and in long format. In your case we therefore want to combine the data from both data.frames. For this simple example, we just concatenate the data into a long variable called d and use an additional column id to indicate to which dataset that value belongs.
d.f <- data.frame(id = rep(c("XXXXX", "YYYYY"), c(3000, 2000)),
d = c(d1$e, d2$e))
More complex data manipulations can be done using packages such as reshape2 and tidyr. I find this cheat sheet often useful. Then when we plot we map fill to id, and ggplot will take of the legend automatically.
ggplot(d.f, aes(x = d, fill = id)) +
geom_density()

Plot one numeric variable against n numeric variables in n plots

I have a huge data frame and I would like to make some plots to get an idea of the associations among different variables. I cannot use
pairs(data)
, because that would give me 400+ plots. However, there's one response variable y I'm particularly interested in. Thus, I'd like to plot y against all variables, which would reduce the number of plots from n^2 to n. How can I do it?
EDIT: I add an example for the sake of clarity. Let's say I have the dataframe
foo=data.frame(x1=1:10,x2=seq(0.1,1,0.1),x3=-7:2,x4=runif(10,0,1))
and my response variable is x3. Then I'd like to generate four plots arranged in a row, respectively x1 vs x3, x2 vs x3, an histogram of x3 and finally x4 vs x3. I know how to make each plot
plot(foo$x1,foo$x3)
plot(foo$x2,foo$x3)
hist(foo$x3)
plot(foo$x4,foo$x3)
However I have no idea how to arrange them in a row. Also, it would be great if there was a way to automatically make all the n plots, without having to call the command plot (or hist) each time. When n=4, it's not that big of an issue, but I usually deal with n=20+ variables, so it can be a drag.
Could do reshape2/ggplot2/gridExtra packages combination. This way you don't need to specify the number of plots. This code will work on any number of explaining variables without any modifications
foo <- data.frame(x1=1:10,x2=seq(0.1,1,0.1),x3=-7:2,x4=runif(10,0,1))
library(reshape2)
foo2 <- melt(foo, "x3")
library(ggplot2)
p1 <- ggplot(foo2, aes(value, x3)) + geom_point() + facet_grid(.~variable)
p2 <- ggplot(foo, aes(x = x3)) + geom_histogram()
library(gridExtra)
grid.arrange(p1, p2, ncol=2)
The package tidyr helps doing this efficiently. please refer here for more options
data %>%
gather(-y_value, key = "some_var_name", value = "some_value_name") %>%
ggplot(aes(x = some_value_name, y = y_value)) +
geom_point() +
facet_wrap(~ some_var_name, scales = "free")
you would get something like this
If your goal is only to get an idea of the associations among different variables, you can also use:
plot(y~., data = foo)
It is not as nice as using ggplot and it doesn't automatically put all the graphs in one window (although you can change that using par(mfrow = c(a, b)), but it is a quick way to get what you want.
I faced the same problem, and I don't have any experience of ggplot2, so I created a function using plot which takes the data frame, and the variables to be plotted as arguments and generate graphs.
dfplot <- function(data.frame, xvar, yvars=NULL)
{
df <- data.frame
if (is.null(yvars)) {
yvars = names(data.frame[which(names(data.frame)!=xvar)])
}
if (length(yvars) > 25) {
print("Warning: number of variables to be plotted exceeds 25, only first 25 will be plotted")
yvars = yvars[1:25]
}
#choose a format to display charts
ncharts <- length(yvars)
nrows = ceiling(sqrt(ncharts))
ncols = ceiling(ncharts/nrows)
par(mfrow = c(nrows,ncols))
for(i in 1:ncharts){
plot(df[,xvar],df[,yvars[i]],main=yvars[i], xlab = xvar, ylab = "")
}
}
Notes:
You can provide the list of variables to be plotted as yvars,
otherwise it will plot all (or first 25, whichever is less) the variables in the data frame against xvar.
Margins were going out of bounds if the number of plots exceeds 25,
so I kept a limit to plot 25 charts only. Any suggestions to nicely
handle this are welcome.
Also the y axis labels are removed as titles of the graphs take care
of it. x axis label is set to xvar.

Plotting inside function: subset(df,id_==...) gives wrong plot, df[df$id_==...,] is right

I have a df with multiple y-series which I want to plot individually, so I wrote a fn that selects one particular series, assigns to a local variable dat, then plots it. However ggplot/geom_step when called inside the fn doesn't treat it properly like a single series. I don't see how this can be a scoping issue, since if dat wasn't visible, surely ggplot would fail?
You can verify the code is correct when executed from the toplevel environment, but not inside the function. This is not a duplicate question. I understand the problem (this is a recurring issue with ggplot), but I've read all the other answers; this is not a duplicate and they do not give the solution.
set.seed(1234)
require(ggplot2)
require(scales)
N = 10
df <- data.frame(x = 1:N,
id_ = c(rep(20,N), rep(25,N), rep(33,N)),
y = c(runif(N, 1.2e6, 2.9e6), runif(N, 5.8e5, 8.9e5) ,runif(N, 2.4e5, 3.3e5)),
row.names=NULL)
plot_series <- function(id_, envir=environment()) {
dat <- subset(df,id_==id_)
p <- ggplot(data=dat, mapping=aes(x,y), color='red') + geom_step()
# Unsuccessfully trying the approach from http://stackoverflow.com/questions/22287498/scoping-of-variables-in-aes-inside-a-function-in-ggplot
p$plot_env <- envir
plot(p)
# Displays wrongly whether we do the plot here inside fn, or return the object to parent environment
return(p)
}
# BAD: doesn't plot geom_step!
plot_series(20)
# GOOD! but what's causing the difference?
ggplot(data=subset(df,id_==20), mapping=aes(x,y), color='red') + geom_step()
#plot_series(25)
#plot_series(33)
This works fine:
plot_series <- function(id_) {
dat <- df[df$id_ == id_,]
p <- ggplot(data=dat, mapping=aes(x,y), color='red') + geom_step()
return(p)
}
print(plot_series(20))
If you simply step through the original function using debug, you'll quickly see that the subset line did not actually subset the data frame at all: it returned all rows!
Why? Because subset uses non-standard evaluation and you used the same name for both the column name and the function argument. As jlhoward demonstrates above, it would have worked (but probably not been advisable) to have simply used different names for the two.
The reason is that subset evaluates with the data frame first. So all it sees in the logical expression is the always true id_ == id_ within that data frame.
One way to think about it is to play dumb (like a computer) and ask yourself when presented with the condition id_ == id_ how do you know what exactly each symbol refers to. It's ambiguous, and subset makes a consistent choice: use what's in the data frame.
Notwithstanding the comments, this works:
plot_series <- function(z, envir=environment()) {
dat <- subset(df,id_==z)
p <- ggplot(data=dat, mapping=aes(x,y), color='red') + geom_step()
p$plot_env <- envir
plot(p)
# Displays wrongly whether we do the plot here inside fn, or return the object to parent environment
return(p)
}
plot_series(20)
The problem seems to be the subset is interpreting id_ on the RHS of the == as identical to the LHS, to this is equivalent to subletting on T, which of course includes all the rows of df. That's the plot you are seeing.

Resources