This is the basic example given in the iNEXT package:
library(iNEXT)
data(spider)
# multiple abundance-based data with multiple order q
z <- iNEXT(spider, q=c(0,1,2), datatype="abundance")
p1 <- ggiNEXT(z, facet.var="site", color.var="order")
In my dataset, i have more samples and the facetting does not work so great:
, so i want to change the ncol/nrow arguments in the facet_wrap/grid-call inside the object "p1". p1 is a ggplot object, so it can be altered (f.e. p1 + xlab("") removes the x-title).
In general, it would be nice to know how gginext() can be decomposed into single lines, and what objects are used in the data arguments, so i can change the order of the samples and reduce the amount of samples used per plot. Somehow, i wasnt able to find that out by looking at the function itself, also i get "Error: ggplot2 doesn't know how to deal with data of class iNEXT" when i try to follow gginext() step-by-step.
You could use facet_wrap(~site, ncol=3) to tune your plot. Take a simple example as following:
library(iNEXT)
library(ggplot2)
set.seed(123)
p <- 1/1:sample(1:50, 1)
p <- p/sum(p)
dat <- as.data.frame(rmultinom(9, 200, p))
z <- iNEXT(dat, q=c(0,1,2))
p1 <- ggiNEXT(z, facet.var="site", color.var="order")
p1 + facet_wrap(~site, ncol=3)
Related
I have the following loop to produce several histograms based off certain columns (columns 2 to 5) in a larger dataset (df):
loop.vector <- 2:5
for (i in loop.vector){
x <- df[,i]
print(ggplot(df,aes(x=x)) + geom_histogram(binwidth=1)+scale_x_continuous(breaks=seq(0,max((x),1)))
}
I'd like to have my y-axis scale done automatically as I have for the x-axis, where it ranges between zero and whatever the maximum frequency value is, at increments of 1.
I know how to set these values manually if I were to plot, take a look at it, and enter the max y-axis value separately, but i'd like to do this automatically within the loop.
Thanks!
Answering the question: how to access max counts for a histogram plot?
The information you're missing on each plot in order to create your scale_y_continuous command is the maximum number of counts. There is a nice way to access this information once you have created a ggplot object, which is to use the built-in ggplot_build() function from ggplot2. For a given plot, myPlot, the following will give you a list of dataframes that are used for each layer in your plot:
ggplot_build(myPlot)$data
In the case of your example, you can access the count column of the first data frame (since you only have one histogram geom layer). Here's how you can write the function to do what you need it to do. I'll use an example dataset that can show you the results. Note that I've also changed your scale_x_continuous line to be able to accomodate positive and negative numbers by using a combination of min(), max(), and the ceiling() and floor() functions:
set.seed(1234)
df <- data.frame(
y1=rnorm(100,10,1),
y2=rnorm(100,12,3),
y3=rnorm(100,5,4),
y4=rnorm(100,13,5))
for (i in 1:ncol(df)) {
p <- ggplot(df, aes(df[,i])) +
geom_histogram(alpha=0.5, color='black', fill='red', binwidth=1) +
scale_x_continuous(breaks=seq(floor(min(df[,i])),ceiling(max(df[,i])))) +
ggtitle(names(df)[i])
# get max counts
max_count <- max(ggplot_build(p)$data[[1]]$count)
p <- p + scale_y_continuous(breaks=seq(0,max_count,1))
print(p)
}
Is there a better way?
While that gets you what need, it's typically hard to deal with multiple plots output to your graphics device iteratively. I would recommend reformatting the above code as a function and then using lapply() and using something like plot_grid() from cowplot to display the output. This suggested approach is detailed in the code below:
myPlots <- function(data, column, fill_color) {
# column = character name of column
p <- ggplot(data, aes_string(x=column)) +
geom_histogram(fill='red', binwidth=1, alpha=0.5, color='black') +
scale_x_continuous(breaks=seq(floor(min(data[column])), ceiling(max(data[column])),1)) +
ggtitle(column)
max_count <- max(ggplot_build(p)$data[[1]]$count)
p <- p + scale_y_continuous(breaks=seq(0,max_count,1))
return(p)
}
library(cowplot)
plotList <- lapply(names(df), myPlots, data=df)
plot_grid(plotlist = plotList)
Figured it out - my values are integers, so what ended up working was a variation on Duck's response. See below:
loop.vector <- 2:5
for (i in loop.vector){
x <- df[,i]
print(ggplot(df,aes(x=x)) + geom_histogram(binwidth=1)+scale_x_continuous(breaks=seq(0,max((x),1)))+scale_y_continuous(breaks=seq(0,max(table(x)),1)))
}
I am creating a scatter plot matrix using GGally::ggpairs. I am using a custom function (below called my_fn) to create the bottom-left non-diagonal subplots. In the process of calling that custom function, there is information about each of these subplots that is calculated, and that I would like to store for later.
In the example below, each h#cID is a int[] structure with 100 values. In total, it is created 10 times in my_fn (once for each of the 10 bottom-left non-diagonal subplots). I am trying to store all 10 of these h#cID structures into the listCID list object.
I have not had success with this approach, and I have tried a few other variants (such as trying to put listCID as an input parameter to my_fn, or trying to return it in the end).
Is it possible for me to store the ten h#cID structures efficiently through my_fn to be used later? I feel there are several syntax issues that I am not entirely familiar with that may explain why I am stuck, and likewise I would be happy to change the title of this question if I am not using appropriate terminology. Thank you!
library(hexbin)
library(GGally)
library(ggplot2)
set.seed(1)
bindata <- data.frame(
ID = paste0("ID", 1:100),
A = rnorm(100), B = rnorm(100), C = rnorm(100),
D = rnorm(100), E = rnorm(100))
bindata$ID <- as.character(bindata$ID
)
maxVal <- max(abs(bindata[ ,2:6]))
maxRange <- c(-1 * maxVal, maxVal)
listCID <- c()
my_fn <- function(data, mapping, ...){
x <- data[ ,c(as.character(mapping$x))]
y <- data[ ,c(as.character(mapping$y))]
h <- hexbin(x=x, y=y, xbins=5, shape=1, IDs=TRUE,
xbnds=maxRange, ybnds=maxRange)
hexdf <- data.frame(hcell2xy(h), hexID=h#cell, counts=h#count)
listCID <- c(listCID, h#cID)
print(listCID)
p <- ggplot(hexdf, aes(x=x, y=y, fill=counts, hexID=hexID)) +
geom_hex(stat="identity")
p
}
p <- ggpairs(bindata[ ,2:6], lower=list(continuous=my_fn))
p
If I understand your problem correctly this is quite easily, albeit inelegantly, achieved using the <<- operator.
With it you may assign something like a global variable from inside the scope of your function.
Set listCID <- NULL before executing the function and listCID <<-c(listCID,h#cID) inside the function.
listCID = NULL
my_fn <- function(data, mapping, ...){
x = data[,c(as.character(mapping$x))]
y = data[,c(as.character(mapping$y))]
h <- hexbin(x=x, y=y, xbins=5, shape=1, IDs=TRUE, xbnds=maxRange, ybnds=maxRange)
hexdf <- data.frame (hcell2xy (h), hexID = h#cell, counts = h#count)
if(exists("listCID")) listCID <<-c(listCID,h#cID)
print(listCID)
p <- ggplot(hexdf, aes(x=x, y=y, fill = counts, hexID=hexID)) + geom_hex(stat="identity")
p
}
For more on scoping please refer to Hadleys excellent Advanced R: http://adv-r.had.co.nz/Environments.html
In general it is not a good practice to try to return two different results with one function. In your case, you want to return the plot and the result of a calculation (the hexbin cIDs).
Better would be to calculate your results in steps. Each step would be a separate function. The result of the first function (calculating the hexbins) can then be used as an input for multiple follow-up functions (finding the cIDs and creating the plot). Next follows one of the many ways in which you could refactor your code:
calc_hexbins() in which you generate all the hexbins. This function could return a named list of hexbins (e.g. list(AB = h1, AC = h2, BC = 43)). This is achieved by enumerating all the possible combinations of your list (A, B, C, D and E). The drawback is that you are duplicating some of the logic that is already in ggpairs().
gen_cids() takes the hexbins as an input and generates all the cIDs. This is a simple operation where you loop (or lappy) over all the elements in your list and take the cID.
create_plot() also takes the hexbins as an input and this is the function in which you actually generate the plot. Here you can add an extra parameter for the list of hexbins (there is a function wrap() in your package GGally to do this). Instead of calculating the hexbins, you can look them up in the named list that you've generated earlier by combining the A and the B in a string.
This avoids hacky methods such as working with attributes or using global variables. These work of course, but are often a headache when maintaining code. Unfortunately, this will also make your code a little longer, but this can be a good thing.
I have read lots of posts about using loops for ggplot to generate lots of graphs, but cannot find any that explain my problem...
I have a dataframe and am trying to loop over 92 columns, creating a new graph for each column. I want to save each plot as a separate object. When I run my loop (code below) and print the graphs, all the graphs are correct. However, when I change the print() command with assign(), the graphs are not correct. The titles are changing as they should, however the graph-values are all identical (they are all the values for the final graph). I found this out because when I used plot_grid() to generate a figure of 10 plots, the graph titles and axis labels were all correct, but the values were identical!
My data set is large, so I have provided a small data set for illustration below.
Sample datafame:
library(ggplot)
library(cowplot)
df <- as.data.frame(cbind(group=c(rep("A", 4), rep("B", 4)), a=sample(1:100, 8), b=sample(100:200, 8), c=sample(300:400, 8))) #make data frame
cols <- 2:4 #define columns for plots
for(i in 1:length(cols)){
df[,cols[i]] <- as.numeric(as.character(df[,cols[i]]))
} #convert columns to numeric
Plots:
for (i in 1:length(cols)){
g <- ggplot(df, aes(x=group, y=df[,cols[i]])) +
geom_boxplot() +
ggtitle(colnames(df)[cols[i]])
print(g)
assign(colnames(df)[cols[i]], g) #generate an object for each plot
}
plot_grid(a, b, c)
I am thinking that when ggplots make the plot, it only renders the data from the final value of i? Or somthing like that? Is there a way around this?
I wish to do it like this, as there are a lot of graphs I wish to make and then I want to mix and match plots for figures.
Thanks!
I have cleaned up how you generated your sample data frame.
library(ggplot2)
library(cowplot)
df <- data.frame(group=c(rep("A", 4), rep("B", 4)),
a=sample(1:100, 8),
b=sample(100:200, 8),
c=sample(300:400, 8)) #make data frame
Just using data.frame() will suffice. This makes your code clearer and avoids the need for all that post-processing in your 'for loop' to convert your dataframe to numeric and to remove the factors generated - Note that as.data.frame() and cbind() tend to default to factors if you don't have 'stringsAsFactors = FALSE' and that the numeric to character conversion can be avoided by using cbind.data.frame() rather than cbind().
I have also refactored your 'for loop' that generates your plots. You generate a list of integers called 'cols' (cols <- 2:4 ) which you then reiterate across to generate your plots from each column of data. This is unnecessary, we can just create a range in the for statement conditions - 'for (i in 2:ncol(df))' - this simply reiterates from 2 to 4 (the number of columns in your dataframe) - starting from 2 is required to avoid column 1 which contains metadata. This is preferable because:
i) When reviewing your code the condition used is immediately apparent without searching through the rest of your code
ii) R has a number of functions/parameters similarly named to your variable 'cols' and it is best to avoid confusion.
With the code cleaned up we can now try to locate the cause of the bug:
library(ggplot2)
library(cowplot)
df <- data.frame(group=c(rep("A", 4), rep("B", 4)),
a=sample(1:100, 8),
b=sample(100:200, 8),
c=sample(300:400, 8)) #make data frame
for (i in 2:ncol(df)){
g <- ggplot(df, aes(x=group, y=df[,i])) +
geom_boxplot() +
ggtitle(colnames(df)[i])
print(g)
assign(colnames(df)[i], g) #generate an object for each plot
}
It's not immediately obvious why your code doesn't work. The suggestion by Imo has merit. Saving your plots to a list would prevent your environment from getting cluttered with objects, however it would not solve this bug. The cause is unintuitive and requires a deep understanding about how the assign() function is evaluated. See the answer provided here by Konrad Rudolph. The following should work and retains the style of your original code. As Konrad suggests in his answer it might be more "R" like to use lapply. Note that we have given the for loop local scope and that we now re-define i locally. Previously the last value of i generated in the loop was being used to generate each object created via the assign() function. Note the use of <<- to assign g to the global environment.
for (i in 2:ncol(df))
local({
i <- i
g <<- ggplot(df, aes(x=group, y=df[,i])) +
geom_boxplot() +
ggtitle(colnames(df)[i])
print(i)
print(g)
assign(colnames(df)[i], g, pos =1) #generate an object for each plot
})
plot_grid(a, b, c)
You owe me a drink.
There are two standard ways to deal with this problem:
1- Work with a long-format data.frame
2- Use aes_string to refer to variable names in the wide format data.frame
Here's an illustration of possible strategies.
library(ggplot2)
library(gridExtra)
# data from other answer
df <- data.frame(group=c(rep("A", 4), rep("B", 4)),
a=sample(1:100, 8),
b=sample(100:200, 8),
c=sample(300:400, 8))
## first method: long format
m <- reshape2::melt(df, id = "group")
p <- ggplot(m, aes(x=group, y=value)) +
geom_boxplot()
pl <- plyr::dlply(m, "variable", function(.d) p %+% .d + ggtitle(unique(.d$variable)))
grid.arrange(grobs=pl)
## second method: keep wide format
one_plot <- function(col = "a") ggplot(df, aes_string(x="group", y=col)) + geom_boxplot() + ggtitle(col)
pl <- plyr::llply(colnames(df)[-1], one_plot)
grid.arrange(grobs=pl)
## third method: more explicit looping
pl <- vector("list", length = ncol(df)-1)
for(ii in seq_along(pl)){
.col <- colnames(df)[-1][ii]
.p <- ggplot(df, aes_string(x="group", y=.col)) + geom_boxplot() + ggtitle(.col)
pl[[ii]] <- .p
}
grid.arrange(grobs=pl)
Sometimes, when wrapping a ggplot call inside a function/for loop one faces issues with local variables (not the case here, if aes_string is used). In such cases one can define a local environment.
Note that using a construct like aes(y=df[,i]) may appear to work, but can produce very wrong results. Consider a facetted plot, the data.frame will be split into different groups for each panel, and this subsetting can fail miserably to group the right data if numeric values are passed directly to aes() instead of variable names.
I have a dataset of leaf trait measurements made at multiple sites at two contrasting seasons. I am interested to explore the association/line fit between a pair of traits and to differentiate the seasons at each site.
Rather than a linear regression, I would prefer to use the Standardised Major Axis approach within the smatr package:
e.g. sma.site1 <- sma(TraitA ~ TraitB * Visit, data=subset(myfile, Site=="Site1")) # testing the null hypothesis of common slopes for the two Visits (Seasons) at a given Site.
I can produce a handy lattice plot in ggplot2 with a separate panel for each Site and the points differentiated by Visit:
e.g. qplot(TraitB, TraitA, data=myfile, colour=Visit) + facet_wrap(~Site, ncol=2)
However, if I add trend lines fitted with the additional argument in ggplot2:
+ geom_smooth(aes(group=Visit), method="lm", se=F)
……, those lines are not a good match for the sma coefficients.
What I would like to do is fit the lines suggested by the sma test onto the ggplot lattice. Is there an easy, or efficient, way to do that?
I know that I can subset the data, produce a plot for each site, add the relevant lines with + geom_abline() and then stitch the separate plots up together with grid.arrange(). But that feels very long-winded.
I would be grateful for any pointers.
I don't know anything about the smatr package but you should be able to tweak this to get the right values. Since you provided no data I used the leaf data from the example in the pkg. The basic idea is to pull out the slope & intercept from the returned sma object and then facet the geom_abline. I may be misinterpreting the object, though.
library(smatr)
library(ggplot2)
data(leaflife)
do.call(rbind, lapply(unique(leaflife$site), function(x) {
obj <- sma(longev~lma*rain, data=subset(leaflife, site=x))
data.frame(site=x,
intercept=obj$coef[[1]][1, 1],
slope=obj$coef[[1]][2, 1])
})) -> fits
gg <- ggplot(leaflife)
gg <- gg + geom_point(aes(x=lma, y=longev, color=soilp))
gg <- gg + geom_abline(data=fits, aes(slope=slope, intercept=intercept))
gg <- gg + facet_wrap(~site, ncol=2)
gg
I just saw this question and am not sure if you are still interested in this. I run the code by hrbrmstr, and found actually the only thing you need to change is:
obj <- sma(longev~lma*rain, data=subset(leaflife, site == x))
then you can get the plot with four lines for each group.
and also
I want to create a correlation matrix plot, i.e. a plot where each variable is plotted in a scatterplot against each other variable like with pairs() or splom(). I want to do this with ggplot2. See here for examples. The link mentions some code someone wrote for doing this in ggplot2, however, it is outdated and no longer works (even after you swap out the deprecated parts).
One could do this with a loop in a loop and then multiplot(), but there must be a better way. I tried melting the dataset to long, and copying the value and variable variables and then using facets. This almost gives you something correct.
d = data.frame(x1=rnorm(100),
x2=rnorm(100),
x3=rnorm(100),
x4=rnorm(100),
x5=rnorm(100))
library(reshape2)
d = melt(d)
d$value2 = d$value
d$variable2 = d$variable
library(ggplot2)
ggplot(data=d, aes(x=value, y=value2)) +
geom_point() +
facet_grid(variable ~ variable2)
This gets the general structure right, but only works for the plotting each variable against itself. Is there some more clever way of doing this without resorting to 2 loops?
library(GGally)
set.seed(42)
d = data.frame(x1=rnorm(100),
x2=rnorm(100),
x3=rnorm(100),
x4=rnorm(100),
x5=rnorm(100))
# estimated density in diagonal
ggpairs(d)
# blank
ggpairs(d, diag = list("continuous"="blank")
Using PerformanceAnalytics library :
library("PerformanceAnalytics")
chart.Correlation(df, histogram = T, pch= 19)