R Get objects from environment and feed to function - r

This is probably a pretty trivial question, but I'm much more used to python than to R (the fact I'm mostly a biologist might also play a role...)
What the code below does is plot the counts for each gene in the provided data in an independent panel, and rearrange the legends in order to have a single one for all the panels in the plot.
# function to rearrange plot legend, from here:
# http://rpubs.com/sjackman/grid_arrange_shared_legend. Give credit where credit is due ;)
grid_arrange_shared_legend <- function(...) {
plots <- list(...)
g <- ggplotGrob(plots[[1]] + theme(legend.position="bottom"))$grobs
legend <- g[[which(sapply(g, function(x) x$name) == "guide-box")]]
lheight <- sum(legend$height)
grid.arrange(
do.call(arrangeGrob, lapply(plots, function(x)
x + theme(legend.position="none"))),
legend,
ncol = 1,
heights = unit.c(unit(1, "npc") - lheight, lheight))
}
# make plot for the given gene and assign it to a named object
plot_genes <- function(gene, gID){
name<-paste0("plotted_counts_for_", gene)
counts = plotCounts("whatever") # get data using plotCounts from DESeq2 package. the gID is used in here
return(assign(name, ggplot(counts,
# + a bunch of plotting aestetics
envir = .GlobalEnv)) #make plot available outside function. Probably I can also use parent.frame()
}
# call plot_genes() for each cluster of genes, adjust the legend for multiple plots with grid_arrange_shared_legend()
plot_cluster_count <- function(cluster,name) {
genes = as.vector(as.data.frame(cluster)$Symbol)
gIDs = as.vector(as.data.frame(cluster)$EMSEMBL)
pdf(paste0(name,"_counts.pdf"))
plt = lapply(seq_along(genes), function(x) plot_genes(genes[x], gIDs[x]))
grid_arrange_shared_legend(plotted_counts_for_gene1, plotted_counts_for_gene2, plotted_counts_for_gene3,plotted_counts_for_gene4)
dev.off()
}
# call the whole thing
plot_cluster_count(Cluster_1,"Cluster_1")
This code works.
The issue is that it works only when I explicitely hard-code the names of the plots as in grid_arrange_shared_legend(plotted_counts_for_gene1, plotted_counts_for_gene2, plotted_counts_for_gene3,plotted_counts_for_gene4).
However I have plenty of clusters to plot, with different number of genes with different names, so I need to automate the selection of objects to feed to grid_arrange_shared_legend().
I tried to play around with ls()/objects(), mget() and Google but I can't find a way to get it working, I always end up with
Error in plot_clone(plot) : attempt to apply non-function.
I traced the error back with options(error=recover) and indeed it comes from grid_arrange_shared_legend(), so to me it looks like I'm not able to feed the objects to the function.
The ultimate goal would be to be able to call plot_cluster_count() within a lapply() statement feeding a list of clusters to iterate through. This should result in one pdf per cluster, each containing one panel per gene.
PS
I'm aware that getting the object names from the environment is not the most elegant way to go, it just seemed more straightforward. Any alternative approach is more than welcome
Thanks!

One solution:
# mock data (3 "objects")
set.seed(1)
obj_1 <- list(var = sample(1:100, 1), name = "obj_1")
obj_2 <- list(var = sample(1:100, 1), name = "obj_2")
obj_3 <- list(var = sample(1:100, 1), name = "obj_3")
# any kind of function you want to apply
f1 <- function(obj, obj_name) {
print(paste(obj, " -- ", obj_name))
}
# Find all objects in your environment
list_obj <- ls(pattern = "obj_")
# Apply the previous function on this list
output <- lapply(list_obj, function(x) f1(get(x)$var, get(x)$name))
output
#> [[1]]
#> [1] "27 -- obj_1"
#>
#> [[2]]
#> [1] "38 -- obj_2"
#>
#> [[3]]
#> [1] "58 -- obj_3"

Related

Plot function by condition and subsample for factor data

I'm working on a plotting function for the likert data from a survey and I'm trying to optimize it to be as automated as possible since I have to make quite a lot of plots and make it as user-friendly as possible, but I'm having some problems and really need help finishing this function...
These are the data:
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
df1[colnames(df1)] <- lapply(df1[colnames(df1)], factor)
Columns A and B pertain to the "Technology" section of my survey, while C, D and E are in "Social".
I have transformed my data using the likertpackage and compiled them in a list to be more easily called in my function (don't know if it's the best way to go about it, I'm still quite new to R, so feel free to make suggestions even concerning this point):
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
So far, here is the function I could come up with (with great help of user #gaut):
mynames <- sapply(names(tbls), function(x) {
paste("How do they rank? -",gsub("\\.",": ",x))
})
myfilenames <- names(tbls)
plot_likert <- function(x, myname, myfilename){
p <- plot(likert(x),
type ="bar",center=3,
group.order=names(x))+
labs(x = "Theme", subtitle=paste("Number of observations:",nrow(x)))+
guides(fill=guide_legend("Rank"))+
ggtitle(myname)
p
I then lapply the function to get a list of plots:
list_plots <- lapply(1:length(tbls),function(i) {
plot_likert(tbls[[i]], mynames[i], myfilenames[i])
})
And then save them all as .png
sapply(1:length(list_plots), function(i) ggsave(
filename = paste0("plots ",i,".png"),
plot = list_plots[[i]],
width = 15, height = 9
))
Now, there are 3 main things I want my function to do but don't really know how to approach:
1) Right now I can export all the plots in one batch, but I would also like to be able to export a single plot, for example obtaining the above graph by writing:
plot_likert(tbls$dummy1.no)
2) In my mind, my ideal plotting function would also take into account the sections of my data mentioned above, so that if I specify the section Technology only get a Likert plot considering only columns A and B, and specifying the subsample gets me the dummy. Like so:
plot_likert(section=Technology, subsample=dummy1.no)
3) As you maybe have already noted, I need the titles of the plot to be fully automatic, so that by changing section or subsample they too change accordingly.
Apologies for the long/intricate question but I've been stuck on this function for quite some time and really need help finalizing it. For any further clarification/info, do not hesitate to ask!
Thank you in advance for any advice!
There are many ways to get what you want. Essentially, you need to add a few arguments to your function.
I agree with Limey though (and of course Hadley) - generally better to have a few simple functions that do a little step and then you can collate everything in one bigger function.
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
## this can be shortened
df1 <- data.frame(lapply(df1, factor))
## the rest of dummy data creation probably too, but I won't dig too much into this now
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
library(ggplot2)
library(likert)
#> Loading required package: xtable
## no need for sapply, really!
mynames <- paste("How do they rank? -", gsub("\\.",": ",names(tbls)))
myfilenames <- names(tbls)
## defining arguments with NULL makes it possible to not specify it without giving it a value
plot_likert <- function(x, myname, myfilename, section = NULL, subsample = NULL){
## first take only the tbl of interest
if(!is.null(subsample)) x <- x[subsample]
## then filter for your section and subsample
if(!is.null(section)) x <- lapply(x, function(y) y[, section])
## you can run your lapply within the function -
## ideally make a separate funciton and call the smaller function in the bigger one
## use seq_along
lapply(seq_along(x), function(i) {
plot(likert(x[[i]]),
type ="bar",center=3,
group.order=names(x[[i]]))+
labs(x = "Theme", subtitle=paste("Number of observations:",nrow(x)))+
guides(fill=guide_legend("Rank")) +
## programmatic title
ggtitle(names(x)[i])
})
}
## you need to pass character vectors to your arguments
patchwork::wrap_plots(plot_likert(tbls))
patchwork::wrap_plots(plot_likert(tbls, section = LETTERS[1:2], subsample = paste("dummy1", c("no", "yes"), sep = ".")))
Created on 2022-08-17 by the reprex package (v2.0.1)

Is it somehow possibe to link a mathematical axis unit expression from a list or dataframe onto an ggplot axis?

I wonder if it's possible to craft a semi-automatic parsing of 'complicated' mathematical expressions into the axis of ggplot by maintaining some sort of lookup table?
So, for example, for data-mining, I regularly have to produce hundreds of scatterplots, which I want to discuss with colleagues. To do so, I want correct axis-legends, of course - this is rather cumbersome.
Here a simple example of what would like to read out from a database into the labs() by using a formula: expression(paste(delta^{18},"O (\u2030)")
So, what I was wondering is if there's a way to link those labs() to predefined lists or tables in a way like labs(y = list[3])?
This works just fine for simple names like: "Dissolved oxygen saturation / %", but when trying the same for the above, it generates:
paste(delta^{
18
}, "O (‰)")
(including the breaks - which is obviously not what I want)
Thanks,
Alex
You could tinker with the math_format() function from the scales package a bit to take in pre-substituted expressions:
library(patchwork)
library(ggplot2)
splitiris <- split(iris, iris$Species)
# Example expressions
exprs <- list(
substitute(10^.x),
substitute(log[.x]~"(%)"),
substitute(frac(.x, 2))
)
# Near-copy of scales::math_format
math_format2 <- function(expr = subsitute(10^.x), format = force) {
.x <- NULL
subs <- function(x) {
do.call("substitute", list(expr, list(.x = x)))
}
function(x) {
x <- format(x)
ret <- lapply(x, subs)
ret <- as.expression(ret)
ret[is.na(x)] <- NA
names(ret) <- names(x)
ret
}
}
# Generate plots
plots <- lapply(seq_along(splitiris), function(i) {
ggplot(splitiris[[i]], aes(Sepal.Width, Sepal.Length)) +
geom_point() +
scale_x_continuous(labels = math_format2(exprs[[i]]))
})
plots[[1]] + plots[[2]] + plots[[3]]
Created on 2020-05-27 by the reprex package (v0.3.0)

How do I repeat codes with names changing at every block? (with R)

I'm dealing with several outputs I obtain from QIIME, texts which I want to manipulate for obtaining boxplots. Every input is formatted in the same way, so the manipulation is always the same, but it changes the source name. For each input, I want to extract the last 5 rows, have a mean for each column/sample, associate the values to sample experimental labels (Group) taken from the mapfile and put them in the order I use for making a boxplot of all the 6 data obtained.
In bash, I do something like "for i in GG97 GG100 SILVA97 SILVA100 NCBI RDP; do cp ${i}/alpha/collated_alpha/chao1.txt alpha_tot/${i}_chao1.txt; done" to do a command various times changing the names in the code in an automatic way through ${i}.
I'm struggling to find a way to do the same with R. I thought creating a vector containing the names and then using a for cycle by moving the i with [1], [2] etc., but it doesn't work, it stops at the read.delim line not finding the file in the wd.
Here's the manipulation code I wrote. After the comment, it will repeat itself 6 times with the 6 databases I'm using (GG97 GG100 SILVA97 SILVA100 NCBI RDP).
PLUS, I repeat this process 4 times because I have 4 metrics to use (here I'm showing shannon, but I also have a copy of the code for chao1, observed_species and PD_whole_tree).
library(tidyverse)
library(labelled)
mapfile <- read.delim(file="mapfile_HC+BV.txt", check.names=FALSE);
mapfile <- mapfile[,c(1,4)]
colnames(mapfile) <- c("SampleID","Pathology_group")
#GG97
collated <- read.delim(file="alpha_diversity/GG97_shannon.txt", check.names=FALSE);
collated <- tail(collated,5); collated <- collated[,-c(1:3)]
collated_reorder <- collated[,match(mapfile[,1], colnames(collated))]
labels <- t(mapfile)
colnames(collated_reorder) <- labels[2,]
mean <- colMeans(collated_reorder, na.rm = FALSE, dims = 1)
mean = as.matrix(mean); mean <- t(mean)
GG97_shannon <- as.data.frame(rbind(labels[2,],mean))
GG97_shannon <- t(GG97_shannon);
DB_type <- list(DB = "GG97"); DB_type <- rep(DB_type, 41)
GG97_shannon <- as.data.frame(cbind(DB_type,GG97_shannon))
colnames(GG97_shannon) <- c("DB","Group","value")
rm(collated,collated_reorder,DB_type,labels,mean)
Here I paste all the outputs together, freeze the order and make the boxplot.
alpha_shannon <- as.data.frame(rbind(GG97_shannon,GG100_shannon,SILVA97_shannon,SILVA100_shannon,NCBI_shannon,RDP_shannon))
rownames(alpha_shannon) <- NULL
rm(GG97_shannon,GG100_shannon,SILVA97_shannon,SILVA100_shannon,NCBI_shannon,RDP_shannon)
alpha_shannon$Group = factor(alpha_shannon$Group, unique(alpha_shannon$Group))
alpha_shannon$DB = factor(alpha_shannon$DB, unique(alpha_shannon$DB))
library(ggplot2)
ggplot(data = alpha_shannon) +
aes(x = DB, y = value, colour = Group) +
geom_boxplot()+
labs(title = 'Shannon',
x = 'Database',
y = 'Diversity') +
theme(legend.position = 'bottom')+
theme_grey(base_size = 16)
How do I keep this code "DRY" and don't need 146 rows of code to repeat the same things over and over? Thank you!!
You didn't provide a Minimal reproducible example, so this answer cannot guarantee correctness.
An important point to note is that you use rm(...), so this means some variables are only relevant within a certain scope. Therefore, encapsulate this scope into a function. This makes your code reusable and spares you the manual variable removal:
process <- function(file, DB){
# -> Use the function parameter `file` instead of a hardcoded filename
collated <- read.delim(file=file, check.names=FALSE);
collated <- tail(collated,5); collated <- collated[,-c(1:3)]
collated_reorder <- collated[,match(mapfile[,1], colnames(collated))]
labels <- t(mapfile)
colnames(collated_reorder) <- labels[2,]
mean <- colMeans(collated_reorder, na.rm = FALSE, dims = 1)
mean = as.matrix(mean); mean <- t(mean)
# -> rename this variable to a more general name, e.g. `result`
result <- as.data.frame(rbind(labels[2,],mean))
result <- t(result);
# -> Use the function parameter `DB` instead of a hardcoded string
DB_type <- list(DB = DB); DB_type <- rep(DB_type, 41)
result <- as.data.frame(cbind(DB_type,result))
colnames(result) <- c("DB","Group","value")
# -> After the end of this function, the variables defined in this function
# vanish automatically, you just need to specify the result
return(result)
}
Now you can reuse that block:
GG97_shannon <- process(file = "alpha_diversity/GG97_shannon.txt", DB = "GG97")
GG100_shannon <- process(file =...., DB = ....)
SILVA97_shannon <- ...
SILVA100_shannon <- ...
NCBI_shannon <- ...
RDP_shannon <- ...
Alternatively, you can use looping structures:
General-purpose for:
datasets <- c("GG97_shannon", "GG100_shannon", "SILVA97_shannon",
"SILVA100_shannon", "NCBI_shannon", "RDP_shannon")
files <- c("alpha_diversity/GG97_shannon.txt", .....)
DBs <- c("GG97", ....)
result <- list()
for(i in seq_along(datasets)){
result[[datasets[i]]] <- process(files[i], DBs[i])
}
mapply, a "specialized for" for looping over several vectors in parallel:
# the first argument is the function from above, the other ones are given as arguments
# to our process(.) function
results <- mapply(process, files, DBs)

Repeat n times a subset and plot in R

first of all thanks to this forum because I've finded a lot of answers!!
Now my time to ask for help.
I can solve this.... function?, loop?... didn't find a good example
# from a data.frame = data
# A:F are name of columns
x<-unique(data$A) # in the example c('var1','var2','var3','var4')
y<-unique(data$B) # in the example c('varA','varB')
z<-unique(data$C) # in the example c('var1a','var2a')
# I NEED TO REPEAT THIS (based on x_y_z combinations)#
x_y_z<-subset(data,data$A==x & data$B==y & data$C==z)
plot_x_y<-qic(y=D
,n=E
,x=F
,data=x_y_z
,chart = 'p')
The idea is to repeat the subsetting and make a plot for each x_y_z combination. The subset should have the name of the variables combinated separated by a '_'.
I guess it should work like this:
var1_varA_var1a<-subset(data,data$A==var1 & data$B==varA & data$C==var1a)
plot_var1_varA<-qic(y=D
,n=E
,x=F
,data=var1_varA_var1a
,chart = 'p')
And obtain all this plots:
plot_var1_varA_var1a
plot_var1_varB_var1a
plot_var1_varA_var2a
plot_var1_varB_var2a
plot_var2_varA_var1a
plot_var2_varB_var1a
plot_var2_varA_var2a
plot_var2_varB_var2a
plot_var3_varA_var1a
plot_var3_varB_var1a
plot_var3_varA_var2a
plot_var3_varB_var2a
plot_var4_varA_var1a
plot_var4_varB_var1a
plot_var4_varA_var2a
plot_var4_varB_var2a
Sorry for the basic question, but I'm stuck on this.
Cristobal
Consider by which slices like subset but allows dataframe operations in its FUN arg. And use a list of plots instead of many separately named plot objects.
plot_list <- by(data, data[, c("A","B","C")], FUN = function(df) {
qic(y = D,n = E,x = F, data = df, chart = 'p')
})
Should you want to rename this list:
dfnames <- expand.grid(x,y,z)
listnames <- vapply(1:nrow(dfnames), function(i)
paste(dfnames$Var1[[i]], dfnames$Var2[[i]], dfnames$Var3[[i]], sep="_"), character(1))
# [1] "var1_varA_var1a" "var2_varA_var1a" "var3_varA_var1a" "var4_varA_var1a" "var1_varB_var1a"
# [6] "var2_varB_var1a" "var3_varB_var1a" "var4_varB_var1a" "var1_varA_var2a" "var2_varA_var2a"
# [11] "var3_varA_var2a" "var4_varA_var2a" "var1_varB_var2a" "var2_varB_var2a" "var3_varB_var2a"
# [16] "var4_varB_var2a"
# RENAME LIST ELEMENTS
plot_list <- setNames(plot_list, paste0("plot_", listnames))
plot_list$plot_var1_varA_var1a # FIRST PLOT
plot_list$plot_var2_varA_var1a # SECOND PLOT
...

Strange performance of multi qplots to a figure in R when parameter of qplot includes list

Aim: save multi qplots into a figure.
The point is there is a list parameter to loop (see code. It's more clear!). The output is different as I expected. They are the same!
# Make y be a list containing different values.
y <- list()
for (j in 1:6) {
y[[j]] <- rnorm(10)
}
# plot multi qplots into one figure
plots <- list() # new empty list
for (i in 1:6) {
p1 = qplot(1:10, y[[i]], main = i)
plots[[i]] <- p1 # add each plot into plot list
}
do.call(grid.arrange, plots)
The code is revised from http://rstudio-pubs-static.s3.amazonaws.com/2852_379274d7c5734f979e106dcf019ec46c.html
There is y partly: (to illustrate they are different!). The figure only used the last one in y. It's strange.
[[1]]
[1] 2.01846525 -2.32504052 -1.07201485 -0.21105479 0.25706024 0.50934754
[7] 0.39844954 0.18110421 1.03368602 0.01185971
[[2]]
[1] 0.01824317 -1.51801208 1.68385158 -0.30159404 -0.34894329 0.62840458
[7] -0.45447576 1.18625774 -0.36671100 -0.05502285
...
[[6]]
[1] 0.1134854 0.1806742 -0.9491033 0.7279389 -0.2193326 0.1595183 -1.1751557
[8] -0.4416456 -0.7074360 -0.3887882
This has to with with the lazy exaluation of parameters passed to qplot. The values aren't actually resolved till you print the plot. At that point, the value if i is just 6 after the looping. A better strategy would be
plots <- lapply(1:6, function(i) {
force(i) #required if you didn't have main=i to force the evaluation of i
qplot(1:10, y[[i]], main = i)
})
do.call(grid.arrange, plots)

Resources