Plot function by condition and subsample for factor data - r

I'm working on a plotting function for the likert data from a survey and I'm trying to optimize it to be as automated as possible since I have to make quite a lot of plots and make it as user-friendly as possible, but I'm having some problems and really need help finishing this function...
These are the data:
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
df1[colnames(df1)] <- lapply(df1[colnames(df1)], factor)
Columns A and B pertain to the "Technology" section of my survey, while C, D and E are in "Social".
I have transformed my data using the likertpackage and compiled them in a list to be more easily called in my function (don't know if it's the best way to go about it, I'm still quite new to R, so feel free to make suggestions even concerning this point):
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
So far, here is the function I could come up with (with great help of user #gaut):
mynames <- sapply(names(tbls), function(x) {
paste("How do they rank? -",gsub("\\.",": ",x))
})
myfilenames <- names(tbls)
plot_likert <- function(x, myname, myfilename){
p <- plot(likert(x),
type ="bar",center=3,
group.order=names(x))+
labs(x = "Theme", subtitle=paste("Number of observations:",nrow(x)))+
guides(fill=guide_legend("Rank"))+
ggtitle(myname)
p
I then lapply the function to get a list of plots:
list_plots <- lapply(1:length(tbls),function(i) {
plot_likert(tbls[[i]], mynames[i], myfilenames[i])
})
And then save them all as .png
sapply(1:length(list_plots), function(i) ggsave(
filename = paste0("plots ",i,".png"),
plot = list_plots[[i]],
width = 15, height = 9
))
Now, there are 3 main things I want my function to do but don't really know how to approach:
1) Right now I can export all the plots in one batch, but I would also like to be able to export a single plot, for example obtaining the above graph by writing:
plot_likert(tbls$dummy1.no)
2) In my mind, my ideal plotting function would also take into account the sections of my data mentioned above, so that if I specify the section Technology only get a Likert plot considering only columns A and B, and specifying the subsample gets me the dummy. Like so:
plot_likert(section=Technology, subsample=dummy1.no)
3) As you maybe have already noted, I need the titles of the plot to be fully automatic, so that by changing section or subsample they too change accordingly.
Apologies for the long/intricate question but I've been stuck on this function for quite some time and really need help finalizing it. For any further clarification/info, do not hesitate to ask!
Thank you in advance for any advice!

There are many ways to get what you want. Essentially, you need to add a few arguments to your function.
I agree with Limey though (and of course Hadley) - generally better to have a few simple functions that do a little step and then you can collate everything in one bigger function.
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
## this can be shortened
df1 <- data.frame(lapply(df1, factor))
## the rest of dummy data creation probably too, but I won't dig too much into this now
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
library(ggplot2)
library(likert)
#> Loading required package: xtable
## no need for sapply, really!
mynames <- paste("How do they rank? -", gsub("\\.",": ",names(tbls)))
myfilenames <- names(tbls)
## defining arguments with NULL makes it possible to not specify it without giving it a value
plot_likert <- function(x, myname, myfilename, section = NULL, subsample = NULL){
## first take only the tbl of interest
if(!is.null(subsample)) x <- x[subsample]
## then filter for your section and subsample
if(!is.null(section)) x <- lapply(x, function(y) y[, section])
## you can run your lapply within the function -
## ideally make a separate funciton and call the smaller function in the bigger one
## use seq_along
lapply(seq_along(x), function(i) {
plot(likert(x[[i]]),
type ="bar",center=3,
group.order=names(x[[i]]))+
labs(x = "Theme", subtitle=paste("Number of observations:",nrow(x)))+
guides(fill=guide_legend("Rank")) +
## programmatic title
ggtitle(names(x)[i])
})
}
## you need to pass character vectors to your arguments
patchwork::wrap_plots(plot_likert(tbls))
patchwork::wrap_plots(plot_likert(tbls, section = LETTERS[1:2], subsample = paste("dummy1", c("no", "yes"), sep = ".")))
Created on 2022-08-17 by the reprex package (v2.0.1)

Related

Passing parameters though wrappers in R

I have a question about passing parameters for functions down through a series of wrappers and the correct way to these sorts of things. Each of these functions aside from the wrapper function are supposed to also work outside of the wrapper if that makes any difference. I apologize for the stupidity or errors within the example, I was struggling to think of something that would explain the problem.
# data sets
data("mtcars")
data("starwars")
# Data set list
d.list <- list(mtcars, starwars)
names(d.list) <- c("mtcars", "starwars")
foo_1 <- function(event,data.in){
data.in[grep(event, names(data.in))]
}
foo_2 <- function(event, data.in, extra, ...){
a.df <- foo_1(...)
a.df %>%
mutate(across(is.numeric, ~ . + extra))
}
foo_wrapper <- function(event, data.in, extra, ...){
b.df <- foo_2(...)
c.df <- foo_2(..., extra = 15)
return(list(b.df, c.df))}
foo.this <- foo_wrapper(event = "starwars",
data.in = d.list,
extra = 12)
# Error in foo_1(...) : argument "data.in" is missing, with no default

How do I repeat codes with names changing at every block? (with R)

I'm dealing with several outputs I obtain from QIIME, texts which I want to manipulate for obtaining boxplots. Every input is formatted in the same way, so the manipulation is always the same, but it changes the source name. For each input, I want to extract the last 5 rows, have a mean for each column/sample, associate the values to sample experimental labels (Group) taken from the mapfile and put them in the order I use for making a boxplot of all the 6 data obtained.
In bash, I do something like "for i in GG97 GG100 SILVA97 SILVA100 NCBI RDP; do cp ${i}/alpha/collated_alpha/chao1.txt alpha_tot/${i}_chao1.txt; done" to do a command various times changing the names in the code in an automatic way through ${i}.
I'm struggling to find a way to do the same with R. I thought creating a vector containing the names and then using a for cycle by moving the i with [1], [2] etc., but it doesn't work, it stops at the read.delim line not finding the file in the wd.
Here's the manipulation code I wrote. After the comment, it will repeat itself 6 times with the 6 databases I'm using (GG97 GG100 SILVA97 SILVA100 NCBI RDP).
PLUS, I repeat this process 4 times because I have 4 metrics to use (here I'm showing shannon, but I also have a copy of the code for chao1, observed_species and PD_whole_tree).
library(tidyverse)
library(labelled)
mapfile <- read.delim(file="mapfile_HC+BV.txt", check.names=FALSE);
mapfile <- mapfile[,c(1,4)]
colnames(mapfile) <- c("SampleID","Pathology_group")
#GG97
collated <- read.delim(file="alpha_diversity/GG97_shannon.txt", check.names=FALSE);
collated <- tail(collated,5); collated <- collated[,-c(1:3)]
collated_reorder <- collated[,match(mapfile[,1], colnames(collated))]
labels <- t(mapfile)
colnames(collated_reorder) <- labels[2,]
mean <- colMeans(collated_reorder, na.rm = FALSE, dims = 1)
mean = as.matrix(mean); mean <- t(mean)
GG97_shannon <- as.data.frame(rbind(labels[2,],mean))
GG97_shannon <- t(GG97_shannon);
DB_type <- list(DB = "GG97"); DB_type <- rep(DB_type, 41)
GG97_shannon <- as.data.frame(cbind(DB_type,GG97_shannon))
colnames(GG97_shannon) <- c("DB","Group","value")
rm(collated,collated_reorder,DB_type,labels,mean)
Here I paste all the outputs together, freeze the order and make the boxplot.
alpha_shannon <- as.data.frame(rbind(GG97_shannon,GG100_shannon,SILVA97_shannon,SILVA100_shannon,NCBI_shannon,RDP_shannon))
rownames(alpha_shannon) <- NULL
rm(GG97_shannon,GG100_shannon,SILVA97_shannon,SILVA100_shannon,NCBI_shannon,RDP_shannon)
alpha_shannon$Group = factor(alpha_shannon$Group, unique(alpha_shannon$Group))
alpha_shannon$DB = factor(alpha_shannon$DB, unique(alpha_shannon$DB))
library(ggplot2)
ggplot(data = alpha_shannon) +
aes(x = DB, y = value, colour = Group) +
geom_boxplot()+
labs(title = 'Shannon',
x = 'Database',
y = 'Diversity') +
theme(legend.position = 'bottom')+
theme_grey(base_size = 16)
How do I keep this code "DRY" and don't need 146 rows of code to repeat the same things over and over? Thank you!!
You didn't provide a Minimal reproducible example, so this answer cannot guarantee correctness.
An important point to note is that you use rm(...), so this means some variables are only relevant within a certain scope. Therefore, encapsulate this scope into a function. This makes your code reusable and spares you the manual variable removal:
process <- function(file, DB){
# -> Use the function parameter `file` instead of a hardcoded filename
collated <- read.delim(file=file, check.names=FALSE);
collated <- tail(collated,5); collated <- collated[,-c(1:3)]
collated_reorder <- collated[,match(mapfile[,1], colnames(collated))]
labels <- t(mapfile)
colnames(collated_reorder) <- labels[2,]
mean <- colMeans(collated_reorder, na.rm = FALSE, dims = 1)
mean = as.matrix(mean); mean <- t(mean)
# -> rename this variable to a more general name, e.g. `result`
result <- as.data.frame(rbind(labels[2,],mean))
result <- t(result);
# -> Use the function parameter `DB` instead of a hardcoded string
DB_type <- list(DB = DB); DB_type <- rep(DB_type, 41)
result <- as.data.frame(cbind(DB_type,result))
colnames(result) <- c("DB","Group","value")
# -> After the end of this function, the variables defined in this function
# vanish automatically, you just need to specify the result
return(result)
}
Now you can reuse that block:
GG97_shannon <- process(file = "alpha_diversity/GG97_shannon.txt", DB = "GG97")
GG100_shannon <- process(file =...., DB = ....)
SILVA97_shannon <- ...
SILVA100_shannon <- ...
NCBI_shannon <- ...
RDP_shannon <- ...
Alternatively, you can use looping structures:
General-purpose for:
datasets <- c("GG97_shannon", "GG100_shannon", "SILVA97_shannon",
"SILVA100_shannon", "NCBI_shannon", "RDP_shannon")
files <- c("alpha_diversity/GG97_shannon.txt", .....)
DBs <- c("GG97", ....)
result <- list()
for(i in seq_along(datasets)){
result[[datasets[i]]] <- process(files[i], DBs[i])
}
mapply, a "specialized for" for looping over several vectors in parallel:
# the first argument is the function from above, the other ones are given as arguments
# to our process(.) function
results <- mapply(process, files, DBs)

R Get objects from environment and feed to function

This is probably a pretty trivial question, but I'm much more used to python than to R (the fact I'm mostly a biologist might also play a role...)
What the code below does is plot the counts for each gene in the provided data in an independent panel, and rearrange the legends in order to have a single one for all the panels in the plot.
# function to rearrange plot legend, from here:
# http://rpubs.com/sjackman/grid_arrange_shared_legend. Give credit where credit is due ;)
grid_arrange_shared_legend <- function(...) {
plots <- list(...)
g <- ggplotGrob(plots[[1]] + theme(legend.position="bottom"))$grobs
legend <- g[[which(sapply(g, function(x) x$name) == "guide-box")]]
lheight <- sum(legend$height)
grid.arrange(
do.call(arrangeGrob, lapply(plots, function(x)
x + theme(legend.position="none"))),
legend,
ncol = 1,
heights = unit.c(unit(1, "npc") - lheight, lheight))
}
# make plot for the given gene and assign it to a named object
plot_genes <- function(gene, gID){
name<-paste0("plotted_counts_for_", gene)
counts = plotCounts("whatever") # get data using plotCounts from DESeq2 package. the gID is used in here
return(assign(name, ggplot(counts,
# + a bunch of plotting aestetics
envir = .GlobalEnv)) #make plot available outside function. Probably I can also use parent.frame()
}
# call plot_genes() for each cluster of genes, adjust the legend for multiple plots with grid_arrange_shared_legend()
plot_cluster_count <- function(cluster,name) {
genes = as.vector(as.data.frame(cluster)$Symbol)
gIDs = as.vector(as.data.frame(cluster)$EMSEMBL)
pdf(paste0(name,"_counts.pdf"))
plt = lapply(seq_along(genes), function(x) plot_genes(genes[x], gIDs[x]))
grid_arrange_shared_legend(plotted_counts_for_gene1, plotted_counts_for_gene2, plotted_counts_for_gene3,plotted_counts_for_gene4)
dev.off()
}
# call the whole thing
plot_cluster_count(Cluster_1,"Cluster_1")
This code works.
The issue is that it works only when I explicitely hard-code the names of the plots as in grid_arrange_shared_legend(plotted_counts_for_gene1, plotted_counts_for_gene2, plotted_counts_for_gene3,plotted_counts_for_gene4).
However I have plenty of clusters to plot, with different number of genes with different names, so I need to automate the selection of objects to feed to grid_arrange_shared_legend().
I tried to play around with ls()/objects(), mget() and Google but I can't find a way to get it working, I always end up with
Error in plot_clone(plot) : attempt to apply non-function.
I traced the error back with options(error=recover) and indeed it comes from grid_arrange_shared_legend(), so to me it looks like I'm not able to feed the objects to the function.
The ultimate goal would be to be able to call plot_cluster_count() within a lapply() statement feeding a list of clusters to iterate through. This should result in one pdf per cluster, each containing one panel per gene.
PS
I'm aware that getting the object names from the environment is not the most elegant way to go, it just seemed more straightforward. Any alternative approach is more than welcome
Thanks!
One solution:
# mock data (3 "objects")
set.seed(1)
obj_1 <- list(var = sample(1:100, 1), name = "obj_1")
obj_2 <- list(var = sample(1:100, 1), name = "obj_2")
obj_3 <- list(var = sample(1:100, 1), name = "obj_3")
# any kind of function you want to apply
f1 <- function(obj, obj_name) {
print(paste(obj, " -- ", obj_name))
}
# Find all objects in your environment
list_obj <- ls(pattern = "obj_")
# Apply the previous function on this list
output <- lapply(list_obj, function(x) f1(get(x)$var, get(x)$name))
output
#> [[1]]
#> [1] "27 -- obj_1"
#>
#> [[2]]
#> [1] "38 -- obj_2"
#>
#> [[3]]
#> [1] "58 -- obj_3"

Looping dataset R

I'm trying to make a loop to automate a lot of actions in R. The code I have looks like this:
datA <- droplevels(datSUM[datSUM$Conc=="a",])
datB <- droplevels(datSUM[datSUM$Conc=="b",])
datC <- droplevels(datSUM[datSUM$Conc=="c",])
datD <- droplevels(datSUM[datSUM$Conc=="d",])
datE <- droplevels(datSUM[datSUM$Conc=="e",])
datX <- droplevels(datSUM[datSUM$Conc=="x",])
datY <- droplevels(datSUM[datSUM$Conc=="y",])
datAf <- droplevels(datA[datA$Sex=="f",])
datAf1 <- droplevels(datAf[datAf$rep=="1",])
datAf2 <- droplevels(datAf[datAf$rep=="2",])
datAf3 <- droplevels(datAf[datAf$rep=="3",])
datAm <- droplevels(datA[datA$Sex=="m",])
datAm1 <- droplevels(datAm[datAm$rep=="1",])
datAm2 <- droplevels(datAm[datAm$rep=="2",])
datAm3 <- droplevels(datAm[datAm$rep=="3",])
So since I have to do this 7 times, it seems like making a loop for this operation is the best way to do it. Can someone help me make that? I'm new to R so please bear that in mind.
Well I will have a stab at this.
concs <- c(a='a',b='b',c='c',d='d',e='e',x='x',y='y')
sex <- c(m='m',f='f')
reps <- c(rep1='1',rep2='2',rep3='3')
# By using m='m' we can label the objects within the list, making it
# easier to navigate the final object, otherwise use:
# concs <- c('a','b','c','d','e','x','y')
# sex <- c('m','f')
# reps <- c('1','2','3')
dfs <- lapply(concs, function(x){
droplevels(datSUM[datSUM$Conc==x,])}
)
sdfs <- lapply(sex, function(x){
lapply(dfs, function(y){
droplevels(y[y$Sex==x,])}
)}
)
rsdfs <- lapply(reps, function(x){
lapply(sdfs, function(y){
lapply(y, function(z){
droplevels(z[z$rep==x,])}
)}
)}
)
There is probably a better way to do this, that may involve using more lapplys but I think this "should" do the trick.
The only downside to this method you will have to access certain objects with rsdfs[[1]][[1]][[1]] or rsdfs[['rep1']][['m']][['a']] e.t.c
And applying functions to these would in itself require a bunch of lapplys
Let me know if this helps.
This is one method to do so - I will work on a more elegant solution later.

How do I convert this for loop into something cooler like by in R

uniq <- unique(file[,12])
pdf("SKAT.pdf")
for(i in 1:length(uniq)) {
dat <- subset(file, file[,12] == uniq[i])
names <- paste("Sample_filtered_on_", uniq[i], sep="")
qq.chisq(-2*log(as.numeric(dat[,10])), df = 2, main = names, pvals = T,
sub=subtitle)
}
dev.off()
file[,12] is an integer so I convert it to a factor when I'm trying to run it with by instead of a for loop as follows:
pdf("SKAT.pdf")
by(file, as.factor(file[,12]), function(x) { qq.chisq(-2*log(as.numeric(x[,10])), df = 2, main = paste("Sample_filtered_on_", file[1,12], sep=""), pvals = T, sub=subtitle) } )
dev.off()
It works fine to sort the data frame by this (now a factor) column. My problem is that for the plot title, I want to label it with the correct index from that column. This is easy to do in the for loop by uniq[i]. How do I do this in a by function?
Hope this makes sense.
A more vectorized (== cooler?) version would pull the common operations out of the loop and let R do the book-keeping about unique factor levels.
dat <- split(-2 * log(as.numeric(file[,10])), file[,12])
names(dat) <- paste0("IoOPanos_filtered_on_pc_", names(dat))
(paste0 is a convenience function for the common use case where normally one would use paste with the argument sep=""). The for loop is entirely appropriate when you're running it for its side effects (plotting pretty pictures) rather than trying to capture values for further computation; it's definitely un-cool to use T instead of TRUE, while seq_along(dat) means that your code won't produce unexpected results when length(dat) == 0.
pdf("SKAT.pdf")
for(i in seq_along(dat)) {
vals <- dat[[i]]
nm <- names(dat)[[i]]
qq.chisq(val, main = nm, df = 2, pvals = TRUE, sub=subtitle)
}
dev.off()
If you did want to capture values, the basic observation is that your function takes 2 arguments that vary. So by or tapply or sapply or ... are not appropriate; each of these assume that just a single argument is varying. Instead, use mapply or the comparable Map
Map(qq.chisq, dat, main=names(dat),
MoreArgs=list(df=2, pvals=TRUE, sub=subtitle))

Resources