I'm working on a plotting function for the likert data from a survey and I'm trying to optimize it to be as automated as possible since I have to make quite a lot of plots and make it as user-friendly as possible, but I'm having some problems and really need help finishing this function...
These are the data:
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
df1[colnames(df1)] <- lapply(df1[colnames(df1)], factor)
Columns A and B pertain to the "Technology" section of my survey, while C, D and E are in "Social".
I have transformed my data using the likertpackage and compiled them in a list to be more easily called in my function (don't know if it's the best way to go about it, I'm still quite new to R, so feel free to make suggestions even concerning this point):
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
So far, here is the function I could come up with (with great help of user #gaut):
mynames <- sapply(names(tbls), function(x) {
paste("How do they rank? -",gsub("\\.",": ",x))
})
myfilenames <- names(tbls)
plot_likert <- function(x, myname, myfilename){
p <- plot(likert(x),
type ="bar",center=3,
group.order=names(x))+
labs(x = "Theme", subtitle=paste("Number of observations:",nrow(x)))+
guides(fill=guide_legend("Rank"))+
ggtitle(myname)
p
I then lapply the function to get a list of plots:
list_plots <- lapply(1:length(tbls),function(i) {
plot_likert(tbls[[i]], mynames[i], myfilenames[i])
})
And then save them all as .png
sapply(1:length(list_plots), function(i) ggsave(
filename = paste0("plots ",i,".png"),
plot = list_plots[[i]],
width = 15, height = 9
))
Now, there are 3 main things I want my function to do but don't really know how to approach:
1) Right now I can export all the plots in one batch, but I would also like to be able to export a single plot, for example obtaining the above graph by writing:
plot_likert(tbls$dummy1.no)
2) In my mind, my ideal plotting function would also take into account the sections of my data mentioned above, so that if I specify the section Technology only get a Likert plot considering only columns A and B, and specifying the subsample gets me the dummy. Like so:
plot_likert(section=Technology, subsample=dummy1.no)
3) As you maybe have already noted, I need the titles of the plot to be fully automatic, so that by changing section or subsample they too change accordingly.
Apologies for the long/intricate question but I've been stuck on this function for quite some time and really need help finalizing it. For any further clarification/info, do not hesitate to ask!
Thank you in advance for any advice!
There are many ways to get what you want. Essentially, you need to add a few arguments to your function.
I agree with Limey though (and of course Hadley) - generally better to have a few simple functions that do a little step and then you can collate everything in one bigger function.
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
## this can be shortened
df1 <- data.frame(lapply(df1, factor))
## the rest of dummy data creation probably too, but I won't dig too much into this now
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
library(ggplot2)
library(likert)
#> Loading required package: xtable
## no need for sapply, really!
mynames <- paste("How do they rank? -", gsub("\\.",": ",names(tbls)))
myfilenames <- names(tbls)
## defining arguments with NULL makes it possible to not specify it without giving it a value
plot_likert <- function(x, myname, myfilename, section = NULL, subsample = NULL){
## first take only the tbl of interest
if(!is.null(subsample)) x <- x[subsample]
## then filter for your section and subsample
if(!is.null(section)) x <- lapply(x, function(y) y[, section])
## you can run your lapply within the function -
## ideally make a separate funciton and call the smaller function in the bigger one
## use seq_along
lapply(seq_along(x), function(i) {
plot(likert(x[[i]]),
type ="bar",center=3,
group.order=names(x[[i]]))+
labs(x = "Theme", subtitle=paste("Number of observations:",nrow(x)))+
guides(fill=guide_legend("Rank")) +
## programmatic title
ggtitle(names(x)[i])
})
}
## you need to pass character vectors to your arguments
patchwork::wrap_plots(plot_likert(tbls))
patchwork::wrap_plots(plot_likert(tbls, section = LETTERS[1:2], subsample = paste("dummy1", c("no", "yes"), sep = ".")))
Created on 2022-08-17 by the reprex package (v2.0.1)
I'm dealing with several outputs I obtain from QIIME, texts which I want to manipulate for obtaining boxplots. Every input is formatted in the same way, so the manipulation is always the same, but it changes the source name. For each input, I want to extract the last 5 rows, have a mean for each column/sample, associate the values to sample experimental labels (Group) taken from the mapfile and put them in the order I use for making a boxplot of all the 6 data obtained.
In bash, I do something like "for i in GG97 GG100 SILVA97 SILVA100 NCBI RDP; do cp ${i}/alpha/collated_alpha/chao1.txt alpha_tot/${i}_chao1.txt; done" to do a command various times changing the names in the code in an automatic way through ${i}.
I'm struggling to find a way to do the same with R. I thought creating a vector containing the names and then using a for cycle by moving the i with [1], [2] etc., but it doesn't work, it stops at the read.delim line not finding the file in the wd.
Here's the manipulation code I wrote. After the comment, it will repeat itself 6 times with the 6 databases I'm using (GG97 GG100 SILVA97 SILVA100 NCBI RDP).
PLUS, I repeat this process 4 times because I have 4 metrics to use (here I'm showing shannon, but I also have a copy of the code for chao1, observed_species and PD_whole_tree).
library(tidyverse)
library(labelled)
mapfile <- read.delim(file="mapfile_HC+BV.txt", check.names=FALSE);
mapfile <- mapfile[,c(1,4)]
colnames(mapfile) <- c("SampleID","Pathology_group")
#GG97
collated <- read.delim(file="alpha_diversity/GG97_shannon.txt", check.names=FALSE);
collated <- tail(collated,5); collated <- collated[,-c(1:3)]
collated_reorder <- collated[,match(mapfile[,1], colnames(collated))]
labels <- t(mapfile)
colnames(collated_reorder) <- labels[2,]
mean <- colMeans(collated_reorder, na.rm = FALSE, dims = 1)
mean = as.matrix(mean); mean <- t(mean)
GG97_shannon <- as.data.frame(rbind(labels[2,],mean))
GG97_shannon <- t(GG97_shannon);
DB_type <- list(DB = "GG97"); DB_type <- rep(DB_type, 41)
GG97_shannon <- as.data.frame(cbind(DB_type,GG97_shannon))
colnames(GG97_shannon) <- c("DB","Group","value")
rm(collated,collated_reorder,DB_type,labels,mean)
Here I paste all the outputs together, freeze the order and make the boxplot.
alpha_shannon <- as.data.frame(rbind(GG97_shannon,GG100_shannon,SILVA97_shannon,SILVA100_shannon,NCBI_shannon,RDP_shannon))
rownames(alpha_shannon) <- NULL
rm(GG97_shannon,GG100_shannon,SILVA97_shannon,SILVA100_shannon,NCBI_shannon,RDP_shannon)
alpha_shannon$Group = factor(alpha_shannon$Group, unique(alpha_shannon$Group))
alpha_shannon$DB = factor(alpha_shannon$DB, unique(alpha_shannon$DB))
library(ggplot2)
ggplot(data = alpha_shannon) +
aes(x = DB, y = value, colour = Group) +
geom_boxplot()+
labs(title = 'Shannon',
x = 'Database',
y = 'Diversity') +
theme(legend.position = 'bottom')+
theme_grey(base_size = 16)
How do I keep this code "DRY" and don't need 146 rows of code to repeat the same things over and over? Thank you!!
You didn't provide a Minimal reproducible example, so this answer cannot guarantee correctness.
An important point to note is that you use rm(...), so this means some variables are only relevant within a certain scope. Therefore, encapsulate this scope into a function. This makes your code reusable and spares you the manual variable removal:
process <- function(file, DB){
# -> Use the function parameter `file` instead of a hardcoded filename
collated <- read.delim(file=file, check.names=FALSE);
collated <- tail(collated,5); collated <- collated[,-c(1:3)]
collated_reorder <- collated[,match(mapfile[,1], colnames(collated))]
labels <- t(mapfile)
colnames(collated_reorder) <- labels[2,]
mean <- colMeans(collated_reorder, na.rm = FALSE, dims = 1)
mean = as.matrix(mean); mean <- t(mean)
# -> rename this variable to a more general name, e.g. `result`
result <- as.data.frame(rbind(labels[2,],mean))
result <- t(result);
# -> Use the function parameter `DB` instead of a hardcoded string
DB_type <- list(DB = DB); DB_type <- rep(DB_type, 41)
result <- as.data.frame(cbind(DB_type,result))
colnames(result) <- c("DB","Group","value")
# -> After the end of this function, the variables defined in this function
# vanish automatically, you just need to specify the result
return(result)
}
Now you can reuse that block:
GG97_shannon <- process(file = "alpha_diversity/GG97_shannon.txt", DB = "GG97")
GG100_shannon <- process(file =...., DB = ....)
SILVA97_shannon <- ...
SILVA100_shannon <- ...
NCBI_shannon <- ...
RDP_shannon <- ...
Alternatively, you can use looping structures:
General-purpose for:
datasets <- c("GG97_shannon", "GG100_shannon", "SILVA97_shannon",
"SILVA100_shannon", "NCBI_shannon", "RDP_shannon")
files <- c("alpha_diversity/GG97_shannon.txt", .....)
DBs <- c("GG97", ....)
result <- list()
for(i in seq_along(datasets)){
result[[datasets[i]]] <- process(files[i], DBs[i])
}
mapply, a "specialized for" for looping over several vectors in parallel:
# the first argument is the function from above, the other ones are given as arguments
# to our process(.) function
results <- mapply(process, files, DBs)
This is probably a pretty trivial question, but I'm much more used to python than to R (the fact I'm mostly a biologist might also play a role...)
What the code below does is plot the counts for each gene in the provided data in an independent panel, and rearrange the legends in order to have a single one for all the panels in the plot.
# function to rearrange plot legend, from here:
# http://rpubs.com/sjackman/grid_arrange_shared_legend. Give credit where credit is due ;)
grid_arrange_shared_legend <- function(...) {
plots <- list(...)
g <- ggplotGrob(plots[[1]] + theme(legend.position="bottom"))$grobs
legend <- g[[which(sapply(g, function(x) x$name) == "guide-box")]]
lheight <- sum(legend$height)
grid.arrange(
do.call(arrangeGrob, lapply(plots, function(x)
x + theme(legend.position="none"))),
legend,
ncol = 1,
heights = unit.c(unit(1, "npc") - lheight, lheight))
}
# make plot for the given gene and assign it to a named object
plot_genes <- function(gene, gID){
name<-paste0("plotted_counts_for_", gene)
counts = plotCounts("whatever") # get data using plotCounts from DESeq2 package. the gID is used in here
return(assign(name, ggplot(counts,
# + a bunch of plotting aestetics
envir = .GlobalEnv)) #make plot available outside function. Probably I can also use parent.frame()
}
# call plot_genes() for each cluster of genes, adjust the legend for multiple plots with grid_arrange_shared_legend()
plot_cluster_count <- function(cluster,name) {
genes = as.vector(as.data.frame(cluster)$Symbol)
gIDs = as.vector(as.data.frame(cluster)$EMSEMBL)
pdf(paste0(name,"_counts.pdf"))
plt = lapply(seq_along(genes), function(x) plot_genes(genes[x], gIDs[x]))
grid_arrange_shared_legend(plotted_counts_for_gene1, plotted_counts_for_gene2, plotted_counts_for_gene3,plotted_counts_for_gene4)
dev.off()
}
# call the whole thing
plot_cluster_count(Cluster_1,"Cluster_1")
This code works.
The issue is that it works only when I explicitely hard-code the names of the plots as in grid_arrange_shared_legend(plotted_counts_for_gene1, plotted_counts_for_gene2, plotted_counts_for_gene3,plotted_counts_for_gene4).
However I have plenty of clusters to plot, with different number of genes with different names, so I need to automate the selection of objects to feed to grid_arrange_shared_legend().
I tried to play around with ls()/objects(), mget() and Google but I can't find a way to get it working, I always end up with
Error in plot_clone(plot) : attempt to apply non-function.
I traced the error back with options(error=recover) and indeed it comes from grid_arrange_shared_legend(), so to me it looks like I'm not able to feed the objects to the function.
The ultimate goal would be to be able to call plot_cluster_count() within a lapply() statement feeding a list of clusters to iterate through. This should result in one pdf per cluster, each containing one panel per gene.
PS
I'm aware that getting the object names from the environment is not the most elegant way to go, it just seemed more straightforward. Any alternative approach is more than welcome
Thanks!
One solution:
# mock data (3 "objects")
set.seed(1)
obj_1 <- list(var = sample(1:100, 1), name = "obj_1")
obj_2 <- list(var = sample(1:100, 1), name = "obj_2")
obj_3 <- list(var = sample(1:100, 1), name = "obj_3")
# any kind of function you want to apply
f1 <- function(obj, obj_name) {
print(paste(obj, " -- ", obj_name))
}
# Find all objects in your environment
list_obj <- ls(pattern = "obj_")
# Apply the previous function on this list
output <- lapply(list_obj, function(x) f1(get(x)$var, get(x)$name))
output
#> [[1]]
#> [1] "27 -- obj_1"
#>
#> [[2]]
#> [1] "38 -- obj_2"
#>
#> [[3]]
#> [1] "58 -- obj_3"
In the R environment, I have already have some variable, their name:
id_01_r
id_02_l
id_05_l
id_06_r
id_07_l
id_09_1
id_11_l
So, their pattern seems like id_ and follows two figures, then _ and r or l randomly.
Each of them corresponds to one frame but different dim() output.
Also, there are some other variables in the environment, so first I should extract these frames. For this, I'm going to adopt:
> a <- list(ls()[grep("id*",ls())])` #a little sample for just id* I know
But, this function put them as one element, so I don't think it's good way
> length(a) [1] 1
I know how to read them in like below, but now for extact and same processes, I'm so confused.
i_set <- Sys.glob(paths='mypath/////id*.txt')
for (i in i_set) {
assign(substring(i, startx, endx),read.table(file=i,header=F))
}
Here, the key point is I want to do a series of same data processing for each of these frames. But based on these, what can I do instead of one by one?
Thanks your kind consideration.
Here is an example:
id_01_r <- iris
id_02_l <- mtcars
foo <- 42
vars <- grep("^id_\\d{2}_[rl]$", ls(), value = TRUE)
# [1] "id_01_r" "id_02_l"
process_data <- function(df) {
dim(df)
}
processed_data <- lapply(
mget(vars),
process_data
)
# $id_01_r
# [1] 150 5
#
# $id_02_l
# [1] 32 11
I would like to rename a bunch of data frames with name function but not able to use lapply or loop.
I have group of data frames name qcew.2007, qcew.2014, etc... I have vector with name I would like all of data frame to have. They are all the same. the vector is name colnm:
colnm = c("area_fips" , "own_code", "industry_code", "agglvl_code") # example shortened
# groups has names of all data frames and goes to 2013
group =c("qcew.2007", "qcew.2008", "qcew.2009")
# using lapply
names <- lapply(group, function(d){
n = paste0(d)
names(n) = colnm
})
# using loop does not work either
for (i in seq(group)) {
names(group[[i]]) = colnm
}
Neither option works, as it is saying I am comparing vectors with uneven lengths. I must be missing something obvious. Thanks
Here you go. You need to use get otherwise you're assigning names to the character vectors in group:
# sample data
qcew.2007 <- data.frame(a=1, b=2, c=3, d=4)
qcew.2008 <- data.frame(a=3, b=4, c=5, d=6)
qcew.2009 <- data.frame(a=5, b=6, c=7, d=8)
for(i in 1:3)
assign(group[i], `names<-`(get(group[i]), colnm))
names(qcew.2007)
# [1] "area_fips" "own_code" "industry_code" "agglvl_code"
names(qcew.2008)
# [1] "area_fips" "own_code" "industry_code" "agglvl_code"
names(qcew.2009)
# [1] "area_fips" "own_code" "industry_code" "agglvl_code"
Here you use get to get the object named in each position in group and then use assign to reassign the modified object (modified by changing column names) back into that named object.
Also:
list2env(lapply(mget(group), setNames, colnm),envir=.GlobalEnv)
names(qcew.2007)
#[1] "area_fips" "own_code" "industry_code" "agglvl_code"