Related
I'm working on a plotting function for the likert data from a survey and I'm trying to optimize it to be as automated as possible since I have to make quite a lot of plots and make it as user-friendly as possible, but I'm having some problems and really need help finishing this function...
These are the data:
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
df1[colnames(df1)] <- lapply(df1[colnames(df1)], factor)
Columns A and B pertain to the "Technology" section of my survey, while C, D and E are in "Social".
I have transformed my data using the likertpackage and compiled them in a list to be more easily called in my function (don't know if it's the best way to go about it, I'm still quite new to R, so feel free to make suggestions even concerning this point):
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
So far, here is the function I could come up with (with great help of user #gaut):
mynames <- sapply(names(tbls), function(x) {
paste("How do they rank? -",gsub("\\.",": ",x))
})
myfilenames <- names(tbls)
plot_likert <- function(x, myname, myfilename){
p <- plot(likert(x),
type ="bar",center=3,
group.order=names(x))+
labs(x = "Theme", subtitle=paste("Number of observations:",nrow(x)))+
guides(fill=guide_legend("Rank"))+
ggtitle(myname)
p
I then lapply the function to get a list of plots:
list_plots <- lapply(1:length(tbls),function(i) {
plot_likert(tbls[[i]], mynames[i], myfilenames[i])
})
And then save them all as .png
sapply(1:length(list_plots), function(i) ggsave(
filename = paste0("plots ",i,".png"),
plot = list_plots[[i]],
width = 15, height = 9
))
Now, there are 3 main things I want my function to do but don't really know how to approach:
1) Right now I can export all the plots in one batch, but I would also like to be able to export a single plot, for example obtaining the above graph by writing:
plot_likert(tbls$dummy1.no)
2) In my mind, my ideal plotting function would also take into account the sections of my data mentioned above, so that if I specify the section Technology only get a Likert plot considering only columns A and B, and specifying the subsample gets me the dummy. Like so:
plot_likert(section=Technology, subsample=dummy1.no)
3) As you maybe have already noted, I need the titles of the plot to be fully automatic, so that by changing section or subsample they too change accordingly.
Apologies for the long/intricate question but I've been stuck on this function for quite some time and really need help finalizing it. For any further clarification/info, do not hesitate to ask!
Thank you in advance for any advice!
There are many ways to get what you want. Essentially, you need to add a few arguments to your function.
I agree with Limey though (and of course Hadley) - generally better to have a few simple functions that do a little step and then you can collate everything in one bigger function.
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
## this can be shortened
df1 <- data.frame(lapply(df1, factor))
## the rest of dummy data creation probably too, but I won't dig too much into this now
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
library(ggplot2)
library(likert)
#> Loading required package: xtable
## no need for sapply, really!
mynames <- paste("How do they rank? -", gsub("\\.",": ",names(tbls)))
myfilenames <- names(tbls)
## defining arguments with NULL makes it possible to not specify it without giving it a value
plot_likert <- function(x, myname, myfilename, section = NULL, subsample = NULL){
## first take only the tbl of interest
if(!is.null(subsample)) x <- x[subsample]
## then filter for your section and subsample
if(!is.null(section)) x <- lapply(x, function(y) y[, section])
## you can run your lapply within the function -
## ideally make a separate funciton and call the smaller function in the bigger one
## use seq_along
lapply(seq_along(x), function(i) {
plot(likert(x[[i]]),
type ="bar",center=3,
group.order=names(x[[i]]))+
labs(x = "Theme", subtitle=paste("Number of observations:",nrow(x)))+
guides(fill=guide_legend("Rank")) +
## programmatic title
ggtitle(names(x)[i])
})
}
## you need to pass character vectors to your arguments
patchwork::wrap_plots(plot_likert(tbls))
patchwork::wrap_plots(plot_likert(tbls, section = LETTERS[1:2], subsample = paste("dummy1", c("no", "yes"), sep = ".")))
Created on 2022-08-17 by the reprex package (v2.0.1)
I'm dealing with several outputs I obtain from QIIME, texts which I want to manipulate for obtaining boxplots. Every input is formatted in the same way, so the manipulation is always the same, but it changes the source name. For each input, I want to extract the last 5 rows, have a mean for each column/sample, associate the values to sample experimental labels (Group) taken from the mapfile and put them in the order I use for making a boxplot of all the 6 data obtained.
In bash, I do something like "for i in GG97 GG100 SILVA97 SILVA100 NCBI RDP; do cp ${i}/alpha/collated_alpha/chao1.txt alpha_tot/${i}_chao1.txt; done" to do a command various times changing the names in the code in an automatic way through ${i}.
I'm struggling to find a way to do the same with R. I thought creating a vector containing the names and then using a for cycle by moving the i with [1], [2] etc., but it doesn't work, it stops at the read.delim line not finding the file in the wd.
Here's the manipulation code I wrote. After the comment, it will repeat itself 6 times with the 6 databases I'm using (GG97 GG100 SILVA97 SILVA100 NCBI RDP).
PLUS, I repeat this process 4 times because I have 4 metrics to use (here I'm showing shannon, but I also have a copy of the code for chao1, observed_species and PD_whole_tree).
library(tidyverse)
library(labelled)
mapfile <- read.delim(file="mapfile_HC+BV.txt", check.names=FALSE);
mapfile <- mapfile[,c(1,4)]
colnames(mapfile) <- c("SampleID","Pathology_group")
#GG97
collated <- read.delim(file="alpha_diversity/GG97_shannon.txt", check.names=FALSE);
collated <- tail(collated,5); collated <- collated[,-c(1:3)]
collated_reorder <- collated[,match(mapfile[,1], colnames(collated))]
labels <- t(mapfile)
colnames(collated_reorder) <- labels[2,]
mean <- colMeans(collated_reorder, na.rm = FALSE, dims = 1)
mean = as.matrix(mean); mean <- t(mean)
GG97_shannon <- as.data.frame(rbind(labels[2,],mean))
GG97_shannon <- t(GG97_shannon);
DB_type <- list(DB = "GG97"); DB_type <- rep(DB_type, 41)
GG97_shannon <- as.data.frame(cbind(DB_type,GG97_shannon))
colnames(GG97_shannon) <- c("DB","Group","value")
rm(collated,collated_reorder,DB_type,labels,mean)
Here I paste all the outputs together, freeze the order and make the boxplot.
alpha_shannon <- as.data.frame(rbind(GG97_shannon,GG100_shannon,SILVA97_shannon,SILVA100_shannon,NCBI_shannon,RDP_shannon))
rownames(alpha_shannon) <- NULL
rm(GG97_shannon,GG100_shannon,SILVA97_shannon,SILVA100_shannon,NCBI_shannon,RDP_shannon)
alpha_shannon$Group = factor(alpha_shannon$Group, unique(alpha_shannon$Group))
alpha_shannon$DB = factor(alpha_shannon$DB, unique(alpha_shannon$DB))
library(ggplot2)
ggplot(data = alpha_shannon) +
aes(x = DB, y = value, colour = Group) +
geom_boxplot()+
labs(title = 'Shannon',
x = 'Database',
y = 'Diversity') +
theme(legend.position = 'bottom')+
theme_grey(base_size = 16)
How do I keep this code "DRY" and don't need 146 rows of code to repeat the same things over and over? Thank you!!
You didn't provide a Minimal reproducible example, so this answer cannot guarantee correctness.
An important point to note is that you use rm(...), so this means some variables are only relevant within a certain scope. Therefore, encapsulate this scope into a function. This makes your code reusable and spares you the manual variable removal:
process <- function(file, DB){
# -> Use the function parameter `file` instead of a hardcoded filename
collated <- read.delim(file=file, check.names=FALSE);
collated <- tail(collated,5); collated <- collated[,-c(1:3)]
collated_reorder <- collated[,match(mapfile[,1], colnames(collated))]
labels <- t(mapfile)
colnames(collated_reorder) <- labels[2,]
mean <- colMeans(collated_reorder, na.rm = FALSE, dims = 1)
mean = as.matrix(mean); mean <- t(mean)
# -> rename this variable to a more general name, e.g. `result`
result <- as.data.frame(rbind(labels[2,],mean))
result <- t(result);
# -> Use the function parameter `DB` instead of a hardcoded string
DB_type <- list(DB = DB); DB_type <- rep(DB_type, 41)
result <- as.data.frame(cbind(DB_type,result))
colnames(result) <- c("DB","Group","value")
# -> After the end of this function, the variables defined in this function
# vanish automatically, you just need to specify the result
return(result)
}
Now you can reuse that block:
GG97_shannon <- process(file = "alpha_diversity/GG97_shannon.txt", DB = "GG97")
GG100_shannon <- process(file =...., DB = ....)
SILVA97_shannon <- ...
SILVA100_shannon <- ...
NCBI_shannon <- ...
RDP_shannon <- ...
Alternatively, you can use looping structures:
General-purpose for:
datasets <- c("GG97_shannon", "GG100_shannon", "SILVA97_shannon",
"SILVA100_shannon", "NCBI_shannon", "RDP_shannon")
files <- c("alpha_diversity/GG97_shannon.txt", .....)
DBs <- c("GG97", ....)
result <- list()
for(i in seq_along(datasets)){
result[[datasets[i]]] <- process(files[i], DBs[i])
}
mapply, a "specialized for" for looping over several vectors in parallel:
# the first argument is the function from above, the other ones are given as arguments
# to our process(.) function
results <- mapply(process, files, DBs)
I am trying to assign a plot to a list that is named via variable (snm). My code snippet is all the variations that I tried to do to make it work. What other option am I missing? Thanks.
My goal is to loop over my graph assignments, using an IF statement to change the snm and a few other variables that I will use in the graphs.
for (x in seq(0,1)) {
if (x==0) {
snm="grad"
}
else if (x==1) {
snm="start"
}
assign(snm,list(),envir=.GlobalEnv) #works
assign(snm[[1]],ggplot(data=TDSF, aes(x=Graduation))+geom_histogram()+labs(title="A"),envir=.GlobalEnv) #works
assign(snm[[2]],ggplot(data=TDSF, aes(x=Graduation,weights=Donation))+geom_bar()+labs(title="B"),envir=.GlobalEnv) #fails "subscript out of bonds"
assign(snm[[3]],ggplot(data=TDSF, aes(x=State,weights=Donation))+geom_bar()+labs(title="B")+scale_y_sqrt(),envir=.GlobalEnv) #fails "subscript out of bonds"
grid.arrange(grad[[1]],grad[[2]],grad[[3]])
}
Partial solution based on #MrFlick and #hrbrmstr, but 1) I have to use do.call in the loop or I get the same graphs and 2) seeing mapply I feel that I should be able to use it, but cannot get it to work.
library(ggplot2)
library(gridExtra)
set.seed(1492)
TDSF <- data.frame(Graduation=sample(1950:2010, 30),
Donation=sample(300:10000, 30),
Start.Year=sample(1950:2010,30),
State=sample(state.abb,30,replace=TRUE))
plots <- list()
for (x in seq(0,1)) {
if (x==0) {
nm=quote(Graduation)
snm="grad"
}
else if (x==1) {
nm=quote(Start.Year)
snm="start"
}
plots[[snm]]<-list()
plots[[snm]][[1]] <- ggplot(data=TDSF, aes(x=eval(nm)))+geom_histogram()+labs(title=paste("Number of People per",snm,"Year"))
plots[[snm]][[2]] <- ggplot(data=TDSF, aes(x=eval(nm),weights=Donation))+geom_bar()+labs(title=paste("Donations by",snm,"Year"))
plots[[snm]][[3]] <- ggplot(data=TDSF, aes(x=State,weights=Donation))+geom_bar()+labs(title="Donations by State")+scale_y_sqrt()
}
do.call(grid.arrange,plots[["grad"]])
do.call(grid.arrange,plots[["start"]])
mapply attempt:
plot<-mapply(function(snm,nm) list(
{ggplot(data=TDSF, aes(x=nm))+geom_histogram()+labs(title=paste("Number of People per",snm,"Year"))},
{ggplot(data=TDSF, aes(x=nm,weights=Donation))+geom_bar()+labs(title=paste("Donations by",snm,"Year"))},
{ggplot(data=TDSF, aes(x=State,weights=Donation))+geom_bar()+labs(title="Donations by State")+scale_y_sqrt()}
), c("grad","start"),c("Graduation","Start.Year"),SIMPLIFY = FALSE)
do.call(grid.arrange,plot[["grad"]])
do.call(grid.arrange,plot[["start"]])
Despite a lengthy code snippet, your question really isn't clear. Are you trying to do something like this?
library(ggplot2)
library(gridExtra)
set.seed(1492)
TDSF <- data.frame(Graduation=sample(1950:2010, 30),
Donation=sample(300:10000, 30))
snm <- mapply(function(x, title) {
ggplot(TDSF, aes_(x=as.name(x))) +
geom_histogram() +
labs(title=title)
}, c("Graduation", "Donation"), c("A", "B"), SIMPLIFY=FALSE)
do.call(grid.arrange, snm)
What I wanted to do originally is not possible in that exact manner. However, based on the feedback I was able to create a list of lists of ggplots. This condenses my out down to 1 variable versus the 2 I was targeting. Instead of start[[1]] to plot the first graph I use plot[["start"]][[1]]. Makes sense when you read it, but was not how I expected. I come from VBA and python so R is new format.
Mapply is also incredibly more powerful/simple approach than I was originally doing, so thanks for that input.
library(ggplot2)
library(gridExtra)
set.seed(1492)
TDSF <- data.frame(Graduation=sample(1950:2010, 30),
Donation=sample(300:10000, 30),
Start.Year=sample(1950:2010,30),
State=sample(state.abb,30,replace=TRUE))
plot<-mapply(function(snm,nm) list(
{ggplot(data=TDSF, aes_q(x=as.name(nm)))+geom_histogram()+labs(title=paste("Number of People per",snm,"Year"))},
{ggplot(data=TDSF, aes_q(x=as.name(nm),weights=~Donation))+geom_bar()+labs(title=paste("Donations by",snm,"Year"))},
{ggplot(data=TDSF, aes(x=State,weights=Donation))+geom_bar()+labs(title="Donations by State")+scale_y_sqrt()}
), c("grad","start"),c("Graduation","Start.Year"),SIMPLIFY = FALSE)
do.call(grid.arrange,plot[["grad"]])
do.call(grid.arrange,plot[["start"]])
I'm trying to make a loop to automate a lot of actions in R. The code I have looks like this:
datA <- droplevels(datSUM[datSUM$Conc=="a",])
datB <- droplevels(datSUM[datSUM$Conc=="b",])
datC <- droplevels(datSUM[datSUM$Conc=="c",])
datD <- droplevels(datSUM[datSUM$Conc=="d",])
datE <- droplevels(datSUM[datSUM$Conc=="e",])
datX <- droplevels(datSUM[datSUM$Conc=="x",])
datY <- droplevels(datSUM[datSUM$Conc=="y",])
datAf <- droplevels(datA[datA$Sex=="f",])
datAf1 <- droplevels(datAf[datAf$rep=="1",])
datAf2 <- droplevels(datAf[datAf$rep=="2",])
datAf3 <- droplevels(datAf[datAf$rep=="3",])
datAm <- droplevels(datA[datA$Sex=="m",])
datAm1 <- droplevels(datAm[datAm$rep=="1",])
datAm2 <- droplevels(datAm[datAm$rep=="2",])
datAm3 <- droplevels(datAm[datAm$rep=="3",])
So since I have to do this 7 times, it seems like making a loop for this operation is the best way to do it. Can someone help me make that? I'm new to R so please bear that in mind.
Well I will have a stab at this.
concs <- c(a='a',b='b',c='c',d='d',e='e',x='x',y='y')
sex <- c(m='m',f='f')
reps <- c(rep1='1',rep2='2',rep3='3')
# By using m='m' we can label the objects within the list, making it
# easier to navigate the final object, otherwise use:
# concs <- c('a','b','c','d','e','x','y')
# sex <- c('m','f')
# reps <- c('1','2','3')
dfs <- lapply(concs, function(x){
droplevels(datSUM[datSUM$Conc==x,])}
)
sdfs <- lapply(sex, function(x){
lapply(dfs, function(y){
droplevels(y[y$Sex==x,])}
)}
)
rsdfs <- lapply(reps, function(x){
lapply(sdfs, function(y){
lapply(y, function(z){
droplevels(z[z$rep==x,])}
)}
)}
)
There is probably a better way to do this, that may involve using more lapplys but I think this "should" do the trick.
The only downside to this method you will have to access certain objects with rsdfs[[1]][[1]][[1]] or rsdfs[['rep1']][['m']][['a']] e.t.c
And applying functions to these would in itself require a bunch of lapplys
Let me know if this helps.
This is one method to do so - I will work on a more elegant solution later.
I'm writing a program in R and I need to select variables based in a particular value of one of the variable. The program is the next:
a1961 <- base[base[,5]==1961,]
a1962 <- base[base[,5]==1962,]
a1963 <- base[base[,5]==1963,]
a1964 <- base[base[,5]==1964,]
a1965 <- base[base[,5]==1965,]
a1966 <- base[base[,5]==1966,]
a1967 <- base[base[,5]==1967,]
a1968 <- base[base[,5]==1968,]
a1969 <- base[base[,5]==1969,]
a1970 <- base[base[,5]==1970,]
a1971 <- base[base[,5]==1971,]
a1972 <- base[base[,5]==1972,]
a1973 <- base[base[,5]==1973,]
a1974 <- base[base[,5]==1974,]
a1975 <- base[base[,5]==1975,]
a1976 <- base[base[,5]==1976,]
a1977 <- base[base[,5]==1977,]
a1978 <- base[base[,5]==1978,]
a1979 <- base[base[,5]==1979,]
a1980 <- base[base[,5]==1980,]
a1981 <- base[base[,5]==1981,]
a1982 <- base[base[,5]==1982,]
a1983 <- base[base[,5]==1983,]
a1984 <- base[base[,5]==1984,]
a1985 <- base[base[,5]==1985,]
a1986 <- base[base[,5]==1986,]
a1987 <- base[base[,5]==1987,]
a1988 <- base[base[,5]==1988,]
a1989 <- base[base[,5]==1989,]
...
a2012 <- base[base[,5]==2012,]
Is there a way (like modules in SAS) in which I can avoid writing the same thing over and over again?
In general, coding/implementation questions really belong on StackOverflow. That said, my recommendation is instead of naming individual variables for each result, just throw them all into a list:
a = lapply(1961:1989, function(x) base[base[,5]==x,]
You can also use the assign command.
years <- 1961:2012
for(i in 1:length(years)) {
assign(x = paste0("a", years[i]), value = base[base[,5]==years[i],])
}