I'm working on a plotting function for the likert data from a survey and I'm trying to optimize it to be as automated as possible since I have to make quite a lot of plots and make it as user-friendly as possible, but I'm having some problems and really need help finishing this function...
These are the data:
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
df1[colnames(df1)] <- lapply(df1[colnames(df1)], factor)
Columns A and B pertain to the "Technology" section of my survey, while C, D and E are in "Social".
I have transformed my data using the likertpackage and compiled them in a list to be more easily called in my function (don't know if it's the best way to go about it, I'm still quite new to R, so feel free to make suggestions even concerning this point):
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
So far, here is the function I could come up with (with great help of user #gaut):
mynames <- sapply(names(tbls), function(x) {
paste("How do they rank? -",gsub("\\.",": ",x))
})
myfilenames <- names(tbls)
plot_likert <- function(x, myname, myfilename){
p <- plot(likert(x),
type ="bar",center=3,
group.order=names(x))+
labs(x = "Theme", subtitle=paste("Number of observations:",nrow(x)))+
guides(fill=guide_legend("Rank"))+
ggtitle(myname)
p
I then lapply the function to get a list of plots:
list_plots <- lapply(1:length(tbls),function(i) {
plot_likert(tbls[[i]], mynames[i], myfilenames[i])
})
And then save them all as .png
sapply(1:length(list_plots), function(i) ggsave(
filename = paste0("plots ",i,".png"),
plot = list_plots[[i]],
width = 15, height = 9
))
Now, there are 3 main things I want my function to do but don't really know how to approach:
1) Right now I can export all the plots in one batch, but I would also like to be able to export a single plot, for example obtaining the above graph by writing:
plot_likert(tbls$dummy1.no)
2) In my mind, my ideal plotting function would also take into account the sections of my data mentioned above, so that if I specify the section Technology only get a Likert plot considering only columns A and B, and specifying the subsample gets me the dummy. Like so:
plot_likert(section=Technology, subsample=dummy1.no)
3) As you maybe have already noted, I need the titles of the plot to be fully automatic, so that by changing section or subsample they too change accordingly.
Apologies for the long/intricate question but I've been stuck on this function for quite some time and really need help finalizing it. For any further clarification/info, do not hesitate to ask!
Thank you in advance for any advice!
There are many ways to get what you want. Essentially, you need to add a few arguments to your function.
I agree with Limey though (and of course Hadley) - generally better to have a few simple functions that do a little step and then you can collate everything in one bigger function.
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
## this can be shortened
df1 <- data.frame(lapply(df1, factor))
## the rest of dummy data creation probably too, but I won't dig too much into this now
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
library(ggplot2)
library(likert)
#> Loading required package: xtable
## no need for sapply, really!
mynames <- paste("How do they rank? -", gsub("\\.",": ",names(tbls)))
myfilenames <- names(tbls)
## defining arguments with NULL makes it possible to not specify it without giving it a value
plot_likert <- function(x, myname, myfilename, section = NULL, subsample = NULL){
## first take only the tbl of interest
if(!is.null(subsample)) x <- x[subsample]
## then filter for your section and subsample
if(!is.null(section)) x <- lapply(x, function(y) y[, section])
## you can run your lapply within the function -
## ideally make a separate funciton and call the smaller function in the bigger one
## use seq_along
lapply(seq_along(x), function(i) {
plot(likert(x[[i]]),
type ="bar",center=3,
group.order=names(x[[i]]))+
labs(x = "Theme", subtitle=paste("Number of observations:",nrow(x)))+
guides(fill=guide_legend("Rank")) +
## programmatic title
ggtitle(names(x)[i])
})
}
## you need to pass character vectors to your arguments
patchwork::wrap_plots(plot_likert(tbls))
patchwork::wrap_plots(plot_likert(tbls, section = LETTERS[1:2], subsample = paste("dummy1", c("no", "yes"), sep = ".")))
Created on 2022-08-17 by the reprex package (v2.0.1)
I have the following data as an example:
IID<-c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
KB<-c(345,1234,2000,1567,
376,1657,9656,234,
1865,1565,123,111,
1999,2032,1565,234)
data<-cbind(IID,KB)
I wrote a script to process it and give me some outcomes:
results_kb <- function(class) {
this_iids_roh <- dat[class,]
my_list<-c("Sum_long"=sum(this_iids_roh$KB[this_iids_roh$KB>=1500]),
"N_long"=length(this_iids_roh$KB[this_iids_roh$KB>=1500]),
"Sum_short"=sum(this_iids_roh$KB[this_iids_roh$KB<1500]),
"N_short"=length(this_iids_roh$KB[this_iids_roh$KB<1500]))
return(my_list)
}
dat <- data.table::as.data.table(data)
dat$IID<-as.factor(dat$IID)
data.table::setkey(dat,"IID")
results <- c()
nLevels <- length(levels(dat$IID))
start <- proc.time()
pb <- txtProgressBar(min = 0, max = nLevels, style = 3)
for (i in 1:nLevels){
this_iid <- levels(dat$IID)[i]
results <- rbind(results,results_kb(this_iid))
setTxtProgressBar(pb,i)
}
close(pb)
proc.time()-start
results<-data.frame(levels(dat$IID),results)
results$IID<-results$levels.dat.IID.
results[results==0] <- NA
I created first a function that gives me the outcome table and then I processed the dataset I want to analyze. I added a time indicator because this script is to process very large samples. As you can test it works perfectly.
Now I want to write my own package, so I have to reduce all this into functions. The first part of the script is already in a function. For the second part I have tried the following:
rohsum<-function(data){
dat <- data.table::as.data.table(data)
dat$IID<-as.factor(dat$IID)
data.table::setkey(dat,"IID")
results <- c()
nLevels <- length(levels(dat$IID))
start <- proc.time()
pb <- txtProgressBar(min = 0, max = nLevels, style = 3)
for (i in 1:nLevels){
this_iid <- levels(dat$IID)[i]
results <- rbind(results,results_kb(this_iid))
setTxtProgressBar(pb,i)
}
close(pb)
proc.time()-start
results<-data.frame(levels(dat$IID),results)
results$IID<-results$levels.dat.IID.
results[results==0] <- NA
return(results)
}
However this seems not to work, since when I tried to run rohsum(data) I get the following error message:
Error in results_kb(this_iid) : object 'dat' not found
Even more, If I tried to run several times I get the following:
How can I solve this issue to be able to build my own package?
I'm working with raster package and I try to switch to terra but for some reasons that I don't understand, terra cannot reproduce the same operation of raster when working in parallel with packages such snowfall and future.apply. Here is a reproducible example.
library(terra)
r <- rast()
r[] <- 1:ncell(r)
m <- rast()
m[] <- c(rep(1,ncell(m)/5),rep(2,ncell(m)/5),rep(3,ncell(m)/5),rep(4,ncell(m)/5),rep(5,ncell(m)/5))
ms <- separate(m,other=NA)
plot(ms)
mymask <- function(ind){
tipo <- tipo_tav[ind]
mask <- ms[[ind]]
masked <-
terra::mask(
r,
mask
)
richard <- function(x){
k <-0.2
v <-0.3
a <-200
y0 <-2
y <- k/v*x*(1-((x/a)^v))+y0
return(y)
}
pred <- richard(masked)
pred <- clamp(pred,lower=0)
return(pred)
}
#the sequential usage works fine, faster than the `raster` counterpart
system.time(x <- mymask(1))#0.03
#when I try to run my function in parallel I receive an error
plan(multisession,workers=5)
system.time(pred_list <- future_lapply(1:5, FUN = mymask))
Error in .External(list(name = "CppMethod__invoke_notvoid", address = <pointer: (nil)>, :
NULL value as symbol address.
the exactly same code works well if I change rast with raster and terra::mask with raster::mask. See below:
library(raster)
r <- raster(r)
ms <- stack(ms)
mymask <- function(ind){
tipo <- tipo_tav[ind]
mask <- ms[[ind]]
masked <-
raster::mask(
r,
mask
)
richard <- function(x){
k <-0.2
v <-0.3
a <-200
y0 <-2
y <- k/v*x*(1-((x/a)^v))+y0
return(y)
}
pred <- richard(masked)
pred <- clamp(pred,lower=0)
return(pred)
}
#this works fine
system.time(x <- mymask(1))#0.06
#this works too
plan(multisession,workers=5)
system.time(pred_list <- future_lapply(1:5, FUN = mymask))#15.48
The same behavior if I use snowfall instead of future
library(snowfall)
sfInit(parallel = TRUE, cpus =5)
sfLibrary(terra)
sfExportAll()
system.time(pred_list <- sfLapply(1:5, fun = mymask))
sfStop()
this return the same error of future_lapply
Why is this happening? I've never seen such an error. I was hoping to take advantage of the higher speed of terra but so I'm stuck.
A SpatRaster cannot be serialized, you cannot send it to parallel compute nodes. Have a look here for more discussion.
Instead you can (a) send and receive filenames; (b) parallelize your custom function that you supply to app or lapp; (c) use the cores=n argument (where available, e.g. app and predict); (d) use a mechanism like wrap; (e) send a filename and a vector to make a SpatExtent to process and create a virtual raster from the output tiles (see ?vrt).
For example, you could do use a function like this (Option "a")
prich <- function(filein, fileout) {
r <- rast(filein)
richard <- function(x) {
k <-0.2
v <-0.3
a <-200
y0 <-2
y <- k/v*x*(1-((x/a)^v))+y0
y[y<0] <- 0
return(y)
}
x <- app(masked, richard, filename=fileout, overwrite=TRUE)
return(TRUE)
}
I use app because it is much more efficient for large rasters --- as it could avoid writing temp files for each of the 10 arithmetic operations with a SpatRaster. Given that you want to parallelize this relatively simple function, I assume the files are very large.
Or option "c":
richard <- function(x) {
k <-0.2
v <-0.3
a <-200
y0 <-2
y <- k/v*x*(1-((x/a)^v))+y0
y[y<0] <- 0
return(y)
}
x <- app(masked, richard, cores=12)
In neither case I included the masking. You could include it in option "a" but mask is disk I/O intensive, not computationally intensive, so it might be as efficient to do it in one step rather than in parallel.
With wrap you could do something like this
f <- function(w) {
x <- rast(w)
y <- richard(x)
wrap(y)
}
r <- rast(nrow=10, ncol=10, vals=1:100)
x <- f(wrap(r))
x <- rast(x)
Where f would be run in parallel. That only works for small rasters, but you could parallelize over tiles, and you can create tiles with terra::makeTiles.
More internal parallelization options will be coming, but don't hold your breath.
I am trying to produce same graph as in example but using different data. Here is my code:
library(SciViews)
args <- commandArgs(TRUE)
pdfname <- args[1]
datafile <- args[2]
pdf(pdfname)
eqdata = read.csv(datafile , header = T,sep=",")
(longley.cor <- correlation(eqdata$feqs))
# Synthetic view of the correlation matrix
summary(longley.cor)
p <- plot(longley.cor)
print(p)
dev.off()
and the data
ques,feqs
"abc",20
"def",10
"ghi",40
"jkl",10
"mno",20
"pqr",10
I use this command
Rscript ./rscript/correlation.R "/home/co.pdf" "/home/data_correlation.csv"
Code output
I want to generate like this
You can try the plotcorr function in the ellipse package. The help pages gives among others this example:
Which seems to be what you are looking for?
Edit:
You can add text afterwards, the circles are placed on a 1 - number of vars grid. E.g.:
data(mtcars)
Corrmat <- cor(mtcars)
cols <- ifelse(Corrmat>0, rgb(0,0,abs(Corrmat)), rgb(abs(Corrmat),0,0))
library(ellipse)
plotcorr(Corrmat,col=cols)
n <- nrow(Corrmat)
for (i in 1:n)
{
for (j in 1:n)
{
text(j,i,round(Corrmat[n-i+1,j],2),col="white",cex=0.6)
}
}