In order to streamline future data analysis, I'm trying to write a script that will identify the different self-report scales included in a data.frame and perform routine analyses on each scale's items. Currently, I want it to identify which scales are present, find the responses for each of the scale's items, and then calculate the Cronbach's Alphas for each scale.
Everything seems to be working except when I run my function that should produce a list of alpha() outputs for each scale I get the following error:
> Cronbach.Alphas(scales.data, scale.names)
Error in alpha(data[, responses[[i]]]) :
Data must either be a data frame or a matrix
Obviously I know that this is saying the information being given to the alpha() function is not a data.frame or matrix. The reason I'm so confused though is that when I do these calculations manually step-by-step outside of my Cronbach.Alphas() function, it clearly tells me that it is a data.frame and seems to work like a charm:
> class(scales.data[,responses[[1]]])
[1] "data.frame"
This is driving me crazy and I'll be extremely appreciative of any help with figuring this out. My full code is pasted below. (Note: I'm pretty new to programming functions in R so the way I'm doing things is probably not optimal. Any additional advice is welcome as well.)
Also, it might help to mention that my code is designed to identify scale names based on the presence of an underscore in a column name. That is, "rsq_12" indicates the scale as rsq and the column as responses to item 12 of the scale.
require(psych)
##### Function for identifying names of scales present in the data file #####
GetScales <- function(x) {
find.scale.names <- regexec("^(([^_]+)_)", colnames(x))
scales <- do.call(rbind, lapply(regmatches(colnames(x), find.scale.names), `[`, 3L))
colnames(scales) <- "scale"
na.find <- ifelse(is.na(scales[,1]), 0, 1)
scales <- cbind(scales, na.find)
output <- scales[scales[,2] == 1,]
output[,1]
}
##### Function for calculating cronbach's alpha for each scale #####
Cronbach.Alphas <- function(data, scales){
for(i in 1:length(scales)){
if(i == 1) {
responses <- list(grep(scales[i], colnames(data)))
alphas <- list(alpha(data[,responses[[i]]]))
} else {
responses <- append(responses, list(grep(scales[i], colnames(data))))
alphas <- append(alphas, list(alpha(data[,responses[[i]]])))
}
}
return(alphas)
}
### Import data from .csv file ###
scales.data <- data.frame(read.csv(file.choose()))
### Identify each item's scale ###
scale.items <- GetScales(scales.data)
### Reduce to names of scales ###
scale.names <- cbind(scale.items, !duplicated(scale.items))
scale.names <- scale.names[scale.names[,2] == TRUE, 1]
scale.names
### Calculate list of alphas ###
Cronbach.Alphas(scales.data, scale.names)
Thank you to anyone who has taken the time to look over my code. I appreciate your help. I was working off of the suggestions left here when I realized a simple mistake on my part...
One of the scales in the dataset that I've been using as a test while working on this script had only one item in it. Thus, data[,responses[[i]]] in my Cronbach.Alphas() function was passing a vector (rather than a data.frame or matrix) to the alpha() function at that point in the for loop. It is impossible to calculate cronbach alpha for a single item scale because it is an index of inter-item reliability...
Sooooo, all my code needed was a way to identify scales with just one item:
Cronbach.Alphas <- function(data, scales){
for(i in 1:length(scales)){
if(i == 1) {
responses <- list(grep(scales[i], colnames(data)))
if(length(responses[[i]]) > 1){
alphas <- list(alpha(data[,responses[[i]]]))
}
} else {
responses <- append(responses, list(grep(scales[i], colnames(data))))
if(length(responses[[i]]) > 1){
alphas <- append(alphas, list(alpha(data[,responses[[i]]])))
}
}
}
return(alphas)
}
Sorry for wasting anyone's time with my mistake. On the plus side, by substituting this new Cronbach.Alphas() function into the script above, I've now posted a script that will automatically identify scales and produce a list of cronbach's alphas (provided the columns are named with an underscore after the scale names) for anyone who might interested. Thanks again!
Related
I am running a function that returns a custom ggplot from an input data (it is in fact a plot with several layers on it). I run the function over several different input data and obtain a list of ggplots.
I want to create a grid with these plots to compare them but they all have different y axes.
I guess what I have to do is extract the maximum and minimum y axes limits from the ggplot list and apply those to each plot in the list.
How can I do that? I guess its through the use of ggbuild. Something like this:
test = ggplot_build(plot_list[[1]])
> test$layout$panel_scales_x
[[1]]
<ScaleContinuousPosition>
Range:
Limits: 0 -- 1
I am not familiar with the structure of a ggplot_build and maybe this one in particular is not a standard one as it comes from a "custom" ggplot.
For reference, these plots are created whit the gseaplot2 function from the enrichplot package.
I dont know how to "upload" an R object but if that would help, let me know how to do it.
Thanks!
edit after comments (thanks for your suggestions!)
Here is an example of the a gseaplot2 plot. GSEA stands for Gene Set Enrichment Analysis, it is a technique used in genomic studies. The gseaplot2 function calculates a running average and then plots it and another bar plot on the bottom.
and here is the grid I create to compare the plots generated from different data:
I would like to have a common scale for the "Running Enrichment Score" part.
I guess I could try to recreate the gseaplot2 function and input all of the datasets and then create the grid by facet_wrap, but I was wondering if there was an easy way of extracting parameters from a plot list.
As a reproducible example (from the enrichplot package):
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
wpgmtfile <- system.file("extdata/wikipathways-20180810-gmt-Homo_sapiens.gmt", package="clusterProfiler")
wp2gene <- read.gmt(wpgmtfile)
wp2gene <- wp2gene %>% tidyr::separate(term, c("name","version","wpid","org"), "%")
wpid2gene <- wp2gene %>% dplyr::select(wpid, gene) #TERM2GENE
wpid2name <- wp2gene %>% dplyr::select(wpid, name) #TERM2NAME
ewp2 <- GSEA(geneList, TERM2GENE = wpid2gene, TERM2NAME = wpid2name, verbose=FALSE)
gseaplot2(ewp2, geneSetID=1, subplots=1:2)
And this is how I generate the plot list (probably there is a much more elegant way):
plot_list = list()
for(i in 1:3) {
fig_i = gseaplot2(ewp2,
geneSetID=i,
subplots=1:2)
plot_list[[i]] = fig_i
}
ggarrange(plotlist=plot_list)
In the following reproducible example I try to create a function for a ggplot distribution plot and saving it as an R object, with the intention of displaying two plots in a grid.
ggplothist<- function(dat,var1)
{
if (is.character(var1)) {
var1 <- which(names(dat) == var1)
}
distribution <- ggplot(data=dat, aes(dat[,var1]))
distribution <- distribution + geom_histogram(aes(y=..density..),binwidth=0.1,colour="black", fill="white")
output<-list(distribution,var1,dat)
return(output)
}
Call to function:
set.seed(100)
df <- data.frame(x = rnorm(100, mean=10),y =rep(1,100))
output1 <- ggplothist(dat=df,var1='x')
output1[1]
All fine untill now.
Then i want to make a second plot, (of note mean=100 instead of previous 10)
df2 <- data.frame(x = rep(1,1000),y = rnorm(1000, mean=100))
output2 <- ggplothist(dat=df2,var1='y')
output2[1]
Then i try to replot first distribution with mean 10.
output1[1]
I get the same distibution as before?
If however i use the information contained inside the function, return it back and reset it as a global variable it works.
var1=as.numeric(output1[2]);dat=as.data.frame(output1[3]);p1 <- output1[1]
p1
If anyone can explain why this happens I would like to know. It seems that in order to to draw the intended distribution I have to reset the data.frame and variable to what was used to draw the plot. Is there a way to save the plot as an object without having to this. luckly I can replot the first distribution.
but i can't plot them both at the same time
var1=as.numeric(output2[2]);dat=as.data.frame(output2[3]);p2 <- output2[1]
grid.arrange(p1,p2)
ERROR: Error in gList(list(list(data = list(x = c(9.66707664902549, 11.3631137069225, :
only 'grobs' allowed in "gList"
In this" Grid of multiple ggplot2 plots which have been made in a for loop " answer is suggested to use a list for containing the plots
ggplothist<- function(dat,var1)
{
if (is.character(var1)) {
var1 <- which(names(dat) == var1)
}
distribution <- ggplot(data=dat, aes(dat[,var1]))
distribution <- distribution + geom_histogram(aes(y=..density..),binwidth=0.1,colour="black", fill="white")
plot(distribution)
pltlist <- list()
pltlist[["plot"]] <- distribution
output<-list(pltlist,var1,dat)
return(output)
}
output1 <- ggplothist(dat=df,var1='x')
p1<-output1[1]
output2 <- ggplothist(dat=df2,var1='y')
p2<-output2[1]
output1[1]
Will produce the distribution with mean=100 again instead of mean=10
and:
grid.arrange(p1,p2)
will produce the same Error
Error in gList(list(list(plot = list(data = list(x = c(9.66707664902549, :
only 'grobs' allowed in "gList"
As a last attempt i try to use recordPlot() to record everything about the plot into an object. The following is now inside the function.
ggplothist<- function(dat,var1)
{
if (is.character(var1)) {
var1 <- which(names(dat) == var1)
}
distribution <- ggplot(data=dat, aes(dat[,var1]))
distribution <- distribution + geom_histogram(aes(y=..density..),binwidth=0.1,colour="black", fill="white")
plot(distribution)
distribution<-recordPlot()
output<-list(distribution,var1,dat)
return(output)
}
This function will produce the same errors as before, dependent on resetting the dat, and var1 variables to what is needed for drawing the distribution. and similarly can't be put inside a grid.
I've tried similar things like arrangeGrob() in this question "R saving multiple ggplot2 plots as R-object in list and re-displaying in grid " but with no luck.
I would really like a solution that creates an R object containing the plot, that can be redrawn by itself and can be used inside a grid without having to reset the variables used to draw the plot each time it is done. I would also like to understand wht this is happening as I don't consider it intuitive at all.
The only solution I can think of is to draw the plot as a png file, saved somewhere and then have the function return the path such that i can be reused - is that what other people are doing?.
Thanks for reading, and sorry for the long question.
Found a solution
How can I reference the local environment within a function, in R?
by inserting
localenv <- environment()
And referencing that in the ggplot
distribution <- ggplot(data=dat, aes(dat[,var1]),environment = localenv)
made it all work! even with grid arrange!
Basically, i have a dataframe with 3 numeric vectors(x,y,z), and lets say i wanna make a scatter plot of x,y colored by z. I want to transform the colorscale with a squareroot that respects sign, so i made my own with trans_new. Here is a simple dataset, but with the actual transform.
library(ggplot2)
library(scales)
set.seed(1)
plot<-data.frame(x=rnorm(100),y=rnorm(100),z=rnorm(100))
super_trans <- function(){
trans_new('super', function(X) sapply(X,function(x) {if(x>0){x^0.5} else{-(- x)^0.5}}), function(X) sapply(X,function(x){ if(x>0){x^2} else{-x^2}}))
}
ggplot(plot,aes(x,y))+geom_point(aes(colour=z))+scale_colour_gradient(trans="super")
It gives an error,
Error in if (x > 0) { : missing value where TRUE/FALSE needed
I don't understand it. I tried to backtrack the mistake, and my guess is that the error happens when trans_new tries to make breaks.
However, i do not understand how the "breaks" parameter works in trans_new.
Is there a ggplot2/Scales hero out there, that can help me transform my color-scale correctly?
It may be relevant that only some datasets gives errors.
There is a vectorized if, called ifelse. It also seems you are missing an extra minus.
super_trans <- function() {
trans_new('super',
function(x) ifelse(x>0, x^0.5, -(-x)^0.5),
function(x) ifelse(x>0, x^2, -(-x)^2))
}
I use heatmap.2 from gplots to make a heatmap:
library(gplots)
# some fake data
m = matrix(c(0,1,2,3), nrow=2, ncol=2)
# make heatmap
hm = heatmap.2(m)
When I do 'heatmap.2' directly I get a plot that I can output to a device. How can I make the plot again from my variable 'hm'? Obviously this is a toy example, in real life I have a function that generates and returns a heatmap which I would like to plot later.
There are several alternatives, although none of them are particularly elegant. It depends on if the variables used by your function are available in the plotting environment. heatmap.2 doesn't return a proper "heatmap" object, although it contains the necessary information for plotting the graphics again. See str(hm) to inspect the object.
If the variables are available in your environment, you could just re-evaluate the original plotting call:
library(gplots)
# some fake data (adjusted a bit)
set.seed(1)
m = matrix(rnorm(100), nrow=10, ncol=10)
# make heatmap
hm = heatmap.2(m, col=rainbow(4))
# Below fails if all variables are not available in the global environment
eval(hm$call)
I assume this won't be the case though, as you mentioned that you are calling the plot command from inside a function and I think you're not using any global variables. You could just re-construct the heatmap drawing call from the fields available in your hm-object. The problem is that the original matrix is not available, but instead we have a re-organized $carpet-field. It requires some tinkering to obtain the original matrix, as the projection has been:
# hm2$carpet = t(m[hm2$rowInd, hm2$colInd])
At least in the case when the data matrix has not been scaled, the below should work. Add extra parameters according to your specific plotting call.
func <- function(mat){
h <- heatmap.2(mat, col=rainbow(4))
h
}
# eval(hm2$call) does not work, 'mat' is not available
hm2 <- func(m)
# here hm2$carpet = t(m[hm2$rowInd, hm2$colInd])
# Finding the projection back can be a bit cumbersome:
revRowInd <- match(c(1:length(hm2$rowInd)), hm2$rowInd)
revColInd <- match(c(1:length(hm2$colInd)), hm2$colInd)
heatmap.2(t(hm2$carpet)[revRowInd, revColInd], Rowv=hm2$rowDendrogram, Colv=hm2$colDendrogram, col=hm2$col)
Furthermore, I think you may be able to work your way to evaluating hm$call in the function's environment. Perhaps with-function would be useful.
You could also make mat available by attaching it to the global environment, but I think this is considered bad practice, as too eager use of attach can result in problems. Notice that in my example every call to func creates the original plot.
I would do some functional programming:
create_heatmap <- function(...) {
plot_heatmap <- function() heatmap.2(...)
}
data = matrix(rnorm(100), nrow = 10)
show_heatmap <- create_heatmap(x = data)
show_heatmap()
Pass all of the arguments you need to send to plot_heatmap through the .... The outer function call sets up an environment in which the inner function looks first for its arguments. The inner function is returned as an object and is now completely portable. This should produce the exact same plot each time!
Anyone knows how to take advantage of ggplot or lattice in doing survival analysis? It would be nice to do a trellis or facet-like survival graphs.
So in the end I played around and sort of found a solution for a Kaplan-Meier plot. I apologize for the messy code in taking the list elements into a dataframe, but I couldnt figure out another way.
Note: It only works with two levels of strata. If anyone know how I can use x<-length(stratum) to do this please let me know (in Stata I could append to a macro-unsure how this works in R).
ggkm<-function(time,event,stratum) {
m2s<-Surv(time,as.numeric(event))
fit <- survfit(m2s ~ stratum)
f$time <- fit$time
f$surv <- fit$surv
f$strata <- c(rep(names(fit$strata[1]),fit$strata[1]),
rep(names(fit$strata[2]),fit$strata[2]))
f$upper <- fit$upper
f$lower <- fit$lower
r <- ggplot (f, aes(x=time, y=surv, fill=strata, group=strata))
+geom_line()+geom_ribbon(aes(ymin=lower,ymax=upper),alpha=0.3)
return(r)
}
I have been using the following code in lattice. The first function draws KM-curves for one group and would typically be used as the panel.group function, while the second adds the log-rank test p-value for the entire panel:
km.panel <- function(x,y,type,mark.time=T,...){
na.part <- is.na(x)|is.na(y)
x <- x[!na.part]
y <- y[!na.part]
if (length(x)==0) return()
fit <- survfit(Surv(x,y)~1)
if (mark.time){
cens <- which(fit$time %in% x[y==0])
panel.xyplot(fit$time[cens], fit$surv[cens], type="p",...)
}
panel.xyplot(c(0,fit$time), c(1,fit$surv),type="s",...)
}
logrank.panel <- function(x,y,subscripts,groups,...){
lr <- survdiff(Surv(x,y)~groups[subscripts])
otmp <- lr$obs
etmp <- lr$exp
df <- (sum(1 * (etmp > 0))) - 1
p <- 1 - pchisq(lr$chisq, df)
p.text <- paste("p=", signif(p, 2))
grid.text(p.text, 0.95, 0.05, just=c("right","bottom"))
panel.superpose(x=x,y=y,subscripts=subscripts,groups=groups,...)
}
The censoring indicator has to be 0-1 for this code to work. The usage would be along the following lines:
library(survival)
library(lattice)
library(grid)
data(colon) #built-in example data set
xyplot(status~time, data=colon, groups=rx, panel.groups=km.panel, panel=logrank.panel)
If you just use 'panel=panel.superpose' then you won't get the p-value.
I started out following almost exactly the approach you use in your updated answer. But the thing that's irritating about the survfit is that it only marks the changes, not each tick - e.g., it will give you 0 - 100%, 3 - 88% instead of 0 - 100%, 1 - 100%, 2 - 100%, 3 - 88%. If you feed that into ggplot, your lines will slope from 0 to 3, rather than remaining flat and dropping straight down at 3. That might be fine depending on your application and assumptions, but it's not the classic KM plot. This is how I handled the varying numbers of strata:
groupvec <- c()
for(i in seq_along(x$strata)){
groupvec <- append(groupvec, rep(x = names(x$strata[i]), times = x$strata[i]))
}
f$strata <- groupvec
For what it's worth, this is how I ended up doing it - but this isn't really a KM plot, either, because I'm not calculating out the KM estimate per se (although I have no censoring, so this is equivalent... I believe).
survcurv <- function(surv.time, group = NA) {
#Must be able to coerce surv.time and group to vectors
if(!is.vector(as.vector(surv.time)) | !is.vector(as.vector(group))) {stop("surv.time and group must be coercible to vectors.")}
#Make sure that the surv.time is numeric
if(!is.numeric(surv.time)) {stop("Survival times must be numeric.")}
#Group can be just about anything, but must be the same length as surv.time
if(length(surv.time) != length(group)) {stop("The vectors passed to the surv.time and group arguments must be of equal length.")}
#What is the maximum number of ticks recorded?
max.time <- max(surv.time)
#What is the number of groups in the data?
n.groups <- length(unique(group))
#Use the number of ticks (plus one for t = 0) times the number of groups to
#create an empty skeleton of the results.
curves <- data.frame(tick = rep(0:max.time, n.groups), group = NA, surv.prop = NA)
#Add the group names - R will reuse the vector so that equal numbers of rows
#are labeled with each group.
curves$group <- unique(group)
#For each row, calculate the number of survivors in group[i] at tick[i]
for(i in seq_len(nrow(curves))){
curves$surv.prop[i] <- sum(surv.time[group %in% curves$group[i]] > curves$tick[i]) /
length(surv.time[group %in% curves$group[i]])
}
#Return the results, ordered by group and tick - easier for humans to read.
return(curves[order(curves$group, curves$tick), ])
}