I am prototyping an application in R. I'm using the parallel library and parApply to run a function on columns of a data frame. I understand this will also be applicable to non-parallel/Apply application as well. I have a line similar to:
myBigList <- parApply(myCluster, myInputData, 2, myFunction)
where myFunction is a one that I have written, takes a vector as an input. The function itself performs quite a few operations that I can't go in to. It returns a list of variables of various classes. For the purposes of a MWE, say:
myFunction <- function(vectorIn){
# CODE GOES HERE
return(list(
mean = mean(vectorIn),
sd = mean(vectorIn),
vectorOut = sumUserFunction(vectorIn),
plot1 = aPlotGeneratingFunction(vectorIn),
))
What is returned to me is a list containing the results from the function. I can address elements from the list, eg:
myBigList$Column1$mean
But that isnt really helpful for my purposes. I'd like to know how to unpack the list so that I can look at all the mean values. eg:
listOfMeans <- myBigList$*ALL_ITEMS*$mean
so that listOfMeans is a vector with row.names, or data.frame with col.names.
Is this possible? I can think of a solution using a for loop but that doesnt seem very elegant.
I'd also like to do something similiar with the plots that I return so that I can automatically build a pdf containing all of them. I'm guessing learning the above will help.
tl;dr: What is the best methods of extracting common data names from a list?
EDIT: An actual MWE
library('ggplot2')
exampleData <- data.frame(Col1 = rnorm(100), Col2 = rnorm(100), Col3 = rnorm(100))
myFunction <- function(xIn){
meanX <- mean(xIn)
sdX <- sd(xIn)
vecX <- xIn^2 + xIn
plotX <-
ggplot(data.frame(xIn, vecX), aes(x = xIn, y = vecX)) +
geom_point()
return(list(
mean = meanX,
sd = sdX,
vect = vecX,
plot = plotX
))
}
myBigList <- apply(exampleData,
2,
myFunction)
from #docendo discusimus comment
mymeans <- sapply(myBigList, '[[', 'mean')
returns a vector of all the values stores in mean. To return a list, which is useful for storing the plot class the command should be:
myplots <- lapply(myBigList, '[[', 'plot')
Related
My problem is the following
I want to create variables e_1, e_2, e_3, ... , e_50 which are all composed of 100 draws from the uniform[-1,1]
This means e_1 is a vector of 100 draws from U[-1.1], e_2, .., e_50 as well.
Here is what I thought I could do :
periods <- c(1:50)
people <- c(1:100)
for (t in periods){
sprint('e_', t) <- runif(100, -1,1)
}
This did not work, and i am really not sure how to change it to obtain what I want.
Thank you so much for your help!!
It is better not to create objects in the global environment. Regarding the issue in the code, the assignment should be based on assign
for(t in periods) {
assign(sprintf('e_%d', t), runif(100, -1, 1))
}
An approach that wouldn't create multiple objects in the global env, would be to create a list with replicate
lst1 <- replicate(length(periods), runif(100, -1, 1), simplify = FALSE)
names(lst1) <- sprintf('e_%d', periods)
I try to create a function to remove multiple outliers via cooks distance from a list of data frames.
There are some problems at the moment:
Can I formulate part 1 as function? I tried several things that did not work out. I want to use several different variables for the lm - so it would be great if I could use colnumbers and the regular expression syntax of data frames as input argument.
Part 2 - the filename of the plots are not correct. It takes the first observation in each data frame from the list as filename. How can I correct this?
Part 3: data frames without the outliers are not created. Function comes to an end after the message is printed. I can't find my mistake.
data(iris)
iris.lst <- split(iris[, 1:2], iris$Species)
new_names <- c(paste0(unlist(levels(iris$Species)),"_data"))
for (i in 1:length(iris.lst)) {
assign(new_names[i], iris.lst[[i]])
}
# Part 1: Then cooks distances
fit <- lapply(mget(ls(pattern = "_data")),
function(x) lm(x[,1] ~ x[,3], data = x))
cooksd <-lapply(fit,cooks.distance)
# Part 2: Plot each data frame with suspected outlier
plots <- function(x){
jpeg(file=paste0(names(x),".jpeg")) # file names are numbers
#par(mfrow=c(2,1))
plot(x, pch="*", cex=2, main="Influential cases by Cooks distance") # plot cook's distance
abline(h = 3*mean(x, na.rm=T), col="red") # add cutoff line
text(x=1:length(x)+1, y=x, labels=ifelse(x > 3*mean(x, na.rm=T),
names(x),""), col="red")
dev.off()
}
myplots <- lapply(cooksd, plots)
# Part 3: give me new data frames without influential cases
show_influential_cases <- function(x){
# invisible(cooksd[["n_OG"]] <- lapply(cooksd, length)
influential <- lapply(x,function(x) names(x)[x > 3*mean(x, na.rm=T)])
test <- as.data.frame(unlist(influential))[,1]
test <- as.numeric(test)
}
tested <- show_influential_cases(result)
cleaned_data <- add_new[-tested,] # removing outliers by indexing
Could someone please help me to improve my code?
Many thanks,
Nadine
In general, it is not a good practice to create multiple dataframes in global environment. Lists always are a better option, they are easy to manage.
Part 1 -
You can combine multiple steps in one lapply function. Here in part 1 we apply lm and cooks.distance function together in the same lapply call.
master_data <- split(iris[, 1:2], iris$Species)
data <- lapply(master_data, function(x) {
cooks.distance(lm(Sepal.Length ~ Sepal.Width, data = x))
})
new_names <- paste0(levels(iris$Species),"_data")
names(data) <- new_names
Part 2 -
lapply does not have access to names of the list, pass them separately and use Map to call plots function.
plots <- function(x, y){
jpeg(file=paste0(y,".jpeg"))
plot(x, pch="*", cex=2, main="Influential cases by Cooks distance")
abline(h = 3*mean(x, na.rm=T), col="red") # add cutoff line
text(x=1:length(x)+1,y=x,labels=ifelse(x > 3*mean(x, na.rm=T),y,""), col="red")
dev.off()
}
Map(plots, data, names(data))
Part 3 -
I am not exactly clear about how you want to perform Part3 but for now I am showing outlier and data separately.
remove_influential_cases <- function(x, y){
inds <- x > 3*mean(x, na.rm=TRUE)
y[!inds, ]
}
result <- Map(remove_influential_cases, data, master_data)
I used mcmc_trace function from the bayesplot package to plot traceplot with mcmc list, which is a ggplot item so it can be further edited by ggplot function.
Follows is the plot that produced by the function. I needed to change the title k1...k[20] to subject 1... subject 20. Are there any approaches I can achieve this with ggplot function?
Follows is a simple reproducible model.
library (r2jags)
library (bayesplot)
library (ggplot2)
# data
dlist <- list(
NSubjects = 20,
k = rep (5,20),
n = rep (10,20)
)
# monitor
parameter <- 'theta'
# model
minimodel <- function(){
for (i in 1:NSubjects){
theta [i] ~ dbeta (1,1)
k[i] ~ dbin(theta[i],n[i])
}
}
samples <- jags(dlist, inits=NULL, parameter,
model.file = minimodel,
n.chains=1, n.iter=10, n.burnin=1, n.thin=1, DIC=T)
# mcmc list
codaSamples = as.mcmc.list(samples$BUGSoutput)
# select subjects
colstheta <- sprintf("theta[%d]",1:20)
# plot (here is where I need to change title, in this example: theta[1]...theta[20] to subject [1].. subject [20]
mcmc_trace(codaSamples[,colstheta]) +
labs (x='Iteration',y='theta value',
title='Traceplot - theta')
Use colnames<- to modify the column names. Since the object is a 1-element list containing a matrix-like object, you need to use [[1]]; if you have multiple chains you'll need to lapply() (or use a for loop) to apply the solution to every chain (i.e., every element in the list).
cc <- codaSamples[,colstheta]
colnames(cc[[1]]) <- gsub("theta\\[([0-9]+)\\]","subject \\1",colnames(cc[[1]]))
mcmc_trace(cc, ...)
The code above finds the numerical element in each name and inserts it into the new name; since you happen to know in this case that these are elements 1:20, you could simplify considerably, e.g.
colnames(cc[[1]]) <- paste("subject",seq(ncol(cc[[1]])))
I'm trying to write a function that takes a few lines of code and allows me to input a single variable. I've got the code below that creates an object using the Surv function (Survival package). The second line takes the variable in question, in this case a column listed as Variable_X, and outputs data that can then be visualized using ggsurvplot. The output is a Kaplan-Meier survival curve. What I'd like to do is have a function such that i can type f(Variable_X) and have the output KM curve visualized for whichever column I choose from the data. I want f(y) to output the KM as if I had put y where the ~Variable_X currently is. I'm new to R and very new to how functions work, I've tried the below code but it obviously doesn't work. I'm working through datacamp and reading posts but I'm having a hard time with it, appreciate any help.
surv_object <- Surv(time = KMeier_DF$Followup_Duration, event = KMeier_DF$Death_Indicator)
fitX <- survfit(surv_object ~ Variable_X, data = KMeier_DF)
ggsurvplot(fitX, data = KMeier_DF, pval = TRUE)
f<- function(x) {
dat<-read.csv("T:/datafile.csv")
KMeier_DF < - dat
surv_object <- Surv(time = KMeier_DF$Followup_Duration, event =
KMeier_DF$Death_Indicator)
fitX<-survfit(surv_object ~ x, data = KMeier_DF)
PlotX<- ggsurvplot(fitX, data = KMeier_DF, pval = TRUE)
return(PlotX)
}
The crux of the problem you have is actually a tough stumbling block to figure out initially: how to pass variable or dataframe column names into a function. I created some example data. In the example below I supply a function four variables, one of which is your data. You can see two ways I call on the columns, using [[]], and [,], which you can think of as being equivalent to using $. Outside of functions, they are, but not inside. The print functions are there to just show you the data along the way. If those objects exist in your global environment, remove them one by one, rm(surv_object), or clear them all rm(list = ls()).
duration <- c(1, 3, 4, 3, 3, 4, 2)
di <- c(1, 1, 0, 0, 0, 0, 1)
color <- c(1, 1, 2, 2, 3, 3, 4)
KMdf <- data.frame(duration, di, color)
testfun <- function(df, varb1, varb2, varb3) {
surv_object <- Surv(time = df[[varb1]], event = df[ , varb2])
print(surv_object)
fitX <- survfit(surv_object ~ df[[varb3]], data = df)
print(fitX)
# plotx <- ggsurvplot(fitX, data = df, pval = TRUE) # this gives an error that surv_object is not found
# return(plotx)
}
testfun(KMdf, "duration", "di", "color") # notice the use of quotes here, if not you'll get an error about object not found.
And even better, you have an even tougher stumbling block: how r handles variables and where it looks for them. From what I can tell, you're running into that because there is possibly a bug in ggsurvplot and looking at the global environment for variables, and not inside the function. They closed the issue, but as far as I can tell, it's still there. When you try to run the ggsurvplot line, you'll get an error that you would get if you didn't supply a variable:
Error in eval(inp, data, env) : object 'surv_object' not found.
Hopefully that helps. I'd submit a bug report if I were you.
edit
I was hoping this solution would help, but it doesn't.
testfun <- function(df, varb1, varb2, varb3) {
surv_object <- Surv(time = df[[varb1]], event = df[,varb2])
print(surv_object)
fitX <- survfit(surv_object ~ df[[varb3]], data = df)
print(fitX)
attr(fitX[['strata']], "names") <- c("color = 1", "color = 2", "color = 3", "color = 4")
plotx <- ggsurvplot(fitX, data = df, pval = TRUE) # this gives an error that surv_object is not found
return(plotx)
}
Error in eval(inp, data, env) : object 'surv_object' not found
This is homework, right?
First, you need to try to run the code before you provide it as an example. Your example has several fatal errors. ggsurvplot() needs either a library call to survminer or to be summoned as follows: survminer::ggsurvplot().
You have defined a function f, but you never used it. In the function definition, you have a wayward space < -. It never would have worked.
I suggest you start by defining a function that calculates the sum of two numbers, or concatenates two strings. Start here or here. Then, you can return to the Kaplan-Meier stuff.
Second, in another class or two, you will need to know the three parts of a function. You will need to understand the scope of a function. You might as well dig into the basics before you start copy-and-pasting.
Third, before you post another question, please read How to make a great R reproducible example?.
Best of luck.
Let's say I generate 9 groups of data in a list data and plot them each with a for loop. I could use *apply here too, whichever you prefer.
data = list()
layout(mat = matrix(1:9, nrow = 3))
for(i in 1:9){
data[[i]] = rnorm(n = 100, mean = i, sd = 1)
plot(data[[i]])
}
After creating all the data, I want to decide which one is best:
best_data = which.min(sapply(data, sd))
Now I want to highlight that best data on the plot to distinguish it. Is there a plotting function that lets me go back to a specified sub-plot in the active device and add an element (maybe a title)?
I know I could make a second for loop: for loop 1 generates the data, then I assess which is best, then for loop 2 creates the plots, but this seems less efficient and more verbose.
Does such a plotting function exist for base R graphics?
#rawr's answer is simple and easy. But I thought I'd point out another option that allows you to select the "best" data set before you plot, in case you want more flexibility to plot the "best" data set differently from the rest.
For example:
# Create the data
data = lapply(1:9, function(i) rnorm(n = 100, mean = i, sd = 1))
par(mar=c(4,4,1,1))
layout(mat = matrix(1:9, nrow = 3))
rng = range(data)
# Plot each data frame
lapply(1:9, function(i) {
# Select data frame with lowest SD
best = which.min(sapply(data, sd))
# Highlight data frame with lowest SD by coloring points red
plot(data[[i]], col=ifelse(best==i,"red","black"), pch=ifelse(best==i, 3, 1), ylim=rng)
})