Applying an operation to every dataframe in the global environment - r

I would like to create a prediction matrix (using mice) for each dataframe in my workspace. I thought of doing the following:
library(mice)
PredMatr = list()
try (for (i in 1:length(ls())) {
PredMatr [[i]]=quickpred(get(ls()[i]), mincor=.1)
})
But it stops when it encounters something different than a dataframe in the workspace. How could I adapt my code to make the operation conditional on the object being a dataframe?

you can use eapply to test which objects in the environment are class data.frame and only work with those. For example use:
Myls<-ls(sorted=F)[eapply(.GlobalEnv, class)=="data.frame"]
and now Myls is a list of the names of the objects that are a data.frame. These can then be fed into get()
eapply is like lapply but it applies to every object in an environment rather than every object in a list.
Edit to add:
To use this in the original problem you can do the following:
library(mice)
PredMatr = list()
Myls<-ls(sorted=F)[eapply(.GlobalEnv, class)=="data.frame"]
try (for (i in 1:length(Myls)) {
PredMatr [[i]]=quickpred(get(Myls[i]), mincor=.1)
})

You could add
if(!is.data.frame(get(ls()[i]))) next;
to your code, then the loop will skip to the next iteration when it encounters a non-data.frame structure.
Answer to comment
library(mice)
PredMatr = list()
try (for (i in 1:length(ls())) {
if(!is.data.frame(get(ls()[i]))) next;
PredMatr [[i]]=quickpred(get(ls()[i]), mincor=.1)
})
Should do the trick.

Related

How do I pass a function argument into map_df()?

I am trying to create a function to clean data and return as a data.frame in R.
I'm using the map_df() function to return the cleaned data as a data.frame, and have a function written to clean the data.
The first thing I do is pull a list of files from a folder, then iterate through them and clean each file. I have a pre-defined set specifying which column names to pull (stored in selectCols) in case of variation between files:
files <- list.files(filepath,full.names=F)
colInd <- which(names(fread(files[i],nrows=0)) %in% gsub("_","",selectCols))
I also have a function to clean my data, which uses fread() to read in the .csv files. It takes colInd and i as arguments to clean files iteratively.
cleanData <- function(files,i,colInd) {
addData <- fread(files[i],select=c(colInd))
[...]
}
Overall it looks like this (as a recursive function):
i <- 1
files <- list.files(filepath,full.names=F)
iterateCleaning <- function(files,i) {
colInd <- (which(names(fread(files[i],nrows=0)) %in% gsubs("_","",selectCols))
if (length(colInd)==length(selectCols)) {
newData <- map_df(files,cleanData)
saveToFolder(newData,i,files)
}
else {}
i=i+1
if (i<-length(files)){
iterateCleaning(files,i)
}
else {}
}
When I try to run without specifying the arguments for my function I get this error:
Error in fread(files,select=c(colInd)):
argument "colInd" is missing, with no default.
When I insert it into my map_df() I do it like so:
newData <- map_df(files,i,colInd,cleanData)
Then I get this error:
Error in as_mapper(.f,...): object 'colInd' not found.
Any suggestions for resolving this error? As I understand it, map_df() applies to each element in the function, but I don't need it applied to the i and colInd inputs, I just need them for the function I am calling in map_df(). How can I call map_df() on a function that requires additional arguments?
I read the documentation but it seemed a bit confusing. It says for a single-argument function to use "." and for two-argument functions to use .x and .y, but I'm not sure what it means. My initial guess is something like these, but neither line works):
newData <- map_df(files,cleanData,.i,.colInd)
newData <- map_df(files,cleanData,.x=i,.y=colInd)
Any recommendations? Will I have the same output if I just call map_df() afterwards on the output of my function?

R, dplyr and snow: how to parallelize functions which use dplyr

Let's suppose that I want to apply, in a parallel fashion, myfunction to each row of myDataFrame. Suppose that otherDataFrame is a dataframe with two columns: COLUNM1_odf and COLUMN2_odf used for some reasons in myfunction. So I would like to write a code using parApply like this:
clus <- makeCluster(4)
clusterExport(clus, list("myfunction","%>%"))
myfunction <- function(fst, snd) {
#otherFunction and aGlobalDataFrame are defined in the global env
otherFunction(aGlobalDataFrame)
# some code to create otherDataFrame **INTERNALLY** to this function
otherDataFrame %>% filter(COLUMN1_odf==fst & COLUMN2_odf==snd)
return(otherDataFrame)
}
do.call(bind_rows,parApply(clus,myDataFrame,1,function(r) { myfunction(r[1],r[2]) }
The problem here is that R doesn't recognize COLUMN1_odf and COLUMN2_odf even if I insert them in clusterExport. How can I solve this problem? Is there a way to "export" all the object that snow needs in order to not enumerate each of them?
EDIT 1: I've added a comment (in the code above) in order to specify that the otherDataFrame is created interally to myfunction.
EDIT 2: I've added some pseudo-code in order to generalize myfunction: it now uses a global dataframe (aGlobalDataFrame and another function otherFunction)
Done some experiments, so I solved my problem (with the suggestion of Benjamin and considering the 'edit' that I've added to the question) with:
clus <- makeCluster(4)
clusterEvalQ(clus, {library(dplyr); library(magrittr)})
clusterExport(clus, "myfunction", "otherfunction", aGlobalDataFrame)
myfunction <- function(fst, snd) {
#otherFunction and aGlobalDataFrame are defined in the global env
otherFunction(aGlobalDataFrame)
# some code to create otherDataFrame **INTERNALLY** to this function
otherDataFrame %>% dplyr::filter(COLUMN1_odf==fst & COLUMN2_odf==snd)
return(otherDataFrame)
}
do.call(bind_rows, parApply(clus, myDataFrame, 1,
{function(r) { myfunction(r[1], r[2]) } )
In this way I've registered aGlobalDataFrame, myfunction and otherfunction, in short all the function and the data used by the function used to parallelize the job (myfunction itself)
Now that I'm not looking at this on my phone, I can see a couple of issues.
First, you are not actually creating otherDataFrame in your function. You are trying to pipe an existing otherDataFrame into filter, and if otherDataFrame doesn't exist in the environment, the function will fail.
Second, unless you have already loaded the dplyr package into your cluster environments, you will be calling the wrong filter function.
Lastly, when you've called parApply, you haven't specified anywhere what fst and snd are supposed to be. Give the following a try:
clus <- makeCluster(4)
clusterEvalQ(clus, {library(dplyr); library(magrittr)})
clusterExport(clus, "myfunction")
myfunction <- function(otherDataFrame, fst, snd) {
dplyr::filter(otherDataFrame, COLUMN1_odf==fst & COLUMN2_odf==snd)
}
do.call(bind_rows,parApply(clus,myDataFrame,1,function(r, fst, snd) { myfunction(r[fst],r[snd]), "[fst]", "[snd]") }

Get object name after passed to function, and used in for loop, in R

This question is asked here In R, how to get an object's name after it is sent to a function?
However, this doesn't work when in a for loop. For example the following method will write multiple dataframes to a postgresql database,
write_multiple_to_postgres <- function(list_of_frames) {
for(i in 1:length(list_of_frames)) {
object_name <- deparse(substitute(list_of_frames[[i]]))
write_to_postgresql(object_name, list_of_frames[[i]])
}
}
Where the list_of_frames looks like this:
my_list <- list(data_frame_susan, data_frame_bobby, data_frame_melissa)
....and is called as:
write_multiple_to_postgres(my_list)
I want the object_name to be passed as a string to the write_to_postgresql method. But instead I get the following outputs for object_name
my_list[[1L]],
my_list[[2L]],
my_list[[3L]]
Where what I want is:
data_frame_susan,
data_frame_bobby,
data_frame_melissa
How can I use "deparse(substitute) trick" or some other method to get the object name after being passed into a function and used in a for loop?
As you define your list my_list there is no way to get back df name back. You should use a named list such as my_list=list(data_frame_susan=data_frame_susan...) and then have a loop on names(my_list)
for (df in names(my_list)){
write_to_postgresql(df, my_list[[df]])
}
Now the question is how to prepare my_list with corresponding names but with what you say we don't know from where it comes from and how those data.frames are populated.

How to call a result from a function in another one in R

can please somebody tell me how I can call my output which are two matrices as an input into another function?
X1=function(y,z)
{
output1=y*z
output2=y/z
}
X2=function(p,q)
{
input=X1(y,z)
input1=input$output1 ??? How to specify the output that I can call it this way? output1 and output2 are matrices!
input2=input$output2
equation=input1+input2
}
I tried return() and data.frame but both didn't work. Any tipps?
You can't use c as some might otherwise expect because you'll lose the structure of the matrices. Instead, use list when you want to return multiple objects from an R function.
X1 <- function(y,z)
{
list(
output1=y*z,
output2=y/z
)
}

output from "for" loop

based on Roland's suggestion from Plot titles in R using sapply(), I have created the following loop to make boxplots out of every selected variable in my dataset.
all.box=function(x) {
for (i in seq_along(x)) {
boxplot(x[,i], main = names(x)[i])
}
}
It does the job nicely in that it provides the graphs. Could someone point out to me how to make the loop to return some output, say the $out from the boxplot to be able to see the number of outliers calculated by it?
Thanx a lot!
Using lapply here is better to avoid side-effect of the for:
all.box=function(x) {
res <- lapply(seq_along(x),function(i){
boxplot(x[,i], main = names(x)[i])$out
})
res
}
PS: you can continue to use for, but you will need either to append a list as a result within your loop or to allocate memory for the output object before calling boxplot. So I think it is simpler to use xxapply family function here.
If you want to return something from a for loop, it's very important to pre-allocate the return object if it's not a list. Otherwise for loops with many iterations will be slow. I suggest to read the R inferno and Circle 2 in particular.
all.box=function(x) {
result <- list()
for (i in seq_along(x)) {
result[[i]] <- boxplot(x[,i], main = names(x)[i])$out
}
result
}

Resources