How do I pass a function argument into map_df()? - r

I am trying to create a function to clean data and return as a data.frame in R.
I'm using the map_df() function to return the cleaned data as a data.frame, and have a function written to clean the data.
The first thing I do is pull a list of files from a folder, then iterate through them and clean each file. I have a pre-defined set specifying which column names to pull (stored in selectCols) in case of variation between files:
files <- list.files(filepath,full.names=F)
colInd <- which(names(fread(files[i],nrows=0)) %in% gsub("_","",selectCols))
I also have a function to clean my data, which uses fread() to read in the .csv files. It takes colInd and i as arguments to clean files iteratively.
cleanData <- function(files,i,colInd) {
addData <- fread(files[i],select=c(colInd))
[...]
}
Overall it looks like this (as a recursive function):
i <- 1
files <- list.files(filepath,full.names=F)
iterateCleaning <- function(files,i) {
colInd <- (which(names(fread(files[i],nrows=0)) %in% gsubs("_","",selectCols))
if (length(colInd)==length(selectCols)) {
newData <- map_df(files,cleanData)
saveToFolder(newData,i,files)
}
else {}
i=i+1
if (i<-length(files)){
iterateCleaning(files,i)
}
else {}
}
When I try to run without specifying the arguments for my function I get this error:
Error in fread(files,select=c(colInd)):
argument "colInd" is missing, with no default.
When I insert it into my map_df() I do it like so:
newData <- map_df(files,i,colInd,cleanData)
Then I get this error:
Error in as_mapper(.f,...): object 'colInd' not found.
Any suggestions for resolving this error? As I understand it, map_df() applies to each element in the function, but I don't need it applied to the i and colInd inputs, I just need them for the function I am calling in map_df(). How can I call map_df() on a function that requires additional arguments?
I read the documentation but it seemed a bit confusing. It says for a single-argument function to use "." and for two-argument functions to use .x and .y, but I'm not sure what it means. My initial guess is something like these, but neither line works):
newData <- map_df(files,cleanData,.i,.colInd)
newData <- map_df(files,cleanData,.x=i,.y=colInd)
Any recommendations? Will I have the same output if I just call map_df() afterwards on the output of my function?

Related

R Why do I have to assign a formal argument variable to itself in order for this function to work?

I have developed the following two functions:
save_sysdata <- function(...) {
data <- eval(substitute(alist(...)))
data <- purrr::map_chr(data, add_dot)
save(list = data, file = "sysdata.rda", compress = "bzip2", version = 2)
}
add_dot <- function(object) {
object <- object # Why is this required?
name <- paste0(".", deparse(substitute(object)))
# parent.frame(3) because evaluating in global (or caller function); 2 because assigning in save_sysdata.
assign(name, eval(object, envir = parent.frame(3)), envir = parent.frame(2))
return(name)
}
The purpose of this set of functions is to provide an object (x) and save it as a sysdata.rda file but as a hidden object. This requires adding a . to the object symbol (.x).
The set of functions as I have it works and accomplishes what I want. However, it requires a bit of code that I don't understand why it works or what it's doing. I'm not even sure how I came up with this particular line as a solution.
If I remove the line object <- object from the add_dot function, the whole thing fails to work. It actually just generates an empty sysdata.rda file.
Can anyone explain why this line is necessary and what it is doing?
And if you have a more efficient way of accomplishing this, please let me know. It was a fun exercise to figure this out myself but I'm sure there is a better way.
For a reprex, simply copy the above functions and run:
x <- "test"
save_sysdata(x)
Then load the sysdata.rda file into your global environment and type .x. You should return [1] "test".
Here's an alternative version
save_sysdata <- function(...) {
pnames <- sapply(match.call(expand.dots=FALSE)$..., deparse)
snames <- paste0(".", pnames)
senv <- setNames(list(...), snames)
save(list = snames, envir=list2env(senv), file = "sysdata.rda", compress = "bzip2", version = 2)
}
We dump the values into a named list and granbing the names of the parameter with match.call(). We add dots to the names and then turn that list into an environment that we can use with save.
The reason your version required object <- object is that function parameters are lazily evaluated. Since you never actually use the value of that object in your function without the assignment, it remains a promise and is never added tot he function environment. Sometimes you'll see force(object) instead which does the same thing.

Applying an operation to every dataframe in the global environment

I would like to create a prediction matrix (using mice) for each dataframe in my workspace. I thought of doing the following:
library(mice)
PredMatr = list()
try (for (i in 1:length(ls())) {
PredMatr [[i]]=quickpred(get(ls()[i]), mincor=.1)
})
But it stops when it encounters something different than a dataframe in the workspace. How could I adapt my code to make the operation conditional on the object being a dataframe?
you can use eapply to test which objects in the environment are class data.frame and only work with those. For example use:
Myls<-ls(sorted=F)[eapply(.GlobalEnv, class)=="data.frame"]
and now Myls is a list of the names of the objects that are a data.frame. These can then be fed into get()
eapply is like lapply but it applies to every object in an environment rather than every object in a list.
Edit to add:
To use this in the original problem you can do the following:
library(mice)
PredMatr = list()
Myls<-ls(sorted=F)[eapply(.GlobalEnv, class)=="data.frame"]
try (for (i in 1:length(Myls)) {
PredMatr [[i]]=quickpred(get(Myls[i]), mincor=.1)
})
You could add
if(!is.data.frame(get(ls()[i]))) next;
to your code, then the loop will skip to the next iteration when it encounters a non-data.frame structure.
Answer to comment
library(mice)
PredMatr = list()
try (for (i in 1:length(ls())) {
if(!is.data.frame(get(ls()[i]))) next;
PredMatr [[i]]=quickpred(get(ls()[i]), mincor=.1)
})
Should do the trick.

how to use lapply in R

I am trying to use lapply so that I can apply a custom function on all elements of a vector in R. I am trying to avoid using for loop here .
This is what I have so far:
writeToHDFS <- function(fileName){
print(fileName)
hdfs.init()
modelfile <- hdfs.file(fileName, "w")
hdfs.write(gaDataUser, modelfile)
hdfs.close(modelfile)
}
fileNames <- c("gaDataUser", "gaDataSession", "gaDataSources","gaDataEventTracking", "gaDataEcommerce", "gaDataEcommerce2", "gaDataEcommerce3", "gaDataEcommerce4")
lapply(fileNames,writeToHDFS(x))
I have variables with the names mentioned in the character vector fileNames.
What I need to know:
How to pass each string from the vector fileNames to function writeToHDFS since I would like this function to be executed for every element in the vector.
How to use string name to access variables of that name in the function.
For example:
At this line,I have to access variable with name same as string passed to fileName variable in the function.
hdfs.write(variableWithData, modelfile)
3. Can I pass the variable fileName to
modelfile <- hdfs.file(fileName, "w")
instead of passing a string for file name ?
I am trying to use lapply so that I can apply a custom function on all elements of a vector in R
In this situation, you should use tapply:
tapply(fileNames, 1:length(fileNames), writeToHDFS)
lapply, is short for "list-apply", but fileNames is a vector not a list.
Since you are already pretty well-aimed at using lapply, you can learn from ?lapply.
writeToHDFS(x) gives you return value of function.
But you want to pass function so:
lapply(fileNames,writeToHDFS)

Subsetting data as generic function in R

I am trying to create a function that plots graphs for either an entire dataset, or a subset of the data. The function needs to be able to do both so that you can plot the subset if you so wish. I am struggling with just coming up with the generic subset function.
I currently have this code (I am more of a SAS user so R is confusing me a bit):
subset<-function(dat, varname, val)
if(dat$varname==val) {
data<-subset(dat, dat$varname==val)
}
But R keeps returning this error message:
Error in if (dat$varname == val) { : argument is of length zero
Could someone help me to resolve this? Thanks so much! I figure it may have to do with the way I wrote it.
First off all the $ operator can not handle variables. In your code you are always looking up a column named varname.
Replace $varname with [varname] instead.
The next error is that you are conditioning on a vector, dat$varname==val will be vector of booleans.
A third error in your code is that you are naming your function subset and thus overlayering the subset function in the base package. So the inner call to subset will be a recursive call to your own function. To fix this rename your function or you have to specify that it is the subset function in the base package you are calling with base::subset(dat, dat[varname]==val).
The final error in the code is that your function does not return anything. Do not assign the result to the variable data but return it instead.
Here is how the code should look like.
mySubset<-function(dat, varname, val)
if(any(dat[varname]==val)) {
subset(dat, dat[varname]==val)
} else {
NA
}
Or even better
mySubset <- function(dat,varname,val) dat[dat[varname] == val]

How to list all the functions signatures in an R file?

Is there an R function that lists all the functions in an R script file along with their arguments?
i.e. an output of the form:
func1(var1, var2)
func2(var4, var10)
.
.
.
func10(varA, varB)
Using [sys.]source has the very undesirable side-effect of executing the source inside the file. At the worst this has security problems, but even “benign” code may simply have unintended side-effects when executed. At best it just takes unnecessary time (and potentially a lot).
It’s actually unnecessary to execute the code, though: it is enough to parse it, and then do some syntactical analysis.
The actual code is trivial:
file_parsed = parse(filename)
functions = Filter(is_function, file_parsed)
function_names = unlist(Map(function_name, functions))
And there you go, function_names contains a vector of function names. Extending this to also list the function arguments is left as an exercise to the reader. Hint: there are two approaches. One is to eval the function definition (now that we know it’s a function definition, this is safe); the other is to “cheat” and just get the list of arguments to the function call.
The implementation of the functions used above is also not particularly hard. There’s probably even something already in R core packages (‘utils’ has a lot of stuff) but since I’m not very familiar with this, I’ve just written them myself:
is_function = function (expr) {
if (! is_assign(expr)) return(FALSE)
value = expr[[3L]]
is.call(value) && as.character(value[[1L]]) == 'function'
}
function_name = function (expr) {
as.character(expr[[2L]])
}
is_assign = function (expr) {
is.call(expr) && as.character(expr[[1L]]) %in% c('=', '<-', 'assign')
}
This correctly recognises function declarations of the forms
f = function (…) …
f <- function (…) …
assign('f', function (…) …)
It won’t work for more complex code, since assignments can be arbitrarily complex and in general are only resolvable by actually executing the code. However, the three forms above probably account for ≫ 99% of all named function definitions in practice.
UPDATE: Please refer to the answer by #Konrad Rudolph instead
You can create a new environment, source your file in that environment and then list the functions in it using lsf.str() e.g.
test.env <- new.env()
sys.source("myfile.R", envir = test.env)
lsf.str(envir=test.env)
rm(test.env)
or if you want to wrap it as a function:
listFunctions <- function(filename) {
temp.env <- new.env()
sys.source(filename, envir = temp.env)
functions <- lsf.str(envir=temp.env)
rm(temp.env)
return(functions)
}

Resources