Subsetting data as generic function in R - r

I am trying to create a function that plots graphs for either an entire dataset, or a subset of the data. The function needs to be able to do both so that you can plot the subset if you so wish. I am struggling with just coming up with the generic subset function.
I currently have this code (I am more of a SAS user so R is confusing me a bit):
subset<-function(dat, varname, val)
if(dat$varname==val) {
data<-subset(dat, dat$varname==val)
}
But R keeps returning this error message:
Error in if (dat$varname == val) { : argument is of length zero
Could someone help me to resolve this? Thanks so much! I figure it may have to do with the way I wrote it.

First off all the $ operator can not handle variables. In your code you are always looking up a column named varname.
Replace $varname with [varname] instead.
The next error is that you are conditioning on a vector, dat$varname==val will be vector of booleans.
A third error in your code is that you are naming your function subset and thus overlayering the subset function in the base package. So the inner call to subset will be a recursive call to your own function. To fix this rename your function or you have to specify that it is the subset function in the base package you are calling with base::subset(dat, dat[varname]==val).
The final error in the code is that your function does not return anything. Do not assign the result to the variable data but return it instead.
Here is how the code should look like.
mySubset<-function(dat, varname, val)
if(any(dat[varname]==val)) {
subset(dat, dat[varname]==val)
} else {
NA
}
Or even better
mySubset <- function(dat,varname,val) dat[dat[varname] == val]

Related

Is it valid to access global variables in R function and how to assign it in a package?

I have a package which provides a script and some functions. Within the script I assign a variable which will be used by the function. This works if the function gets executed within the script but might fail if I just call the function since the variable doesn't exist.
If I use devtools::check() I get warnings, that the variable within the function isn't defined. How can I handle this properly?
Edit
I am thinking about to use get() within the function to assign the variable within the function to get rid of this warnings. So the question is, is myp2 the correct way of doing something like this? Maybe some trycatch to handle errors?
ab <- c(1,2,3)
myp1 <- function() {
print(ab)
return(1)
}
myp2 <- function() {
ab <- get('ab')
print(ab)
return(1)
}
myp1()
myp2()
You could do something like
if(!exists("your variable")){
stop("You have not defined your variable")}
This would check to see if what you are looking for exists. A better practice would be to define the variable in the function and have the default value be the name of the thing for which you are looking.
myp <- function(x) {
print(x)
return(1)
}
ab <- c(1,2,3)
myp(x = ab)
If possible, it would be also better to substitute the script with a function.

How do I pass a function argument into map_df()?

I am trying to create a function to clean data and return as a data.frame in R.
I'm using the map_df() function to return the cleaned data as a data.frame, and have a function written to clean the data.
The first thing I do is pull a list of files from a folder, then iterate through them and clean each file. I have a pre-defined set specifying which column names to pull (stored in selectCols) in case of variation between files:
files <- list.files(filepath,full.names=F)
colInd <- which(names(fread(files[i],nrows=0)) %in% gsub("_","",selectCols))
I also have a function to clean my data, which uses fread() to read in the .csv files. It takes colInd and i as arguments to clean files iteratively.
cleanData <- function(files,i,colInd) {
addData <- fread(files[i],select=c(colInd))
[...]
}
Overall it looks like this (as a recursive function):
i <- 1
files <- list.files(filepath,full.names=F)
iterateCleaning <- function(files,i) {
colInd <- (which(names(fread(files[i],nrows=0)) %in% gsubs("_","",selectCols))
if (length(colInd)==length(selectCols)) {
newData <- map_df(files,cleanData)
saveToFolder(newData,i,files)
}
else {}
i=i+1
if (i<-length(files)){
iterateCleaning(files,i)
}
else {}
}
When I try to run without specifying the arguments for my function I get this error:
Error in fread(files,select=c(colInd)):
argument "colInd" is missing, with no default.
When I insert it into my map_df() I do it like so:
newData <- map_df(files,i,colInd,cleanData)
Then I get this error:
Error in as_mapper(.f,...): object 'colInd' not found.
Any suggestions for resolving this error? As I understand it, map_df() applies to each element in the function, but I don't need it applied to the i and colInd inputs, I just need them for the function I am calling in map_df(). How can I call map_df() on a function that requires additional arguments?
I read the documentation but it seemed a bit confusing. It says for a single-argument function to use "." and for two-argument functions to use .x and .y, but I'm not sure what it means. My initial guess is something like these, but neither line works):
newData <- map_df(files,cleanData,.i,.colInd)
newData <- map_df(files,cleanData,.x=i,.y=colInd)
Any recommendations? Will I have the same output if I just call map_df() afterwards on the output of my function?

using callCC with higher-order functions in R

I'm trying to figure out how to get R's callCC function for short-circuiting evalutation of a function to work with functions like lapply and Reduce.
Motivation
This would make Reduce and and lapply have asymptotic efficiency > O(n), by allowing you to
exit a computation early.
For example, if I'm searching for a value in a list I could map a 'finder' function across the list, and the second it is found lapply stops running and that value is returned (much like breaking a loop, or using a return statement to break out early).
The problem is I am having trouble writing the functions that lapply and Reduce should take using a style that callCC requires.
Example
Say I'm trying to write a function to find the value '100' in a list: something equivalent to
imperativeVersion <- function (xs) {
for (val in xs) if (val == 100) return (val)
}
The function to pass to lapply would look like:
find100 <- function (val) { if (val == 100) SHORT_CIRCUIT(val) }
functionalVersion <- function (xs) lapply(xs, find100)
This (obviously) crashes, since the short circuiting function hasn't been defined yet.
callCC( function (SHORT_CIRCUIT) lapply(1:1000, find100) )
The problem is that this also crashes, because the short circuiting function wasn't around when find100 was defined. I would like for something similar to this to work.
the following works because SHORT_CIRCUIT IS defined at the time that the function passed to lapply is created.
callCC(
function (SHORT_CIRCUIT) {
lapply(1:1000, function (val) {
if (val == 100) SHORT_CIRCUIT(val)
})
)
How can I make SHORT_CIRCUIT be defined in the function passed to lapply without defining it inline like above?
I'm aware this example can be achieved using loops, reduce or any other number of ways. I am looking for a solution to the problem of using callCC with lapply and Reduce in specific.
If I was vague or any clarification is needed please leave a comment below. I hope someone can help with this :)
Edit One:
The approach should be 'production-quality'; no deparsing functions or similar black magic.
I found a soluton to this problem:
find100 <- function (val) {
if (val == 100) SHORT_CIRCUIT(val)
}
short_map <- function (fn, coll) {
callCC(function (SHORT_CIRCUIT) {
clone_env <- new.env(parent = environment(fn))
clone_env$SHORT_CIRCUIT <- SHORT_CIRCUIT
environment(fn) <- clone_env
lapply(coll, fn)
})
}
short_map(find100, c(1,2,100,3))
The trick to making higher-order functions work with callCC is to assign the short-circuiting function into the input functions environment before carrying on with the rest of the program. I made a clone of the environment to avoid unintended side-effects.
You can achieve this using metaprogramming in R.
#alexis_laz's approach was in fact already metaprogramming.
However, he used strings which are a dirty hack and error prone. So you did well to reject it.
The correct way to approach #alexis_laz's approach would be by wrangling on code level. In base R this is done using substitute(). There are however better packages e.g. rlang by Hadley Wickham. But I give you a base R solution (less dependency).
lapply_ <- function(lst, FUN) {
eval.parent(
substitute(
callCC(function(return_) {
lapply(lst_, FUN_)
}),
list(lst_ = lst, FUN_=substitute(FUN))))
}
Your SHORT_CIRCUIT function is actually a more general, control flow return function (or a break function which takes an argument to return it). Thus, I call it return_.
We want to have a lapply_ function, in which we can in the FUN= part use a return_ to break out of the usual lapply().
As you showed, this is the aim:
callCC(
function (return_) {
lapply(1:1000, function (x) if (x == 100) return_(x))
}
)
Just with the problem, that we want to be able to generalize this expression.
We want
callCC(
function(return_) lapply(lst, FUN_)
)
Where we can use inside the function definition we give for FUN_ the return_.
We can let, however, the function defintion see return_ only if we insert the function definition code into this expression.
This exactly #alexis_laz tried using string and eval.
Or you did this by manipulating environment variables.
We can safely achieve the insertion of literal code using substitute(expr, replacer_list) where expr is the code to be manipulated and replacer_list is the lookup table for the replacement of code.
By substitute(FUN) we take the literal code given for FUN= for lapply_ without evaluating it. This expression returns literal quoted code (better than the string in #alexis_laz's approach).
The big substitute command says: "Take the expression callCC(function(return_) lapply(lst_, FUN_)) and replace lst_ in this expression by the list given for coll and FUN_ by the literal quoted expression given for FUN.
This replaced expression is then evaluated in the parent environment (eval.parent()) meaning: the resulting expression replaces the lapply_() call and is executed exactly where it was placed.
Such use of eval.parent() (or eval( ... , envir=parent.frame())) is fool proof. (otherwise, tidyverse packages wouldn't be production level ...).
So in this way, you can generalize callCC() calls.
lapply_(1:1000, FUN=function(x) if (x==100) return_(x))
## [1] 100
I don't know if it can be of use, but:
find100 <- "function (val) { if (val == 100) SHORT_CIRCUIT(val) }"
callCC( function (SHORT_CIRCUIT) lapply(1:1000, eval(parse(text = find100))) )
#[1] 100

"could not find function" when using functions as arguments

I have two .R files, plotDataSet(..) and plotAllDataSets(). plotDataSet(..) makes a call to curve(..) (in the R graphics library), while plotAllDataSets() makes a call to plotDataSet(..). plotDataSet(..) takes a function as a parameter, and passes it to curve(..).
I want to pass in my function argument for curve(..) into plotDataSet(..) from a list of functions, such as:
v <- c(function(x){x}, function(x){x*x}, function(x){x*x}, function(x){x*x*x},
function(x){x*x}, function(x){x*x*x}, function(x){x*x*x})
for (i in 1:7) {
plotSaveData(data, v[i], i)
}
I get the following output: Error in eval(expr, envir, enclos) :
could not find function "expectedOrderEquation".
Interestingly, when I call plotDataSet(..) and pass in a function like function(x){x*x}, it works fine:
for (i in 1:7) {
plotSaveData(data, function(x) {x}, i)
}
But this won't let me call plotSaveData(..) while cycling through a list of functions.
Can someone please explain why this does not work?
I hope this is sufficient, but I am happy to provide more context as needed. Also, I am a bit new to R, so any corrections to my descriptions would be helpful.
use double brackets instead of single brackets
v[[i]] instead of v[i]
Have a look at the difference between these two:
v[[i]] (3)
v[i] (3) # error
The single brackets returns a list, whose contents is a function
The double brackets returns the function.

Writing a user-function to return column position, column name, mode and class for every variable

I need to write a user-defined function that, when applied to a data frame, will return the column position, the column name, the mode, and the class for each variable. I am able to create one that returns mode and class, but I keep getting errors when I include the position/name. I have been doing this,
myFunction <- function(x) {
data.frame(mode(x), class(x))
}
data.frame(names(myData), myFunction(myData))
and it returns the correct info, but it doesn't combine it into a single function I need. Any advice?
You can combine it as follows:
myFunction <- function(x)
data.frame(mode(x), class(x), cname=names(x), cpos=1:ncol(x))

Resources