R retrieving values from data.tables while within custom functions - r

I'm trying to retrive a value from a datatable as part of a larger custom function. I can gnerate the row number, but I can't retrive the values from that row. The formula works outside of the function environment but not inside.
example_outlier_table <- data.table(dataset = c("doe", "ray", "me", "fa", "so"),
upper_limit = c(2,6,9,11,7))
example_function <- function(dt,otable){
return(match(deparse(substitute(dt)), otable$dataset))
}
example_function(ray, example_outlier_table)
result = 2
This is correct, 'ray' is the second entry in the 'dataset' column
In this example, 'ray' is both the character string in 'example_outlier_table$dataset' and the name of another data table object, hence the 'deparse(substitute(dt))' step.
The issue is this: I want to use the value that 'ray' indicates in the example_outlier_table, number 6, in another place within my custom function.
example_function <- function(dt,otable){
return(otable[dataset == as.character(deparse(substitute(dt))),
upper_limit])
}
example_function(ray, example_outlier_table)
result = numeric(0)
incorrect
example_function <- function(dt,otable){
return(otable[match(deparse(substitute(dt)), otable$dataset),
upper_limit])
}
example_function(ray, example_outlier_table)
result = [1] NA

We could directly extract the column with [[
example_function <- function(dt,otable){
dt <- deparse(substitute(dt))
otable[["upper_limit"]][otable[["dataset"]] == dt]
}
-testing
example_function(ray, example_outlier_table)
[1] 6
Or using the data.table methods
example_function <- function(dt,otable){
dt <- deparse(substitute(dt))
otable[dataset == dt, upper_limit][[1]]
}
example_function(ray, example_outlier_table)
[1] 6

Related

How to get the original name of a object when sending a list of objects to a function

Let's say I have the following function:
return_name <- function(data){
for(datasets in data)
print(deparse(substitute(datasets)))
}
my_data_1 <- data.frame(a = "a", b = "b")
my_data_2 <- data.frame(a = "a", b = "b")
return_name(list(my_data, my_data_2))
I'd like to be able for this function to print my_data_1 followed by my_data_2 (the name of the object in memory.
Instead it prints the structure of the entire object.
Edit for #Ronak
In my actual code I am iterating over a list of dataframes (not a named list). I need to be able to grep on the name of the current object in the for loop.
It looks something like this:
data_list = list(my_data_1, my_data_2)
random_function <- function(data_list){
for(datasets in data_list)
value = ifelse(grepl("my_data_1", return_name(datasets)), 1, 0)
}
The problem is that the return_name function described in your answer will return "datasets" and not the actual, original object name.
This is kind of a hack :
return_name <- function(data){
strsplit(gsub('list|[()]', '', deparse(substitute(data))), ',\\s*')[[1]]
}
return_name(list(my_data, my_data_2))
#[1] "my_data" "my_data_2"
return_name(my_data_1)
#[1] "my_data_1"

Function in R for validating existence of specific columns on a data.frame

I'd like to validate that a data.frame contains columns with specific names. Ideally this would be a utility function that I can just pass the data.frame and expected column names and the function will raise an error if the data.frame does not contain the expected columns. I have written my own function below, however, this seems like something that would already exist in the R ecosystem.
My questions are:
Does such a function (or one-liner) already exist either in base R or in a common package?
If not, any suggestions for my function (below)?
Example of the function I have written to do this:
validate_df_columns <- function(df, columns) {
chr_df <- deparse(substitute(df))
chr_columns <- paste(columns, collapse = ", ")
if (!('data.frame' %in% class(df))) {
stop(paste("Argument", df, "must be a data.frame."))
}
if (sum(colnames(df) %in% columns) != length(columns)) {
stop(paste(chr_df, "must contain the columns", chr_columns))
}
}
validate_df_columns(data.frame(a=1:3, b=4:6), c("a", "b", "c'"))
## Error in validate_df_columns(data.frame(a = 1:3, b = 4:6), c("a", "b", :
## data.frame(a = 1:3, b = 4:6) must contain the columns a, b, c'
The packages tibble and rlang, part of tidyverse have a function to check this :
library(tibble) # or library(rlang) or library(tidyverse)
has_name(iris, c("Species","potatoe"))
# [1] TRUE FALSE
Technically it lives in rlang and its code is just :
function (x, name)
{
name %in% names2(x)
}
where rlang::names2 is an enhanced version of base::names which returns a vector of empty strings rather than NULL when the object doesn't have names.
Here's a way to rewrite your function :
validate_df_columns <- function(df, columns){
if (!is.data.frame(df)) {
stop(paste("Argument", deparse(substitute(df)), "must be a data.frame."))
}
if(!all(i <- rlang::has_name(df,columns)))
stop(sprintf(
"%s doesn't contain: %s",
deparse(substitute(df)),
paste(columns[!i], collapse=", ")))
}
validate_df_columns(iris, c("Species","potatoe","banana"))
# Error in validate_df_columns(iris, c("Species", "potatoe", "banana")) :
# iris doesn't contain: potatoe, banana
Using deparse(substitute(...)) here makes little sense to me though, as it's not used interactively, clearer in my opinion to just say "df".
The %in% operator works with pairs of vectors, so there is already a one-liner we can use here. Consider:
df <- data.frame(a=c(1:3), b=c(4:6), c=c(7:9))
names <- c("a", "c", "blah", "doh")
names[names %in% names(df)]
[1] "a" "c"
If you want to assert that the data frame contains all the input names, then just use:
length(names %in% names(df)) == length(names) # to check all inputs are present
length(names %in% names(df)) == length(names(df)) # to check that input matches df

Converting Dataframe into List and inputting the function

I have a dataset with different receipts, items and its value:
library(arules)
library(VennDiagram)
Data <- data.frame(
Receipt_ID = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,5,5,6,6,6),
item = c('a','b','c','k','a','b','d','k','a','k','c','q','k', 'a','b','a','a', 'b', 'c'
)
,
value = c(2,3,2,4,2,5,6,7,8,1,2,3,4,5,6,7,8,3,4
)
)
And I created a market basket, with some output like below:
rules <- data.frame(
Target = c("a","a","b"),
item1 = c("b","b","k" )
,
item2 = c("k","",""),
lift = c(1,2,3),
confidence = c(0.08,0.45,0.1)
)
)
I created a function which will take the Vector of items as input(Combination of Target, item1, item2 in vector form)
findvalue <- function (vectormb) {
keyvalue <- Data[Data$item %in% vectormb, ]
keyvaluetran <- sapply(vectormb,function(x){
ret <- unique (keyvalue$Receipt_ID[keyvalue$item==x])
})
#reducetran <- calculate.overlap(keyvaluetran)
reducetran1 <- Reduce(intersect,keyvaluetran)
totalsales <- sum (keyvalue$value[keyvalue$Receipt_ID %in% reducetran1])
return(totalsales)
}
And I comverted the Dataframe to List
createlist <- function(data){
subset <- data.frame(data$Target,data$item1,data$item2)
listdata <- apply(subset,1,
function(x){
ret<-list( x[1:ncol(subset)])
names(ret)<-as.character(x[1])
return(ret)
})
listdata <- createlist(rules)
List data converts the dataframe to vector.
My question is how can I pass the Listdata Output to the function findvalue.
I tried couple of things:
findvalue(c("a","b","k")) is giving proper value as 23.
When I tried findvalue(listdata[1]), this is not giving any value.
What mistake I did. Listdata should be sent to findvalue in a loop.
Your createlist() function has no return value - hence it returns nothing for you. Here is what I would suggest:
createlist <- function(data) {
subset <- data.frame(data$Target,data$item1,data$item2)
apply(subset,1,
function(x){
ret<-list( x[1:ncol(subset)])
names(ret)<-as.character(x[1])
return(ret)
})
}
listdata <- createlist(rules)
This will create your listdata variable properly.
Then, in order to run findvalue() on it, the way the function is currently written, you need to pass on a vector, and not a list. Thus:
> class(listdata[1])
[1] "list"
> findvalue(unlist(listdata[1]))
[1] 23

How to use the for loop with function needing for a string field?

I am using the smbinning R package to compute the variables information value included in my dataset.
The function smbinning() is pretty simple and it has to be used as follows:
result = smbinning(df= dataframe, y= "target_variable", x="characteristic_variable", p = 0.05)
So, df is the dataset you want to analyse, y the target variable and x is the variable of which you want to compute the information value statistics; I enumerate all the characteristic variables as z1, z2, ... z417 to be able to use a for loop to mechanize all the analysis process.
I tried to use the following for loop:
for (i in 1:417) {
result = smbinning(df=DATA, y = "FLAG", x = "DATA[,i]", p=0.05)
}
in order to be able to compute the information value for each variable corresponding to i column of the dataframe.
The DATA class is "data.frame" while the resultone is "character".
So, my question is how to compute the information value of each variable and store that in the object denominated result?
Thanks! Any help will be appreciated!
No sample data is provided I can only hazard a guess that the following will work:
results_list = list()
for (i in 1:417) {
current_var = paste0('z', i)
current_result = smbinning(df=DATA, y = "FLAG", x = current_var, p=0.05)
results_list[i] = current_result$iv
}
You could try to use one of the apply methods, iterating over the z-counts. The x value to smbinning should be the column name not the column.
results = sapply(paste0("z",1:147), function(foo) {
smbinning(df=DATA, y = "FLAG", x = foo, p=0.05)
})
class(results) # should be "list"
length(results) # should be 147
names(results) # should be z1,...
results[[1]] # should be the first result, so you can also iterate by indexing
I tried the following, since you had not provided any data
> XX=c("IncomeLevel","TOB","RevAccts01")
> res = sapply(XX, function(z) smbinning(df=chileancredit.train,y="FlagGB",x=z,p=0.05))
Warning message:
NAs introduced by coercion
> class(res)
[1] "list"
> names(res)
[1] "IncomeLevel" "TOB" "RevAccts01"
> res$TOB
...
HTH

Simplify ave() or aggregate() with several inputs

How can I write this all in one line?
mydata is a "zoo" series, limit is a numeric vector of the same size
tmp <- ave(coredata(mydata), as.Date(index(mydata)),
FUN = function(x) cummax(x)-x)
tmp <- (tmp < limit)
final <- ave(tmp, as.Date(index(mydata)),
FUN = function(x) cumprod(x))
I've tried to use two vectors as argument to ave(...) but it seems to accept just one even if I join them into a matrix.
This is just an example, but any other function could be use.
Here I need to compare the value of cummax(mydata)-mydata with a numeric vector and
once it surpasses it I'll keep zeros till the end of the day. The cummax is calculated from the beginning of each day.
If limit were a single number instead of a vector (with different possible numbers) I could write it:
ave(coredata(mydata), as.Date(index(mydata)),
FUN = function(x) cumprod((cummax(x) - x) < limit))
But I can't introduce there a vector longer than x (it should have the same length than each day) and I don't know how to introduce it as another argument in ave().
Seems like this routine imposes intraday stoploss based on maxdrawdown. So I assume you want to be able to pass in variable limit as a second argument to your aggregation function which only currently only takes 1 function due to the way ave works.
If putting all this in one line is not an absolute must, I can share a function I've written that generalizes aggregation via "cut variables". Here's the code:
mtapplylist2 <- function(t, IDX, DEF, MoreArgs=NULL, ...)
{
if(mode(DEF) != "list")
{
cat("Definition must be list type\n");
return(NULL);
}
a <- c();
colnames <- names(DEF);
for ( i in 1:length(DEF) )
{
def <- DEF[[i]];
func <- def[1];
if(mode(func) == "character") { func <- get(func); }
cols <- def[-1];
# build the argument to be called
arglist <- list();
arglist[[1]] <- func;
for( j in 1:length(cols) )
{
col <- cols[j];
grp <- split(t[,col], IDX);
arglist[[1+j]] <- grp;
}
arglist[["MoreArgs"]] <- MoreArgs;
v <- do.call("mapply", arglist);
# print(class(v)); print(v);
if(class(v) == "matrix")
{
a <- cbind(a, as.vector(v));
} else {
a <- cbind(a, v);
}
}
colnames(a) <- colnames;
return(a);
}
And you can use it like this:
# assuming you have the data in the data.frame
df <- data.frame(date=rep(1:10,10), ret=rnorm(100), limit=rep(c(0.25,0.50),50))
dfunc <- function(x, ...) { return(cummax(x)-x ) }
pfunc <- function(x,y, ...) { return((cummax(x)-x) < y) }
# assumes you have the function declared in the same namespace
def <- list(
"drawdown" = c("dfunc", "ret"),
"hasdrawdown" = c("pfunc", "ret", "limit")
);
# from R console
> def <- list("drawdown" = c("dfunc", "ret"),"happened" = c("pfunc","ret","limit"))
> dim( mtapplylist2(df, df$date, def) )
[1] 100 2
Notice that the "def" variable is a list containing the following items:
computed column name
vector arg function name as a string
name of the variable in the input data.frame that are inputs into the function
If you look at the guts of "mtapplylist2" function, the key components would be "split" and "mapply". These functions are sufficiently fast (I think split is implemented in C).
This works with functions requiring multiple arguments, and also for functions returning vector of the same size or aggregated value.
Try it out and let me know if this solves your problem.

Resources