I need to use the below function in loop as i have 100s of variables.
binning <- function (df,vars,by=0.1,eout=TRUE,verbose=FALSE) {
for (col in vars) {
breaks <- numeric(0)
if(eout) {
x <- boxplot(df[,col][!df[[col]] %in% boxplot.stats(df[[col]])$out],plot=FALSE)
non_outliers <- df[,col][df[[col]] <= x$stats[5] & df[[col]] >= x$stats[1]]
if (!(min(df[[col]])==min(non_outliers))) {
breaks <- c(breaks, min(df[[col]]))
}
}
breaks <- c(breaks, quantile(if(eout) non_outliers else df[[col]], probs=seq(0,1, by=by)))
if(eout) {
if (!(max(df[[col]])==max(non_outliers))) {
breaks <- c(breaks, max(df[[col]]))
}
}
return (cut(df[[col]],breaks=breaks,include.lowest=TRUE))
}}
It creates a variable with binned score. The naming convention of variable is "the original name" plus "_bin".
data$credit_amount_bin <- iv.binning.simple(data,"credit_amount",eout=FALSE)
I want the function runs for all the NUMERIC variables and store the converted bins variables in a different data frame and name them with "the original name _bin".
Any help would be highly appreciated.
Using your function, you could go via lapply, looping over all values that are numeric.
# some data
dat0 <- data.frame(a=letters[1:10], x=rnorm(10), y=rnorm(10), z=rnorm(10))
# find all numeric by names
vars <- colnames(dat0)[which(sapply(dat0,is.numeric))]
# target data set
dat1 <- as.data.frame( lapply(vars, function(x) binning(dat0,x,eout=FALSE)) )
colnames(dat1) <- paste(vars, "_bin", sep="")
Personally, I would prefer having this function with vector input instead of data frame plus variable names. It might run more efficiently, too.
Related
i'm performing Anova testing for my current datasets that has multiple columns which i am trying to loop to make things easier but it seems to me that i am always facing the same error called "variable lengths differ"
here is my code for the loop:
for(i in 5:125){
WL<- colnames(NB[i])
model <- lm(WL ~ Treatment , data = NB)
if(!exists("aovNB")){
aovNB<-anova(model)
}
if(exists("aovNB")){
aovNB <- rbind(aovNB,anova(model))
}
}
and i'm wondering if it is possible that way to store the column names into WL variable which i can use to read the multiple columns i have.
thanks if anyone could solve it. i'm using base R.
Use reformulate/as.formula to create formula from strings. Also instead of rbinding the datasets in a loop store them in a list.
cols <- colnames(NB)[5:125]
result <- vector('list', length(cols))
for(i in seq_along(cols)){
model <- lm(reformulate('Treatment', cols[i]) , data = NB)
result[[i]] <- anova(model)
}
If needed you can combine them using result <- do.call(rbind, result)
We may do this with paste
cols <- colnames(NB)[5:125]
result <- vector('list', length(cols))
for(i in seq_along(cols)) {
result[[i]] <- anova(lm(as.formula(paste(cols[i], '~ Treatment')), data = NB))
}
I'm trying to build a safety catch for my function.
I'd like it to take an input of a data frame, and a vector of variable names to check the class of. I'd like it to attempt to coerce any variable that fails to character, send a warning, or if coercing fails, stop the process.
This is what I've got so far:
#Inputs
varlist <- c("varA","varB","varC")
df <- data.frame(matrix(ncol=4,nrow=0, dimnames=list(NULL, c("varA", "varB", "varC","varD"))))
df$varA <- as.character(df$varA)
df$varB <- as.character(df$varB)
df$varC <- as.factor(df$varC)
df$varD <- as.numeric(df$varD)
for(i in 1:length(varlist)){
if(!(purrr::pmap(select(df, varlist[i]), ~is.character)==TRUE)){
df$varlist[i] <- as.character(df$varlist[i])
warning(paste0("Coercing variable", varlist[i], "to character"))
stopifnot(is.character(df$varlist[i]))
}
}
However, this fails,
Error in if (!(purrr::pmap(select(df, varlist[i]), ~is.character) == TRUE)) { :
argument is of length zero
Is there a simple way to do this?
Expected Output
#output console warning
Warning: Coercing variable varC to character
#Manual checks
is.character(df$varA) == TRUE #included in varlist
is.character(df$varB) == TRUE #included in varlist
is.character(df$varC) == TRUE #included in varlist, coerced from factor to character
is.character(df$varD) == FALSE #not included in varlist
I've got something that I think would work in a for loop. If your dataframe is quite large and has a lot of columns, there is a way to apply it to speed it up, but this more closely matches what you originally had.
Make data
varlist <- c("varA","varB","varC")
df <- data.frame(varA = letters[1:4],
varB = letters[11:14],
varC = letters[21:24],
varD = 1:4)
df$varA <- as.character(df$varA)
df$varB <- as.character(df$varB)
df$varC <- as.factor(df$varC)
df$varD <- as.numeric(df$varD)
I haven't added a manual Error for if as.character fails because that function will do it on its own.
for(i in (1:ncol(df))[colnames(df) %in% varlist]){
df[,i] <- as.character(df[,i])
if(!(is.character(df[,i]))){
warning(paste("Coercing variable", colnames(df)[i], "to character"))
}
}
Using Map
Update to do the same thing using Map which is the multivariate version of apply:
char_function <- function(x, name){
y <- as.character(x)
if(!(is.character(x))){
warning((paste("Coercing variable", name, "to character")))
}
return(y)
}
df.char <- data.frame(Map(char_function, df[varlist], names(df[varlist])), stringsAsFactors = F)
For a dataframe with 4 columns and 2600 rows, the for loop took 0.015 sec and Map took 0.012 sec. Depending on the size of your dataframe, it may be worth using Map
I have the following code:
df<- iris
library(svDialogs)
columnFunction <- function (x) {
column.D <- dlgList(names(x), multiple = T, title = "Spalten auswaehlen")$res
if (!length((column.D))) {
cat("No column selected\n")
} else {
cat("The following columns are choosen:\n")
print(column.D)
for (z in column.D) {
x[[z]] <- NULL #with this part I wanted to delete the above selected columns
}
}
}
columnFunction(df)
So how is it possible to address data.frame columns "dynamically" so: x[[z]] <- NULL should translate to:
df$Species <- NULL
df[["Species"]] <- NULL
df[,"Species"] <- NULL
and that for every selected column in every data.frame chosen for the function.
Well does anyone know how to archive something like that? I tried several things like with the paste command or sprintf, deparse but i didnt get it working. I also tied to address the data.frame as a global variable by using <<- but didn`t help, too. (Well its the first time i even heard about that). It looks like i miss the right method transferring x and z to the variable assignment.
If you want to create a function columnFunction that removes columns from a passed data frame df, all you need to do is pass the data frame to the function, return the modified version of df, and replace df with the result:
library(svDialogs)
columnFunction <- function (x) {
column.D <- dlgList(names(x), multiple = T, title = "Spalten auswaehlen")$res
if (!length((column.D))) {
cat("No column selected\n")
} else {
cat("The following columns are choosen:\n")
print(column.D)
x <- x[,!names(x) %in% column.D]
}
return(x)
}
df <- columnFunction(df)
uniq <- unique(file[,12])
pdf("SKAT.pdf")
for(i in 1:length(uniq)) {
dat <- subset(file, file[,12] == uniq[i])
names <- paste("Sample_filtered_on_", uniq[i], sep="")
qq.chisq(-2*log(as.numeric(dat[,10])), df = 2, main = names, pvals = T,
sub=subtitle)
}
dev.off()
file[,12] is an integer so I convert it to a factor when I'm trying to run it with by instead of a for loop as follows:
pdf("SKAT.pdf")
by(file, as.factor(file[,12]), function(x) { qq.chisq(-2*log(as.numeric(x[,10])), df = 2, main = paste("Sample_filtered_on_", file[1,12], sep=""), pvals = T, sub=subtitle) } )
dev.off()
It works fine to sort the data frame by this (now a factor) column. My problem is that for the plot title, I want to label it with the correct index from that column. This is easy to do in the for loop by uniq[i]. How do I do this in a by function?
Hope this makes sense.
A more vectorized (== cooler?) version would pull the common operations out of the loop and let R do the book-keeping about unique factor levels.
dat <- split(-2 * log(as.numeric(file[,10])), file[,12])
names(dat) <- paste0("IoOPanos_filtered_on_pc_", names(dat))
(paste0 is a convenience function for the common use case where normally one would use paste with the argument sep=""). The for loop is entirely appropriate when you're running it for its side effects (plotting pretty pictures) rather than trying to capture values for further computation; it's definitely un-cool to use T instead of TRUE, while seq_along(dat) means that your code won't produce unexpected results when length(dat) == 0.
pdf("SKAT.pdf")
for(i in seq_along(dat)) {
vals <- dat[[i]]
nm <- names(dat)[[i]]
qq.chisq(val, main = nm, df = 2, pvals = TRUE, sub=subtitle)
}
dev.off()
If you did want to capture values, the basic observation is that your function takes 2 arguments that vary. So by or tapply or sapply or ... are not appropriate; each of these assume that just a single argument is varying. Instead, use mapply or the comparable Map
Map(qq.chisq, dat, main=names(dat),
MoreArgs=list(df=2, pvals=TRUE, sub=subtitle))
How can I write this all in one line?
mydata is a "zoo" series, limit is a numeric vector of the same size
tmp <- ave(coredata(mydata), as.Date(index(mydata)),
FUN = function(x) cummax(x)-x)
tmp <- (tmp < limit)
final <- ave(tmp, as.Date(index(mydata)),
FUN = function(x) cumprod(x))
I've tried to use two vectors as argument to ave(...) but it seems to accept just one even if I join them into a matrix.
This is just an example, but any other function could be use.
Here I need to compare the value of cummax(mydata)-mydata with a numeric vector and
once it surpasses it I'll keep zeros till the end of the day. The cummax is calculated from the beginning of each day.
If limit were a single number instead of a vector (with different possible numbers) I could write it:
ave(coredata(mydata), as.Date(index(mydata)),
FUN = function(x) cumprod((cummax(x) - x) < limit))
But I can't introduce there a vector longer than x (it should have the same length than each day) and I don't know how to introduce it as another argument in ave().
Seems like this routine imposes intraday stoploss based on maxdrawdown. So I assume you want to be able to pass in variable limit as a second argument to your aggregation function which only currently only takes 1 function due to the way ave works.
If putting all this in one line is not an absolute must, I can share a function I've written that generalizes aggregation via "cut variables". Here's the code:
mtapplylist2 <- function(t, IDX, DEF, MoreArgs=NULL, ...)
{
if(mode(DEF) != "list")
{
cat("Definition must be list type\n");
return(NULL);
}
a <- c();
colnames <- names(DEF);
for ( i in 1:length(DEF) )
{
def <- DEF[[i]];
func <- def[1];
if(mode(func) == "character") { func <- get(func); }
cols <- def[-1];
# build the argument to be called
arglist <- list();
arglist[[1]] <- func;
for( j in 1:length(cols) )
{
col <- cols[j];
grp <- split(t[,col], IDX);
arglist[[1+j]] <- grp;
}
arglist[["MoreArgs"]] <- MoreArgs;
v <- do.call("mapply", arglist);
# print(class(v)); print(v);
if(class(v) == "matrix")
{
a <- cbind(a, as.vector(v));
} else {
a <- cbind(a, v);
}
}
colnames(a) <- colnames;
return(a);
}
And you can use it like this:
# assuming you have the data in the data.frame
df <- data.frame(date=rep(1:10,10), ret=rnorm(100), limit=rep(c(0.25,0.50),50))
dfunc <- function(x, ...) { return(cummax(x)-x ) }
pfunc <- function(x,y, ...) { return((cummax(x)-x) < y) }
# assumes you have the function declared in the same namespace
def <- list(
"drawdown" = c("dfunc", "ret"),
"hasdrawdown" = c("pfunc", "ret", "limit")
);
# from R console
> def <- list("drawdown" = c("dfunc", "ret"),"happened" = c("pfunc","ret","limit"))
> dim( mtapplylist2(df, df$date, def) )
[1] 100 2
Notice that the "def" variable is a list containing the following items:
computed column name
vector arg function name as a string
name of the variable in the input data.frame that are inputs into the function
If you look at the guts of "mtapplylist2" function, the key components would be "split" and "mapply". These functions are sufficiently fast (I think split is implemented in C).
This works with functions requiring multiple arguments, and also for functions returning vector of the same size or aggregated value.
Try it out and let me know if this solves your problem.