Fixing a function to find and remove outliers from the dataset

Fixing a function to find and remove outliers from the dataset - r

I am trying to make a simple function which will find and remove outliers automatically. This is the function I have created so far:
fOutlier <- function(x, y) {
outlier <- with(x, boxplot.stats(y)$out)
subset(x, !(y %in% outlier))
}
data <- fOutlier(data, variable)
The problem is that the function does not read x as dataset name. It works if I use the following:
data <- fOutlier(data, data$variable)

Non-standard evaluation seems to be the culprit.
This is what I would personally do.
set.seed(1)
# mock data set
d<-data.frame(var1=rnorm(1000,500,50),
var2=rnorm(1000,1000,100),
var3=rnorm(1000,1000,100),
var4=rnorm(1000,1000,100))
fOutlier<-function(dat, var_name){
var_vec<-dat[,var_name]
outliers<-boxplot.stats(var_vec)$out
clean_dat<-dat[!(var_vec %in% outliers),]
}
# test with different variables
d_var1_clean<-fOutlier(d, 'var1')
d_var2_clean<-fOutlier(d, 'var2')
d_var3_clean<-fOutlier(d, 'var3')
If you really like the non-standard evaluation, then you can add eval() and substitute() to maintain this functionality.
This function is a workable version of what you posted (note the creation of y_vec):
fOutlier2 <- function(x, y) {
y_vec<-eval(substitute(y),eval(x))
outlier <- boxplot.stats(y_vec)$out
subset(x, !(y_vec %in% outlier))
}
d_var1_clean2<-fOutlier2(d, var1)

Related

Plot function by condition and subsample for factor data

I'm working on a plotting function for the likert data from a survey and I'm trying to optimize it to be as automated as possible since I have to make quite a lot of plots and make it as user-friendly as possible, but I'm having some problems and really need help finishing this function...
These are the data:
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
df1[colnames(df1)] <- lapply(df1[colnames(df1)], factor)
Columns A and B pertain to the "Technology" section of my survey, while C, D and E are in "Social".
I have transformed my data using the likertpackage and compiled them in a list to be more easily called in my function (don't know if it's the best way to go about it, I'm still quite new to R, so feel free to make suggestions even concerning this point):
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
So far, here is the function I could come up with (with great help of user #gaut):
mynames <- sapply(names(tbls), function(x) {
paste("How do they rank? -",gsub("\\.",": ",x))
})
myfilenames <- names(tbls)
plot_likert <- function(x, myname, myfilename){
p <- plot(likert(x),
type ="bar",center=3,
group.order=names(x))+
labs(x = "Theme", subtitle=paste("Number of observations:",nrow(x)))+
guides(fill=guide_legend("Rank"))+
ggtitle(myname)
p
I then lapply the function to get a list of plots:
list_plots <- lapply(1:length(tbls),function(i) {
plot_likert(tbls[[i]], mynames[i], myfilenames[i])
})
And then save them all as .png
sapply(1:length(list_plots), function(i) ggsave(
filename = paste0("plots ",i,".png"),
plot = list_plots[[i]],
width = 15, height = 9
))
Now, there are 3 main things I want my function to do but don't really know how to approach:
1) Right now I can export all the plots in one batch, but I would also like to be able to export a single plot, for example obtaining the above graph by writing:
plot_likert(tbls$dummy1.no)
2) In my mind, my ideal plotting function would also take into account the sections of my data mentioned above, so that if I specify the section Technology only get a Likert plot considering only columns A and B, and specifying the subsample gets me the dummy. Like so:
plot_likert(section=Technology, subsample=dummy1.no)
3) As you maybe have already noted, I need the titles of the plot to be fully automatic, so that by changing section or subsample they too change accordingly.
Apologies for the long/intricate question but I've been stuck on this function for quite some time and really need help finalizing it. For any further clarification/info, do not hesitate to ask!
Thank you in advance for any advice!

There are many ways to get what you want. Essentially, you need to add a few arguments to your function.
I agree with Limey though (and of course Hadley) - generally better to have a few simple functions that do a little step and then you can collate everything in one bigger function.
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
## this can be shortened
df1 <- data.frame(lapply(df1, factor))
## the rest of dummy data creation probably too, but I won't dig too much into this now
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
library(ggplot2)
library(likert)
#> Loading required package: xtable
## no need for sapply, really!
mynames <- paste("How do they rank? -", gsub("\\.",": ",names(tbls)))
myfilenames <- names(tbls)
## defining arguments with NULL makes it possible to not specify it without giving it a value
plot_likert <- function(x, myname, myfilename, section = NULL, subsample = NULL){
## first take only the tbl of interest
if(!is.null(subsample)) x <- x[subsample]
## then filter for your section and subsample
if(!is.null(section)) x <- lapply(x, function(y) y[, section])
## you can run your lapply within the function -
## ideally make a separate funciton and call the smaller function in the bigger one
## use seq_along
lapply(seq_along(x), function(i) {
plot(likert(x[[i]]),
type ="bar",center=3,
group.order=names(x[[i]]))+
labs(x = "Theme", subtitle=paste("Number of observations:",nrow(x)))+
guides(fill=guide_legend("Rank")) +
## programmatic title
ggtitle(names(x)[i])
})
}
## you need to pass character vectors to your arguments
patchwork::wrap_plots(plot_likert(tbls))
patchwork::wrap_plots(plot_likert(tbls, section = LETTERS[1:2], subsample = paste("dummy1", c("no", "yes"), sep = ".")))
Created on 2022-08-17 by the reprex package (v2.0.1)

R - Creating function call within function using relational operator as variable

I am trying to write a function that will apply a user-specified binary operator (e.g. < ) to a raster object. To do so is fairly simple. For example:
selection <- raster::overlay(x = data, fun = function(x) {return(x < 2)}
My issue is that this code would be running within a function, with which I would like to specify both the binary operator and the criteria value (which is 2 in the example above) as variables. For example:
my.func <- function(data, binary_operator, value){
selection <- raster::overlay(x=data, fun=function(x) {x criteria value})
return(selection)
}
I have tried to construct the function as a call without success.
my.func <- function(data, binary_operator, value){
selection <- raster::overlay(x=data, fun=function(x) {call(sprintf("x %s %s", criteria, value))}
return(selection)
}
Is there a way to construct the call of the second function using variables in the first function?
Thanks for your help.

Write your code like this:
my.func <- function(data, binary_operator, value){
selection <- raster::overlay(x=data, fun=function(x) binary_operator(x, value))
return(selection)
}
You need to call this as
my.func(data, `<`, 2)
(with backticks for quotes). If you want to allow "<" for the operator, you could use do.call:
my.func <- function(data, binary_operator, value){
selection <- raster::overlay(x=data, fun=function(x)
do.call(binary_operator, list(x, value)))
return(selection)
}
This will work with either form of argument.

The example is probably simpler than the real case, but you in the example you use, it would be more direct to do:
selection <- data < 2

Object not found - nested function - R

I am still getting used with functions. I had a look in environments documentation but I can't figure out how to solve the error. Lets see what I tried until now:
I have a list of documents. Lets suppose it is "core"
library(dplyr)
table_1 <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
table_2 <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
core <- list(table_1, table_2)
Then, I have to run the function documents_ for each element of the list. This function gives some parameters to execute in another nested function:
documents_ <- function(i) {
core_processed <- as.data.frame(core[[i]])
x <- 1:nrow(core_processed)
y <- 1:ncol(core_processed)
temp <- sapply(x, function(x) mapply(calc_dens_,x,y))
return(temp)
}
Inside that, there is the function calc_dens, which is:
calc_dens_ <- function(x, y) {
core_temp <- core_processed %>%
filter(X2 == x & X3 == y)
return(core_temp)
}
Then, for iterate for each element of the list, I tried without success:
calc <- lapply(c(1:2), function(i) documents_(i))
Error in eval(lhs, parent, parent) : object 'core_processed' not found
The calc_dens function doesn't get the results of the documents_ (environment problem. Is there a way to solve this, or another better approach? My function is more complex than this, but the main elements are in this example. Thank you in advance.

As the other commenters have said, the problem is that you are referring to a variable, core_processed that is not in scope. You can make it a global variable, but it might be more sensible just to use it in a closure like this:
table_1 <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
table_2 <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
cores <- list(table_1, table_2)
documents_ <- function(core_processed) {
x <- 1:nrow(core_processed)
y <- 1:ncol(core_processed)
calc_dens <- function(x, y) core_processed %>% filter(X2 == x & X3 == y)
sapply(x, function(x) mapply(calc_dens, x, y))
}
calc <- lapply(cores, documents_)
If cores is a list of data frames, you do not need to to use as.data.frame and since you use lapply, there is no need to apply over indices and then index into the list. So the code I wrote here is simplified but does the same as your code.
I have to wonder, though, is this really what you want? The sapply over x and then mapply over x and y -- where x is the one from the sapply and not the ist you built in documents_ -- looks mighty strange to me.

dplyr and overlapping variable names with surrounding environment

Let's say I have a (dplyr/tibble) data-frame/tbl constructed like so:
df <- data_frame(x = 1:10)
Now, I'd like to use this within a function that works with df via some dplyr verbs, like so:
myfun <- function(df, x) {
x <- doSomeStuffTo(x)
filter(df, x == x)
}
But this will always return the full df... I'm trying to figure out a way to implement scoping within a dplyr verb, something like:
filter_(df, ~x == x)
... which doesn't work, either. In some other languages, you might be able to achieve this via something like:
df.filter(this.x == x)
... where this refers to the df instance.
My only work-around so far is naming the function's variable like so:
myfun <- function(df, query_x) {
query_x <- doSomeStuffTo(query_x)
filter(df, x == query_x)
}
I suspect this is doable (without using a name like query_x) somehow with SE dplyr verbs (e.g. filter_), but I haven't stumbled upon the correct pattern yet. Anyone here have the answer?

To dynamically build different dplyr commands you typically use the standard evaluation versions of the functions (the ones with the underscores) and the lazyeval package. Here's how you could change your function
doSomeStuffTo <- function(x) {x+1}
myfun <- function(df, x) {
x <- doSomeStuffTo(x)
filter_(df, lazyeval::interp(~x == y, y=x))
}
df <- data_frame(x = 1:10)
myfun(df,3)
but even in the interp we can't have x==x because it's not clear which x you want to replace. Both filter(df, 3==x) and filter(df, x==3) work with dplyr. You can have constants or column names on either side of the equality.

If you use filter_ you can pass logical expressions via quote:
myfun <- function(df, t) {
df$x <- 5*df$x
filter_(df, t )
}
> myfun(df, t= quote(x < 25) )
# A tibble: 4 x 1
x
<dbl>
1 5
2 10
3 15
4 20

I stumbled into the same issue. Instead of wrangling with even more complex evaluations, it's usually easier to just rename the function argument. Like this:
myfun <- function(df, x) {
x_ <- doSomeStuffTo(x)
filter(df, x == x_)
}
This solution is still dangerous because we might hit another variable called x_. One can be defensive about this by checking the variable names in df and making sure to pick one that isn't there. Or more lazily, one can use very implausible variable names. I often use stuff like _____temp.
Maybe the new dplyr 0.6.0 evaluation system will handle this better. See the notes about the new system, tidyeval.

How to use transformations to variables in formulas in R

I'm trying to use transformations of my outcomevar in a function that runs a few variants of models and stores the result in a list.
The runpanelsfunction first calls the prepare data function, which creates the lagged and differenced variables of the outcome variable specified as argument in the function. So after preparedata, model data contains outcomevar, doutcomevar and loutcomevar.
My problem is I now need to call/get these transformations of the outcomevar to subset the data such that loutcomevar and doutcomevar is not zero.
And then i need to use doutcomevar and loutcomevar in the models.
set.seed(1)
df <- data.frame(firm=rep(LETTERS[1:5],each=10),
date=as.Date("2014-01-01")+1:10,
y1=sample(1:100,50),y2=sample(1:100,50),y3=sample(1:100,50),
x1=sample(1:100,50), x2=sample(1:100,50))
preparedata<-function(testData,outcomevar){
require(data.table)
DT <- as.data.table(testData)
setkey(DT,firm,date)
DT[,lag := c(NA,unlist(.SD)[-.N]), by=firm, .SDcols=outcomevar]
DT[,diff := c(NA,diff(unlist(.SD))), by=firm, .SDcols=outcomevar]
setnames(DT,c("lag","diff"),paste0(c("loutcomevar","doutcomevar")))
return(DT)
modeldata<-as.data.frame(DT)
}
runpanels <- function(testData,outcomevar) {
modeldata<-preparedata(testData,outcomevar)
modeldata<-subset(modeldata,loutcomevar!=0& doutcomevar!=0)
modellist<-list()
modellist$m1<-lm(log(outcomevar)~-1+x1+x2,data=modeldata)
modellist$m2<-lm(log(doutcomevar)~-1+x1+date,data=modeldata)
modellist$m3<-lm(log(outcomevar)~-1+log(loutcomevar)+x1+x2,data=modeldata)
return(modellist)
}
Example use: modelsID1<-runpanels(df,outcomevar="y1")
Unsurprisingly, I get the error when it gets to evaluating "loutcomevar!=0"
: Error in eval(expr, envir, enclos) : object 'loutcomevar' not found
Called from: eval(e, x, parent.frame())
So it does not find the lagged variable i created in the prepare data function in the environment of the run panels function.
How can I call those variables?
The below example solution from another question was using call which is similar to my problem but i also want to call transformations of my outcomevar which is an argument of the function.
Any ideas how to tackle this will be much appreciated!
Example solution from other question that was kind of similar:
air <- data(airquality)
fm <- lm(Ozone ~ Solar.R, data=airquality)
myfun <- function(fm, name){
dn <- fm$call[['data']]
varname <- deparse(substitute(name))
get(as.character(dn),envir=.GlobalEnv)[varname]
}
Usage: myfun(fm, Temp)

You are assuming way too much capacity of the R interpreter to think like you do. It's powers of abstraction are much more limited. In particular there is no interpretation that would allow doutcomevar nd loutcomevar to be constructed within a formula or in the subset call.
Something allong these (untested) lines might work:
runpanels <- function(testData,outcomevar) {
modeldata<-preparedata(testData,outcomevar)
idx <- testData[[ paste0("l", outcomevar) ]] != 0 &
testData[[ paste0("d", outcomevar) ]] != 0
modeldata<-modeldata[idx ,]
modellist<-list()
form1 <- as.formula( "log(", outcomevar,")~-1+x1+x2" )
modellist$m1<-lm(log(outcomevar)~-1+x1+x2,data=modeldata)
#similar construction of formula objects for models 2 and 3
# .........
modellist$m2<-lm(log(doutcomevar)~-1+x1+date,data=modeldata)
modellist$m3<-lm(log(outcomevar)~-1+log(loutcomevar)+x1+x2,data=modeldata)
return(modellist)
}

set.seed(1)
df <- data.frame(firm=rep(LETTERS[1:5],each=10),
date=as.Date("2014-01-01")+1:10,
y1=sample(1:100,50),y2=sample(1:100,50),y3=sample(1:100,50),
x1=sample(1:100,50), x2=sample(1:100,50))
preparedata<-function(testData,outcomevar){
require(data.table)
DT <- as.data.table(testData)
setkey(DT,firm,date)
DT[,lag := c(NA,unlist(.SD)[-.N]), by=firm, .SDcols=outcomevar]
DT[,diff := c(NA,diff(unlist(.SD))), by=firm, .SDcols=outcomevar]
setnames(DT,c("lag","diff"),paste0(c("loutcomevar","doutcomevar")))
DT$outcomevar <- with(DT, eval(parse(text=outcomevar)))
return(DT)
modeldata<-as.data.frame(DT)
}
runpanels <- function(testData,outcomevar) {
modeldata<-preparedata(testData,outcomevar)
modeldata<-subset(modeldata,loutcomevar!=0& doutcomevar!=0)
modellist<-list()
modellist$m1<-lm(log(outcomevar)~-1+x1+x2,data=modeldata)
modellist$m2<-lm(log(doutcomevar)~-1+x1+date,data=modeldata)
modellist$m3<-lm(log(outcomevar)~-1+log(loutcomevar)+x1+x2,data=modeldata)
return(modellist)
}
Example use: modelsID1<-runpanels(df,outcomevar="y1")
Example use: modelsID1<-runpanels(df,outcomevar="y2")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Fixing a function to find and remove outliers from the dataset - r

Related

Plot function by condition and subsample for factor data

R - Creating function call within function using relational operator as variable

Object not found - nested function - R

dplyr and overlapping variable names with surrounding environment

How to use transformations to variables in formulas in R

Categories

Resources