Making subsets by means of quantiles - r

I made quantiles of equal size with the cut2 function, now I want to make 4 different subsets, by means of the 4 quantiles.
The first and fourth quantile I can make with the subset function:
quantile1 <- subset (trial, NAG <22.1)
quantile4 <- subset(trial, NAG >=61.6)
But if I try to make subsets of the second and third quantile, it doesn’t quite work and I don’t understand why. This is what I’ve tried:
quantile2<- subset(trial, NAG >=22.1 | NAG<36.8)
quantile3<-subset(trial, NAG >=36.8 | NAG <61.6)
If I use this function, R makes a subset, but the subset consists of the total number of observations, which can’t be right. Anyone an idea about what's wrong with the syntax is or how to fix it?
Thanks in advance!

I had the same kind of problem a while ago (here). I made a GetQuantile function which could be helpful to you:
GetQuantile<-function(x,q,n){
# Extract the nth quantile from a time series
#
# args:
# x = xts object
# q = quantile of xts object
# n = nthe quantile to extract
#
# Returns:
# Returns an xts object of quantiles
# TRUE / FALSE depending on the quantile we are looking for
if(n==1) # first quantile
test<-xts((coredata(x[,])<c(coredata(q[,2]))),order.by = index(x))
else if (n== dim(q)[2]-1) # last quantile
test<-xts((coredata(x[,])>=c(coredata(q[,n]))),order.by = index(x))
else # else
test<-xts( (coredata(monthly.returns[,])>=c(coredata(q[,n]))) &
(coredata(monthly.returns[,])<c(coredata(q[,(n+1)]))) ,order.by = index(x))
# replace NA by FALSE
test[is.na(test)]<-FALSE
# we only keep returns for which we need the quantile
x[test==FALSE]<-NA
return(x)
}
with this function I can have an xts with all the monthly returns of the quantile I want and NA everywhere else. With this xts I can do some stuff like computing the mean for each quantile ect..
monthly.returns.stock.Q1<-GetQuantile(stocks.returns,stocks.quantile,1)
rowMeans(monthly.returns.stock.Q1,na.rm = TRUE)

I had the same problem. I used this:
df$cumsum <- cumsum(df$var)
# makes cumulative sum of variable; my data were in shares, so they added up
# to 100
df$quantile <- cut(df$cumsum, c(0, 25, 50, 75, 100, NA), names=TRUE)
# cuts the cumulative sum at desired percentile
For variables that do not come in shares, I used the info from the summary, where R gives you quantiles and then cut the data according to those values.
Question: Are your quantiles equal? I mean, do they all contain exactly 25% of observations? Because mine were lumpy... ie some were 22%, some 28% etc. Just curious how you may have worked around that.

Related

How do I remove outliers from my dataset? [duplicate]

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.
Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/
OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)
Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)
The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.
I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"
x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.
Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?
Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.
1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]
Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:
It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)
Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

Is there an inbuilt function to identify outliers in all columns of a dataframe? [duplicate]

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.
Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/
OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)
Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)
The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.
I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"
x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.
Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?
Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.
1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]
Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:
It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)
Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

log- and z-transforming my data in R

I'm preparing my data for a PCA, for which I need to standardize it. I've been following someone else's code in vegan but am not getting a mean of zero and SD of 1, as I should be.
I'm using a data set called musci which has 13 variables, three of which are labels to identify my data.
log.musci<-log(musci[,4:13],10)
stand.musci<-decostand(log.musci,method="standardize",MARGIN=2)
When I then check for mean=0 and SD=1...
colMeans(stand.musci)
sapply(stand.musci,sd)
I get mean values ranging from -8.9 to 3.8 and SD values are just listed as NA (for every data point in my data set rather than for each variable). If I leave out the last variable in my standardization, i.e.
log.musci<-log(musci[,4:12],10)
the means don't change, but the SDs now all have a value of 1.
Any ideas of where I've gone wrong?
Cheers!
You data is likely a matrix.
## Sample data
dat <- as.matrix(data.frame(a=rnorm(100, 10, 4), b=rexp(100, 0.4)))
So, either convert to a data.frame and use sapply to operate on columns
dat <- data.frame(dat)
scaled <- sapply(dat, scale)
colMeans(scaled)
# a b
# -2.307095e-16 2.164935e-17
apply(scaled, 2, sd)
# a b
# 1 1
or use apply to do columnwise operations
scaled <- apply(dat, 2, scale)
A z-transformation is quite easy to do manually.
See below using a random string of data.
data <- c(1,2,3,4,5,6,7,8,9,10)
data
mean(data)
sd(data)
z <- ((data - mean(data))/(sd(data)))
z
mean(z) == 0
sd(z) == 1
The logarithm transformation (assuming you mean a natural logarithm) is done using the log() function.
log(data)
Hope this helps!

How to remove outliers from a dataset

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.
Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/
OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)
Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)
The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.
I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"
x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.
Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?
Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.
1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]
Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:
It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)
Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

Resources