Surprising behaviour survey R package with missing values - r

I'm reproducing a question that I couldn't find an answer to.
"I got some surprising results when using the svytotal routine from the survey package with data containing missing values.
Some example code demonstrating the behaviour is included below.
I have a stratified sampling design where I want to estimate the total
income. In some strata some of the incomes are missing. I want to
ignore these missing incomes. I would have expected that
svytotal(\~income, design=mydesign, na.rm=TRUE) would do the trick.
However, when calculating the estimates 'by hand' the estimates were
different from those obtained from svytotal. The estimated mean
incomes do agree with each other. It seems that using the na.rm option
with svytotal is the same as replacing the missing values with zero's,
which is not what I would have expected, especially since this
behaviour seems to differ from that of svymean. Is there a reason for
this behaviour?
I can of course remove the missing values myself before creating the
survey object. However, with many different variables with different
missing values, this is not very practical. Is there an easy way to
get the behaviour I want?"
library(survey)
library(plyr)
# generate some data
data <- data.frame(
id = 1:20,
stratum = rep(c("a", "b"), each=10),
income = rnorm(20, 100),
n = rep(c(100, 200), each=10)
)
data$income[5] <- NA
# calculate mean and total income for every stratum using survey package
des <- svydesign(ids=~id, strata=~stratum, data=data, fpc=~n)
svyby(~income, by=~stratum, FUN=svytotal, design=des, na.rm=TRUE)
mn <- svyby(~income, by=~stratum, FUN=svymean, design=des, na.rm=TRUE)
mn
n <- svyby(~n, by=~stratum, FUN=svymean, design=des)
# total does not equal mean times number of persons in stratum
mn[2] * n[2]
# calculate mean and total income 'by hand'. This does not give the same total
# as svytotal, but it does give the same mean
ddply(data, .(stratum), function(d) {
data.frame(
mean = mean(d$income, na.rm=TRUE),
n = mean(d$n),
total = mean(d$income, na.rm=TRUE) * mean(d$n)
)
})
# when we set income to 0 for missing cases and repeat the previous estimation
# we get the same answer as svytotal (but not svymean)
data2 <- data
data2$income[is.na(data$income )] <- 0
ddply(data2, .(stratum), function(d) {
data.frame(
mean = mean(d$income, na.rm=TRUE),
n = mean(d$n),
total = mean(d$income, na.rm=TRUE) * mean(d$n)
)
})

Yes, there is a reason for this behaviour!
The easiest way to think about the answer survey is trying to give here is it sets the weights for the missing observations to zero. That is, the package gives population estimates for the subdomain of non-missing values. This is important for getting the right standard errors. [Note: it doesn't actually do it by just setting the weights to zero, there are some optimisations, but that's the answer it gives]
If you set the weights to zero in svytotal, you get the sum of the non-missing values, which is the same as you get if you set the missing values to 0 or if they weren't ever sampled. When you come to compute standard errors it matters exactly which one you did, but not for point estimates.
If you set the weights to zero in svymean you get the mean of the non-missing values, which is not the same as you get if you set the missing values to zero (though it is the same as if they just weren't ever sampled).
I don't know exactly what you mean when you say you want to 'ignore' the missing incomes, but if you want to multiply mn[2] and n[2] meaningfully, they need to be computed on the same subdomain: you have one of them computed only where income is not missing and the other computed on all observations.

Related

How do I remove outliers from my dataset? [duplicate]

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.
Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/
OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)
Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)
The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.
I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"
x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.
Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?
Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.
1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]
Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:
It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)
Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

Is there an inbuilt function to identify outliers in all columns of a dataframe? [duplicate]

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.
Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/
OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)
Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)
The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.
I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"
x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.
Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?
Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.
1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]
Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:
It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)
Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

Fama Macbeth Regression in R pmg

In the past few days I have been trying to find how to do Fama Macbeth regressions in R. It is advised to use the plm package with pmg, however every attempt I do returns me that I have an insufficient number of time periods.
My Dataset consists of 2828419 observations with 13 columns of variables of which I am looking to do multiple cross-sectional regressions.
My firms are specified by seriesis, I have got a variable date and want to do the following Fama Macbeth regressions:
totret ~ size
totret ~ momentum
totret ~ reversal
totret ~ volatility
totret ~ value size
totret ~ value + size + momentum
totret ~ value + size + momentum + reversal + volatility
I have been using this command:
fpmg <- pmg(totret ~ momentum, Data, index = c("date", "seriesid")
Which returns: Error in pmg(totret ~ mom, Dataset, index = c("seriesid", "datem")) : Insufficient number of time periods
I tried it with my dataset being a datatable, dataframe and pdataframe. Switching the index does not work as well.
My data contains NAs as well.
Who can fix this, or find a different way for me to do Fama Macbeth?
This is almost certainly due to having NAs in the variables in your formula. The error message is not very helpful - it is probably not a case of "too few time periods to estimate" and very likely a case of "there are firm/unit IDs that are not represented across all time periods" due to missing data being dropped.
You have two options - impute the missing data or drop observations with missing data (the latter being a quick test that the model works without missing points before deciding what you want to do that is valid for estimtation).
If the missingness in your data is truly random, you might be okay just dropping observations with missingness. Otherwise you should probably impute. A common strategy here is to impute multiple times - at least 5 - and then estimate for each of those 5 resulting data sets and average the effect together. Amelia or mice are very strong imputation packages. I like Amelia because with one call you can impute n times for that many resulting data sets and it's easy to pass in a set of variables to not impute (e.g., id variable or time period) with the idvars parameter.
EDIT: I dug into the source code to see where the error was triggered and here is what the issue is - again likely caused by missing data, but it does interact with your degrees of freedom:
...
# part of the code where error is triggered below, here is context:
# X = matrix of the RHS of your model including intercept, so X[,1] is all 1s
# k = number of coefficients used determined by length(coef(plm.model))
# ind = vector of ID values
# so t here is the minimum value from a count of occurrences for each unique ID
t <- min(tapply(X[,1], ind, length))
# then if the minimum number of times a single ID appears across time is
# less than the number of coefficients + 1, you do not have enough time
# points (for that ID/those IDs) to estimate.
if (t < (k + 1))
stop("Insufficient number of time periods")
That is what is triggering your error. So imputation is definitely a solution, but there might be a single offender in your data and importantly, once this condition is satisfied your model will run just fine with missing data.
Lately, I fixed the Fama Macbeth regression in R.
From a Data Table with all of the characteristics within the rows, the following works and gives the opportunity to equally weight or apply weights to the regression (remove the ",weights = marketcap" for equally weighted). totret is a total return variable, logmarket is the logarithm of market capitalization.
logmarket<- df %>%
group_by(date) %>%
summarise(constant = summary(lm(totret~logmarket, weights = marketcap))$coefficient[1], rsquared = summary(lm(totret~logmarket*, weights = marketcap*))$r.squared, beta= summary(lm(totret~logmarket, weights = marketcap))$coefficient[2])
You obtain a DataFrame with monthly alphas (constant), betas (beta), the R squared (rsquared).
To retrieve coefficients with t-statistics in a dataframe:
Summarystatistics <- as.data.frame(matrix(data=NA, nrow=6, ncol=1)
names(Summarystatistics) <- "logmarket"
row.names(Summarystatistics) <- c("constant","t-stat", "beta", "tstat", "R^2", "observations")
Summarystatistics[1,1] <- mean(logmarket$constant)
Summarystatistics[2,1] <- coeftest(lm(logmarket$constant~1))[1,3]
Summarystatistics[3,1] <- mean(logmarket$beta)
Summarystatistics[4,1] <- coeftest(lm(logmarket$beta~1))[1,3]
Summarystatistics[5,1] <- mean(logmarket$rsquared)
Summarystatistics[6,1] <- nrow(subset(df, !is.na(logmarket)))
There are some entries of "seriesid" with only one entry. Therefore the pmg gives the error. If you do something like this (with variable names you use), it will stop the error:
try2 <- try2 %>%
group_by(cusip) %>%
mutate(flag = (if (length(cusip)==1) {1} else {0})) %>%
ungroup() %>%
filter(flag == 0)

Remove the outlier in the dataset before running factor analysis in R

The introduction about my dataset: It is the questionnaire data, mentioning the different reasons for students' antisocial behaviours. And I want to run the factor analysis to organize similar reasons to a factor.
For instance, there is one reason that students have antisocial behaviour because of their parents' educating, and another reason is that this happens because of their parents' educational background. There are some similarity between these two reasons, so that I am wondering whether these two reasons could be merged into one factor, so I want to run a factor analysis to see whether I could merge different reasons in one factor.
In order to run the factor analysis, removing the outlier(those which is smaller than mean minus 3 standard deviation, and bigger than mean add 3 standard deviation) is quite important from my understanding. However, I am not sure whether it is necessary for the questionnaire data, and if it is necessary, or at least it is not completely redundant, then with which R code could I reach this aim?
I did some research on Median Absolute Deviation (MAD) method, which could partial out the outliers. And I also wrote the R code as below:
mad.mean.D.O <- as.numeric(D.O.Mean.data$D.O_Mean)
median(mad.mean.D.O)
mad(mad.mean.D.O, center = median(mad.mean.D.O), constant = 1.4826,
na.rm = FALSE, low = FALSE, high = FALSE)
print(Upper.MAD <- (median(mad.mean.D.O)+3*(mad(mad.mean.D.O, center = median(mad.mean.D.O), constant = 1.4826,
na.rm = FALSE, low = FALSE, high = FALSE))))
print(Lower.MAD <- (median(mad.mean.D.O)-3*(mad(mad.mean.D.O, center = median(mad.mean.D.O), constant = 1.4826,
na.rm = FALSE, low = FALSE, high = FALSE))))
D.O.clean.mean.data <- D.O.Mean.data %>%
select(ID_t,
anonymity,
fail_exm,
pregnant,
deg_job,
new_job,
crowded,
stu_req,
int_sub,
no_org,
child,
exm_cont,
lec_sup,
fals_exp,
fin_prob,
int_pro,
family,
illness,
perf_req,
abroad,
relevanc,
quickcash,
deg_per,
lack_opp,
prac_work,
D.O_Mean) %>%
filter(D.O_Mean < 4.197032 & D.O_Mean > 0.282968)
This R code works.
However, I just wonder whether there are also other methods which could reach the same aim, but in a simpler approach.
In addition, my data set looks like this:
All the variables are questionnaire data, being measured by likert scale. And all of those are reasons for antisocial behaviour. For example, the first participants, she/he give 1 to anonymity, that means from not exactly yo exactly, he/ she think anonymity not exactly contribute to his/ her antisocial behaviour.
I would be really thankful for all of your input here.
You can try this function to remove outliers. It'll comb through all columns to identify outliers, so be sure to temporarily remove columns do not need outliers removed and you can cbind() it back later.
#identify outliers
idoutlier<- function(data, cutoff = 3) {
# Calculate the sd
sds <- apply(data, 2, sd, na.rm = TRUE)
# Identify the cells with value greater than cutoff * sd (column wise)
result <- mapply(function(d, s) {
which(d > cutoff * s)
}, data, sds)
result
}
#remove outliers
rmoutlier<- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
cbind() if necessary, and then na.omit() to remove your outliers

How to remove outliers from a dataset

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.
Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/
OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)
Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)
The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.
I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"
x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.
Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?
Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.
1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]
Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:
It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)
Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

Resources