grouped loess scores - r

I've got this data.frame: http://sprunge.us/TMGS, and I'd like to calculate the loess of Intermediate.MAP.Score ~ x, so I get one curve from the whole dataset. But every group (by name) should have the same wight as every other group, I'm not sure what happens if I call loess over the whole data.frame. Do I need to call it once per group and combine them? If yes, how do I do that?

If you want to average over all of the values in 'loess.fits' produced in my earlier answer to a different question, you will get one answer. If you want to just get a loess fit on the entire dataset (which would not fit your "equal weighting" spec at least as I interpret that phrase), you will get another answer.
This would produce averaged 'yhat' values at the 51 equally spaced data values for 'x' in the range of [0,1]. Because of missing values, it may not be exactly "equally weighted" but only at the extremes. The estimates are dense elsewhere:
apply( as.data.frame(loess.fits), 1, mean, na.rm=TRUE)
Earlier answer:
I would have titled the question "loess scores split by group":
plot(dat$x, dat$Intermediate.MAP.Score, col=as.numeric(factor(dat$name)) )
If you proceed with loess(Intermediate.MAP.Score ~ x, data=dat) you will get an overall average with no distinction among groups. And loess doesn't accept factor or character arguments in its formula. You need to split by 'name' and calculate separately. The other gotcha to avoid is plotting on the default limits which will be driven varying data ranges:
loess.fits <- lapply(split(dat, dat$name), function(xdf) {
list( yhat=predict( loess(Intermediate.MAP.Score ~ x,
data=xdf[ complete.cases(
xdf[ , c("Intermediate.MAP.Score", "x") ]
),
] ) ,
newdata=data.frame(x=seq(0,1,by=0.02))))})
plot(dat$x, dat$Intermediate.MAP.Score,
col=as.numeric(factor(dat$name)),
ylim=c(0.2,1) )
lapply(loess.fits, function(xdf) { par(new=TRUE);
# so the plots can be compared to predictions
plot(x= seq(0,1,by=0.02), y=xdf$yhat,
ylab="", xlab="",
ylim=c(0.2,1), axes=FALSE) })

Related

How do I remove outliers from my dataset? [duplicate]

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.
Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/
OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)
Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)
The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.
I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"
x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.
Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?
Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.
1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]
Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:
It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)
Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

Segmented Regression with two zero constraints at beginning and end boundaries

I am having trouble with this segmented regression as it requires two constraints and so far I have only treated single constraints.
Here is an example of some data I am trying to fit:
library(segmented)
library("readxl")
library(ggplot2)
#DATA PRE-PROCESSING
yields <- c(-0.131, 0.533, -0.397, -0.429, -0.593, -0.778, -0.92, -0.987, -1.113, -1.314, -0.808, -1.534, -1.377, -1.459, -1.818, -1.686, -1.73, -1.221, -1.595, -1.568, -1.883, -1.53, -1.64, -1.396, -1.679, -1.782, -1.033, -0.539, -1.207, -1.437, -1.521, -0.691, -0.879, -0.974, -1.816, -1.854, -1.752, -1.61, -0.602, -1.364, -1.303, -1.186, -1.336)
maturities <- c(2.824657534246575, 2.9013698630136986, 3.106849315068493, 3.1534246575342464, 3.235616438356164, 3.358904109589041, 3.610958904109589, 3.654794520547945, 3.778082191780822, 3.824657534246575, 3.9013698630136986, 3.9863013698630136, 4.153424657534247, 4.273972602739726, 4.32054794520548, 4.654794520547945, 4.778082191780822, 4.986301369863014, 5.153424657534247, 5.32054794520548, 5.443835616438356, 5.572602739726028, 5.654794520547945, 5.824425480949174, 5.941911819746988, 6.275245153080321, 6.4063926940639275, 6.655026573845348, 6.863013698630137, 7.191780821917808, 7.32054794520548, 7.572602739726028, 7.693150684931507, 7.901369863013699, 7.986301369863014, 8.32054794520548, 8.654794520547945, 8.986301369863014, 9.068493150684931, 9.32054794520548, 9.654794520547945, 9.903660453626769, 10.155026573845348)
off_2 <- 2.693277939965566
off_10 <- 10.655026573845348
bond_data = data.frame(yield_change = yields, maturity = maturities) code here
I am trying to fit a segmented model (with a formula of "yield_change~maturity") that has the following constraints:
At maturity = 2 I want the yield_change to be zero
At maturity = 10, I want the yield_change to zero
I want breakpoints(fixed in x) at the 3, 5 and 7-year maturity values.
The off_2 and off_10 variables are the offsets I must use (to set the yields to zero at the 2 and 10-year mark)
As I mentioned before, my previous regressions only required one initial constraint, having one offset value I had to use, I did the following:
I subtracted the offset value from the maturity vector (for example I had maturity = c(10.8,10.9,11,14,16,18, etc... then subtracted the offset, always lower than the initial vector value, 10,4 for example and then fitted a lm with an origin constraint)
From there I could use the segmented package and fit as many breakpoints as I wanted)
As the segmented() function requires a lm object as an input that was possible.
However in this case I cannot to the previous approach as I have two offsets and cannot subtract all the values by off_2 or off_10 as it would fix one point at the zero and not the other.
What I have tried doing is the following:
Separate the dataset into maturities below 5 and maturities over 5 (and essentially apply a segmented model to each of these (with only one breakpoint being 3 or 7).
The issue is that I need to have the 5 year point yield the same for the two models.
I have done this:
bond_data_sub5 <- bond_data[bond_data$maturity < 5,]
bond_data_over5 <- bond_data[bond_data$maturity > 5,]
bond_data_sub5["maturity"] <- bond_data_sub5$maturity - off_2
#2 to 5 year model
model_sub5 <- lm(yield_change~maturity+0, data = bond_data_sub5)
plot(bond_data_sub5$maturity,bond_data_sub5$yield_change, pch=16, col="blue",
xlab = "maturity",ylab = "yield_change", xlim = c(0,12))
abline(model_sub5)
Which gives me the following graph:
The fact that the maturities have an offset of off_2 is not a problem as when I input my predictions to the function I will create, I will then subtract them by off_2.
The worrying thing is that the 5-year prediction is not at all close to where the actual 5 year should be. Looking at the scatter plot of all maturities we can see this:
five_yr_yield <- predict(model_sub5,data.frame(maturity = 5 - off_2))
plot(bond_data$maturity,bond_data$yield_change, pch=16, col="blue",
xlab = "maturity",ylab = "yield_change", xlim = c(0,12), ylim = c(-3,0.5))
points(5,five_yr_yield, pch=16, col = "red")
Gives:
The issue with this method is that if I set the model_sub5 5-year prediction as the beginning constraint of model_over5, I will have the exact same problem I am trying to resolve, two constraints in one lm (but this time (5,five_yr_yield) and (10,0) constraints.
Isn't there a way to fit a lm with no slope and zero as an intercept from (2,0) to (10,0) and then apply the segmented function with breakpoints at 3,5 and 7?
If that isn't possible how would I make the logic I am trying to apply work? Or is there another way of doing this?
If anyone has any suggestions I would greatly appreciate them!
Thank you very much!

Is there an inbuilt function to identify outliers in all columns of a dataframe? [duplicate]

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.
Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/
OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)
Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)
The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.
I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"
x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.
Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?
Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.
1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]
Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:
It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)
Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

Plot "regression line" from multiple regression in R

I ran a multiple regression with several continuous predictors, a few of which came out significant, and I'd like to create a scatterplot or scatter-like plot of my DV against one of the predictors, including a "regression line". How can I do this?
My plot looks like this
D = my.data; plot( D$probCategorySame, D$posttestScore )
If it were simple regression, I could add a regression line like this:
lmSimple <- lm( posttestScore ~ probCategorySame, data=D )
abline( lmSimple )
But my actual model is like this:
lmMultiple <- lm( posttestScore ~ pretestScore + probCategorySame + probDataRelated + practiceAccuracy + practiceNumTrials, data=D )
I would like to add a regression line that reflects the coefficient and intercept from the actual model instead of the simplified one. I think I'd be happy to assume mean values for all other predictors in order to do this, although I'm ready to hear advice to the contrary.
This might make no difference, but I'll mention just in case, the situation is complicated slightly by the fact that I probably will not want to plot the original data. Instead, I'd like to plot mean values of the DV for binned values of the predictor, like so:
D[,'probCSBinned'] = cut( my.data$probCategorySame, as.numeric( seq( 0,1,0.04 ) ), include.lowest=TRUE, right=FALSE, labels=FALSE )
D = aggregate( posttestScore~probCSBinned, data=D, FUN=mean )
plot( D$probCSBinned, D$posttestScore )
Just because it happens to look much cleaner for my data when I do it this way.
To plot the individual terms in a linear or generalised linear model (ie, fit with lm or glm), use termplot. No need for binning or other manipulation.
# plot everything on one page
par(mfrow=c(2,3))
termplot(lmMultiple)
# plot individual term
par(mfrow=c(1,1))
termplot(lmMultiple, terms="preTestScore")
You need to create a vector of x-values in the domain of your plot and predict their corresponding y-values from your model. To do this, you need to inject this vector into a dataframe comprised of variables that match those in your model. You stated that you are OK with keeping the other variables fixed at their mean values, so I have used that approach in my solution. Whether or not the x-values you are predicting are actually legal given the other values in your plot should probably be something you consider when setting this up.
Without sample data I can't be sure this will work exactly for you, so I apologize if there are any bugs below, but this should at least illustrate the approach.
# Setup
xmin = 0; xmax=10 # domain of your plot
D = my.data
plot( D$probCategorySame, D$posttestScore, xlim=c(xmin,xmax) )
lmMultiple <- lm( posttestScore ~ pretestScore + probCategorySame + probDataRelated + practiceAccuracy + practiceNumTrials, data=D )
# create a dummy dataframe where all variables = their mean value for each record
# except the variable we want to plot, which will vary incrementally over the
# domain of the plot. We need this object to get the predicted values we
# want to plot.
N=1e4
means = colMeans(D)
dummyDF = t(as.data.frame(means))
for(i in 2:N){dummyDF=rbind(dummyDF,means)} # There's probably a more elegant way to do this.
xv=seq(xmin,xmax, length.out=N)
dummyDF$probCSBinned = xv
# if this gives you a warning about "Coercing LHS to list," use bracket syntax:
#dummyDF[,k] = xv # where k is the column index of the variable `posttestScore`
# Getting and plotting predictions over our dummy data.
yv=predict(lmMultiple, newdata=subset(dummyDF, select=c(-posttestScore)))
lines(xv, yv)
Look at the Predict.Plot function in the TeachingDemos package for one option to plot one predictor vs. the response at a given value of the other predictors.

How to remove outliers from a dataset

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.
Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/
OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)
Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)
The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.
I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"
x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.
Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?
Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.
1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]
Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:
It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)
Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

Resources