R Winsorizing with specific cut-off values does not work - r

I want to winsorize my data using the mean plus (/minus) 2 standard deviations of the variable as cut-off points. I thus want to winsorize every variable one by one.
The variable I want to winsorize in the example code I provided below has 5 outliers.
I have created two benchmarks (high and low) and have inserted them in minval and maxval.
Just to prevent misunderstandings: I have several timepoints and groups in my data frame, the grepl-part is to just get one group at one measurement point for the winsorizing.
My code so far:
library(DescTools)
benchhigh <- mean(ds$RRS[grepl('^34.*', ds$QUESTNNR)], na.rm=TRUE) +
2*sd(ds$RRS[grepl('^34.*', ds$QUESTNNR)], na.rm=TRUE)
benchlow <- mean(ds$RRS[grepl('^34.*', ds$QUESTNNR)], na.rm=TRUE) -
2*sd(ds$RRS[grepl('^34.*', ds$QUESTNNR)], na.rm=TRUE)
ds$RRSout <- Winsorize( ds$RRS[ grepl('^34.*', ds$QUESTNNR) ],
minval = benchlow , maxval = benchhigh, na.rm = TRUE)
The error I get is:
"Error in $<-.data.frame(*tmp*, RRSout, value = c(1, 1.33333333333333, :
replacement has 38 rows, data has 510"
My replacement only has 38 rows because the ^34.* group has only 38 participants. I have to winsorize the outliers per group and measurement point though...
How can I replace/winsorize the outliers of the specific participant group in the RRS variable?
Thank you very much in advance!

Your input to Winsorize() is restricted to certain observations (grepl('^34.*', ds$QUESTNNR)). You can only append the result to the same number of (and ideally the exact same) rows:
ds$RRSout[ grepl('^34.*', ds$QUESTNNR) ] <-
Winsorize( ds$RRS[ grepl('^34.*', ds$QUESTNNR) ],
minval = benchlow , maxval = benchhigh, na.rm = TRUE)

Related

How do I remove outliers from my dataset? [duplicate]

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.
Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/
OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)
Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)
The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.
I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"
x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.
Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?
Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.
1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]
Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:
It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)
Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

Surprising behaviour survey R package with missing values

I'm reproducing a question that I couldn't find an answer to.
"I got some surprising results when using the svytotal routine from the survey package with data containing missing values.
Some example code demonstrating the behaviour is included below.
I have a stratified sampling design where I want to estimate the total
income. In some strata some of the incomes are missing. I want to
ignore these missing incomes. I would have expected that
svytotal(\~income, design=mydesign, na.rm=TRUE) would do the trick.
However, when calculating the estimates 'by hand' the estimates were
different from those obtained from svytotal. The estimated mean
incomes do agree with each other. It seems that using the na.rm option
with svytotal is the same as replacing the missing values with zero's,
which is not what I would have expected, especially since this
behaviour seems to differ from that of svymean. Is there a reason for
this behaviour?
I can of course remove the missing values myself before creating the
survey object. However, with many different variables with different
missing values, this is not very practical. Is there an easy way to
get the behaviour I want?"
library(survey)
library(plyr)
# generate some data
data <- data.frame(
id = 1:20,
stratum = rep(c("a", "b"), each=10),
income = rnorm(20, 100),
n = rep(c(100, 200), each=10)
)
data$income[5] <- NA
# calculate mean and total income for every stratum using survey package
des <- svydesign(ids=~id, strata=~stratum, data=data, fpc=~n)
svyby(~income, by=~stratum, FUN=svytotal, design=des, na.rm=TRUE)
mn <- svyby(~income, by=~stratum, FUN=svymean, design=des, na.rm=TRUE)
mn
n <- svyby(~n, by=~stratum, FUN=svymean, design=des)
# total does not equal mean times number of persons in stratum
mn[2] * n[2]
# calculate mean and total income 'by hand'. This does not give the same total
# as svytotal, but it does give the same mean
ddply(data, .(stratum), function(d) {
data.frame(
mean = mean(d$income, na.rm=TRUE),
n = mean(d$n),
total = mean(d$income, na.rm=TRUE) * mean(d$n)
)
})
# when we set income to 0 for missing cases and repeat the previous estimation
# we get the same answer as svytotal (but not svymean)
data2 <- data
data2$income[is.na(data$income )] <- 0
ddply(data2, .(stratum), function(d) {
data.frame(
mean = mean(d$income, na.rm=TRUE),
n = mean(d$n),
total = mean(d$income, na.rm=TRUE) * mean(d$n)
)
})
Yes, there is a reason for this behaviour!
The easiest way to think about the answer survey is trying to give here is it sets the weights for the missing observations to zero. That is, the package gives population estimates for the subdomain of non-missing values. This is important for getting the right standard errors. [Note: it doesn't actually do it by just setting the weights to zero, there are some optimisations, but that's the answer it gives]
If you set the weights to zero in svytotal, you get the sum of the non-missing values, which is the same as you get if you set the missing values to 0 or if they weren't ever sampled. When you come to compute standard errors it matters exactly which one you did, but not for point estimates.
If you set the weights to zero in svymean you get the mean of the non-missing values, which is not the same as you get if you set the missing values to zero (though it is the same as if they just weren't ever sampled).
I don't know exactly what you mean when you say you want to 'ignore' the missing incomes, but if you want to multiply mn[2] and n[2] meaningfully, they need to be computed on the same subdomain: you have one of them computed only where income is not missing and the other computed on all observations.

Is there an inbuilt function to identify outliers in all columns of a dataframe? [duplicate]

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.
Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/
OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)
Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)
The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.
I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"
x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.
Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?
Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.
1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]
Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:
It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)
Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

Calculate the average of a subsample with cast in R reshape2

I am attempting to calculate the average of a subsample with the function acast. As the subsample, I want to use data within a percentile range for which I use quantile within the subset. The problem seems that the quantiles are calculated before arranging the data by groups, thus the same values are used. See the example below:
library(reshape2)
library(plyr)
data(airquality)
aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE)
## here I calculate the length for each group for the whole sample
acast(aqm, variable + Month ~ . , length, value.var = "value")
## here I calculate the length for the range within the quantiles 0.05 - 0.5
acast(aqm, variable + Month ~ . , length, value.var = "value", subset = .(value >= quantile(value,c(0.05)) & value <= quantile(value,c(0.5))))
I should get with the subset half of the observations for each group, but instead I get in some cases way lees than half and in other way more. It seems to me that the quantiles are calculated with the melted data, therefore the function applies the same quantile numbers to all groups.
Does anyone have an idea how to make that the quantiles are calculates for each group? Any help would be appreciated. I know this would be possible by doing a loop by categories, but want to see if there is a way to do it all at once.
Thanks,
Sergio René

How to remove outliers from a dataset

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.
Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/
OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)
Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)
The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.
I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"
x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.
Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?
Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.
1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]
Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:
It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)
Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

Resources