How to find a percentile that can maximize the correlation coefficient between two vector?

How to find a percentile that can maximize the correlation coefficient between two vector? - r

Suppose I have two continuous vectors such like:
set.seed(123)
df <- data.frame(x = rnorm(100),
y = rnorm(100,3,5))
with(df, cor(x,y))
My question is how to find a percentile of x so that to maximize the absolute correlation of x and y such that:
perc <- quantile(df$x, 0.3)
df1 <- subset(df, x > perc)
with(df1, cor(x,y))
Namely how to find perc?

This problem is ill defined. Take your example data set and the function you want to find the maximum of (copied from #coffeinjunky):
set.seed(123)
df <- data.frame(x = rnorm(100),
y = rnorm(100,3,5))
findperc <- function(prop, dat) {
perc <- quantile(dat$x, prop)
with(subset(dat, dat$x > perc), abs(cor(x,y)))
}
Now plot the result of findperc for percentiles between 0 and 1.
x <- seq(0,1,0.01)
plot(x,sapply(x,findperc,df),type="l")
The circled point indicates that found by optimize as in #coffeinjunky's answer. This is clearly only a local maximum. The applicability of the warning from #Thierry, "You need to rethink the question. As soon a x and y contain only 2 element the correlation will be either 1 or -1", should be apparent on the right hand side of the plot.
In general, the fact that you are getting moderate to high correlations when starting with independently generated random variables should warn you that your results are spurious and method suspect.

Well, why not take your question literally, and just search for it? For instance, try:
findperc <- function(prop, dat) {
perc <- quantile(dat$x, prop)
with(subset(dat, dat$x > perc), abs(cor(x,y)))
}
optimize(findperc, lower=0, upper=1, maximum=T, dat=df)
This defines a function that computes the absolute correlation between your vectors based on the corresponding percentile (which here is a single value), just as in your example code. And then I feed this function to a linear optimizer which searches for the input that produces the maximum value for the output.
Edit: Thanks to #A. Webb's answer I learned that optimize uses a gradient search as opposed to a grid search. I thought that this was the main difference between optim and optimize, a clearly wrong assumption I should have checked myself. However, just to provide a solution using grid search that will get you closer to the global maximum, one could use the following:
x <- seq(0,0.97,0.01)
x[which.max(sapply(x, findperc, dat=df))]
Note that I have cut x here at 97%. This ensures that at least 3 people are left in the sample (given a sample size of 100).

Related

How do I remove outliers from my dataset? [duplicate]

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.

Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/

OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)

Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)

The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.

I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"

x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.

Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?

Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.

1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]

Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:

It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)

Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

Is there an inbuilt function to identify outliers in all columns of a dataframe? [duplicate]

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.

Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/

OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)

Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)

The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.

I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"

x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.

Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?

Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.

1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]

Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:

It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)

Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

Does cattell's profile similarity coefficient (Rp) exist as a function in R?

i'm comparing different measures of distance and similarity for vector profiles (Subtest results) in R, most of them are easy to compute and/or exist in dist().
Unfortunately, one that might be interesting and is to difficult for me to calculate myself is Cattel's Rp. I can not find it in R.
Does anybody know if this exists already?
Or can you help me to write a function?
The formula (Cattell 1994) of Rp is this:
(2k-d^2)/(2k + d^2)
where:
k is the median for chi square on a sample of size n;
d is the sum of the (weighted=m) difference between the two profiles,
sth like: sum(m(x(i)-y(i)));
one thing i don't know is, how to get the chi square median in there
Thank you
What i get without defining the k is:
Rp.Cattell <- function(x,y){z <- (2k-(sum(x-y))^2)/(2k+(sum(x-y))^2);return(z)}
Vector examples are:
x <- c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758)
y <- c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925)
They are measures by the same device, but related to different bodyparts. They don't need to be standartised or weighted, i would say.

This page gives a general formula for k, and then gives a more thorough method using SAS/IML which pretty much gives the same results. So I used the general formula, added calculation of degrees of freedom, which leads to this:
Rp.Cattell <- function(x,y) {
dof <- (2-1) * (length(y)-1)
k <- (1-2/(9*dof))^3
z <- (2*k-sum(sum(x-y))^2)/(2*k+sum(sum(x-y))^2)
return(z)
}
x <- c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758)
y <- c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925)
Rp.Cattell(x, y)
# [1] -0.9012083
Does this figure appear to make sense?

Trying to verify the function, I found out now that the median of chisquare is the chisquare value for 50% probability - relating to random. So the function should be:
Rp.Cattell <- function(x,y){
dof <- (2-1) * (length(y)-1)
k <- qchisq(.50, df=dof)
z <- (2k-(sum(x-y))^2)/(2k+(sum(x-y))^2);
return(z)}
It is necessary though to standardize the Values before, so the results are distributed correctly.
So:
library ("stringr")
# they are centered already
x <- as.vector(scale(c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758),center=F, scale=T))
y <- as.vector(scale(c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925),center=F, scale=T))
Rp.Cattell(x, y) -0.584423
This sounds reasonable now - or not?

I consider calculation of z is incorrect.
You need to calculate the sum of the squared differences. Not the square of the sum of differences. Besides product operator is missing in 2k.
It should be
z <- (2*k-sum((x-y)^2))/(2*k+sum((x-y)^2))
Do you agree?

R: Finding solutions for new x values with nlmrt

Good day,
I have tried to figure this out, but I really can't!! I'll supply an example of my data in R:
x <- c(36,71,106,142,175,210,246,288,357)
y <- c(19.6,20.9,19.8,21.2,17.6,23.6,20.4,18.9,17.2)
table <- data.frame(x,y)
library(nlmrt)
curve <- "y~ a + b*exp(-0.01*x) + (c*x)"
ones <- list(a=1, b=1, c=1)
Then I use wrapnls to fit the curve and to find a solution:
solve <- wrapnls(curve, data=table, start=ones, trace=FALSE)
This is all fine and works for me. Then, using the following, I obtain a prediction of y for each of the x values:
predict(solve)
But how do I find the prediction of y for new x values? For instance:
new_x <- c(10, 30, 50, 70)
I have tried:
predict(solve, new_x)
predict(solve, 10)
It just gives the same output as:
predict(solve)
I really hope someone can help! I know if I use the values of 'solve' for parameters a, b, and c and substitute them into the curve formula with the desired x value that I would be able to this, but I'm wondering if there is a simpler option. Also, without plotting the data first.

Predict requires the new data to be a data.frame with column names that match the variable names used in your model (whether your model has one or many variables). All you need to do is use
predict(solve, data.frame(x=new_x))
# [1] 18.30066 19.21600 19.88409 20.34973
And that will give you a prediction for just those 4 values. It's somewhat unfortunate that any mistakes in specifying the new data results in the fitted values for the original model being returned. An error message probably would have been more useful, but oh well.

How to remove outliers from a dataset

I've got some multivariate data of beauty vs ages. The ages range from 20-40 at intervals of 2 (20, 22, 24....40), and for each record of data, they are given an age and a beauty rating from 1-5. When I do boxplots of this data (ages across the X-axis, beauty ratings across the Y-axis), there are some outliers plotted outside the whiskers of each box.
I want to remove these outliers from the data frame itself, but I'm not sure how R calculates outliers for its box plots. Below is an example of what my data might look like.

Nobody has posted the simplest answer:
x[!x %in% boxplot.stats(x)$out]
Also see this: http://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/

OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE as default.
EDIT2: Removed quantile function, added subscripting, hence made the function faster! =)

Use outline = FALSE as an option when you do the boxplot (read the help!).
> m <- c(rnorm(10),5,10)
> bp <- boxplot(m, outline = FALSE)

The boxplot function returns the values used to do the plotting (which is actually then done by bxp():
bstats <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
#need to "waste" this plot
bstats$out <- NULL
bstats$group <- NULL
bxp(bstats) # this will plot without any outlier points
I purposely did not answer the specific question because I consider it statistical malpractice to remove "outliers". I consider it acceptable practice to not plot them in a boxplot, but removing them just because they exceed some number of standard deviations or some number of inter-quartile widths is a systematic and unscientific mangling of the observational record.

I looked up for packages related to removing outliers, and found this package (surprisingly called "outliers"!): https://cran.r-project.org/web/packages/outliers/outliers.pdf
if you go through it you see different ways of removing outliers and among them I found rm.outlier most convenient one to use and as it says in the link above:
"If the outlier is detected and confirmed by statistical tests, this function can remove it or replace by
sample mean or median" and also here is the usage part from the same source:
"Usage
rm.outlier(x, fill = FALSE, median = FALSE, opposite = FALSE)
Arguments
x a dataset, most frequently a vector. If argument is a dataframe, then outlier is
removed from each column by sapply. The same behavior is applied by apply
when the matrix is given.
fill If set to TRUE, the median or mean is placed instead of outlier. Otherwise, the
outlier(s) is/are simply removed.
median If set to TRUE, median is used instead of mean in outlier replacement.
opposite if set to TRUE, gives opposite value (if largest value has maximum difference
from the mean, it gives smallest and vice versa)
"

x<-quantile(retentiondata$sum_dec_incr,c(0.01,0.99))
data_clean <- data[data$attribute >=x[1] & data$attribute<=x[2],]
I find this very easy to remove outliers. In the above example I am just extracting 2 percentile to 98 percentile of attribute values.

Wouldn't:
z <- df[df$x > quantile(df$x, .25) - 1.5*IQR(df$x) &
df$x < quantile(df$x, .75) + 1.5*IQR(df$x), ] #rows
accomplish this task quite easily?

Adding to #sefarkas' suggestion and using quantile as cut-offs, one could explore the following option:
newdata <- subset(mydata,!(mydata$var > quantile(mydata$var, probs=c(.01, .99))[2] | mydata$var < quantile(mydata$var, probs=c(.01, .99))[1]) )
This will remove the points points beyond the 99th quantile. Care should be taken like what aL3Xa was saying about keeping outliers. It should be removed only for getting an alternative conservative view of the data.

1 way to do that is
my.NEW.data.frame <- my.data.frame[-boxplot.stats(my.data.frame$my.column)$out, ]
or
my.high.value <- which(my.data.frame$age > 200 | my.data.frame$age < 0)
my.NEW.data.frame <- my.data.frame[-my.high.value, ]

Outliers are quite similar to peaks, so a peak detector can be useful for identifying outliers. The method described here has quite good performance using z-scores. The animation part way down the page illustrates the method signaling on outliers, or peaks.
Peaks are not always the same as outliers, but they're similar frequently.
An example is shown here:
This dataset is read from a sensor via serial communications. Occasional serial communication errors, sensor error or both lead to repeated, clearly erroneous data points. There is no statistical value in these point. They are arguably not outliers, they are errors. The z-score peak detector was able to signal on spurious data points and generated a clean resulting dataset:

It is more difficult to remove outliers with grouped data because there is a risk of removing data points that are considered outliers in one group but not in others.
Because no dataset is provided I assume that there is a dependent variable "attractiveness", and two independent variables "age" and "gender". The boxplot shown in the original post above is then created with boxplot(dat$attractiveness ~ dat$gender + dat$age). To remove outliers you can use the following approach:
# Create a separate dataset for each group
group_data = split(dat, list(dat$age, dat$gender))
# Remove outliers from each dataset
group_data = lapply(group_data, function(x) {
# Extract outlier values from boxplot
outliers = boxplot.stats(x$attractiveness)$out
# Remove outliers from data
return(subset(x, !x$attractiveness %in% outliers))
})
# Combine datasets into a single dataset
dat = do.call(rbind, group_data)

Try this. Feed your variable in the function and save the o/p in the variable which would contain removed outliers
outliers<-function(variable){
iqr<-IQR(variable)
q1<-as.numeric(quantile(variable,0.25))
q3<-as.numeric(quantile(variable,0.75))
mild_low<-q1-(1.5*iqr)
mild_high<-q3+(1.5*iqr)
new_variable<-variable[variable>mild_low & variable<mild_high]
return(new_variable)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to find a percentile that can maximize the correlation coefficient between two vector? - r

Related

How do I remove outliers from my dataset? [duplicate]

Is there an inbuilt function to identify outliers in all columns of a dataframe? [duplicate]

Does cattell's profile similarity coefficient (Rp) exist as a function in R?

R: Finding solutions for new x values with nlmrt

How to remove outliers from a dataset

Categories

Resources