Removing univariate outliers from data frame (+-3 SDs) - r

I'm so new to R that I'm having trouble finding what I need in other peoples' questions. I think my question is so easy that nobody else has bothered to ask it.
What would be the simplest code to create a new data frame which excludes data which are univariate outliers(which I'm defining as points which are 3 SDs from their condition's mean), within their condition, on a certain variable?
I'm embarrassed to show what I've tried but here it is
greaterthan <- mean(dat$var2[dat$condition=="one"]) +
2.5*(sd(dat$var2[dat$condition=="one"]))
lessthan <- mean(dat$var2[dat$condition=="one"]) -
2.5*(sd(dat$var2[dat$condition=="one"]))
withoutliersremovedone1 <-dat$var2[dat$condition=="one"] < greaterthan
and I'm pretty much already stuck there.
Thanks

> dat <- data.frame(
var1=sample(letters[1:2],10,replace=TRUE),
var2=c(1,2,3,1,2,3,102,3,1,2)
)
> dat
var1 var2
1 b 1
2 a 2
3 a 3
4 a 1
5 b 2
6 b 3
7 a 102 #outlier
8 b 3
9 b 1
10 a 2
Now only return those rows which are not (!) greater than 2 absolute sd's from the mean of the variable in question. Obviously change 2 to however many sd's you want to be the cutoff.
> dat[!(abs(dat$var2 - mean(dat$var2))/sd(dat$var2)) > 2,]
var1 var2
1 b 1
2 a 2
3 a 3
4 a 1
5 b 2
6 b 3 # no outlier
8 b 3 # between here
9 b 1
10 a 2
Or more short-hand using the scale function:
dat[!abs(scale(dat$var2)) > 2,]
var1 var2
1 b 1
2 a 2
3 a 3
4 a 1
5 b 2
6 b 3
8 b 3
9 b 1
10 a 2
edit
This can be extended to looking within groups using by
do.call(rbind,by(dat,dat$var1,function(x) x[!abs(scale(x$var2)) > 2,] ))
This assumes dat$var1 is your variable defining the group each row belongs to.

I use the winsorize() function in the robustHD package for this task. Here is its example:
R> example(winsorize)
winsrzR> ## generate data
winsrzR> set.seed(1234) # for reproducibility
winsrzR> x <- rnorm(10) # standard normal
winsrzR> x[1] <- x[1] * 10 # introduce outlier
winsrzR> ## winsorize data
winsrzR> x
[1] -12.070657 0.277429 1.084441 -2.345698 0.429125 0.506056
[7] -0.574740 -0.546632 -0.564452 -0.890038
winsrzR> winsorize(x)
[1] -3.250372 0.277429 1.084441 -2.345698 0.429125 0.506056
[7] -0.574740 -0.546632 -0.564452 -0.890038
winsrzR>
This defaults to median +/- 2 mad, but you can set the parameters for mean +/- 3 sd.

Related

How to filter variables by fold change difference in R

I'm trying to filter a very heterogeneous dataset.
I have numerous variables with several replicates each one. I have a factor with two levels (lets say X and Y), and I would like to subset the variables which present a fold change on its mean greater than 2 (X/Y >= 2 OR Y/X >= 2).
How can I achieve that in R? I can think of some ways but they seem too much of a hassle, I'm sure there is a better way. I would later run multivariate test on those filtered variables.
This would be an example dataset:
d <- read.table(text = "a b c d factor replicate
1 2 2 3 X 1
3 2 4 4 X 2
2 3 1 2 X 3
1 2 3 2 X 4
5 2 6 4 Y 1
7 4 5 5 Y 2
8 5 7 4 Y 3
6 4 3 3 Y 4", header = TRUE)
From this example, only variables a and c should be kept.
Using colMeans:
#subset
x <- d[ d$factor == "X", 1:4 ]
y <- d[ d$factor == "Y", 1:4 ]
# check colmeans, and get index
which(colMeans(x/y) >= 2 | colMeans(y/x) >= 2)
# a c
# 1 3

How to write the remaining data frame in R after randomly subseting the data

I took a random sample from a data frame. But I don't know how to get the remaining data frame.
df <- data.frame(x=rep(1:3,each=2),y=6:1,z=letters[1:6])
#select 3 random rows
df[sample(nrow(df),3)]
What I want is to get the remaining data frame with the other 3 rows.
sample sets a random seed each time you run it, thus if you want to reproduce its results you will either need to set.seed or save its results in a variable.
Addressing your question, you simply need to add - before your index in order to get the rest of the data set.
Also, don't forget to add a comma after the indx if you want to select rows (unlike in your question)
set.seed(1)
indx <- sample(nrow(df), 3)
Your subset
df[indx, ]
# x y z
# 2 1 5 b
# 6 3 1 f
# 3 2 4 c
Remaining data set
df[-indx, ]
# x y z
# 1 1 6 a
# 4 2 3 d
# 5 3 2 e
Try:
> df
x y z
1 1 6 a
2 1 5 b
3 2 4 c
4 2 3 d
5 3 2 e
6 3 1 f
>
> df2 = df[sample(nrow(df),3),]
> df2
x y z
5 3 2 e
3 2 4 c
1 1 6 a
> df[!rownames(df) %in% rownames(df2),]
x y z
1 1 6 a
2 1 5 b
5 3 2 e

Permutation Testing

I have the following data set for which I've written some code to do permutation testing
df <- read.table(text="Group var1 var2 var3 var4 var5
1 3 5 7 3 7
1 3 7 5 9 6
1 5 2 6 7 6
1 9 5 7 0 8
1 2 4 5 7 8
1 2 3 1 6 4
2 4 2 7 6 5
2 0 8 3 7 5
2 1 2 3 5 9
2 1 5 3 8 0
2 2 6 9 0 7
2 3 6 7 8 8
2 10 6 3 8 0", header = TRUE)
This is my code. However it doesn't seem to work for some reason - all the p values I get at the end are about 0.5. Can anyone see what I'm doing wrong??
data = df[,2:6]
t.test.pvals = matrix(NA,nrow=1000,ncol=5)
ids.group1 = c(1,2,3,4,5,6)
ids.group2 = c(7,8,9,10,11,12,13)
#Define binary vector type for the t test
group1.binary <- rep(0,times=6)
group2.binary <- rep(1,times=7)
type <- c(group1.binary,group2.binary)
#Permutation testing
for (i in 1:1000) {
index = sample(1:13, size=13, replace=F)
group1 = data[which(index %in% ids.group1),]
group2 = data[which(index %in% ids.group2),]
group.total = rbind(group1,group2)
temp = t(sapply(group.total, function(x)
unlist(t.test(x~type)[c("p.value")])))
temp = as.vector(temp)
t.test.pvals[i,] = temp
}
You can either do a t-test or do permutation testing. In the permutation testing, you don't use t-tests. See for instance here for a tutorial on permutation testing. Below you find the code for your particular example (e.g. var5):
# t-test
with(df, t.test(var5~Group))$p.value
# Permutation testing
# mean difference
mean.diff <- with(df, abs(mean(var5[Group==1])-mean(var5[Group==2])))
# function that calculates resampled mean
one.test <- function(x,y) {
xstar<-sample(x)
abs(mean(y[xstar==1])-mean(y[xstar==2]))
}
# calculating the resampled means
many.diff <- c(mean.diff, with(df, replicate(1000, one.test(Group, var5))))
# pvalue
p5 <- mean(abs(many.diff) >= abs(mean.diff))
p5
The way you did it, you resampled and then calculated p-values from a t-test. After the resampling, the p-value is uniformly distributed between 0 and 1. Therefore when you look at summary(t.test.pvals), you see uniformly distributed p-values (as expected).
#shadow explained the issue with your code well. If I were you I would generally refrain from coding this kind of thing from scratch. The coin package implements all the permutation tests you could ever want to use. No need to re-invent the wheel.
This code
library(coin)
sapply(df[,-1], function(x) pvalue(oneway_test(x ~ as.factor(df$Group))))
## var1 var2 var3 var4 var5
## 0.548 0.544 0.898 0.685 0.304
does what you seem to want to do (i.e., test whether there is a shift in the distribution of varX in Group 1 versus Group 2).

sort and number within levels of a factor in r

if i have the following data frame G:
z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4
I am trying to get:
z type x y
3 a 6 3
2 a 5 2
1 a 4 1
4 b 1 2
5 b 0.9 1
6 c 4 1
I.e. i want to sort the whole data frame within the levels of factor type based on vector x. Get the length of of each level a = 3 b=2 c=1 and then number in a decreasing fashion in a new vector y.
My starting place is currently with sort()
tapply(y, x, sort)
Would it be best to first try and use sapply to split everything first?
There are many ways to skin this cat. Here is one solution using base R and vectorized code in two steps (without any apply):
Sort the data using order and xtfrm
Use rle and sequence to genereate the sequence.
Replicate your data:
dat <- read.table(text="
z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4
", header=TRUE, stringsAsFactors=FALSE)
Two lines of code:
r <- dat[order(dat$type, -xtfrm(dat$x)), ]
r$y <- sequence(rle(r$type)$lengths)
Results in:
r
z type x y
3 3 a 6.0 1
2 2 a 5.0 2
1 1 a 4.0 3
4 4 b 1.0 1
5 5 b 0.9 2
6 6 c 4.0 1
The call to order is slightly complicated. Since you are sorting one column in ascending order and a second in descending order, use the helper function xtfrm. See ?xtfrm for details, but it is also described in ?order.
I like Andrie's better:
dat <- read.table(text="z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4", header=T)
Three lines of code:
dat <- dat[order(dat$type), ]
x <- by(dat, dat$type, nrow)
dat$y <- unlist(sapply(x, function(z) z:1))
I Edited my response to adapt for the comments Andrie mentioned. This works but if you went this route instead of Andrie's you're crazy.

How to calculate correlation In R

I wanted to calculate correlation coeficient between colunms of a subset of a data set x in R
I have rows of 40 models each 200 simulations in total 8000 rows
I wanted to calculate the corr coeficient between colums for each simulation (40 rows)
cor(x[c(3,5)]) calculates from all 8000 rows
I need cor(x[c(3,5)]) but only when X$nsimul=1 and so on
would you help me in this regards
San
I'm not sure what exactly you're doing with x[c(3,5)] but it looks like you want to do something like the following: You have a data-frame X like this:
set.seed(123)
X <- data.frame(nsimul = rep(1:2, each=5), a = sample(1:10), b = sample(1:10))
> X
nsimul a b
1 1 1 6
2 1 8 2
3 1 9 1
4 1 10 4
5 1 3 9
6 2 4 8
7 2 6 5
8 2 7 7
9 2 2 10
10 2 5 3
And you want to split this data-frame by the nsimul column, and calculate the correlation between a and b in each group. This is a classic split-apply-combine problem for which the plyr package is very well-suited:
require(plyr)
> ddply(X, .(nsimul), summarize, cor_a_b = cor(a,b))
nsimul cor_a_b
1 1 -0.7549232
2 2 -0.5964848
You can use by function e.g.:
correlations <- as.list(by(data=x,INDICES=x$nsimul,FUN=function(x) cor(x[3],x[5])))
# now you can access to correlation for each simulation
correlations["simulation 1"]
correlations["simulation 2"]
...
correlations["simulation 40"]

Resources