Removing outliers above 3SD above the mean of a time series profile - r

Removing outliers (by column) above 3 standard deviations of the median in R with multiple columns in a time series. I want to remove the row that has an outlier.
In the example below, the last row would be removed because there is an outlier in column B.
See example data and output
Example data
A B C
1 0.1 2
2 0.2 3
3 0.3 4
4 0.4 5
5 8.0 6
Example output
A B C
1 0.1 2
2 0.2 3
3 0.3 4
4 0.4 5

You should probably base your cut on the median rather than the mean!
> d <- data.frame(A=1:5, B=c(0.1*(1:4),8), C=2:6)
> cut <- apply(d, 2, quantile, 0.997)
> sel <- apply(d, 1, function(x) all(x<cut))
> d[sel,]
A B C
1 1 0.1 2
2 2 0.2 3
3 3 0.3 4
4 4 0.4 5

Related

R: Compare elements from a column based upon other column conditions?

I would like to create a new df, based upon whether the second or third condition's for each subject are greater than the first condition.
Example df:
df1 <- data.frame(subject = rep(1:5, 3),
condition = rep(c("first", "second", "third"), each = 5),
values = c(.4, .4, .4, .4, .4, .6, .6, .6, .6, .4, .6, .6, .6, .4, .4))
> df1
subject condition values
1 1 first 0.4
2 2 first 0.4
3 3 first 0.4
4 4 first 0.4
5 5 first 0.4
6 1 second 0.6
7 2 second 0.6
8 3 second 0.6
9 4 second 0.6
10 5 second 0.4
11 1 third 0.6
12 2 third 0.6
13 3 third 0.6
14 4 third 0.4
15 5 third 0.4
The resulting df would be this:
> df2
subject condition values
1 1 first 0.4
2 2 first 0.4
3 3 first 0.4
4 4 first 0.4
6 1 second 0.6
7 2 second 0.6
8 3 second 0.6
9 4 second 0.6
11 1 third 0.6
12 2 third 0.6
13 3 third 0.6
14 4 third 0.4
Here, subject #5 does not meet the criteria. This is because only subject #5's values are not greater than the first condition in either the second or third condition.
Thanks.
We may group by 'subject' and filter if any of the second or third 'values' are greater than 'first'
library(dplyr)
df1 %>%
group_by(subject) %>%
filter(any(values[2:3] > first(values))) %>%
ungroup
-output
# A tibble: 12 × 3
subject condition values
<int> <chr> <dbl>
1 1 first 0.4
2 2 first 0.4
3 3 first 0.4
4 4 first 0.4
5 1 second 0.6
6 2 second 0.6
7 3 second 0.6
8 4 second 0.6
9 1 third 0.6
10 2 third 0.6
11 3 third 0.6
12 4 third 0.4
Using ave.
df1[with(df1, ave(values, subject, FUN=\(x) any(x[2:3] > x[1])) == 1), ]
# subject condition values
# 1 1 first 0.4
# 2 2 first 0.4
# 3 3 first 0.4
# 4 4 first 0.4
# 6 1 second 0.6
# 7 2 second 0.6
# 8 3 second 0.6
# 9 4 second 0.6
# 11 1 third 0.6
# 12 2 third 0.6
# 13 3 third 0.6
# 14 4 third 0.4

Filter a group of a data.frame based on multiple conditions

I am looking for an elegant way to filter the values of a specific group of big data.frame based on multiple conditions.
My data frame looks like this.
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0.2,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.2
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like only for the time point 1 to filter out all the values that are smaller than 1 but bigger than 0.1
I want my data.frame to look like this.
group time value
1 A 1 0.2
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
Any help is highly appreciated.
With dplyr you can do
library(dplyr)
data %>% filter(!(time == 1 & (value <= 0.1 | value >= 1)))
# group time value
# 1 A 1 0.2
# 2 A 2 0.1
# 3 B 2 10.0
# 4 C 2 20.0
# 5 A 3 10.0
# 6 B 3 20.0
# 7 C 3 30.0
Or if you have too much free time and you decided to avoid dplyr:
ind <- with(data, (data$time==1 & (data$value > 0.1 & data$value < 1)))
ind <- ifelse((data$time==1) & (data$value > 0.1 & data$value < 1), TRUE, FALSE)
#above two do the same
data$ind <- ind
data <- data[!(data$time==1 & ind==F),]
data$ind <- NULL
group time value
1 A 1 0.2
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
Another simple option would be to use subset twice and then append the results in a row wise manner.
rbind(
subset(data, time == 1 & value > 0.1 & value < 1),
subset(data, time != 1)
)
# group time value
# 1 A 1 0.2
# 4 A 2 0.1
# 5 B 2 10.0
# 6 C 2 20.0
# 7 A 3 10.0
# 8 B 3 20.0
# 9 C 3 30.0

Data frame with multiple colums, from a group of same values in one column select the maximum in the other column

I have the following data frame:
DF <- data.frame(A=c(0.1,0.1,0.1,0.1,0.2,0.2,0.2,0.3,0.4,0.4 ), B=c(1,2,1,5,10,2,3,1,6,2), B=c(1000,50,400,6,300,2000,20,30,40,50))
and I want to filter DF for each group of equal values in A select the Maximum in B.
For example for 0.1 in A the maximum in B is 5.
Ending with the new data frame:
A B C
0.1 5 6
0.2 10 300
0.3 1 30
0.4 6 40
I am not sure if this a problem to solve with base R or with a library. Because I am thinking to use dplyr and group A. I am correct?
There are a couple of base R options:
Using subset + ave
> subset(DF,as.logical(ave(B,A,FUN = function(x) x == max(x))))
A B B.1
4 0.1 5 6
5 0.2 10 300
8 0.3 1 30
9 0.4 6 40
Using merge + aggregate
> merge(aggregate(B~A,DF,max),DF)
A B B.1
1 0.1 5 6
2 0.2 10 300
3 0.3 1 30
4 0.4 6 40
An option with data.table where group by 'A', get the index where 'B' is max with which.max, wrap with .I to return the row index. If we don't specify or rename, by default, it returns as 'V1' column, which we extract as vector to subset the rows of dataset
library(data.table)
setDT(DF)[DF[, .I[which.max(B)], A]$V1]
-output
# A B B.1
#1: 0.1 5 6
#2: 0.2 10 300
#3: 0.3 1 30
#4: 0.4 6 40
You're right, using dplyr and grouping by A, you can use slice_max() (also from dplyr) to select the max value in B for each group
library(dplyr)
DF %>%
group_by(A) %>%
slice_max(B)
Output:
# A tibble: 4 x 3
# Groups: A [4]
A B C
<dbl> <dbl> <dbl>
1 0.1 5 6
2 0.2 10 300
3 0.3 1 30
4 0.4 6 40

Calculate group mean with the same grouping factors several times

I have genetic data. It is quite big, about 17 000 genetic markers (SNPs) and 700 individuals. These SNPs can be assigned to a founder.
Now I want to calculate the average probability per 'founder segment'. A segment is defined as a part of the chromosome that is assigned to one founder uninterrupted.
In the example below I would have 3 segments.
In the end I want to know the average probability over all SNPs within a segment.
Chromosome SNP Founder Probability
1 1 7 0.6
1 2 7 0.5
1 3 7 0.7
1 4 2 0.5
1 5 2 0.8
1 6 7 0.6
1 7 7 0.5
I can group easily with dplyr, but I don't want the first segment of founder 7 together with the other segment with founder 7.
So what I want:
Chromosome SNP Founder Probability Average
1 1 7 0.6 0.6
1 2 7 0.5 0.6
1 3 7 0.7 0.6
1 4 2 0.5 0.65
1 5 2 0.8 0.65
1 6 7 0.6 0.55
1 7 7 0.5 0.55
How can I calculate group mean I when have the same grouping factors several times?
With dplyr we can compare the adjacent elements of 'Founder' to create a grouping variable along with 'Chromosome', and then get the mean of 'Probability'
library(dplyr)
library(data.table)
df1 %>%
group_by(Chromosome, grp1 = cumsum(Founder!=lag(Founder, default = Founder[n()]))) %>%
mutate(Average = mean(Probability))
# Chromosome SNP Founder Probability grp1 Average
# <int> <int> <int> <dbl> <int> <dbl>
#1 1 1 7 0.6 0 0.60
#2 1 2 7 0.5 0 0.60
#3 1 3 7 0.7 0 0.60
#4 1 4 2 0.5 1 0.65
#5 1 5 2 0.8 1 0.65
#6 1 6 7 0.6 2 0.55
#7 1 7 7 0.5 2 0.55
Or using data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Chromome' and run-length-type id (rleid) of 'Founder', we assign (:=) the mean of "Probability" as the "Average" column.
library(data.table)
setDT(df1)[, Average := mean(Probability) , .(Chromosome, grp1 = rleid(Founder))]

randomize or permuting values in a data.frame

I have a data.frame that looks like this: (my real dataframe is bigger):
df <- data.frame(A=c("a","b","c","d","e","f","g","h","i"),
B=c("1","1","1","2","2","2","3","3","3"),
C=c(0.1,0.2,0.4,0.1,0.5,0.7,0.1,0.2,0.5))
> df
A B C
1 a 1 0.1
2 b 1 0.2
3 c 1 0.4
4 d 2 0.1
5 e 2 0.5
6 f 2 0.7
7 g 3 0.1
8 h 3 0.2
9 i 3 0.5
I want to add several n-columns (something similar to permutations) where the column D would be a random value from df$C but this value should only be picked from those rows with the dame value of df$B, an example of the desired output would be:
df <- data.frame(A=c("a","b","c","d","e","f","g","h","i"),
B=c("1","1","1","2","2","2","3","3","3"),
C=c(0.1,0.2,0.4,0.1,0.5,0.7,0.1,0.2,0.5),
D=c(0.2,0.2,0.1,0.5,0.7,0.1,0.5,0.5,0.2))
> df
A B C D
1 a 1 0.1 0.2
2 b 1 0.2 0.2
3 c 1 0.4 0.1
4 d 2 0.1 0.5
5 e 2 0.5 0.7
6 f 2 0.7 0.1
7 g 3 0.1 0.5
8 h 3 0.2 0.5
9 i 3 0.5 0.2
I've tried with plyr package but my approach does not work properly:
ddply(df, levels(.(B)), transform, D=sample(C))
I also have thought about splitting the dataframe based on df$B and then using a function to add the column in each dataframe using lapply however I have no clue how select for the levels of df$B,
Many thanks
No need for plyr, ave will do the trick.
transform(df, D=ave(C, B, FUN=function(b) sample(b, replace=TRUE)))

Resources