removing outliers in a vector

removing outliers in a vector - r

The aim is to remove outliers in a vector.
x = datasets::islands ($area)
x = 12 13 13 13 14 14 15 15 16 16 16 19 21 23 25 26 29 29 30 30
32 33 36 40 42 43 43 44 49 58 73 82 82 84 89 183 184 227 280 306
840 2968 3745 5500 6795 9390 11506 16988
so far by using
x_rm_out <- x[!x%in% boxplot.stats
(x, coef = .05, do.conf = TRUE, do.out = TRUE)$out]
Result
[1] 12 13 13 13 14 14 15 15 16 16 16 19 21 23 25 26 29 29 30 30 32 33 36 40 42 43 43 44 49 58 73
[32] 82 82 84 89 183 184
Is there a way to remove 183 & 184 from vector (x)?

Finding Outliers
A very easy way to find outliers is with the rstatix package, then filter them out with dplyr:
# Load library:
library(rstatix)
library(dplyr)
# Make x into dataframe:
x <- data.frame(x)
# Identify outliers:
x %>%
identify_outliers()
You should get an output like this now:
x is.outlier is.extreme
1 840 TRUE TRUE
2 2968 TRUE TRUE
3 3745 TRUE TRUE
4 5500 TRUE TRUE
5 6795 TRUE TRUE
6 9390 TRUE TRUE
7 11506 TRUE TRUE
8 16988 TRUE TRUE
Creating Dataframe Without Them
Now you have to filter out the data, which you can then turn into a new dataframe (< 840). You may also remove them with your previously established criterion (< 183) if you desire:
# Filter outliers and create new file:
x2 <- x %>%
filter(x < 183)
x2
Which after you enter x2, gives you this output without outliers:
x
1 12
2 13
3 13
4 13
5 14
6 14
7 15
8 15
9 16
10 16
11 16
12 19
13 21
14 23
15 25
16 26
17 29
18 29
19 30
20 30
21 32
22 33
23 36
24 40
25 42
26 43
27 43
28 44
29 49
30 58
31 73
32 82
33 82
34 84
35 89

To supplement the Shawn's answer, you can also use rstatix::is_outlier() function for numeric vectors.

Related

R plot numbers of factor levels having n, n+1, .... counts

I have a very large dataset (> 200000 lines) with 6 variables (only the first two shown)
>head(gt7)
ChromKey POS
1 2447 25
2 2447 183
3 26341 75
4 26341 2213
5 26341 2617
6 54011 1868
I have converted the Chromkey variable to a factor variable made up of > 55000 levels.
> gt7[1] <- lapply(gt7[1], factor)
> is.factor(gt7$ChromKey)
[1] TRUE
I can further make a table with counts of ChromKey levels
> table(gt7$ChromKey)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
88 88 44 33 11 11 33 22 121 11 22 11 11 11 22 11 33
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
22 22 44 55 22 11 22 66 11 11 11 22 11 11 11 187 77
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
77 11 44 11 11 11 11 11 11 22 66 11 22 11 44 22 22
... outut cropped
Which I can save in table format
> table <- table(gt7$ChromKey)
> head(table)
1 2 3 4 5 6
88 88 44 33 11 11
I would like to know whether is it possible to have a table (and histogram) of the number of levels with specific count numbers. From the example above, I would expect
88 44 33 11
2 1 1 2
I would very much appreciate any hint.

We can apply table again on the output to get the frequency count of the frequency
table(table(gt7$ChromKey))

Translate this R geometric problem using numpy random geometric

How can I translate this geometric law problem to numpy ?
Products produced by a machine has a 3% defective rate.
What is the probability that the first defective oc-curs in the fifth item inspected?
P(X= 5) =P(1st 4 non-defective )P( 5th defective)=(0.974)(0.03)
In R > dgeom (x= 4, prob = .03)[1] 0.02655878T
The convention in R is to record X as the number of failures that occur
before the first success.
Is this my numpy code ok ? :
result = np.random.geometric(p=0.03, size=1000)
print(result);
result = (result == 5).sum() / 1000.
print(result * 1000,"%");
I get 17 % as a result with numpy , is it ok ? Seem wrong because there is only 3% defect rate.
This is the numpy result Array :
""" [ 31 20 37 9 47 31 22 7 44 15 52 15 4 14 36 45 26 27
9 48 30 5 7 17 7 24 121 22 23 49 2 26 25 8 4 5
3 27 70 71 3 1 19 22 103 18 14 20 34 45 8 169 11 63
29 71 30 79 75 19 56 9 5 8 15 44 8 12 40 29 46 2
144 69 65 1 4 90 20 187 100 52 46 76 3 105 12 110 31 3
113 18 6 15 127 22 6 7 3 18 123 41 69 104 13 18 2 8
52 35 54 27 74 22 31 27 3 15 21 26 13 3 32 10 131 20
I guess that 31 is the number of integrity checks before a failure .... 20 , 37 etc ...

This is what I would do:
np.random.seed(1)
tests = np.random.choice([0,1], size=(1000,5), p=[0.7,0.3])
((np.argmax(tests, axis=1) == 4) & tests[:,4]==1).mean()
# 0.073

Repeating elements in a vector with a for loop

I want to make a vector from 3:50 in R, looking like
3 4 4 5 6 6 7 8 8 .. 50 50
I want to use a for loop in a for loop but it's not doing wat I want.
f <- c()
for (i in 3:50) {
for(j in 1:2) {
f = c(f, i)
}
}
What is wrong with it?

Another option is to use an embedded rep:
rep(3:50, rep(1:2, 24))
which gives:
[1] 3 4 4 5 6 6 7 8 8 9 10 10 11 12 12 13 14 14 15 16 16 17 18 18 19 20 20
[28] 21 22 22 23 24 24 25 26 26 27 28 28 29 30 30 31 32 32 33 34 34 35 36 36 37 38 38
[55] 39 40 40 41 42 42 43 44 44 45 46 46 47 48 48 49 50 50
This utilizes the fact that the times-argument of rep can also be an integer vector which is equal to the length of the x-argument.
You can generalize this to:
s <- 3
e <- 50
v <- 1:2
rep(s:e, rep(v, (e-s+1)/2))
Even another option using a mix of rep and rep_len:
v <- 3:50
rep(v, rep_len(1:2, length(v)))

A solution based on sapply.
as.vector(sapply(0:23 * 2 + 2, function(x) x + c(1, 2, 2)))
# [1] 3 4 4 5 6 6 7 8 8 9 10 10 11 12 12 13 14 14 15 16 16 17 18 18 19 20 20 21 22 22 23 24 24 25 26 26
# [37] 27 28 28 29 30 30 31 32 32 33 34 34 35 36 36 37 38 38 39 40 40 41 42 42 43 44 44 45 46 46 47 48 48 49 50 50
Benchmarking
Here is a comparison of performance for all the current answers. The result shows that cumsum(rep(c(1, 1, 0), 24)) + 2L (m8) is the fastest, while rep(3:50, rep(1:2, 24))(m1) is almost as fast as the m8.
library(microbenchmark)
library(ggplot2)
perf <- microbenchmark(
m1 = {rep(3:50, rep(1:2, 24))},
m2 = {rep(3:50, each = 2)[c(TRUE, FALSE, TRUE, TRUE)]},
m3 = {v <- 3:50; sort(c(v,v[v %% 2 == 0]))},
m4 = {as.vector(t(cbind(seq(3,49,2),seq(4,50,2),seq(4,50,2))))},
m5 = {as.vector(sapply(0:23 * 2 + 2, function(x) x + c(1, 2, 2)))},
m6 = {sort(c(3:50, seq(4, 50, 2)))},
m7 = {rep(seq(3, 50, 2), each=3) + c(0, 1, 1)},
m8 = {cumsum(rep(c(1, 1, 0), 24)) + 2L},
times = 10000L
)
perf
# Unit: nanoseconds
# expr min lq mean median uq max neval
# m1 514 1028 1344.980 1029 1542 190200 10000
# m2 1542 2570 3083.716 3084 3085 191229 10000
# m3 26217 30329 35593.596 31871 34442 5843267 10000
# m4 43180 48321 56988.386 50891 55518 6626173 10000
# m5 30843 35984 42077.543 37526 40611 6557289 10000
# m6 40611 44209 50092.131 46779 50891 446714 10000
# m7 13879 16449 19314.547 17478 19020 6309001 10000
# m8 0 1028 1256.715 1028 1542 71454 10000

Use the rep function, along with the possibility to use recycling logical indexing ...[c(TRUE, FALSE, TRUE, TRUE)]
rep(3:50, each = 2)[c(TRUE, FALSE, TRUE, TRUE)]
## [1] 3 4 4 5 6 6 7 8 8 9 10 10 11 12 12 13 14 14 15 16 16 17 18 18 19
## [26] 20 20 21 22 22 23 24 24 25 26 26 27 28 28 29 30 30 31 32 32 33 34 34 35 36
## [51] 36 37 38 38 39 40 40 41 42 42 43 44 44 45 46 46 47 48 48 49 50 50
If you use a logical vector (TRUE/FALSE) as index (inside [ ]), a TRUE leads to selection of the corresponding element and a FALSE leads to omission. If the logical index vector (c(TRUE, FALSE, TRUE, TRUE)) is shorter than the indexed vector (rep(3:50, each = 2) in your case), the index vector is recyled.
Also a side note: Whenever you use R code like
x = c(x, something)
or
x = rbind(x, something)
or similar, you are adopting a C-like programming style in R. This makes your code unnessecarily complex and might lead to low performance and out-of-memory issues if you work with large (say, 200MB+) data sets. R is designed to spare you those low-level tinkering with data structures.
Read for more information about the gluttons and their punishment in the R Inferno, Circle 2: Growing Objects.

The easiest way I can found is in way to create another one containing only even values (based on OP's intention) and then simply join two vectors. The example could be:
v <- 3:50
sort(c(v,v[v %% 2 == 0]))
# [1] 3 4 4 5 6 6 7 8 8 9 10 10 11 12 12 13 14 14 15 16 16
# 17 18 18 19 20 20 21 22 22 23 24 24 25 26 26 27 28 28
#[40] 29 30 30 31 32 32 33 34 34 35 36 36 37 38 38 39 40 40 41 42 42
# 43 44 44 45 46 46 47 48 48 49 50 50

Here is a loop-free 1 line solution:
> as.vector(t(cbind(seq(3,49,2),seq(4,50,2),seq(4,50,2))))
[1] 3 4 4 5 6 6 7 8 8 9 10 10 11 12 12 13 14 14 15 16 16 17
[23] 18 18 19 20 20 21 22 22 23 24 24 25 26 26 27 28 28 29 30 30 31 32
[45] 32 33 34 34 35 36 36 37 38 38 39 40 40 41 42 42 43 44 44 45 46 46
[67] 47 48 48 49 50 50
It forms a matrix whose first column is the odd numbers in the range 3:50 and whose second and third columns are the even numbers in that range and then (by taking the transpose) reads it off row by row.
The problem with your nested loop approach is that the fundamental pattern is one of length 3, repeated 24 times (instead of a pattern of length 2 repeated 50 times). If you wanted to use a nested loop, the outer loop could iterate 24 times and the inner loop 3. The first pass through the outer loop could construct 3,4,4. The second pass could construct 5,6,6. Etc. Since there are 24*3 = 72 elements, you can pre-allocate the vector (by using f <- vector("numeric",74) ) so that you aren't growing it 1 element at a time. The idiom f <- c(f,i) that you are using at each stage copies all of the old elements just to create a new vector which is only 1 element longer. Here there are too few elements for it to really make a difference, but if you try to create large vectors that way the performance can be shockingly bad.

Here is a method that combines portions of a couple of the other answers.
rep(seq(3, 50, 2), each=3) + c(0, 1, 1)
[1] 3 4 4 5 6 6 7 8 8 9 10 10 11 12 12 13 14 14 15 16
[21] 16 17 18 18 19 20 20 21 22 22 23 24 24 25 26 26 27 28 28 29
[41] 30 30 31 32 32 33 34 34 35 36 36 37 38 38 39 40 40 41 42 42
[61] 43 44 44 45 46 46 47 48 48 49 50 50
Here is a second method using cumsum
cumsum(rep(c(1, 1, 0), 24)) + 2L
This should be very quick.

This should do too.
sort(c(3:50, seq(4, 50, 2)))

Another idea, though not competing in speed with fastest solutions:
mat <- matrix(3:50,nrow=2)
c(rbind(mat,mat[2,]))
# [1] 3 4 4 5 6 6 7 8 8 9 10 10 11 12 12 13 14 14 15 16 16 17 18 18 19 20 20 21 22 22
# [31] 23 24 24 25 26 26 27 28 28 29 30 30 31 32 32 33 34 34 35 36 36 37 38 38 39 40 40 41 42 42
# [61] 43 44 44 45 46 46 47 48 48 49 50 50

Creating a sequence in R [duplicate]

This question already has answers here:
Create integer sequences defined by 'from' and 'to' vectors
(2 answers)
Closed 5 years ago.
Let's say, I created two vectors like:
Ncla = 10
CC.1 = seq(2,((Ncla *Ncla)-Ncla),(Ncla+1))
CC.2 = seq(Ncla,((Ncla *Ncla)-Ncla),(Ncla))
and, I tried to create the following sequence:
#[1] 2 3 4 5 6 7 8 9 10 13 14 15 16 17 18 19 20 24 25 26
# 27 28 29 30 35 36 37 38 39 40 46 47 48 49 50 57 58 59 60 68 69 70 79 80 90
using the statement:
for(i in 1:(Ncla-1)) A.1[i]={c(seq(CC.1[i],CC.2[i],length = 1))}
but it doesn't work.
Any help is greatly appreciated.

Try
unlist(Map(seq, CC.1, CC.2))
# [1] 2 3 4 5 6 7 8 9 10 13 14 15 16 17 18 19 20 24 25 26 27 28 29 30 35
#[26] 36 37 38 39 40 46 47 48 49 50 57 58 59 60 68 69 70 79 80 90
Or
unlist(sapply(seq_along(CC.1), function(i) seq(CC.1[i], CC.2[i])))
Or
A.1 <- list()
for(i in seq_along(CC.1)) A.1[[i]] <- seq(CC.1[i], CC.2[i])
unlist(A.1)
# [1] 2 3 4 5 6 7 8 9 10 13 14 15 16 17 18 19 20 24 25 26 27 28 29 30 35
#[26] 36 37 38 39 40 46 47 48 49 50 57 58 59 60 68 69 70 79 80 90

test<-NULL
for(i in 1:(Ncla-1)) {
A.1=c(seq(CC.1[i],CC.2[i],1))
test<-c(test,A.1)
}
test
Your mistake: You were not saving your results.

R: Which components of vector are out of order

Say I have vector:
x <- c(11,6,5,3,2,1,25,10,16,12,22,24,19,14,18,32,17,15,8,7,
33,4,27,9,29,13,30,23,20,31,26,21,28)
x
[1] 11 6 5 3 2 1 25 10 16 12 22 24 19 14 18 32 17 15 8 7 33 4 27 9 29 13 30 23 20
[30] 31 26 21 28
I want to identify which elements are not ascending. So, for example, elements 2 to 5 (values 6,5,3,2,1) are out of order because they are less than element 1 (11). Then element 6 is in order because its greater than 11, then all elements until element 16 (32) are out of order. I want to remove those elements.
Vectorized/shortcut way of doing this?

Create some data:
set.seed(1)
x <- sample(100, 30)
x
[1] 27 37 57 89 20 86 97 62 58 6 19 16 61 34 67 43 88 83 32 63 75 17 51 10 21 29 1 28 81 25
Select only those elements that are greater than or equal to the cumulative maximum:
x[x >= cummax(x)]
[1] 27 37 57 89 97

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

removing outliers in a vector - r

To supplement the Shawn's answer, you can also use rstatix::is_outlier() function for numeric vectors.

Related

R plot numbers of factor levels having n, n+1, .... counts

Translate this R geometric problem using numpy random geometric

Repeating elements in a vector with a for loop

Creating a sequence in R [duplicate]

R: Which components of vector are out of order

Categories

Resources