I have a data frame with a string of values, with certain anomalous readings I want to identify. I would like to make a third column in my data frame marking certain readings as "anomaly", and the rest as "normal". Looking over a plot of my data, by eye it seems pretty obvious when I get these odd dips, but I am having trouble figuring out how to get R to recognize the odd readings since the baseline average changes over time. The best I can come up with is three rules to use to classify something as "anomaly".
1: Starting with the second value, if the second value is within a close range of the first value, then mark as "N" for normal in the third column. And so on through the rest of the data set.
2: If the second value represents a large increase or decrease from the first value, mark as "A" for anomaly in the third column.
3: If a value is marked as "A", the following value will be marked as "A" as well if it is within a small range the previous anomalous value. If the following value represents a large increase or decrease from the previous anomalous value, it is to be marked as "N".
This was my best logic I could come up with, but looking at the data below if you can come up with a better idea I'm all for it.
So given a dummy data set:
SampleNum<-1:50
Value <- c(1, 2, 2, 2, 23, 22, 2, 3, 2, -23, -23, 4, 4, 5, 5, 25, 24,
6, 7, 6, 35, 38, 20, 21, 22, -22, 2, 2, 6, 7, 7, 6, 30, 31,
6, 6, 6, 5, 22, 22, 4, 5, 4, 5, 30, 39, 18, 18, 19, 18)
DF<-data.frame(SampleNum,Value)
This is how I might see the final data, with a third column identifying which values are anomalous.
SampleNum Value Name
1 1 N
2 2 N
3 2 N
4 2 N
5 23 A
6 22 A
7 2 N
8 3 N
9 2 N
10 -23 A
11 -23 A
12 4 N
13 4 N
14 5 N
15 5 N
16 25 A
17 24 A
18 6 N
19 7 N
20 6 N
21 35 A
22 38 A
23 20 N
24 21 N
25 22 N
26 -22 A
27 2 N
28 2 N
29 6 N
30 7 N
31 7 N
32 6 N
33 30 A
34 31 A
35 6 N
36 6 N
37 6 N
38 5 N
39 22 A
40 22 A
41 4 N
42 5 N
43 4 N
44 5 N
45 30 A
46 39 A
47 18 N
48 18 N
49 19 N
50 18 N
You need to distinguish anomalies from mixtures of different distributions. This is usually NOT a statistical question but rahter soemthing that comes from domain-specific knowledge. If you plot the desnity estimates from you data you get:
png(); plot( density(DF$Value)) ; dev.off()
So how are we supposed to know that the two values below zero are not real? They are 4% of your sample so applying a rule: "anomalies == items being outside the 99% confidence interval" would not define them as "anomalies. Are these activity measurements of some sort where the instrument should have given a positive value? The much larger bump peaking at 20 is surely not an anomaly by any reasonable definition.
You should do some searching on the topic of statistical prcess control. There are R packages with SPC oriented functions in them.
Related
I am currently working on a programming puzzle that sounds straightforward, but apparently it is pretty difficult if I want to do this efficiently in R without having to use for loop to go through a column with 100k+ rows within a data-frame. I am trying to apply dplyr (particularly group_by and mutate) or data.table, and -apply family, but it's quite tough. Could anyone give some help?
The problem is as follows: given a data-frame df with columns key ("string" data type), x, y, z (all are "numeric" data type). Some elements within column key might get repeated. The rule set is as follows: for every rows with the same value in key column, we determine whether their values in column x are smaller than the sum of elements in column y (for example, with key = aa_bb_1, there are 6 rows with this key, and all of these rows always have the same value in column x. Please see the Sample Output to see how the rule works). If it is, then keep that value in column x , while distributing the element in column x to elements in column y in a decreasing order based on corresponding values in column z. How do we effectively do this given that we need to go through all distinct elements in column key?
Sample Input
df <- data.frame(key = c('aa_bb_1', 'aa_bb_0', 'ab_ca_0', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1',
'aa_bb_1', 'aa_bb_1', 'ab_ca_0', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1',
'aa_bb_0', 'aa_bb_1', 'ab_ca_0', 'abc_bbb_0', 'abbbc_aa_0', 'aaa_ccc_1',
'aa_bb_0', 'aa_bb_1', 'ab_ca_1', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1',
'aa_bb_1', 'aa_bb_0', 'ab_ca_0', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1'),
x = c(20, 19, 30, 25, 37, 13, 20, 20, 30, 25, 37, 13, 19, 20, 30, 43,
71, 13, 19, 20, 10, 25, 37, 13, 20, 19, 30, 25, 37,13),
y = c(3, 10, 18, 15, 32, 4, 12, 29, 71, 92, 11, 7, 21, 19, 13,
26,28,11,8, 8, 5, 23, 3, 12, 19, 7, 9, 11, 7, 12),
z = c(8,13,15,16,10,10,25,21,32,15,45,8,10,50,12,10,35,
23,10,12,2,40,45,57,66,49,100,5,11,30))
key x y z
1 aa_bb_1 20 3 8
2 aa_bb_0 19 10 13
3 ab_ca_0 30 18 15
4 abc_bbb_1 25 15 16
5 abbbc_aa_1 37 32 10
6 aaa_ccc_1 13 4 10
7 aa_bb_1 20 12 25
8 aa_bb_1 20 29 21
9 ab_ca_0 30 71 32
10 abc_bbb_1 25 92 15
11 abbbc_aa_1 37 11 45
12 aaa_ccc_1 13 7 8
13 aa_bb_0 19 21 10
14 aa_bb_1 20 19 50
15 ab_ca_0 30 13 12
16 abc_bbb_0 43 26 10
17 abbbc_aa_0 71 28 35
18 aaa_ccc_1 13 11 23
19 aa_bb_0 19 8 10
20 aa_bb_1 20 8 12
21 ab_ca_1 10 5 2
22 abc_bbb_1 25 23 40
23 abbbc_aa_1 37 3 45
24 aaa_ccc_1 13 12 57
25 aa_bb_1 20 19 66
26 aa_bb_0 19 7 49
27 ab_ca_0 30 9 100
28 abc_bbb_1 25 11 5
29 abbbc_aa_1 37 7 11
30 aaa_ccc_1 13 12 30
Sample Output for aa_bb_1 and aa_bb_0
key x y z
1 aa_bb_1 20 0 8
2 aa_bb_0 19 10 13 -- Second largest value of z among rows with same key aa_bb_0. Get second distribution equal to min(10,19-7)=min(10,12)=10.
7 aa_bb_1 20 0 25
8 aa_bb_1 20 0 21 -- Nothing left to be distributed => 0 in column y.
13 aa_bb_0 19 0 10 --- Nothing left so distribute 0
14 aa_bb_1 20 1 50 --- Second largest value of z among rows with same key aa_bb_1. So distribute min(19,20-19)=1 to column y.
19 aa_bb_0 19 2 10 --- Tie as third largest value of z among rows with same key aa_bb_0. Pick *randomly* for now (in reality, I would have another column to decide on which row would get distributed first). Since min(8,19-7-10)=min(8,2)=2, only 2 is distributed.
20 aa_bb_1 20 0 12
25 aa_bb_1 20 19 66 --- Largest value of z among rows with same key aa_bb_1. Get first distribution = min(20, 19)=19.
26 aa_bb_0 19 7 49 --- Largest value of z among rows with same key aa_bb_0. Get first distribution equal to min(7,19)=7.
Caveat. Only perform the above operations if the sum of all the elements in column z with the same key is greater than the value in column x of that key. Example includes aa_bb_1 where x = 20 < 3+19+8+19
Pretty much anything you can do with a for loop.
Here I apply a function to the data.frame split by key,
that function being a for loop. Then I assign the output to the ordered df, because the split data frame loop output is ordered by key.
df <- dplyr::arrange(df, key, desc(z))
df$y <- lapply(split(df, df$key), \(x) {
ndf <- x
base <- min(ndf$x)
#out values for y
yout = list(x$y)
if(sum(x$y) > min(x$x) {
for (i in seq(nrow(x))) {
##get the max
maxz <- which.max(ndf$z)
##get the minimum
minv <- min(base, ndf$y[maxz])
#add to yout
yout[[i]] <- minv
#new base
base <- base - minv
##update dataframe
ndf <- ndf[-maxz, ]
}
}
return(yout)
}) |> unlist()
key x y z
1 aa_bb_0 19 7 49
2 aa_bb_0 19 10 13
3 aa_bb_0 19 2 10
4 aa_bb_0 19 0 10
5 aa_bb_1 20 19 66
6 aa_bb_1 20 1 50
7 aa_bb_1 20 0 25
8 aa_bb_1 20 0 21
9 aa_bb_1 20 0 12
10 aa_bb_1 20 0 8
11 aaa_ccc_1 13 12 57
12 aaa_ccc_1 13 1 30
13 aaa_ccc_1 13 0 23
14 aaa_ccc_1 13 0 10
I have a data matrix of order 2000 x 20, and I want to choose a specific order of entire rows from the matrix, for example, 1st,7th, 8th, 14th, 15th, 21th, 22th, is such a sequence of rows until the last 2000th rows.
[1, 7, 8, 14, 15, 21, 22, ...]
Manually it's very difficult to select such a sequence, is there is an alternative to do the same task in R? are the for looping is helpful in solving such a problem.
Using the updated question data, something like:
cumsum(rep(c(1,6), 2000/7))
# [1] 1 7 8 14 15
# ...
#[566] 1981 1982 1988 1989 1995
Since your pattern is +1/+6 up until 2000, you can repeat the two values c(1,6) as many times as sum(c(1,6)) goes in to 2000, and then take a cumulative sum.
Try this
mat[sort(c(k <- seq(6, 2000, by = 7), k + 1)),]
You can define your sequence first, using e.g. sequence, and then subset using [].
n = 2000 / 7
s = sequence(nvec = c(1, rep(2,n)), from = c(1, 7*1:n))
# s
# [1] 1 7 8 14 15 21 22 28 29 35 36 42 43 49 50 56 57 63 64 70 71 ...
yourMatrix[s, ]
sequence creates a sequence of sequences of length nvec and of starting point from.
I have the following vector
37 15 30 37 4 11 35 37
I want to extract intervals of number. The interval starts and ends with the same number. This number appears more than once in the vector.
For example in this case 37: 15, 30 and 4, 11, 35 and 15, 30, 37, 4, 11, 35.
Can this example reproduced to a matrix?
After finding out the start value , using split and cumsum
names(table(v)[table(v)>2])
[1] "37"
split(v[v!=37],cumsum(v==37)[v!=37])
$`1`
[1] 15 30
$`2`
[1] 4 11 35
I am working with R version i386 3.1.1 and RStudio 0.99.442.
I have large datasets of tree species that I've collected from 7 plots, each of which are divided into 5 subplots (i.e. 35 distinct subplots). I am trying to get R to run through my dataset and print the species which are present within each plot.
I thought I could use "aggregate" to apply the "levels" function to the Species data column and have it return the Species present for each Plot and Subplot, however it returns the levels of the entire data frame (for 12 species, total) rather than the 3 or 4 species that are actually present in the Subplot.
To provide a reproducible example of what I'm trying to do, we can use the "warpbreaks" dataset that comes with R.
I convert the 'breaks' variable in warpbreaks to a factor variable to recreate the problem; It thus exemplifies my 'species' variable, whereas 'warpbreaks$wool' would represent 'plot', and 'warpbreaks$tension' would represent 'subplot'.
require(stats)
warpbreaks$breaks = as.factor(warpbreaks$breaks)
aggregate(breaks ~ wool + tension, data = warpbreaks, FUN="levels")
If we look at the warpbreaks data, then for "Plot" A (wool) and "Subplot" L (tension) - the desired script would print the species "26, 30, 54, 25, etc."
breaks wool tension
1 26 A L
2 30 A L
3 54 A L
4 25 A L
5 70 A L
6 52 A L
7 51 A L
8 26 A L
9 67 A L
10 18 A M
11 21 A M
12 29 A M
...
Instead, R returns something of this sort, where it is printing ALL of the levels of the factor variable for ALL of the plots:
wool tension breaks.1 breaks.2 breaks.3 breaks.4 breaks.5 breaks...
1 A L 10 12 13 14 15 ...
2 B L 10 12 13 14 15 ...
3 A M 10 12 13 14 15 ...
4 B M 10 12 13 14 15 ...
5 A H 10 12 13 14 15 ...
6 B H 10 12 13 14 15 ...
How do I get it to print only the factors that are present within that Plot/Subplot combination? Am I totally off in my use of "aggregate"? I'd imagine this is a relatively easy task for an experience R user...
First time stackoverflow post - would appreciate any help or nudges towards the right code!
Many kind thanks.
Try FUN=unique rather than FUN=levels. levels will return every level of the factor, as you have surmised already. unique(...) will only return the unique levels.
y <- aggregate(breaks ~ wool + tension, data = warpbreaks, FUN=unique)
wool tension breaks
1 A L 14, 18, 29, 13, 31, 28, 27, 30
2 B L 15, 4, 17, 9, 19, 23, 10, 26
3 A M 8, 11, 17, 7, 2, 20, 18, 21
4 B M 24, 14, 9, 6, 22, 16, 11, 17
5 A H 21, 11, 12, 8, 1, 25, 16, 5, 14
6 B H 10, 11, 12, 7, 3, 5, 6, 16
NOTE the breaks column is a little weird, as in each row of that column instead of having one value (which makes sense for a dataframe), you have a vector of values. i.e. each cell of that breaks column is NOT a string; it's a vector!
> class(y$wool)
[1] "factor"
> class(y$breaks) # list !!
[1] "list"
> y$breaks[[1]] # first row in breaks
[1] 26 30 54 25 70 52 51 67
Levels: 10 12 13 14 15 16 17 18 19 20 21 24 25 26 27 28 29 30 31 35 36 39 41 42 43 44 51 52 54 67 70
Note that to access the first element of the breaks column, instead of doing y$breaks[1] (like you would with the wool or tension column) you need to do y$breaks[[1]] because of this.
Data frames are not really meant to work like this; a single cell in a dataframe is supposed to have a single value, and most functions will expect a dataframe to conform to this, so just keep this in mind when doing future processing.
If you wanted to convert to a string, use (e.g.) FUN=function (x) paste(unique(x), collapse=', '); then y$breaks will be a column of strings and behaves as normal.
I have two vectors of uneven lengths.
> starts
[1] 1 4 7 11 13 15 18 20 37 41 53 61
> ends
[1] 3 6 10 17 19 35 52 60 63
Each corresponding part in starts and ends are supposed to form a boundary, e.g. (1, 3) for the first, (4, 6) for second, etc. However you will notice that starts has 10 elements, and ends has just 9. What happened is for some anomaly, there may be consecutive starts, e.g. 4th to 6th elements of starts (11, 13, 15) are all smaller than the 4th element of ends (17).
Edit: please note also corresponding ends are not always 1 higher than starts, sample above edited to reflect so i.e. after ends 35, the next starts is 37.
My question is, how to find all these extranuous unpaired starts? My aim is to lengthen ends to be same length as starts, and pair all extranuous starts with a corresponding NA in ends. The actual vector lengths are in thousands, with mismatches in hundreds. I can imagine a nested for loop to address this, but am wondering if there is a more efficient solution.
Edit: the expected result would be (starts unchanged, displayed for comparison):
> starts
[1] 1 4 7 11 13 15 18 20 37 41 53 61
> ends
[1] 3 6 10 NA NA 17 19 35 NA 52 60 63
or equivalent, not particular about format.
> starts = c(1, 4, 7, 11, 15, 19, 23, 27)
> ends = c(3, 5, 14, 22, 25)
> e = ends[findInterval(starts, ends)+1]
> e
[1] 3 5 14 14 22 22 25 NA
> e[duplicated(e, fromLast=T)]=NA
> e
[1] 3 5 NA 14 NA 22 25 NA
findInterval seems to work
Assuming both starts and ends are sorted and that it's only in ends where the values are missing, you might be able to do something as straightforward as:
ends[c(match(starts, ends + 1)[-1], length(ends))]
# [1] 3 6 10 NA 17 19 36 52 60 63