removing outliers from repeated dataframe in R - r

I would like remove outliers (remove rows with outliers) from each group (by each BRMA_Name)from a dataframe. My example data as following:
BRMA_No BRMA_Name Price
1 A 5
1 A 6
1 A 100
1 A 90
2 B 50
2 B 60
2 B 40
2 B 400
2 B 4
3 C 4
3 C 2
I look through but could not find any answer (sorry), could anyone shed some light on it.
Kind regards
Lutfor

You could try this:
#outlier based on IQR - returns TRUE or FALSE based on the outlier condition
outlier <- function(x) {
ifelse(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x),
TRUE,
FALSE)
}
library(data.table)
#apply the function per group
setDT(df)[, out := outlier(Price), by = 'BRMA_Name']
df
# BRMA_No BRMA_Name Price out
# 1: 1 A 5 FALSE
# 2: 1 A 6 FALSE
# 3: 1 A 100 FALSE
# 4: 1 A 90 FALSE
# 5: 2 B 50 FALSE
# 6: 2 B 60 FALSE
# 7: 2 B 40 FALSE
# 8: 2 B 400 TRUE
# 9: 2 B 4 TRUE
#10: 3 C 4 FALSE
#11: 3 C 2 FALSE
Then just select the rows where out is FALSE (e.g. df[out == FALSE]).

Here's an option using boxplot to determine the outliers:
library(data.table)
setDT(mydf)[, rm := !Price %in% boxplot(Price, plot = FALSE)$out, BRMA_Name][(rm)]
# BRMA_No BRMA_Name Price rm
# 1: 1 A 5 TRUE
# 2: 1 A 6 TRUE
# 3: 1 A 100 TRUE
# 4: 1 A 90 TRUE
# 5: 2 B 50 TRUE
# 6: 2 B 60 TRUE
# 7: 2 B 40 TRUE
# 8: 3 C 4 TRUE
# 9: 3 C 2 TRUE
I suppose the more appropriate approach would be:
setDT(mydf)[, rm := !Price %in% boxplot.stats(Price)$out, BRMA_Name][(rm)]
From the help page for boxplot.stats, the function's default for the coef argument is 1.5. If you wanted to change your outlier detection rule, you can change that value.

Define the wrapper:
TukeyRangeFilter <- function(x) {
normrange <- quantile(x, c(0.25, 0.75)) + c(-1.5, 1.5) * IQR(x)
findInterval(x, normrange)==1
}
Then loop across the elements of BRMA using by:
by(df, df$BRMA_Name, function(x) x[TukeyRangeFilter(x$Price), ])
Concatenate with do.call(rbind, <output>).
BRMA_No BRMA_Name Price
A.1 1 A 5
A.2 1 A 6
A.3 1 A 100
A.4 1 A 90
B.5 2 B 50
B.6 2 B 60
B.7 2 B 40
C.10 3 C 4
C.11 3 C 2

Related

add grouping number based on consecutive TRUE/FALSE column values

Given this data frame:
library(dplyr)
dat <- data.frame(
bar = c(letters[1:10]),
foo = c(1,2,3,5,8,9,11,13,14,15)
)
bar foo
1 a 1
2 b 2
3 c 3
4 d 5
5 e 8
6 f 9
7 g 11
8 h 13
9 i 14
10 j 15
I first want to identify groups, if the foo number is consecutive:
dat <- dat %>% mutate(in_cluster =
ifelse( lead(foo) == foo +1 | lag(foo) == foo -1,
TRUE,
FALSE))
Which leads to the following data frame:
bar foo in_cluster
1 a 1 TRUE
2 b 2 TRUE
3 c 3 TRUE
4 d 5 FALSE
5 e 8 TRUE
6 f 9 TRUE
7 g 11 FALSE
8 h 13 TRUE
9 i 14 TRUE
10 j 15 TRUE
As can be seen, the values 1,2,3 form a group, then value 5 is on it's own and does not belong to a cluster, then values 8,9 form another cluster and so on.
I would like to add cluster numbers to these "groups".
Expected output:
bar foo in_cluster cluster_number
1 a 1 TRUE 1
2 b 2 TRUE 1
3 c 3 TRUE 1
4 d 5 FALSE NA
5 e 8 TRUE 2
6 f 9 TRUE 2
7 g 11 FALSE NA
8 h 13 TRUE 3
9 i 14 TRUE 3
10 j 15 TRUE 3
There is probably a better tidverse approach for something like this. For example, group_indices could be used if in_cluster is defined through an arbitrary length case_when. However, we can also implement our own method to specifically deal with logical value run lengths, using the rle function.
solution 1 (R version > 3.5)
lgl_indices <- function(var){
x <- rle(var)
cumsum(x[[2]]) |> (\(.){ .[which(!x[[2]], T)] <- NA ; .})() |> rep(x[[1]])
}
solution 2
lgl_indices <- function(var){
x <- rle(var)
y <- cumsum(x$values)
y[which(x$values == F)] <- NA
rep(y, x$lengths)
}
solution 3
lgl_indices <- function(var){
x <- rle(var)
l <- vector("list", length(x))
n <- 1L
for (i in seq_along(x[[1]])) {
if(!x$values[i]) grp <- NA else {
grp <- n
n <- n + 1L
}
l[[i]] <- rep(grp, x$lengths[i])
}
Reduce(c, l)
}
dat %>%
mutate(cluster_number = lgl_indices(in_cluster))
bar foo in_cluster cluster_number
1 a 1 TRUE 1
2 b 2 TRUE 1
3 c 3 TRUE 1
4 d 5 FALSE NA
5 e 8 TRUE 2
6 f 9 TRUE 2
This may not be the efficient way. Still, this works:
# Cumuative sum of the logical
dat$new_cluster <- cumsum(!dat$in_cluster)+1
# using the in_cluster to subset and replacing the cluster number for FALSE by NA
dat[!dat$in_cluster,]$new_cluster <- NA
dat
bar foo in_cluster new_cluster
1 a 1 TRUE 1
2 b 2 TRUE 1
3 c 3 TRUE 1
4 d 5 FALSE NA
5 e 8 TRUE 2
6 f 9 TRUE 2

How can I flag a row where a value appears in more than one group in R?

I would like to add a column that indicates whether a value appears in more than one group. Using the below example, value '4' appears in groups '1' and '2', so I would like to flag that value.
x = matrix(c(1,1,1,2,2,2,3,3,4,4,5,4), nrow = 6, ncol = 2, byrow = F)
x = data.frame(x)
x
# X1 X2
# 1 1 3
# 2 1 3
# 3 1 4
# 4 2 4
# 5 2 5
# 6 2 4
This would be the desired output:
# X1 X2 FLAG
# 1 1 3 False
# 2 1 3 False
# 3 1 4 True
# 4 2 4 True
# 5 2 5 False
# 6 2 4 True
We can create the flag by using n_distinct after grouping by 'X2'
library(dplyr)
x %>%
group_by(X2) %>%
mutate(FLAG = n_distinct(X1) > 1)
Here is a base R option using àve
transform(
x,
FLAG = ave(X1, X2, FUN = function(v) length(unique(v))) > 1
)
or aggregate + subset
transform(
x,
FLAG = X2 %in% subset(aggregate(. ~ X2, x, function(x) length(unique(x))), X1 > 1)$X2
)
which gives
X1 X2 FLAG
1 1 3 FALSE
2 1 3 FALSE
3 1 4 TRUE
4 2 4 TRUE
5 2 5 FALSE
6 2 4 TRUE
For completion here is data.table version :
library(data.table)
setDT(x)[, FLAG := uniqueN(X1) > 1, X2]
x
# X1 X2 FLAG
#1: 1 3 FALSE
#2: 1 3 FALSE
#3: 1 4 TRUE
#4: 2 4 TRUE
#5: 2 5 FALSE
#6: 2 4 TRUE

R: Compare across IDs within the same data frame

I have the following dataset:
df <- data.frame(c(1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5), c("a","a","a","b","b","b","b","b","b","b","b",
"a","a","a","b","b","b"),
c(300,295,295,25,25,25,25,25,20,20,20,300,295,295,300, 295,295),
c("c","d","e","f","g","h","i","j","l","m","n","o","p","q","r","s","t"))
colnames(df) <- c("ID", "Group", "Price", "OtherNumber")
> df
ID Group Price OtherNumber
1 1 a 300 c
2 1 a 295 d
3 1 a 295 e
4 2 b 25 f
5 2 b 25 g
6 2 b 25 h
7 2 b 25 i
8 3 b 25 j
9 3 b 20 l
10 3 b 20 m
11 3 b 20 n
12 4 a 300 o
13 4 a 295 p
14 4 a 295 q
15 5 b 300 r
16 5 b 295 s
17 5 b 295 t
I want to compare the first price of subsequent IDs. Only if the two subsequent IDs have the same initial price and are in the same group, I want to flag them. Just in case this was not very clear, here an example: I compare the first and second ID, but both the group (a vs. b) and the initial price is a mismatch (300 vs. 25). On the other hand, between ID 2 and 3, they are both in group b and have the same initial price of 25 (cf. row 4 and 8). The prices afterwards do not really matter as they may differ.
I figure, I must be able to work with the dplyr package and have determined a very rough solution (which does not yet work).
# Load dplyr
library(dplyr)
# Assign row numbers within IDs
df1 <- df %>%
group_by(ID) %>%
mutate(subID = row_number())
# Isolate first observation in ID
df2 <- df1[df1$subID == 1,]
# Set up loop to iterate through IDs
for (i in 2:length(df2)) {
if (df2$Price[i] - df2$Price[i - 1] == 0) {
df2$flag <- TRUE
} else {
df2$flag <- FALSE
}
}
If you tell me that this is the only possible solution, I will obviously devote more resources to it, but I am sure there must be an easier solution. I checked on SO and maybe I missed something, but I was not able to find anything going into this direction. Thanks!
The output I want to get looks something like this:
ID Group Price OtherNumber flag
1 1 a 300 c FALSE
2 1 a 295 d FALSE
3 1 a 295 e FALSE
4 2 b 25 f TRUE
5 2 b 25 g TRUE
6 2 b 25 h TRUE
7 2 b 25 i TRUE
8 3 b 25 j TRUE
9 3 b 20 l TRUE
10 3 b 20 m TRUE
11 3 b 20 n TRUE
12 4 a 300 o FALSE
13 4 a 295 p FALSE
14 4 a 295 q FALSE
15 5 b 300 r FALSE
16 5 b 295 s FALSE
17 5 b 295 t FALSE
Here is a data.table oneliner... cut into smaller pieces to view intermediate results; also see explanation at the bottom of the answer.
dt <- as.data.table( df )
dt[ dt[ , .SD[1], ID][ ( Group == shift( Group, type = "lead") & Price == shift( Price, type = "lead") ) |
( Group == shift( Group, type = "lag") & Price == shift( Price, type = "lag),
flag := TRUE][is.na(flag), flag := FALSE], flag := i.flag, on = .(ID)][]
# ID Group Price OtherNumber flag
# 1: 1 a 300 c FALSE
# 2: 1 a 295 d FALSE
# 3: 1 a 295 e FALSE
# 4: 2 b 25 f TRUE
# 5: 2 b 25 g TRUE
# 6: 2 b 25 h TRUE
# 7: 2 b 25 i TRUE
# 8: 3 b 25 j TRUE
# 9: 3 b 20 l TRUE
# 10: 3 b 20 m TRUE
# 11: 3 b 20 n TRUE
# 12: 4 a 300 o FALSE
# 13: 4 a 295 p FALSE
# 14: 4 a 295 q FALSE
# 15: 5 b 300 r FALSE
# 16: 5 b 295 s FALSE
# 17: 5 b 295 t FALSE
explanation:
dt[ , .SD[1], ID] create a data.table with the first row of each ID
[ Group == shift( ... , flag := TRUE] sets the column flag to TRUE when the next (or previous) row has matching Price and Group.
[is.na(flag), flag := FALSE] fills in the rest (which is not TRUE) with `FALSE
..flag := i.flag, on = .(ID)] performs a left join (by reference, so it's fast and efficient) on the original data.table, to get the final result.

Replace NA value with next or previous non-NA value conditional on other column

Below is an example data set similar to what I'm working with.
df<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c("A",rep(NA,8),"B",rep(NA,9),"C"))
In this example we have a string of values ranging from + to - values or vice versa (Loc). What I am trying to do accomplish is to fill these NA values, where B is always a associated with negative values of Loc, however, positive values can either take on values A if NA's are between A and B or C if NA's are between B and C.
The desired output should look like the following
df2<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c(rep("A",6),rep("B",8),rep("C",6)))
I have looked into the na.locf from the zoo package but I'm not sure how to order which direction the funcion looks for the non-NA value to get the desired output.
df$Reg2<-ifelse(df$Loc<=0,df$Reg2<-"B",na.locf(df$Reg,fromLast = F))
The above code is only returning the right response for some of the rows depending on the direction (i.e. fromLast = T or F)
Any help on this would be much appreciated.
Use ave splitting by a grouping variable generated from rleid of the sign. Then omit the NAs leaving the single non-NA in each group which ave will copy for all values in that group.
library(data.table)
transform(df, Reg = ave(Reg, rleid(Loc >= 0), FUN = na.omit))
giving:
Loc Reg
1 5 A
2 4 A
3 3 A
4 2 A
5 1 A
6 0 A
7 -1 B
8 -2 B
9 -3 B
10 -4 B
11 -4 B
12 -3 B
13 -2 B
14 -1 B
15 0 C
16 1 C
17 2 C
18 3 C
19 4 C
20 5 C
Here is a data.table solution which reproduces OP's expected answer:
library(data.table)
result <- as.data.table(df)[, Reg := first(Reg[!is.na(Reg)]), by = rleid(Loc >= 0)][]
result
Loc Reg
1: 5 A
2: 4 A
3: 3 A
4: 2 A
5: 1 A
6: 0 A
7: -1 B
8: -2 B
9: -3 B
10: -4 B
11: -4 B
12: -3 B
13: -2 B
14: -1 B
15: 0 C
16: 1 C
17: 2 C
18: 3 C
19: 4 C
20: 5 C
identical(as.data.frame(result), df2)
[1] TRUE
Note that this approach is similar to G. Grothendiek's base R solution in that it uses rleid(Loc >= 0) to group the data but it does not call transform() and ave() but updates Reg by reference, i.e., without copying the whole object.
Here is a quick solution with dplyr:
df<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c("A",rep(NA,8),"B",rep(NA,9),"C"))
c <- match("C",df$Reg)
a <- match("A",df$Reg)
df2 <- df %>%
mutate(newReg=case_when(Loc < 0 ~ "B",
Loc >= 0 & abs(row_number()-c)<abs(row_number()-a)~ "C",
TRUE ~ "A"))
Note: This is hideous and I am doubtful this is reproducible for more use cases... this is probably better suited for some type of dplyr::case_when function, but I just couldn't think it through at this point.
lapply(2:nrow(df), function(i){
this_row <- df[i, ]
last_row <- i - 1
if(is.na(this_row[['Reg']])){
if(this_row[['Loc']] < 0){
df[i, 'Reg'] <<- "B"
}else if(df[i - 1, 'Reg'] == "A"){
df[i, 'Reg'] <<- "A"
}else {
df[i, "Reg"] <<- "C"
}
}
})
> df
Loc Reg
1 5 A
2 4 A
3 3 A
4 2 A
5 1 A
6 0 A
7 -1 B
8 -2 B
9 -3 B
10 -4 B
11 -4 B
12 -3 B
13 -2 B
14 -1 B
15 0 C
16 1 C
17 2 C
18 3 C
19 4 C
20 5 C

get rows of unique values by group

I have a data.table and want to pick those lines of the data.table where some values of a variable x are unique relative to another variable y
It's possible to get the unique values of x, grouped by y in a separate dataset, like this
dt[,unique(x),by=y]
But I want to pick the rows in the original dataset where this is the case. I don't want a new data.table because I also need the other variables.
So, what do I have to add to my code to get the rows in dt for which the above is true?
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
y x z
1: a 1 1
2: a 2 2
3: a 2 3
4: b 3 4
5: b 2 5
6: b 1 6
What I want:
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
The idiomatic data.table way is:
require(data.table)
unique(dt, by = c("y", "x"))
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 3 4
# 4: b 2 5
# 5: b 1 6
data.table is a bit different in how to use duplicated. Here's the approach I've seen around here somewhere before:
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
setkey(dt, "y", "x")
key(dt)
# [1] "y" "x"
!duplicated(dt)
# [1] TRUE TRUE FALSE TRUE TRUE TRUE
dt[!duplicated(dt)]
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 1 6
# 4: b 2 5
# 5: b 3 4
The simpler data.table solution is to grab the first element of each group
> dt[, head(.SD, 1), by=.(y, x)]
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
Thanks to dplyR
library(dplyr)
col1 = c(1,1,3,3,5,6,7,8,9)
col2 = c("cust1", 'cust1', 'cust3', 'cust4', 'cust5', 'cust5', 'cust5', 'cust5', 'cust6')
df1 = data.frame(col1, col2)
df1
distinct(select(df1, col1, col2))

Resources