R: Compare across IDs within the same data frame - r

I have the following dataset:
df <- data.frame(c(1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5), c("a","a","a","b","b","b","b","b","b","b","b",
"a","a","a","b","b","b"),
c(300,295,295,25,25,25,25,25,20,20,20,300,295,295,300, 295,295),
c("c","d","e","f","g","h","i","j","l","m","n","o","p","q","r","s","t"))
colnames(df) <- c("ID", "Group", "Price", "OtherNumber")
> df
ID Group Price OtherNumber
1 1 a 300 c
2 1 a 295 d
3 1 a 295 e
4 2 b 25 f
5 2 b 25 g
6 2 b 25 h
7 2 b 25 i
8 3 b 25 j
9 3 b 20 l
10 3 b 20 m
11 3 b 20 n
12 4 a 300 o
13 4 a 295 p
14 4 a 295 q
15 5 b 300 r
16 5 b 295 s
17 5 b 295 t
I want to compare the first price of subsequent IDs. Only if the two subsequent IDs have the same initial price and are in the same group, I want to flag them. Just in case this was not very clear, here an example: I compare the first and second ID, but both the group (a vs. b) and the initial price is a mismatch (300 vs. 25). On the other hand, between ID 2 and 3, they are both in group b and have the same initial price of 25 (cf. row 4 and 8). The prices afterwards do not really matter as they may differ.
I figure, I must be able to work with the dplyr package and have determined a very rough solution (which does not yet work).
# Load dplyr
library(dplyr)
# Assign row numbers within IDs
df1 <- df %>%
group_by(ID) %>%
mutate(subID = row_number())
# Isolate first observation in ID
df2 <- df1[df1$subID == 1,]
# Set up loop to iterate through IDs
for (i in 2:length(df2)) {
if (df2$Price[i] - df2$Price[i - 1] == 0) {
df2$flag <- TRUE
} else {
df2$flag <- FALSE
}
}
If you tell me that this is the only possible solution, I will obviously devote more resources to it, but I am sure there must be an easier solution. I checked on SO and maybe I missed something, but I was not able to find anything going into this direction. Thanks!
The output I want to get looks something like this:
ID Group Price OtherNumber flag
1 1 a 300 c FALSE
2 1 a 295 d FALSE
3 1 a 295 e FALSE
4 2 b 25 f TRUE
5 2 b 25 g TRUE
6 2 b 25 h TRUE
7 2 b 25 i TRUE
8 3 b 25 j TRUE
9 3 b 20 l TRUE
10 3 b 20 m TRUE
11 3 b 20 n TRUE
12 4 a 300 o FALSE
13 4 a 295 p FALSE
14 4 a 295 q FALSE
15 5 b 300 r FALSE
16 5 b 295 s FALSE
17 5 b 295 t FALSE

Here is a data.table oneliner... cut into smaller pieces to view intermediate results; also see explanation at the bottom of the answer.
dt <- as.data.table( df )
dt[ dt[ , .SD[1], ID][ ( Group == shift( Group, type = "lead") & Price == shift( Price, type = "lead") ) |
( Group == shift( Group, type = "lag") & Price == shift( Price, type = "lag),
flag := TRUE][is.na(flag), flag := FALSE], flag := i.flag, on = .(ID)][]
# ID Group Price OtherNumber flag
# 1: 1 a 300 c FALSE
# 2: 1 a 295 d FALSE
# 3: 1 a 295 e FALSE
# 4: 2 b 25 f TRUE
# 5: 2 b 25 g TRUE
# 6: 2 b 25 h TRUE
# 7: 2 b 25 i TRUE
# 8: 3 b 25 j TRUE
# 9: 3 b 20 l TRUE
# 10: 3 b 20 m TRUE
# 11: 3 b 20 n TRUE
# 12: 4 a 300 o FALSE
# 13: 4 a 295 p FALSE
# 14: 4 a 295 q FALSE
# 15: 5 b 300 r FALSE
# 16: 5 b 295 s FALSE
# 17: 5 b 295 t FALSE
explanation:
dt[ , .SD[1], ID] create a data.table with the first row of each ID
[ Group == shift( ... , flag := TRUE] sets the column flag to TRUE when the next (or previous) row has matching Price and Group.
[is.na(flag), flag := FALSE] fills in the rest (which is not TRUE) with `FALSE
..flag := i.flag, on = .(ID)] performs a left join (by reference, so it's fast and efficient) on the original data.table, to get the final result.

Related

add grouping number based on consecutive TRUE/FALSE column values

Given this data frame:
library(dplyr)
dat <- data.frame(
bar = c(letters[1:10]),
foo = c(1,2,3,5,8,9,11,13,14,15)
)
bar foo
1 a 1
2 b 2
3 c 3
4 d 5
5 e 8
6 f 9
7 g 11
8 h 13
9 i 14
10 j 15
I first want to identify groups, if the foo number is consecutive:
dat <- dat %>% mutate(in_cluster =
ifelse( lead(foo) == foo +1 | lag(foo) == foo -1,
TRUE,
FALSE))
Which leads to the following data frame:
bar foo in_cluster
1 a 1 TRUE
2 b 2 TRUE
3 c 3 TRUE
4 d 5 FALSE
5 e 8 TRUE
6 f 9 TRUE
7 g 11 FALSE
8 h 13 TRUE
9 i 14 TRUE
10 j 15 TRUE
As can be seen, the values 1,2,3 form a group, then value 5 is on it's own and does not belong to a cluster, then values 8,9 form another cluster and so on.
I would like to add cluster numbers to these "groups".
Expected output:
bar foo in_cluster cluster_number
1 a 1 TRUE 1
2 b 2 TRUE 1
3 c 3 TRUE 1
4 d 5 FALSE NA
5 e 8 TRUE 2
6 f 9 TRUE 2
7 g 11 FALSE NA
8 h 13 TRUE 3
9 i 14 TRUE 3
10 j 15 TRUE 3
There is probably a better tidverse approach for something like this. For example, group_indices could be used if in_cluster is defined through an arbitrary length case_when. However, we can also implement our own method to specifically deal with logical value run lengths, using the rle function.
solution 1 (R version > 3.5)
lgl_indices <- function(var){
x <- rle(var)
cumsum(x[[2]]) |> (\(.){ .[which(!x[[2]], T)] <- NA ; .})() |> rep(x[[1]])
}
solution 2
lgl_indices <- function(var){
x <- rle(var)
y <- cumsum(x$values)
y[which(x$values == F)] <- NA
rep(y, x$lengths)
}
solution 3
lgl_indices <- function(var){
x <- rle(var)
l <- vector("list", length(x))
n <- 1L
for (i in seq_along(x[[1]])) {
if(!x$values[i]) grp <- NA else {
grp <- n
n <- n + 1L
}
l[[i]] <- rep(grp, x$lengths[i])
}
Reduce(c, l)
}
dat %>%
mutate(cluster_number = lgl_indices(in_cluster))
bar foo in_cluster cluster_number
1 a 1 TRUE 1
2 b 2 TRUE 1
3 c 3 TRUE 1
4 d 5 FALSE NA
5 e 8 TRUE 2
6 f 9 TRUE 2
This may not be the efficient way. Still, this works:
# Cumuative sum of the logical
dat$new_cluster <- cumsum(!dat$in_cluster)+1
# using the in_cluster to subset and replacing the cluster number for FALSE by NA
dat[!dat$in_cluster,]$new_cluster <- NA
dat
bar foo in_cluster new_cluster
1 a 1 TRUE 1
2 b 2 TRUE 1
3 c 3 TRUE 1
4 d 5 FALSE NA
5 e 8 TRUE 2
6 f 9 TRUE 2

How to find values shared between groups in a data frame? [duplicate]

This question already has answers here:
Extract elements common in all column groups
(3 answers)
Closed 3 years ago.
I have a tidy data.frame with two columns: exp and val. I want to find which values of val are shared among all different experiments.
df <- data.frame(exp = c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'),
val = c(10, 20, 15, 10, 10, 15, 99, 2, 15, 20, 10, 4))
df
exp val
1 A 10
2 A 20
3 A 15
4 A 10
5 B 10
6 B 15
7 B 99
8 B 2
9 C 15
10 C 20
11 C 10
12 C 4
Expected result could be either a vector of values:
10, 15
or a column on the data frame telling if that value is shared:
exp val shared
<fct> <dbl> <lgl>
1 A 10 TRUE
2 A 20 FALSE
3 A 15 TRUE
4 A 10 TRUE
5 B 10 TRUE
6 B 15 TRUE
7 B 99 FALSE
8 B 2 FALSE
9 C 15 TRUE
10 C 20 FALSE
11 C 10 TRUE
12 C 4 FALSE
I was able to find an answer (see the self-answer below) but this seems like a common enough question that there must be a better way than the really hacky solution I cam up with.
I tried to solve this problem in dplyr since that's what I'm familiar with, but I'm interested in any kind of solution.
Or you can group by val and then check whether the number of distinct exp for that val is equal to the data frame level number of distinct exp:
df %>%
group_by(val) %>%
mutate(shared = n_distinct(exp) == n_distinct(.$exp))
# notice the first exp refers to exp for each group while .$exp refers
# to the overall exp column in the data frame
# A tibble: 12 x 3
# Groups: val [6]
# exp val shared
# <fct> <dbl> <lgl>
# 1 A 10 TRUE
# 2 A 20 FALSE
# 3 A 15 TRUE
# 4 A 10 TRUE
# 5 B 10 TRUE
# 6 B 15 TRUE
# 7 B 99 FALSE
# 8 B 2 FALSE
# 9 C 15 TRUE
#10 C 20 FALSE
#11 C 10 TRUE
#12 C 4 FALSE
Using base R you can use table:
as.numeric(colnames(a<-table(df))[colSums(a>0)==nrow(a)])
[1] 10 15
you can also do:
df %>%
mutate(s = val %in% as.numeric(colnames(a<-table(df))[colSums(a>0)==nrow(a)]))
exp val s
1 A 10 TRUE
2 A 20 FALSE
3 A 15 TRUE
4 A 10 TRUE
5 B 10 TRUE
6 B 15 TRUE
7 B 99 FALSE
8 B 2 FALSE
9 C 15 TRUE
10 C 20 FALSE
11 C 10 TRUE
12 C 4 FALSE
Here is an other base R solution:
x <- split(df$val, df$exp)
Reduce(intersect, x)
## [1] 10 15
We can go through the data.frame row by row and count up how many times that row's value is found in the vector df$val.
To deal with possible repeat values, we have to use group_by %>% distinct to remove repeated values of val within groups. But then to get just the values of val as a vector, we need to ungroup %>% select(val) %>% unlist, which just seems needlessly complicated.
Finally, we can check whether the number of groups the value is found in equals the total number of groups.
df %>%
rowwise() %>%
mutate(num_groups = sum(group_by(., exp) %>%
distinct(val) %>%
ungroup() %>%
select(val) %>%
unlist() %in% val),
shared = num_groups == length(unique(.$exp)))
# A tibble: 12 x 4
exp val num_groups shared
<fct> <dbl> <int> <lgl>
1 A 10 3 TRUE
2 A 20 2 FALSE
3 A 15 3 TRUE
4 A 10 3 TRUE
5 B 10 3 TRUE
6 B 15 3 TRUE
7 B 99 1 FALSE
8 B 2 1 FALSE
9 C 15 3 TRUE
10 C 20 2 FALSE
11 C 10 3 TRUE
12 C 4 1 FALSE

Replace NA value with next or previous non-NA value conditional on other column

Below is an example data set similar to what I'm working with.
df<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c("A",rep(NA,8),"B",rep(NA,9),"C"))
In this example we have a string of values ranging from + to - values or vice versa (Loc). What I am trying to do accomplish is to fill these NA values, where B is always a associated with negative values of Loc, however, positive values can either take on values A if NA's are between A and B or C if NA's are between B and C.
The desired output should look like the following
df2<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c(rep("A",6),rep("B",8),rep("C",6)))
I have looked into the na.locf from the zoo package but I'm not sure how to order which direction the funcion looks for the non-NA value to get the desired output.
df$Reg2<-ifelse(df$Loc<=0,df$Reg2<-"B",na.locf(df$Reg,fromLast = F))
The above code is only returning the right response for some of the rows depending on the direction (i.e. fromLast = T or F)
Any help on this would be much appreciated.
Use ave splitting by a grouping variable generated from rleid of the sign. Then omit the NAs leaving the single non-NA in each group which ave will copy for all values in that group.
library(data.table)
transform(df, Reg = ave(Reg, rleid(Loc >= 0), FUN = na.omit))
giving:
Loc Reg
1 5 A
2 4 A
3 3 A
4 2 A
5 1 A
6 0 A
7 -1 B
8 -2 B
9 -3 B
10 -4 B
11 -4 B
12 -3 B
13 -2 B
14 -1 B
15 0 C
16 1 C
17 2 C
18 3 C
19 4 C
20 5 C
Here is a data.table solution which reproduces OP's expected answer:
library(data.table)
result <- as.data.table(df)[, Reg := first(Reg[!is.na(Reg)]), by = rleid(Loc >= 0)][]
result
Loc Reg
1: 5 A
2: 4 A
3: 3 A
4: 2 A
5: 1 A
6: 0 A
7: -1 B
8: -2 B
9: -3 B
10: -4 B
11: -4 B
12: -3 B
13: -2 B
14: -1 B
15: 0 C
16: 1 C
17: 2 C
18: 3 C
19: 4 C
20: 5 C
identical(as.data.frame(result), df2)
[1] TRUE
Note that this approach is similar to G. Grothendiek's base R solution in that it uses rleid(Loc >= 0) to group the data but it does not call transform() and ave() but updates Reg by reference, i.e., without copying the whole object.
Here is a quick solution with dplyr:
df<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c("A",rep(NA,8),"B",rep(NA,9),"C"))
c <- match("C",df$Reg)
a <- match("A",df$Reg)
df2 <- df %>%
mutate(newReg=case_when(Loc < 0 ~ "B",
Loc >= 0 & abs(row_number()-c)<abs(row_number()-a)~ "C",
TRUE ~ "A"))
Note: This is hideous and I am doubtful this is reproducible for more use cases... this is probably better suited for some type of dplyr::case_when function, but I just couldn't think it through at this point.
lapply(2:nrow(df), function(i){
this_row <- df[i, ]
last_row <- i - 1
if(is.na(this_row[['Reg']])){
if(this_row[['Loc']] < 0){
df[i, 'Reg'] <<- "B"
}else if(df[i - 1, 'Reg'] == "A"){
df[i, 'Reg'] <<- "A"
}else {
df[i, "Reg"] <<- "C"
}
}
})
> df
Loc Reg
1 5 A
2 4 A
3 3 A
4 2 A
5 1 A
6 0 A
7 -1 B
8 -2 B
9 -3 B
10 -4 B
11 -4 B
12 -3 B
13 -2 B
14 -1 B
15 0 C
16 1 C
17 2 C
18 3 C
19 4 C
20 5 C

removing outliers from repeated dataframe in R

I would like remove outliers (remove rows with outliers) from each group (by each BRMA_Name)from a dataframe. My example data as following:
BRMA_No BRMA_Name Price
1 A 5
1 A 6
1 A 100
1 A 90
2 B 50
2 B 60
2 B 40
2 B 400
2 B 4
3 C 4
3 C 2
I look through but could not find any answer (sorry), could anyone shed some light on it.
Kind regards
Lutfor
You could try this:
#outlier based on IQR - returns TRUE or FALSE based on the outlier condition
outlier <- function(x) {
ifelse(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x),
TRUE,
FALSE)
}
library(data.table)
#apply the function per group
setDT(df)[, out := outlier(Price), by = 'BRMA_Name']
df
# BRMA_No BRMA_Name Price out
# 1: 1 A 5 FALSE
# 2: 1 A 6 FALSE
# 3: 1 A 100 FALSE
# 4: 1 A 90 FALSE
# 5: 2 B 50 FALSE
# 6: 2 B 60 FALSE
# 7: 2 B 40 FALSE
# 8: 2 B 400 TRUE
# 9: 2 B 4 TRUE
#10: 3 C 4 FALSE
#11: 3 C 2 FALSE
Then just select the rows where out is FALSE (e.g. df[out == FALSE]).
Here's an option using boxplot to determine the outliers:
library(data.table)
setDT(mydf)[, rm := !Price %in% boxplot(Price, plot = FALSE)$out, BRMA_Name][(rm)]
# BRMA_No BRMA_Name Price rm
# 1: 1 A 5 TRUE
# 2: 1 A 6 TRUE
# 3: 1 A 100 TRUE
# 4: 1 A 90 TRUE
# 5: 2 B 50 TRUE
# 6: 2 B 60 TRUE
# 7: 2 B 40 TRUE
# 8: 3 C 4 TRUE
# 9: 3 C 2 TRUE
I suppose the more appropriate approach would be:
setDT(mydf)[, rm := !Price %in% boxplot.stats(Price)$out, BRMA_Name][(rm)]
From the help page for boxplot.stats, the function's default for the coef argument is 1.5. If you wanted to change your outlier detection rule, you can change that value.
Define the wrapper:
TukeyRangeFilter <- function(x) {
normrange <- quantile(x, c(0.25, 0.75)) + c(-1.5, 1.5) * IQR(x)
findInterval(x, normrange)==1
}
Then loop across the elements of BRMA using by:
by(df, df$BRMA_Name, function(x) x[TukeyRangeFilter(x$Price), ])
Concatenate with do.call(rbind, <output>).
BRMA_No BRMA_Name Price
A.1 1 A 5
A.2 1 A 6
A.3 1 A 100
A.4 1 A 90
B.5 2 B 50
B.6 2 B 60
B.7 2 B 40
C.10 3 C 4
C.11 3 C 2

Compare each row within each group

Following dataset is reproducible
group <- c(1,1,2,2,3,3)
parameter <- c("A","B","A","B","A","B")
values <- c(10,20,20,5,30,50)
df <- data.frame(group,parameter,values)
group parameter values
1 A 10
1 B 20
2 A 20
2 B 5
3 A 30
3 B 50
I want to check within each group whether A > B (store this result in fourth column for entire group)
If yes -> TRUE, If no -> FALSE
New Df:
group parameter values status
1 A 10 FALSE
1 B 20 FALSE
2 A 20 TRUE
2 B 5 TRUE
3 A 30 FALSE
3 B 50 FALSE
Approach
with(df, ave(values,group, FUN = function(x) ))
I am not able to think what will be the code inside the function. Can someone please help me
Updated: Status should be ranked as per the values column (highest to lowest) per group
group parameter values status
1 A 10 2
1 B 20 1
2 A 20 1
2 B 5 2
3 A 30 2
3 B 50 1
We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'group', compare the 'values' where 'parameter' is 'A' with that of 'B' and assign (:=) to create 'status'
library(data.table)
setDT(df)[, status := values[parameter=="A"]>values[parameter=="B"], by = group]
df
# group parameter values status
#1: 1 A 10 FALSE
#2: 1 B 20 FALSE
#3: 2 A 20 TRUE
#4: 2 B 5 TRUE
#5: 3 A 30 FALSE
#6: 3 B 50 FALSE
and for the rank, use frank on the 'values' after grouping by 'group.
setDT(df)[, status:= frank(-values), group]
df
# group parameter values status
#1: 1 A 10 2
#2: 1 B 20 1
#3: 2 A 20 1
#4: 2 B 5 2
#5: 3 A 30 2
#6: 3 B 50 1
Or with ave, we can compare the first value with second one (assuming that 'parameter' is ordered and also only two elements per 'group'
df$status <- with(df, as.logical(ave(values, group, FUN = function(x) x[1] > x[2])))
Or another option is to order the dataset by the first columns (in case it is not ordered), the subset the 'values' by the recycling of logical index, compare and replicate each of the logical values by 2.
df1 <- df[do.call(order, df[1:2]), ]
rep(df1$values[c(TRUE, FALSE)] > df1$values[c(FALSE, TRUE)], each = 2)
There is also the tidyverse solution using dplyr:
library(dplyr)
df %>%
group_by(group) %>%
mutate(status = ifelse(values[parameter == "A"] > values[parameter == "B"], TRUE, FALSE),
rank = min_rank(-values))
Source: local data frame [6 x 5]
Groups: group [3]
group parameter values status rank
(dbl) (fctr) (dbl) (lgl) (int)
1 1 A 10 FALSE 2
2 1 B 20 FALSE 1
3 2 A 20 TRUE 1
4 2 B 5 TRUE 2
5 3 A 30 FALSE 2
6 3 B 50 FALSE 1

Resources