This question already has answers here:
Extract elements common in all column groups
(3 answers)
Closed 3 years ago.
I have a tidy data.frame with two columns: exp and val. I want to find which values of val are shared among all different experiments.
df <- data.frame(exp = c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'),
val = c(10, 20, 15, 10, 10, 15, 99, 2, 15, 20, 10, 4))
df
exp val
1 A 10
2 A 20
3 A 15
4 A 10
5 B 10
6 B 15
7 B 99
8 B 2
9 C 15
10 C 20
11 C 10
12 C 4
Expected result could be either a vector of values:
10, 15
or a column on the data frame telling if that value is shared:
exp val shared
<fct> <dbl> <lgl>
1 A 10 TRUE
2 A 20 FALSE
3 A 15 TRUE
4 A 10 TRUE
5 B 10 TRUE
6 B 15 TRUE
7 B 99 FALSE
8 B 2 FALSE
9 C 15 TRUE
10 C 20 FALSE
11 C 10 TRUE
12 C 4 FALSE
I was able to find an answer (see the self-answer below) but this seems like a common enough question that there must be a better way than the really hacky solution I cam up with.
I tried to solve this problem in dplyr since that's what I'm familiar with, but I'm interested in any kind of solution.
Or you can group by val and then check whether the number of distinct exp for that val is equal to the data frame level number of distinct exp:
df %>%
group_by(val) %>%
mutate(shared = n_distinct(exp) == n_distinct(.$exp))
# notice the first exp refers to exp for each group while .$exp refers
# to the overall exp column in the data frame
# A tibble: 12 x 3
# Groups: val [6]
# exp val shared
# <fct> <dbl> <lgl>
# 1 A 10 TRUE
# 2 A 20 FALSE
# 3 A 15 TRUE
# 4 A 10 TRUE
# 5 B 10 TRUE
# 6 B 15 TRUE
# 7 B 99 FALSE
# 8 B 2 FALSE
# 9 C 15 TRUE
#10 C 20 FALSE
#11 C 10 TRUE
#12 C 4 FALSE
Using base R you can use table:
as.numeric(colnames(a<-table(df))[colSums(a>0)==nrow(a)])
[1] 10 15
you can also do:
df %>%
mutate(s = val %in% as.numeric(colnames(a<-table(df))[colSums(a>0)==nrow(a)]))
exp val s
1 A 10 TRUE
2 A 20 FALSE
3 A 15 TRUE
4 A 10 TRUE
5 B 10 TRUE
6 B 15 TRUE
7 B 99 FALSE
8 B 2 FALSE
9 C 15 TRUE
10 C 20 FALSE
11 C 10 TRUE
12 C 4 FALSE
Here is an other base R solution:
x <- split(df$val, df$exp)
Reduce(intersect, x)
## [1] 10 15
We can go through the data.frame row by row and count up how many times that row's value is found in the vector df$val.
To deal with possible repeat values, we have to use group_by %>% distinct to remove repeated values of val within groups. But then to get just the values of val as a vector, we need to ungroup %>% select(val) %>% unlist, which just seems needlessly complicated.
Finally, we can check whether the number of groups the value is found in equals the total number of groups.
df %>%
rowwise() %>%
mutate(num_groups = sum(group_by(., exp) %>%
distinct(val) %>%
ungroup() %>%
select(val) %>%
unlist() %in% val),
shared = num_groups == length(unique(.$exp)))
# A tibble: 12 x 4
exp val num_groups shared
<fct> <dbl> <int> <lgl>
1 A 10 3 TRUE
2 A 20 2 FALSE
3 A 15 3 TRUE
4 A 10 3 TRUE
5 B 10 3 TRUE
6 B 15 3 TRUE
7 B 99 1 FALSE
8 B 2 1 FALSE
9 C 15 3 TRUE
10 C 20 2 FALSE
11 C 10 3 TRUE
12 C 4 1 FALSE
Related
Given this data frame:
library(dplyr)
dat <- data.frame(
bar = c(letters[1:10]),
foo = c(1,2,3,5,8,9,11,13,14,15)
)
bar foo
1 a 1
2 b 2
3 c 3
4 d 5
5 e 8
6 f 9
7 g 11
8 h 13
9 i 14
10 j 15
I first want to identify groups, if the foo number is consecutive:
dat <- dat %>% mutate(in_cluster =
ifelse( lead(foo) == foo +1 | lag(foo) == foo -1,
TRUE,
FALSE))
Which leads to the following data frame:
bar foo in_cluster
1 a 1 TRUE
2 b 2 TRUE
3 c 3 TRUE
4 d 5 FALSE
5 e 8 TRUE
6 f 9 TRUE
7 g 11 FALSE
8 h 13 TRUE
9 i 14 TRUE
10 j 15 TRUE
As can be seen, the values 1,2,3 form a group, then value 5 is on it's own and does not belong to a cluster, then values 8,9 form another cluster and so on.
I would like to add cluster numbers to these "groups".
Expected output:
bar foo in_cluster cluster_number
1 a 1 TRUE 1
2 b 2 TRUE 1
3 c 3 TRUE 1
4 d 5 FALSE NA
5 e 8 TRUE 2
6 f 9 TRUE 2
7 g 11 FALSE NA
8 h 13 TRUE 3
9 i 14 TRUE 3
10 j 15 TRUE 3
There is probably a better tidverse approach for something like this. For example, group_indices could be used if in_cluster is defined through an arbitrary length case_when. However, we can also implement our own method to specifically deal with logical value run lengths, using the rle function.
solution 1 (R version > 3.5)
lgl_indices <- function(var){
x <- rle(var)
cumsum(x[[2]]) |> (\(.){ .[which(!x[[2]], T)] <- NA ; .})() |> rep(x[[1]])
}
solution 2
lgl_indices <- function(var){
x <- rle(var)
y <- cumsum(x$values)
y[which(x$values == F)] <- NA
rep(y, x$lengths)
}
solution 3
lgl_indices <- function(var){
x <- rle(var)
l <- vector("list", length(x))
n <- 1L
for (i in seq_along(x[[1]])) {
if(!x$values[i]) grp <- NA else {
grp <- n
n <- n + 1L
}
l[[i]] <- rep(grp, x$lengths[i])
}
Reduce(c, l)
}
dat %>%
mutate(cluster_number = lgl_indices(in_cluster))
bar foo in_cluster cluster_number
1 a 1 TRUE 1
2 b 2 TRUE 1
3 c 3 TRUE 1
4 d 5 FALSE NA
5 e 8 TRUE 2
6 f 9 TRUE 2
This may not be the efficient way. Still, this works:
# Cumuative sum of the logical
dat$new_cluster <- cumsum(!dat$in_cluster)+1
# using the in_cluster to subset and replacing the cluster number for FALSE by NA
dat[!dat$in_cluster,]$new_cluster <- NA
dat
bar foo in_cluster new_cluster
1 a 1 TRUE 1
2 b 2 TRUE 1
3 c 3 TRUE 1
4 d 5 FALSE NA
5 e 8 TRUE 2
6 f 9 TRUE 2
I have the following dataset:
df <- data.frame(c(1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5), c("a","a","a","b","b","b","b","b","b","b","b",
"a","a","a","b","b","b"),
c(300,295,295,25,25,25,25,25,20,20,20,300,295,295,300, 295,295),
c("c","d","e","f","g","h","i","j","l","m","n","o","p","q","r","s","t"))
colnames(df) <- c("ID", "Group", "Price", "OtherNumber")
> df
ID Group Price OtherNumber
1 1 a 300 c
2 1 a 295 d
3 1 a 295 e
4 2 b 25 f
5 2 b 25 g
6 2 b 25 h
7 2 b 25 i
8 3 b 25 j
9 3 b 20 l
10 3 b 20 m
11 3 b 20 n
12 4 a 300 o
13 4 a 295 p
14 4 a 295 q
15 5 b 300 r
16 5 b 295 s
17 5 b 295 t
I want to compare the first price of subsequent IDs. Only if the two subsequent IDs have the same initial price and are in the same group, I want to flag them. Just in case this was not very clear, here an example: I compare the first and second ID, but both the group (a vs. b) and the initial price is a mismatch (300 vs. 25). On the other hand, between ID 2 and 3, they are both in group b and have the same initial price of 25 (cf. row 4 and 8). The prices afterwards do not really matter as they may differ.
I figure, I must be able to work with the dplyr package and have determined a very rough solution (which does not yet work).
# Load dplyr
library(dplyr)
# Assign row numbers within IDs
df1 <- df %>%
group_by(ID) %>%
mutate(subID = row_number())
# Isolate first observation in ID
df2 <- df1[df1$subID == 1,]
# Set up loop to iterate through IDs
for (i in 2:length(df2)) {
if (df2$Price[i] - df2$Price[i - 1] == 0) {
df2$flag <- TRUE
} else {
df2$flag <- FALSE
}
}
If you tell me that this is the only possible solution, I will obviously devote more resources to it, but I am sure there must be an easier solution. I checked on SO and maybe I missed something, but I was not able to find anything going into this direction. Thanks!
The output I want to get looks something like this:
ID Group Price OtherNumber flag
1 1 a 300 c FALSE
2 1 a 295 d FALSE
3 1 a 295 e FALSE
4 2 b 25 f TRUE
5 2 b 25 g TRUE
6 2 b 25 h TRUE
7 2 b 25 i TRUE
8 3 b 25 j TRUE
9 3 b 20 l TRUE
10 3 b 20 m TRUE
11 3 b 20 n TRUE
12 4 a 300 o FALSE
13 4 a 295 p FALSE
14 4 a 295 q FALSE
15 5 b 300 r FALSE
16 5 b 295 s FALSE
17 5 b 295 t FALSE
Here is a data.table oneliner... cut into smaller pieces to view intermediate results; also see explanation at the bottom of the answer.
dt <- as.data.table( df )
dt[ dt[ , .SD[1], ID][ ( Group == shift( Group, type = "lead") & Price == shift( Price, type = "lead") ) |
( Group == shift( Group, type = "lag") & Price == shift( Price, type = "lag),
flag := TRUE][is.na(flag), flag := FALSE], flag := i.flag, on = .(ID)][]
# ID Group Price OtherNumber flag
# 1: 1 a 300 c FALSE
# 2: 1 a 295 d FALSE
# 3: 1 a 295 e FALSE
# 4: 2 b 25 f TRUE
# 5: 2 b 25 g TRUE
# 6: 2 b 25 h TRUE
# 7: 2 b 25 i TRUE
# 8: 3 b 25 j TRUE
# 9: 3 b 20 l TRUE
# 10: 3 b 20 m TRUE
# 11: 3 b 20 n TRUE
# 12: 4 a 300 o FALSE
# 13: 4 a 295 p FALSE
# 14: 4 a 295 q FALSE
# 15: 5 b 300 r FALSE
# 16: 5 b 295 s FALSE
# 17: 5 b 295 t FALSE
explanation:
dt[ , .SD[1], ID] create a data.table with the first row of each ID
[ Group == shift( ... , flag := TRUE] sets the column flag to TRUE when the next (or previous) row has matching Price and Group.
[is.na(flag), flag := FALSE] fills in the rest (which is not TRUE) with `FALSE
..flag := i.flag, on = .(ID)] performs a left join (by reference, so it's fast and efficient) on the original data.table, to get the final result.
In the following dataset, I want to remove all rows starting at the first instance, sorted by Time and grouped by ID, that Var is TRUE. Put differently, I want to subset all rows for each ID by those which are FALSE up until the first TRUE, sorted by Time.
ID <- c('A','B','C','A','B','C','A','B','C','A','B','C')
Time <- c(3,3,3,6,6,6,9,9,9,12,12,12)
Var <- c(F,F,F,T,T,F,T,T,F,T,F,T)
data = data.frame(ID, Time, Var)
data
ID Time Var
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 A 6 TRUE
5 B 6 TRUE
6 C 6 FALSE
7 A 9 TRUE
8 B 9 TRUE
9 C 9 FALSE
10 A 12 TRUE
11 B 12 FALSE
12 C 12 TRUE
The desired result for this data frame should be:
ID Time Var
A 3 FALSE
B 3 FALSE
C 3 FALSE
C 6 FALSE
C 9 FALSE
Note that the solution should not only remove rows where Var == TRUE, but should also remove rows where Var == FALSE but this follows (in Time) another instance where Var == TRUE for that ID.
I've tried many different things but can't seem to figure this out. Any help is much appreciated!
Here's how to do that with dplyr using group_by and cumsum.
The rationale is that Var is a logical vector where FALSE is equal to 0 and TRUE is equal to 1. cumsum will remain at 0 until it hits the first TRUE.
library(dplyr)
data%>%
group_by(ID)%>%
filter(cumsum(Var)<1)
ID Time Var
<fctr> <dbl> <lgl>
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 C 6 FALSE
5 C 9 FALSE
Here's the equivalent code with data.table:
library(data.table)
data[data[, .I[cumsum(Var) <1], by = ID]$V1]
ID Time Var
1: A 3 FALSE
2: B 3 FALSE
3: C 3 FALSE
4: C 6 FALSE
5: C 9 FALSE
This data.table solution should work.
library(data.table)
> setDT(data)[, .SD[1:(which.max(Var)-1)], by=ID]
ID Time Var
1: A 3 FALSE
2: B 3 FALSE
3: C 3 FALSE
4: C 6 FALSE
5: C 9 FALSE
Given that you want all the values up to the first TRUE value, which.max is the way to go.
You can do this with the cumall verb as well:
library(dplyr)
data %>%
dplyr::group_by(ID) %>%
dplyr::filter(dplyr::cumall(!Var))
ID Time Var
<chr> <dbl> <lgl>
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 C 6 FALSE
5 C 9 FALSE
cumall(!x): all cases until the first TRUE
Following dataset is reproducible
group <- c(1,1,2,2,3,3)
parameter <- c("A","B","A","B","A","B")
values <- c(10,20,20,5,30,50)
df <- data.frame(group,parameter,values)
group parameter values
1 A 10
1 B 20
2 A 20
2 B 5
3 A 30
3 B 50
I want to check within each group whether A > B (store this result in fourth column for entire group)
If yes -> TRUE, If no -> FALSE
New Df:
group parameter values status
1 A 10 FALSE
1 B 20 FALSE
2 A 20 TRUE
2 B 5 TRUE
3 A 30 FALSE
3 B 50 FALSE
Approach
with(df, ave(values,group, FUN = function(x) ))
I am not able to think what will be the code inside the function. Can someone please help me
Updated: Status should be ranked as per the values column (highest to lowest) per group
group parameter values status
1 A 10 2
1 B 20 1
2 A 20 1
2 B 5 2
3 A 30 2
3 B 50 1
We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'group', compare the 'values' where 'parameter' is 'A' with that of 'B' and assign (:=) to create 'status'
library(data.table)
setDT(df)[, status := values[parameter=="A"]>values[parameter=="B"], by = group]
df
# group parameter values status
#1: 1 A 10 FALSE
#2: 1 B 20 FALSE
#3: 2 A 20 TRUE
#4: 2 B 5 TRUE
#5: 3 A 30 FALSE
#6: 3 B 50 FALSE
and for the rank, use frank on the 'values' after grouping by 'group.
setDT(df)[, status:= frank(-values), group]
df
# group parameter values status
#1: 1 A 10 2
#2: 1 B 20 1
#3: 2 A 20 1
#4: 2 B 5 2
#5: 3 A 30 2
#6: 3 B 50 1
Or with ave, we can compare the first value with second one (assuming that 'parameter' is ordered and also only two elements per 'group'
df$status <- with(df, as.logical(ave(values, group, FUN = function(x) x[1] > x[2])))
Or another option is to order the dataset by the first columns (in case it is not ordered), the subset the 'values' by the recycling of logical index, compare and replicate each of the logical values by 2.
df1 <- df[do.call(order, df[1:2]), ]
rep(df1$values[c(TRUE, FALSE)] > df1$values[c(FALSE, TRUE)], each = 2)
There is also the tidyverse solution using dplyr:
library(dplyr)
df %>%
group_by(group) %>%
mutate(status = ifelse(values[parameter == "A"] > values[parameter == "B"], TRUE, FALSE),
rank = min_rank(-values))
Source: local data frame [6 x 5]
Groups: group [3]
group parameter values status rank
(dbl) (fctr) (dbl) (lgl) (int)
1 1 A 10 FALSE 2
2 1 B 20 FALSE 1
3 2 A 20 TRUE 1
4 2 B 5 TRUE 2
5 3 A 30 FALSE 2
6 3 B 50 FALSE 1
Say I wanted to work with hospital Medicare data showing procedure prices by hospital and by county and my data frame was called df with columns price, procedure and county. If I wanted to find the minimum and maximum prices for each procedure by county, I could so something like
library(plyr)
mostexpensive <- ddply(df,c('county','procedure'),function(x)x[which(x$price==max(x$price)),])
to get a table showing the hospitals with the most expensive procedures in each county. I can then see how many times each hospital is listed with
summary(mostexpensive$hospital)
For the final step I want to add a column to the original df dataframe that says TRUE if the row is most expensive and FALSE otherwise but I can't figure out how to get a logical vector from a plyr function. Thanks.
Posting reproducible code would be useful. Try this anyway,
For the summary
pricey <- ddply(df, c('county','procedure'), summarise, most = max(price), less=min(price))
and for the logical indexing
testing <- ddply(df, c('county','procedure'), mutate, expensive = price == max(price))
It will be more easier to get an answer with a reproductible example. You should think about it, next time you as for help in SO.
That being said, you can use the transform function to add a new column to your existing data.
The first step is to create a toy data set.
set.seed(123)
df <- data.frame(
county = sample(LETTERS[1:3], size = 20, replace = TRUE),
procedure = sample(c(1, 2), size = 20, replace = TRUE),
price = rpois(20, 10)
)
str(df)
## 'data.frame': 20 obs. of 3 variables:
## $ county : Factor w/ 3 levels "A","B","C": 1 3 2 3 3 1 2 3 2 2 ...
## $ procedure: num 2 2 2 2 2 2 2 2 1 1 ...
## $ price : int 6 8 6 8 4 6 6 8 5 12 ...
Now we can use plyr and the transform function
require(plyr)
expensive <- ddply(df, .(county, procedure),
transform, ismax = price == max(price))
expensive
## county procedure price ismax
## 1 A 1 9 FALSE
## 2 A 1 7 FALSE
## 3 A 1 12 TRUE
## 4 A 2 6 FALSE
## 5 A 2 6 FALSE
## 6 A 2 8 TRUE
## 7 B 1 5 FALSE
## 8 B 1 12 TRUE
## 9 B 2 6 FALSE
## 10 B 2 6 FALSE
## 11 B 2 12 TRUE
## 12 B 2 11 FALSE
## 13 C 1 9 TRUE
## 14 C 1 9 TRUE
## 15 C 2 8 FALSE
## 16 C 2 8 FALSE
## 17 C 2 4 FALSE
## 18 C 2 8 FALSE
## 19 C 2 12 TRUE
## 20 C 2 12 TRUE