Select grouped rows with at least one matching criterion - r

I want to select all those groupings that contain at least one of the elements that I am interested in. I was able to do this by creating an intermediate array, but I am looking for something simpler and faster. This is because my actual data set has over 1M rows (and 20 columns) so I am not sure whether I will have sufficient memory to create an intermediate array. More importantly, the below method on my original file takes a lot of time.
Here's my code and data:
a) Data
dput(Data_File)
structure(list(Group_ID = c(123, 123, 123, 123, 234, 345, 444,
444), Product_Name = c("ABCD", "EFGH", "XYZ1", "Z123", "ABCD",
"EFGH", "ABCD", "ABCD"), Qty = c(2, 3, 4, 5, 6, 7, 8, 9)), .Names = c("Group_ID",
"Product_Name", "Qty"), row.names = c(NA, 8L), class = "data.frame")
b) Code: I want to select Group_ID that has at least one Product_Name = ABCD
#Find out transactions
Data_T <- Data_File %>%
group_by(Group_ID) %>%
dplyr::filter(Product_Name == "ABCD") %>%
select(Group_ID) %>%
distinct()
#Now filter them
Filtered_T <- Data_File %>%
group_by(Group_ID) %>%
dplyr::filter(Group_ID %in% Data_T$Group_ID)
c) Expected output is
Group_ID Product_Name Qty
<dbl> <chr> <dbl>
123 ABCD 2
123 EFGH 3
123 XYZ1 4
123 Z123 5
234 ABCD 6
444 ABCD 8
444 ABCD 9
I'm struggling with this for over 3 hours now. I looked at the auto-suggested thread by SO: Select rows with at least two conditions from all conditions but my question is very different.

I would do it like this:
Data_File %>% group_by(Group_ID) %>%
filter(any(Product_Name %in% "ABCD"))
# Source: local data frame [7 x 3]
# Groups: Group_ID [3]
#
# Group_ID Product_Name Qty
# <dbl> <chr> <dbl>
# 1 123 ABCD 2
# 2 123 EFGH 3
# 3 123 XYZ1 4
# 4 123 Z123 5
# 5 234 ABCD 6
# 6 444 ABCD 8
# 7 444 ABCD 9
Explanation: any() will return TRUE if there are any rows (within the group) that match the condition. The length-1 result will then be recycled to the full length of the group and the entire group will be kept. You could also do it with sum(Product_name %in% "ABCD") > 0 as the condition, but the any reads very nicely. Use sum instead if you wanted a more complicated condition, like 3 or more matching product names.
I prefer%in%to == for things like this because it has better behavior with NA and it is easy to expand if you wanted to check for any of multiple products by group.
If speed and efficiency are an issue, data.table will be faster. I would do it like this, which relies on a keyed join for the filtering and uses no non-data.table operations, so it should be very fast:
library(data.table)
df = as.data.table(df)
setkey(df)
groups = unique(subset(df, Product_Name %in% "ABCD", Group_ID))
df[groups, nomatch = 0]
# Group_ID Product_Name Qty
# 1: 123 ABCD 2
# 2: 123 EFGH 3
# 3: 123 XYZ1 4
# 4: 123 Z123 5
# 5: 234 ABCD 6
# 6: 444 ABCD 8
# 7: 444 ABCD 9

Related

Rebuild tibble under condition

My Tibble:
df1 <- tibble(a = c("123*", "123", "124", "678*", "678", "679", "677"))
# A tibble: 7 x 1
a
<chr>
1 123*
2 123
3 124
4 678*
5 678
6 679
7 677
What it should become:
# A tibble: 3 x 2
a b
<chr> <chr>
1 123 124
2 678 679
3 678 677
The values with the stars refer to the following values with no stars, until a new value with a star comes and so on.
Each value with a star should go to the first column, the other values (except the ones that are identical to the values with a star, except the star) should go to the second column. If one value with a star is followed by several values, they should still be linked to eachother, so the values in the first column are duplicated to keep the connection.
I know how to filter and bring the values in each column, but not sure how i would keep the connection.
Regards
We can use tidyverse. Create a grouping column based on the occurence of * in 'a', extract the numeric part with parse_number, get the distinct rows, grouped by 'grp', create a new column with the first value of 'b'
library(dplyr)
library(stringr)
df1 %>%
transmute(grp = cumsum(str_detect(a, fixed("*"))),
b = readr::parse_number(a)) %>%
distinct(b, .keep_all = TRUE) %>%
group_by(grp) %>%
mutate(a = first(b)) %>%
slice(-1) %>%
ungroup %>%
select(a, b)
-output
# A tibble: 3 × 2
a b
<dbl> <dbl>
1 123 124
2 678 679
3 678 677
Here is one base R option -
Using cumsum and grepl we split the data on occurrence of *.
In each group, we drop the values which are similar to the star values and create a dataframe with two columns.
Finally, combine the list of dataframes in one combined dataframe.
result <- do.call(rbind, lapply(split(df1,
cumsum(grepl('*', df1$a, fixed = TRUE))), function(x) {
a <- x[[1]]
a[1] <- sub('*', '', a[1], fixed = TRUE)
data.frame(a = a[1], b = a[a != a[1]])
}))
rownames(result) <- NULL
result
# a b
#1 123 124
#2 678 679
#3 678 677

Is there a way in R to sum all items in a column using as condition the values from another?

I have two data frames containg string names with a specific condition and a numeric index value. What I want is to count how many names are there for a condition using an index value as reference.
The data frame is big so I'll just put and example.
I want to summarize all values in NAME from a taking into account CONDITION between INDEX-MIN and INDEX-MAX from b. Here it its important to specify that not all names in ´a´ will be captured or summarize in the final result.
The result should be as shown in c
a <- data.frame(c(1,1,2,3,3,3),c("A","B","C","D","E","F"),c(100,500,233,74,2750,10043))
colnames(a) <- c("CONDITION","NAME","INDEX")
b <- data.frame(c(1,2,3,3),c(1,75,2700,9872),c(600,245,3500,10500))
colnames(b) <- c("CONDITION","INDEX-MIN","INDEX-MAX")
c <- data.frame(c(1,2,3,3),c(1,75,2700,9872),c(600,245,3500,10500),c(2,1,1,1),c("A, B","C", "E", "F"))
colnames(c) <- c("CONDITION","INDEX-MIN","INDEX-MAX","NAME-COUNT","NAME")
We can do this with a non-equi join in data.table
library(data.table)
setDT(a)[b, .(NAME_COUNT = .N, NAME = toString(NAME)),
on = .(CONDITION, INDEX >=`INDEX-MIN`, INDEX < `INDEX-MAX`), by = .EACHI]
-output
CONDITION INDEX INDEX NAME_COUNT NAME
1: 1 1 600 2 A, B
2: 2 75 245 1 C
3: 3 2700 3500 1 E
4: 3 9872 10500 1 F
You can join the two dataframes with fuzzyjoin -
library(dplyr)
fuzzyjoin::fuzzy_inner_join(a, b,
by = c('CONDITION', 'INDEX' = 'INDEX-MIN', 'INDEX' = 'INDEX-MAX'),
match_fun = c(`==`, `>=`, `<=`)) %>%
group_by(`INDEX-MIN`, `INDEX-MAX`) %>%
summarise(CONDITION = sum(CONDITION.x),
`NAME-COUNT` = n(),
NAME = toString(NAME)) %>%
ungroup
# `INDEX-MIN` `INDEX-MAX` CONDITION `NAME-COUNT` NAME
# <dbl> <dbl> <dbl> <int> <chr>
#1 1 600 2 2 A, B
#2 75 245 2 1 C
#3 2700 3500 3 1 E
#4 9872 10500 3 1 F

Keeping one row and discarding others in R using specific criteria?

I'm working with the data frame below, which is just part of the full data, and I need to condense the duplicate numbers in the id column into one row. I want to preserve the row that has the highest sbp number, unless it's 300 or over, in which case I want to discard that too.
So for example, for the first three rows that have id as 13480, I want to keep the row that has 124 and discard the other two.
id,sex,visits,sbp
13480,M,2,124
13480,M,3,306
13480,M,4,116
13520,M,2,124
13520,M,3,116
13520,M,4,120
13580,M,2,NA
13580,M,3,124
This is the farthest I got, been trying to tweak this but not sure I'm on the right track:
maxsbp <- split(sbp, sbp$sbp)
r <- data.frame()
for (i in 1:length(maxsbp)){
one <- maxsbp[[i]]
index <- which(one$sbp == max(one$sbp))
select <- one[index,]
r <- rbind(r, select)
}
r1 <- r[!(sbp$sbp>=300),]
r1
I think a tidy solution to this would work quite well. I would first filter all values above 300, if you do not want to keep any value above that threshold. Then group_by id, order, and keep the first.
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
my.df %>% filter(sbp < 300) # filter to retain only values below 300
%>% group_by(id) # group by id
%>% arrange(-sbp) # arrange by id in descending order
%>% top_n(1, sbp) # retain first value i.e. the largest
# A tibble: 3 x 3
# Groups: id [3]
# id sex sbp
# <dbl> <chr> <dbl>
#1 13480 M 124
#2 13520 M 124
#3 13580 M 124
In R, very rarely you'll require explicit for loops to do tasks.
There are functions available which will help you perform such grouped operations.
For example, in base R you can use subset and ave :
subset(df,sbp == ave(sbp,id,FUN = function(x) max(sbp[sbp <= 300],na.rm = TRUE)))
# id sex visits sbp
#1 13480 M 2 124
#4 13520 M 2 124
#8 13580 M 3 124
The same can be done using dplyr whose syntax is a little bit easier to understand.
library(dplyr)
df %>%
group_by(id) %>%
filter(sbp == max(sbp[sbp <= 300], na.rm = TRUE))
slice_head can also be used
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
> my.df
id sex sbp
1 13480 M 124
2 13480 M 306
3 13480 M 116
4 13520 M 124
5 13520 M 116
6 13520 M 120
7 13580 M NA
8 13580 M 124
Proceed simply like this
my.df %>% group_by(id, sex) %>%
arrange(desc(sbp)) %>%
slice_head() %>%
filter(sbp <300)
# A tibble: 2 x 3
# Groups: id, sex [2]
id sex sbp
<dbl> <chr> <dbl>
1 13520 M 124
2 13580 M 124

Collapse observation rows based on first and last occurence in R

I have a dataset like this.
ID EQP_ID DATE ENTRY EXIT
10 1232 10/01/2018 0058 NA
10 8123 10/01/2018 NA 0059
11 8231 10/02/2018 0063 NA
11 233 10/03/2018 0064 NA
11 2512 10/04/2018 NA 0099
11 2111 10/05/2018 NA 1000
I want to collapse the observations such that the earliest row I see with an 'ENTRY' for a given ID is combined with the latest row with an EXIT value, and I also get the EQP_ID associated with the exit record:
ID EQP_ID ENTRY EXIT
10 8123 0058 0059
11 2111 0063 1000
I'm fairly new to R and this was complicated enough that I couldn't think of a good way to do it without resorting to a loop, and performance is predictably not very good.
Edit
I think this does it, but I'd still be curious if other more experienced folks have a better answer
> group_by(dataset, ID) %>%
arrange(ENTRY) %>%
summarize(ENTRY = first(ENTRY), EXIT = last(exit), EQP_ID = last(EQP_ID))
Using dplyr::first and dplyr::last we can do the below, another option we can use min and max
library(dplyr)
df %>% group_by(ID) %>%
summarise(EQP_ID=dplyr::last(EQP_ID), First=dplyr::first(ENTRY),Last=dplyr::last(EXIT))
# A tibble: 2 x 4
ID EQP_ID First Last
<int> <int> <int> <int>
1 10 8123 58 59
2 11 2111 63 1000
This solution uses dplyr. First, define the data frame.
df <- read.table(text = "ID EQP_ID DATE ENTRY EXIT
10 1232 10/01/2018 0058 NA
10 8123 10/01/2018 NA 0059
11 8231 10/02/2018 0063 NA
11 233 10/03/2018 0064 NA
11 2512 10/04/2018 NA 0099
11 2111 10/05/2018 NA 1000", header = TRUE)
Next, group by ID and take either the first or last value of variables in the group using head or tail, respectively.
df %>%
group_by(ID) %>%
summarise(EQP_ID = tail(EQP_ID, 1),
ENTRY = head(ENTRY, 1),
EXIT = tail(EXIT, 1))
This gives,
# # A tibble: 2 x 4
# ID EQP_ID ENTRY EXIT
# <int> <int> <int> <int>
# 1 10 8123 58 59
# 2 11 2111 63 1000
One option with data.table:
library(data.table)
#create example data
dt <- data.table(
id = c(10, 10, 11, 11, 11, 11),
date = seq(as.Date("2018-10-1"), as.Date("2018-10-6"), by="day"),
entry = c(58, NA, 63, 64, NA, NA),
exit = c(NA, 59, NA, NA, 99, 100)
)
# number rows by id
dt[order(id, date), num := 1:.N, by=id]
# get first-entry and last-exit values by id
dt[ , keepentry := entry[1],by=id]
dt[ , keepexit := exit[.N],by=id]
# keep one row per id
dt[num==1, .(id, keepentry, keepexit)]
Not my most elegant work, but it will get the job done.

How to count the different number of variables in a column, then list that count by numbers in another column

Please see attached image for the best way I can describe my question.
I promise I did attempt to research this first, and I saw a few answers that fit close, but many of them required listing off each variable (in this image, this would be each encounter #), and my data has approximately 15 million lines of code, with about 10,000 different encounter #'s.
I would appreciate any assistance!
As an alternative, you can also use the data.table package. Especially on large datasets, data.table will give you an enormous performance boost. Applied to the data as used by #r2evans:
library(data.table)
setDT(df)[, .(n_uniq_enc = uniqueN(encounter)), by = patient]
this will lead to the following result:
patient n_uniq_enc
1: 123 5
2: 456 5
Lacking a reproducible example, here's some sample data:
set.seed(42)
df <- data.frame(patient = sample(c(123,456), size=30, replace=TRUE), encounter=sample(c(12,34,56,78,90), size=30, replace=TRUE))
head(df)
# patient encounter
# 1 456 78
# 2 456 90
# 3 123 34
# 4 456 78
# 5 456 12
# 6 456 90
Base R:
aggregate(x = df$encounter, by = list(patient = df$patient),
FUN = function(a) length(unique(a)))
# patient x
# 1 123 5
# 2 456 5
or (by #20100721's suggestion):
aggregate(encounter~.,FUN = function(t) length(unique(t)),data = df)
Using dplyr:
library(dplyr)
group_by(df, patient) %>%
summarize(numencounters = length(unique(encounter)))
# # A tibble: 2 x 2
# patient numencounters
# <dbl> <int>
# 1 123 5
# 2 456 5
Update: #2100721 informed me of n_distinct, effectively same as length(unique(...)):
group_by(df, patient) %>%
summarize(numencounters = n_distinct(encounter))

Resources