Fast way to find min in groups after excluding observations using R - r

I need to do something similar to below on a very large data set (with many groups), and read somewhere that using .SD is slow. Is there any faster way to perform the following operation?
To be more precise, I need to create a new column that contains the min value for each group after having excluded a subset of observations in that group (something similar to minif in Excel).
library(data.table)
dt <- data.table(valid = c(0,1,1,0,1),
a = c(1,1,2,3,4),
groups = c("A", "A", "A", "B", "B"))
dt[, valid_min := .SD[valid == 1, min(a, na.rm = TRUE)], by = groups]
With the output:
> test
valid a k valid_min
1: 0 1 A 1
2: 1 1 A 1
3: 1 2 A 1
4: 0 3 B 4
5: 1 4 B 4
To make it even more complicated, groups could have no valid entries or they could have multiple valid but missing entries. My current code is similar to this:
dt <- data.table(valid = c(0,1,1,0,1,0,1,1),
a = c(1,1,2,3,4,3,NA,NA),
k = c("A", "A", "A", "B", "B", "C", "D", "D"))
dt[, valid_min := .SD[valid == 1,
ifelse(all(is.na(a)), NA_real_, min(a, na.rm = TRUE))], by = k]
Output:
> dt
valid a k valid_min
1: 0 1 A 1
2: 1 1 A 1
3: 1 2 A 1
4: 0 3 B 4
5: 1 4 B 4
6: 0 3 C NA
7: 1 NA D NA
8: 1 NA D NA

There's...
dt[dt[valid == 1 & !is.na(a), min(a), by=k], on=.(k), the_min := i.V1]
This should be fast since the inner call to min is optimized for groups. (See ?GForce.)

We can do the same using dplyr
dt %>%
group_by(groups) %>%
mutate(valid_min = min(ifelse(valid == 1,
a, NA),
na.rm = TRUE))
Which gives:
valid a groups valid_min
<dbl> <dbl> <chr> <dbl>
1 0 1 A 1
2 1 1 A 1
3 1 2 A 1
4 0 3 B 4
5 1 4 B 4
Alternatively, if you are not interested in keeping the 'non-valid' rows, we can do the following:
dt %>%
filter(valid == 1) %>%
group_by(groups) %>%
mutate(valid_min = min(a))
Looks like I provided the slowest approach. Comparing each approach (using a larger, replicated data frame called df) with a microbenchmark test:
library(microbenchmark)
library(ggplot2)
mbm <- microbenchmark(
dplyr.test = suppressWarnings(df %>%
group_by(k) %>%
mutate(valid_min = min(ifelse(valid == 1,
a, NA),
na.rm = TRUE),
valid_min = ifelse(valid_min == Inf,
NA,
valid_min))),
data.table.test = df[, valid_min := .SD[valid == 1,
ifelse(all(is.na(a)), NA_real_, min(a, na.rm = TRUE))], by = k],
GForce.test = df[df[valid == 1 & !is.na(a), min(a), by=k], on=.(k), the_min := i.V1]
)
autoplot(mbm)
...well, i tried...

Related

Identify rows with a value greater than threshold, but only direct one above per group

Suppose we have a dataset with a grouping variable, a value, and a threshold that is unique per group. Say I want to identify a value that is greater than a threshold, but only one.
test <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2)
)
want <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2),
want = c(NA, NA, "yes", NA, "yes", NA)
)
In the table above, Group A has a threshold of 4, and only value of 5 is higher. But in Group B, threshold is 2, and both value of 3 and 5 is higher. However, only row with value of 3 is marked.
I was able to do this by identifying which rows had value greater than threshold, then removing the repeated value:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = if_else(value > threshold, "yes", NA_character_)) %>%
mutate(across(want, ~replace(.x, duplicated(.x), NA)))
I was wondering if there was a direct way to do this using a single logical statement rather than doing it two-step method, something along the line of:
test %>%
group_by(grp) %>%
mutate(want = if_else(???, "yes", NA_character_))
The answer doesn't have to be on R either. Just a logical step explanation would suffice as well. Perhaps using a rank?
Thank you!
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = (value > threshold), want = want & !lag(cumany(want))) %>%
ungroup()
# # A tibble: 6 × 4
# grp value threshold want
# <chr> <dbl> <dbl> <lgl>
# 1 A 1 4 FALSE
# 2 A 3 4 FALSE
# 3 A 5 4 TRUE
# 4 B 1 2 FALSE
# 5 B 3 2 TRUE
# 6 B 5 2 FALSE
If you really want strings, you can if_else after this.
Here is more direct way:
The essential part:
With min(which((value > threshold) == TRUE) we get the first TRUE in our column,
Next we use an ifelse and check the number we get to the row number and set the conditions:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = ifelse(row_number()==min(which((value > threshold) == TRUE)),
"yes", NA_character_))
grp value threshold want
<chr> <dbl> <dbl> <chr>
1 A 1 4 NA
2 A 3 4 NA
3 A 5 4 yes
4 B 1 2 NA
5 B 3 2 yes
6 B 5 2 NA
>
This is a perfect chance for a data.table answer using its non-equi matching and multiple match handling capabilities:
library(data.table)
setDT(test)
test[test, on=.(grp, value>threshold), mult="first", flag := TRUE]
test
# grp value threshold flag
# <char> <num> <num> <lgcl>
#1: A 1 4 NA
#2: A 3 4 NA
#3: A 5 4 TRUE
#4: B 1 2 NA
#5: B 3 2 TRUE
#6: B 5 2 NA
Find the "first" matching value in each group that is greater than > the threshold and set := it to TRUE

Check whether a vector element of one value is placed between vector elements of two other values in R

I did not find any method of checking whether categorical value elements of a vector are between other categorical value elements.
A dataframe is given:
id letter
1 B
2 A
3 B
4 B
5 C
6 B
7 A
8 B
9 C
Everything I found is related to numerical values and to the notion of general order (rather than to index of an element in a specific vector).
I want to add a new column with boolean values (1 if B is between A and C; 0 if B is between C and A) to the dataframe,
id letter between
1 B 0
2 A NA
3 B 1
4 B 1
5 C NA
6 B 0
7 A NA
8 B 1
9 C NA
A combination of rle (run length encoding) and zoo::rollapply is one option:
library(zoo)
d <- structure(list(id = 1:9,
letter = structure(c(2L, 1L, 2L, 2L, 3L, 2L, 1L, 2L, 3L),
.Label = c("A", "B", "C"),
class = "factor")),
class = "data.frame", row.names = c(NA, -9L))
rl <- rle(as.numeric(d$letter))
rep(rollapply(c(NA, rl$values, NA),
3,
function(x) if (x[2] == 2)
ifelse(x[1] == 1 && x[3] == 3, 1, 0)
else NA),
rl$lengths)
# [1] 0 NA 1 1 NA 0 NA 1 NA
Explanation
With rleyou identify blocks of consecutive values.
With rollapply you "roll" a function with a given window size (here 3) over a vector.
Our vector rl$values contains the different elements and the function we apply to it is pretty straight forward:
if the second element is anything but a 2 (corresponding to B) return NA
if the second element is a 2 and element 1 is an A and element 3 is a C return 1 and 0 otherwise
A different tidyverse possibility could be:
df %>%
group_by(grp = with(rle(letter), rep(seq_along(lengths), lengths))) %>%
filter(row_number() == 1) %>%
ungroup() %>%
mutate(res = ifelse(lag(letter, default = first(letter)) == "A" &
lead(letter, default = last(letter)) == "C", 1, 0)) %>%
select(-letter, -grp) %>%
full_join(df, by = c("id" = "id")) %>%
arrange(id) %>%
fill(res) %>%
mutate(res = ifelse(letter != "B", NA, res))
id res letter
<int> <dbl> <chr>
1 1 0 B
2 2 NA A
3 3 1 B
4 4 1 B
5 5 NA C
6 6 0 B
7 7 NA A
8 8 1 B
9 9 NA C
In this case it, first, groups by a run-length type ID and keeps the first rows with a given ID. Second, it checks the condition. Third, it performs a full join with the original df on "id" column. Finally, it arranges according "id", fills the missing values and assigns NA to rows where "letter" != B.
It's unclear from the question whether "A" and "C" must alternate, though that's implied because there is no coding for "B" between "A" and "A" or vv. Supposing that they do, for the vector
x = c("B", "A", "B", "B", "C", "B", "A", "B", "C")
map to numeric values c(A=1, B=0, C=-1) and form the cumulative sum
v = cumsum(c(A=1, B=0, C=-1)[x])
(increment by 1 when encountering "A", decrement by one when "C"). Replace positions not corresponding to "B" with NA
v[x != "B"] = NA
giving
> v
B A B B C B A B C
0 NA 1 1 NA 0 NA 1 NA
This could be captured as a function
fun = function(x, map = c(A = 1, B = 0, C = -1)) {
x = map[x]
v = cumsum(x)
v[x != 0] = NA
v
}
and used to transform a data.frame or tibble, e.g.,
tibble(x) %>% mutate(v = fun(x))
Here's one solution, which I hope is fairly easy conceptually. For 'special' cases such as B being at the top or bottom of the list, or having an A or a C on both sides, I've set such values to 0.
# Create dummy data - you use your own
df <- data.frame(id=1:100, letter=sample(c("A", "B", "C"), 100, replace=T))
# Copy down info on whether A or C is above each B
acup <- df$letter
for(i in 2:nrow(df))
if(df$letter[i] == "B")
acup[i] <- acup[i-1]
# Copy up info on whether A or C is below each B
acdown <- df$letter
for(i in nrow(df):2 -1)
if(df$letter[i] == "B")
acdown[i] <- acdown[i+1]
# Set appropriate values for column 'between'
df$between <- NA
df$between[acup == "A" & acdown == "C"] <- 1
df$between[df$letter == "B" & is.na(df$between)] <- 0 # Includes special cases
You can use lead and lag functions to know the letters before and after and then mutate as below:
library(dplyr)
df %>%
mutate(letter_lag = lag(letter, 1),
letter_lead = lead(letter, 1)) %>%
mutate(between = case_when(letter_lag == "A" | letter_lead == "C" ~ 1,
letter_lag == "C" | letter_lead == "A" ~ 0,
TRUE ~ NA_real_)) %>%
select(id, letter, between)
id letter between
1 1 B 0
2 2 A NA
3 3 B 1
4 4 B 1
5 5 C NA
6 6 B 0
7 7 A NA
8 8 B 1
9 9 C NA

Conditionally removing duplicates

I have a dataset in which I need to conditionally remove duplicated rows based on values in another column.
Specifically, I need to delete any row where size = 0 only if SampleID is duplicated.
SampleID<-c("a", "a", "b", "b", "b", "c", "d", "d", "e")
size<-c(0, 1, 1, 2, 3, 0, 0, 1, 0)
data<-data.frame(SampleID, size)
I want to delete rows with:
Sample ID size
a 0
d 0
And keep:
SampleID size
a 1
b 1
b 2
b 3
c 0
d 1
e 0
Note. actual dataset it very large, so I am not looking for a way to just remove a known row by row number.
In dplyr we can do this using group_by and filter:
library(dplyr)
data %>%
group_by(SampleID) %>%
filter(!(size==0 & n() > 1)) # filter(size!=0 | n() == 1))
#> # A tibble: 7 x 2
#> # Groups: SampleID [5]
#> SampleID size
#> <fct> <dbl>
#> 1 a 1
#> 2 b 1
#> 3 b 2
#> 4 b 3
#> 5 c 0
#> 6 d 1
#> 7 e 0
Using data.table framework: Transform your set to data.table
require(data.table)
setDT(data)
Build a list of id where we can delete lines:
dropable_ids = unique(data[size != 0, SampleID])
Finaly keep lines that are not in the dropable list or with non 0 value
data = data[!(SampleID %in% dropable_ids & size == 0), ]
Please note that not( a and b ) is equivalent to a or b but data.table framework doesn't handle well or.
Hope it helps
A solution that works in base R without data.table and is easy to follow through for R starters:
#Find all duplicates
data$dup1 <- duplicated(data$SampleID)
data$dup2 <- duplicated(data$SampleID, fromLast = TRUE)
data$dup <- ifelse(data$dup1 == TRUE | data$dup2 == TRUE, 1, 0)
#Subset to relevant
data$drop <- ifelse(data$dup == 1 & data$size == 0, 1, 0)
data2 <- subset(data, drop == 0)

Create column with minimum (first) level of factors by group r

I have a data frame with a mixture of patient and appointment information. Each patient may have attended multiple appointments. Some patient information is collected at each appointment, with the result that some patient information is conflicting and some is missing.
I want to fill in the missing patient information for appointments where it wasn't recorded based upon data recorded at other appointments. And (this is where I'm stuck) I want to take the 'minimum' level (earliest in order of levels) recorded for factors that have conflicting information for the same patient. (In the example below the factor levels are in alphabetical order , but this isn't always the case).
This is similar to this question, but I am using factors instead of characters, and I have multiple factors that I want the minimum for so can't filter by row.
eg. I have
df.have <- data.frame(
grp_id = rep(1:3, each = 2),
grpvar1 = factor(c("B", "A", "B", "C", NA, "A")),
grpvar2 = factor(c("a", "b", "c", NA, NA, "x")),
appt_id = 1:6)
I want
grp_id grpvar1 grpvar2 appt_id
1 A a 1
1 A a 2
2 B c 3
2 B c 4
3 A x 5
3 A x 6
or at least
grp_id grpvar1 grpvar1.1
1 A a
2 B c
3 A x
We can try with summarise_each. As we need the first level of 'grpvar' variables, we need to make sure that the unused levels are dropped before we do that (using droplevels).
df.have %>%
group_by(grp_id) %>%
summarise_each(funs(first(levels(droplevels(.)))), grpvar1:grpvar2)
# grp_id grpvar1 grpvar2
# <int> <chr> <chr>
#1 1 A a
#2 2 B c
#3 3 A x
Or if we use mutate_each, we get the first output
df.have %>%
group_by(grp_id) %>%
mutate_each(funs(levels(droplevels(.))[1]), grpvar1:grpvar2)
# grp_id grpvar1 grpvar2 appt_id
# <int> <chr> <chr> <int>
#1 1 A a 1
#2 1 A a 2
#3 2 B c 3
#4 2 B c 4
#5 3 A x 5
#6 3 A x 6
if we need the output as factor columns for 'grpvar's.
df.have %>%
group_by(grp_id) %>%
mutate_each(funs(factor(levels(droplevels(.))[1])), grpvar1:grpvar2)
# grp_id grpvar1 grpvar2 appt_id
# <int> <fctr> <fctr> <int>
#1 1 A a 1
#2 1 A a 2
#3 2 B c 3
#4 2 B c 4
#5 3 A x 5
#6 3 A x 6
Or using data.table
library(data.table)
setDT(df.have)[, lapply(.SD, function(x) levels(droplevels(x))[1]) ,
grp_id, .SDcols = grpvar1:grpvar2]
While revisiting this for a different project with a larger dataset I realised that it was much more efficient (though required more keystrokes) to convert the factors to numbers, take the minimum, then reconvert back to factors.
library(data.table)
library(dplyr)
set.seed(1)
n <- 100L
dat <- data.table(
grp_id = rep(1:n/10, each = 10),
grpvar1 = factor(sample(c(LETTERS, NA), n, replace = TRUE), levels = LETTERS),
grpvar2 = factor(sample(c(letters, NA), n, replace = TRUE), levels = letters),
appt_id = 1:n)
cols <- c("grpvar1","grpvar2")
dplyr_fct <- function(data, cols) {
data %>%
group_by(grp_id) %>%
mutate_each(funs(factor(levels(droplevels(.))[1])), one_of(cols))
}
dt_fct <- function(data, cols) {
data[, lapply(.SD, function(x) levels(droplevels(x))[1]), grp_id, .SDcols = cols]}
dt_nmbr <- function(data, cols) {
dat_out <- copy(data)
v_lvl = lapply(dat_out[, .SD, .SDcols = cols], levels)
# Convert factors to numeric
for(col in cols) set(dat_out, j = col, value = as.numeric(dat_out[[col]]))
# Select highest value
dat_out[, (cols):= lapply(.SD, min, na.rm = TRUE), by = grp_id, .SDcols = cols]
# Convert back to factor
for(col in cols) set(dat_out, j = col,
value = factor(dat_out[[col]], levels = 1:length(v_lvl[[col]]), labels = v_lvl[[col]]))
assign("dat_out", dat_out, envir = .GlobalEnv)
}
mbm <- microbenchmark::microbenchmark(
mbm_dplyr_fct = dplyr_fct(dat, cols),
mbm_dt_fct = dt_fct(dat, cols),
mbm_dt_nmbr = dt_nmbr(dat, cols)
)
mbm
Unit: milliseconds
expr min lq mean median uq max neval cld
mbm_dplyr_fct 84.487484 85.829834 90.988740 87.015878 91.159178 120.22171 100 c
mbm_dt_fct 56.768529 58.007094 60.988083 58.831850 60.269427 87.11799 100 b
mbm_dt_nmbr 4.181538 4.406392 4.540248 4.557948 4.619757 6.04197 100 a

Sum of two Columns of Data Frame with NA Values

I have a data frame with some NA values. I need the sum of two of the columns. If a value is NA, I need to treat it as zero.
a b c d
1 2 3 4
5 NA 7 8
Column e should be the sum of b and c:
e
5
7
I have tried a lot of things, and done two dozen searches with no luck. It seems like a simple problem. Any help would be appreciated!
dat$e <- rowSums(dat[,c("b", "c")], na.rm=TRUE)
dat
# a b c d e
# 1 1 2 3 4 5
# 2 5 NA 7 8 7
dplyr solution, taken from here:
library(dplyr)
dat %>%
rowwise() %>%
mutate(e = sum(b, c, na.rm = TRUE))
Here is another solution, with concatenated ifelse():
dat$e <- ifelse(is.na(dat$b) & is.na(dat$c), dat$e <-0, ifelse(is.na(dat$b), dat$e <- 0 + dat$c, dat$b + dat$c))
# a b c d e
#1 1 2 3 4 5
#2 5 NA 7 8 7
Edit, here is another solution that uses with as suggested by #kasterma in the comments, this is much more readable and straightforward:
dat$e <- with(dat, ifelse(is.na(b) & is.na(c ), 0, ifelse(is.na(b), 0 + c, b + c)))
if you want to keep NA if both columns has it you can use:
Data, sample:
dt <- data.table(x = sample(c(NA, 1, 2, 3), 100, replace = T), y = sample(c(NA, 1, 2, 3), 100, replace = T))
Solution:
dt[, z := ifelse(is.na(x) & is.na(y), NA_real_, rowSums(.SD, na.rm = T)), .SDcols = c("x", "y")]
(the data.table way)
I hope that it may help you
Some cases you have a few columns that are not numeric. This approach will serve you both.
Note that: c_across() for dplyr version 1.0.0 and later
df <- data.frame(
TEXT = c("text1", "text2"), a = c(1,5), b = c(2, NA), c = c(3,7), d = c(4,8))
df2 <- df %>%
rowwise() %>%
mutate(e = sum(c_across(a:d), na.rm = TRUE))
# A tibble: 2 x 6
# Rowwise:
# TEXT a b c d e
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 text1 1 2 3 4 10
# 2 text2 5 NA 7 8 20

Resources