How to count occurrences of a specific value across multiple columns? - r

I want to count how many times a specific value occurs across multiple columns and put the number of occurrences in a new column. My dataset has a lot of missing values but only if the entire row consists solely of NA's, it should return NA. If possible, I would prefer something that works with dplyr pipelines.
Example dataset:
df <- data.frame(c1 = sample(1:4, 20, replace = TRUE),
c2 = sample(1:4, 20, replace = TRUE),
c3 = sample(1:4, 20, replace = TRUE),
c4 = sample(1:4, 20, replace = TRUE),
c5 = sample(1:4, 20, replace = TRUE))
for (i in 1:5) {
df[sample(1:20, 1), sample(1:5, 1)] <- NA
df[sample(1:20, 1), ] <- NA
c1 c2 c3 c4 c5
1 1 2 4 4 1
2 2 2 1 3 4
3 2 4 4 3 3
4 4 2 3 2 1
5 4 2 4 1 3
6 NA 1 2 4 4
7 3 NA 4 NA 4
9 1 3 3 2 2
I have tried with rowwise() and rowSums. Some non-working examples here:
# First attempt
df <- df %>%
rowwise() %>%
mutate(count2 = sum(c_across(c1:c5, ~.x %in% 2)))
# Second attempt
df <- df %>%
rowwise() %>%
mutate(count2 = sum(c_across(select(where(c1:c5 %in% 2)))))
# With rowSums
df <- df %>%
rowwise() %>%
mutate(count4 = rowSums(select(c1:c5 %in% 4), na.rm = TRUE))

How about this:
df <- data.frame(c1 = sample(1:4, 20, replace = TRUE),
c2 = sample(1:4, 20, replace = TRUE),
c3 = sample(1:4, 20, replace = TRUE),
c4 = sample(1:4, 20, replace = TRUE),
c5 = sample(1:4, 20, replace = TRUE))
for (i in 1:5) {
df[sample(1:20, 1), sample(1:5, 1)] <- NA
df[sample(1:20, 1), ] <- NA
df %>%
rowwise() %>%
mutate(count2 = sum(na.omit(c_across(c1:c5)) == 2),
count2 = ifelse(all(, NA, count2))
#> # A tibble: 20 × 6
#> # Rowwise:
#> c1 c2 c3 c4 c5 count2
#> <int> <int> <int> <int> <int> <int>
#> 2 2 2 3 4 2 3
#> 3 1 1 1 4 4 0
#> 4 2 3 3 2 4 2
#> 6 1 1 1 2 1 1
#> 7 3 3 2 3 4 1
#> 8 1 1 4 3 4 0
#> 11 2 3 3 4 1 1
#> 12 2 1 4 2 NA 2
#> 13 4 4 2 NA 2 2
#> 14 4 2 3 3 2 2
#> 15 1 3 4 2 2 2
#> 16 1 1 3 3 2 1
#> 17 1 1 1 4 4 0
#> 18 2 4 4 NA 1 1
#> 20 4 1 1 NA 4 0
Created on 2022-12-08 by the reprex package (v2.0.1)


How to Filter by group and move all values to new column if any value in any of the affected columns is greater than 5 in R

I have a Datafaame like this:
dt <- tibble(
TRIAL = c("A", "A", "A", "B", "B", "B", "C", "C", "C","D","D","D"),
RL = c(1, NA, 3, 1, 6, 3, 2, 3, 1, 0, 1.5, NA),
SL = c(6, 1.5, 1, 0, 0, 1, 1, 2, 0, 1, 1.5, NA),
HC = c(0, 1, 5, 6,7, 8, 9, 3, 4, 5, 4, 2)
# A tibble: 12 x 4
<chr> <dbl> <dbl> <dbl>
1 A 1 6 0
2 A NA 1.5 1
3 A 3 1 5
4 B 1 0 6
5 B 6 0 7
6 B 3 1 8
7 C 2 1 9
8 C 3 2 3
9 C 1 0 4
10 D 0 1 5
11 D 1.5 1.5 4
12 D NA NA 2
I want to group the data frame by TRIAL and have the values in RL and SL checked by group, if the value in either of the column is greater than 5 then move all values for RL and SL for that particular group to RLCT and SLCT respectively.
# A tibble: 12 x 6
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0 1 6 NA NA
2 A 1 NA 1.5 NA NA
3 A 5 3 1 NA NA
4 B 6 1 0 NA NA
5 B 7 6 0 NA NA
6 B 8 3 1 NA NA
7 C 9 NA NA 1 3
8 C 3 NA NA 3 5
9 C 4 NA NA 1 1
10 D 5 NA NA 1 0
11 D 4 NA NA 1.5 1.5
12 D 2 NA NA NA NA
When I run the below code, I did not get the expected output
dt0 <- dt %>%
mutate(RLCT = NA,
SLCT = NA) %>%
group_by(TRIAL) %>%
filter(!any(RL > 5.0 | SL > 5.0))
dt1 <- dt %>%
group_by(TRIAL) %>%
filter(any(RL > 5.0 | SL > 5.0)) %>%
mutate(RLCT = RL,
SLCT = SL) %>%
rbind(dt0, .) %>%
mutate(RL = ifelse(!, NA, RL),
SL = ifelse(!, NA, SL)) %>% arrange(TRIAL)
This is what I get
# A tibble: 9 x 6
# Groups: TRIAL [3]
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A NA NA 0 1 6
2 A NA NA 1 NA 1.5
3 A NA NA 5 3 1
4 B NA NA 6 1 0
5 B NA NA 7 6 0
6 B NA NA 8 3 1
7 C 2 1 9 NA NA
8 C 3 2 3 NA NA
9 C 1 0 4 NA NA
You can define a column to storage the condition, and change RL and SL with ifelse inside across.
dt %>%
group_by(TRIAL) %>%
mutate(cond = any(RL > 5.0 | SL > 5.0, na.rm = TRUE),
across(c(RL, SL), ~ ifelse(cond, ., NA), .names = "{.col}CT"),
across(c(RL, SL), ~ ifelse(!cond, ., NA)),
cond = NULL)
# A tibble: 12 x 6
# Groups: TRIAL [4]
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A NA NA 0 1 6
2 A NA NA 1 NA 1.5
3 A NA NA 5 3 1
4 B NA NA 6 1 0
5 B NA NA 7 6 0
6 B NA NA 8 3 1
7 C 2 1 9 NA NA
8 C 3 2 3 NA NA
9 C 1 0 4 NA NA
10 D 0 1 5 NA NA
11 D 1.5 1.5 4 NA NA
12 D NA NA 2 NA NA
With dplyr, you could use group_modify():
dt %>%
group_by(TRIAL) %>%
group_modify(~ {
if(any(select(.x, c(RL, SL)) > 5, na.rm = TRUE)) {
rename_with(.x, ~ paste0(.x, 'CT'), c(RL, SL))
} else {
# A tibble: 12 × 6
# Groups: TRIAL [4]
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 6 0 NA NA
2 A NA 1.5 1 NA NA
3 A 3 1 5 NA NA
4 B 1 0 6 NA NA
5 B 6 0 7 NA NA
6 B 3 1 8 NA NA
7 C NA NA 9 2 1
8 C NA NA 3 3 2
9 C NA NA 4 1 0
10 D NA NA 5 0 1
11 D NA NA 4 1.5 1.5
12 D NA NA 2 NA NA

Group_by id and count the consective NA's and then restart counting when a new series of NA's is encountered

I have a dataframe like this:
df <- data_frame(id = c(rep('A', 10), rep('B', 10)),
value = c(1:3, rep(NA, 2), 1:2, rep(NA, 3), 1, rep(NA, 4), 1:3, rep(NA, 2)))
I need to count the number of consective NA's in the value column. The count needs to be grouped by ID, and it needs to restart at 1 every time a new NA or new series of NA's is encountered. The exptected output should look like this:
df$expected_output <- c(rep(NA, 3), 1:2, rep(NA, 2), 1:3, NA, 1:4, rep(NA, 3), 1:2)
If anyone can give me a dplyr solution that would also be great :)
I've tried a few things but nothing is giving any sort of sensical result. Thanks in advance^!
A solution using dplyr and data.table.
df2 <- df %>%
group_by(id) %>%
mutate(info = rleid(value)) %>%
group_by(id, info) %>%
mutate(expected_output = row_number()) %>%
ungroup() %>%
mutate(expected_output = ifelse(!, NA, expected_output)) %>%
# # A tibble: 20 x 3
# id value expected_output
# <chr> <dbl> <int>
# 1 A 1 NA
# 2 A 2 NA
# 3 A 3 NA
# 4 A NA 1
# 5 A NA 2
# 6 A 1 NA
# 7 A 2 NA
# 8 A NA 1
# 9 A NA 2
# 10 A NA 3
# 11 B 1 NA
# 12 B NA 1
# 13 B NA 2
# 14 B NA 3
# 15 B NA 4
# 16 B 1 NA
# 17 B 2 NA
# 18 B 3 NA
# 19 B NA 1
# 20 B NA 2
We can use rle to get length of groups that are or are not na, and use purrr::map2 to apply seq if they are NA and get the growing count or just fill in with NA values using rep.
count_na <- function(x) {
r <- rle(
consec <- map2(r$lengths, r$values, ~ if (.y) seq(.x) else rep(NA, .x))
df %>%
mutate(expected_output = count_na(value))
#> # A tibble: 20 × 3
#> id value expected_output
#> <chr> <dbl> <int>
#> 1 A 1 NA
#> 2 A 2 NA
#> 3 A 3 NA
#> 4 A NA 1
#> 5 A NA 2
#> 6 A 1 NA
#> 7 A 2 NA
#> 8 A NA 1
#> 9 A NA 2
#> 10 A NA 3
#> 11 B 1 NA
#> 12 B NA 1
#> 13 B NA 2
#> 14 B NA 3
#> 15 B NA 4
#> 16 B 1 NA
#> 17 B 2 NA
#> 18 B 3 NA
#> 19 B NA 1
#> 20 B NA 2
Here is a solution using rle:
x <- rle($value))
df$new[$value)] <- sequence(x$lengths[x$values])
# A tibble: 20 x 3
id value new
<chr> <dbl> <int>
1 A 1 NA
2 A 2 NA
3 A 3 NA
4 A NA 1
5 A NA 2
6 A 1 NA
7 A 2 NA
8 A NA 1
9 A NA 2
10 A NA 3
11 B 1 NA
12 B NA 1
13 B NA 2
14 B NA 3
15 B NA 4
16 B 1 NA
17 B 2 NA
18 B 3 NA
19 B NA 1
20 B NA 2
Yet another solution:
df %>%
mutate(aux =data.table::rleid(value)) %>%
group_by(id, aux) %>%
mutate(eout = ifelse(, row_number(), NA_real_)) %>%
ungroup %>% select(-aux)
#> # A tibble: 20 × 4
#> id value expected_output eout
#> <chr> <dbl> <int> <dbl>
#> 1 A 1 NA NA
#> 2 A 2 NA NA
#> 3 A 3 NA NA
#> 4 A NA 1 1
#> 5 A NA 2 2
#> 6 A 1 NA NA
#> 7 A 2 NA NA
#> 8 A NA 1 1
#> 9 A NA 2 2
#> 10 A NA 3 3
#> 11 B 1 NA NA
#> 12 B NA 1 1
#> 13 B NA 2 2
#> 14 B NA 3 3
#> 15 B NA 4 4
#> 16 B 1 NA NA
#> 17 B 2 NA NA
#> 18 B 3 NA NA
#> 19 B NA 1 1
#> 20 B NA 2 2

Calculate value based on multiple other values

I would like to make a some new variables in R based on multiple (>100) other variables.
My dataset looks like this
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br
1 1 1 1 2 2 1 6 0 1 6 1
2 2 2 3 2 5 1 3
3 3 0 0 <NA> 4 1 0 0 <NA> 2 2
4 4 NA 1 2 2 NA 1 1 4
5 5 NA 4 2 3 5 NA 4 3 4 3
The variables diag_x_ais can take integers from 0-6, and diag_x_br can take integers between 1-6.
I would like to make 6 new variables corresponding to the 6 possible diag_x_br values, i.e. the new variables would be called br_1, br_2 ... br_6. These new variables shall then be filled with the maximum value of the corresponding diag_x_ais variables, i.e.
if diag_1_br, diag_2_br, and diag_4_br are all 3, then br_3 should take the maximum value of diag_1_ais, diag_2_ais, and diag_4_ais.
Please also see the example dataset below:
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6
1 1 1 1 2 2 1 6 0 1 6 1 2 NA NA NA NA 2
2 2 2 1 4 3 5 5 2 2 1 3 3 4 5 NA 2 NA
3 3 0 0 NA 4 1 0 0 NA 2 2 NA 4 NA NA NA NA
4 4 NA 1 2 2 NA 1 1 4 2 NA NA 2 NA NA
5 5 NA 4 2 3 5 NA 4 3 4 3 NA NA 5 4 NA NA
Hereafter, I would like a final variable which calculates the sum of the up to three largest br_x variables, example displayed below:
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6 sum3
1 1 1 1 2 2 1 6 0 1 6 1 2 NA NA NA NA 2 4
2 2 2 1 4 3 5 5 2 2 1 3 3 4 5 NA 2 NA 12
3 3 0 0 NA 4 1 0 0 NA 2 2 NA 4 NA NA NA NA 4
4 4 NA 1 2 2 NA 1 1 4 2 NA NA 2 NA NA 4
5 5 NA 4 2 3 5 NA 4 3 4 3 NA NA 5 4 NA NA 9
My actual dataset has 60 diag_x_ais variables and 60 diag_x_br variables and 4000 rows.
I hope that someone can help me do this in R. Thank you!
I think you could use the following solution. I made a slight modification so that we only sum the first 3 max values:
df %>%
bind_cols(, function(a) pmap_dfc(df, ~ {x <- c(...)[grepl("br", names(df))]
inds <- which(x == a)
if(length(inds) != 0) {
y <- c(...)[grepl("ais", names(df))]
} else {
}})))) %>%
setNames(paste0("br", 1:6))) %>%
rowwise() %>%
mutate(sum = sum(sort(as.numeric(c_across(starts_with("br"))), decreasing = TRUE)[1:3], na.rm = TRUE)) %>%
select(starts_with("br"), sum)
Resulting output
# A tibble: 5 x 7
# Rowwise:
br1 br2 br3 br4 br5 br6 sum
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2 NA NA NA NA 2 4
2 3 4 5 NA 2 NA 12
3 NA 4 NA NA NA NA 4
4 2 NA NA 2 NA NA 4
5 NA NA 5 4 NA NA 9
You could use some heavy data.transforming most likely not very efficient on large datasets. There are some empty values, NA and 0 in your dataset. I didn't handle them (and replaced the empty values by NA to make importing easier).
data %>%
names_to = c("name", "cat"),
names_pattern = ".*_(\\d+)_(.*)") %>%
pivot_wider(names_from = "cat") %>%
group_by(sub_id, br) %>%
summarise(value = max(ais), .groups = "drop") %>%
filter(br %in% 1:6) %>%
group_by(sub_id) %>%
mutate(sum = sum(tail(sort(value), 3))) %>%
pivot_wider(names_from = br,
names_glue = "br_{br}") %>%
select(sub_id, paste0("br_", 1:6), sum)
This returns
# A tibble: 5 x 8
sub_id br_1 br_2 br_3 br_4 br_5 br_6 sum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 NA NA NA NA 2 4
2 2 3 4 5 NA 2 NA 12
3 3 NA 4 NA NA NA NA 4
4 4 2 NA NA 2 NA NA 4
5 5 NA NA 5 4 NA NA 9
Piping an addtional right_join(data, by = "sub_id") gives you your example output (minus the order of your columns).
I took an idea from this answer.
data <- structure(list(sub_id = c(1, 2, 3, 4, 5), diag_1_ais = c(1, 2,
0, NA, NA), diag_2_ais = c(1, 1, 0, NA, 4), diag_3_ais = c(2,
4, NA, 1, 2), diag_4_ais = c(2, 3, 4, 2, 3), diag_5_ais = c(1,
5, 1, 2, 5), diag_1_br = c(6, 5, 0, NA, NA), diag_2_br = c(0,
2, 0, NA, 4), diag_3_br = c(1, 2, NA, 1, 3), diag_4_br = c(6,
1, 2, 1, 4), diag_5_br = c(1, 3, 2, 4, 3)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
For the first part:
data <- data.frame(sub_id = c(1,2,3,4,5),
diag_1_ais = c(1,2,0,NA,NA),
diag_2_ais = c(1,1,0,NA,4),
diag_3_ais = c(2,4,NA,1,2),
diag_4_ais = c(2,3,4,2,3),
diag_5_ais = c(1,5,1,2,5),
diag_1_br = c(6,5,0,NA,NA),
diag_2_br = c(0,2,0,NA,4),
diag_3_br = c(1,2,NA,1,3),
diag_4_br = c(6,1,2,1,4),
diag_5_br = c(1,3,2,4,3))
calc_br <- function(data, value, firstBr, lastBr) {
br <- c()
for (i in 1:nrow(data)){
if (length(which(data[i,c(firstBr:lastBr)] %in% value))!=0){
br <- c(br, c(max(data[i,which(data[i,c(firstBr:lastBr)] %in% value)+1])))
else {
br <- c(br, c(NA))
result <- br
firstBr = 7
lastBr = 11
data$br_1 <- calc_br(data,1,firstBr,lastBr)
data$br_2 <- calc_br(data,2,firstBr,lastBr)
data$br_3 <- calc_br(data,3,firstBr,lastBr)
data$br_4 <- calc_br(data,4,firstBr,lastBr)
data$br_5 <- calc_br(data,5,firstBr,lastBr)
data$br_6 <- calc_br(data,6,firstBr,lastBr)
This should yield the same results as in your example. You should only have to exchange lastBr and firstBr (to 62 and 122 i would guess).
For the second part this should do the trick:
br_sum <- c()
for (i in 1:nrow(data)){
br_sum <- c(br_sum, sum(data[i,lastBr+tail(order(data[i,c((lastBr+1):(lastBr+6))], na.last = NA), 3)]))
data$br_sum <- br_sum
For completness here my results:
sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br
1 1 1 1 2 2 1 6
2 2 2 1 4 3 5 5
3 3 0 0 NA 4 1 0
4 4 NA NA 1 2 2 NA
5 5 NA 4 2 3 5 NA
diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6 br_sum
1 0 1 6 1 2 NA NA NA NA 2 4
2 2 2 1 3 3 4 5 NA 2 NA 12
3 0 NA 2 2 NA 4 NA NA NA NA 4
4 NA 1 1 4 2 NA NA 2 NA NA 4
5 4 3 4 3 NA NA 5 4 NA NA 9

trying to calculate sum of row with dataframe having NA values

I am trying to sum the row of values if any column have values but not working for me like below
df$summ <- ifelse([,"x3"] & df[,"x4"])),NA,rowSums(df[,c("x3","x4")], na.rm=TRUE))
the output should be like
An alternative solution:
setDT(df)[!( &,summ:=rowSums(.SD, na.rm = T)]
You can do :
df <- transform(df, summ = ifelse( &, NA,
rowSums(df, na.rm = TRUE)))
# x3 x4 summ
#1 2 0 2
#3 3 NA 3
#4 5 6 11
#5 4 5 9
#6 6 6 12
#8 NA 0 0
#9 3 4 7
#10 3 2 5
In general for any number of columns :
cols <- c('x3', 'x4')
df <- transform(df, summ = ifelse(rowSums([cols])) == length(cols),
NA, rowSums(df, na.rm = TRUE)))
Try the code below with rowSums + replace
df$summ <- replace(rowSums(df, na.rm = TRUE), rowSums( == 2, NA)
which gives
> df
x3 x4 summ
1 2 0 2
3 3 NA 3
4 5 6 11
5 4 5 9
6 6 6 12
8 NA 0 0
9 3 4 7
10 3 2 5
This is not much different from already posted answers, however, it contains some useful functions:
df %>%
rowwise() %>%
mutate(Count = ifelse(all(, NA,
sum(c_across(everything()), na.rm = TRUE)))
# A tibble: 10 x 3
# Rowwise:
x3 x4 Count
<dbl> <dbl> <dbl>
1 2 0 2
3 3 NA 3
4 5 6 11
5 4 5 9
6 6 6 12
8 NA 0 0
9 3 4 7
10 3 2 5

Applying a conditional replace function over every cell in a data frame

I'm trying to subset a dataframe in R by checking if each value is present in a specific list and keeping it if it is. For instance in the following dataframe:
x <- data.frame(A = sample(1:5, 5),
B = sample(1:5, 5),
C = sample(1:5, 5))
1 2 2 1
2 3 3 3
3 1 4 4
4 4 5 2
5 5 1 5
How could I subset it to include only the values 1, 3 and 4, giving the following as a result:
1 1
2 3 3 3
3 4 4
4 4
5 1
It doesn't matter what happens to the missing values - they could be changed to NA if this is easier. From browsing similar questions it seems that lapply might do it, but as a novice I'm struggling to apply what I've seen to this scenario.
x <- data.frame(A = sample(1:5, 5),
B = sample(1:5, 5),
C = sample(1:5, 5))
# with lapply
keep_vals = c(1, 3, 4)
x[] = lapply(x, function(y) {
y[! y %in% keep_vals] = NA
# A B C
# 1 3 1 1
# 2 1 NA NA
# 3 NA NA 4
# 4 4 3 NA
# 5 NA 4 3
Or with a for loop:
set.seed(47) # reset data
x <- data.frame(A = sample(1:5, 5),
B = sample(1:5, 5),
C = sample(1:5, 5))
keep_vals = c(1, 3, 4)
for (i in 1:ncol(x)) {
x[, i][!x[, i] %in% keep_vals] <- NA
# A B C
# 1 3 1 1
# 2 1 NA NA
# 3 NA NA 4
# 4 4 3 NA
# 5 NA 4 3
With dplyr
x %>% mutate_all(
~replace(., !. %in% keep_vals, NA)
# A B C
# 1 3 1 1
# 2 1 NA NA
# 3 NA NA 4
# 4 4 3 NA
# 5 NA 4 3
using dplyr::bind_rows,apply(x,1, function(a) a[a %in% c(1,3,4)]))
# A tibble: 5 x 3
<int> <int> <int>
1 4 NA NA
2 1 1 1
3 3 3 NA
4 NA NA 4
5 NA 4 3
Collapsing each row to the matching numbers, and adjusting each length to ncol. (Assuming you want to "left-align" your numbers, as shown in your expected output.)
d <- setNames(, 1, function(x) {
x <- x[x %in% c(1, 3, 4)]
`length<-`(x, ncol(d))
}))), names(d))
# A B C
# 1 1 NA NA
# 2 3 3 3
# 3 1 4 4
# 4 4 NA NA
# 5 NA NA NA
Since apply throws a matrix, we tell R that we want the transpose and setNames to restore those.
Note, that I changed line 5 of your example data so that it doesn't contain any of the matching numbers, so as not to make it too easy.
d <- read.table(text="A B C
1 2 2 1
2 3 3 3
3 1 4 4
4 4 5 2
5 5 2 5", header=TRUE)
