Counting non-NA values per row including certain columns in R [duplicate] - r

This question already has answers here:
Count number of NA's in a Row in Specified Columns R [duplicate]
(3 answers)
Closed 7 months ago.
How can I count non-NA values per row including columns from "b" to "c"?
library(tidyverse)
d = tibble(a = c("Tom", "Mary", "Ben", "Jane", "Lucas", "Mark"),
b = c(NA, 3, 6, NA, 5, NA),
c = c(2, NA, 6, 7, 1, 9))
d
Output should have an extra column with values as follows: 1, 1, 2, 1, 2, 1
Tidyverse solutions are especially appreciated!

A possible solution:
library(dplyr)
d %>%
mutate(result = rowSums(!is.na(across(b:c))))
#> # A tibble: 6 × 4
#> a b c result
#> <chr> <dbl> <dbl> <dbl>
#> 1 Tom NA 2 1
#> 2 Mary 3 NA 1
#> 3 Ben 6 6 2
#> 4 Jane NA 7 1
#> 5 Lucas 5 1 2
#> 6 Mark NA 9 1

Using select instead of across:
library(tidyverse)
d = tibble(a = c("Tom", "Mary", "Ben", "Jane", "Lucas", "Mark"),
b = c(NA, 3, 6, NA, 5, NA),
c = c(2, NA, 6, 7, 1, 9))
d %>%
mutate(output = rowSums(!is.na(select(., -a))))
#> # A tibble: 6 × 4
#> a b c output
#> <chr> <dbl> <dbl> <dbl>
#> 1 Tom NA 2 1
#> 2 Mary 3 NA 1
#> 3 Ben 6 6 2
#> 4 Jane NA 7 1
#> 5 Lucas 5 1 2
#> 6 Mark NA 9 1
Created on 2022-07-15 by the reprex package (v2.0.1)
base R option:
library(tidyverse)
d = tibble(a = c("Tom", "Mary", "Ben", "Jane", "Lucas", "Mark"),
b = c(NA, 3, 6, NA, 5, NA),
c = c(2, NA, 6, 7, 1, 9))
d$output <- apply(d[2:3], 1, function(x) sum(!is.na(x)))
d
#> # A tibble: 6 × 4
#> a b c output
#> <chr> <dbl> <dbl> <int>
#> 1 Tom NA 2 1
#> 2 Mary 3 NA 1
#> 3 Ben 6 6 2
#> 4 Jane NA 7 1
#> 5 Lucas 5 1 2
#> 6 Mark NA 9 1
Created on 2022-07-15 by the reprex package (v2.0.1)

Related

How to add rows so that each group has equal number of rows?

I have a data frame with unequal numbers of rows per group, see df in the example below. I would like to add rows containing the group name and NAs in all other columns so that there is an equal number of rows per group like in df.desired. The rows should be added after the last row from the respective group.
Example:
df = data.frame(group = c("A","A","A","A","B","B","B","C","C"),
col1 = c(1, 1, 1, 1, 2, 2, 2, 3, 3),
col2 = c(12, 13, 14, 15, 21, 22, 23, 31, 32))
> df
group col1 col2
1 A 1 12
2 A 1 13
3 A 1 14
4 A 1 15
5 B 2 21
6 B 2 22
7 B 2 23
8 C 3 31
9 C 3 32
df.desired = data.frame(group = c("A","A","A","A","B","B","B","B","C","C","C","C"),
col1 = c(1, 1, 1, 1, 2, 2, 2, NA, 3, 3, NA, NA),
col2 = c(12, 13, 14, 15, 21, 22, 23, NA, 31, 32, NA, NA))
> df.desired
group col1 col2
1 A 1 12
2 A 1 13
3 A 1 14
4 A 1 15
5 B 2 21
6 B 2 22
7 B 2 23
8 B NA NA
9 C 3 31
10 C 3 32
11 C NA NA
12 C NA NA
I know how to do this with a loop but that would be super slow and I would prefer to use dplyr if possible. Does anyone have any ideas?
How about this:
library(dplyr)
df = data.frame(group = c("A","A","A","A","B","B","B","C","C"),
col1 = c(1, 1, 1, 1, 2, 2, 2, 3, 3),
col2 = c(12, 13, 14, 15, 21, 22, 23, 31, 32))
maxgp <- max(table(df$group))
df %>%
group_by(group) %>%
summarise(across(everything(), ~c(.x, rep(NA, maxgp-n()))))
#> `summarise()` has grouped output by 'group'. You can override using the
#> `.groups` argument.
#> # A tibble: 12 × 3
#> # Groups: group [3]
#> group col1 col2
#> <chr> <dbl> <dbl>
#> 1 A 1 12
#> 2 A 1 13
#> 3 A 1 14
#> 4 A 1 15
#> 5 B 2 21
#> 6 B 2 22
#> 7 B 2 23
#> 8 B NA NA
#> 9 C 3 31
#> 10 C 3 32
#> 11 C NA NA
#> 12 C NA NA
Created on 2023-02-01 by the reprex package (v2.0.1)
You can create row numbers for each group and then tidyr::complete:
library(dplyr)
df %>%
group_by(group) %>%
mutate(id = row_number()) %>%
ungroup() %>%
tidyr::complete(group, id) %>%
select(-id)
# # A tibble: 12 × 3
# group col1 col2
# <chr> <dbl> <dbl>
# 1 A 1 12
# 2 A 1 13
# 3 A 1 14
# 4 A 1 15
# 5 B 2 21
# 6 B 2 22
# 7 B 2 23
# 8 B NA NA
# 9 C 3 31
# 10 C 3 32
# 11 C NA NA
# 12 C NA NA
Update (from #Maël's answer)
After dplyr 1.1.0, Per-operation grouping with .by/by is supported for mutate(), summarise(), filter(), and the slice() family. The code can be simplified to
df %>%
mutate(id = row_number(), .by = group) %>%
tidyr::complete(group, id) %>%
select(-id)

how can i group_by NA's as well?

with this formula:
datanew <- df_bsp %>%
group_by(id_mother) %>%
dplyr::mutate(Family = cur_group_id())
I got this output:
datanew <- data.frame(id_pers=c(1, 2, 3, 4, 5, 6),
id_mother=c(11, 11, 11, 12, 12, 12),
FAMILY=c(1,1,1,2,2,2)
now the problem:
There are also some NA's in the id_mother-variable
it looks like this:
datanew_1 <- data.frame(id_pers=c(1, 2, 3, 4, 5, 6, 7, 8, 9,10),
id_mother=c(11, 11, 11, 12, 12, 12, NA, NA, NA, NA)
How can i get this result:
datanew <- data.frame(id_pers=c(1, 2, 3, 4, 5, 6, 7, 8, 9,10),
id_mother=c(11, 11, 11, 12, 12, 12, NA, NA, NA, NA),
FAMILY=c(1,1,1,2,2,2,3,4,5,6)
THX
If you want each NA value treated as its own group, give each one a unique value:
datanew_1 %>%
mutate(
id_mother_na = ifelse(
is.na(id_mother),
paste("g", "na", cumsum(is.na(id_mother))),
paste("g", id_mother)
)
) %>%
group_by(id_mother_na) %>%
mutate(Family = cur_group_id()) %>%
ungroup()
# # A tibble: 10 × 4
# id_pers id_mother id_mother_na Family
# <dbl> <dbl> <chr> <int>
# 1 1 11 g 11 1
# 2 2 11 g 11 1
# 3 3 11 g 11 1
# 4 4 12 g 12 2
# 5 5 12 g 12 2
# 6 6 12 g 12 2
# 7 7 NA g na 1 3
# 8 8 NA g na 2 4
# 9 9 NA g na 3 5
# 10 10 NA g na 4 6
Along the same lines of the other answer, you need to make a unique group for the NA:
library(tidyverse)
make_grp <- function(x){
coalesce(x, cumsum(is.na(x))) + (max(x, na.rm = TRUE)*is.na(x))
}
datanew_1 |>
group_by(grp = make_grp(id_mother)) |>
mutate(Family = cur_group_id()) |>
ungroup() |>
select(-grp)
#> # A tibble: 10 x 3
#> id_pers id_mother Family
#> <dbl> <dbl> <int>
#> 1 1 11 1
#> 2 2 11 1
#> 3 3 11 1
#> 4 4 12 2
#> 5 5 12 2
#> 6 6 12 2
#> 7 7 NA 3
#> 8 8 NA 4
#> 9 9 NA 5
#> 10 10 NA 6

Subset data based on variable prefix

I have a large dataset in which the answers to one question are distributed among various columns. However, if the columns belong together, they share the same prefix. I wonder how I can create a subset dataset of each question sorting based on the prefix.
Here is an example dataset. I would like to receive an efficient and easy adaptable solution to create a dataset only containing the values of either question one, two or three.
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8), Question1a = c(1,
1, NA, NA, 1, 1, 1, NA), Question1b = c(NA, 1, NA, 1, NA, 1,
NA, 1), Question1c = c(1, 1, NA, NA, 1, NA, NA, NA), Question2a = c(1,
NA, NA, NA, 1, 1, NA, NA), Question2b = c(NA, 1, NA, 1, NA, NA,
NA, NA), Question3a = c(NA, NA, NA, NA, 1, 1, 1, NA), Question3b = c(NA,
NA, 1, 1, NA, NA, NA, NA)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L))
You can use sapply and a function:
list_data <- sapply(c("Question1", "Question2", "Question3"),
function(x) df[startsWith(names(df),x)], simplify = FALSE)
This will store everything in a list. To get the individual data sets in the global environment as individual objects, use:
list2env(list_data, globalenv())
Output
# $Question1
# # A tibble: 8 × 3
# Question1a Question1b Question1c
# <dbl> <dbl> <dbl>
# 1 1 NA 1
# 2 1 1 1
# 3 NA NA NA
# 4 NA 1 NA
# 5 1 NA 1
# 6 1 1 NA
# 7 1 NA NA
# 8 NA 1 NA
#
# $Question2
# # A tibble: 8 × 2
# Question2a Question2b
# <dbl> <dbl>
# 1 1 NA
# 2 NA 1
# 3 NA NA
# 4 NA 1
# 5 1 NA
# 6 1 NA
# 7 NA NA
# 8 NA NA
#
# $Question3
# # A tibble: 8 × 2
# Question3a Question3b
# <dbl> <dbl>
# 1 NA NA
# 2 NA NA
# 3 NA 1
# 4 NA 1
# 5 1 NA
# 6 1 NA
# 7 1 NA
# 8 NA NA
I believe the underlying question is about data-formats.
Here's a few:
library(tidyverse)
structure(
list(
ID = c(1, 2, 3, 4, 5, 6, 7, 8),
Question1a = c(1,
1, NA, NA, 1, 1, 1, NA),
Question1b = c(NA, 1, NA, 1, NA, 1,
NA, 1),
Question1c = c(1, 1, NA, NA, 1, NA, NA, NA),
Question2a = c(1,
NA, NA, NA, 1, 1, NA, NA),
Question2b = c(NA, 1, NA, 1, NA, NA,
NA, NA),
Question3a = c(NA, NA, NA, NA, 1, 1, 1, NA),
Question3b = c(NA,
NA, 1, 1, NA, NA, NA, NA)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -8L)
) -> square_df
square_df %>%
pivot_longer(-ID,
names_to = c("Question", "Item"),
names_pattern = "Question(\\d+)(\\w+)") ->
long_df
long_df
#> # A tibble: 56 × 4
#> ID Question Item value
#> <dbl> <chr> <chr> <dbl>
#> 1 1 1 a 1
#> 2 1 1 b NA
#> 3 1 1 c 1
#> 4 1 2 a 1
#> 5 1 2 b NA
#> 6 1 3 a NA
#> 7 1 3 b NA
#> 8 2 1 a 1
#> 9 2 1 b 1
#> 10 2 1 c 1
#> # … with 46 more rows
long_df %>%
na.omit(value) ->
sparse_long_df
sparse_long_df
#> # A tibble: 22 × 4
#> ID Question Item value
#> <dbl> <chr> <chr> <dbl>
#> 1 1 1 a 1
#> 2 1 1 c 1
#> 3 1 2 a 1
#> 4 2 1 a 1
#> 5 2 1 b 1
#> 6 2 1 c 1
#> 7 2 2 b 1
#> 8 3 3 b 1
#> 9 4 1 b 1
#> 10 4 2 b 1
#> # … with 12 more rows
sparse_long_df %>%
nest(data = c(ID, Item, value)) ->
nested_long_df
nested_long_df
#> # A tibble: 3 × 2
#> Question data
#> <chr> <list>
#> 1 1 <tibble [12 × 3]>
#> 2 2 <tibble [5 × 3]>
#> 3 3 <tibble [5 × 3]>
Created on 2022-05-12 by the reprex package (v2.0.1)
You could also use map to store each dataframe in a list, e.g.
library(purrr)
# 3 = number of questions
map(c(1:3),
function(x){
quest <- paste0("Question",x)
select(df, ID, starts_with(quest))
})
Output:
[[1]]
# A tibble: 8 x 4
ID Question1a Question1b Question1c
<dbl> <dbl> <dbl> <dbl>
1 1 1 NA 1
2 2 1 1 1
3 3 NA NA NA
4 4 NA 1 NA
5 5 1 NA 1
6 6 1 1 NA
7 7 1 NA NA
8 8 NA 1 NA
[[2]]
# A tibble: 8 x 3
ID Question2a Question2b
<dbl> <dbl> <dbl>
1 1 1 NA
2 2 NA 1
3 3 NA NA
4 4 NA 1
5 5 1 NA
6 6 1 NA
7 7 NA NA
8 8 NA NA
[[3]]
# A tibble: 8 x 3
ID Question3a Question3b
<dbl> <dbl> <dbl>
1 1 NA NA
2 2 NA NA
3 3 NA 1
4 4 NA 1
5 5 1 NA
6 6 1 NA
7 7 1 NA
8 8 NA NA
I found a really intuitive solution using the dplyr package, using the select and starts_with commands. Alternatively, you can also replace the starts_with command with contains, if the you are not identifying the similar variables by a prefix but some other common feature.
Q1 <- Survey %>%
select(
starts_with("Question1")
)
Q2 <- Survey %>%
select(
starts_with("Question2")
)
Q3 <- Survey %>%
select(
starts_with("Question3")
)

Flagging an id when having similar columns different values in R

I need to flag an id when they have different grade values in the grade columns. Here how my sample dataset looks like
df <- data.frame(id = c(11,22,33,44,55),
grade.1 = c(3,4,5,6,7),
grade.2 = c(3,4,5,NA,7),
grade.3 = c(4,4,6,5,7),
grade.4 = c(NA,NA,NA, 5, 7 ))
df$Grade <- paste0(df$grade.1, df$grade.2, df$grade.3, df$grade.4)
> df
id grade.1 grade.2 grade.3 grade.4 Grade
1 11 3 3 4 NA 334NA
2 22 4 4 4 NA 444NA
3 33 5 5 6 NA 556NA
4 44 6 NA 5 5 6NA55
5 55 7 7 7 7 7777
When an id has different grade values in grade.1 grade.2 grade.3 and grade.4, that row needs to be flagged. Having NA in that column does not affect the flagging.
In other words, if the Grade column at the end has any differential numbers, that id needs to be flagged.
My desired output should look like this:
> df
id grade.1 grade.2 grade.3 grade.4 flag
1 11 3 3 4 NA flagged
2 22 4 4 4 NA Not_flagged
3 33 5 5 6 NA flagged
4 44 6 NA 5 5 flagged
5 55 7 7 7 7 Not_flagged
Any ideas?
Thanks!
A base R solution using rle omitting NA values.
df$flag <- apply(df[,2:5], 1, function(x)
ifelse(length(rle(x[!is.na(x)])$lengths)==1, "not_flagged", "flagged"))
df
id grade.1 grade.2 grade.3 grade.4 flag
1 11 3 3 4 NA flagged
2 22 4 4 4 NA not_flagged
3 33 5 5 6 NA flagged
4 44 6 NA 5 5 flagged
5 55 7 7 7 7 not_flagged
Data
df <- structure(list(id = c(11, 22, 33, 44, 55), grade.1 = c(3, 4,
5, 6, 7), grade.2 = c(3, 4, 5, NA, 7), grade.3 = c(4, 4, 6, 5,
7), grade.4 = c(NA, NA, NA, 5, 7)), class = "data.frame", row.names = c(NA,
-5L))
Here is a base R approach.
df$flag <- c("not_flagged", "flagged")[
apply(df[-1L], 1L, \(x) length( (ux <- unique(x))[!is.na(ux)] ) > 1L) + 1L
]
Output
> df
id grade.1 grade.2 grade.3 grade.4 flag
1 11 3 3 4 NA flagged
2 22 4 4 4 NA not_flagged
3 33 5 5 6 NA flagged
4 44 6 NA 5 5 flagged
5 55 7 7 7 7 not_flagged
A possible solution:
library(tidyverse)
df <- data.frame(id = c(11,22,33,44,55),
grade.1 = c(3,4,5,6,7),
grade.2 = c(3,4,5,NA,7),
grade.3 = c(4,4,6,5,7),
grade.4 = c(NA,NA,NA, 5, 7 ))
df %>%
rowwise %>%
mutate(flag = if_else(length(unique(na.omit(c_across(2:5)))) == 1,
"not-flagged", "flagged")) %>% ungroup
#> # A tibble: 5 × 6
#> id grade.1 grade.2 grade.3 grade.4 flag
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 11 3 3 4 NA flagged
#> 2 22 4 4 4 NA not-flagged
#> 3 33 5 5 6 NA flagged
#> 4 44 6 NA 5 5 flagged
#> 5 55 7 7 7 7 not-flagged
Using data.table::uniqueN, that counts the number of unique elements in a vector (and that allows for NA removal):
library(data.table)
library(dplyr)
df %>%
rowwise %>%
mutate(flag = if_else(uniqueN(c_across(2:5), na.rm = T) == 1,
"not-flagged", "flagged")) %>% ungroup
n_distinct from dyplr is very helpful: Here a version using a combination of pivot_longer and pivot_wider:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
-c(id, Grade),
names_to = "name",
values_to = "value"
) %>%
group_by(id) %>%
mutate(flag = ifelse(n_distinct(value, na.rm = TRUE)==1, "Not flagged", "Flagged")) %>%
pivot_wider(
names_from = name,
values_from = value
)
id Grade flag grade.1 grade.2 grade.3 grade.4
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 11 334NA Flagged 3 3 4 NA
2 22 444NA Not flagged 4 4 4 NA
3 33 556NA Flagged 5 5 6 NA
4 44 6NA55 Flagged 6 NA 5 5
5 55 7777 Not flagged 7 7 7 7

R Removing row if three or more values are NA

I feel like I should be able to do this with filter or subset, but can't figure out how.
How do I remove a row if three or more of the cells in that row are "NA"?
So in this dataset, rows with titles 1A-C2 and 3A-C2 would be removed.
my_data <- data.frame(Title = c("1A-C2", "1D-T2", "1F-T1", "1E-C2", "3A-C2", "3F-T2"),
Group1 = c(NA, 10, 2, 9, NA, 4), Group2 = c(1, 3, 6, 1, NA, 3), Group3=c(NA, 3, 3, 8, NA, 4), Group4=c(NA, NA, 4, 5, 1, 7), Group5=c(1, 4, 3, 3, 9, NA), Group6=c(NA, 4, 5, 6, 1, NA))
Thank you!!
With Base R,
my_data[rowSums(is.na(my_data))<3,]
gives,
Title Group1 Group2 Group3 Group4 Group5 Group6
2 1D-T2 10 3 3 NA 4 4
3 1F-T1 2 6 3 4 3 5
4 1E-C2 9 1 8 5 3 6
6 3F-T2 4 3 4 7 NA NA
Using dplyr :
library(dplyr)
my_data %>%
rowwise() %>%
filter(sum(is.na(c_across(starts_with('Group')))) < 3)
# Title Group1 Group2 Group3 Group4 Group5 Group6
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1D-T2 10 3 3 NA 4 4
#2 1F-T1 2 6 3 4 3 5
#3 1E-C2 9 1 8 5 3 6
#4 3F-T2 4 3 4 7 NA NA
In base R, we can use Reduce with is.na
subset(my_data, Reduce(`+`, lapply(my_data[startsWith(names(my_data), "Group")],
is.na)) < 3)
# Title Group1 Group2 Group3 Group4 Group5 Group6
#2 1D-T2 10 3 3 NA 4 4
#3 1F-T1 2 6 3 4 3 5
#4 1E-C2 9 1 8 5 3 6
#6 3F-T2 4 3 4 7 NA NA

Resources