R total multiple columns at once with n() - r

I don't know if I am missing something very obvious here or not, but I am having trouble getting the desired results format for a count. These are all yes, no or NA answers to a question.
My data looks a bit like:
df <- read.table(text = " A B C
0 NA 1
1 0 NA
0 1 0
NA NA 1
0 0 1
1 0 NA
0 1 NA ", header = TRUE)
df %>%
group_by(A, B, C)%>%
summarise(count = n())
I have also tried
count(A, B, C)
with exactly the same results.
I want to count the total number of 0, 1 and NA responses for each column: (rows and columns are interchangeable here, it's the count of response v column format of the table that I'm after.)
Response 0 1 NA
Column A 4 2 1
Column B 3 2 2
Column C 1 3 3
What I am getting instead is
A B C n
0 0 1 1
0 1 0 1
0 1 NA 1
0 NA 1 1
1 0 NA 2
NA NA 1 1
In other words, it's counting the number of times each unique combination of ABC appears. How do I get it to focus on counting the columns and not the rows?

You can apply the table() function across the columns:
df <- read.table(text = " A B C
0 NA 1
1 0 NA
0 1 0
NA NA 1
0 0 1
1 0 NA
0 1 NA ", header = TRUE)
t(apply(df, 2,table, useNA = "always"))
#> 0 1 <NA>
#> A 4 2 1
#> B 3 2 2
#> C 1 3 3
Created on 2022-08-05 by the reprex package (v2.0.1)
One alternate tidyverse solution would be the following:
library(tidyverse)
df <- read.table(text = " A B C
0 NA 1
1 0 NA
0 1 0
NA NA 1
0 0 1
1 0 NA
0 1 NA ", header = TRUE)
x <- df %>%
mutate(across(everything(), ~fct_explicit_na(as.factor(.x),"NA"))) %>%
map(., ~c(table(.x))) %>%
bind_rows(.id = 'Response')
x
#> # A tibble: 3 × 4
#> Response `0` `1` `NA`
#> <chr> <int> <int> <int>
#> 1 A 4 2 1
#> 2 B 3 2 2
#> 3 C 1 3 3
Created on 2022-08-05 by the reprex package (v2.0.1)

I guess it might need some data re-shaping if you want to use dplyr::n().
First transform df into a "long" format, you'll get a two-column dataframe, from which we can group by everything (group_by_all()) and do your summarize(n()). Finally, transform it back to a "wide" format.
library(tidyverse)
df %>% pivot_longer(everything(), names_to = "Response") %>%
group_by_all() %>%
summarize(n = n()) %>%
pivot_wider(names_from = "value", values_from = "n")
# A tibble: 3 × 4
# Groups: Response [3]
Response `0` `1` `NA`
<chr> <int> <int> <int>
1 A 4 2 1
2 B 3 2 2
3 C 1 3 3

Using table and stack you can try:
t(table(stack(df), useNA = "ifany"))
Output
values
ind 0 1 <NA>
A 4 2 1
B 3 2 2
C 1 3 3

If you find yourself wanting to apply the same operation to multiple columns in your data it could be a hint that you should reshape your data to a "longer" format, such that each row represents a single observation. Once your data is in this format you can use table() to get the summary you're after:
df_tidy <-
df %>%
pivot_longer(cols = everything(), names_to = "group", values_to = "response")
print(df_tidy)
#> # A tibble: 21 x 2
#> group response
#> <chr> <int>
#> 1 A 0
#> 2 B NA
#> 3 C 1
#> 4 A 1
#> 5 B 0
#> 6 C NA
#> 7 A 0
#> 8 B 1
#> 9 C 0
#> 10 A NA
#> # … with 11 more rows
table(df_tidy, useNA = "ifany")
#> response
#> group 0 1 <NA>
#> A 4 2 1
#> B 3 2 2
#> C 1 3 3

Related

Filling in NA values with a sequence by group

I have a data set that looks like the following:
ID Count
1 0
1 1
1 NA
1 2
1 NA
1 NA
1 NA
1 NA
1 NA
2 0
2 NA
2 NA
2 3
The first row of each ID starts with 0. I want to fill the NA values with sequential values by group. If there are values before and after the NA values, I need to fill the NA values with a sequence counting up to the first value after the NA values. If there are no values after the NA values, I need to fill the NA values with a sequence counting up from the last value before the NA value. The output should look like following:
ID Count
1 0
1 1
1 1
1 2
1 3
1 4
1 5
1 6
1 7
2 0
2 1
2 2
2 3
This is a little complicated, but I think this does what you want. I left all my helper columns in so you can see what's happening, but the non-needed columns can all be dropped at the end.
library(dplyr)
library(vctrs)
df %>%
group_by(ID, na_group = cumsum(!is.na(Count))) %>%
mutate(n_til_non_na = ifelse(is.na(Count), rev(row_number()), 0L)) %>%
group_by(ID) %>%
mutate(
fill_down = vec_fill_missing(Count, direction = "down"),
fill_up = vec_fill_missing(Count, direction = "up"),
result = case_when(
is.na(fill_up) ~ fill_down + cumsum(is.na(fill_up)),
is.na(Count) ~ fill_up - n_til_non_na,
TRUE ~ Count
)
) %>%
ungroup()
# # A tibble: 13 × 7
# ID Count na_group n_til_non_na fill_down fill_up result
# <int> <int> <int> <int> <int> <int> <int>
# 1 1 0 1 0 0 0 0
# 2 1 1 2 0 1 1 1
# 3 1 NA 2 1 1 2 1
# 4 1 2 3 0 2 2 2
# 5 1 NA 3 5 2 NA 3
# 6 1 NA 3 4 2 NA 4
# 7 1 NA 3 3 2 NA 5
# 8 1 NA 3 2 2 NA 6
# 9 1 NA 3 1 2 NA 7
# 10 2 0 4 0 0 0 0
# 11 2 NA 4 2 0 3 1
# 12 2 NA 4 1 0 3 2
# 13 2 3 5 0 3 3 3
Using this sample data:
df = read.table(text = 'ID Count
1 0
1 1
1 NA
1 2
1 NA
1 NA
1 NA
1 NA
1 NA
2 0
2 NA
2 NA
2 3', header = T)
You can use purrr::accumulate(), first backwards, then forward. While going backwards, replace each missing value with the previous value - 1 to count down; then while moving forwards, replace remaining missing values with the previous value + 1 to count up.
library(dplyr)
library(purrr)
dat %>%
group_by(ID) %>%
mutate(
Count = accumulate(
Count,
\(x, y) ifelse(is.na(x), y - 1, x),
.dir = "backward"
),
Count = accumulate(
Count,
\(x, y) ifelse(is.na(y), x + 1, y)
)
) %>%
ungroup()
# A tibble: 13 × 2
ID Count
<dbl> <dbl>
1 1 0
2 1 1
3 1 1
4 1 2
5 1 3
6 1 4
7 1 5
8 1 6
9 1 7
10 2 0
11 2 1
12 2 2
13 2 3

How to count the cumulative number of subgroupings using dplyr?

I'm trying to run the number of cumulative subgroupings using dplyr, as illustrated and explanation in the image below. I am trying to solve for Flag2 in the image. Any recommendations for how to do this?
Beneath the image I also have the reproducible code that runs all columns up through Flag1 which works fine.
Reproducible code:
library(dplyr)
myData <-
data.frame(
Element = c("A","B","B","B","B","B","A","C","C","C","C","C"),
Group = c(0,0,1,1,2,2,0,3,3,0,0,0)
)
excelCopy <- myData %>%
group_by(Element) %>%
mutate(Element_Count = row_number()) %>%
mutate(Flag1 = case_when(Group > 0 ~ match(Group, unique(Group)),TRUE ~ Element_Count)) %>%
ungroup()
print.data.frame(excelCopy)
Using row_number and setting 0 values to NA
library(dplyr)
excelCopy |>
group_by(Element, Group) |>
mutate(Flag2 = ifelse(Group == 0, NA, row_number()))
Element Group Element_Count Flag1 Flag2
<chr> <dbl> <int> <int> <int>
1 A 0 1 1 NA
2 B 0 1 1 NA
3 B 1 2 2 1
4 B 1 3 2 2
5 B 2 4 3 1
6 B 2 5 3 2
7 A 0 2 2 NA
8 C 3 1 1 1
9 C 3 2 1 2
10 C 0 3 3 NA
11 C 0 4 4 NA
12 C 0 5 5 NA

insert NA if previous variable is 1 grouped by unique id

I have clinical data that records a patient at three time points with a disease outcome indicated by a binary variable. It looks something like this
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,0,1,NA,0,0,0,1,0)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome)
Data
I want to create an outcome incidence variable. There are two conditions:
When a patient is coded a 1, I would like there to be a NA for any time period after for that patient.
If a patient has NA at time point 1 then time point 2 and 3 should also be NA.
For the example data it should now look like this:
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,0,1,NA,NA,NA,0,1,NA)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome)
Data
Not the smartest way, but you could use dplyr:
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(outcome = outcome * (1 - na_if(lag(cumsum(outcome) > 0, default = 0), 1))) %>%
ungroup()
which returns
# A tibble: 9 x 3
patientid time outcome
<dbl> <dbl> <dbl>
1 100 1 0
2 100 2 0
3 100 3 1
4 101 1 NA
5 101 2 NA
6 101 3 NA
7 102 1 0
8 102 2 1
9 102 3 NA
We can group_by() patientid, then use purrr:accumulate() and ifelse() here:
library(purrr)
library(dplyr)
Data%>%
group_by(patientid)%>%
mutate(outcome=accumulate(outcome, ~ifelse(.x==1 | is.na(.x), NA, .y)))
# A tibble: 9 x 3
# Groups: patientid [3]
patientid time outcome
<dbl> <dbl> <dbl>
1 100 1 0
2 100 2 0
3 100 3 1
4 101 1 NA
5 101 2 NA
6 101 3 NA
7 102 1 0
8 102 2 1
9 102 3 NA
You can create a function and apply it for each patientid.
library(dplyr)
return_outcome <- function(x) {
if(is.na(first(x))) return(NA)
else {
val <- max(which(x == 1))
if(length(val) && val < length(x))
x[(val + 1):length(x)] <- NA
}
x
}
Data %>%
group_by(patientid) %>%
mutate(result = return_outcome(outcome)) %>%
ungroup
# patientid time outcome result
# <dbl> <dbl> <dbl> <dbl>
#1 100 1 0 0
#2 100 2 0 0
#3 100 3 1 1
#4 101 1 NA NA
#5 101 2 0 NA
#6 101 3 0 NA
#7 102 1 0 0
#8 102 2 1 1
#9 102 3 0 NA

Return the column name of the second largest value of a row

df = data.frame( ID = c (1,2,3,4,5), a = c (0,2,0,1,0),
b = c (0,3,2,NA,0), c = c(0,4,NA,NA,1),
d = c (2,5,4,NA,1))
maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
df<-df %>% mutate( second_largest=apply(.[2:5], 1, function(x) names(x)[maxn(2)(x)]) )
I used the R codes above to obtain the column name for the second largest value of a,b,c,d. For ID=4, because there are missing values for b,c,d, so the name of second largest value should be NA. However, the codes return b. How should I remove missing value?
one more approach
df = data.frame( ID = c (1,2,3,4,5), a = c (0,2,0,1,0),
b = c (0,3,2,NA,0), c = c(0,4,NA,NA,1),
d = c (2,5,4,NA,1))
library(dplyr, warn.conflicts = F)
df %>% group_by(ID) %>% rowwise() %>%
mutate(name = {x <- c_across(everything());
if (sum(!is.na(x)) >= 2) tail(head(names(cur_data())[order(x, decreasing = T)],2),1) else NA})
#> # A tibble: 5 x 6
#> # Rowwise: ID
#> ID a b c d name
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1 0 0 0 2 a
#> 2 2 2 3 4 5 c
#> 3 3 0 2 NA 4 b
#> 4 4 1 NA NA NA <NA>
#> 5 5 0 0 1 1 d
If you have to do it for a few columns instead
df %>% group_by(ID) %>% rowwise() %>%
mutate(name = {x <- c_across(c('a', 'c'));
if (sum(!is.na(x)) >= 2) tail(head(c('a', 'c')[order(x, decreasing = T)],2),1) else NA})
# A tibble: 5 x 6
# Rowwise: ID
ID a b c d name
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 0 0 0 2 c
2 2 2 3 4 5 a
3 3 0 2 NA 4 NA
4 4 1 NA NA NA NA
5 5 0 0 1 1 a
I think you can use the following solution. I tested some possible configurations of numbers and it worked:
library(dplyr)
library(purrr)
df %>%
mutate(Name = pmap_chr(., ~ {x <- c(...)[-1];
if(sum(is.na(x)) >= 3) {
NA
} else {
ind <- which(x == max(x[!is.na(x)]))
if(length(ind) > 1) {
colnames(df[-1])[ind[2]]
} else {
colnames(df[-1])[which(x == sort(x)[length(sort(x))-1])][1]
}
}
}
))
ID a b c d Name
1 1 0 0 0 2 a
2 2 2 3 4 5 c
3 3 0 2 NA 4 b
4 4 1 NA NA NA <NA>
5 5 0 0 1 1 d
We can change the function to -
maxn <- function(n) function(x) order(x, decreasing = TRUE)[!is.na(x)][n]
The code will then work with your approach -
library(dplyr)
df %>%
mutate(second_largest=apply(.[2:5], 1, function(x) names(x)[maxn(2)(x)]))
# ID a b c d second_largest
#1 1 0 0 0 2 a
#2 2 2 3 4 5 c
#3 3 0 2 NA 4 b
#4 4 1 NA NA NA <NA>
#5 5 0 0 1 1 d

R - Grouping rows by matching value then adding rows to matching columns in another data frame

I am trying to add values from one data frame (ex2) to an existing data frame (ex1) based on two different columns. As you can see, there is an ID column in both data frames. But in ex2, each column of ex1 is represented by a different row instead of a column. For each matching ID, I want to add the result from ex2$result to the matching row in ex1 under the appropriate column heading (if ex2$alpha[i] = a then ex2$result[i] gets added to ex1$a[z] where ex2$id[i]=ex1$id[z]). Another complication is that not all of the columns from ex1 will have alpha value in ex2, so those should be set as 'NA'.
ex1 <- data.frame(
id = c(1:20),
a = c(rep(1,5),rep(0,5),rep(NA,10)),
b = c(rep(c(1,0),5),rep(NA,10)),
c = c(rep(c(0,1),5),rep(NA,10)),
d = c(rep(0,5),rep(1,5),rep(NA,10))
)
ex2 <- data.frame(
id = c(rep(11,3),rep(12,3),rep(13,3),
rep(14,2),rep(15,2),
rep(16,4),rep(17,4),rep(18,4),rep(19,4),rep(20,4)),
alpha = c(rep(c('a','b','d'),3),rep(c('a','b'),2),
rep(c('a','b','c','d'),5)),
result = c(rep(c(0,1,1),11))
)
Thanks for your help!
I believe the attached snippet does what you want it to do. But it is hard to know from your toy data if it is feasible to write out the columns a to d in the mutate statement. There surely is a more clever programmatic way to approach this problem.
ex1 <- data.frame(
id = c(1:20),
a = c(rep(1,5),rep(0,5),rep(NA,10)),
b = c(rep(c(1,0),5),rep(NA,10)),
c = c(rep(c(0,1),5),rep(NA,10)),
d = c(rep(0,5),rep(1,5),rep(NA,10))
)
ex2 <- data.frame(
id = c(rep(11,3),rep(12,3),rep(13,3),
rep(14,2),rep(15,2),
rep(16,4),rep(17,4),rep(18,4),rep(19,4),rep(20,4)),
alpha = c(rep(c('a','b','d'),3),rep(c('a','b'),2),
rep(c('a','b','c','d'),5)),
result = c(rep(c(0,1,1),11))
)
library(tidyverse)
ex_2_wide <- pivot_wider(ex2, id_cols = id, names_from = alpha, values_from = result )
joined <- full_join(ex1, ex_2_wide, by = c("id" = "id")) %>%
mutate(a = coalesce(a.x, a.y)) %>%
mutate(b = coalesce(b.x, b.y)) %>%
mutate(c = coalesce(c.x, c.y)) %>%
mutate(d = coalesce(d.x, d.y)) %>%
select(-(a.x:c.y))
joined
#> id a b c d
#> 1 1 1 1 0 0
#> 2 2 1 0 1 0
#> 3 3 1 1 0 0
#> 4 4 1 0 1 0
#> 5 5 1 1 0 0
#> 6 6 0 0 1 1
#> 7 7 0 1 0 1
#> 8 8 0 0 1 1
#> 9 9 0 1 0 1
#> 10 10 0 0 1 1
#> 11 11 0 1 NA 1
#> 12 12 0 1 NA 1
#> 13 13 0 1 NA 1
#> 14 14 0 1 NA NA
#> 15 15 1 0 NA NA
#> 16 16 1 1 0 1
#> 17 17 1 0 1 1
#> 18 18 0 1 1 0
#> 19 19 1 1 0 1
#> 20 20 1 0 1 1
Created on 2021-01-07 by the reprex package (v0.3.0)
EDIT:
If we turn the problem around (we first make long tables, followed by join and merge, then pivot back wide), there is only a single step for merger, no matter how many columns you have.
library(tidyverse)
ex1_long <- pivot_longer(ex1, cols = a:d, names_to = "alpha")
joined <- full_join(ex1_long, ex2, by = c("id" = "id", "alpha" = "alpha")) %>%
mutate(value = coalesce(value, result)) %>% select(-result) %>%
pivot_wider(id_cols = id, names_from = alpha, values_from = value)
joined
#> # A tibble: 20 x 5
#> id a b c d
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 0 0
#> 2 2 1 0 1 0
#> 3 3 1 1 0 0
#> 4 4 1 0 1 0
#> 5 5 1 1 0 0
#> 6 6 0 0 1 1
#> 7 7 0 1 0 1
#> 8 8 0 0 1 1
#> 9 9 0 1 0 1
#> 10 10 0 0 1 1
#> 11 11 0 1 NA 1
#> 12 12 0 1 NA 1
#> 13 13 0 1 NA 1
#> 14 14 0 1 NA NA
#> 15 15 1 0 NA NA
#> 16 16 1 1 0 1
#> 17 17 1 0 1 1
#> 18 18 0 1 1 0
#> 19 19 1 1 0 1
#> 20 20 1 0 1 1
Created on 2021-01-07 by the reprex package (v0.3.0)

Resources