How to summarise a categorical variable with missing data? - r

I'm trying to perform a group_by summarise on a categorical variable, frailty score. The data is structured such that there are multiple observations for each subject, some of which contain missing data e.g.
Subject Frailty
1 Managing well
1 NA
1 NA
2 NA
2 NA
2 Vulnerable
3 NA
3 NA
3 NA
I would like the data to be summarised so that a frailty description appears if there is one available, and NA if not e.g.
Subject Frailty
1 Managing well
2 Vulnerable
3 NA
I tried the following two approaches which both returned errors:
Mode <- function(x) {
ux <- na.omit(unique(x[!is.na(x)]))
tab <- tabulate(match(x, ux)); ux[tab == max(tab)]
}
data %>%
group_by(Subject) %>%
summarise(frailty = Mode(frailty)) %>%
Error: Expecting a single value: [extent=2].
condense <- function(x){unique(x[!is.na(x)])}
data %>%
group_by(subject) %>%
summarise(frailty = condense(frailty))
Error: Column frailty must be length 1 (a summary value), not 0

One solution involving dplyr could be:
df %>%
group_by(Subject) %>%
slice(which.min(is.na(Frailty)))
Subject Frailty
<int> <chr>
1 1 Managing_well
2 2 Vulnerable
3 3 <NA>

If there are only one a single non-NA element, then after grouping by 'Subject', get the first non-NA element
library(dplyr)
data %>%
group_by(Subject) %>%
summarise(Frailty = Frailty[which(!is.na(Frailty))[1]])
# A tibble: 3 x 2
# Subject Frailty
# <int> <chr>
#1 1 Managing well
#2 2 Vulnerable
#3 3 <NA>
If there are more than one non-NA unique elements, either we paste them together or return as a list
data %>%
group_by(Subject) %>%
summarise(Frailty = na_if(toString(unique(na.omit(Frailty))), ""))
# A tibble: 3 x 2
# Subject Frailty
# <int> <chr>
#1 1 Managing well
#2 2 Vulnerable
#3 3 <NA>
data
data <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L
), Frailty = c("Managing well", NA, NA, NA, NA, "Vulnerable",
NA, NA, NA)), class = "data.frame", row.names = c(NA, -9L))

Related

Coalescing multiple chunks of columns with the same suffix in names (R)

I have a dataset with various "chunks" of columns with different prefixes, but the same suffix:
ID
A034
B034
C034
D034
A099
B099
A123
B123
...
1
NA
1
NA
NA
NA
3
1
NA
...
2
2
NA
NA
NA
2
NA
NA
2
...
3
NA
NA
2
NA
NA
2
1
NA
...
The number of columns within each "chunk" also varies. Is there any way (other than manually, which is what I have been painstakingly doing with coalesce(!!! select(., contains("XXX")))) to automatically coalesce by chunk based on the shared suffix? That is, the result should resemble
ID
034
099
123
...
1
1
3
1
...
2
2
2
2
...
3
2
2
1
...
I'm not sure how to begin doing something like this, so any suggestions would be very helpful.
We reshape the data into 'long' format with pivot_longer, then we group by 'ID' and loop across the other columns, apply the na.omit to remove the NA elements (we assume that there is only one non-NA per each column by group)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -ID, names_to = ".value",
names_pattern = "[A-Z](\\d+)") %>%
group_by(ID) %>%
summarise(across(everything(), na.omit), .groups = 'drop')
-output
# A tibble: 3 x 4
ID `034` `099` `123`
<int> <int> <int> <int>
1 1 1 3 1
2 2 2 2 2
3 3 2 2 1
Or to be safe, use complete.cases to create a logical vector for non-NA elements, and extract the first element (assuming we need only a single non-NA - if the non-NA lengths are different, we may need to return a list)
df1 %>%
pivot_longer(cols = -ID, names_to = ".value",
names_pattern = "[A-Z](\\d+)") %>%
group_by(ID) %>%
summarise(across(everything(), ~ .[complete.cases(.)][1]))
data
df1 <- structure(list(ID = 1:3, A034 = c(NA, 2L, NA), B034 = c(1L, NA,
NA), C034 = c(NA, NA, 2L), D034 = c(NA, NA, NA), A099 = c(NA,
2L, NA), B099 = c(3L, NA, 2L), A123 = c(1L, NA, 1L), B123 = c(NA,
2L, NA)), class = "data.frame", row.names = c(NA, -3L))
one more approach
library(tidyverse)
split(names(df1)[-1], gsub('^\\D*(\\d+)$', '\\1', names(df1)[-1])) %>% map(~df1[c('ID', .x)]) %>%
imap(~ .x %>% group_by(ID) %>% rowwise %>% transmute(!!.y := first(na.omit(c_across(everything())))) %>% ungroup) %>%
reduce(left_join, by = 'ID')
#> # A tibble: 3 x 4
#> ID `034` `099` `123`
#> <int> <int> <int> <int>
#> 1 1 1 3 1
#> 2 2 2 2 2
#> 3 3 2 2 1
Created on 2021-06-20 by the reprex package (v2.0.0)

R function to paste information from different rows with a common column? [duplicate]

This question already has an answer here:
dplyr::first() to choose first non NA value
(1 answer)
Closed 2 years ago.
I understand we can use the dplyr function coalesce() to unite different columns, but is there such function to unite rows?
I am struggling with a confusing incomplete/doubled dataframe with duplicate rows for the same id, but with different columns filled. E.g.
id sex age source
12 M NA 1
12 NA 3 1
13 NA 2 2
13 NA NA NA
13 F 2 NA
and I am trying to achieve:
id sex age source
12 M 3 1
13 F 2 2
You can try:
library(dplyr)
#Data
df <- structure(list(id = c(12L, 12L, 13L, 13L, 13L), sex = structure(c(2L,
NA, NA, NA, 1L), .Label = c("F", "M"), class = "factor"), age = c(NA,
3L, 2L, NA, 2L), source = c(1L, 1L, 2L, NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
df %>%
group_by(id) %>%
fill(everything(), .direction = "down") %>%
fill(everything(), .direction = "up") %>%
slice(1)
# A tibble: 2 x 4
# Groups: id [2]
id sex age source
<int> <fct> <int> <int>
1 12 M 3 1
2 13 F 2 2
As mentioned by #A5C1D2H2I1M1N2O1R2T1 you can select the first non-NA value in each group. This can be done using dplyr :
library(dplyr)
df %>% group_by(id) %>% summarise(across(.fns = ~na.omit(.)[1]))
# A tibble: 2 x 4
# id sex age source
# <int> <fct> <int> <int>
#1 12 M 3 1
#2 13 F 2 2
Base R :
aggregate(.~id, df, function(x) na.omit(x)[1], na.action = 'na.pass')
Or data.table :
library(data.table)
setDT(df)[, lapply(.SD, function(x) na.omit(x)[1]), id]

Equivalent of summarise_all for group_by and slice

I'm currently using group_by then slice, to get the maximum dates in my data. There are a few rows where the date is NA, and when using slice(which.max(END_DT)), the NAs end up getting dropped. Is there an equivalent of summarise_all, so that I can keep the NAs in my data?
ID Date INitials
1 01-01-2020 AZ
1 02-01-2020 BE
2 NA CC
I'm using
df %>%
group_by(ID) %>%
slice(which.max(Date))
I need the final results to look like below, but it's dropping the NA entirely
ID Date Initials
1 02-01-2020 BE
2 NA CC
which.max() is not suitable in this case because (1) it drops missing values and (2) it only finds the first position of maxima. Here is a general solution:
library(dplyr)
df %>%
mutate(Date = as.Date(Date, "%m-%d-%Y")) %>%
group_by(ID) %>%
filter(Date == max(Date) | all(is.na(Date)))
# # A tibble: 2 x 3
# # Groups: ID [2]
# ID Date INitials
# <int> <date> <fct>
# 1 1 2020-02-01 BE
# 2 2 NA CC
df <- structure(list(ID = c(1L, 1L, 2L), Date = structure(c(1L, 2L,
NA), .Label = c("01-01-2020", "02-01-2020"), class = "factor"),
INitials = structure(1:3, .Label = c("AZ", "BE", "CC"), class = "factor")),
class = "data.frame", row.names = c(NA, -3L))
It's dropping the NA because you're asking it to find the max date...which NA would not fall into. If you want to go the "which.max" route, then I'd just run the dataset again, using filter, and grab the NA(s) and bind them to the dataset.
df.1 <- df%>%
filter(is.na(Date))
df <- rbind(df, df.1)

Summarise number of specific rows containing string variables in R (dplyr/tidyverse codes are appreciated)

I have a big dataset with a variety of variables concerning infectious complications. There are columns, containing symptoms written as strings in the corresponding columns ("Dysuria", "Fever", etc.). I would like to know the number of positive symptoms in each observation. I have tried to write different codes, using rowSums within mutate_at with is.character and !is.na, trying to do it simpler and as short as a single line of code, but it did not work.
example:
symps_na %>%
mutate_if(~any(is.character(.), rowSums)) %>%
View()
Then, I wrote a code for each column separately, trying to recode string variables to 1, convert them to numeric and then sum these ones to get the number of symptoms (see the codes below).
symps_na<-
pb_table_ord %>%
select(ID, dysuria:fever)%>%
mutate(dysuria=ifelse(dysuria=="Dysuria", 1, dysuria)) %>%
mutate(frequency=ifelse(frequency=="Frequency", 1, frequency)) %>%
mutate(urgency=ifelse(urgency=="Urgency", 1, urgency)) %>%
mutate(prostatepain=ifelse(prostatepain=="Prostate pain", 1, prostatepain)) %>%
mutate(rigor=ifelse(!is.na(rigor), 1, rigor)) %>%
mutate(loinpain=ifelse(!is.na(loinpain), 1, loinpain)) %>%
mutate(fever=ifelse(!is.na(fever), 1, fever)) %>%
mutate_at(vars(dysuria:fever), as.numeric) %>%
mutate(symptoms.sum=rowSums(select(., dysuria:fever)))
but the column symptoms.sum returns NA's instead numbers.
Oh, sorry, just have realized that I have missed na.rm=TRUE! But anyway. Can anyone suggest a more elegant way how could one get the summary number of non-NA/string variables for each observation in a separate column?
You can create two sets of columns one where you need to check value same as column name and the other one where you need to check to for NA values. I have created a sample data shared at the end of the answer and the two vectors cols1 which is a vector of column names which has same value as in it's column and cols2 where we need to check for NA values. You can change that according to column names that you have.
library(dplyr)
cols1 <- c('b', 'c')
cols2 <- c('d')
purrr::imap_dfc(df %>% select(cols1), `==`) %>% mutate_all(as.numeric) %>%
bind_cols(df %>% transmute_at(vars(cols2), ~+(!is.na(.)))) %>%
mutate(symptoms.sum = rowSums(select(., b:d), na.rm = TRUE))
# A tibble: 5 x 4
# b c d symptoms.sum
# <dbl> <dbl> <int> <dbl>
#1 1 1 0 2
#2 0 1 1 2
#3 1 0 1 2
#4 NA NA 1 1
#5 1 NA 0 1
data
Tested on this data which looks like this
df <- structure(list(a = 1:5, b = structure(c(1L, 2L, 1L, NA, 1L), .Label = c("b",
"c"), class = "factor"), c = structure(c(1L, 1L, 2L, NA, NA), .Label = c("c",
"d"), class = "factor"), d = c(NA, 1, 2, 4, NA)), class = "data.frame",
row.names = c(NA, -5L))
df
# a b c d
#1 1 b c NA
#2 2 c c 1
#3 3 b d 2
#4 4 <NA> <NA> 4
#5 5 b <NA> NA

Looping through columns and duplicating data in R

I am trying to iterate through columns, and if the column is a whole year, it should be duplicated four times, and renamed to quarters
So this
2000 Q1-01 Q2-01 Q3-01
1 2 3 3
Should become this:
Q1-00 Q2-00 Q3-00 Q4-00 Q1-01 Q2-01 Q3-01
1 1 1 1 2 3 3
Any ideas?
We can use stringr::str_detect to look for colnames with 4 digits then take the last two digits from those columns
library(dplyr)
library(tidyr)
library(stringr)
df %>% gather(key,value) %>% group_by(key) %>%
mutate(key_new = ifelse(str_detect(key,'\\d{4}'),paste0('Q',1:4,'-',str_extract(key,'\\d{2}$'),collapse = ','),key)) %>%
ungroup() %>% select(-key) %>%
separate_rows(key_new,sep = ',') %>% spread(key_new,value)
PS: I hope you don't have a large dataset
Since you want repeated columns, you can just re-index your data frame and then update the column names
df <- structure(list(`2000` = 1L, Q1.01 = 2L, Q2.01 = 3L, Q3.01 = 3L,
`2002` = 1L, Q1.03 = 2L, Q2.03 = 3L, Q3.03 = 3L), row.names = c(NA,
-1L), class = "data.frame")
#> df
#2000 Q1.01 Q2.01 Q3.01 2002 Q1.03 Q2.03 Q3.03
#1 1 2 3 3 1 2 3 3
# Get indices of columns that consist of 4 numbers
col.ids <- grep('^[0-9]{4}$', names(df))
# For each of those, create new names, and for the rest preserve the old names
new.names <- lapply(seq_along(df), function(i) {
if (i %in% col.ids)
return(paste(substr(names(df)[i], 3, 4), c('Q1', 'Q2', 'Q3', 'Q4'), sep = '.'))
return(names(df)[i])
})
# Now repeat each of those columns 4 times
df <- df[rep(seq_along(df), ifelse(seq_along(df) %in% col.ids, 4, 1))]
# ...and finally set the column names to the desired new names
names(df) <- unlist(new.names)
#> df
#00.Q1 00.Q2 00.Q3 00.Q4 Q1.01 Q2.01 Q3.01 02.Q1 02.Q2 02.Q3 02.Q4 Q1.03 Q2.03 Q3.03
#1 1 1 1 1 2 3 3 1 1 1 1 2 3 3

Resources