I have a dataset that I am needing to use dplyr::coalesce() on. But I want to do this multiple times and am not sure about what is a more efficient way of doing this (e.g. loop, apply, etc).
To give you a toy example, say my dataset is:
df = data.frame(
a = c(1, NA, NA),
a.1 = c(NA, 1, NA),
a.2 = c(NA, NA, 1),
b = c(2, NA, NA),
b.1 = c(NA, 2, NA),
b.2 = c(NA, NA, 2),
c = c(3, NA, NA),
c.1 = c(NA, 3, NA),
c.2 = c(NA, NA, 3)
)
And I could do this:
new_df = df |>
dplyr::mutate(
a = dplyr::coalesce(a, a.1, a.2),
b = dplyr::coalesce(b, b.1, b.2),
c = dplyr::coalesce(c, c.1, c.2)
) |>
dplyr::select(a, b, c)
Which would give me:
new_df
a b c
1 1 2 3
2 1 2 3
3 1 2 3
First, how could I efficiently do this without having to write coalesce n times? This example here is just an example and I'd really need to do this forty times with the dataset.
Also, is there a way to do it as I have here where I basically just keep a, b, and c rather than naming it as a.1 or whatever?
If columns are like something and somthing.etc shape,
you may try
library(dplyr)
library(stringr)
df %>%
split.default(str_remove(names(.), "\\..*")) %>%
map_df(~ coalesce(!!! .x))
a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 1 2 3
3 1 2 3
Here is an alternative with pivoting:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
mutate(name = sub("\\..*", "", name)) %>%
drop_na %>%
pivot_wider(names_from = name, values_from = value, values_fn = list) %>%
unnest(cols = c(a, b, c))
a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 1 2 3
3 1 2 3
Related
I have a dataset that I want to convert any duplicates across columns to be NA. I've found answers to help with just looking for duplicates in one column, and I've found ways to remove duplicates entirely (e.g., distinct()). Instead, I have this data:
library(dpylr)
test <- tibble(job = c(1:6),
name = c("j", "j", "j", "c", "c", "c"),
id = c(1, 1, 2, 1, 5, 1))
And want this result:
library(dpylr)
answer <- tibble(job = c(1:6),
id = c("j", NA, "j", "c", NA, "c"),
name = c(1, NA, 2, 1, NA, 5))
And I've tried a solution like this using duplicated(), but it fails:
#Attempted solution
library(dpylr)
test %>%
mutate_at(vars(id, name), ~case_when(
duplicated(id, name) ~ NA,
TRUE ~ .
))
I'd prefer to use tidy solutions, but I can be flexible as long as the answer can be piped.
We could create a helper and then identify duplicates and replace them with NA in an ifelse statement using across:
library(dplyr)
test %>%
mutate(helper = paste(id, name)) %>%
mutate(across(c(name, id), ~ifelse(duplicated(helper), NA, .)), .keep="unused")
job name id
<int> <chr> <dbl>
1 1 j 1
2 2 NA NA
3 3 j 2
4 4 c 1
5 5 c 5
6 6 NA NA
If we want to convert to NA, create a column that includes all the columns with paste or unite and then mutate with across
library(dplyr)
library(tidyr)
test %>%
unite(full_nm, -job, remove = FALSE) %>%
mutate(across(-c(job, full_nm), ~ replace(.x, duplicated(full_nm), NA))) %>%
select(-full_nm)
-output
# A tibble: 6 × 3
job name id
<int> <chr> <dbl>
1 1 j 1
2 2 <NA> NA
3 3 j 2
4 4 c 1
5 5 c 5
6 6 <NA> NA
I have a table like the following:
A, B, C
1, Yes, 3
1, No, 2
2, Yes, 4
2, No, 6
etc
I want to convert it to:
A, Yes, No
1, 3, 2
2, 4, 6
I have tried using:
dat <- dat %>%
spread(B, C) %>%
group_by(A)
However, now I have a bunch of NA values. Is it possible to use pivot_longer to do this instead?
We can use pivot_wider
library(tidyr)
pivot_wider(dat, names_from = B, values_from = C)
-output
# A tibble: 2 x 3
# A Yes No
# <dbl> <dbl> <dbl>
#1 1 3 2
#2 2 4 6
If there are duplicate rows, then an option is to create a sequence by that column
library(data.table)
library(dplyr)
dat1 <- bind_rows(dat, dat) # // example with duplicates
dat1 %>%
mutate(rn = rowid(B)) %>%
pivot_wider(names_from = B, values_from = C) %>%
select(-rn)
-output
# A tibble: 4 x 3
# A Yes No
# <dbl> <dbl> <dbl>
#1 1 3 2
#2 2 4 6
#3 1 3 2
#4 2 4 6
data
dat <- structure(list(A = c(1, 1, 2, 2), B = c("Yes", "No", "Yes", "No"
), C = c(3, 2, 4, 6)), class = "data.frame", row.names = c(NA,
-4L))
I am processing a large dataset adapted to my research. Suppose that I have 4 observations (records) and 5 columns as follows:
x <- data.frame("ID" = c(1, 2, 3, 4),
"group1" = c("A", NA, "B", NA),
"group2" = c("B", "A", NA, "C"),
"hours1" = c(3, NA, 5, NA),
"hours2" = c(1, 2, NA, 5))
> x
ID group1 group2 hours1 hours2
1 A B 3 1
2 <NA> A NA 2
3 B <NA> 5 NA
4 <NA> C NA 5
The "group1" and "group2" are reference columns containing the character values of A, B, and C, and the last two columns, "hours1" and "hours2," are numeric indicating hours obviously.
The column "group1" is corresponding to the column "hours1"; likewise, "group2" is corresponding to "hours 2."
I want to create multiple columns according to the values, A, B, and C, of the reference columns matching to values of "hours1" and "hours2" as follows:
ID group1 group2 hours1 hours2 A B C
1 A B 3 1 3 1 NA
2 <NA> A NA 2 2 NA NA
3 B <NA> 5 NA NA 5 NA
4 <NA> C NA 5 NA NA 5
For example, ID 1 has A in "group1," corresponding to 3 in "hours1" which is found under the column "A." ID 3 has B in "group1," corresponding to 5 in "hours1" which is found under the columns "B." In "group 2," ID 4 has C, corresponding to 5 in hours2 which is found under column "C."
Is there a way to do it using R?
One way would be to combine all the "hour" column in one column and "group" columns in another column. This can be done using pivot_longer. After that we can get data in wide format and join it with original data.
library(dplyr)
library(tidyr)
x %>%
pivot_longer(cols = -ID,
names_to = c('.value'),
names_pattern = '(.*?)\\d+',
values_drop_na = TRUE) %>%
pivot_wider(names_from = group, values_from = hours) %>%
left_join(x, by = 'ID') %>%
select(ID, starts_with('group'), starts_with('hour'), everything())
# A tibble: 4 x 8
# ID group1 group2 hours1 hours2 A B C
# <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 A B 3 1 3 1 NA
#2 2 NA A NA 2 2 NA NA
#3 3 B NA 5 NA NA 5 NA
#4 4 NA C NA 5 NA NA 5
For OP's dataset we can slightly modify the code to achieve the desired result.
zz %>%
pivot_longer(cols = -id,
names_to = c('.value'),
names_pattern = '(.*)_',
values_drop_na = TRUE) %>%
arrange(fu2a) %>%
pivot_wider(names_from = fu2a, values_from = fu2b) %>%
left_join(zz, by = 'id') %>%
select(id, starts_with('fu2a'), starts_with('fu2b'), everything())
Another approach using dplyr could be done separating group and hours variables to compute the desired variables and then merge with the original x:
library(tidyverse)
#Data
x <- data.frame("ID" = c(1, 2, 3, 4),
"group1" = c("A", NA, "B", NA),
"group2" = c("B", "A", NA, "C"),
"hours1" = c(3, NA, 5, NA),
"hours2" = c(1, 2, NA, 5),stringsAsFactors = F)
#Reshape
x %>%
left_join(x %>% select(1:3) %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>% mutate(id=1:n()) %>%
left_join(x %>% select(c(1,4:5)) %>%
pivot_longer(cols = -ID) %>%
rename(name2=name,value2=value) %>%
group_by(ID) %>% mutate(id=1:n())) %>%
filter(!is.na(value)) %>% select(ID,value,value2) %>%
pivot_wider(names_from = value,values_from=value2))
Output:
ID group1 group2 hours1 hours2 A B C
1 1 A B 3 1 3 1 NA
2 2 <NA> A NA 2 2 NA NA
3 3 B <NA> 5 NA NA 5 NA
4 4 <NA> C NA 5 NA NA 5
I have a big dataset with a variety of variables concerning infectious complications. There are columns, containing symptoms written as strings in the corresponding columns ("Dysuria", "Fever", etc.). I would like to know the number of positive symptoms in each observation. I have tried to write different codes, using rowSums within mutate_at with is.character and !is.na, trying to do it simpler and as short as a single line of code, but it did not work.
example:
symps_na %>%
mutate_if(~any(is.character(.), rowSums)) %>%
View()
Then, I wrote a code for each column separately, trying to recode string variables to 1, convert them to numeric and then sum these ones to get the number of symptoms (see the codes below).
symps_na<-
pb_table_ord %>%
select(ID, dysuria:fever)%>%
mutate(dysuria=ifelse(dysuria=="Dysuria", 1, dysuria)) %>%
mutate(frequency=ifelse(frequency=="Frequency", 1, frequency)) %>%
mutate(urgency=ifelse(urgency=="Urgency", 1, urgency)) %>%
mutate(prostatepain=ifelse(prostatepain=="Prostate pain", 1, prostatepain)) %>%
mutate(rigor=ifelse(!is.na(rigor), 1, rigor)) %>%
mutate(loinpain=ifelse(!is.na(loinpain), 1, loinpain)) %>%
mutate(fever=ifelse(!is.na(fever), 1, fever)) %>%
mutate_at(vars(dysuria:fever), as.numeric) %>%
mutate(symptoms.sum=rowSums(select(., dysuria:fever)))
but the column symptoms.sum returns NA's instead numbers.
Oh, sorry, just have realized that I have missed na.rm=TRUE! But anyway. Can anyone suggest a more elegant way how could one get the summary number of non-NA/string variables for each observation in a separate column?
You can create two sets of columns one where you need to check value same as column name and the other one where you need to check to for NA values. I have created a sample data shared at the end of the answer and the two vectors cols1 which is a vector of column names which has same value as in it's column and cols2 where we need to check for NA values. You can change that according to column names that you have.
library(dplyr)
cols1 <- c('b', 'c')
cols2 <- c('d')
purrr::imap_dfc(df %>% select(cols1), `==`) %>% mutate_all(as.numeric) %>%
bind_cols(df %>% transmute_at(vars(cols2), ~+(!is.na(.)))) %>%
mutate(symptoms.sum = rowSums(select(., b:d), na.rm = TRUE))
# A tibble: 5 x 4
# b c d symptoms.sum
# <dbl> <dbl> <int> <dbl>
#1 1 1 0 2
#2 0 1 1 2
#3 1 0 1 2
#4 NA NA 1 1
#5 1 NA 0 1
data
Tested on this data which looks like this
df <- structure(list(a = 1:5, b = structure(c(1L, 2L, 1L, NA, 1L), .Label = c("b",
"c"), class = "factor"), c = structure(c(1L, 1L, 2L, NA, NA), .Label = c("c",
"d"), class = "factor"), d = c(NA, 1, 2, 4, NA)), class = "data.frame",
row.names = c(NA, -5L))
df
# a b c d
#1 1 b c NA
#2 2 c c 1
#3 3 b d 2
#4 4 <NA> <NA> 4
#5 5 b <NA> NA
I have two data frames:
Harry <- c(1, NA, NA, NA)
Tom <- c(NA, 2, NA, NA)
Sally <- c(NA, NA, 3, NA)
Jane <- c(NA, NA, NA, 4)
df <- data.frame(Harry, Tom, Sally, Jane)
Harry <- c(1, NA, NA, NA)
Tom <- c(1, NA, NA, NA)
Mary <- c(NA, NA, 3, NA)
Sarah <- c(NA, NA, NA, 4)
df2 <- data.frame(Harry, Tom, Mary, Sarah)
... where there's only one value per column. I'd like to flatten the data frames into single rows and then vertically concatenate such that each data frame becomes an observation in the new frame. There may be different columns, in which case these columns would be added and hence why I can't use rbind.
In addition and since these are numeric, the NAs should be zeroes and the resulting frame would look as below:
Harry <- c(1, 1)
Tom <- c(2, 1)
Sally <- c(3, 0)
Jane <- c(4, 0)
Mary <- c(0, 3)
Sarah <- c(0, 4)
df <- data.frame(Harry, Tom, Sally, Jane, Mary, Sarah)
I realise I could make everything numeric and total to get each row, but my issue is to get this into a single object.
We can use the gather and spread approach from dplyr and tidyr.
library(dplyr)
library(tidyr)
df_2 <- df %>% gather(Col, Val, na.rm = TRUE)
df2_2 <- df2 %>% gather(Col, Val, na.rm = TRUE)
df3 <- bind_rows(df_2, df2_2, .id = "ID") %>%
spread(Col, Val, fill = 0) %>%
select(-ID)
df3
# Harry Jane Mary Sally Sarah Tom
# 1 1 4 0 3 0 2
# 2 1 0 3 0 4 1
We can get the dataset into a single one with bind_rows, create a grouping column using .id, grouped by 'grp', then get the sum of columns with summarise_all
library(dplyr)
bind_rows(df, df2, .id = 'grp') %>%
group_by(grp) %>%
summarise_all(funs(sum(., na.rm = TRUE))) %>%
ungroup %>%
select(-grp)
# A tibble: 2 x 6
# Harry Tom Sally Jane Mary Sarah
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 2 3 4 0 0
#2 1 1 0 0 3 4