R Mutating multiple columns with matching - r

I am processing a large dataset adapted to my research. Suppose that I have 4 observations (records) and 5 columns as follows:
x <- data.frame("ID" = c(1, 2, 3, 4),
"group1" = c("A", NA, "B", NA),
"group2" = c("B", "A", NA, "C"),
"hours1" = c(3, NA, 5, NA),
"hours2" = c(1, 2, NA, 5))
> x
ID group1 group2 hours1 hours2
1 A B 3 1
2 <NA> A NA 2
3 B <NA> 5 NA
4 <NA> C NA 5
The "group1" and "group2" are reference columns containing the character values of A, B, and C, and the last two columns, "hours1" and "hours2," are numeric indicating hours obviously.
The column "group1" is corresponding to the column "hours1"; likewise, "group2" is corresponding to "hours 2."
I want to create multiple columns according to the values, A, B, and C, of the reference columns matching to values of "hours1" and "hours2" as follows:
ID group1 group2 hours1 hours2 A B C
1 A B 3 1 3 1 NA
2 <NA> A NA 2 2 NA NA
3 B <NA> 5 NA NA 5 NA
4 <NA> C NA 5 NA NA 5
For example, ID 1 has A in "group1," corresponding to 3 in "hours1" which is found under the column "A." ID 3 has B in "group1," corresponding to 5 in "hours1" which is found under the columns "B." In "group 2," ID 4 has C, corresponding to 5 in hours2 which is found under column "C."
Is there a way to do it using R?

One way would be to combine all the "hour" column in one column and "group" columns in another column. This can be done using pivot_longer. After that we can get data in wide format and join it with original data.
library(dplyr)
library(tidyr)
x %>%
pivot_longer(cols = -ID,
names_to = c('.value'),
names_pattern = '(.*?)\\d+',
values_drop_na = TRUE) %>%
pivot_wider(names_from = group, values_from = hours) %>%
left_join(x, by = 'ID') %>%
select(ID, starts_with('group'), starts_with('hour'), everything())
# A tibble: 4 x 8
# ID group1 group2 hours1 hours2 A B C
# <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 A B 3 1 3 1 NA
#2 2 NA A NA 2 2 NA NA
#3 3 B NA 5 NA NA 5 NA
#4 4 NA C NA 5 NA NA 5
For OP's dataset we can slightly modify the code to achieve the desired result.
zz %>%
pivot_longer(cols = -id,
names_to = c('.value'),
names_pattern = '(.*)_',
values_drop_na = TRUE) %>%
arrange(fu2a) %>%
pivot_wider(names_from = fu2a, values_from = fu2b) %>%
left_join(zz, by = 'id') %>%
select(id, starts_with('fu2a'), starts_with('fu2b'), everything())

Another approach using dplyr could be done separating group and hours variables to compute the desired variables and then merge with the original x:
library(tidyverse)
#Data
x <- data.frame("ID" = c(1, 2, 3, 4),
"group1" = c("A", NA, "B", NA),
"group2" = c("B", "A", NA, "C"),
"hours1" = c(3, NA, 5, NA),
"hours2" = c(1, 2, NA, 5),stringsAsFactors = F)
#Reshape
x %>%
left_join(x %>% select(1:3) %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>% mutate(id=1:n()) %>%
left_join(x %>% select(c(1,4:5)) %>%
pivot_longer(cols = -ID) %>%
rename(name2=name,value2=value) %>%
group_by(ID) %>% mutate(id=1:n())) %>%
filter(!is.na(value)) %>% select(ID,value,value2) %>%
pivot_wider(names_from = value,values_from=value2))
Output:
ID group1 group2 hours1 hours2 A B C
1 1 A B 3 1 3 1 NA
2 2 <NA> A NA 2 2 NA NA
3 3 B <NA> 5 NA NA 5 NA
4 4 <NA> C NA 5 NA NA 5

Related

How can you convert duplicates across multiple columns to be NA in R?

I have a dataset that I want to convert any duplicates across columns to be NA. I've found answers to help with just looking for duplicates in one column, and I've found ways to remove duplicates entirely (e.g., distinct()). Instead, I have this data:
library(dpylr)
test <- tibble(job = c(1:6),
name = c("j", "j", "j", "c", "c", "c"),
id = c(1, 1, 2, 1, 5, 1))
And want this result:
library(dpylr)
answer <- tibble(job = c(1:6),
id = c("j", NA, "j", "c", NA, "c"),
name = c(1, NA, 2, 1, NA, 5))
And I've tried a solution like this using duplicated(), but it fails:
#Attempted solution
library(dpylr)
test %>%
mutate_at(vars(id, name), ~case_when(
duplicated(id, name) ~ NA,
TRUE ~ .
))
I'd prefer to use tidy solutions, but I can be flexible as long as the answer can be piped.
We could create a helper and then identify duplicates and replace them with NA in an ifelse statement using across:
library(dplyr)
test %>%
mutate(helper = paste(id, name)) %>%
mutate(across(c(name, id), ~ifelse(duplicated(helper), NA, .)), .keep="unused")
job name id
<int> <chr> <dbl>
1 1 j 1
2 2 NA NA
3 3 j 2
4 4 c 1
5 5 c 5
6 6 NA NA
If we want to convert to NA, create a column that includes all the columns with paste or unite and then mutate with across
library(dplyr)
library(tidyr)
test %>%
unite(full_nm, -job, remove = FALSE) %>%
mutate(across(-c(job, full_nm), ~ replace(.x, duplicated(full_nm), NA))) %>%
select(-full_nm)
-output
# A tibble: 6 × 3
job name id
<int> <chr> <dbl>
1 1 j 1
2 2 <NA> NA
3 3 j 2
4 4 c 1
5 5 c 5
6 6 <NA> NA

Coalescing multiple chunks of columns with the same suffix in names (R)

I have a dataset with various "chunks" of columns with different prefixes, but the same suffix:
ID
A034
B034
C034
D034
A099
B099
A123
B123
...
1
NA
1
NA
NA
NA
3
1
NA
...
2
2
NA
NA
NA
2
NA
NA
2
...
3
NA
NA
2
NA
NA
2
1
NA
...
The number of columns within each "chunk" also varies. Is there any way (other than manually, which is what I have been painstakingly doing with coalesce(!!! select(., contains("XXX")))) to automatically coalesce by chunk based on the shared suffix? That is, the result should resemble
ID
034
099
123
...
1
1
3
1
...
2
2
2
2
...
3
2
2
1
...
I'm not sure how to begin doing something like this, so any suggestions would be very helpful.
We reshape the data into 'long' format with pivot_longer, then we group by 'ID' and loop across the other columns, apply the na.omit to remove the NA elements (we assume that there is only one non-NA per each column by group)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -ID, names_to = ".value",
names_pattern = "[A-Z](\\d+)") %>%
group_by(ID) %>%
summarise(across(everything(), na.omit), .groups = 'drop')
-output
# A tibble: 3 x 4
ID `034` `099` `123`
<int> <int> <int> <int>
1 1 1 3 1
2 2 2 2 2
3 3 2 2 1
Or to be safe, use complete.cases to create a logical vector for non-NA elements, and extract the first element (assuming we need only a single non-NA - if the non-NA lengths are different, we may need to return a list)
df1 %>%
pivot_longer(cols = -ID, names_to = ".value",
names_pattern = "[A-Z](\\d+)") %>%
group_by(ID) %>%
summarise(across(everything(), ~ .[complete.cases(.)][1]))
data
df1 <- structure(list(ID = 1:3, A034 = c(NA, 2L, NA), B034 = c(1L, NA,
NA), C034 = c(NA, NA, 2L), D034 = c(NA, NA, NA), A099 = c(NA,
2L, NA), B099 = c(3L, NA, 2L), A123 = c(1L, NA, 1L), B123 = c(NA,
2L, NA)), class = "data.frame", row.names = c(NA, -3L))
one more approach
library(tidyverse)
split(names(df1)[-1], gsub('^\\D*(\\d+)$', '\\1', names(df1)[-1])) %>% map(~df1[c('ID', .x)]) %>%
imap(~ .x %>% group_by(ID) %>% rowwise %>% transmute(!!.y := first(na.omit(c_across(everything())))) %>% ungroup) %>%
reduce(left_join, by = 'ID')
#> # A tibble: 3 x 4
#> ID `034` `099` `123`
#> <int> <int> <int> <int>
#> 1 1 1 3 1
#> 2 2 2 2 2
#> 3 3 2 2 1
Created on 2021-06-20 by the reprex package (v2.0.0)

Merge data frames and divide rows by group

I would like to divide the values from df1 over the values from df2. In this reproducible example, I am able to sum these values. What about the division? Thanks in advance!
df1 <- data.frame(country = c("a", "b", "c"), year1 = c(1, 2, 3), year2 = c(1, 2, 3))
df2 <- data.frame(country = c("a", "b", "d"), year1 = c(1, 2, NA), year2 = c(1, 2, 3))
df3 <- bind_rows(df1, df2) %>%
mutate_if(is.numeric, tidyr::replace_na, 0) %>%
group_by(country) %>%
summarise_all(., sum, na.rm = TRUE) %>%
na_if(., 0)
Expected result is:
# A tibble: 4 x 3
country year1 year2
<chr> <dbl> <dbl>
1 a 1 1
2 b 1 1
3 c NA NA
4 d NA NA
As there are groups with 2 rows and some with 1, use an if/else condition within summarise/across to divide the first element by the last if there are two elements or else return NA
library(dplyr) # version 1.0.4
library(tidyr)
bind_rows(df1, df2) %>%
mutate(across(where(is.numeric), replace_na, 0)) %>%
group_by(country) %>%
summarise(across(everything(), ~ if(n() == 2) first(.)/last(.)
else NA_real_))
-output
# A tibble: 4 x 3
# country year1 year2
#* <chr> <dbl> <dbl>
#1 a 1 1
#2 b 1 1
#3 c NA NA
#4 d NA NA
Here is a base R option using merge + split.default
df <- merge(df1, df2, by = "country", all = TRUE)
cbind(
df[1],
list2DF(lapply(
split.default(df[-1], gsub("\\.(x|y)", "", names(df)[-1])),
function(v) do.call("/", v)
))
)
which gives
country year1 year2
1 a 1 1
2 b 1 1
3 c NA NA
4 d NA NA

Conditionally Create New Column Based on Row Values

thanks in advance for any assistance.
I have a dataframe:
df <- structure(list(ID = c("0001", "0002", "0003", "0004"), May_1 = c(1,
2, 1, 3), May_5 = c(NA, 1, 2, 1), May_10 = c(NA, 3, 3, NA), May_16 = c(2,
NA, NA, NA), May_20 = c(3, NA, NA, 2)), row.names = c(NA, -4L
), class = c("tbl_df", "tbl", "data.frame"))
I would like to create new columns named "First Preference", "Second Preference" and "Third Preference" based on the row values for each response.
If a row value == 1, I would like to append a column called "First Preference" that contains the column name where the row value == 1.
My actual data contains about 40 dates that will be changing week over week, so a generalizable solution is most appreciated.
Here's the ideal df:
df_ideal <- structure(list(ID = c("0001", "0002", "0003", "0004"), May_1 = c(1,
2, 1, 3), May_5 = c(NA, 1, 2, 1), May_10 = c(NA, 3, 3, NA), May_16 = c(2,
NA, NA, NA), May_20 = c(3, NA, NA, 2), First_Preference = c("May_1",
"May_5", "May_1", "May_5"), Second_Preference = c("May_16", "May_1",
"May_5", "May_20"), Third_Preference = c("May_20", "May_10",
"May_10", "May_1")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
A tidyverse solution would be preferred, but I'm certainly open to anything.
Thanks!
In base R, we can use apply row-wise order the values removing NA values and get corresponding column names.
cols <- paste(c('First', 'Second', 'Third'), "Preference", sep = "_")
df[cols] <- t(apply(df[-1], 1, function(x) names(df)[-1][order(x, na.last= NA)]))
df
# A tibble: 4 x 9
# ID May_1 May_5 May_10 May_16 May_20 First_Preference Second_Preference Third_Preference
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#1 0001 1 NA NA 2 3 May_1 May_16 May_20
#2 0002 2 1 3 NA NA May_5 May_1 May_10
#3 0003 1 2 3 NA NA May_1 May_5 May_10
#4 0004 3 1 NA NA 2 May_5 May_20 May_1
We can reshape it to 'long' format, while dropping the NA elements with values_drop_na, then use the 'value' column as index to change the labels and then convert back to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -ID, values_drop_na = TRUE) %>%
group_by(ID) %>%
mutate(value = c("First_Preference", "Second_Preference",
"Third_Preference")[value]) %>%
ungroup %>%
pivot_wider(names_from = value, values_from = name) %>%
left_join(df, .)
# A tibble: 4 x 9
# ID May_1 May_5 May_10 May_16 May_20 First_Preference Second_Preference Third_Preference
#* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#1 0001 1 NA NA 2 3 May_1 May_16 May_20
#2 0002 2 1 3 NA NA May_5 May_1 May_10
#3 0003 1 2 3 NA NA May_1 May_5 May_10
#4 0004 3 1 NA NA 2 May_5 May_20 May_1
To get the column names automatically, we can use ordinal from english
library(english)
library(stringr)
df %>%
pivot_longer(cols = -ID, values_drop_na = TRUE) %>%
group_by(ID) %>%
mutate(value = str_c(ordinal(value), "_preference")) %>%
ungroup %>%
pivot_wider(names_from = value, values_from = name) %>%
left_join(df, .)
Or using data.table
library(data.table)
setDT(df)[dcast(melt(df, id.var = 'ID', na.rm = TRUE),
ID ~ paste0(ordinal(value), "_preference"), value.var = 'variable'), on = .(ID)]
# ID May_1 May_5 May_10 May_16 May_20 first_preference second_preference third_preference
#1: 0001 1 NA NA 2 3 May_1 May_16 May_20
#2: 0002 2 1 3 NA NA May_5 May_1 May_10
#3: 0003 1 2 3 NA NA May_1 May_5 May_10
#4: 0004 3 1 NA NA 2 May_5 May_20 May_1

Forward and backward difference between rows with missing values

Here is the sample dataframe:
df <- data.frame(
id = c("A", "A", "A", "A", "B", "B", "B", "B"),
num = c(1, NA, 6, 3, 7, NA , NA, 2))
How do I get forward and backward difference between rows over id category? There should be two new columns: one difference between between current raw and previous, and the other should be difference between current raw and next raw. If the previous raw is NA then it should calculate the difference between current row and the first previous raw that contains real number. The same holds for the other forward difference case.
Many thanks!!
require(magrittr)
df$backdiff <- c(NA, sapply(2:nrow(df),
function(i){
df$num[i] - df$num[(i-1):1] %>% .[!is.na(.)][1]
}))
df$forward.diff <- c(sapply(2:nrow(df) - 1,
function(i){
df$num[i] - df$num[(i+1):nrow(df)] %>% .[!is.na(.)][1]
}), NA)
One solution could be achieved by using fill function from tidyr to create two columns (one each for prev and next calculation) where NA values are removed.
df <- data.frame(
id = c("A", "A", "A", "A", "B", "B", "B", "B"),
num = c(1, NA, 6, 3, 7, NA , NA, 2))
library("tidyverse")
df %>% mutate(dup_num_prv = num, dup_num_nxt = num) %>%
group_by(id) %>%
fill(dup_num_prv, .direction = "down") %>%
fill(dup_num_nxt, .direction = "up") %>%
mutate(prev_diff = ifelse(is.na(num), NA, num - lag(dup_num_prv))) %>%
mutate(next_diff = ifelse(is.na(num), NA, num - lead(dup_num_nxt))) %>%
as.data.frame()
# Result is shown in columns 'prev_diff' and 'next_diff'
# id num dup_num_prv dup_num_nxt prev_diff next_diff
#1 A 1 1 1 NA -5
#2 A NA 1 6 NA NA
#3 A 6 6 6 5 3
#4 A 3 3 3 -3 NA
#5 B 7 7 7 NA 5
#6 B NA 7 2 NA NA
#7 B NA 7 2 NA NA
#8 B 2 2 2 -5 NA
Note: There are few queries which OP needs to clarify. The solution can be fine-tuned afterwards. dup_num_prv and dup_num_nxtare kept just for understanding purpose. These column can be removed.

Resources