I have a large dataset in which the answers to one question are distributed among various columns. However, if the columns belong together, they share the same prefix. I wonder how I can create a subset dataset of each question sorting based on the prefix.
Here is an example dataset. I would like to receive an efficient and easy adaptable solution to create a dataset only containing the values of either question one, two or three.
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8), Question1a = c(1,
1, NA, NA, 1, 1, 1, NA), Question1b = c(NA, 1, NA, 1, NA, 1,
NA, 1), Question1c = c(1, 1, NA, NA, 1, NA, NA, NA), Question2a = c(1,
NA, NA, NA, 1, 1, NA, NA), Question2b = c(NA, 1, NA, 1, NA, NA,
NA, NA), Question3a = c(NA, NA, NA, NA, 1, 1, 1, NA), Question3b = c(NA,
NA, 1, 1, NA, NA, NA, NA)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L))
You can use sapply and a function:
list_data <- sapply(c("Question1", "Question2", "Question3"),
function(x) df[startsWith(names(df),x)], simplify = FALSE)
This will store everything in a list. To get the individual data sets in the global environment as individual objects, use:
list2env(list_data, globalenv())
Output
# $Question1
# # A tibble: 8 × 3
# Question1a Question1b Question1c
# <dbl> <dbl> <dbl>
# 1 1 NA 1
# 2 1 1 1
# 3 NA NA NA
# 4 NA 1 NA
# 5 1 NA 1
# 6 1 1 NA
# 7 1 NA NA
# 8 NA 1 NA
#
# $Question2
# # A tibble: 8 × 2
# Question2a Question2b
# <dbl> <dbl>
# 1 1 NA
# 2 NA 1
# 3 NA NA
# 4 NA 1
# 5 1 NA
# 6 1 NA
# 7 NA NA
# 8 NA NA
#
# $Question3
# # A tibble: 8 × 2
# Question3a Question3b
# <dbl> <dbl>
# 1 NA NA
# 2 NA NA
# 3 NA 1
# 4 NA 1
# 5 1 NA
# 6 1 NA
# 7 1 NA
# 8 NA NA
I believe the underlying question is about data-formats.
Here's a few:
library(tidyverse)
structure(
list(
ID = c(1, 2, 3, 4, 5, 6, 7, 8),
Question1a = c(1,
1, NA, NA, 1, 1, 1, NA),
Question1b = c(NA, 1, NA, 1, NA, 1,
NA, 1),
Question1c = c(1, 1, NA, NA, 1, NA, NA, NA),
Question2a = c(1,
NA, NA, NA, 1, 1, NA, NA),
Question2b = c(NA, 1, NA, 1, NA, NA,
NA, NA),
Question3a = c(NA, NA, NA, NA, 1, 1, 1, NA),
Question3b = c(NA,
NA, 1, 1, NA, NA, NA, NA)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -8L)
) -> square_df
square_df %>%
pivot_longer(-ID,
names_to = c("Question", "Item"),
names_pattern = "Question(\\d+)(\\w+)") ->
long_df
long_df
#> # A tibble: 56 × 4
#> ID Question Item value
#> <dbl> <chr> <chr> <dbl>
#> 1 1 1 a 1
#> 2 1 1 b NA
#> 3 1 1 c 1
#> 4 1 2 a 1
#> 5 1 2 b NA
#> 6 1 3 a NA
#> 7 1 3 b NA
#> 8 2 1 a 1
#> 9 2 1 b 1
#> 10 2 1 c 1
#> # … with 46 more rows
long_df %>%
na.omit(value) ->
sparse_long_df
sparse_long_df
#> # A tibble: 22 × 4
#> ID Question Item value
#> <dbl> <chr> <chr> <dbl>
#> 1 1 1 a 1
#> 2 1 1 c 1
#> 3 1 2 a 1
#> 4 2 1 a 1
#> 5 2 1 b 1
#> 6 2 1 c 1
#> 7 2 2 b 1
#> 8 3 3 b 1
#> 9 4 1 b 1
#> 10 4 2 b 1
#> # … with 12 more rows
sparse_long_df %>%
nest(data = c(ID, Item, value)) ->
nested_long_df
nested_long_df
#> # A tibble: 3 × 2
#> Question data
#> <chr> <list>
#> 1 1 <tibble [12 × 3]>
#> 2 2 <tibble [5 × 3]>
#> 3 3 <tibble [5 × 3]>
Created on 2022-05-12 by the reprex package (v2.0.1)
You could also use map to store each dataframe in a list, e.g.
library(purrr)
# 3 = number of questions
map(c(1:3),
function(x){
quest <- paste0("Question",x)
select(df, ID, starts_with(quest))
})
Output:
[[1]]
# A tibble: 8 x 4
ID Question1a Question1b Question1c
<dbl> <dbl> <dbl> <dbl>
1 1 1 NA 1
2 2 1 1 1
3 3 NA NA NA
4 4 NA 1 NA
5 5 1 NA 1
6 6 1 1 NA
7 7 1 NA NA
8 8 NA 1 NA
[[2]]
# A tibble: 8 x 3
ID Question2a Question2b
<dbl> <dbl> <dbl>
1 1 1 NA
2 2 NA 1
3 3 NA NA
4 4 NA 1
5 5 1 NA
6 6 1 NA
7 7 NA NA
8 8 NA NA
[[3]]
# A tibble: 8 x 3
ID Question3a Question3b
<dbl> <dbl> <dbl>
1 1 NA NA
2 2 NA NA
3 3 NA 1
4 4 NA 1
5 5 1 NA
6 6 1 NA
7 7 1 NA
8 8 NA NA
I found a really intuitive solution using the dplyr package, using the select and starts_with commands. Alternatively, you can also replace the starts_with command with contains, if the you are not identifying the similar variables by a prefix but some other common feature.
Q1 <- Survey %>%
select(
starts_with("Question1")
)
Q2 <- Survey %>%
select(
starts_with("Question2")
)
Q3 <- Survey %>%
select(
starts_with("Question3")
)
Related
This question already has answers here:
Count number of NA's in a Row in Specified Columns R [duplicate]
(3 answers)
Closed 7 months ago.
How can I count non-NA values per row including columns from "b" to "c"?
library(tidyverse)
d = tibble(a = c("Tom", "Mary", "Ben", "Jane", "Lucas", "Mark"),
b = c(NA, 3, 6, NA, 5, NA),
c = c(2, NA, 6, 7, 1, 9))
d
Output should have an extra column with values as follows: 1, 1, 2, 1, 2, 1
Tidyverse solutions are especially appreciated!
A possible solution:
library(dplyr)
d %>%
mutate(result = rowSums(!is.na(across(b:c))))
#> # A tibble: 6 × 4
#> a b c result
#> <chr> <dbl> <dbl> <dbl>
#> 1 Tom NA 2 1
#> 2 Mary 3 NA 1
#> 3 Ben 6 6 2
#> 4 Jane NA 7 1
#> 5 Lucas 5 1 2
#> 6 Mark NA 9 1
Using select instead of across:
library(tidyverse)
d = tibble(a = c("Tom", "Mary", "Ben", "Jane", "Lucas", "Mark"),
b = c(NA, 3, 6, NA, 5, NA),
c = c(2, NA, 6, 7, 1, 9))
d %>%
mutate(output = rowSums(!is.na(select(., -a))))
#> # A tibble: 6 × 4
#> a b c output
#> <chr> <dbl> <dbl> <dbl>
#> 1 Tom NA 2 1
#> 2 Mary 3 NA 1
#> 3 Ben 6 6 2
#> 4 Jane NA 7 1
#> 5 Lucas 5 1 2
#> 6 Mark NA 9 1
Created on 2022-07-15 by the reprex package (v2.0.1)
base R option:
library(tidyverse)
d = tibble(a = c("Tom", "Mary", "Ben", "Jane", "Lucas", "Mark"),
b = c(NA, 3, 6, NA, 5, NA),
c = c(2, NA, 6, 7, 1, 9))
d$output <- apply(d[2:3], 1, function(x) sum(!is.na(x)))
d
#> # A tibble: 6 × 4
#> a b c output
#> <chr> <dbl> <dbl> <int>
#> 1 Tom NA 2 1
#> 2 Mary 3 NA 1
#> 3 Ben 6 6 2
#> 4 Jane NA 7 1
#> 5 Lucas 5 1 2
#> 6 Mark NA 9 1
Created on 2022-07-15 by the reprex package (v2.0.1)
I have a huge list samplelist containing various tibbles (in the simplified sample called Alpha, Beta and Gamma). Each tibble contains various elements (in the sample called sample_0 and sample_1). However, not each tibble contains each element (Gamma contains only sample_0, but not sample_1). What I would like to do is to rename the elements based on a condition: if there is an element sample_1 in a tibble, rename it to sampling. However, if the tibble does not contain sample_1, rename sample_0 to sampling (so that the list now contains an element called sampling for each tibble).
samplelist <- list(Alpha = structure(list(sample_0 = c(3, NA, 7, 9, 2),
sample_1 = c(NA, 8, 5, 4, NA)),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")),
Beta = structure (list(sample_0 = c(2, 9, NA, 3, 7),
sample_1 = c(3, 7, 9, 3, NA)),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")),
Gamma = structure(list(sample_0 = c(NA, NA, 4, 6, 3)),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")))
Does anybody know how to get the following desired output?
samplelist
$Alpha
# A tibble: 5 x 2
sample_0 sampling
<dbl> <dbl>
1 3 NA
2 NA 8
3 7 5
4 9 4
5 2 NA
$Beta
# A tibble: 5 x 2
sample_0 sampling
<dbl> <dbl>
1 2 3
2 9 7
3 NA 9
4 3 3
5 7 NA
$Gamma
# A tibble: 5 x 1
sampling
<dbl>
1 NA
2 NA
3 4
4 6
5 3
EDIT
With the code provided by #akrun:
map(errorlist, ~ if(ncol(.x) == 1 && names(.x) == 'sample_0')
setNames(.x, 'sampling') else
rename_with(.x, ~ 'sampling', matches('sample_1')))
I got the disired output for my samplelist. However, if there's more than one group in Gamma, the (adjusted) code only works for Alpha and Beta, yet leaves Gamma unchanged (Delta added from before editing):
errorlist <- list(Alpha = structure(list(sample_0 = c(3, NA, 7, 9, 2),
sample_1 = c(NA, 8, 5, 4, NA),
sample_2 = c(7, 3, 5, NA, NA)),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")),
Beta = structure (list(sample_0 = c(2, 9, NA, 3, 7),
sample_1 = c(3, 7, 9, 3, NA),
sample_2 = c(4, 2, 6, 4, 6)),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")),
Gamma = structure(list(sample_0 = c(NA, NA, 4, 6, 3),
sample_1 = c(3, 7, 3, NA, 8)),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")),
Delta = structure (list(error = c(3, 7, 9, 3, NA)),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")))
map(errorlist, ~ if(ncol(.x) == 1 && names(.x) == 'sample_1')
setNames(.x, 'sampling') else
rename_with(.x, ~ 'sampling', matches('sample_2')))
Output:
$Alpha
# A tibble: 5 x 3
sample_0 sample_1 sampling
<dbl> <dbl> <dbl>
1 3 NA 7
2 NA 8 3
3 7 5 5
4 9 4 NA
5 2 NA NA
$Beta
# A tibble: 5 x 3
sample_0 sample_1 sampling
<dbl> <dbl> <dbl>
1 2 3 4
2 9 7 2
3 NA 9 6
4 3 3 4
5 7 NA 6
$Gamma
# A tibble: 5 x 2
sample_0 sample_1
<dbl> <dbl>
1 NA 3
2 NA 7
3 4 3
4 6 NA
5 3 8
$Delta
# A tibble: 5 x 1
error
<dbl>
1 3
2 7
3 9
4 3
5 NA
Here is an option - loop over the list with map and do the changes with a condition (if/else) (Here, we are using errorlist as it is more general. It also works with samplelist)
library(dplyr)
library(purrr)
map(errorlist, ~ if(ncol(.x) == 1 && names(.x) == 'sample_0')
setNames(.x, 'sampling') else
rename_with(.x, ~ 'sampling', matches('sample_1')))
-output
$Alpha
# A tibble: 5 × 2
sample_0 sampling
<dbl> <dbl>
1 3 NA
2 NA 8
3 7 5
4 9 4
5 2 NA
$Beta
# A tibble: 5 × 2
sample_0 sampling
<dbl> <dbl>
1 2 3
2 9 7
3 NA 9
4 3 3
5 7 NA
$Gamma
# A tibble: 5 × 1
sampling
<dbl>
1 NA
2 NA
3 4
4 6
5 3
$Delta
# A tibble: 5 × 1
error
<dbl>
1 3
2 7
3 9
4 3
5 NA
Update
Based on the OP's comments
lapply(errorlist, \(x) {
nm1 <- stringr::str_subset(names(x), "^sample_\\d+$")
i1 <- which.max(as.numeric(stringr::str_extract(nm1,
"(?<=sample_)\\d+")))
if(length(i1) > 0) names(x)[names(x) == nm1[i1]] <- "sampling"
x})
-output
$Alpha
# A tibble: 5 × 3
sample_0 sample_1 sampling
<dbl> <dbl> <dbl>
1 3 NA 7
2 NA 8 3
3 7 5 5
4 9 4 NA
5 2 NA NA
$Beta
# A tibble: 5 × 3
sample_0 sample_1 sampling
<dbl> <dbl> <dbl>
1 2 3 4
2 9 7 2
3 NA 9 6
4 3 3 4
5 7 NA 6
$Gamma
# A tibble: 5 × 2
sample_0 sampling
<dbl> <dbl>
1 NA 3
2 NA 7
3 4 3
4 6 NA
5 3 8
$Delta
# A tibble: 5 × 1
error
<dbl>
1 3
2 7
3 9
4 3
5 NA
Say I have the following dataframe:
ABC1_old <- c(1, 5, 3, 4, 3, NA, NA, NA, NA, NA)
ABC2_old <- c(4, 2, 1, 1, 5, NA, NA, NA, NA, NA)
ABC1_adj <- c(NA, NA, NA, NA, NA, 5, 5, 1, 2, 4)
ABC2_adj <- c(NA, NA, NA, NA, NA, 3, 2, 1, 4, 2)
df <- data.frame(ABC1_old, ABC2_old, ABC1_adj, ABC2_adj)
I want to create a column that compares each pair of ABCn_old with its corresponding ABCn_adj. (So ABC1_old would be compared against ABCn_adj, etc.) The resulting column would be called ABCn_new. The evaluation would be that if ABCn_old is NA, fill in the blank with the corresponding value in ABCn_adj, otherwise use ABCn_old's value. The new columns would look like this:
df$ABC1_new <- c(1, 5, 3, 4, 3, 5, 5, 1, 2, 4)
df$ABC2_new <- c(4, 2, 1, 1, 5, 3, 2, 1, 4, 2)
I know a simple mutate could work here, but I would like to use some kind of tidyverse looping via purrr if possible since the dataset is much larger in reality. Any ideas for the best way to achieve this?
map_dfc(split.default(df, str_remove(names(df), "_.*")), ~coalesce(!!!.x))
# A tibble: 10 x 2
ABC1 ABC2
<dbl> <dbl>
1 1 4
2 5 2
3 3 1
4 4 1
5 3 5
6 5 3
7 5 2
8 1 1
9 2 4
10 4 2
Putting it together:
df %>%
split.default(str_replace(names(.), "_.*", "_new")) %>%
map_dfc(~coalesce(!!!.x))%>%
cbind(df, .)
ABC1_old ABC2_old ABC1_adj ABC2_adj ABC1_new ABC2_new
1 1 4 NA NA 1 4
2 5 2 NA NA 5 2
3 3 1 NA NA 3 1
4 4 1 NA NA 4 1
5 3 5 NA NA 3 5
6 NA NA 5 3 5 3
7 NA NA 5 2 5 2
8 NA NA 1 1 1 1
9 NA NA 2 4 2 4
10 NA NA 4 2 4 2
Using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_to = c(".value", 'grp'),
names_sep = '_', values_drop_na = TRUE) %>%
select(-grp, -rn) %>%
rename_all(~ str_c(., '_new')) %>% bind_cols(df, .)
# ABC1_old ABC2_old ABC1_adj ABC2_adj ABC1_new ABC2_new
#1 1 4 NA NA 1 4
#2 5 2 NA NA 5 2
#3 3 1 NA NA 3 1
#4 4 1 NA NA 4 1
#5 3 5 NA NA 3 5
#6 NA NA 5 3 5 3
#7 NA NA 5 2 5 2
#8 NA NA 1 1 1 1
#9 NA NA 2 4 2 4
#10 NA NA 4 2 4 2
Or using dplyr
df %>%
mutate(across(ends_with('old'),
~ coalesce(., get(str_replace(cur_column(),
'old', 'adj'))), .names = '{.col}_new'))
I have a package on github to solve this and similar problems. In this case we could use dplyover::across2 to apply one (or more) functions to two set of columns, which can be selected with tidyselect. In the .names argument we can specify "{pre}" to refer to the common prefix of both sets of columns.
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
df %>%
mutate(across2(ends_with("_old"),
ends_with("_adj"),
~ coalesce(.x, .y),
.names = "{pre}_new"))
#> ABC1_old ABC2_old ABC1_adj ABC2_adj ABC1_new ABC2_new
#> 1 1 4 NA NA 1 4
#> 2 5 2 NA NA 5 2
#> 3 3 1 NA NA 3 1
#> 4 4 1 NA NA 4 1
#> 5 3 5 NA NA 3 5
#> 6 NA NA 5 3 5 3
#> 7 NA NA 5 2 5 2
#> 8 NA NA 1 1 1 1
#> 9 NA NA 2 4 2 4
#> 10 NA NA 4 2 4 2
Created on 2021-05-16 by the reprex package (v0.3.0)
I have two columns of data like this:
I want to add a column or modify the second column resulting in a sequence of integers starting with 1, wherever the 1 already appears. Result changes to:
I can do this with a loop, but what is the "right" R way of doing it?
Here's my loop:
for(i in 1:length(df2$col2)) {
df2$col3[i] <- ifelse(df2$col2[i] == 1, 1, df2$col3[i - 1] + 1)
if(is.na(df2$col2[i])) df2$col3[i] <- df2$col3[i - 1] + 1
}
Here is a sample data set with 20 rows:
478.69, 320.45, 503.7, 609.3, 478.19, 419.633683050051, 552.939975773916,
785.119385505095, 18.2542654918507, 98.6469651805237, 132.587260054424,
697.119552921504, 512.560374778695, 916.425200179219, 14.3385051051155
), col2 = c(1, NA, 1, NA, NA, 1, NA, 1, NA, NA, NA, NA, 1, NA,
NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-20L))
I don't know if this is the way to do it, but it's one way:
df$col3 <- unlist(sapply(diff(c(which(!is.na(df$col2)), nrow(df) + 1)), seq))
df
#> col1 col2 col3
#> 1 478.69000 1 1
#> 2 320.45000 NA 2
#> 3 503.70000 1 1
#> 4 609.30000 NA 2
#> 5 478.19000 NA 3
#> 6 478.69000 1 1
#> 7 320.45000 NA 2
#> 8 503.70000 1 1
#> 9 609.30000 NA 2
#> 10 478.19000 NA 3
#> 11 419.63368 NA 4
#> 12 552.93998 NA 5
#> 13 785.11939 1 1
#> 14 18.25427 NA 2
#> 15 98.64697 NA 3
#> 16 132.58726 NA 4
#> 17 697.11955 NA 5
#> 18 512.56037 NA 6
#> 19 916.42520 NA 7
#> 20 14.33851 NA 8
Note that the first 5 values of col1 were missing from your dput, so I added the second 5 numbers twice - they're not relevant to the question anyway.
Data
df <- structure(list(col1 = c(478.69, 320.45, 503.7, 609.3, 478.19,
478.69, 320.45, 503.7, 609.3, 478.19, 419.633683050051, 552.939975773916,
785.119385505095, 18.2542654918507, 98.6469651805237, 132.587260054424,
697.119552921504, 512.560374778695, 916.425200179219, 14.3385051051155
), col2 = c(1, NA, 1, NA, NA, 1, NA, 1, NA, NA, NA, NA, 1, NA,
NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-20L))
df
#> col1 col2
#> 1 478.69000 1
#> 2 320.45000 NA
#> 3 503.70000 1
#> 4 609.30000 NA
#> 5 478.19000 NA
#> 6 478.69000 1
#> 7 320.45000 NA
#> 8 503.70000 1
#> 9 609.30000 NA
#> 10 478.19000 NA
#> 11 419.63368 NA
#> 12 552.93998 NA
#> 13 785.11939 1
#> 14 18.25427 NA
#> 15 98.64697 NA
#> 16 132.58726 NA
#> 17 697.11955 NA
#> 18 512.56037 NA
#> 19 916.42520 NA
#> 20 14.33851 NA
I have a dataframe with several columns, and I create a new column which randomly samples a single value from either of the other columns. How can I trace back to tell which column the value came from?
I've seen the exact same question and solution here, but it's in python, and couldn't find an R equivalent.
Data 1 :: each row has different values across columns
df_uniques <-
data.frame(
col_a = c(2, 2, 5, 5, 3),
col_b = c(NA, 4, 2, 3, 1),
col_c = c(4, 5, 3, 1, 2),
col_d = c(1, NA, 4, 2, 4),
col_e = c(3, 3, 1, 4, 5)
)
> df_uniques
## col_a col_b col_c col_d col_e
## 1 2 NA 4 1 3
## 2 2 4 5 NA 3
## 3 5 2 3 4 1
## 4 5 3 1 2 4
## 5 3 1 2 4 5
Mutate a new column to sample from either previous columns
library(dplyr)
set.seed(2020)
df_uniques %>%
rowwise() %>%
mutate(sampled = sample(c(col_a, col_b, col_c, col_d, col_e), size = n()))
## # A tibble: 5 x 6
## # Rowwise:
## col_a col_b col_c col_d col_e sampled
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 NA 4 1 3 1
## 2 2 4 5 NA 3 NA
## 3 5 2 3 4 1 5
## 4 5 3 1 2 4 5
## 5 3 1 2 4 5 4
Data 2 :: each row has duplicating values across columns
df_duplicates <-
data.frame(
col_a = c(1, 4, 2, 5, 2),
col_b = c(NA, 4, NA, 3, 1),
col_c = c(4, NA, 5, NA, NA),
col_d = c(1, NA, NA, 2, NA),
col_e = c(2, 3, NA, NA, 5)
)
> df_duplicates
## col_a col_b col_c col_d col_e
## 1 1 NA 4 1 2
## 2 4 4 NA NA 3
## 3 2 NA 5 NA NA
## 4 5 3 NA 2 NA
## 5 2 1 NA NA 5
Mutate a new column to sample from either previous columns
set.seed(2020)
df_duplicates %>%
rowwise() %>%
mutate(sampled = sample(c(col_a, col_b, col_c, col_d, col_e), size = n()))
## # A tibble: 5 x 6
## # Rowwise:
## col_a col_b col_c col_d col_e sampled
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 NA 4 1 2 NA
## 2 4 4 NA NA 3 4
## 3 2 NA 5 NA NA NA
## 4 5 3 NA 2 NA 3
## 5 2 1 NA NA 5 1
Tracing back: which column is the origin of sampled?
Desired Output (Data 1 :: uniques)
# A tibble: 5 x 7
# Rowwise:
col_a col_b col_c col_d col_e sampled origin_col
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 2 NA 4 1 3 1 col_d
2 2 4 5 NA 3 NA col_d
3 5 2 3 4 1 5 col_a
4 5 3 1 2 4 5 col_a
5 3 1 2 4 5 4 col_d
Desired Output (Data 2 :: duplicates)
# A tibble: 5 x 7
# Rowwise:
col_a col_b col_c col_d col_e sampled origin_col
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 NA 4 1 2 1 col_a, col_d
2 4 4 NA NA 3 NA col_c, col_d
3 2 NA 5 NA NA 2 col_a
4 5 3 NA 2 NA 5 col_a
5 2 1 NA NA 5 NA col_c, col_d
Are you looking for something like this?
cols <- c("col_a", "col_b", "col_c", "col_d", "col_e")
workflow <-
. %>%
rowwise() %>%
mutate(
sampled = sample(c_across(!!cols), 1L),
origin_col = toString(cols[which(c_across(!!cols) %in% sampled)])
)
Output
> set.seed(2020L); workflow(df_uniques)
# A tibble: 5 x 7
# Rowwise:
col_a col_b col_c col_d col_e sampled origin_col
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 2 NA 4 1 3 1 col_d
2 2 4 5 NA 3 NA col_d
3 5 2 3 4 1 5 col_a
4 5 3 1 2 4 5 col_a
5 3 1 2 4 5 4 col_d
> set.seed(2020L); workflow(df_duplicates)
# A tibble: 5 x 7
# Rowwise:
col_a col_b col_c col_d col_e sampled origin_col
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 NA 4 1 2 1 col_a, col_d
2 4 4 NA NA 3 NA col_c, col_d
3 2 NA 5 NA NA 2 col_a
4 5 3 NA 2 NA 5 col_a
5 2 1 NA NA 5 NA col_c, col_d
Method 1: create a temporary variable for your selected columns
workflow <-
. %>%
rowwise() %>%
mutate(
d = across(starts_with("col_")),
sampled = sample(c_across(names(d)), 1L),
original_col = toString(names(d)[which(c_across(names(d)) %in% sampled)]),
d = NULL
)
Method 2: wrap everything in a function
workflow <- function(df) {
cols <- names(df)
cols <- cols[starts_with("col_", vars = cols)]
# or cols <- cols[startsWith(cols, "col_")]
# or cols <- cols[grepl("^col_", cols)]
# ...
df %>%
rowwise() %>%
mutate(
sampled = sample(c_across(!!cols), 1L),
original_col = toString(cols[which(c_across(!!cols) %in% sampled)]),
)
}
I prefer the second method as it is more flexible.
One option could be:
df_duplicates %>%
rowwise() %>%
mutate(sampled = sample(c_across(col_a:col_e), size = n()),
origin_col = if(is.na(sampled)) toString(names(.)[which(is.na(c_across(col_a:col_e)))]) else toString(names(.)[which(c_across(col_a:col_e) == sampled)]))
col_a col_b col_c col_d col_e sampled origin_col
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 NA 4 1 2 1 col_a, col_d
2 4 4 NA NA 3 4 col_a, col_b
3 2 NA 5 NA NA NA col_b, col_d, col_e
4 5 3 NA 2 NA NA col_c, col_e
5 2 1 NA NA 5 2 col_a