Rowwise, how to specify which column a certain value is from?

Rowwise, how to specify which column a certain value is from? - r

I have a dataframe with several columns, and I create a new column which randomly samples a single value from either of the other columns. How can I trace back to tell which column the value came from?
I've seen the exact same question and solution here, but it's in python, and couldn't find an R equivalent.
Data 1 :: each row has different values across columns
df_uniques <-
data.frame(
col_a = c(2, 2, 5, 5, 3),
col_b = c(NA, 4, 2, 3, 1),
col_c = c(4, 5, 3, 1, 2),
col_d = c(1, NA, 4, 2, 4),
col_e = c(3, 3, 1, 4, 5)
)
> df_uniques
## col_a col_b col_c col_d col_e
## 1 2 NA 4 1 3
## 2 2 4 5 NA 3
## 3 5 2 3 4 1
## 4 5 3 1 2 4
## 5 3 1 2 4 5
Mutate a new column to sample from either previous columns
library(dplyr)
set.seed(2020)
df_uniques %>%
rowwise() %>%
mutate(sampled = sample(c(col_a, col_b, col_c, col_d, col_e), size = n()))
## # A tibble: 5 x 6
## # Rowwise:
## col_a col_b col_c col_d col_e sampled
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 NA 4 1 3 1
## 2 2 4 5 NA 3 NA
## 3 5 2 3 4 1 5
## 4 5 3 1 2 4 5
## 5 3 1 2 4 5 4
Data 2 :: each row has duplicating values across columns
df_duplicates <-
data.frame(
col_a = c(1, 4, 2, 5, 2),
col_b = c(NA, 4, NA, 3, 1),
col_c = c(4, NA, 5, NA, NA),
col_d = c(1, NA, NA, 2, NA),
col_e = c(2, 3, NA, NA, 5)
)
> df_duplicates
## col_a col_b col_c col_d col_e
## 1 1 NA 4 1 2
## 2 4 4 NA NA 3
## 3 2 NA 5 NA NA
## 4 5 3 NA 2 NA
## 5 2 1 NA NA 5
Mutate a new column to sample from either previous columns
set.seed(2020)
df_duplicates %>%
rowwise() %>%
mutate(sampled = sample(c(col_a, col_b, col_c, col_d, col_e), size = n()))
## # A tibble: 5 x 6
## # Rowwise:
## col_a col_b col_c col_d col_e sampled
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 NA 4 1 2 NA
## 2 4 4 NA NA 3 4
## 3 2 NA 5 NA NA NA
## 4 5 3 NA 2 NA 3
## 5 2 1 NA NA 5 1
Tracing back: which column is the origin of sampled?
Desired Output (Data 1 :: uniques)
# A tibble: 5 x 7
# Rowwise:
col_a col_b col_c col_d col_e sampled origin_col
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 2 NA 4 1 3 1 col_d
2 2 4 5 NA 3 NA col_d
3 5 2 3 4 1 5 col_a
4 5 3 1 2 4 5 col_a
5 3 1 2 4 5 4 col_d
Desired Output (Data 2 :: duplicates)
# A tibble: 5 x 7
# Rowwise:
col_a col_b col_c col_d col_e sampled origin_col
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 NA 4 1 2 1 col_a, col_d
2 4 4 NA NA 3 NA col_c, col_d
3 2 NA 5 NA NA 2 col_a
4 5 3 NA 2 NA 5 col_a
5 2 1 NA NA 5 NA col_c, col_d

Are you looking for something like this?
cols <- c("col_a", "col_b", "col_c", "col_d", "col_e")
workflow <-
. %>%
rowwise() %>%
mutate(
sampled = sample(c_across(!!cols), 1L),
origin_col = toString(cols[which(c_across(!!cols) %in% sampled)])
)
Output
> set.seed(2020L); workflow(df_uniques)
# A tibble: 5 x 7
# Rowwise:
col_a col_b col_c col_d col_e sampled origin_col
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 2 NA 4 1 3 1 col_d
2 2 4 5 NA 3 NA col_d
3 5 2 3 4 1 5 col_a
4 5 3 1 2 4 5 col_a
5 3 1 2 4 5 4 col_d
> set.seed(2020L); workflow(df_duplicates)
# A tibble: 5 x 7
# Rowwise:
col_a col_b col_c col_d col_e sampled origin_col
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 NA 4 1 2 1 col_a, col_d
2 4 4 NA NA 3 NA col_c, col_d
3 2 NA 5 NA NA 2 col_a
4 5 3 NA 2 NA 5 col_a
5 2 1 NA NA 5 NA col_c, col_d
Method 1: create a temporary variable for your selected columns
workflow <-
. %>%
rowwise() %>%
mutate(
d = across(starts_with("col_")),
sampled = sample(c_across(names(d)), 1L),
original_col = toString(names(d)[which(c_across(names(d)) %in% sampled)]),
d = NULL
)
Method 2: wrap everything in a function
workflow <- function(df) {
cols <- names(df)
cols <- cols[starts_with("col_", vars = cols)]
# or cols <- cols[startsWith(cols, "col_")]
# or cols <- cols[grepl("^col_", cols)]
# ...
df %>%
rowwise() %>%
mutate(
sampled = sample(c_across(!!cols), 1L),
original_col = toString(cols[which(c_across(!!cols) %in% sampled)]),
)
}
I prefer the second method as it is more flexible.

One option could be:
df_duplicates %>%
rowwise() %>%
mutate(sampled = sample(c_across(col_a:col_e), size = n()),
origin_col = if(is.na(sampled)) toString(names(.)[which(is.na(c_across(col_a:col_e)))]) else toString(names(.)[which(c_across(col_a:col_e) == sampled)]))
col_a col_b col_c col_d col_e sampled origin_col
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 NA 4 1 2 1 col_a, col_d
2 4 4 NA NA 3 4 col_a, col_b
3 2 NA 5 NA NA NA col_b, col_d, col_e
4 5 3 NA 2 NA NA col_c, col_e
5 2 1 NA NA 5 2 col_a

Related

Extract mismatch by groups

I have a data frame like this:
ID
col1
col2
AB
1
3
AB
1
3
CD
2
4
CD
2
3
I would like to compare row within each ID.
For each column with difference add in the mismatch referred to the column.
Output:
ID
col1
col2
mismatch_extract_col1
mismatch_extract_col2
AB
1
3
Na
Na
AB
1
3
Na
Na
CD
2
4
Na
4:3
CD
2
3
Na
4:3

You can use n_distinct() == 1 to know if there is a mismatch in each column by ID groups.
library(dplyr)
df %>%
mutate(across(col1:col2, ~ if_else(n_distinct(.x) == 1, NA, toString(.x)),
.names = "mismatch_extract_{.col}"),
.by = ID)
# # A tibble: 4 × 5
# ID col1 col2 mismatch_extract_col1 mismatch_extract_col2
# <chr> <int> <int> <lgl> <chr>
# 1 AB 1 3 NA NA
# 2 AB 1 3 NA NA
# 3 CD 2 4 NA 4, 3
# 4 CD 2 3 NA 4, 3

How to Filter by group and move all values to new column if any value in any of the affected columns is greater than 5 in R

I have a Datafaame like this:
dt <- tibble(
TRIAL = c("A", "A", "A", "B", "B", "B", "C", "C", "C","D","D","D"),
RL = c(1, NA, 3, 1, 6, 3, 2, 3, 1, 0, 1.5, NA),
SL = c(6, 1.5, 1, 0, 0, 1, 1, 2, 0, 1, 1.5, NA),
HC = c(0, 1, 5, 6,7, 8, 9, 3, 4, 5, 4, 2)
)
# A tibble: 12 x 4
TRIAL RL SL HC
<chr> <dbl> <dbl> <dbl>
1 A 1 6 0
2 A NA 1.5 1
3 A 3 1 5
4 B 1 0 6
5 B 6 0 7
6 B 3 1 8
7 C 2 1 9
8 C 3 2 3
9 C 1 0 4
10 D 0 1 5
11 D 1.5 1.5 4
12 D NA NA 2
I want to group the data frame by TRIAL and have the values in RL and SL checked by group, if the value in either of the column is greater than 5 then move all values for RL and SL for that particular group to RLCT and SLCT respectively.
# A tibble: 12 x 6
TRIAL HC RLCT SLCT SL RL
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0 1 6 NA NA
2 A 1 NA 1.5 NA NA
3 A 5 3 1 NA NA
4 B 6 1 0 NA NA
5 B 7 6 0 NA NA
6 B 8 3 1 NA NA
7 C 9 NA NA 1 3
8 C 3 NA NA 3 5
9 C 4 NA NA 1 1
10 D 5 NA NA 1 0
11 D 4 NA NA 1.5 1.5
12 D 2 NA NA NA NA
When I run the below code, I did not get the expected output
dt0 <- dt %>%
mutate(RLCT = NA,
SLCT = NA) %>%
group_by(TRIAL) %>%
filter(!any(RL > 5.0 | SL > 5.0))
dt1 <- dt %>%
group_by(TRIAL) %>%
filter(any(RL > 5.0 | SL > 5.0)) %>%
mutate(RLCT = RL,
SLCT = SL) %>%
rbind(dt0, .) %>%
mutate(RL = ifelse(!is.na(RLCT), NA, RL),
SL = ifelse(!is.na(SLCT), NA, SL)) %>% arrange(TRIAL)
This is what I get
# A tibble: 9 x 6
# Groups: TRIAL [3]
TRIAL RL SL HC RLCT SLCT
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A NA NA 0 1 6
2 A NA NA 1 NA 1.5
3 A NA NA 5 3 1
4 B NA NA 6 1 0
5 B NA NA 7 6 0
6 B NA NA 8 3 1
7 C 2 1 9 NA NA
8 C 3 2 3 NA NA
9 C 1 0 4 NA NA

You can define a column to storage the condition, and change RL and SL with ifelse inside across.
dt %>%
group_by(TRIAL) %>%
mutate(cond = any(RL > 5.0 | SL > 5.0, na.rm = TRUE),
across(c(RL, SL), ~ ifelse(cond, ., NA), .names = "{.col}CT"),
across(c(RL, SL), ~ ifelse(!cond, ., NA)),
cond = NULL)
Result:
# A tibble: 12 x 6
# Groups: TRIAL [4]
TRIAL RL SL HC RLCT SLCT
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A NA NA 0 1 6
2 A NA NA 1 NA 1.5
3 A NA NA 5 3 1
4 B NA NA 6 1 0
5 B NA NA 7 6 0
6 B NA NA 8 3 1
7 C 2 1 9 NA NA
8 C 3 2 3 NA NA
9 C 1 0 4 NA NA
10 D 0 1 5 NA NA
11 D 1.5 1.5 4 NA NA
12 D NA NA 2 NA NA

With dplyr, you could use group_modify():
library(dplyr)
dt %>%
group_by(TRIAL) %>%
group_modify(~ {
if(any(select(.x, c(RL, SL)) > 5, na.rm = TRUE)) {
rename_with(.x, ~ paste0(.x, 'CT'), c(RL, SL))
} else {
.x
}
})
Output
# A tibble: 12 × 6
# Groups: TRIAL [4]
TRIAL RLCT SLCT HC RL SL
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 6 0 NA NA
2 A NA 1.5 1 NA NA
3 A 3 1 5 NA NA
4 B 1 0 6 NA NA
5 B 6 0 7 NA NA
6 B 3 1 8 NA NA
7 C NA NA 9 2 1
8 C NA NA 3 3 2
9 C NA NA 4 1 0
10 D NA NA 5 0 1
11 D NA NA 4 1.5 1.5
12 D NA NA 2 NA NA

Subset data based on variable prefix

I have a large dataset in which the answers to one question are distributed among various columns. However, if the columns belong together, they share the same prefix. I wonder how I can create a subset dataset of each question sorting based on the prefix.
Here is an example dataset. I would like to receive an efficient and easy adaptable solution to create a dataset only containing the values of either question one, two or three.
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8), Question1a = c(1,
1, NA, NA, 1, 1, 1, NA), Question1b = c(NA, 1, NA, 1, NA, 1,
NA, 1), Question1c = c(1, 1, NA, NA, 1, NA, NA, NA), Question2a = c(1,
NA, NA, NA, 1, 1, NA, NA), Question2b = c(NA, 1, NA, 1, NA, NA,
NA, NA), Question3a = c(NA, NA, NA, NA, 1, 1, 1, NA), Question3b = c(NA,
NA, 1, 1, NA, NA, NA, NA)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L))

You can use sapply and a function:
list_data <- sapply(c("Question1", "Question2", "Question3"),
function(x) df[startsWith(names(df),x)], simplify = FALSE)
This will store everything in a list. To get the individual data sets in the global environment as individual objects, use:
list2env(list_data, globalenv())
Output
# $Question1
# # A tibble: 8 × 3
# Question1a Question1b Question1c
# <dbl> <dbl> <dbl>
# 1 1 NA 1
# 2 1 1 1
# 3 NA NA NA
# 4 NA 1 NA
# 5 1 NA 1
# 6 1 1 NA
# 7 1 NA NA
# 8 NA 1 NA
#
# $Question2
# # A tibble: 8 × 2
# Question2a Question2b
# <dbl> <dbl>
# 1 1 NA
# 2 NA 1
# 3 NA NA
# 4 NA 1
# 5 1 NA
# 6 1 NA
# 7 NA NA
# 8 NA NA
#
# $Question3
# # A tibble: 8 × 2
# Question3a Question3b
# <dbl> <dbl>
# 1 NA NA
# 2 NA NA
# 3 NA 1
# 4 NA 1
# 5 1 NA
# 6 1 NA
# 7 1 NA
# 8 NA NA

I believe the underlying question is about data-formats.
Here's a few:
library(tidyverse)
structure(
list(
ID = c(1, 2, 3, 4, 5, 6, 7, 8),
Question1a = c(1,
1, NA, NA, 1, 1, 1, NA),
Question1b = c(NA, 1, NA, 1, NA, 1,
NA, 1),
Question1c = c(1, 1, NA, NA, 1, NA, NA, NA),
Question2a = c(1,
NA, NA, NA, 1, 1, NA, NA),
Question2b = c(NA, 1, NA, 1, NA, NA,
NA, NA),
Question3a = c(NA, NA, NA, NA, 1, 1, 1, NA),
Question3b = c(NA,
NA, 1, 1, NA, NA, NA, NA)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -8L)
) -> square_df
square_df %>%
pivot_longer(-ID,
names_to = c("Question", "Item"),
names_pattern = "Question(\\d+)(\\w+)") ->
long_df
long_df
#> # A tibble: 56 × 4
#> ID Question Item value
#> <dbl> <chr> <chr> <dbl>
#> 1 1 1 a 1
#> 2 1 1 b NA
#> 3 1 1 c 1
#> 4 1 2 a 1
#> 5 1 2 b NA
#> 6 1 3 a NA
#> 7 1 3 b NA
#> 8 2 1 a 1
#> 9 2 1 b 1
#> 10 2 1 c 1
#> # … with 46 more rows
long_df %>%
na.omit(value) ->
sparse_long_df
sparse_long_df
#> # A tibble: 22 × 4
#> ID Question Item value
#> <dbl> <chr> <chr> <dbl>
#> 1 1 1 a 1
#> 2 1 1 c 1
#> 3 1 2 a 1
#> 4 2 1 a 1
#> 5 2 1 b 1
#> 6 2 1 c 1
#> 7 2 2 b 1
#> 8 3 3 b 1
#> 9 4 1 b 1
#> 10 4 2 b 1
#> # … with 12 more rows
sparse_long_df %>%
nest(data = c(ID, Item, value)) ->
nested_long_df
nested_long_df
#> # A tibble: 3 × 2
#> Question data
#> <chr> <list>
#> 1 1 <tibble [12 × 3]>
#> 2 2 <tibble [5 × 3]>
#> 3 3 <tibble [5 × 3]>
Created on 2022-05-12 by the reprex package (v2.0.1)

You could also use map to store each dataframe in a list, e.g.
library(purrr)
# 3 = number of questions
map(c(1:3),
function(x){
quest <- paste0("Question",x)
select(df, ID, starts_with(quest))
})
Output:
[[1]]
# A tibble: 8 x 4
ID Question1a Question1b Question1c
<dbl> <dbl> <dbl> <dbl>
1 1 1 NA 1
2 2 1 1 1
3 3 NA NA NA
4 4 NA 1 NA
5 5 1 NA 1
6 6 1 1 NA
7 7 1 NA NA
8 8 NA 1 NA
[[2]]
# A tibble: 8 x 3
ID Question2a Question2b
<dbl> <dbl> <dbl>
1 1 1 NA
2 2 NA 1
3 3 NA NA
4 4 NA 1
5 5 1 NA
6 6 1 NA
7 7 NA NA
8 8 NA NA
[[3]]
# A tibble: 8 x 3
ID Question3a Question3b
<dbl> <dbl> <dbl>
1 1 NA NA
2 2 NA NA
3 3 NA 1
4 4 NA 1
5 5 1 NA
6 6 1 NA
7 7 1 NA
8 8 NA NA

I found a really intuitive solution using the dplyr package, using the select and starts_with commands. Alternatively, you can also replace the starts_with command with contains, if the you are not identifying the similar variables by a prefix but some other common feature.
Q1 <- Survey %>%
select(
starts_with("Question1")
)
Q2 <- Survey %>%
select(
starts_with("Question2")
)
Q3 <- Survey %>%
select(
starts_with("Question3")
)

How to move all values of the same ID to a new column provided if one of the value is greater than 5 in R

I'm struggling with a problem in R. I'm trying to move all values in RL column of the same ID in Trial column into a new column, provided that any of the value in RL column is greater than 5.
I have a data set like this:
dt <- tibble(
TRIAL = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
RL = c(1, 2, 3, 1, 6, 3, 2, 3, 1),
SL = c(1, 1.5, 1, 0, 0, 1, 1, 1.5, 0)
)
# # A tibble: 9 x 3
# TRIAL RL SL
# <chr> <dbl> <dbl>
# 1 A 1 1
# 2 A 2 1.5
# 3 A 3 1
# 4 B 1 0
# 5 B 6 0
# 6 B 3 1
# 7 C 2 1
# 8 C 3 1.5
# 9 C 1 0
This is what I want to achieve: I want all values from one column in a group to be moved to a new column if the max value for that group is greater than 5, see example below.
# # A tibble: 9 x 4
# TRIAL RL SL RLCT
# <chr> <dbl> <dbl> <dbl>
# 1 A 1 1 NA
# 2 A 2 1.5 NA
# 3 A 3 1 NA
# 4 B NA 0 1
# 5 B NA 0 6
# 6 B NA 1 3
# 7 C 2 1 NA
# 8 C 3 1.5 NA
# 9 C 1 0 NA
When I run this code I get not the expected output
dt %>% group_by("TRIAL") %>% mutate(RLCT = case_when ("RL"> 5 ~ "RL"))
# # A tibble: 9 x 5
# # Groups: "TRIAL" [1]
# TRIAL RL SL `"TRIAL"` RLCT
# <chr> <dbl> <dbl> <chr> <chr>
# 1 A 1 1 TRIAL RL
# 2 A 2 1.5 TRIAL RL
# 3 A 3 1 TRIAL RL
# 4 B 1 0 TRIAL RL
# 5 B 6 0 TRIAL RL
# 6 B 3 1 TRIAL RL
# 7 C 2 1 TRIAL RL
# 8 C 3 1.5 TRIAL RL
# 9 C 1 0 TRIAL RL

Sure not the most straightforward solution but seems to work:
dt0 <- dt %>%
mutate(RLCT = NA) %>%
group_by(TRIAL) %>%
filter(!any(RL > 5))
dt %>%
group_by(TRIAL) %>%
filter(any(RL > 5)) %>%
mutate(RLCT = RL) %>%
rbind(dt0, .) %>%
mutate(RL = ifelse(!is.na(RLCT), NA, RL))
# A tibble: 9 x 4
# Groups: TRIAL [3]
TRIAL RL SL RLCT
<chr> <dbl> <dbl> <dbl>
1 A 1 1 NA
2 A 2 1.5 NA
3 A 3 1 NA
4 C 2 1 NA
5 C 3 1.5 NA
6 C 1 0 NA
7 B NA 0 1
8 B NA 0 6
9 B NA 1 3
Add (arrange(TRIAL)) for alphabetic ordering

How to combine the values of various columns in a tibble by the same row ID

So I have a tibble (data frame) like this (the actual data frame is like 100+ rows)
sample_ID <- c(1, 2, 2, 3)
A <- c(NA, NA, 1, 3)
B <- c(1, 2, NA, 1)
C <- c(5, 1, NA, 2)
D <- c(NA, NA, 3, 1)
tibble(sample_ID,A,B,C,D)
# which reads
# A tibble: 4 × 5
sample_ID A B C D
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA 1 5 NA
2 2 NA 2 1 NA
3 2 1 NA NA 3
4 3 3 1 2 1
As can be seen here, the second and third rows have the same sample ID. I want to combine these two rows so that the tibble looks like
# A tibble: 3 × 5
sample_ID A B C D
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA 1 5 NA
2 2 1 2 1 3
3 3 3 1 2 1
In other words, I want the rows for sample_ID to be unique (order doesn't matter), and the values of other columns are merged (overwrite NA when possible). Can this be achieved in a simple way, such as using gather and spread? Many thanks.

We can use summarise_each after grouping by 'sample_ID'
library(dplyr)
df %>%
group_by(sample_ID) %>%
summarise_each(funs(na.omit))
# A tibble: 3 × 5
# sample_ID A B C D
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 NA 1 5 NA
#2 2 1 2 1 3
#3 3 3 1 2 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Rowwise, how to specify which column a certain value is from? - r

Related

Extract mismatch by groups

How to Filter by group and move all values to new column if any value in any of the affected columns is greater than 5 in R

Subset data based on variable prefix

How to move all values of the same ID to a new column provided if one of the value is greater than 5 in R

How to combine the values of various columns in a tibble by the same row ID

Categories

Resources