How to recode values in haven_labelled vectors in R - r

I am working with data imported from SPSS using the haven package, imported using read_sav().
The data exists in columns of class haven_labelled, which is somewhat similar to a factor in that it contains a value and a label but is different in other ways.
I want to recode the values in the data and associated label values.
Here is an example:
library(haven)
library(dplyr)
library(labelled)
library(tidyr)
x <- structure(list(q0015_0001 = structure(c(3, 5, NA, 3, 1, 2, NA, NA, 3, 4, 2, NA, 2, 2, 4, NA,
4, 3, 3, 3, 3, 2, NA, NA, 2), label = "Menu Options/Variety", format.spss = "F8.2", labels =
c(`Very Dissatisfied` = 1, Dissatisfied = 2, Neutral = 3, Satisfied = 4, `Very Satisfied` = 5),
class = c("haven_labelled", "vctrs_vctr", "double")), q0015_0002 = structure(c(4, 4, NA, 5, 3, 3,
NA, NA, 3, 4, 2, NA, 5, 2, 4, NA, 4, 3, 4, 4, 4, 4, NA, NA, 2), label = "Cleanliness", format.spss
= "F8.2", labels = c(`Very Dissatisfied` = 1, Dissatisfied = 2, Neutral = 3, Satisfied = 4, `Very
Satisfied` = 5), class = c("haven_labelled", "vctrs_vctr", "double")), q0015_0003 =
structure(c(2, 2, NA, 3, 1, 2, NA, NA, 3, 4, 3, NA, 4, 3, 4, NA, 3, 2, 4, 4, 2, 2, NA, NA, 1),
label = "Taste and Quality of Food", format.spss = "F8.2", labels = c(`Very Dissatisfied` = 1,
Dissatisfied = 2, Neutral = 3, Satisfied = 4, `Very Satisfied` = 5), class = c("haven_labelled",
"vctrs_vctr", "double"))), row.names = c(NA, -25L), class = c("tbl_df", "tbl", "data.frame"),
label = "File created by user")
x
# A tibble: 25 x 3
# q0015_0001 q0015_0002 q0015_0003
# <dbl+lbl> <dbl+lbl> <dbl+lbl>
# 1 3 [Neutral] 4 [Satisfied] 2 [Dissatisfied]
# 2 5 [Very Satisfied] 4 [Satisfied] 2 [Dissatisfied]
# 3 NA NA NA
# 4 3 [Neutral] 5 [Very Satisfied] 3 [Neutral]
# 5 1 [Very Dissatisfied] 3 [Neutral] 1 [Very Dissatisfied]
# 6 2 [Dissatisfied] 3 [Neutral] 2 [Dissatisfied]
# 7 NA NA NA
# 8 NA NA NA
# 9 3 [Neutral] 3 [Neutral] 3 [Neutral]
#10 4 [Satisfied] 4 [Satisfied] 4 [Satisfied]
# ... with 15 more rows
To illustrate the column structure better
x$q0015_0001
#<labelled<double>[25]>: Menu Options/Variety
# [1] 3 5 NA 3 1 2 NA NA 3 4 2 NA 2 2 4 NA 4 3 3 3 3 2 NA NA 2
#
#Labels:
# value label
# 1 Very Dissatisfied
# 2 Dissatisfied
# 3 Neutral
# 4 Satisfied
# 5 Very Satisfied
The data include values from 1 to 5, each with a corresponding label (i.e., 1 = "Very Dissatisfied", etc.). haven_labelled allows numeric or character values.
I wish to change the values from c(1, 2, 3, 4, 5) to c(-2, -1, 0, 1, 2) but preserve the labels in the same order (i.e., -2 = "Very Dissatisfied", etc.).
Label
Old Value
New Value
Very Dissatisfied
1
-2
Dissatisfied
2
-1
Neutral
3
0
Satisfied
4
1
Very Satisfied
5
2
The closest I have come is using dplyr::recode(). The labelled package is supposed to extend the dplyr::recode() method to work with labelled vectors [1], but I haven't noticed a difference with/without it being loaded.
dplyr::recode(x$q0015_0001,`1` = -2, `2` = -1, `3` = 0, `4` = 1, `5` = 2)
#<labelled<double>[25]>: Menu Options/Variety
# [1] 0 2 NA 0 -2 -1 NA NA 0 1 -1 NA -1 -1 1 NA 1 0 0 0 0 -1 NA NA -1
#
#Labels:
# value label
# 1 Very Dissatisfied
# 2 Dissatisfied
# 3 Neutral
# 4 Satisfied
# 5 Very Satisfied
Notice that the values in the data changed as expected (3 became 0, 5 became 2, etc.) but not the label values. This means that if you were to attempt to use as_factor (the labelled vector equivalent to as.factor from the haven package) to reference the labels instead of the values, the labels will be incorrect. The effect on the data is further illustrated when viewing the values and labels together.
x %>%
mutate(across(starts_with("q0015"),
~recode(., `1` = -2, `2` = -1, `3` = 0, `4` = 1, `5` = 2)))
# A tibble: 25 x 3
#q0015_0001 q0015_0002 q0015_0003
#<dbl+lbl> <dbl+lbl> <dbl+lbl>
#1 0 1 [Very Dissatisfied] -1
#2 2 [Dissatisfied] 1 [Very Dissatisfied] -1
#3 NA NA NA
#4 0 2 [Dissatisfied] 0
#5 -2 0 -2
#6 -1 0 -1
#7 NA NA NA
#8 NA NA NA
#9 0 0 0
#10 1 [Very Dissatisfied] 1 [Very Dissatisfied] 1 [Very Dissatisfied]
# ... with 15 more rows
As shown, the labels still map to the old values. In the recoded version, 1 and 2 are positive scores but still map to Very Dissatisfied/Dissatisfied, while -2, -1 and 0 are not recognized as labelled values.
Question
How may I recode labelled vectors such that the data values and label values are updated together and labels are preserved/mapped to the new values?

It's ugly AF, but it does the job. Problem is that setting value labels is not straightforward. Package labelled offers functions for it, but these aren't "tidyverse-ready", i.e. they don't work within a mutate, nor do they allow for selecting variables with tidyselect helpers like starts_with.
However, set_value_labels allos for passing a list where each list element carries the name of the variable you want to apply labels to and then the labels itself are provided as a named vector:
x |>
mutate(across(starts_with("q0015"),
~dplyr::recode(., `1` = -2, `2` = -1, `3` = 0, `4` = 1, `5` = 2))) |>
set_value_labels(.labels = rep(list(c("Very Dissatisfied" = -2,
"Dissatisfied" = -1,
"Neutral" = 0,
"Satisfied" = 1,
"Very Satisfied" = 2)),
x |>
select(starts_with("q0015")) |>
ncol()) |>
setNames(nm = x |>
select(starts_with("q0015")) |>
names()))
which gives:
# A tibble: 25 × 3
q0015_0001 q0015_0002 q0015_0003
<dbl+lbl> <dbl+lbl> <dbl+lbl>
1 0 [Neutral] 1 [Satisfied] -1 [Dissatisfied]
2 2 [Very Satisfied] 1 [Satisfied] -1 [Dissatisfied]
3 NA NA NA
4 0 [Neutral] 2 [Very Satisfied] 0 [Neutral]
5 -2 [Very Dissatisfied] 0 [Neutral] -2 [Very Dissatisfied]
6 -1 [Dissatisfied] 0 [Neutral] -1 [Dissatisfied]
7 NA NA NA
8 NA NA NA
9 0 [Neutral] 0 [Neutral] 0 [Neutral]
10 1 [Satisfied] 1 [Satisfied] 1 [Satisfied]
# … with 15 more rows
# ℹ Use `print(n = ...)` to see more rows
I was curious and checked with the package developer of the labelled package, and an alternative would be to write a small function for recoding and relabeling a single variable and then run this function within across:
https://github.com/larmarange/labelled/issues/126

Related

Create new variable based on outcome of other variable in group - R

This a similar/followup question to this R: How to code new variable based on grouped variable and conditioned on earlier row but it is different because within donors there are potentially two match runs.
I have a data file with organ donors. I'm looking at lungs that are donated - there are two lungs.
If the lungs are split (L and R) and put up for donation, they are each attempted to match with recipients ("matchrun"). They go through eligible recipients until one matches ("sequence").
If the lung is matched to a recipient, it goes to them ("organ_placed").
If the lung doesn't match, it continues in the sequence and then just remains NA at the maximum sequence number.
I would like to create a new variable that has the outcome of the match run such that if one lung is placed and the other is not, it tells you that the lung was discarded. i.e. see case of Donor 2 in the data - the left lung is placed, but the right doesn't match.
In donor 3, the first match run doesn't match but the match run for the other lung does.
I figure it will be something like group_by(donorid, matchrun) but then how do you make a condition based on the match run?
library(tribble)
library(dplyr)
data <- tribble(
~donorid, ~matchrun, ~sequence, ~organ_placed,
2, 3, 1, NA,
2, 3, 2, NA,
2, 3, 3, "L",
2, 4, 1, NA,
2, 4, 2, NA,
2, 4, 3, NA,
3, 5, 1, NA,
3, 5, 1, NA,
3, 5, 1, NA,
3, 6, 1, NA,
3, 6, 2, NA,
3, 6, 3, "L"
)
desired_outcome <- tribble(
~donorid, ~matchrun, ~sequence, ~organ_placed, ~organ,
2, 3, 1, NA, NA,
2, 3, 2, NA, NA,
2, 3, 3, "L", "Left Single",
2, 4, 1, NA, NA,
2, 4, 2, NA, NA,
2, 4, 3, NA, "Right Discarded",
3, 5, 1, NA, NA,
3, 5, 1, NA, NA,
3, 5, 1, NA, "Right Discarded",
3, 6, 1, NA, NA,
3, 6, 2, NA, NA,
3, 6, 3, "L", "Left Single")
You can try this:
data %>%
group_by(donorid) %>%
mutate(temp = ifelse(n_distinct(organ_placed, na.rm = TRUE) == 1, unique(na.omit(organ_placed)), "B")) %>%
group_by(matchrun, .add = TRUE) %>%
mutate(organ = case_when(organ_placed == "L" ~ "Left Single",
organ_placed == "R" ~ "Right Single",
all(is.na(organ_placed)) & row_number() == max(sequence) & temp == "L" ~ "Right Discarded",
all(is.na(organ_placed)) & row_number() == max(sequence) & temp == "R" ~ "Left Discarded")) %>%
ungroup()
output
donorid matchrun sequence organ_placed temp organ
1 1 1 1 NA B NA
2 1 1 2 NA B NA
3 1 1 3 L B Left Single
4 1 2 1 NA B NA
5 1 2 2 NA B NA
6 1 2 3 R B Right Single
7 2 3 1 NA L NA
8 2 3 2 NA L NA
9 2 3 3 L L Left Single
10 2 4 1 NA L NA
11 2 4 2 NA L NA
12 2 4 3 NA L Right Discarded
Update: we have to add matchrun to the group. Removed prior solution:
data %>%
group_by(donorid, matchrun) %>%
mutate(outcome = case_when(organ_placed == "L" ~ "Left Single",
organ_placed == "R" ~ "Right Single",
organ_placed == "B" ~ "Bilateral",
(is.na(organ_placed) &
row_number() == max(row_number())) &
"L" %in% organ_placed ~ "Right Discarded",
(is.na(organ_placed) &
row_number() == max(row_number())) &
"R" %in% organ_placed ~ "Left Discarded",
TRUE ~ NA_character_))
Groups: donorid, matchrun [4]
donorid matchrun sequence organ_placed outcome
<dbl> <dbl> <dbl> <chr> <chr>
1 2 3 1 NA NA
2 2 3 2 NA NA
3 2 3 3 L Left Single
4 2 4 1 NA NA
5 2 4 2 NA NA
6 2 4 3 NA NA
7 3 5 1 NA NA
8 3 5 1 NA NA
9 3 5 1 NA NA
10 3 6 1 NA NA
11 3 6 2 NA NA
12 3 6 3 L Left Single
We can use
library(data.table)
library(stringr)
setDT(data)[, seq2 := rowid(donorid, matchrun) ]
data[, organ := str_replace_all(organ_placed,
setNames(c("Left Single", "Right Single"), c("L", "R")))]
data[seq2 == max(seq2),
organ := fcase(!is.na(organ), organ, default =
str_replace_all(setdiff(c("Left Single", "Right Single"), organ),
setNames(c("Left Discarded", "Right Discarded"),
c("Left Single", "Right Single")))), donorid
][, seq2 := NULL][]
-output
> data
donorid matchrun sequence organ_placed organ
1: 2 3 1 <NA> <NA>
2: 2 3 2 <NA> <NA>
3: 2 3 3 L Left Single
4: 2 4 1 <NA> <NA>
5: 2 4 2 <NA> <NA>
6: 2 4 3 <NA> Right Discarded
7: 3 5 1 <NA> <NA>
8: 3 5 1 <NA> <NA>
9: 3 5 1 <NA> Right Discarded
10: 3 6 1 <NA> <NA>
11: 3 6 2 <NA> <NA>
12: 3 6 3 L Left Single

Assign value to a column where column name is a concatenation of other columns' values

Assume the following data:
dat <- structure(list(row = c("467", "537", "236", "257"), x_11 = c(5,
5, 5, 4), x_12 = c(5, 5, 6, 1), x_13 = c(4, 7, 6, 5), x_14 = c(4,
6, 4, 1), x_15 = c(4, 5, 4, 4), x_16 = c(2, 6, 5, 2), x_17 = c(3,
4, 3, 3), mode_1 = c(4, 5, 4, 1), mode_2 = c(NA, NA, 5, 4), mode_3 = c(NA,
NA, 6, NA), mean = c(3.85714285714286, 5.42857142857143, 4.71428571428571,
2.85714285714286), sd = c(1.0690449676497, 0.975900072948533,
1.11269728052837, 1.57359158493889), nearest = c(1L, 1L, 2L,
2L)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
which gives:
# A tibble: 4 x 14
row x_11 x_12 x_13 x_14 x_15 x_16 x_17 mode_1 mode_2 mode_3 mean sd nearest
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 467 5 5 4 4 4 2 3 4 NA NA 3.86 1.07 1
2 537 5 5 7 6 5 6 4 5 NA NA 5.43 0.976 1
3 236 5 6 6 4 4 5 3 4 5 6 4.71 1.11 2
4 257 4 1 5 1 4 2 3 1 4 NA 2.86 1.57 2
I now want to create a new column based on the following condition:
if mode_2 is NA, then take the value from mode_1
if mode_2 is NOT NA, then take the value from the column position that is specified in "nearest". Note: the column position in "nearest" refers to the column position of the mode_ columns, NOT the overall column positions of the data frame.
I tried the following, but always getting an error that object "take" is not found:
dat %>%
mutate(test = case_when(is.na(mode_2) ~ x_1,
TRUE ~ !!paste0("mode_", nearest))
Expected output:
# A tibble: 4 x 15
row [...] mode_1 mode_2 mode_3 mean sd nearest test
<chr> [...] <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 467 [...] 4 NA NA 3.86 1.07 1 4
2 537 [...] 5 NA NA 5.43 0.976 1 5
3 236 [...] 4 5 6 4.71 1.11 2 5
4 257 [...] 1 4 NA 2.86 1.57 2 4
Note in reality I have ~ 20-50 mode_ columns so I can't hard code all potential combinations.
You can create a new column which has corresponding value from take and use coalesce to select any one of the non-NA value.
library(dplyr)
dat %>%
mutate(take_value = as.numeric(.[cbind(1:n(),
match(paste0('x_', take), names(.)))]),
test = coalesce(take_value, x_1)) %>%
select(-take_value)
# x_1 x_2 take test
#1 1 NA 2 1
#2 9 2 1 9
#3 3 NA 2 3
#4 7 8 2 8
#5 5 NA 1 5
Using base R :
dat$take_value <- dat[cbind(1:nrow(dat), match(paste0('x_', dat$take), names(dat)))]
transform(dat, test = ifelse(is.na(take_value), x_1, take_value))

count occurrences in multiple columns (but for each row) based on value in another column

I am currently trying to analyse a data set in which I have one column that gives me the value of interest for each row (column called value_needed) and then a bunch of columns (in reality around 150) that have values and also a lot of NA's. For each row I would like to count the number of occurrences of that value from column value_needed in all the other columns, here position_1:position_6.
Here is some fake data:
position_1 <- c(6, -8, 8, 0, 0, -6)
position_2 <- c(NA, 6, -8, 8, 8, 0)
position_3 <- c(NA, NA, 6, -8, 0, 8)
position_4 <- c(NA, NA, NA, 6, -8, -8)
position_5 <- c(NA, NA, NA, NA, 6, 8)
position_6 <- c(NA, NA, NA, NA, NA, 6)
value_needed <- c(0, 6, -8, 8, 0, 8)
df <- data.frame(position_1, position_2, position_3,position_4, position_5, position_6,value_needed)
In the ideal case I would need to create a new column (name it occ) that counts the occurrences of the value in column value_needed from all position columns in that particular row.
The output for this fake data set above would be then:
occ = c(0,1,1,1,2,1)
If anyone has any hints, I really appreciate that.
Thanks
base solution
df$occ <- rowSums(df[1:6] == df$value_needed, na.rm = T)
dplyr solution
library(dplyr)
df %>%
rowwise() %>%
mutate(occ = sum(c_across(pos_1:pos_6) == value_needed, na.rm = T)) %>%
ungroup()
output
# # A tibble: 6 x 8
# pos_1 pos_2 pos_3 pos_4 pos_5 pos_6 value_needed occ
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 6 NA NA NA NA NA 0 0
# 2 -8 6 NA NA NA NA 6 1
# 3 8 -8 6 NA NA NA -8 1
# 4 0 8 -8 6 NA NA 8 1
# 5 0 8 0 -8 6 NA 0 2
# 6 -6 0 8 -8 8 6 8 2

R: What is an efficient way to recode variables? How do I prorate means?

I was wondering if anyone could point me in the direction of how I would go about recoding multiple variables with the same rules. I have the following df bhs1:
structure(list(bhs1_1 = c(NA, 1, NA, 2, 1, 2), bhs1_2 = c(NA,
2, NA, 2, 1, 1), bhs1_3 = c(NA, 1, NA, 2, 2, 2), bhs1_4 = c(NA,
2, NA, 1, 1, 1), bhs1_5 = c(NA, 1, NA, 1, 2, 2), bhs1_6 = c(NA,
1, NA, 2, 1, 2), bhs1_7 = c(NA, 1, NA, 1, 2, 1), bhs1_8 = c(NA,
2, NA, 2, 2, 2), bhs1_9 = c(NA, 1, NA, 2, 1, 1), bhs1_10 = c(NA,
2, NA, 1, 2, 2), bhs1_11 = c(NA, 2, NA, 2, 2, 1), bhs1_12 = c(NA,
2, NA, 2, 1, 1), bhs1_13 = c(NA, 1, NA, 1, 2, 2), bhs1_14 = c(NA,
2, NA, 2, 1, 1), bhs1_15 = c(NA, 1, NA, 2, 2, 2), bhs1_16 = c(NA,
2, NA, 2, 2, 2), bhs1_17 = c(NA, 2, NA, 2, 2, 1), bhs1_18 = c(NA,
1, NA, 1, 2, 1), bhs1_19 = c(NA, 1, NA, 2, 1, 2), bhs1_20 = c(NA,
2, NA, 2, 1, 1)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
There are two transformation rules, for half of the data set, e.g.,:
(bhs1_2, bhs1_4, bhs1_7, bhs1_9, bhs1_11, bhs1_12, bhs1_14, bhs1_16, bhs1_17,
bhs1_18, bhs1_20)
(if_else(1, 1, 0))
and
(bhs1_1, bhs1_3, bhs1_5, bhs1_6, bhs1_8, bhs1_10, bhs1_13,
bhs1_15, bhs1_19)
(if_else(2, 1, 0))
Is there an elegant way to write code to meet this use case? If so, can someone please point me in the right direction and/or provide me with a sample?
Here's a solution using dplyr
library(dplyr)
case1 <- vars(bhs1_2, bhs1_4, bhs1_7, bhs1_9, bhs1_11, bhs1_12, bhs1_14, bhs1_16, bhs1_17,
bhs1_18, bhs1_20)
case2 <- vars(bhs1_1, bhs1_3, bhs1_5, bhs1_6, bhs1_8, bhs1_10, bhs1_13,
bhs1_15, bhs1_19)
result <- df %>%
mutate_at(case1, ~ (. == 1) * 1L) %>%
mutate_at(case2, ~ (. == 2) * 1L)
Note - I skipped the ifelse statement - I'm just testing for your condition, then converted the TRUE/FALSE responses to numbers by multiplying by 1. I'm also not sure how you want NAs to be handled, but this is ignoring them.
If you aren't familiar with the pipe operator (%>%), it takes the result of the previous function, and sets it as the first argument of the next function. It's designed to improve code legibility by avoiding lots of function nesting.
We can create the column names of interest, then convert to binary (as.integer) from the logical expression
case1 <- c("bhs1_2", "bhs1_4", "bhs1_7", "bhs1_9", "bhs1_11", "bhs1_12",
"bhs1_14", "bhs1_16", "bhs1_17", "bhs1_18", "bhs1_20")
case2 <- c("bhs1_1", "bhs1_3", "bhs1_5", "bhs1_6", "bhs1_8",
"bhs1_10", "bhs1_13", "bhs1_15", "bhs1_19")
library(magrittr)
df1 %<>%
mutate_at(vars(case1), funs(as.integer(.==1 ))) %<>%
mutate_at(vars(case2), funs(as.integer(.==2)))
df1
# A tibble: 6 x 20
# bhs1_1 bhs1_2 bhs1_3 bhs1_4 bhs1_5 bhs1_6 bhs1_7 bhs1_8 bhs1_9 bhs1_10
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 NA NA NA NA NA NA NA NA NA NA
#2 0 0 0 0 0 0 1 1 1 1
#3 NA NA NA NA NA NA NA NA NA NA
#4 1 0 1 1 0 1 1 1 0 0
#5 0 1 1 1 1 0 0 1 1 1
#6 1 1 1 1 1 1 1 1 1 1
# ... with 10 more variables: bhs1_11 <int>, bhs1_12 <int>, bhs1_13 <int>,
# bhs1_14 <int>, bhs1_15 <int>, bhs1_16 <int>, bhs1_17 <int>, bhs1_18 <int>,
# bhs1_19 <int>, bhs1_20 <int>
Or an efficient option would be to use data.table
library(data.table)
setDT(df1)[, (case1) := lapply(.SD, function(x) as.integer(x == 1 )),
.SDcols = case1
][, (case2) := lapply(.SD, function(x) as.integer(x == 2)),
.SDcols = case2][]
NOTE This doesn't assume that all the values are of the same
You can use a very fast base R way of doing this as below:
case1=c("bhs1_10", "bhs1_11", "bhs1_12", "bhs1_13", "bhs1_14", "bhs1_15","bhs1_16", "bhs1_17", "bhs1_18", "bhs1_19", "bhs1_20")
case2=c("bhs1_1", "bhs1_3", "bhs1_5", "bhs1_6", "bhs1_8", "bhs1_10", "bhs1_13", "bhs1_15", "bhs1_19")
dat[case1]=abs(dat[case1]-2)
dat[case2]=dat[case2]-1
An simple ifelse can be helpful considering OP wants NA to be converted based on specified rules:
case1 = c("bhs1_2", "bhs1_4", "bhs1_7", "bhs1_9", "bhs1_11", "bhs1_12",
"bhs1_14", "bhs1_16", "bhs1_17", "bhs1_18", "bhs1_20")
case2 = c("bhs1_1", "bhs1_3", "bhs1_5", "bhs1_6", "bhs1_8", "bhs1_10",
"bhs1_13", "bhs1_15", "bhs1_19")
df[case1] = ifelse(!is.na(df[case1]) & df[case1]==1,1,0)
df[case2] = ifelse(!is.na(df[case2]) & df[case2]==2,1,0)
#Test solution
df[1:7]
# bhs1_1 bhs1_2 bhs1_3 bhs1_4 bhs1_5 bhs1_6 bhs1_7
# 1 0 0 0 0 0 0 0
# 2 0 0 0 0 0 0 1
# 3 0 0 0 0 0 0 0
# 4 1 0 1 1 0 1 1
# 5 0 1 1 1 1 0 0
# 6 1 1 1 1 1 1 1
**Updated:**If NA to be left as is then solution can be:
df[case1] = ifelse(df[case1]==1,1,0)
df[case2] = ifelse(df[case2]==2,1,0)
df[1:7]
# bhs1_1 bhs1_2 bhs1_3 bhs1_4 bhs1_5 bhs1_6 bhs1_7
# 1 NA NA NA NA NA NA NA
# 2 0 0 0 0 0 0 1
# 3 NA NA NA NA NA NA NA
# 4 1 0 1 1 0 1 1
# 5 0 1 1 1 1 0 0
# 6 1 1 1 1 1 1 1

gather multiple columns with nested, repeated measures

I have a dataset of people (pid) of different types (type2=c("dad", "mom", "kid"; and for ease, type=c("a", "b", "c")) nested in households (hid) with repeated measurements (time).
Some variables like v1_ are asked to everyone, but the values are spread across three columns. For instance, v1_a contains the values for all of the dads (type==a).
Variables like v2_ are only asked of dads and moms (a's and b's), and the values are spread across two columns.
Variables like v3 are also only asked to dads and moms, but the values are contained in one column.
Variables like v4 are asked to everyone, and the values are contained in one column.
Have:
hid pid type type2 time v1_a v1_b v1_c v2_a v2_b v3 v4
1 1 1 a dad 1 6 NA NA 2 NA 4 3
2 1 2 b mom 1 NA 2 NA NA 5 6 6
3 1 3 c kid 1 NA NA 1 NA NA NA 5
4 2 4 a dad 1 3 NA NA 6 NA 2 6
5 2 5 b mom 1 NA 5 NA NA 2 4 3
6 2 6 c kid 1 NA NA 3 NA NA NA 5
7 1 1 a dad 2 3 NA NA 2 NA 4 3
8 1 2 b mom 2 NA 3 NA NA 5 6 6
9 1 3 c kid 2 NA NA 2 NA NA NA 5
10 2 4 a dad 2 2 NA NA 6 NA 2 6
11 2 5 b mom 2 NA 3 NA NA 2 4 3
12 2 6 c kid 2 NA NA 2 NA NA NA 5
Here is the end result I want:
hid pid type type2 time v1 v2 v3 v4
1 1 1 a dad 1 6 2 4 3
2 1 2 b mom 1 2 5 6 6
3 1 3 c kid 1 1 NA NA 5
4 2 4 a dad 1 3 6 2 6
5 2 5 b mom 1 5 2 4 3
6 2 6 c kid 1 3 NA NA 5
7 1 1 a dad 2 3 2 4 3
8 1 2 b mom 2 3 5 6 6
9 1 3 c kid 2 2 NA NA 5
10 2 4 a dad 2 2 6 2 6
11 2 5 b mom 2 3 2 4 3
12 2 6 c kid 2 2 NA NA 5
I'm looking for a tidyverse approach that will handle a larger actual use case of mixed variables as shown here. The variable naming is consistent. Where do I go after gather()?
library(tidyverse)
df_have <- data.frame(hid=c(1, 1, 1, 2, 2, 2,
1, 1, 1, 2, 2, 2),
pid=c(1, 2, 3, 4, 5, 6,
1, 2, 3, 4, 5, 6),
type=c("a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c"),
type2=c("dad", "mom", "kid", "dad", "mom", "kid",
"dad", "mom", "kid", "dad", "mom", "kid"),
time=c(1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2),
v1_a=c(6, NA, NA, 3, NA, NA,
3, NA, NA, 2, NA, NA),
v1_b=c(NA, 2, NA, NA, 5, NA,
NA, 3, NA, NA, 3, NA),
v1_c=c(NA, NA, 1, NA, NA, 3,
NA, NA, 2, NA, NA, 2),
v2_a=c(2, NA, NA, 6, NA, NA,
2, NA, NA, 6, NA, NA),
v2_b=c(NA, 5, NA, NA, 2, NA,
NA, 5, NA, NA, 2, NA),
v3=c(4, 6, NA, 2, 4, NA,
4, 6, NA, 2, 4, NA),
v4=c(3, 6, 5, 6, 3, 5,
3, 6, 5, 6, 3, 5)
)
df_want <- data.frame(hid=c(1, 1, 1, 2, 2, 2,
1, 1, 1, 2, 2, 2),
pid=c(1, 2, 3, 4, 5, 6,
1, 2, 3, 4, 5, 6),
type=c("a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c"),
type2=c("dad", "mom", "kid", "dad", "mom", "kid",
"dad", "mom", "kid", "dad", "mom", "kid"),
time=c(1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2),
v1=c(6, 2, 1, 3, 5, 3,
3, 3, 2, 2, 3, 2),
v2=c(2, 5, NA, 6, 2, NA,
2, 5, NA, 6, 2, NA),
v3=c(4, 6, NA, 2, 4, NA,
4, 6, NA, 2, 4, NA),
v4=c(3, 6, 5, 6, 3, 5,
3, 6, 5, 6, 3, 5)
)
df_have %>%
gather(key, value, -hid, -pid, -type, -type2, -time)
Here is another idea using coalesce from dplyr and map from purrr.
library(tidyverse)
# Set target column names
cols <- paste0("v", 1:4)
# Coalesce the numbers based on column names
nums <- map(cols, ~coalesce(!!!as.list(df_have %>% select(starts_with(.x)))))
# Create a data frame
nums_df <- nums %>%
setNames(cols) %>%
as_data_frame()
# Create the final output by bind_cols
df_test <- df_have %>%
select(-starts_with("v")) %>%
bind_cols(nums_df)
df_test
# hid pid type type2 time v1 v2 v3 v4
# 1 1 1 a dad 1 6 2 4 3
# 2 1 2 b mom 1 2 5 6 6
# 3 1 3 c kid 1 1 NA NA 5
# 4 2 4 a dad 1 3 6 2 6
# 5 2 5 b mom 1 5 2 4 3
# 6 2 6 c kid 1 3 NA NA 5
# 7 1 1 a dad 2 3 2 4 3
# 8 1 2 b mom 2 3 5 6 6
# 9 1 3 c kid 2 2 NA NA 5
# 10 2 4 a dad 2 2 6 2 6
# 11 2 5 b mom 2 3 2 4 3
# 12 2 6 c kid 2 2 NA NA 5
This gets me there, but the filter(!is.na(value)) step seems like a hack. Better ideas?
df_test <-
df_have %>%
gather(key, value, -hid, -pid, -type, -time, -type2) %>%
mutate(key = str_replace(key, "_.*", "")) %>%
filter(!is.na(value)) %>%
spread(key, value) %>%
arrange(time, hid, type, pid)
Update from #www:
df_test <-
df_have %>%
gather(key, value, -hid, -pid, -type, -time, -type2, na.rm=TRUE) %>%
mutate(key = str_replace(key, "_.*", "")) %>%
spread(key, value) %>%
arrange(time, hid, type, pid)

Resources