I have a survey where people rank some schools. The survey results in multiple columns with 1s and 2s. I need to colapse these into a persons first and second choice.
Here is an exampe of the data that I have.
df1 <- tibble(Person = c(1 , 2 , 3 , 4 , 5 , 6 , 7),
School1 = c(NA, 1 , 2 , NA, NA, NA, 1 ),
School2 = c(NA, 2 , 1 , NA, NA, 1 , NA),
School3 = c(1 , NA, NA, NA, NA, 2 , NA),
School4 = c(2 , NA, NA, 1 , 2 , NA, NA),
School5 = c(NA, NA, NA, 2 , 1 , NA, 2))
Person School1 School2 School3 School4 School5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA NA 1 2 NA
2 2 1 2 NA NA NA
3 3 2 1 NA NA NA
4 4 NA NA NA 1 2
5 5 NA NA NA 2 1
6 6 NA 1 2 NA NA
7 7 1 NA NA NA 2
Here is the result that I need.
df2 <- tibble(Person = c(1 , 2 , 3 , 4 , 5 , 6 , 7),
School1 = c(NA, 1 , 2 , NA, NA, NA, 1 ),
School2 = c(NA, 2 , 1 , NA, NA, 1 , NA),
School3 = c(1 , NA, NA, NA, NA, 2 , NA),
School4 = c(2 , NA, NA, 1 , 2 , NA, NA),
School5 = c(NA, NA, NA, 2 , 1 , NA, 2),
Firstchoice = c('School3', 'School1', 'School2', 'School4', 'School5', 'School2', 'School1'),
Secondchoice = c('School4', 'School2', 'School1', 'School5', 'School4', 'School3', 'School5'))
Person School1 School2 School3 School4 School5 Firstchoice Secondchoice
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1 NA NA 1 2 NA School3 School4
2 2 1 2 NA NA NA School1 School2
3 3 2 1 NA NA NA School2 School1
4 4 NA NA NA 1 2 School4 School5
5 5 NA NA NA 2 1 School5 School4
6 6 NA 1 2 NA NA School2 School3
7 7 1 NA NA NA 2 School1 School5
I have looked at mutate, and using a for loop however I can't figure out how to get them to work since they would need to do inline updates of a column.
Any help would be appreciated.
One tidyverse possibility could be:
df1 %>%
gather(var, val, -Person) %>%
mutate(val = ifelse(val == 1, "Firstchoice",
ifelse(val == 2, "Secondchoice", NA_character_))) %>%
na.omit() %>%
spread(val, var) %>%
left_join(df1, by = c("Person" = "Person"))
Person Firstchoice Secondchoice School1 School2 School3 School4 School5
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 School3 School4 NA NA 1 2 NA
2 2 School1 School2 1 2 NA NA NA
3 3 School2 School1 2 1 NA NA NA
4 4 School4 School5 NA NA NA 1 2
5 5 School5 School4 NA NA NA 2 1
6 6 School2 School3 NA 1 2 NA NA
7 7 School1 School5 1 NA NA NA 2
Piece of cake with tidyr:
choices <- gather(df1, key = "school", value = "choice", -Person, na.rm=TRUE)
choices <- arrange(choices, Person, choice)
Related
Here is some example data:
transactions <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8), day = c("day1",
"day2", "day3", "day4", "day5", "day6", "day7", "day8"), sent_to = c(NA,
"Garden Cinema", "Pasta House", NA, "Blue Superstore", NA, NA,
NA), received_from = c("ATM", NA, NA, "Sarah", NA, "Jane", "Joe",
"Emily"), reference = c("add_cash", "cinema_tickets", "meal",
"gift", "shopping", "reimbursed", "reimbursed", "reimbursed"),
decrease = c(NA, 10.8, 12.5, NA, 15.25, NA, NA, NA), increase = c(50,
NA, NA, 30, NA, 5.4, 7.25, 2), reimbursed_id = c(NA, "R",
"R", NA, NA, "2", "3", "3")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -8L))
# id day sent_to received_from reference decrease increase reimbursed_id
# <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
# 1 1 day1 NA ATM add_cash NA 50 NA
# 2 2 day2 Garden Cinema NA cinema_tickets 10.8 NA R
# 3 3 day3 Pasta House NA meal 12.5 NA R
# 4 4 day4 NA Sarah gift NA 30 NA
# 5 5 day5 Blue Superstore NA shopping 15.2 NA NA
# 6 6 day6 NA Jane reimbursed NA 5.4 2
# 7 7 day7 NA Joe reimbursed NA 7.25 3
# 8 8 day8 NA Emily reimbursed NA 2 3
Note that this is linked to a question I have previously asked here:
How to conditionally select a column, and subtract values in those rows from rows in another conditionally selected column in R?
I would like a similar solution to the above question but this time accounting for the fact that there are multiple people who are reimbursing the user for the same day.
This is the desired outcome:
# id day sent_to received_from reference decrease increase reimbursed_id actual_decrease
# <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
# 1 1 day1 NA ATM add_cash NA 50 NA NA
# 2 2 day2 Garden Cinema NA cinema_tickets 10.8 NA R 5.4
# 3 3 day3 Pasta House NA meal 12.5 NA R 3.25
# 4 4 day4 NA Sarah gift NA 30 NA NA
# 5 5 day5 Blue Superstore NA shopping 15.2 NA NA 15.2
# 6 6 day6 NA Jane reimbursed NA 5.4 2 NA
# 7 7 day7 NA Joe reimbursed NA 7.25 3 NA
# 8 8 day8 NA Emily reimbursed NA 2 3 NA
Any help is appreciated :)
Well you should first summarise the increase per id:
increase_df <- transactions %>%
filter(!is.na(as.numeric(reimbursed_id))) %>%
group_by(id = as.numeric(reimbursed_id)) %>%
summarise(increase_sum = sum(increase))
id increase_sum
<dbl> <dbl>
1 2 5.4
2 3 9.25
To then merge it and make the subtraction:
left_join(transactions,increase_df,by = "id") %>%
mutate(decrease = ifelse(!is.na(increase_sum),decrease - increase_sum,decrease))
id day sent_to received_from reference decrease increase reimbursed_id increase_sum
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
1 1 day1 NA ATM add_cash NA 50 NA NA
2 2 day2 Garden Cinema NA cinema_tickets 5.4 NA R 5.4
3 3 day3 Pasta House NA meal 3.25 NA R 9.25
4 4 day4 NA Sarah gift NA 30 NA NA
5 5 day5 Blue Superstore NA shopping 15.2 NA NA NA
6 6 day6 NA Jane reimbursed NA 5.4 2 NA
7 7 day7 NA Joe reimbursed NA 7.25 3 NA
8 8 day8 NA Emily reimbursed NA 2 3 NA
I have a large dataset in which the answers to one question are distributed among various columns. However, if the columns belong together, they share the same prefix. I wonder how I can create a subset dataset of each question sorting based on the prefix.
Here is an example dataset. I would like to receive an efficient and easy adaptable solution to create a dataset only containing the values of either question one, two or three.
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8), Question1a = c(1,
1, NA, NA, 1, 1, 1, NA), Question1b = c(NA, 1, NA, 1, NA, 1,
NA, 1), Question1c = c(1, 1, NA, NA, 1, NA, NA, NA), Question2a = c(1,
NA, NA, NA, 1, 1, NA, NA), Question2b = c(NA, 1, NA, 1, NA, NA,
NA, NA), Question3a = c(NA, NA, NA, NA, 1, 1, 1, NA), Question3b = c(NA,
NA, 1, 1, NA, NA, NA, NA)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L))
You can use sapply and a function:
list_data <- sapply(c("Question1", "Question2", "Question3"),
function(x) df[startsWith(names(df),x)], simplify = FALSE)
This will store everything in a list. To get the individual data sets in the global environment as individual objects, use:
list2env(list_data, globalenv())
Output
# $Question1
# # A tibble: 8 × 3
# Question1a Question1b Question1c
# <dbl> <dbl> <dbl>
# 1 1 NA 1
# 2 1 1 1
# 3 NA NA NA
# 4 NA 1 NA
# 5 1 NA 1
# 6 1 1 NA
# 7 1 NA NA
# 8 NA 1 NA
#
# $Question2
# # A tibble: 8 × 2
# Question2a Question2b
# <dbl> <dbl>
# 1 1 NA
# 2 NA 1
# 3 NA NA
# 4 NA 1
# 5 1 NA
# 6 1 NA
# 7 NA NA
# 8 NA NA
#
# $Question3
# # A tibble: 8 × 2
# Question3a Question3b
# <dbl> <dbl>
# 1 NA NA
# 2 NA NA
# 3 NA 1
# 4 NA 1
# 5 1 NA
# 6 1 NA
# 7 1 NA
# 8 NA NA
I believe the underlying question is about data-formats.
Here's a few:
library(tidyverse)
structure(
list(
ID = c(1, 2, 3, 4, 5, 6, 7, 8),
Question1a = c(1,
1, NA, NA, 1, 1, 1, NA),
Question1b = c(NA, 1, NA, 1, NA, 1,
NA, 1),
Question1c = c(1, 1, NA, NA, 1, NA, NA, NA),
Question2a = c(1,
NA, NA, NA, 1, 1, NA, NA),
Question2b = c(NA, 1, NA, 1, NA, NA,
NA, NA),
Question3a = c(NA, NA, NA, NA, 1, 1, 1, NA),
Question3b = c(NA,
NA, 1, 1, NA, NA, NA, NA)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -8L)
) -> square_df
square_df %>%
pivot_longer(-ID,
names_to = c("Question", "Item"),
names_pattern = "Question(\\d+)(\\w+)") ->
long_df
long_df
#> # A tibble: 56 × 4
#> ID Question Item value
#> <dbl> <chr> <chr> <dbl>
#> 1 1 1 a 1
#> 2 1 1 b NA
#> 3 1 1 c 1
#> 4 1 2 a 1
#> 5 1 2 b NA
#> 6 1 3 a NA
#> 7 1 3 b NA
#> 8 2 1 a 1
#> 9 2 1 b 1
#> 10 2 1 c 1
#> # … with 46 more rows
long_df %>%
na.omit(value) ->
sparse_long_df
sparse_long_df
#> # A tibble: 22 × 4
#> ID Question Item value
#> <dbl> <chr> <chr> <dbl>
#> 1 1 1 a 1
#> 2 1 1 c 1
#> 3 1 2 a 1
#> 4 2 1 a 1
#> 5 2 1 b 1
#> 6 2 1 c 1
#> 7 2 2 b 1
#> 8 3 3 b 1
#> 9 4 1 b 1
#> 10 4 2 b 1
#> # … with 12 more rows
sparse_long_df %>%
nest(data = c(ID, Item, value)) ->
nested_long_df
nested_long_df
#> # A tibble: 3 × 2
#> Question data
#> <chr> <list>
#> 1 1 <tibble [12 × 3]>
#> 2 2 <tibble [5 × 3]>
#> 3 3 <tibble [5 × 3]>
Created on 2022-05-12 by the reprex package (v2.0.1)
You could also use map to store each dataframe in a list, e.g.
library(purrr)
# 3 = number of questions
map(c(1:3),
function(x){
quest <- paste0("Question",x)
select(df, ID, starts_with(quest))
})
Output:
[[1]]
# A tibble: 8 x 4
ID Question1a Question1b Question1c
<dbl> <dbl> <dbl> <dbl>
1 1 1 NA 1
2 2 1 1 1
3 3 NA NA NA
4 4 NA 1 NA
5 5 1 NA 1
6 6 1 1 NA
7 7 1 NA NA
8 8 NA 1 NA
[[2]]
# A tibble: 8 x 3
ID Question2a Question2b
<dbl> <dbl> <dbl>
1 1 1 NA
2 2 NA 1
3 3 NA NA
4 4 NA 1
5 5 1 NA
6 6 1 NA
7 7 NA NA
8 8 NA NA
[[3]]
# A tibble: 8 x 3
ID Question3a Question3b
<dbl> <dbl> <dbl>
1 1 NA NA
2 2 NA NA
3 3 NA 1
4 4 NA 1
5 5 1 NA
6 6 1 NA
7 7 1 NA
8 8 NA NA
I found a really intuitive solution using the dplyr package, using the select and starts_with commands. Alternatively, you can also replace the starts_with command with contains, if the you are not identifying the similar variables by a prefix but some other common feature.
Q1 <- Survey %>%
select(
starts_with("Question1")
)
Q2 <- Survey %>%
select(
starts_with("Question2")
)
Q3 <- Survey %>%
select(
starts_with("Question3")
)
I have a large data.table in the format below
Name Value 1 2 3 4 5
A 58 1 NA NA NA NA
B 47 NA 1 NA NA NA
C 89 NA NA 1 NA NA
D 68 NA NA NA 1 NA
E 75 NA NA NA NA 1
I would like to forward rows of the data table to achieve below results. I know how to forward fill columns.
Name Value 1 2 3 4 5
A 58 1 1 1 1 1
B 47 NA 1 1 1 1
C 89 NA NA 1 1 1
D 68 NA NA NA 1 1
E 75 NA NA NA NA 1
Help!
data.table has it's own nafill function.
library(data.table) #v>=1.12.8
library(magrittr)
melt(dt, id = 1:2) %>%
.[, value := nafill(value, "locf"), by = Name] %>%
dcast(., ... ~ variable)
# Name Value 1 2 3 4 5
# 1: A 58 1 1 1 1 1
# 2: B 47 NA 1 1 1 1
# 3: C 89 NA NA 1 1 1
# 4: D 68 NA NA NA 1 1
# 5: E 75 NA NA NA NA 1
Data
dt <- fread("Name Value 1 2 3 4 5
A 58 1 NA NA NA NA
B 47 NA 1 NA NA NA
C 89 NA NA 1 NA NA
D 68 NA NA NA 1 NA
E 75 NA NA NA NA 1")
Use fill in tidyr to fill in missing values with previous value.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(3:7) %>%
group_by(Name) %>%
fill(value) %>%
ungroup() %>%
pivot_wider()
# # A tibble: 5 x 7
# Name Value `1` `2` `3` `4` `5`
# <fct> <int> <int> <int> <int> <int> <int>
# 1 A 58 1 1 1 1 1
# 2 B 47 NA 1 1 1 1
# 3 C 89 NA NA 1 1 1
# 4 D 68 NA NA NA 1 1
# 5 E 75 NA NA NA NA 1
Note: The output above is the same as
df %>% fill(3:7, .direction = "up")
but the logic is different. The former belongs to "filling rows forward" and the latter is "filling columns backward". They will differ in other cases.
Data
df <- structure(list(Name = structure(1:5, .Label = c("A", "B", "C",
"D", "E"), class = "factor"), Value = c(58L, 47L, 89L, 68L, 75L
), `1` = c(1L, NA, NA, NA, NA), `2` = c(NA, 1L, NA, NA, NA),
`3` = c(NA, NA, 1L, NA, NA), `4` = c(NA, NA, NA, 1L, NA),
`5` = c(NA, NA, NA, NA, 1L)), class = "data.frame", row.names = c(NA, -5L))
I have a dataframe which looks like this:
`Row Labels` Female Male
<chr> <chr> <chr>
1 London <NA> <NA>
2 42 <NA> 1
3 Paris <NA> <NA>
4 36 1 <NA>
5 Belgium <NA> <NA>
6 18 1
7 21 <NA> 1
8 Madrid <NA> <NA>
9 20 1 <NA>
10 Berlin <NA> <NA>
11 37 <NA> 1
12 23 1
13 25 1
14 44 1
The code I used to produce this dataframe looks like this:
structure(list(`Row Labels` = c("London", "42", "Paris","36", "Belgium","18" ,"21", "Madrid", "20", "Berlin", "37","23","25","44"),
Female = c(NA, NA, NA, "1", NA, NA,NA, NA, "1", NA, NA,"1","1","1"), Male = c(NA,"1", NA, NA, NA, "1", NA, NA, NA, "1",NA,NA,NA,NA)),
.Names = c("Row Labels","Female", "Male"), row.names = c(NA, -14L), class = c("tbl_df", "tbl", "data.frame"))
I would like to know how I can change multiple rows in this dataframe to become columns.
My ideal output looks like this:
'Row Labels' Female Male 42 36 21 20 37 18 23 25 44
London 1 1
Paris 1 1
Belgium 1 1 1 1
Madrid 1 1
Berlin 3 1 1 1 1 1
Seems very mechanical. Calling your data d:
d1 = d[seq(1, nrow(d), by = 2), ]
d2 = d[seq(2, nrow(d), by = 2), ]
d1[, c("Male", "Female")] = d2[, c("Male", "Female")]
d3 = matrix(nrow = nrow(d2), ncol = nrow(d2))
diag(d3) = 1
colnames(d3) = d2$`Row Labels`
cbind(d2, d3)
# Row Labels Female Male 42 36 21 20 37
# 1 42 <NA> 1 1 NA NA NA NA
# 2 36 1 <NA> NA 1 NA NA NA
# 3 21 <NA> 1 NA NA 1 NA NA
# 4 20 1 <NA> NA NA NA 1 NA
# 5 37 <NA> 1 NA NA NA NA 1
Using tidyverse.
library(dplyr)
library(tidyr)
#cumsum based on country names
df %>% group_by(gr=cumsum(grepl('\\D+',`Row Labels`))) %>%
#Sum Female and Male
mutate_at(vars('Female','Male'), list(~sum(as.numeric(.), na.rm = T))) %>%
#Create RL from country name and number where we are at numbers
mutate(RL=ifelse(row_number()>1, paste0(first(`Row Labels`),',',`Row Labels`), NA)) %>%
filter(!is.na(RL)) %>%
select(RL, gr, Male, Female) %>%
separate(RL, into = c('RL','Age')) %>% mutate(flag=1) %>% spread(Age, flag) %>%
ungroup() %>% select(-gr)
# A tibble: 5 x 12
RL Male Female `18` `20` `21` `23` `25` `36` `37` `42` `44`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Belgium 1 0 1 NA 1 NA NA NA NA NA NA
2 Berlin 1 3 NA NA NA 1 1 NA 1 NA 1
3 London 1 0 NA NA NA NA NA NA NA 1 NA
4 Madrid 0 1 NA 1 NA NA NA NA NA NA NA
5 Paris 0 1 NA NA NA NA NA 1 NA NA NA
I have a tibble with information about diagnoses:
data <- tibble(
id = c(1:10),
diagnosis_1 = c("F32", "F431", "R58", "S32", "F11", NA, NA, "Y67", "F32", "Z032"),
diagnosis_2 = c(NA, NA, NA, NA, NA, NA, "G35", NA, NA, NA),
diagnosis_3 = c("F40", NA, "R67", "F431", NA, "F60", "S58", "R68", "F11", NA),
diagnosis_4 = c(NA, NA, "F65", NA, "F19", NA, NA, "F32", NA, NA)
)
As a part of the cleaning process, I have removed all diagnoses not fulfilling certain criteria (i.e. not starting with the letter F, G, or Z). With the following code:
data$diagnosis_1[str_sub(data$diagnosis_1, 1,1) %in% c("R", "S", "Y")] <- NA
data$diagnosis_2[str_sub(data$diagnosis_2, 1,1) %in% c("R", "S", "Y")] <- NA
data$diagnosis_3[str_sub(data$diagnosis_3, 1,1) %in% c("R", "S", "Y")] <- NA
data$diagnosis_4[str_sub(data$diagnosis_4, 1,1) %in% c("R", "S", "Y")] <- NA
Ending up with this tibble:
I now need to move the data to the left to fill the columns from left to right (i.e diagnosis_1 not being empty if diagnosis_2, diagnosis_3 or diagnosis_4 has data). I have tried using ifelse() as it is vectorized but I can`t seem to get it to work with several nested ifelse().
ifelse(is.na(data$diagnosis_1), data$diagnosis_2, data$diagnosis_1))
All suggestions are much appreciated.
Edit: adding expected output:
Using dplyr and tidyr. Reshape from wide to long, exclude "^RSY" and NA diagnosis, reshape long to wide.
library(dplyr)
library(tidyr)
gather(data, key = "k", value = "v", -id) %>%
filter(!(grepl("^[R|S|Y]", v) | is.na(v))) %>%
group_by(id) %>%
mutate(diagN = paste0("diagnosis_", row_number())) %>%
select(-k) %>%
spread(key = "diagN", value = "v") %>%
ungroup()
# # A tibble: 10 x 3
# id diagnosis_1 diagnosis_2
# <int> <chr> <chr>
# 1 1 F32 F40
# 2 2 F431 NA
# 3 3 F65 NA
# 4 4 F431 NA
# 5 5 F11 F19
# 6 6 F60 NA
# 7 7 G35 NA
# 8 8 F32 NA
# 9 9 F32 F11
# 10 10 Z032 NA
We first replace values which start with either "R", "S" or "Y" to NA and then left shift the non-NA values.
data[-1] <- lapply(data[-1], function(x) replace(x, grepl("^[R|S|Y]", x), NA))
data[] <- t(apply(data, 1, function(x) `length<-`(na.omit(x), length(x))))
data
# A tibble: 10 x 5
# id diagnosis_1 diagnosis_2 diagnosis_3 diagnosis_4
# <chr> <chr> <chr> <chr> <chr>
# 1 " 1" F32 F40 NA NA
# 2 " 2" F431 NA NA NA
# 3 " 3" F65 NA NA NA
# 4 " 4" F431 NA NA NA
# 5 " 5" F11 F19 NA NA
# 6 " 6" F60 NA NA NA
# 7 " 7" G35 NA NA NA
# 8 " 8" F32 NA NA NA
# 9 " 9" F32 F11 NA NA
#10 10 Z032 NA NA NA
Shifting the non-NA value to left has been taken from David's answer from here. You can try any other approach to shift values from the same question as well.
You can try a tidyverse
library(tidyverse)
data %>%
mutate_at(vars(starts_with("diagnosis")), funs(ifelse(str_sub(., 1, 1) %in% c("R", "S", "Y"), NA, .))) %>%
gather(k,v, -id) %>%
group_by(id) %>%
arrange(id) %>%
mutate(v=ifelse(k == "diagnosis_1", v[!is.na(v)][1], v)) %>%
spread(k, v)
# A tibble: 10 x 5
# Groups: id [10]
id diagnosis_1 diagnosis_2 diagnosis_3 diagnosis_4
<int> <chr> <chr> <chr> <chr>
1 1 F32 NA F40 NA
2 2 F431 NA NA NA
3 3 F65 NA NA F65
4 4 F431 NA F431 NA
5 5 F11 NA NA F19
6 6 F60 NA F60 NA
7 7 G35 G35 NA NA
8 8 F32 NA NA F32
9 9 F32 NA F11 NA
10 10 Z032 NA NA NA
As its unclear what OP wants (see discussion below) you can also try
data %>%
mutate_at(vars(starts_with("diagnosis")), funs(ifelse(str_sub(., 1, 1) %in% c("R", "S", "Y"), NA, .))) %>%
gather(k,v, -id) %>%
group_by(id) %>%
arrange(id) %>%
mutate(v=c(v[!is.na(v)], rep(NA, length(v) - length(v[!is.na(v)])))) %>%
spread(k, v)
# A tibble: 10 x 5
# Groups: id [10]
id diagnosis_1 diagnosis_2 diagnosis_3 diagnosis_4
<int> <chr> <chr> <chr> <chr>
1 1 F32 F40 NA NA
2 2 F431 NA NA NA
3 3 F65 NA NA NA
4 4 F431 NA NA NA
5 5 F11 F19 NA NA
6 6 F60 NA NA NA
7 7 G35 NA NA NA
8 8 F32 NA NA NA
9 9 F32 F11 NA NA
10 10 Z032 NA NA NA
You can use Reduce along with coalesce from dplyr, i.e.
df$diagnosis_1 <- Reduce(dplyr::coalesce, df[-1])
#id diagnosis_1 diagnosis_2 diagnosis_3 diagnosis_4
# <int> <chr> <chr> <chr> <chr>
# 1 1 F32 <NA> F40 <NA>
# 2 2 F431 <NA> <NA> <NA>
# 3 3 F65 <NA> <NA> F65
# 4 4 F431 <NA> F431 <NA>
# 5 5 F11 <NA> <NA> F19
# 6 6 F60 <NA> F60 <NA>
# 7 7 G35 G35 <NA> <NA>
# 8 8 F32 <NA> <NA> F32
# 9 9 F32 <NA> F11 <NA>
#10 10 Z032 <NA> <NA> <NA>
Below solution using function na_move from package dedupewider.
library(dedupewider)
na_move(data) # 'right' direction is by default
#> # A tibble: 10 x 5
#> id diagnosis_1 diagnosis_2 diagnosis_3 diagnosis_4
#> * <chr> <chr> <chr> <lgl> <lgl>
#> 1 1 F32 F40 NA NA
#> 2 2 F431 <NA> NA NA
#> 3 3 F65 <NA> NA NA
#> 4 4 F431 <NA> NA NA
#> 5 5 F11 F19 NA NA
#> 6 6 F60 <NA> NA NA
#> 7 7 G35 <NA> NA NA
#> 8 8 F32 <NA> NA NA
#> 9 9 F32 F11 NA NA
#> 10 10 Z032 <NA> NA NA
A tidyr update, using pivot_longer and unnest_wider.
dplyr 1.0.10 CRAN release: 2022-09-01
tidyr 1.2.1 CRAN release: 2022-09-08
Step 1: clean up data
library(dplyr)
library(tidyr)
data <- data %>%
mutate(across(starts_with("diag"), ~
replace(.x, grepl(paste0("^", c("R", "S", "Y"), collapse="|"), .x), NA)))
Step 2: left-compact data
data %>%
pivot_longer(starts_with("diag")) %>%
group_by(id) %>%
mutate(value = value[order(is.na(value))]) %>%
summarize(col = list(value)) %>%
unnest_wider(col, names_sep="_") %>%
setNames(colnames({{data}}))
# A tibble: 10 × 5
id diagnosis_1 diagnosis_2 diagnosis_3 diagnosis_4
<int> <chr> <chr> <chr> <chr>
1 1 F32 F40 NA NA
2 2 F431 NA NA NA
3 3 F65 NA NA NA
4 4 F431 NA NA NA
5 5 F11 F19 NA NA
6 6 F60 NA NA NA
7 7 G35 NA NA NA
8 8 F32 NA NA NA
9 9 F32 F11 NA NA
10 10 Z032 NA NA NA
data
data <- structure(list(id = 1:10, diagnosis_1 = c("F32", "F431", "R58",
"S32", "F11", NA, NA, "Y67", "F32", "Z032"), diagnosis_2 = c(NA,
NA, NA, NA, NA, NA, "G35", NA, NA, NA), diagnosis_3 = c("F40",
NA, "R67", "F431", NA, "F60", "S58", "R68", "F11", NA), diagnosis_4 = c(NA,
NA, "F65", NA, "F19", NA, NA, "F32", NA, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))