How do I pivot_wider a char column? - r

I'm trying to pivot_wider a tibble of random alpha strings
stri_rand_strings(252, 5, '[a-z]') %>%
sort() %>%
as_tibble() %>%
mutate(id = row_number(),
col = rep(letters[1:4], each = length(value) / 4)) %>%
pivot_wider(names_from = col, values_from = value)
I get three columns of NA in a tibble (252 x 5):
# A tibble: 252 × 5
id a b c d
<int> <chr> <chr> <chr> <chr>
1 1 aarup NA NA NA
2 2 abhir NA NA NA
3 3 afpgt NA NA NA
4 4 apjts NA NA NA
5 5 arlst NA NA NA
6 6 awkjn NA NA NA
7 7 babro NA NA NA
8 8 bbrpn NA NA NA
9 9 bbrzt NA NA NA
10 10 bedzs NA NA NA
# … with 242 more rows
instead of the desired 63 x 5.

your id-column is messing everything up. rownumbers are unique, so casting to wide does not make sense, since you have got unique identifiers.
try something like
stringi::stri_rand_strings(252, 5, '[a-z]') %>%
sort() %>%
as_tibble() %>%
mutate(id = rep(1:(length(value) / 4), 4), # !! <-- !!
col = rep(letters[1:4], each = length(value) / 4)) %>%
pivot_wider(names_from = col, values_from = value)
# A tibble: 63 x 5
id a b c d
<int> <chr> <chr> <chr> <chr>
1 1 ababk glynv mottj tqcbv
2 2 abysq gmfhc mujcw twjix
3 3 aerkp godcs mycak tzqny
4 4 agtoa gpler naetp ucuvg
5 5 ahebl grqgz nfali ufbqv
6 6 amdvv gswwu nhmnu ulgup
7 7 apgut gvkwh nkcks umwih
8 8 atgxy gynef nkklm uojxc
9 9 bcklx hcdup nngfz upfhx
10 10 bcnxz hcpzy nnvpd uqlgs
# ... with 53 more rows

Related

R Lag Variable And Skip Value Between

DATA = data.frame(STUDENT = c(1,1,1,2,2,2,3,3,4,4),
SCORE = c(6,4,8,10,9,0,2,3,3,7),
CLASS = c('A', 'B', 'C', 'A', 'B', 'C', 'B', 'C', 'A', 'B'),
WANT = c(NA, NA, 2, NA, NA, -10, NA, NA, NA, NA))
I have DATA and wish to create 'WANT' which is calculate by:
For each STUDENT, find the SCORE where SCORE equals to SCORE(CLASS = C) - SCORE(CLASS = A)
EX: SCORE(STUDENT = 1, CLASS = C) - SCORE(STUDENT = 1, CLASS = A) = 8-6=2
Assuming at most one 'C' and 'A' CLASS per each 'STUDENT', just subset the 'SCORE' where the CLASS value is 'C', 'A', do the subtraction and assign the value only to position where CLASS is 'C' by making all other positions to NA (after grouping by 'STUDENT')
library(dplyr)
DATA <- DATA %>%
group_by(STUDENT) %>%
mutate(WANT2 = (SCORE[CLASS == 'C'][1] - SCORE[CLASS == 'A'][1]) *
NA^(CLASS != "C")) %>%
ungroup
-output
# A tibble: 10 × 5
STUDENT SCORE CLASS WANT WANT2
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 6 A NA NA
2 1 4 B NA NA
3 1 8 C 2 2
4 2 10 A NA NA
5 2 9 B NA NA
6 2 0 C -10 -10
7 3 2 B NA NA
8 3 3 C NA NA
9 4 3 A NA NA
10 4 7 B NA NA
Here is a solution with the data organized in a wider format first, then a longer format below. This solution works regardless of the order of the "CLASS" column (for instance, if there is one instance in which the CLASS order is CBA or BCA instead os ABC, this solution will work).
Solution
library(dplyr)
library(tidyr)
wider <- DATA %>% select(-WANT) %>%
pivot_wider( names_from = "CLASS", values_from = "SCORE") %>%
rowwise() %>%
mutate(WANT = C-A) %>%
ungroup()
output wider
# A tibble: 4 × 5
STUDENT A B C WANT
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6 4 8 2
2 2 10 9 0 -10
3 3 NA 2 3 NA
4 4 3 7 NA NA
If you really want like your output example, then we can reorganize the wider data this way:
Reorganizing wider to long format
wider %>%
pivot_longer(A:C, values_to = "SCORE", names_to = "CLASS") %>%
relocate(WANT, .after = SCORE) %>%
mutate(WANT = if_else(CLASS == "C", WANT, NA_real_))
Final Output
# A tibble: 12 × 4
STUDENT CLASS SCORE WANT
<dbl> <chr> <dbl> <dbl>
1 1 A 6 NA
2 1 B 4 NA
3 1 C 8 2
4 2 A 10 NA
5 2 B 9 NA
6 2 C 0 -10
7 3 A NA NA
8 3 B 2 NA
9 3 C 3 NA
10 4 A 3 NA
11 4 B 7 NA
12 4 C NA NA

R - How can I combine values across multiple columns in a pairwise matching fashion, possibly using 'mutate' or 'coalesce'?

I have a dataset taken from a large survey that consists of ~2k participants (rows) and some 160 variables (columns). The relevant columns include the following:
participant IDS
cl.exp which is a "yes"/"no" response
cl.yes.Q1 up to cl.yes.Q7 which have values if the participant answered "yes" to cl.exp, and NA if they answered "no" to cl.exp
cl.no.Q1 up to cl.no.Q7 which have values if the participant answered "no" to cl.exp, and NA if they answered "yes" to cl.exp
The questions for both cl.exp "yes" and "no" are synonymous, with the exception that no.Q6 and no.Q7 are the inverse of yes.Q6 and yes.Q7; i.e., no.Q7 is synonymous with yes.Q6 and no.Q6 is synonymous with yes.Q7.
The first few rows could be as follows:
ID
cl.exp
cl.yes.Q1
cl.yes.Q2
cl.yes.Q3
cl.yes.Q4
cl.yes.Q5
cl.yes.Q6
cl.yes.Q7
cl.no.Q1
cl.no.Q2
cl.no.Q3
cl.no.Q4
cl.no.Q5
cl.no.Q6
cl.no.Q7
1
No
NA
NA
NA
NA
NA
NA
NA
2
6
3
4
3
7
4
2
No
NA
NA
NA
NA
NA
NA
NA
3
6
6
6
5
7
3
3
Yes
2
5
6
6
4
2
7
NA
NA
NA
NA
NA
NA
NA
4
Yes
7
1
5
6
7
2
5
NA
NA
NA
NA
NA
NA
NA
You'll notice that participants either have values in cl.yes.Q1-7 or cl.no.Q1-7, but never both. I'd like to do either one of the following: a) move the values from cl.no.Q1-7 into the respective columns of cl.yes.Q1-7, or b) create new columns that combine the appropriate columns from cl.yes and cl.no, i.e., cl.yes.Q1 and cl.no.Q1, cl.yes.Q2 and cl.no.Q2, and so on.
I solve the no.Q6 and no.Q7 reverse issue by using the following code:
df[15:16] <- df[16:15]
I then do the following:
df.yes <- df %>%
select(contains("cl.yes"), id, cl.exp) %>%
drop_na()
df.no <- df %>%
select(contains("cl.no"), id, cl.exp) %>%
drop_na()
names(df.yes) <- gsub("cl.yes.", "cl.", names(df.yes))
names(df.no) <- gsub("cl.no.", "cl.", names(df.no))
df.cl <- merge(df.yes, df.no, all = TRUE)
This gives me a new dataframe that has the merged columns. However, I believe there must be a simpler/cleaner/more elegant solution than this, particularly with the ability to keep the data in the original dataframe. I tried some iterations with mutate and coalesce and could never succeed. If anyone has a one or two line code that basically does the same thing that I did here, I would greatly appreciate your insight. Thank you!
This solution use a for loop to dynamically coalesce the column ending with the same question numbers. You will need to know the last question number to write down the iterator of the for loop.
library(dplyr)
dat2 <- dat %>% select(ID, cl.exp)
for (i in 1:7){
temp <- dat %>%
select(ends_with(paste0("Q", i))) %>%
as.list()
dat2[[paste0("cl.Q", i)]] <- coalesce(!!!temp)
}
dat2
# ID cl.exp cl.Q1 cl.Q2 cl.Q3 cl.Q4 cl.Q5 cl.Q6 cl.Q7
# 1 1 No 2 6 3 4 3 7 4
# 2 2 No 3 6 6 6 5 7 3
# 3 3 Yes 2 5 6 6 4 2 7
# 4 4 Yes 7 1 5 6 7 2 5
Note: I did not swap Q6 and Q7, but I am sure you have figured the best way to do it.
DATA
dat <- read.table(text = "ID cl.exp cl.yes.Q1 cl.yes.Q2 cl.yes.Q3 cl.yes.Q4 cl.yes.Q5 cl.yes.Q6 cl.yes.Q7 cl.no.Q1 cl.no.Q2 cl.no.Q3 cl.no.Q4 cl.no.Q5 cl.no.Q6 cl.no.Q7
1 No NA NA NA NA NA NA NA 2 6 3 4 3 7 4
2 No NA NA NA NA NA NA NA 3 6 6 6 5 7 3
3 Yes 2 5 6 6 4 2 7 NA NA NA NA NA NA NA
4 Yes 7 1 5 6 7 2 5 NA NA NA NA NA NA NA",
header = TRUE)
How about something like this:
library(tidyverse)
dat <- tibble::tribble(~ID, ~cl.exp, ~cl.yes.Q1, ~cl.yes.Q2, ~cl.yes.Q3, ~cl.yes.Q4, ~cl.yes.Q5, ~cl.yes.Q6, ~cl.yes.Q7, ~cl.no.Q1, ~cl.no.Q2, ~cl.no.Q3, ~cl.no.Q4, ~cl.no.Q5, ~cl.no.Q6, ~cl.no.Q7,
1, "No",NA,NA,NA,NA,NA,NA,NA,2,6,3,4,3,7,4,
2, "Yes",3,6,6,6,5,7,3, NA,NA,NA,NA,NA,NA,NA)
bind_cols(dat %>% select(c(ID, cl.exp)),
coalesce(dat %>%
select(contains("no")) %>%
setNames(gsub("\\.no", "", names(.))),
dat %>%
select(contains("yes"))%>%
setNames(gsub("\\.yes", "", names(.)))))
#> # A tibble: 2 × 9
#> ID cl.exp cl.Q1 cl.Q2 cl.Q3 cl.Q4 cl.Q5 cl.Q6 cl.Q7
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 No 2 6 3 4 3 7 4
#> 2 2 Yes 3 6 6 6 5 7 3
Created on 2022-06-10 by the reprex package (v2.0.1)
I think you need this one:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(ID, cl.exp),
names_to = c('.value', 'name'),
names_pattern = '(.*)(\\d+)') %>%
mutate(cl.yes.Q = coalesce(cl.yes.Q, cl.no.Q), .keep="unused") %>%
pivot_wider(names_from = name, values_from = cl.yes.Q)
ID cl.exp `1` `2` `3` `4` `5` `6` `7`
<int> <chr> <int> <int> <int> <int> <int> <int> <int>
1 1 No 2 6 3 4 3 7 4
2 2 No 3 6 6 6 5 7 3
3 3 Yes 2 5 6 6 4 2 7
4 4 Yes 7 1 5 6 7 2 5
You should be able to use mutate and coalesce like so (after swapping no.Q6 and no.Q7 as you did above):
library(dplyr)
result <- df %>% mutate(cl.Q1 = coalesce(cl.yes.Q1, cl.no.Q1),
cl.Q2 = coalesce(cl.yes.Q2, cl.no.Q2),
cl.Q3 = coalesce(cl.yes.Q3, cl.no.Q3),
cl.Q4 = coalesce(cl.yes.Q4, cl.no.Q4),
cl.Q5 = coalesce(cl.yes.Q5, cl.no.Q5),
cl.Q6 = coalesce(cl.yes.Q6, cl.no.Q6),
cl.Q7 = coalesce(cl.yes.Q7, cl.no.Q7)) %>%
select(-(cl.yes.Q1:cl.no.Q7))
We simply use mutate to create the cl.Q* columns coalescing the values from the cl.yes.Q* and cl.no.Q* columns, respectively. Then, we remove the original cl.yes.Q* and cl.no.Q* columns using select.

Casting multiple values in R without dropping rows with duplicate IDs but different values

I need to cast multiple values from this df:
test_df <- data.frame(ID=c("409_012", rep("409_003", 2)),
type=c("a", rep("b", 2)),
val1=sample(1:10, 3),
val2=sample(1:10, 3) )
ID type val1 val2
1 409_012 a 5 9
2 409_003 b 10 2
3 409_003 b 2 3
To get the following df:
ID val1_a val2_a val1_b val2_b
1 409_012 5 9 NA NA
2 409_003 NA NA 10 2
3 409_003 NA NA 2 3
I have tried the following command, however I have multiple rows with ID 409_003 with different values which I need to retain as separate rows, and this ends up dropping a row and including them both in one row:
as.data.frame(tidyr::pivot_wider(test_df, names_from=type, values_from=c(val1, val2)))
And wrongly gives me this:
ID val1_a val1_b val2_a val2_b
1 409_012 5 NULL 9 NULL
2 409_003 NULL 10, 2 NULL 2, 3
Can anyone help with this? If so would be much appreciated.
Update: Better:
Using names_glue
test_df %>%
mutate(id = row_number()) %>%
pivot_wider(
names_from = type,
values_from = c(val1, val2),
names_glue = "{.value}_{type}"
) %>%
select(-id)
ID val1_a val1_b val2_a val2_b
<chr> <int> <int> <int> <int>
1 409_012 5 NA 10 NA
2 409_003 NA 10 NA 7
3 409_003 NA 7 NA 8
First answer:
One way could be:
library(dplyr)
library(tidyr)
test_df %>%
pivot_longer(
-c(ID, type)
) %>%
unite(names, c(name, type)) %>%
pivot_wider(
names_from = names,
values_from = value,
values_fn = list
) %>%
unnest(c(val1_a, val2_a, val1_b, val2_b))
ID val1_a val2_a val1_b val2_b
<chr> <int> <int> <int> <int>
1 409_012 5 10 NA NA
2 409_003 NA NA 10 7
3 409_003 NA NA 7 8

A computation efficient way to find the IDs of the Type 1 rows just above and below each Type 2 rows?

I have the following data
df <- tibble(Type=c(1,2,2,1,1,2),ID=c(6,4,3,2,1,5))
Type ID
1 6
2 4
2 3
1 2
1 1
2 5
For each of the type 2 rows, I want to find the IDs of the type 1 rows just below and above them. For the above dataset, the output will be:
Type ID IDabove IDbelow
1 6 NA NA
2 4 6 2
2 3 6 2
1 2 NA NA
1 1 NA NA
2 5 1 NA
Naively, I can write a for loop to achieve this, but that would be too time consuming for the dataset I am dealing with.
One approach using dplyr lead,lag to get next and previous value respectively and data.table's rleid to create groups of consecutive Type values.
library(dplyr)
library(data.table)
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = rleid(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
# Type ID IDabove IDbelow
# <dbl> <dbl> <dbl> <dbl>
#1 1 6 NA NA
#2 2 4 6 2
#3 2 3 6 2
#4 1 2 NA NA
#5 1 1 NA NA
#6 2 5 1 NA
A dplyr only solution:
You could create your own rleid function then apply the logic provided by Ronak(Many thanks. Upvoted).
library(dplyr)
my_func <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
# this part is the same as provided by Ronak.
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = my_func(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
Output:
Type ID IDabove IDbelow
<dbl> <dbl> <dbl> <dbl>
1 1 6 NA NA
2 2 4 6 2
3 2 3 6 2
4 1 2 NA NA
5 1 1 NA NA
6 2 5 1 NA

Mutate row sum but only if NA count is 2 or less

I'm trying to mutate a new variable (sum) of 5 columns of data but only if NA count across affected columns (v2 to v6) is 2 or less otherwise return an NA. The code below sums only where there are no NA's. Help appreciated.
df <- data.frame(v1=c("A","B","C","D","E","F"), v2=c(4,NA,5,6,NA,NA), v3=c(7,8,9,NA,NA,NA),
v4=c(NA,3,5,NA,1,4), v5=c(NA,3,5,NA,1,NA), v6=c(NA,3,5,NA,1,4))
df
library(dplyr)
df = df %>%
rowwise() %>%
mutate(sum(v2, v3, v4, v5, v6))
df
In base R, we can use rowSums twice, 1st to count sum of values in each row and second to count number of NA's in R.
ifelse(rowSums(is.na(df[-1])) <= 2, rowSums(df[-1], na.rm = TRUE), NA)
#[1] NA 17 29 NA 3 NA
Using dplyr row-wise you can do this as :
library(dplyr)
df %>%
rowwise() %>%
mutate(col = ifelse(sum(is.na(c_across(v2:v6))) <= 2,
sum(c_across(v2:v6), na.rm = TRUE), NA))
# A tibble: 6 x 7
# v1 v2 v3 v4 v5 v6 col
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 4 7 NA NA NA NA
#2 B NA 8 3 3 3 17
#3 C 5 9 5 5 5 29
#4 D 6 NA NA NA NA NA
#5 E NA NA 1 1 1 3
#6 F NA NA 4 NA 4 NA
Shortened the code using ifelse suggestion from #rpolicastro.

Resources