Wide to long format with several variables - r

This question is related to a previous question I asked on converting from wide to long format in R with an additional complication.
previous question is here: Wide to long data conversion
The wide data I start with looks like the following:
d2 <- data.frame('id' = c(1,2),
'Q1' = c(2,3),
'Q2' = c(1,3),
'Q3' = c(3,1),
'Q1_X_Opt_1' = c(0,0),
'Q1_X_Opt_2' = c(75,200),
'Q1_X_Opt_3' = c(150,300),
'Q2_X_Opt_1' = c(0,0),
'Q2_X_Opt_2' = c(150,200),
'Q2_X_Opt_3' = c(75,300),
'Q3_X_Opt_1' = c(0,0),
'Q3_X_Opt_2' = c(100,500),
'Q3_X_Opt_3' = c(150,300))
In this example, there are two individuals who have answered three questions. The answer to each question takes the following values {1,2,3} encoded in Q1, Q2, and Q3. So, in this examples, individual 1 chose option 2 in Q1, chose option 1 in Q2, and chose option 3 in Q3.
For each option there is also a variable X associated with each option that I also need to be converted to wide format. The output I am seeking looks like the following:
id question option choice cost
1 1 1 1 0 0
2 1 1 2 1 75
3 1 1 3 0 150
4 1 2 1 1 0
5 1 2 2 0 150
6 1 2 3 0 75
7 1 3 1 0 0
8 1 3 2 0 100
9 1 3 3 1 150
10 2 1 1 0 0
11 2 1 2 0 200
12 2 1 3 1 300
13 2 2 1 0 0
14 2 2 2 0 200
15 2 2 3 1 300
16 2 3 1 1 0
17 2 3 2 0 500
18 2 3 3 0 300
I have tried to adapting the code from the answer to the prior question, but with no success thus far. Thanks for any suggestions or comments.

It's not exactly elegant, but here's a tidyverse version:
library(tidyverse)
d3 <- d2 %>%
gather(option, cost, -id:-Q3) %>%
gather(question, choice, Q1:Q3) %>%
separate(option, c('question2', 'option'), extra = 'merge') %>%
filter(question == question2) %>%
mutate_at(vars(question, option), parse_number) %>%
mutate(choice = as.integer(option == choice)) %>%
select(1, 5, 3, 6, 4) %>%
arrange(id)
d3
#> id question option choice cost
#> 1 1 1 1 0 0
#> 2 1 1 2 1 75
#> 3 1 1 3 0 150
#> 4 1 2 1 1 0
#> 5 1 2 2 0 150
#> 6 1 2 3 0 75
#> 7 1 3 1 0 0
#> 8 1 3 2 0 100
#> 9 1 3 3 1 150
#> 10 2 1 1 0 0
#> 11 2 1 2 0 200
#> 12 2 1 3 1 300
#> 13 2 2 1 0 0
#> 14 2 2 2 0 200
#> 15 2 2 3 1 300
#> 16 2 3 1 1 0
#> 17 2 3 2 0 500
#> 18 2 3 3 0 300

1) First melt the input transformihg it to long form. Then break apart the variable column on underscore using read.table giving columns named V1, V2, V3, V4 representing the question as a factor, junk, junk and the option parts, respectively. Append that back to m and set the question to the factor level of V1 and option to V4. Sort it by id to give the same ordering as in the question. (If the order does not matter this line could be omiited.)
Now put the parts together noting that choice is 1 if the appropriate column among the Q1/Q2/Q3 columns equals the option and 0 otherwise.
library(reshape2)
m <- melt(d2, id = 1:4)
m <- cbind(m, read.table(text = as.character(m$variable), sep = "_"))
m <- transform(m, question = as.numeric(V1), option = V4)
m <- m[order(m$id), ]
n <- nrow(m)
with(m, data.frame(id,
question,
option,
choice = (m[cbind(1:n, question + 1)] == option) + 0,
value))
The result is:
id question option choice value
1 1 1 1 0 0
2 1 1 2 1 75
3 1 1 3 0 150
4 1 2 1 1 0
5 1 2 2 0 150
6 1 2 3 0 75
7 1 3 1 0 0
8 1 3 2 0 100
9 1 3 3 1 150
10 2 1 1 0 0
11 2 1 2 0 200
12 2 1 3 1 300
13 2 2 1 0 0
14 2 2 2 0 200
15 2 2 3 1 300
16 2 3 1 1 0
17 2 3 2 0 500
18 2 3 3 0 300
2) This could also be expressed using magirttr giving the same answer. Note that the last two pipes use the exposition operator %$% providing an implicit with(., ...) around the subsequent expression:
library(magrittr)
library(reshape2)
d2 %>%
melt(id = 1:4) %>%
cbind(read.table(text = as.character(.$variable), sep = "_")) %>%
transform(question = as.numeric(V1), option = V4) %$%
.[order(id), ] %$%
data.frame(id,
question,
option,
choice = (.[cbind(1:nrow(.), question + 1)] == option) + 0,
value)
3) This can be translated to reshape2/dplyr/tidyr:
library(reshape2)
library(dplyr)
library(tidyr)
d2 %>%
melt(id = 1:4) %>%
separate(variable, c("question", "X", "Opt", "option")) %>%
arrange(id) %>%
mutate(question = as.numeric(factor(question)),
choice = (.[cbind(1:n(), question + 1)] == option) + 0) %>%
select(id, question, option, choice, value)

Related

Counting Frequencies of Sequences

Suppose there are two students - each student takes an exam multiple times (e.g.result_id = 1 is the first exam, result_id = 2 is the second exam, etc.). The student can either "pass" (1) or "fail" (0).
The data looks something like this:
library(data.table)
my_data = data.frame(id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,2,2), results = c(0,1,0,1,0,0,1,1,1,0,1,1,0,1,0), result_id = c(1,2,3,4,5,6,1,2,3,4,5,6,7,8,9))
my_data = setDT(my_data)
id results result_id
1: 1 0 1
2: 1 1 2
3: 1 0 3
4: 1 1 4
5: 1 0 5
6: 1 0 6
7: 2 1 1
8: 2 1 2
9: 2 1 3
10: 2 0 4
11: 2 1 5
12: 2 1 6
13: 2 0 7
14: 2 1 8
15: 2 0 9
I am interested in counting the number of times that a student passes an exam, given that the student passed the previous two exams.
I tried to do this with the following code:
my_data$current_exam = shift(my_data$results, 0)
my_data$prev_exam = shift(my_data$results, 1)
my_data$prev_2_exam = shift(my_data$results, 2)
# Count the number of exam results for each record
out <- my_data[!is.na(prev_exam), .(tally = .N), by = .(id, current_exam, prev_exam, prev_2_exam)]
out = na.omit(out)
My code produces the following results:
> out
id current_exam prev_exam prev_2_exam tally
1: 1 0 1 0 2
2: 1 1 0 1 1
3: 1 0 0 1 1
4: 2 1 0 0 1
5: 2 1 1 0 2
6: 2 1 1 1 1
7: 2 0 1 1 2
8: 2 1 0 1 2
9: 2 0 1 0 1
However, I do not think that my code is correct.
For example, with Student_ID = 2 :
My code says that "Current_Exam = 1, Prev_Exam = 1, Prev_2_Exam = 0" happens 1 time, but looking at the actual data - this does not happen at all
Can someone please show me what I am doing wrong and how I can correct this?
Note: I think that this should be the expected output:
> expected_output
id current_exam prev_exam prev_2_exam tally
1: 1 0 1 0 2
2: 1 1 0 1 1
3: 1 0 0 1 1
4: 2 1 0 0 1
5: 2 1 1 0 1
6: 2 1 1 1 1
7: 2 0 1 1 2
8: 2 1 0 1 2
9: 2 0 1 0 0
You did not consider that you can not shift the results over id without placing NA.
. <- my_data[order(my_data$id, my_data$result_id),] #sort if needed
.$p1 <- ave(.$results, .$id, FUN = \(x) c(NA, x[-length(x)]))
.$p2 <- ave(.$p1, .$id, FUN = \(x) c(NA, x[-length(x)]))
aggregate(list(tally=.$p1), .[c("id","results", "p1", "p2")], length)
# id results p1 p2 tally
#1 1 0 1 0 2
#2 2 0 1 0 1
#3 2 1 1 0 1
#4 1 0 0 1 1
#5 1 1 0 1 1
#6 2 1 0 1 2
#7 2 0 1 1 2
#8 2 1 1 1 1
.
# id results result_id p1 p2
#1 1 0 1 NA NA
#2 1 1 2 0 NA
#3 1 0 3 1 0
#4 1 1 4 0 1
#5 1 0 5 1 0
#6 1 0 6 0 1
#7 2 1 1 NA NA
#8 2 1 2 1 NA
#9 2 1 3 1 1
#10 2 0 4 1 1
#11 2 1 5 0 1
#12 2 1 6 1 0
#13 2 0 7 1 1
#14 2 1 8 0 1
#15 2 0 9 1 0
An option would be to use filter to indicate those which had passed 3 times in a row.
cbind(., n=ave(.$results, .$id, FUN = \(x) filter(x, c(1,1,1), sides=1)))
# id results result_id n
#1 1 0 1 NA
#2 1 1 2 NA
#3 1 0 3 1
#4 1 1 4 2
#5 1 0 5 1
#6 1 0 6 1
#7 2 1 1 NA
#8 2 1 2 NA
#9 2 1 3 3
#10 2 0 4 2
#11 2 1 5 2
#12 2 1 6 2
#13 2 0 7 2
#14 2 1 8 2
#15 2 0 9 1
If olny the number of times that a student passes an exam, given that the student passed the previous two exams:
sum(ave(.$results, .$id, FUN = \(x) filter(x, c(1,1,1))==3), na.rm=TRUE)
#[1] 1
sum(ave(.$results, .$id, FUN = \(x)
x==1 & c(x[-1], 0) == 1 & c(x[-1:-2], 0, 0) == 1))
#[1] 1
When trying to count events that happen in series, cumsum() comes in quite handy. As opposed to creating multiple lagged variables, this scales well to counts across a larger number of events:
library(tidyverse)
d <- my_data |>
group_by(id) |> # group to cumulate within student only
mutate(
csum = cumsum(results), # cumulative sum of results
i = csum - lag(csum, 3, 0) # substract the cumulative sum from 3 observation before. This gives the number of exams passed in the current and previous 2 observations.
)
# Ungroup to get global count
d |>
ungroup() |>
count(i == 3) # Count the number of cases where the number of exams passes within 3 observations equals 3
#> # A tibble: 2 × 2
#> `i == 3` n
#> <lgl> <int>
#> 1 FALSE 14
#> 2 TRUE 1
# Retaining the group gives counts by student
d |>
count(i == 3) # Count the number of cases where the number of exams passes within 3 observations equals 3
#> # A tibble: 3 × 3
#> # Groups: id [2]
#> id `i == 3` n
#> <dbl> <lgl> <int>
#> 1 1 FALSE 6
#> 2 2 FALSE 8
#> 3 2 TRUE 1
Since you provided the data as data.table, here is how to do the same in that ecosystem:
my_data[ , csum := cumsum(results), .(id)]
my_data[ , i := csum - lag(csum, 3, 0), .(id)]
my_data[ , .(n_cases = sum(i ==3)), id]
#> id n_cases
#> 1: 1 0
#> 2: 2 1
Here's an approach using dplyr. It uses the lag function to look back 1 and 2 results. If the sum together with the current result is 3, then the condition is met. In the example you provided, the condition is only met once
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(!is.na(threex))
id results result_id threex
<dbl> <dbl> <dbl> <dbl>
1 1 0 3 0
2 1 1 4 0
3 1 0 5 0
4 1 0 6 0
5 2 1 3 1
6 2 0 4 0
7 2 1 5 0
8 2 1 6 0
9 2 0 7 0
10 2 1 8 0
11 2 0 9 0
If you then just want to capture the cases when the condition is met, add a filter.
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(threex == 1)
id results result_id threex
<dbl> <dbl> <dbl> <dbl>
1 2 1 3 1
If you are looking to understand how many times the condition is met per id, you can do this.
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(threex == 1) %>%
select(id) %>%
summarize(count = n())
id count
<dbl> <int>
1 2 1

dplyr: streamline creating matching and absolute difference variables

I have a dataset of friendships and characteristics of each individual, I'm trying to create variables that are if they match on the binary measures and what their absolute difference is for the continuous measures.
I can do this easily, but I was wondering if there is a different way to do it that is more streamlined than my method given that I have ~60 variables to do this with.
Sample Data:
dat <- read.table(text = "id.x id.y male.x smoke.x drink.x everfight.x grades.x male.y smoke.y drink.y everfight.y grades.y
1 6 0 2 4 1 3 0 2 1 0 2
2 7 0 2 4 0 5 0 2 3 1 4
3 8 1 4 4 1 2 0 4 2 1 1
4 9 0 2 3 1 2 0 3 2 0 1
5 10 1 2 4 0 4 1 4 1 0 4", header = TRUE)
Here's what I've done:
dat <- dat %>%
mutate(sex_match = case_when(male.x == male.y ~ 1,
TRUE ~ 0),
fight_match = case_when(everfight.x == everfight.y ~ 1,
TRUE ~ 0),
smoke_diff = abs(smoke.x - smoke.y),
drink_diff = abs(drink.x - drink.y),
grades_diff = abs(grades.x - grades.y))
This gives me exactly what I want:
id.x id.y male.x smoke.x drink.x everfight.x grades.x male.y smoke.y drink.y everfight.y grades.y sex_match fight_match smoke_diff drink_diff grades_diff
1 6 0 2 4 1 3 0 2 1 0 2 1 0 0 3 1
2 7 0 2 4 0 5 0 2 3 1 4 1 0 0 1 1
3 8 1 4 4 1 2 0 4 2 1 1 0 1 0 2 1
4 9 0 2 3 1 2 0 3 2 0 1 1 0 1 1 1
5 10 1 2 4 0 4 1 4 1 0 4 1 1 2 3 0
However, I'm wondering if there's a way to do this with a loop or apply that identifies corresponding vairables and create the matching and absolute difference new variables in the sample output above.
UPDATE
Ended up using most of what Jon answered with and one part of akrun, here's what worked best for me:
non_binary <- dat %>% select(., contains(".x")) %>%
select(., -id.x) %>%
select_if(~!all(. %in% 0:1)) %>%
rename_with(~str_remove(., '.x')) %>%
names()
dat %>%
pivot_longer(-c(id.x:id.y),
names_to = c("var", ".value"),
names_pattern = "(.+).(.+)") %>%
mutate(match = if_else(var %in% non_binary, abs(x - y), 1L * (x == y))) %>%
mutate(col_name = paste(var, ifelse(var %in% non_binary, "diff", "match"), sep = "_")) %>%
select(-c(var:y)) %>%
pivot_wider(names_from = col_name, values_from = match)
Thanks to both of you!
Here's a tidyr/dplyr approach. First I reshape to a long format with a row for each id/variable combination, with columns for each version. Then I can compare those for every pair at once, and reshape wide.
library(dplyr); library(tidyr)
non_binary <- c("smoke", "drink", "grades")
dat %>%
pivot_longer(-c(id.x:id.y),
names_to = c("var", ".value"),
names_pattern = "(.+).(.+)") %>%
mutate(match = if_else(var %in% non_binary, abs(x - y), 1L * (x == y))) %>%
mutate(col_name = paste(var, ifelse(var %in% non_binary, "diff", "match"), sep = "_")) %>%
select(-c(var:y)) %>%
pivot_wider(names_from = col_name, values_from = match)
Result, which could be appended to original data:
# A tibble: 5 x 7
id.x id.y male_match smoke_diff drink_diff everfight_match grades_diff
<int> <int> <int> <int> <int> <int> <int>
1 1 6 1 0 3 0 1
2 2 7 1 0 1 0 1
3 3 8 0 0 2 1 1
4 4 9 1 1 1 0 1
5 5 10 1 2 3 1 0
We can use tidyverse with across which can do this with dplyr/stringr packages alone i.e. loop across the .x columns of 'male', 'everfight', then get the value of the corresponding .y columns to create the binary column, similarly do this on the other columns, and get the absolute difference. In the .names, replace the column name by making use of str_replace
library(dplyr)
library(stringr)
dat %>%
mutate(across(c(male.x, everfight.x ),
~ +(. == get(str_replace(cur_column(), 'x$', 'y'))),
.names = "{str_replace(.col, '.x', '_match')}"),
across(c(smoke.x, drink.x, grades.x),
~
abs(. - get(str_replace(cur_column(), 'x$', 'y'))),
.names = "{str_replace(.col, '.x', '_diff')}"))
-output
id.x id.y male.x smoke.x drink.x everfight.x grades.x male.y smoke.y drink.y everfight.y grades.y male_match everfight_match smoke_diff drink_diff grades_diff
1 1 6 0 2 4 1 3 0 2 1 0 2 1 0 0 3 1
2 2 7 0 2 4 0 5 0 2 3 1 4 1 0 0 1 1
3 3 8 1 4 4 1 2 0 4 2 1 1 0 1 0 2 1
4 4 9 0 2 3 1 2 0 3 2 0 1 1 0 1 1 1
5 5 10 1 2 4 0 4 1 4 1 0 4 1 1 2 3 0
Or may do this in a single across as well
dat %>%
mutate(across(ends_with('.x'), ~ {
other <- get(str_replace(cur_column(), 'x$', 'y'))
if(all(. %in% c(0, 1)) ) +(. == other) else abs(. - other)
}, .names = "{str_replace(.col, '.x', '_diff')}"))

Subsetting data based on a value within ids in r

I'm trying to subset a dataset based on two criteria. Here is a snapshot of my data:
ids <- c(1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3)
seq <- c(1,2,3,4,5,6, 1,2,3,4,5,6, 1,2,3,4,5,6)
type <- c(1,1,5,1,1,1, 1,1,1,8,1,1, 1,1,1,1,1,1)
data <- data.frame(ids, seq, type)
ids seq type
1 1 1 1
2 1 2 1
3 1 3 5
4 1 4 1
5 1 5 1
6 1 6 1
7 2 1 1
8 2 2 1
9 2 3 1
10 2 4 8
11 2 5 1
12 2 6 1
13 3 1 1
14 3 2 1
15 3 3 1
16 3 4 1
17 3 5 1
18 3 6 1
ids is the student id, seq is the sequence of the questions (items) students take. type refers to the type of the question. 1 is simple, 5 or 8 is the complicated items. What I would like to do is to generate 1st variable(complex) as to whether or not student has a complicated item(type=5|8). Then I would like to get:
> data
ids seq type complex
1 1 1 1 1
2 1 2 1 1
3 1 3 5 1
4 1 4 1 1
5 1 5 1 1
6 1 6 1 1
7 2 1 1 1
8 2 2 1 1
9 2 3 1 1
10 2 4 8 1
11 2 5 1 1
12 2 6 1 1
13 3 1 1 0
14 3 2 1 0
15 3 3 1 0
16 3 4 1 0
17 3 5 1 0
18 3 6 1 0
The second step is to split data within students.
(a) For the student who has non-complex items (complex=0), I would like to split the dataset from half point and get this below:
>simple.split.1
ids seq type complex
13 3 1 1 0
14 3 2 1 0
15 3 3 1 0
>simple.split.2
ids seq type complex
16 3 4 1 0
17 3 5 1 0
18 3 6 1 0
(b) for the students who have complex items (complex=1), I would like to set the complex item as a cutting point and split the data from there. So the data should look like this (excluding complex item):
>complex.split.1
ids seq type complex
1 1 1 1 1
2 1 2 1 1
7 2 1 1 1
8 2 2 1 1
9 2 3 1 1
>complex.split.2
ids seq type complex
4 1 4 1 1
5 1 5 1 1
6 1 6 1 1
11 2 5 1 1
12 2 6 1 1
Any thoughts?
Thanks
Here's a way to do it using data.table, zoo packages and split function:
library(data.table)
library(zoo)
setDT(data)[, complex := ifelse(type == 5 | type == 8, 1, NA_integer_), by = ids][, complex := na.locf(na.locf(complex, na.rm=FALSE), na.rm=FALSE, fromLast=TRUE), by = ids][, complex := ifelse(is.na(complex), 0, complex)] ## set data to data.table & add a flag 1 where type is 5 or 8 ## carry forward and backward of complex flag ## replace na values in complex column with 0
data <- data[!(type == 5 | type == 8), ] ## removing rows where type equals 5 or 8
complex <- split(data, data$complex) ## split data based on complex flag
complex_0 <- as.data.frame(complex$`0`) ## saving as data frame based on complex flag
complex_1 <- as.data.frame(complex$`1`)
split(complex_0, cut(complex_0$seq, 2)) ## split into equal parts
split(complex_1, cut(complex_1$seq, 2))
#$`(0.995,3.5]`
# ids seq type complex
#1 3 1 1 0
#2 3 2 1 0
#3 3 3 1 0
#$`(3.5,6]`
# ids seq type complex
#4 3 4 1 0
#5 3 5 1 0
#6 3 6 1 0
#$`(0.995,3.5]`
# ids seq type complex
#1 1 1 1 1
#2 1 2 1 1
#6 2 1 1 1
#7 2 2 1 1
#8 2 3 1 1
#$`(3.5,6]`
# ids seq type complex
#3 1 4 1 1
#4 1 5 1 1
#5 1 6 1 1
#9 2 5 1 1
#10 2 6 1 1
If you prefer using the tidyverse, here's an approach:
ids <- c(1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3)
seq <- c(1,2,3,4,5,6, 1,2,3,4,5,6, 1,2,3,4,5,6)
type <- c(1,1,5,1,1,1, 1,1,1,8,1,1, 1,1,1,1,1,1)
data <- data.frame(ids, seq, type)
step1.data <- data %>%
group_by(ids) %>%
mutate(complex = ifelse(any(type %in% c(5,8)), 1, 0)) %>%
ungroup()
simple.split.1 <- step1.data %>%
filter(complex == 0) %>%
group_by(ids) %>%
filter(seq <= mean(seq)) %>% #if you happen to have more than 6 questions in seq, this gives the midpoint
ungroup()
simple.split.2 <- step1.data %>%
filter(complex == 0) %>%
group_by(ids) %>%
filter(seq > mean(seq)) %>%
ungroup()
complex.split.1 <- step1.data %>%
filter(complex == 1) %>%
arrange(ids, seq) %>%
group_by(ids) %>%
filter(seq < min(seq[type %in% c(5,8)])) %>%
ungroup()
complex.split.2 <- step1.data %>%
filter(complex == 1) %>%
arrange(ids, seq) %>%
group_by(ids) %>%
filter(seq > min(seq[type %in% c(5,8)])) %>%
ungroup()

Get edge list that includes alter's alters

I need a data frame that includes three columns: i, j (alter), and k (j's alter). I have an adjacency matrix (sample below). From there I can get a graph object and extract the edge list. How can I manipulate the data to get an output like the WANT data frame below?
HAVE (matrix & edgelist):
1 2 3 4 5
1 0 0 0 1 0
2 0 0 1 1 1
3 0 0 0 0 0
4 1 1 0 0 1
5 1 1 0 1 0
g <- graph_from_adjacency_matrix(mat)
get.edgelist(g)
i j
1 4
2 3
2 4
2 5
4 1
4 2
4 5
5 1
5 2
5 4
WANT (ijk edge list):
i j k
1 4 2
1 4 5
2 4 1
2 4 5
4 2 3
4 5 1
4 5 2
5 1 4
5 2 3
5 2 4
5 4 1
5 4 2
the ijk edge list should so all possible triples with ij, excluding self loops(ex: 1 4 1)
Data:
as.matrix(read.table(text = "0 0 0 1 0
0 0 1 1 1
0 0 0 0 0
1 1 0 0 1
1 1 0 1 0",
header = F, stringsAsFactors = F)) -> m1
dimnames(m1) <- list(1:5, 1:5)
Libraries:
library(igraph)
library(dplyr)
library(tidyr)
library(magrittr)
Solution:
g1 <- graph_from_adjacency_matrix(m1)
e1 <- get.edgelist(g1) %>% as.data.frame %>% mutate_if(is.factor, as.character)
e1 %>%
group_by(V1) %>%
nest(V2) %>%
right_join(e1,.,by = c("V2"="V1")) %>%
unnest %>%
filter(V1 != V21) %>%
set_colnames(c("i", "j", "k"))
Output:
#> i j k
#> 1 1 4 2
#> 2 1 4 5
#> 3 2 4 1
#> 4 2 4 5
#> 5 2 5 1
#> 6 2 5 4
#> 7 4 2 3
#> 8 4 2 5
#> 9 4 5 1
#> 10 4 5 2
#> 11 5 1 4
#> 12 5 2 3
#> 13 5 2 4
#> 14 5 4 1
#> 15 5 4 2
I was actually able to get a way to do it using igraph and dplyr:
# make graph of matrix
g <- graph_from_adjacency_matrix(mat)
# put edgelist into two objects, one where columns are "i, j" and the other "j, k"
df1 <- get.edgelist(g) %>%
as.data.frame() %>%
select(i = V1, j = V2)
df2 <- get.edgelist(g) %>%
as.data.frame() %>%
select(j = V1, k = V2)
# combine the dataframes, filter out rows where i and k are the same observation
df_combn <- inner_join(df1, df2, by = c("j" = "j")) %>%
mutate_all(as.character) %>%
filter(., !(i == k))

Wide to long data conversion

I have a dataset in 'wide' format that I would like to convert to a non-standard long format. At least, that is how I would characterize this problem.
The original dataset mimics the following:
d1 <- data.frame('id' = c(1,2),
'Q1' = c(2,3),
'Q2' = c(1,3),
'Q3' = c(3,1))
d1
id Q1 Q2 Q3
1 1 2 1 3
2 2 3 3 1
In this example, there are two individuals who have answered three questions. The answer to each question takes the following values {1,2,3}. So, in this examples, individual 1 answered 2 to Q1, 1 to Q2, and 3 for Q3. I now need to convert to a 'long' format that would be take the following format. For each individual and each possible answer
d2 <- data.frame('id'= rep(seq(1:2),each=9),
'question' = rep(seq(1:3), each=3),
'option' = rep(seq(1:3)),
'choice' = 0)
d2
id question option choice
1 1 1 1 0
2 1 1 2 0
3 1 1 3 0
4 1 2 1 0
5 1 2 2 0
6 1 2 3 0
7 1 3 1 0
8 1 3 2 0
9 1 3 3 0
10 2 1 1 0
11 2 1 2 0
12 2 1 3 0
13 2 2 1 0
14 2 2 2 0
15 2 2 3 0
16 2 3 1 0
17 2 3 2 0
18 2 3 3 0
The part of I am struggling with is how to 'merge' or 'reshape' the data from d1 into d2 so that the final outcome would look like the following with the choice column reflecting the answers given in dataframe d1:
id question option choice
1 1 1 1 0
2 1 1 2 1
3 1 1 3 0
4 1 2 1 1
5 1 2 2 0
6 1 2 3 0
7 1 3 1 0
8 1 3 2 0
9 1 3 3 1
10 2 1 1 0
11 2 1 2 0
12 2 1 3 1
13 2 2 1 0
14 2 2 2 0
15 2 2 3 1
16 2 3 1 1
17 2 3 2 0
18 2 3 3 0
Individual 1 did not chose option 1 or 3 in question 1, but DID choose option 2 as indicated in the dummy coding in the choice column.
Any thoughts on this would be greatly appreciated.
d3 is the final output.
d1 <- data.frame('id' = c(1,2),
'Q1' = c(2,3),
'Q2' = c(1,3),
'Q3' = c(3,1))
library(dplyr)
library(tidyr)
d2 <- d1 %>%
gather(question, option, -id)
d3 <- d2 %>%
complete(id, question, option) %>%
left_join(d2, by = c("id", "question")) %>%
mutate(question = sub("Q", "", question)) %>%
mutate(option.y = ifelse(option.y == option.x, 1, 0)) %>%
rename(option = option.x, choice = option.y)
Update
Here is a more concise approach. dt2 is the final output.
d2 <- d1 %>%
gather(question, option, -id) %>%
mutate(choice = 1) %>%
complete(id, question, option, fill = list("choice" = 0)) %>%
mutate(question = sub("Q", "", question))

Resources