dplyr: streamline creating matching and absolute difference variables - r

I have a dataset of friendships and characteristics of each individual, I'm trying to create variables that are if they match on the binary measures and what their absolute difference is for the continuous measures.
I can do this easily, but I was wondering if there is a different way to do it that is more streamlined than my method given that I have ~60 variables to do this with.
Sample Data:
dat <- read.table(text = "id.x id.y male.x smoke.x drink.x everfight.x grades.x male.y smoke.y drink.y everfight.y grades.y
1 6 0 2 4 1 3 0 2 1 0 2
2 7 0 2 4 0 5 0 2 3 1 4
3 8 1 4 4 1 2 0 4 2 1 1
4 9 0 2 3 1 2 0 3 2 0 1
5 10 1 2 4 0 4 1 4 1 0 4", header = TRUE)
Here's what I've done:
dat <- dat %>%
mutate(sex_match = case_when(male.x == male.y ~ 1,
TRUE ~ 0),
fight_match = case_when(everfight.x == everfight.y ~ 1,
TRUE ~ 0),
smoke_diff = abs(smoke.x - smoke.y),
drink_diff = abs(drink.x - drink.y),
grades_diff = abs(grades.x - grades.y))
This gives me exactly what I want:
id.x id.y male.x smoke.x drink.x everfight.x grades.x male.y smoke.y drink.y everfight.y grades.y sex_match fight_match smoke_diff drink_diff grades_diff
1 6 0 2 4 1 3 0 2 1 0 2 1 0 0 3 1
2 7 0 2 4 0 5 0 2 3 1 4 1 0 0 1 1
3 8 1 4 4 1 2 0 4 2 1 1 0 1 0 2 1
4 9 0 2 3 1 2 0 3 2 0 1 1 0 1 1 1
5 10 1 2 4 0 4 1 4 1 0 4 1 1 2 3 0
However, I'm wondering if there's a way to do this with a loop or apply that identifies corresponding vairables and create the matching and absolute difference new variables in the sample output above.
UPDATE
Ended up using most of what Jon answered with and one part of akrun, here's what worked best for me:
non_binary <- dat %>% select(., contains(".x")) %>%
select(., -id.x) %>%
select_if(~!all(. %in% 0:1)) %>%
rename_with(~str_remove(., '.x')) %>%
names()
dat %>%
pivot_longer(-c(id.x:id.y),
names_to = c("var", ".value"),
names_pattern = "(.+).(.+)") %>%
mutate(match = if_else(var %in% non_binary, abs(x - y), 1L * (x == y))) %>%
mutate(col_name = paste(var, ifelse(var %in% non_binary, "diff", "match"), sep = "_")) %>%
select(-c(var:y)) %>%
pivot_wider(names_from = col_name, values_from = match)
Thanks to both of you!

Here's a tidyr/dplyr approach. First I reshape to a long format with a row for each id/variable combination, with columns for each version. Then I can compare those for every pair at once, and reshape wide.
library(dplyr); library(tidyr)
non_binary <- c("smoke", "drink", "grades")
dat %>%
pivot_longer(-c(id.x:id.y),
names_to = c("var", ".value"),
names_pattern = "(.+).(.+)") %>%
mutate(match = if_else(var %in% non_binary, abs(x - y), 1L * (x == y))) %>%
mutate(col_name = paste(var, ifelse(var %in% non_binary, "diff", "match"), sep = "_")) %>%
select(-c(var:y)) %>%
pivot_wider(names_from = col_name, values_from = match)
Result, which could be appended to original data:
# A tibble: 5 x 7
id.x id.y male_match smoke_diff drink_diff everfight_match grades_diff
<int> <int> <int> <int> <int> <int> <int>
1 1 6 1 0 3 0 1
2 2 7 1 0 1 0 1
3 3 8 0 0 2 1 1
4 4 9 1 1 1 0 1
5 5 10 1 2 3 1 0

We can use tidyverse with across which can do this with dplyr/stringr packages alone i.e. loop across the .x columns of 'male', 'everfight', then get the value of the corresponding .y columns to create the binary column, similarly do this on the other columns, and get the absolute difference. In the .names, replace the column name by making use of str_replace
library(dplyr)
library(stringr)
dat %>%
mutate(across(c(male.x, everfight.x ),
~ +(. == get(str_replace(cur_column(), 'x$', 'y'))),
.names = "{str_replace(.col, '.x', '_match')}"),
across(c(smoke.x, drink.x, grades.x),
~
abs(. - get(str_replace(cur_column(), 'x$', 'y'))),
.names = "{str_replace(.col, '.x', '_diff')}"))
-output
id.x id.y male.x smoke.x drink.x everfight.x grades.x male.y smoke.y drink.y everfight.y grades.y male_match everfight_match smoke_diff drink_diff grades_diff
1 1 6 0 2 4 1 3 0 2 1 0 2 1 0 0 3 1
2 2 7 0 2 4 0 5 0 2 3 1 4 1 0 0 1 1
3 3 8 1 4 4 1 2 0 4 2 1 1 0 1 0 2 1
4 4 9 0 2 3 1 2 0 3 2 0 1 1 0 1 1 1
5 5 10 1 2 4 0 4 1 4 1 0 4 1 1 2 3 0
Or may do this in a single across as well
dat %>%
mutate(across(ends_with('.x'), ~ {
other <- get(str_replace(cur_column(), 'x$', 'y'))
if(all(. %in% c(0, 1)) ) +(. == other) else abs(. - other)
}, .names = "{str_replace(.col, '.x', '_diff')}"))

Related

R dplyr filter direct and indirect sequences of string

Assume I have the table below. And I want to find all id where A is directly or indirectly followed by B.
A->B is the direct sequence for id 1, while A->B is the indirect sequence for id 2, id 3, and id 4.
have <- tibble(id=c(rep(1,2),rep(2,4),rep(3,3),rep(4,4))
,sequence=c('A','B','A','D','C','B','D','A','C','B','D','A','B'))
have
# A tibble: 13 × 2
id sequence
<dbl> <chr>
1 1 A
2 1 B
3 2 A
4 2 D
5 2 C
6 2 B
7 3 D
8 3 A
9 3 C
10 4 B
11 4 D
12 4 A
13 4 B
For A->B direct sequences, I do the following. But I don't think the same logic applies to indirect sequences unless I use regex expressions on the concatenated.
have %>% group_by(id) %>%
dplyr::mutate(process_seq = paste(lag(sequence), '->', sequence)) %>%
dplyr::filter(process_seq == 'A -> B')
want
# A tibble: 13 × 2
id sequence-type
<dbl> <chr>
1 1 direct
2 2 indirect
3 3 indirect
4 4 indirect
ABs = have %>%
group_by(id) %>%
mutate(rownum = row_number(),
letternum = match(sequence, LETTERS[1:26])) %>%
filter(sequence == "A" | sequence == "B") %>%
mutate(dif_row = rownum - lag(rownum),
dif_let = letternum - lag(letternum)) %>%
filter(!is.na(dif_row)) %>%
mutate(has_direct_link = max(dif_row==1),
has_indirect_link = max(dif_row >1 & dif_let == 1),
has_reverse_link = max(dif_row >1 & dif_let < 0)) %>%
select(id, starts_with("has_")) %>%
distinct()
res = have %>% left_join(ABs) %>%
mutate(has_no_link = as.integer(is.na(has_direct_link))) %>%
mutate_if(is.numeric,coalesce,0)
> res
# A tibble: 13 x 6
id sequence has_direct_link has_indirect_link has_reverse_link has_no_link
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 A 1 0 0 0
2 1 B 1 0 0 0
3 2 A 0 1 0 0
4 2 D 0 1 0 0
5 2 C 0 1 0 0
6 2 B 0 1 0 0
7 3 D 0 0 0 1
8 3 A 0 0 0 1
9 3 C 0 0 0 1
10 4 B 1 0 1 0
11 4 D 1 0 1 0
12 4 A 1 0 1 0
13 4 B 1 0 1 0
#deschen's answer above is elegant, but somewhat incomplete: I don't think the "definition" of an indirect link is correct (every direct link is also an indirect link). But my answer could probably be improved by deschen.
Here‘s one approach:
library(tidyverse)
have %>%
group_by(id) %>%
mutate(direct = if_else(sequence == 'A' & lead(sequence) == 'B', 1, 0)) %>%
mutate(indirect = any(sequence == 'A') & any(sequence == 'B')) %>%
filter(any(direct == 1) | indirect == TRUE) %>%
ungroup()
which gives:
# A tibble: 10 x 4
id sequence direct indirect
<dbl> <chr> <dbl> <lgl>
1 1 A 1 TRUE
2 1 B 0 TRUE
3 2 A 0 TRUE
4 2 D 0 TRUE
5 2 C 0 TRUE
6 2 B 0 TRUE
7 4 B 0 TRUE
8 4 D 0 TRUE
9 4 A 1 TRUE
10 4 B 0 TRUE
You can of course deselect the created direct /indirect columns.

How to merge two dataset with unequal rows depending on dummy value in R

I have two R dataframe with unequal rows that need to be merged based on value of the dummy in a column.
x <- c(3,4,5,3,5,1,4,5)
y <- c(0,0,1,0,1,1,0,0)
df1 <- data.frame(x,y)
x y
1 3 0
2 4 0
3 5 1
4 3 0
5 5 1
6 1 1
7 4 0
z <- c(7,8,9)
y <- c(1,1,1)
df2 <- data.frame(a,b)
z y
1 7 1
2 8 1
3 9 1
Is it possible to merge it the two such that the resulting dataframe is the following without the use of a loop?
x y z
1 3 0 0
2 4 0 0
3 5 1 7
4 3 0 0
5 5 1 8
6 1 1 9
7 4 0 0
When the first value 1 appears in y, the value of z is set to the first value of z in df2.
You may try
library(dplyr)
df1 <-df1 %>%
group_by(y) %>%
mutate(n = 1:n())
df2 <- df2 %>%
group_by(y) %>%
mutate(n = 1:n())
df1 %>%
left_join(df2, by =c("n", "y")) %>%
mutate(z = replace_na(z, 0)) %>%
select(-n)
x y z
<dbl> <dbl> <dbl>
1 3 0 0
2 4 0 0
3 5 1 7
4 3 0 0
5 5 1 8
6 1 1 9
7 4 0 0
8 5 0 0

R - Mutate column based on another column

Using R:
For the dataframe:
A<-c(3,3,3,3,1,1,2,2,2,2,2)
df<-data.frame(A)
How do you add a column such that the output is the same as:
A<-c(3,3,3,3,1,1,2,2,2,2,2)
df<-data.frame(A)
B<-c(1,1,1,0,1,0,1,1,0,0,0)
mutate(df,B)
In other words, is there a formula for column 'B' - such that it looks at column 'A'....and lists '1', 3 times the puts a '0' .....etc etc.
So - the desired output (given column 'A') is:
Thankyou.
Here I assign a new group each time A changes, then within each group put a 1 in B in the first #A rows.
(If the values of A are distinct for each group, you could replace the first two lines with group_by(A), but unclear if that's a fair assumption.)
library(dplyr)
df %>%
mutate(group = cumsum(A != lag(A, default = 0))) %>%
group_by(group) %>%
mutate(B = 1 * (row_number() <= A)) %>%
ungroup()
result
# A tibble: 11 x 3
A group B
<dbl> <int> <dbl>
1 3 1 1
2 3 1 1
3 3 1 1
4 3 1 0
5 1 2 1
6 1 2 0
7 2 3 1
8 2 3 1
9 2 3 0
10 2 3 0
11 2 3 0
After grouping by 'A', use rep with 1, 0 on the value of 'A' and the difference of number of rows with group value
library(dplyr)
library(data.table)
df %>%
group_by(A, grp = rleid(A)) %>%
mutate(B = rep(c(1, 0), c(first(A), n() - first(A)))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 11 x 2
# A B
# <dbl> <dbl>
# 1 3 1
# 2 3 1
# 3 3 1
# 4 3 0
# 5 1 1
# 6 1 0
# 7 2 1
# 8 2 1
# 9 2 0
#10 2 0
#11 2 0
Or using rle from base R
with(rle(df$A), rep(rep(c(1, 0), length(values)), c(values, lengths-values)))
#[1] 1 1 1 0 1 1 0 1 0 0 0

Wide to long format with several variables

This question is related to a previous question I asked on converting from wide to long format in R with an additional complication.
previous question is here: Wide to long data conversion
The wide data I start with looks like the following:
d2 <- data.frame('id' = c(1,2),
'Q1' = c(2,3),
'Q2' = c(1,3),
'Q3' = c(3,1),
'Q1_X_Opt_1' = c(0,0),
'Q1_X_Opt_2' = c(75,200),
'Q1_X_Opt_3' = c(150,300),
'Q2_X_Opt_1' = c(0,0),
'Q2_X_Opt_2' = c(150,200),
'Q2_X_Opt_3' = c(75,300),
'Q3_X_Opt_1' = c(0,0),
'Q3_X_Opt_2' = c(100,500),
'Q3_X_Opt_3' = c(150,300))
In this example, there are two individuals who have answered three questions. The answer to each question takes the following values {1,2,3} encoded in Q1, Q2, and Q3. So, in this examples, individual 1 chose option 2 in Q1, chose option 1 in Q2, and chose option 3 in Q3.
For each option there is also a variable X associated with each option that I also need to be converted to wide format. The output I am seeking looks like the following:
id question option choice cost
1 1 1 1 0 0
2 1 1 2 1 75
3 1 1 3 0 150
4 1 2 1 1 0
5 1 2 2 0 150
6 1 2 3 0 75
7 1 3 1 0 0
8 1 3 2 0 100
9 1 3 3 1 150
10 2 1 1 0 0
11 2 1 2 0 200
12 2 1 3 1 300
13 2 2 1 0 0
14 2 2 2 0 200
15 2 2 3 1 300
16 2 3 1 1 0
17 2 3 2 0 500
18 2 3 3 0 300
I have tried to adapting the code from the answer to the prior question, but with no success thus far. Thanks for any suggestions or comments.
It's not exactly elegant, but here's a tidyverse version:
library(tidyverse)
d3 <- d2 %>%
gather(option, cost, -id:-Q3) %>%
gather(question, choice, Q1:Q3) %>%
separate(option, c('question2', 'option'), extra = 'merge') %>%
filter(question == question2) %>%
mutate_at(vars(question, option), parse_number) %>%
mutate(choice = as.integer(option == choice)) %>%
select(1, 5, 3, 6, 4) %>%
arrange(id)
d3
#> id question option choice cost
#> 1 1 1 1 0 0
#> 2 1 1 2 1 75
#> 3 1 1 3 0 150
#> 4 1 2 1 1 0
#> 5 1 2 2 0 150
#> 6 1 2 3 0 75
#> 7 1 3 1 0 0
#> 8 1 3 2 0 100
#> 9 1 3 3 1 150
#> 10 2 1 1 0 0
#> 11 2 1 2 0 200
#> 12 2 1 3 1 300
#> 13 2 2 1 0 0
#> 14 2 2 2 0 200
#> 15 2 2 3 1 300
#> 16 2 3 1 1 0
#> 17 2 3 2 0 500
#> 18 2 3 3 0 300
1) First melt the input transformihg it to long form. Then break apart the variable column on underscore using read.table giving columns named V1, V2, V3, V4 representing the question as a factor, junk, junk and the option parts, respectively. Append that back to m and set the question to the factor level of V1 and option to V4. Sort it by id to give the same ordering as in the question. (If the order does not matter this line could be omiited.)
Now put the parts together noting that choice is 1 if the appropriate column among the Q1/Q2/Q3 columns equals the option and 0 otherwise.
library(reshape2)
m <- melt(d2, id = 1:4)
m <- cbind(m, read.table(text = as.character(m$variable), sep = "_"))
m <- transform(m, question = as.numeric(V1), option = V4)
m <- m[order(m$id), ]
n <- nrow(m)
with(m, data.frame(id,
question,
option,
choice = (m[cbind(1:n, question + 1)] == option) + 0,
value))
The result is:
id question option choice value
1 1 1 1 0 0
2 1 1 2 1 75
3 1 1 3 0 150
4 1 2 1 1 0
5 1 2 2 0 150
6 1 2 3 0 75
7 1 3 1 0 0
8 1 3 2 0 100
9 1 3 3 1 150
10 2 1 1 0 0
11 2 1 2 0 200
12 2 1 3 1 300
13 2 2 1 0 0
14 2 2 2 0 200
15 2 2 3 1 300
16 2 3 1 1 0
17 2 3 2 0 500
18 2 3 3 0 300
2) This could also be expressed using magirttr giving the same answer. Note that the last two pipes use the exposition operator %$% providing an implicit with(., ...) around the subsequent expression:
library(magrittr)
library(reshape2)
d2 %>%
melt(id = 1:4) %>%
cbind(read.table(text = as.character(.$variable), sep = "_")) %>%
transform(question = as.numeric(V1), option = V4) %$%
.[order(id), ] %$%
data.frame(id,
question,
option,
choice = (.[cbind(1:nrow(.), question + 1)] == option) + 0,
value)
3) This can be translated to reshape2/dplyr/tidyr:
library(reshape2)
library(dplyr)
library(tidyr)
d2 %>%
melt(id = 1:4) %>%
separate(variable, c("question", "X", "Opt", "option")) %>%
arrange(id) %>%
mutate(question = as.numeric(factor(question)),
choice = (.[cbind(1:n(), question + 1)] == option) + 0) %>%
select(id, question, option, choice, value)

Wide to long data conversion

I have a dataset in 'wide' format that I would like to convert to a non-standard long format. At least, that is how I would characterize this problem.
The original dataset mimics the following:
d1 <- data.frame('id' = c(1,2),
'Q1' = c(2,3),
'Q2' = c(1,3),
'Q3' = c(3,1))
d1
id Q1 Q2 Q3
1 1 2 1 3
2 2 3 3 1
In this example, there are two individuals who have answered three questions. The answer to each question takes the following values {1,2,3}. So, in this examples, individual 1 answered 2 to Q1, 1 to Q2, and 3 for Q3. I now need to convert to a 'long' format that would be take the following format. For each individual and each possible answer
d2 <- data.frame('id'= rep(seq(1:2),each=9),
'question' = rep(seq(1:3), each=3),
'option' = rep(seq(1:3)),
'choice' = 0)
d2
id question option choice
1 1 1 1 0
2 1 1 2 0
3 1 1 3 0
4 1 2 1 0
5 1 2 2 0
6 1 2 3 0
7 1 3 1 0
8 1 3 2 0
9 1 3 3 0
10 2 1 1 0
11 2 1 2 0
12 2 1 3 0
13 2 2 1 0
14 2 2 2 0
15 2 2 3 0
16 2 3 1 0
17 2 3 2 0
18 2 3 3 0
The part of I am struggling with is how to 'merge' or 'reshape' the data from d1 into d2 so that the final outcome would look like the following with the choice column reflecting the answers given in dataframe d1:
id question option choice
1 1 1 1 0
2 1 1 2 1
3 1 1 3 0
4 1 2 1 1
5 1 2 2 0
6 1 2 3 0
7 1 3 1 0
8 1 3 2 0
9 1 3 3 1
10 2 1 1 0
11 2 1 2 0
12 2 1 3 1
13 2 2 1 0
14 2 2 2 0
15 2 2 3 1
16 2 3 1 1
17 2 3 2 0
18 2 3 3 0
Individual 1 did not chose option 1 or 3 in question 1, but DID choose option 2 as indicated in the dummy coding in the choice column.
Any thoughts on this would be greatly appreciated.
d3 is the final output.
d1 <- data.frame('id' = c(1,2),
'Q1' = c(2,3),
'Q2' = c(1,3),
'Q3' = c(3,1))
library(dplyr)
library(tidyr)
d2 <- d1 %>%
gather(question, option, -id)
d3 <- d2 %>%
complete(id, question, option) %>%
left_join(d2, by = c("id", "question")) %>%
mutate(question = sub("Q", "", question)) %>%
mutate(option.y = ifelse(option.y == option.x, 1, 0)) %>%
rename(option = option.x, choice = option.y)
Update
Here is a more concise approach. dt2 is the final output.
d2 <- d1 %>%
gather(question, option, -id) %>%
mutate(choice = 1) %>%
complete(id, question, option, fill = list("choice" = 0)) %>%
mutate(question = sub("Q", "", question))

Resources