Duplicate rows based on common value in other column? - r

I have a data frame which includes: one column having individual ID's (unique), and a second column showing a common unique variable. That is, everyone in column 1 took the same action, which is shown in column B.
I'd like to write code in R which creates new rows, which pair everyone in column A based on column B.
that is, given this example:
person <- c("a", "b", "c", "d", "e", "f")
action <- c("x", "x", "x", "y", "y", "y")
data.frame(person, action)
I'd want to create this:
person1 <- c("a", "a", "b", "d", "d", "e")
person2 <- c("b", "c", "c", "e", "f", "f")
data.frame(person1, person2)

A method using group_modify() and combn():
library(dplyr)
df %>%
group_by(action) %>%
group_modify(~ as_tibble(t(combn(pull(.x, person), 2))))
# A tibble: 6 × 3
# Groups: action [2]
action V1 V2
<chr> <chr> <chr>
1 x a b
2 x a c
3 x b c
4 y d e
5 y d f
6 y e f

How about this:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
#> Warning: package 'tidyr' was built under R version 4.1.2
person<-c("a", "b", "c", "d", "e", "f")
action<-c("x", "x", "x", "y", "y", "y")
dat <- data.frame(person, action)
dat %>%
group_by(action) %>%
summarise(person = as.data.frame(t(combn(person, 2)))) %>%
unnest(person) %>%
rename(person1=V1, person2=V2)
#> `summarise()` has grouped output by 'action'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups: action [2]
#> action person1 person2
#> <chr> <chr> <chr>
#> 1 x a b
#> 2 x a c
#> 3 x b c
#> 4 y d e
#> 5 y d f
#> 6 y e f
Created on 2022-04-21 by the reprex package (v2.0.1)

Here is a one liner in base R.
person <- c("a", "b", "c", "d", "e", "f")
action <- c("x", "x", "x", "y", "y", "y")
df <- data.frame(person, action)
setNames(
do.call(
rbind,
lapply(split(df, df$action),
function(x) as.data.frame(t(combn(x$person, 2))))),
c("person1", "person2"))
# person1 person2
# x.1 a b
# x.2 a c
# x.3 b c
# y.1 d e
# y.2 d f
# y.3 e f

Using base R
subset(merge(dat, dat, by = 'action'), person.x != person.y &
duplicated(paste(pmin(person.x, person.y), pmax(person.x, person.y))))
action person.x person.y
4 x b a
7 x c a
8 x c b
13 y e d
16 y f d
17 y f e

Related

Take the letter that comes first in the alphabet (in R)

In the following dataframe df,
structure(list(Name = c("Gregory", "Jane", "Joey", "Mark", "Rachel", "Phoebe", "Liza"), code = c("xx11-9090", "1367-88uu", "117y-xxxh", "cf56-gh67", "1888-ddf5", "rf52-628u", "hj69-5kk5"), `CLASS IF5` = c("E", "C", "C", "D", "D", "A", "A"), `CLASS AIS` = c("E",
"C", "C", "D", "D", "A", "A"), `CLASS IPP` = c("C", "C", "C",
"E", "E", "B", "A"), `CLASS SJR` = c("D", "C", "C", "D", "D",
"B", "A")), row.names = c(1682L, 1683L, 1768L, 333L, 443L, 510L,
897L), class = "data.frame")
the letters denote a ranking. For example: A is the first position, B is the second and so on. The letters range between A and E. I would like to collapse the columns that begin with CLASS (i.e., the last four columns of the dataframe) in only one column keeping, for each row of the dataframe, only the letter that corresponds to the highest position in the ranking.
The desired result is:
Name code new column
1682 Gregory xx11-9090 C
1683 Jane 1367-88uu C
1768 Joey 117y-xxxh C
333 Mark cf56-gh67 D
443 Rachel 1888-ddf5 D
510 Phoebe rf52-628u A
897 Liza hj69-5kk5 A
You can use the apply statement to apply the min function to each row and then assign its output to a new column:
df$new_column <- apply(df[, grep("^CLASS", names(df))], 1, min, na.rm = TRUE)
A possible solution in base R:
df$new_coolumn <- apply(df, 1, \(x) sort(x[-(1:2)])[1])
df[,c(1,2,7)]
#> Name code new_coolumn
#> 1682 Gregory xx11-9090 C
#> 1683 Jane 1367-88uu C
#> 1768 Joey 117y-xxxh C
#> 333 Mark cf56-gh67 D
#> 443 Rachel 1888-ddf5 D
#> 510 Phoebe rf52-628u A
#> 897 Liza hj69-5kk5 A
Using dplyr:
library(dplyr)
df %>%
rowwise %>%
mutate(new_column = c_across(starts_with("CLASS")) %>% sort %>% .[1]) %>%
select(Name, code, new_column) %>% ungroup
#> # A tibble: 7 × 3
#> Name code new_column
#> <chr> <chr> <chr>
#> 1 Gregory xx11-9090 C
#> 2 Jane 1367-88uu C
#> 3 Joey 117y-xxxh C
#> 4 Mark cf56-gh67 D
#> 5 Rachel 1888-ddf5 D
#> 6 Phoebe rf52-628u A
#> 7 Liza hj69-5kk5 A

How to crosstabulate two variables to classify a third categorical variable in R

I want to crosstabulate x by y to obtain in the table cells, the values of z.
library(tidyverse)
df <- tibble(x = c("a", "a", "b", "b"),
y = c("c", "d", "c", "d"),
z = c("e", "g", "f", "h"))
# I want to obtain this result:
# c d
# a e g
# b f h
Created on 2021-07-18 by the reprex package (v2.0.0)
I think you want tidyr::pivot_wider...
df %>% pivot_wider(names_from = y, values_from = z)
# A tibble: 2 x 3
x c d
<chr> <chr> <chr>
1 a e g
2 b f h

Filter directed co-occurrences in a table

I have co-occurrence data that can be represented in two columns. The entries in each column are from the same set of possibilities. Ultimately I am aiming to plot a directed network but first I would like to split the table into those that reciprocal (i.e. both X->Y and Y->X) and those that occur in just one direction (i.e. only Y->Z). Here is an example:
library(tidyverse)
# Example data
from <- c("A", "B", "F", "Q", "T", "S", "D", "E", "A", "T", "F")
to <- c("E", "D", "Q", "S", "F", "T", "B", "A", "D", "A", "E")
df <- data_frame(from, to)
df
# A tibble: 11 x 2
from to
<chr> <chr>
1 A E
2 B D
3 F Q
4 Q S
5 T F
6 S T
7 D B
8 E A
9 A D
10 T A
11 F E
and here is my desired output:
# Desired output 1 - reciprocal co-occurrences
df %>%
slice(c(1,2)) %>%
rename(item1 = from, item2 = to)
# A tibble: 2 x 2
item1 item2
<chr> <chr>
1 A E
2 B D
# Desired output 2 - single occurrences
df %>%
slice(c(3,4,6,6,9,10,11))
# A tibble: 7 x 2
from to
<chr> <chr>
1 F Q
2 Q S
3 S T
4 S T
5 A D
6 T A
7 F E
If the co-occurrences are reciprocal it does not matter what order the entries are in I only need their names co-occurrences are not I need to know the direction.
This feels like a graph problem so I have had a go but am unfamiliar with working with this type of data and most tutorials seem to cover undirected graphs. Looking at the tidygraph package which I understand uses the igraph package I have tried this:
library(tidygraph)
df %>%
as_tbl_graph(directed = TRUE) %>%
activate(edges) %>%
mutate(recip_occur = edge_is_mutual()) %>%
as_tibble() %>%
filter(recip_occur == TRUE)
# A tibble: 4 x 3
from to recip_occur
<int> <int> <lgl>
1 1 8 TRUE
2 2 7 TRUE
3 7 2 TRUE
4 8 1 TRUE
However this divorces the edges from the nodes and repeats reciprocal co-occurrences. Does anyone have experience with this sort of data?
try this:
data:
from <- c("A", "B", "F", "Q", "T", "S", "D", "E", "A", "T", "F")
to <- c("E", "D", "Q", "S", "F", "T", "B", "A", "D", "A", "E")
df <- data_frame(from, to)
code:
recursive_IND <-
1:nrow(df) %>%
sapply(function(x){
if(any((as.character(df[x,]) == t(df[,c(2,1)])) %>% {.[1,] & .[2,]}))
return(T) else return(F)
})
df[recursive_IND,][!(df[recursive_IND,] %>% apply(1,sort) %>% t %>% duplicated(.)),]
df[!recursive_IND,]
result:
# A tibble: 2 x 2
# from to
# <chr> <chr>
#1 A E
#2 B D
# A tibble: 7 x 2
# from to
# <chr> <chr>
#1 F Q
#2 Q S
#3 T F
#4 S T
#5 A D
#6 T A
#7 F E

Match columns with multiple entries in a row and mutate result

I have a data frame:
col_1 <- c("A", "A", "B", "B", "C", "C")
col_2 <- c("A", "B", "C", "D", "E", "F")
col_3 <- c("A", "B", "C", "C", "B", "A")
df <- data.frame(col_1, col_2, col_3)
I want to mutate a new column that contains TRUE or FALSE depending on whether any row has more than two identical entries.
e.g.:
t_f <- c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE)
Even better, if I could have a column that contains the repeated values, e.g.:
name <- c("A", "B", "C", NA, NA, NA)
For you first requirement
df$t_f <- apply(df, 1, function(x) any(duplicated(x)))
And your second
df$name <- apply(df, 1, function(x) ifelse(any(duplicated(x)), x[which(duplicated(x))], NA))
For your second requirement:
col_1 <- c("A", "A", "B", "B", "C", "C")
col_2 <- c("A", "B", "C", "D", "E", "F")
col_3 <- c("A", "B", "C", "C", "B", "A")
df <- data.frame(col_1, col_2, col_3)
df$name <- apply(df, 1,
function(row)ifelse(max(table(row))>=2,
names(table(row))[which.max(table(row))], NA))
df
#> col_1 col_2 col_3 name
#> 1 A A A A
#> 2 A B B B
#> 3 B C C C
#> 4 B D C <NA>
#> 5 C E B <NA>
#> 6 C F A <NA>
in base R you can try
ifelse(colSums(table(row(df), as.matrix(df)) >= 2) == 1, colnames(table(row(df), as.matrix(df))), NA)
A B C D E F
"A" "B" "C" NA NA NA
In tidyverse you can do
library(tidyverse)
df %>%
mutate_if(is.factor, as.character) %>%
rowwise() %>%
mutate(dup=anyDuplicated(c(col_1, col_2, col_3))!=0) %>%
mutate(which.dup=c(col_1, col_2, col_3)[which(duplicated(c(col_1, col_2, col_3)))[1]])
Source: local data frame [6 x 5]
Groups: <by row>
# A tibble: 6 x 5
col_1 col_2 col_3 dup which.dup
<chr> <chr> <chr> <lgl> <chr>
1 A A A TRUE A
2 A B B TRUE B
3 B C C TRUE C
4 B D C FALSE NA
5 C E B FALSE NA
6 C F A FALSE NA

Loop to Replace Matching Values

I'm looking for an easy and elegant way to accomplish this.
So if I have dataset x and relationship is A -> B -> Z -> Y and D -> H -> G, I would like to create dataset y. Unfortunately, they are not necessarily in order:
> x <- data.frame(
+ from = as.character(c("A", "E", "B", "D", "H", "Z")),
+ to = as.character(c("B", "E", "Z", "H", "G", "Y")))
>
> y <- data.frame(
+ from = as.character(c("A", "E", "B", "D", "H", "Z")),
+ to = as.character(c("Y", "E", "Y", "G", "G", "Y")))
>
> x
from to
1 A B
2 E E
3 B Z
4 D H
5 H G
6 Z Y
> y
from to
1 A Y
2 E E
3 B Y
4 D G
5 H G
6 Z Y
I have a fairly large dataset (currently 500k rows; will grow in the future) and actually care about the performance; I'm not sure if there are any other ways to do this without a for-loop or even to vectorize/parallelize the process.
I'm thinking about splitting and removing all rows where from == to or creating an indicator to skip certain rows so the loop does not have to go through the entire dataset each time.
I'd also like to know what the breakpoint should be if I do create a loop; I'm not sure how to define when the loop should stop.
Any suggestions would be appreciated. Thanks!
We can use dplyr to create a grouping variable by comparing the adjacent elements of 'to' and 'from' and change the values in 'to' the last element of 'to'
library(dplyr)
x %>%
group_by(grp = cumsum(lag(lead(from, default = last(from)) !=
as.character(to), default = TRUE))) %>%
mutate(to = last(to)) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 2
# from to
# <fctr> <fctr>
#1 A D
#2 B D
#3 C D
#4 E E
Another solution can be achieved using lag from dplyr and fill from tidyr as:
library(tidyverse)
x %>% arrange(from) %>%
mutate(samegroup = ifelse(from == lag(to), 1, 0)) %>%
mutate(group = ifelse(samegroup == 0 | is.na(samegroup), row_number(), NA)) %>%
fill(group) %>%
group_by(group) %>%
mutate(to = last(to)) %>%
ungroup() %>%
select(-samegroup, - group)
# A tibble: 6 x 2
# from to
# <chr> <chr>
#1 A D
#2 B D
#3 C D
#4 E E
#5 F H
#6 G H
Data used
x <- data.frame(from = as.character(c("A", "B", "F", "C", "G", "E")),
to = as.character(c("B", "C", "G", "D", "H", "E")),
stringsAsFactors = FALSE)

Resources