I have data.frame containing a list of people and who they are neighbours with. However, the data suggest that Josh a neighbour of himself, Emma, and Nick, but Emma is not a neighbour of Josh.
x <- read.table(text = "
Name ID Neighbour_ID
Josh 1 1,2,3
Emma 2 4
Nick 3 1
Mark 4 5
Claire 5
", sep = " ", header = TRUE)
x
Name ID Neighbour_ID
1 Josh 1 1,2,3
2 Emma 2 4
3 Nick 3 1
4 Mark 4 5
5 Claire 5
This of course needs to be fixed and I am looking for a way to do that. The outcome should look like this
Name ID Neighbour_ID
1 Josh 1 2,3
2 Emma 2 1,4
3 Nick 3 1
4 Mark 4 2,5
5 Claire 5 4
Add: If you find a better suited title for this question please feel free to edit!
Using igraph package, convert it to graph object:
library(dplyr)
library(tidyr)
library(igraph)
g <- separate_rows(x, Neighbour_ID, convert = TRUE) %>%
select(from = ID, to = Neighbour_ID) %>%
filter(!is.na(to) & from != to) %>%
graph_from_data_frame(directed = FALSE)
g
# IGRAPH 1eeedee UN-- 5 5 --
# + attr: name (v/c)
# + edges from 1eeedee (vertex names):
# [1] 1--2 1--3 2--4 1--3 4--5
plot(g)
I'd stop here, as we have our data in graph format. But if you need your output as data.frame then get the edgelists and merge back to original data.
gEdge <- get.edgelist(g)
left_join(x %>% select(Name, ID),
data.frame(unique(rbind(gEdge[, 1:2], gEdge[, 2:1]))) %>%
mutate(X1 = as.integer(X1), X2 = as.integer(X2)) %>%
summarise(Neighbour_ID = paste(sort(X2), collapse = ","), .by = X1),
by = c("ID" = "X1"))
# Name ID Neighbour_ID
# 1 Josh 1 2,3
# 2 Emma 2 1,4
# 3 Nick 3 1
# 4 Mark 4 2,5
# 5 Claire 5 4
x %>%
separate_rows(Neighbour_ID, convert = TRUE) %>%
select(-Name) %>%
rbind(setNames(rev(.), names(.))) %>%
filter(ID != Neighbour_ID) %>%
distinct()%>%
left_join(select(x, -Neighbour_ID), c(ID = 'ID')) %>%
summarise(Neighbour_ID = toString(sort(Neighbour_ID)), .by = c(Name, ID))
# A tibble: 5 × 3
Name ID Neighbour_ID
<chr> <int> <chr>
1 Josh 1 2, 3
2 Emma 2 1, 4
3 Nick 3 1
4 Mark 4 2, 5
5 Claire 5 4
There should be a fairly simple solution to this but it's giving me trouble. I have a DF similar to this:
> df <- data.frame(name = c("george", "george", "george", "sara", "sara", "sam", "bill", "bill"),
id_num = c(1, 1, 2, 3, 3, 4, 5, 5))
> df
name id_num
1 george 1
2 george 1
3 george 2
4 sara 3
5 sara 3
6 sam 4
7 bill 5
8 bill 5
I'm looking for a way to find rows where the name and ID numbers are inconsistent in a very large dataset. I.e., George should always be "1" but in row three there is a mistake and he has also been assigned ID number "2".
I think the easiest way will be to use dplyr::count twice, hence for your example:
df %>%
count(name, id) %>%
count(name)
The first count will give:
name id n
george 1 2
george 2 1
sara 3 2
sam 4 1
bill 5 2
Then the second count will give:
name n
george 2
sara 1
sam 1
bill 1
Of course, you could add filter(n > 1) to the end of your pipe, too, or arrange(desc(n))
df %>%
count(name, id) %>%
count(name) %>%
arrange(desc(n)) %>%
filter(n > 1)
Using tapply() to calculate number of ID's per name, then subset for greater than 1.
res <- with(df, tapply(id_num, list(name), \(x) length(unique(x))))
res[res > 1]
# george
# 2
You probably want to correct this. A safe way is to rebuild the numeric ID's using as.factor(),
df$id_new <- as.integer(as.factor(df$name))
df
# name id_num id_new
# 1 george 1 2
# 2 george 1 2
# 3 george 2 2
# 4 sara 3 4
# 5 sara 3 4
# 6 sam 4 3
# 7 bill 5 1
# 8 bill 5 1
where numbers are assigned according to the names in alphabetical order, or factor(), reading in the levels in order of appearance.
df$id_new2 <- as.integer(factor(df$name, levels=unique(df$name)))
df
# name id_num id_new id_new2
# 1 george 1 2 1
# 2 george 1 2 1
# 3 george 2 2 1
# 4 sara 3 4 2
# 5 sara 3 4 2
# 6 sam 4 3 3
# 7 bill 5 1 4
# 8 bill 5 1 4
Note: R >= 4.1 used.
Data:
df <- structure(list(name = c("george", "george", "george", "sara",
"sara", "sam", "bill", "bill"), id_num = c(1, 1, 2, 3, 3, 4,
5, 5)), class = "data.frame", row.names = c(NA, -8L))
I have the dataframe below
name<-c("Jack","Bob","Jack","Bill","Jack","Bob")
items<-c("car","house","ball","desk","bike","chair")
d<-data.frame(name,item)
name items
1 Jack car
2 Bob house
3 Jack ball
4 Bill desk
5 Jack bike
6 Bob chair
and I want to convert it to a dataframe in which the unique items will be summarized based on the name and a new column will be added with their count, so it will be like:
name items count
1 Jack car,ball,bike 3
2 Bob house,chair 2
3 Bill chair 1
Although I'm not convinced that a comma-separated sequence is the best way to approach further data processing, here is code that does what you want:
library(dplyr)
d %>%
group_by(name) %>%
summarize(count = n(),
items = toString(items)) %>%
ungroup()
# A tibble: 3 x 3
name count items
<chr> <int> <chr>
1 Bill 1 desk
2 Bob 2 house, chair
3 Jack 3 car, ball, bike
A base R solution without extra packages
d <-data.frame(name = c("Jack","Bob","Jack","Bill","Jack","Bob"),
items = c("car","house","ball","desk","bike","chair"))
Get the frequencies for names and add another column for the concatenated items.
result <- margin.table(table(d), 1)
sdf <- data.frame(items = paste(d$items[d$name == names(result)], collapse = ", "), result)
reorder columns
sdf <- sdf[, c(2, 1, 3)]
sdf
#> name items Freq
#> 1 Bill house, ball, desk 1
#> 2 Bob house, ball, desk 2
#> 3 Jack house, ball, desk 3
Created on 2021-01-10 by the reprex package (v0.3.0)
d %>% group_by(name) %>%
mutate(foo = paste0(items, collapse = ",")) %>%
mutate(count_w = length(foo)) %>%
dplyr::select(-items) %>%
distinct()
# A tibble: 3 x 3
# Groups: name [3]
name foo count_w
<chr> <chr> <int>
1 Jack car,ball,bike 3
2 Bob house,chair 2
3 Bill desk 1
I have two dfs like this:
df1
name <- c("Ted","Bill","James","Randy","Mark","Jimmy","Eric","Allen")
team <- c("Hawks","Tigers","Bears","Tigers","Lions","Bears","Hawks","Lions")
df1 <- data.frame(name,team)
df2
name <- c("Ted","Bill","Mark","Jimmy","Eric","James","Allen","Randy","Bill","James","Mark")
team <- c("Hawks","Tigers","Lions","Bears","Hawks","Bears","Lions","Tigers","Tigers","Bears","Lions")
game_id <- c("21","23","28","21","21","21","29","22","22","32","42")
df2 <- data.frame(name,team,game_id)
I want to mark the game_ids in df2 with NA if the game_id does not have ALL of the names for its respective team in df1. In the sample data I provided, for example, game_id 32 in the row containing "James" and "Bears" would be one of the game_ids marked NA because "Jimmy" isn't represented for game_id 32 in df2. We know that Jimmy must be represented because he appears in a row in df1 with "Bears" indicated for his team.
My desired output for my sample data would look like this:
df3
name <- c("Ted","Bill","Mark","Jimmy","Eric","James","Allen","Randy","Bill","James","Mark")
team <- c("Hawks","Tigers","Lions","Bears","Hawks","Bears","Lions","Tigers","Tigers","Bears","Lions")
game_id <- c("21",NA,NA,"21","21","21",NA,"22","22",NA,NA)
df3 <- data.frame(name,team,game_id)
I think the solution starts by spreading df1 (after adding a unique ID column), like this:
df1$row_index <- seq.int(nrow(df1))
df1 <- spread(df1,team,name)
But I get stuck after that point. What is the best way to go about doing this?
You should be able to do this via an "anti-join" against all the correct combinations of team/name:
badgames <- df1 %>%
full_join(distinct(select(df2, game_id, team)), by="team") %>%
anti_join(df2, by=c("team", "game_id", "name")) %>%
select(game_id,team) %>%
mutate(hit = 1)
df2 %>%
left_join(badgames, by=c("game_id","team")) %>%
mutate(game_id = replace(game_id, hit==1, NA), hit = NULL)
The same logic works in data.table keyed joins, where you can specify an anti-join by putting ! in front of the joined table. You can also do the update all in the same step using := instead of creating an intermediary dataset:
library(data.table)
setDT(df1)
setDT(df2)
df2[
df1[unique(df2[, .(game_id,team)]), on=.(team)][
!df2, on=.(game_id, team, name)], on=.(game_id,team),
game_id := NA
]
Both resulting in:
# name team game_id
#1 Ted Hawks 21
#2 Bill Tigers <NA>
#3 Mark Lions <NA>
#4 Jimmy Bears 21
#5 Eric Hawks 21
#6 James Bears 21
#7 Allen Lions <NA>
#8 Randy Tigers 22
#9 Bill Tigers 22
#10 James Bears <NA>
#11 Mark Lions <NA>
Here's another way using counts. We're comparing the number of players on each team in df1 to the number of players at each game for each team in df2. This could be tripped up if df1 was an incomplete list of players, e.g. if the Lions had two players in df1 and two totally different players played for them in a game in df2, but if I understand the setting that shouldn't be the case.
library(tidyverse)
df1 <- tibble(
name = c("Ted","Bill","James","Randy","Mark","Jimmy","Eric","Allen"),
team = c("Hawks","Tigers","Bears","Tigers","Lions","Bears","Hawks","Lions")
)
df2 <- tibble(
name = c("Ted","Bill","Mark","Jimmy","Eric","James","Allen","Randy","Bill","James","Mark"),
team = c("Hawks","Tigers","Lions","Bears","Hawks","Bears","Lions","Tigers","Tigers","Bears","Lions"),
game_id = c("21","23","28","21","21","21","29","22","22","32","42")
)
df2 %>%
add_count(team, game_id) %>%
left_join(add_count(df1, team), by = c("name", "team")) %>%
mutate(game_id = ifelse(n.x == n.y, game_id, NA)) %>%
select(name:game_id)
#> # A tibble: 11 x 3
#> name team game_id
#> <chr> <chr> <chr>
#> 1 Ted Hawks 21
#> 2 Bill Tigers <NA>
#> 3 Mark Lions <NA>
#> 4 Jimmy Bears 21
#> 5 Eric Hawks 21
#> 6 James Bears 21
#> 7 Allen Lions <NA>
#> 8 Randy Tigers 22
#> 9 Bill Tigers 22
#> 10 James Bears <NA>
#> 11 Mark Lions <NA>
Created on 2018-04-10 by the reprex package (v0.2.0).
Using sqldf you can skip the annoying NA replacements.
library(dplyr)
library(sqldf)
dfx <- inner_join(count(df2,game_id,team),count(df1,team))
sqldf("SELECT name, team, dfx.game_id from df2 natural left join dfx")
# or finish the dplyr chain with:
# %>% right_join(df2) %>% mutate(game_id = `is.na<-`(game_id,is.na(n))) %>% select(-n)
# name team game_id
# 1 Ted Hawks 21
# 2 Bill Tigers <NA>
# 3 Mark Lions <NA>
# 4 Jimmy Bears 21
# 5 Eric Hawks 21
# 6 James Bears 21
# 7 Allen Lions <NA>
# 8 Randy Tigers 22
# 9 Bill Tigers 22
# 10 James Bears <NA>
# 11 Mark Lions <NA>
data.table has this feature as well:
setDT(df1)
setDT(df2)
dfx <- df2[,.N, by=c("team","game_id")][df1[,.N, by=team],on=c("team","N")]
dfx[df2,.(name,team,game_id=x.game_id),on=c("team","game_id")]
# name team game_id
# 1: Ted Hawks 21
# 2: Bill Tigers NA
# 3: Mark Lions NA
# 4: Jimmy Bears 21
# 5: Eric Hawks 21
# 6: James Bears 21
# 7: Allen Lions NA
# 8: Randy Tigers 22
# 9: Bill Tigers 22
# 10: James Bears NA
# 11: Mark Lions NA
And the base version for completeness, notice that one can merge tables without converting them to data.frame first:
dfx <- merge(table(df2[-1]),table(df1[-1],dnn=names(df1[-1])))
df3 <- merge(df2,dfx,all.x=T)
is.na(df3$game_id) <- is.na(df3$n)
df3 <- df3[-4]
# team game_id name
# 1 Bears 21 Jimmy
# 2 Bears 21 James
# 3 Bears <NA> James
# 4 Hawks 21 Ted
# 5 Hawks 21 Eric
# 6 Lions <NA> Mark
# 7 Lions <NA> Allen
# 8 Lions <NA> Mark
# 9 Tigers 22 Randy
# 10 Tigers 22 Bill
# 11 Tigers <NA> Bill
Here is one method:
how_many_players <- aggregate(name ~ team, data = df1, function(x) length(unique(x)))
names(how_many_players)[2] <- "total_players"
num_played <- aggregate(name ~ game_id + team, data = df2, function(x) length(unique(x)))
names(num_played)[3] <- "num_played"
check <- merge(how_many_players, num_played)
full_games <- check[check$total_players == check$num_played, "game_id"]
df3 <- df2
df3$game_id[!df3$game_id %in% full_games] <- NA
df3
# name team game_id
# 1 Ted Hawks 21
# 2 Bill Tigers <NA>
# 3 Mark Lions <NA>
# 4 Jimmy Bears 21
# 5 Eric Hawks 21
# 6 James Bears 21
# 7 Allen Lions <NA>
# 8 Randy Tigers 22
# 9 Bill Tigers 22
# 10 James Bears <NA>
# 11 Mark Lions <NA>
the other solutions uses counting the number of players, which may not catch exactly the scenario you are looking at when the same number of players, but a different set of players are playing.
Hence, if you want to be specific about the players that are playing, you may want to concatenate all the players names in a sorted fashion and compare them.
name <- c("Ted","Bill","James","Randy","Mark","Jimmy","Eric","Allen")
team <- c("Hawks","Tigers","Bears","Tigers","Lions","Bears","Hawks","Lions")
df1 <- data.frame(name,team)
name <- c("Ted","Bill","Mark","Jimmy","Eric","James","Allen","Randy","Bill","James","Mark")
team <- c("Hawks","Tigers","Lions","Bears","Hawks","Bears","Lions","Tigers","Tigers","Bears","Lions")
game_id <- c("21","23","28","21","21","21","29","22","22","32","42")
# Note the game_id needs to be a string, otherwise the NAs may be improperly captured
df2 <- data.frame(name,team,game_id, stringsAsFactors = FALSE)
# Concatenate all players names by group in df1
df1.all.members <- df1 %>%
group_by(team) %>%
arrange(name) %>%
summarise(all_players = paste0(name, collapse = "_"))
# Perform the same concatenation in df2
df2.all.members <- df2 %>%
group_by(team, game_id) %>%
arrange(name) %>%
mutate(all_players2 = paste0(name, collapse = "_")) %>%
# Left join with the new df1
left_join(df1.all.members, by = "team") %>%
ungroup %>%
# Compare if all names are the same
mutate(game_id = ifelse(all_players2 == all_players, game_id, NA)) %>%
# Select required fields
select(name, team, game_id)
# # A tibble: 11 x 3
# name team game_id
# <chr> <chr> <chr>
# 1 Allen Lions <NA>
# 2 Bill Tigers <NA>
# 3 Bill Tigers 22
# 4 Eric Hawks 21
# 5 James Bears 21
# 6 James Bears <NA>
# 7 Jimmy Bears 21
# 8 Mark Lions <NA>
# 9 Mark Lions <NA>
# 10 Randy Tigers 22
# 11 Ted Hawks 21
I am currently running a randomization where individuals of a given population are sampled and placed into groups of defined size. The result is a data frame seen below:
Ind Group
Sally 1
Bob 1
Sue 1
Joe 2
Jeff 2
Jess 2
Mary 2
Jim 3
James 3
Is there a function which will allow me to expand the data set to show every possible within group pairing? (Desired output below). The pairings do not need to be reciprocal.
Group Ind1 Ind2
1 Sally Bob
1 Sally Sue
1 Sue Bob
2 Joe Jeff
2 Joe Jess
2 Joe Mary
2 Jeff Jess
2 Jess Mary
2 Jeff Mary
3 Jim James
I feel like there must be a way to do this in dplyr, but for the life of me I can't seem to sort it out.
An alternative dplyr & tidyr approach: The pipeline is a little longer, but the wrangling feels more straightforward to me. Start with combining all records in each group together. Next, pool and alphabetize all the names together to be able to eliminate the reciprocal/duplicates. Then finally separate the results back apart again.
left_join(dt, dt, by = "Group") %>%
filter(Ind.x != Ind.y) %>%
rowwise %>%
mutate(name = toString(sort(c(Ind.x,Ind.y)))) %>%
select(Group, name) %>%
distinct %>%
separate(name, into = c("Ind1", "Ind2")) %>%
arrange(Group, Ind1, Ind2)
start off with a weak cross join of all records in each group
filter out the self joins
collect up all the names in each row, sort them, and set them down together in the name column.
now that the names are alphabetized, remove the alphabetized reciprocals
pull the data apart back into separate columns.
# A tibble: 10 x 3
Group Ind1 Ind2
* <int> <chr> <chr>
1 1 Bob Sally
2 1 Sally Sue
3 1 Bob Sue
4 2 Jeff Joe
5 2 Jess Joe
6 2 Joe Mary
7 2 Jeff Jess
8 2 Jeff Mary
9 2 Jess Mary
10 3 James Jim
Here is an option using data.table. Convert to data.table (setDT(dt)), Do a cross join (CJ) grouped by 'Group' and remove the duplicated elements
library(data.table)
setDT(dt)[, CJ(Ind1 = Ind, Ind2 = Ind, unique = TRUE)[Ind1 != Ind2],
Group][!duplicated(data.table(pmax(Ind1, Ind2), pmin(Ind1, Ind2)))]
# Group Ind1 Ind2
#1: 1 Bob Sally
#2: 1 Bob Sue
#3: 1 Sally Sue
#4: 2 Jeff Jess
#5: 2 Jeff Joe
#6: 2 Jeff Mary
#7: 2 Jess Joe
#8: 2 Jess Mary
#9: 2 Joe Mary
#10: 3 James Jim
Or using combn by 'Group'
setDT(dt)[, {temp <- combn(Ind, 2); .(Ind1 = temp[1,], Ind2 = temp[2,])}, Group]
A solution using dplyr. We can use group_by and do to apply the combn function to each group and combine the results to form a data frame.
library(dplyr)
dt2 <- dt %>%
group_by(Group) %>%
do(as_data_frame(t(combn(.$Ind, m = 2)))) %>%
ungroup() %>%
setNames(sub("V", "Ind", colnames(.)))
dt2
# # A tibble: 10 x 3
# Group Ind1 Ind2
# <int> <chr> <chr>
# 1 1 Sally Bob
# 2 1 Sally Sue
# 3 1 Bob Sue
# 4 2 Joe Jeff
# 5 2 Joe Jess
# 6 2 Joe Mary
# 7 2 Jeff Jess
# 8 2 Jeff Mary
# 9 2 Jess Mary
# 10 3 Jim James
DATA
dt <- read.table(text = "Ind Group
Sally 1
Bob 1
Sue 1
Joe 2
Jeff 2
Jess 2
Mary 2
Jim 3
James 3",
header = TRUE, stringsAsFactors = FALSE)