How to join and bind similar dataframes in R, iterative solutions welcome - r

In R markdown through R Studio (R v. 4.0.3), I'm looking for a better solution to combining similarly structured dataframes while keeping all rows and matching entries on a key. Piping full_join() into a filter into a bind_rows() directly wasn't working, possibly because of the error message:
Error: Can't combine ..1$term_code 'character> and ..2$term_code '<integer.
I have 23 dataframes (let's call these "semester data") of data I'm looking to combine into a single dataframe (intended to be a single dataset of individuals outcomes from semester-to-semester).
Each semester dataframe is roughly 3000-4000 observations (individuals) with 45-47 variables of relevant data. A simplified example of a semester (or term) dataframe is shown below.
Simplified example of a "semester" dataframe:
id
ACT_math
course_code
section_code
term_code
grade
term_GPA
0001
23
101
001
FA12
3.45
3.8
0002
28
201
003
FA12
3.2
3.4
Individuals will show up in multiple semester dataframes as they progress through the program (taking course 101 in the fall and course 102 in the spring).
I want to use the dplyr full_join() to match these individuals on an ID key.
Using the suffix argument, I hope to keep track of which semester and course a set of data (grade, term_GPA, etc) for an individual comes from.
There's some data (ACT score, gender, state residency, etc) that is the stable for an individual across semester dataframes. Ideally I could take the first input and drop the rest, but if I had to clean this afterwards, that's fine.
I started by defining an object programatic_database using the first semester of data SP11. To cut down on the duplication of stable data for an individual, I selected the relevant columns that I wanted to join.
programmatic_database <- programmatic_database %>%
full_join(select(fa12, id, course_code, section_code, grade, term_gpa), by = "id", copy = TRUE, suffix = c(".sp11", ".fa12"), keep = FALSE, name = "id")
However, every semester new students join the program. I would like to add these entries to the bottom of the growing programmatic_database.
I'm also looking to use rbind() or bind_rows() to add these individuals to the bottom of the programmatic_database, along with their relevant data.
After full_join(), I'm filtering out the entries that have already been added horizontally to the dataframe, then piping the remaining entries into bind_rows()
programmatic_database <- fa12[!which(fa12$id %in% programmatic_database),] %>% dplyr::bind_rows(programmatic_database, fa12)
Concatenated example of what my code is producing after several iterations:
id
ACT_math
course_code
section_code
section_code.db
section_code.db.db
term_code
grade.sp11
grade.fa12
grade.sp13
grade.sp15
term_GPA.sp11
term_GPA.fa12
term_GPA.sp15
0001
23
102
001
001
001
FA12
3.45
3.8
3.0
-
3.8
3.7
-
0002
28
201
003
003
003
FA12
3.2
3.4
3.0
-
3.8
3.7
-
1020
28
201
003
003
003
FA12
3.2
3.4
-
-
3.8
3.7
-
6783
30
101
-
-
-
SP15
-
-
-
3.8
-
-
4.0
where I have successfully added horizontally for students 0001 and 0002 for outcomes in subsequent courses in subsequent semesters. I have also managed to add vertically, like with student 6783, leaving blanks for previous semesters before they enrolled but still adding the relevant columns.
Questions:
Is there a way to pipe full_join() into a filter() into a bind_rows() without running into these errors?
rbind number of columns do not match
OR
Error: Can't combine ..1$term_code 'character> and ..2$term_code '<integer.
Is there a easy way to keep certain columns and only add the suffix ".fa12" to certain columns? As you can see, the .db is piling up.
Is there any way to automate this? Loops aren't my strong suit, but I'm sure there's a better-looking code than doing each of the 23 joins/binds by hand.
Thank you for assistance!
Current code for simplicity:
#reproducible example
fa11 <- structure(list(id = c("1001", "1002", "1003",
"1013"), act6_05_composite = c(33L, 26L, 27L, 25L), course_code = c("101",
"101", "101", "101"), term_code = c("FA11", "FA11", "FA11", "FA11"
), section_code = c(1L, 1L, 1L, 1L), grade = c(4, 0, 0, 2.5
), repeat_status_flag = c(NA, "PR", NA, NA), class_code = c(1L,
1L, 1L, 1L), cum_atmpt_credits_prior = c(16, 0, 0, 0), cum_completed_credits_prior = c(0L,
0L, 0L, 0L), cum_passed_credits_prior = c(16, 0, 0, 0), cum_gpa_prior = c(0,
0, 0, 0), cum_atmpt_credits_during = c(29, 15, 18, 15), cum_completed_credits_during = c(13L,
1L, 10L, 15L), cum_passed_credits_during = c(29, 1, 14, 15),
term_gpa = c(3.9615, 0.2333, 2.3214, 2.9666), row.names = c(NA, 4L
), class = "data.frame")
sp12 <- structure(list(id = c("1007", "1013", "1355",
"2779", "2302"), act6_05_composite = c(24L, 26L, 25L, 24L,
24L), course_code = c(101L, 102L, 101L, 101L, 101L
), term_code = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_), section_code = c(1L, 1L, 1L, 1L, 1L), grade = c(2,
2.5, 2, 1.5, 3.5), repeat_status_flag = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), class_code = c(2L, 2L, 1L, 2L, 2L), cum_atmpt_credits_prior = c(44,
43, 12, 43, 30), cum_completed_credits_prior = c(41L, 43L,
12L, 43L, 12L), cum_passed_credits_prior = c(41, 43, 12,
43, 30), cum_gpa_prior = c(3.3125, 3.186, 3.5416, 3.1785,
3.8636), cum_atmpt_credits_during = c(56, 59, 25, 64, 43),
cum_completed_credits_during = c(53L, 56L, 25L, 56L, 25L),
cum_passed_credits_during = c(53, 59, 25, 64, 43), term_gpa = c(2.8333,
3.423, 3.1153, 2.1923, 3.6153), row.names = c(NA,
5L), class = "data.frame")
# make object from fall 2011 semester dataframe
programmatic_database <- fa11
# join the spring 2012 semester dataframe by id using select variables and attaching relevant suffix
programmatic_database <- programmatic_database %>%
full_join(select(sp12, id, course_code, section_code, grade, term_gpa), by = "id", copy = TRUE, suffix = c(".fa11", ".sp12"), keep = FALSE, name = "id")
#view results of join, force integer type on certain variables as needed (see error above)
#filter the joined entries from fall 2012 database, then bind the remaining entries to the bottom of the growing dataset
programmatic_database <- sp12[!which(sp12$id %in% programmatic_database),] %>% dplyr::bind_rows(programmatic_database, sp12)

It would be possible to use bind_rows here if you make the column types consistent between tables. For instance, you could make a function to re-type any particular columns that aren't consistent in your original data. (That might also be something you could fix upstream as you read it in.)
library(dplyr)
set_column_types <- function(df) {
df %>%
mutate(term_code = as.character(term_code),
course_code = as.character(course_code))
}
bind_rows(
fa11 %>% set_column_types(),
sp12 %>% set_column_types() %>% mutate(term_code = "SP12")
)
This will stack your data into a relatively "long" format, like below. You may want to then reshape it depending on what kind of subsequent calculations you want to do.
id act6_05_composite course_code term_code section_code grade repeat_status_flag class_code cum_atmpt_credits_prior cum_completed_credits_prior cum_passed_credits_prior cum_gpa_prior cum_atmpt_credits_during cum_completed_credits_during cum_passed_credits_during term_gpa
1 1001 33 101 FA11 1 4.0 <NA> 1 16 0 16 0.0000 29 13 29 3.9615
2 1002 26 101 FA11 1 0.0 PR 1 0 0 0 0.0000 15 1 1 0.2333
3 1003 27 101 FA11 1 0.0 <NA> 1 0 0 0 0.0000 18 10 14 2.3214
4 1013 25 101 FA11 1 2.5 <NA> 1 0 0 0 0.0000 15 15 15 2.9666
5 1007 24 101 SP12 1 2.0 <NA> 2 44 41 41 3.3125 56 53 53 2.8333
6 1013 26 102 SP12 1 2.5 <NA> 2 43 43 43 3.1860 59 56 59 3.4230
7 1355 25 101 SP12 1 2.0 <NA> 1 12 12 12 3.5416 25 25 25 3.1153
8 2779 24 101 SP12 1 1.5 <NA> 2 43 43 43 3.1785 64 56 64 2.1923
9 2302 24 101 SP12 1 3.5 <NA> 2 30 12 30 3.8636 43 25 43 3.6153

Related

Display only rows values in which the difference in between a column is below 30

I keep trying unsuccessfully to select from an excel file a filter in which only the rows values where three consecutive row values in column'x' are below 30 units. For example, in the following table:
Name age height speed
Helen 12. 1.20 40
Alan. 14. 1.40. 75
Hector.15. 1.25. 80
Ana. 11. 1.02. 81
Sophie.16. 1.40. 50
When the difference in column speed is below 30 within consecutive rows it should give as a result:
Name age height speed
Alan. 14. 1.40. 75
Hector.15. 1.25. 80
Ana. 11. 1.02. 81
Thank you!!!
If your data is like this:
x = structure(list(Name = structure(c(4L, 1L, 3L, 2L, 5L), .Label = c("Alan",
"Ana", "Hector", "Helen", "Sophie"), class = "factor"), age = c(12,
14, 15, 11, 16), height = c(1.2, 1.4, 1.25, 1.02, 1.4), speed = c(40L,
75L, 80L, 81L, 50L)), class = "data.frame", row.names = c(NA,
-5L))
Hope I got the numbers right:
Name age height speed
1 Helen 12 1.20 40
2 Alan 14 1.40 75
3 Hector 15 1.25 80
4 Ana 11 1.02 81
5 Sophie 16 1.40 50
Then do:
x[diff(x$speed)<30,]
Name age height speed
2 Alan 14 1.40 75
3 Hector 15 1.25 80
4 Ana 11 1.02 81
next time you publish here it is useful to post some toydata information like below:
rm(list=ls())
#### Toy data ###
dfnames<-c("Name","age","height","speed")
size<-20 # number of rows
name<-LETTERS[1:size]
age<-sample(20:26,size,replace=T)
height<-sample(160:180,size,replace=T)
speed<-sample(0:60,size,replace=T)
df<-cbind.data.frame(name,age,height,speed)
Solution:
for(i in 1:nrow(df)-1){
df[i,"test"]<-(df[i+1,"speed"]-df[i,"speed"])<30
}
df[nrow(df),"test"]<-"last_row"
df<-df[df[,"test"]!=F,]

Merging two Dataframes in R by ID, One is the subset of the other

I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?
Thank you in advance
P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure
dfold:
structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France",
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L,
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L,
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased",
"pid"), row.names = c(NA, 5L), class = "data.frame")
dfnew:
structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000,
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA,
3L), class = "data.frame")
Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.
Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 NA Yes 5
> dfnew
Age Salary pid
1 44 72000 1
2 27 48000 2
3 30 54000 3
4 38 61000 4
5 40 70000 5
To do this for a single column, try
dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
Using it on the whole dataset was a bit trickier:
First define all common colnames except pid:
cols <- names(dfnew)[names(dfnew) != "pid"]
> cols
[1] "Age" "Salary"
Now use mapply to replace the NA values with ifelse:
dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])
> dfold
Country Age Salary Purchased pid
1 France 44 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use
cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]

Subset multiple observations by ID then select first observation by time [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 6 years ago.
I have a large dataset of observations, with several observations in rows and several different variables for each ID.
e.g.
Data
ID V1 V2 V3 time
1 35 100 5.2 2015-07-03 07:49
2 25 111 6.2 2015-04-01 11:52
3 41 120 NA 2015-04-01 14:17
1 25 NA NA 2015-07-03 07:51
2 NA 122 6.2 2015-04-01 11:50
3 40 110 4.1 2015-04-01 14:25
I would like to extract the earliest (first) observation for each variable independently based on the time column, for each unique ID. i.e. I would like to combine multiple rows of the same ID together so that I have one row of the first observation for each variable (time variable will not be equal for all).
The min() function will return the earliest time for a set of observations, but the problem is I need to do this for each variable. To do this I have tried using the tapply function with minimum time
tapply(Data, ID, min(time)
but get an error saying
"Error in match.fun(FUN) :
'min(Data$time)' is not a function, character or symbol.
I suspect that there is also a problem because many of the rows of observations have missing data.
Alternatively I have tried to just do each variable one at a time using aggregate, and select the min(time) this way:
firstV1 <-aggregate(V1[min(time)]~ID, data=Data, na.rm=T)
From the example dataset, what I would like to see is:
Data
ID V1 V2 V3
1 35 100 5.2
2 25 122 6.2
3 41 120 4.1
Note the '25' for ID2 V1 was from the later observation because the first observation was missing. Same for ID3 V3.
Input data
structure(list(ID = c(1L, 2L, 3L, 1L, 2L, 3L), V1 = c(35L, 25L,
41L, 25L, NA, 40L), V2 = c(100L, 111L, 120L, NA, 122L, 110L),
V3 = c(5.2, 6.2, 4.2, NA, 6.2, 4.1), time = structure(c(1435906140,
1427885520, 1427894220, 1435906260, 1427885400, 1427894700
), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("ID",
"V1", "V2", "V3", "time"), row.names = c(NA, -6L), class = "data.frame")
This should do what you need.
library(data.table)
Data <- rbind(cbind(1,35,100,5.2,"2015-07-03 07:49"),
cbind(2,25,111,6.2,"2015-04-01 11:52"),
cbind(3,41,120,4.2,"2015-04-01 14:17"),
cbind(1,25,NA,NA,"2015-07-03 07:51"),
cbind(2,NA,122,6.2,"2015-04-01 11:50"),
cbind(3,40,110,4.1,"2015-04-01 14:25"))
colnames(Data) <- c("ID","V1","V2","V3","time")
Data <- data.table(Data)
class(Data[,time])
Data[,time:=as.POSIXct(time)]
minTime.Data <- Data[,lapply(.SD, function(x) x[time==min(time)]),by=ID]
minTime.Data
The outcome will be
ID V1 V2 V3 time
1: 1 35 100 5.2 2015-07-03 07:49:00
2: 2 NA 122 6.2 2015-04-01 11:50:00
3: 3 41 120 4.2 2015-04-01 14:17:00
Let me know if this is what you were looking for, because there is a little ambiguity in your question.

R dplyr: min max function not working in mutate

I have an issue with dplyr I cannot resolve. Also I do not have a full workable example, since the problem only occurs with the full set of data (that I cannot share with you).
I do the following:
t %>% group_by(id, add=TRUE) %>%
summarise(minbplevel = min(ref, na.rm=T)
,maxbplevel = max(ref, na.rm=T)
) %>% filter(id %in% c(caseA,caseB))
Which results in
id minbplevel maxbplevel
(dbl) (dbl) (dbl)
1 B 33.0 73.0
2 A 39.4 80.4
But when I do
t %>% group_by(id, add=TRUE) %>%
mutate(minbplevel = min(ref, na.rm=T)
,maxbplevel = max(ref, na.rm=T)
) %>% filter(id %in% c(caseA,caseB))
It results in:
id Level refparmax refparmin ref meanbptest minbplevel maxbplevel
(dbl) (chr) (int) (int) (dbl) (dbl) (dbl) (dbl)
1 B 0SD 69 68 49.0 52.00000 33 73
2 B min1SD 69 68 41.0 52.00000 33 73
3 B min2SD 69 68 33.0 52.00000 33 73
4 B plus1SD 69 68 59.0 52.00000 33 73
5 B plus2SD 69 68 73.0 52.00000 33 73
6 A 0SD 100 95 56.4 35.33333 NA NA
7 A min1SD 100 95 47.4 35.33333 NA NA
8 A min2SD 100 95 39.4 35.33333 NA NA
9 A plus1SD 100 95 67.4 35.33333 NA NA
10 A plus2SD 100 95 80.4 35.33333 NA NA
Why the NA's in case A are produced, I have no clue. It seems that each time I try it on a subset of the data, the second case with data is the problem, but that is just a hunch.
It is only one case of the 18850 that gives this issue, but there is nothing identifiable that makes the problem case different than the rest.
Please advice what I can try to do to solve this?
I can think of workarounds, creating the summarized data and then merging the result with the original data. But I thought that dplyr would allow me to do this in one step.
I tried removing or adding the add = TRUE option. That does not make any difference.
Maybe I am using this in the wrong way.
Based on comment I tried:
subset(with(t,aggregate(ref~id, t, FUN= min, na.rm=TRUE, na.action= na.pass)),id %in% c(caseA,caseB))
Which results in
id ref
4 B 33.0
5 A 39.4
I have to mask some parts of the data.
dput(head(subset(t,id %in% c(caseA,caseB)) , 12))
gives:
Again I replaced the actual id's with variables caseB and caseA. Also this is not the full dataset in which the problem occurs.
structure(list(id = c(caseB, caseB, caseB, caseB, caseB,
caseA, caseA, caseA, caseA, caseA), Level = c("0SD", "min1SD",
"min2SD", "plus1SD", "plus2SD", "0SD", "min1SD", "min2SD", "plus1SD",
"plus2SD"), refparmax = c(69L, 69L, 69L, 69L, 69L, 100L, 100L,
100L, 100L, 100L), refparmin = c(68L, 68L, 68L, 68L, 68L, 95L,
95L, 95L, 95L, 95L), ref = c(49, 41, 33, 59, 73, 56.4, 47.4,
39.4, 67.4, 80.4), meanbptest = c(52, 52, 52, 52, 52, 35.3333333333333,
35.3333333333333, 35.3333333333333, 35.3333333333333, 35.3333333333333
)), .Names = c("id", "Level", "refparmax", "refparmin", "ref",
"meanbptest"), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), vars = list(id), drop = TRUE, indices = list(
0:4, 5:9), group_sizes = c(5L, 5L), biggest_group_size = 5L, labels = structure(list(
id = c(caseB, caseA)), class = "data.frame", row.names = c(NA,
-2L), vars = list(id), drop = TRUE, .Names = "id"))
If I replace all NA's in the ref column with zeros the mutate step is working fine. As aosmith suggested, it has probably something to do with the mutate and NA issue that is fixed in the developement version of dplyr.
I cannot test this suggestion due to workstation restrictions though. So I will work around the issue, with the NA replacement step, and process the zero values after the summary steps.

Identifying Duplicate/Unique Teams (and Restructuring Data) in R

I have a data set that looks like this:
Person Team
1 30
2 30
3 30
4 30
11 40
22 40
1 50
2 50
3 50
4 50
15 60
16 60
17 60
1 70
2 70
3 70
4 70
11 80
22 80
My overall goal is to organize that team identification codes so that it is easy to see which teams are duplicates of one another and which teams are unique. I want to summarize the data so that it looks like this:
Team Duplicate1 Duplicate2
30 50 70
40 80
60
As you can see, teams 30, 50, and 70 have identical members, so they share a row. Similarly, teams 40 and 80 have identical members, so they share a row. Only team 60 (in this example) is unique.
In situations where teams are duplicated, I don't care which team id goes in which column. Also, there may be more than 2 duplicates of a team. Teams range in size from 2 members to 8 members.
This answer gives the output data format you asked for. I left the duplicate teams in a single variable because I think it's a better way to handle an arbitrary number of duplicates.
require(dplyr)
df %>%
arrange(Team, Person) %>% # this line is necessary in case the rest of your data isn't sorted
group_by(Team) %>%
summarize(players = paste0(Person, collapse = ",")) %>%
group_by(players) %>%
summarize(teams = paste0(Team, collapse = ",")) %>%
mutate(
original_team = ifelse(grepl(",", teams), substr(teams, 1, gregexpr(",", teams)[[1]][1]-1), teams),
dup_teams = ifelse(grepl(",", teams), substr(teams, gregexpr(",", teams)[[1]][1]+1, nchar(teams)), NA)
)
The result:
Source: local data frame [3 x 4]
players teams original_team dup_teams
1 1,2,3,4 30,50,70 30 50,70
2 11,22 40,80 40 80
3 15,16,17 60 60 NA
Not exactly the format you're wanting, but pretty useful:
# using MrFlick's data
library(dplyr)
dd %>% group_by(Team) %>%
arrange(Person) %>%
summarize(team.char = paste(Person, collapse = "_")) %>%
group_by(team.char) %>%
arrange(team.char, Team) %>%
mutate(duplicate = 1:n())
Source: local data frame [6 x 3]
Groups: team.char
Team team.char duplicate
1 40 11_22 1
2 80 11_22 2
3 60 15_16_17 1
4 30 1_2_3_4 1
5 50 1_2_3_4 2
6 70 1_2_3_4 3
(Edited in the arrange(Person) line in case the data isn't already sorted, got the idea from #Reed's answer.)
Using this for your sample data
dd<-structure(list(Person = c(1L, 2L, 3L, 4L, 11L, 22L, 1L, 2L, 3L,
4L, 15L, 16L, 17L, 1L, 2L, 3L, 4L, 11L, 22L), Team = c(30L, 30L,
30L, 30L, 40L, 40L, 50L, 50L, 50L, 50L, 60L, 60L, 60L, 70L, 70L,
70L, 70L, 80L, 80L)), .Names = c("Person", "Team"),
class = "data.frame", row.names = c(NA, -19L))
You could try a table()/interaction() to find duplicate groups. For example
tt <- with(dd, table(Team, Person))
grp <- do.call("interaction", c(data.frame(unclass(tt)), drop=TRUE))
split(rownames(tt), grp)
this returns
$`1.1.1.1.0.0.0.0.0`
[1] "30" "50" "70"
$`0.0.0.0.0.1.1.1.0`
[1] "60"
$`0.0.0.0.1.0.0.0.1`
[1] "40" "80"
so the group "names" are really just indicators for membership for each person. You could easily rename them if you like with setNames(). But here it collapse the appropriate teams.
Two more base R options (though not exactly the desired output):
DF2 <- aggregate(Person ~ Team, DF, toString)
> split(DF2$Team, DF2$Person)
$`1, 2, 3, 4`
[1] 30 50 70
$`11, 22`
[1] 40 80
$`15, 16, 17`
[1] 60
Or
( DF2$DupeGroup <- as.integer(factor(DF2$Person)) )
Team Person DupeGroup
1 30 1, 2, 3, 4 1
2 40 11, 22 2
3 50 1, 2, 3, 4 1
4 60 15, 16, 17 3
5 70 1, 2, 3, 4 1
6 80 11, 22 2
Note that the expected output as shown in the question would either require to add NAs or empty strings in some of the columns entries because in a data.frame, all columns must have the same number of rows. That is different for lists in, as you can see in some of the answers.
The second option, but using data.table, since aggregate tends to be slow for large data:
library(data.table)
setDT(DF)[, toString(Person), by=Team][,DupeGroup := .GRP, by=V1][]
Team V1 DupeGroup
1: 30 1, 2, 3, 4 1
2: 40 11, 22 2
3: 50 1, 2, 3, 4 1
4: 60 15, 16, 17 3
5: 70 1, 2, 3, 4 1
6: 80 11, 22 2
Using uniquecombs from the mgcv package:
library(mgcv)
library(magrittr) # for the pipe %>%
# Using MrFlick's data
team_names <- sort(unique(dd$Team))
unique_teams <- with(dd, table(Team, Person)) %>% uniquecombs %>% attr("index")
printout <- unstack(data.frame(team_names, unique_teams))
> printout
$`1`
[1] 60
$`2`
[1] 40 80
$`3`
[1] 30 50 70
Now you could use something like this answer to print it in tabular form (note that the groups are column-wise, not row-wise as in your question):
attributes(printout) <- list(names = names(printout)
, row.names = 1:max(sapply(printout, length))
, class = "data.frame")
> printout
1 2 3
1 60 40 30
2 <NA> 80 50
3 <NA> <NA> 70
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs

Resources