Percent change for grouped subjects at multiple timepoints R - r

id timepoint dv.a
1 baseline 100
1 1min 105
1 2min 90
2 baseline 70
2 1min 100
2 2min 80
3 baseline 80
3 1min 80
3 2min 90
I have repeated measures data for a given subject in long format as above. I'm looking to calculate percent change relative to baseline for each subject.
id timepoint dv pct.chg
1 baseline 100 100
1 1min 105 105
1 2min 90 90
2 baseline 70 100
2 1min 100 143
2 2min 80 114
3 baseline 80 100
3 1min 80 100
3 2min 90 113

df <- expand.grid( time=c("baseline","1","2"), id=1:4)
df$dv <- sample(100,12)
df %>% group_by(id) %>%
mutate(perc=dv*100/dv[time=="baseline"]) %>%
ungroup()
You're wanting to do something for each 'id' group, so that's the group_by, then you need to create a new column, so there's a mutate. That new variable is the old dv, scaled by the value that dv takes at the baseline - hence the inner part of the mutate. And finally it's to remove the grouping you'd applied.

Try creating a helper column, group and arrange on that. Then use the window function first in your mutate function:
df %>% mutate(clean_timepoint = str_remove(timepoint,"min") %>% if_else(. == "baseline", "0", .) %>% as.numeric()) %>%
group_by(id) %>%
arrange(id,clean_timepoint) %>%
mutate(pct.chg = (dv / first(dv)) * 100) %>%
select(-clean_timepoint)

in Base Ryou can do this
for(i in 1:(NROW(df)/3)){
df[1+3*(i-1),4] <- 100
df[2+3*(i-1),4] <- df[2+3*(i-1),3]/df[1+3*(i-1),3]*100
df[3+3*(i-1),4] <- df[3+3*(i-1),3]/df[1+3*(i-1),3]*100
}
colnames(df)[4] <- "pct.chg"
output:
> df
id timepoint dv.a pct.chg
1 1 baseline 100 100.0000
2 1 1min 105 105.0000
3 1 2min 90 90.0000
4 2 baseline 70 100.0000
5 2 1min 100 142.8571
6 2 2min 80 114.2857
7 3 baseline 80 100.0000
8 3 1min 80 100.0000
9 3 2min 90 112.5000

Base R solution: (assuming "baseline" always appears as first record per group)
data.frame(do.call("rbind", lapply(split(df, df$id),
function(x){x$pct.change <- x$dv/x$dv[1]; return(x)})), row.names = NULL)
Data:
df <- structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
timepoint = c(
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min"
),
dv = c(100L, 105L, 90L, 70L, 100L, 80L, 80L, 80L, 90L)
),
class = "data.frame",
row.names = c(NA,-9L)
)

Related

How to join and bind similar dataframes in R, iterative solutions welcome

In R markdown through R Studio (R v. 4.0.3), I'm looking for a better solution to combining similarly structured dataframes while keeping all rows and matching entries on a key. Piping full_join() into a filter into a bind_rows() directly wasn't working, possibly because of the error message:
Error: Can't combine ..1$term_code 'character> and ..2$term_code '<integer.
I have 23 dataframes (let's call these "semester data") of data I'm looking to combine into a single dataframe (intended to be a single dataset of individuals outcomes from semester-to-semester).
Each semester dataframe is roughly 3000-4000 observations (individuals) with 45-47 variables of relevant data. A simplified example of a semester (or term) dataframe is shown below.
Simplified example of a "semester" dataframe:
id
ACT_math
course_code
section_code
term_code
grade
term_GPA
0001
23
101
001
FA12
3.45
3.8
0002
28
201
003
FA12
3.2
3.4
Individuals will show up in multiple semester dataframes as they progress through the program (taking course 101 in the fall and course 102 in the spring).
I want to use the dplyr full_join() to match these individuals on an ID key.
Using the suffix argument, I hope to keep track of which semester and course a set of data (grade, term_GPA, etc) for an individual comes from.
There's some data (ACT score, gender, state residency, etc) that is the stable for an individual across semester dataframes. Ideally I could take the first input and drop the rest, but if I had to clean this afterwards, that's fine.
I started by defining an object programatic_database using the first semester of data SP11. To cut down on the duplication of stable data for an individual, I selected the relevant columns that I wanted to join.
programmatic_database <- programmatic_database %>%
full_join(select(fa12, id, course_code, section_code, grade, term_gpa), by = "id", copy = TRUE, suffix = c(".sp11", ".fa12"), keep = FALSE, name = "id")
However, every semester new students join the program. I would like to add these entries to the bottom of the growing programmatic_database.
I'm also looking to use rbind() or bind_rows() to add these individuals to the bottom of the programmatic_database, along with their relevant data.
After full_join(), I'm filtering out the entries that have already been added horizontally to the dataframe, then piping the remaining entries into bind_rows()
programmatic_database <- fa12[!which(fa12$id %in% programmatic_database),] %>% dplyr::bind_rows(programmatic_database, fa12)
Concatenated example of what my code is producing after several iterations:
id
ACT_math
course_code
section_code
section_code.db
section_code.db.db
term_code
grade.sp11
grade.fa12
grade.sp13
grade.sp15
term_GPA.sp11
term_GPA.fa12
term_GPA.sp15
0001
23
102
001
001
001
FA12
3.45
3.8
3.0
-
3.8
3.7
-
0002
28
201
003
003
003
FA12
3.2
3.4
3.0
-
3.8
3.7
-
1020
28
201
003
003
003
FA12
3.2
3.4
-
-
3.8
3.7
-
6783
30
101
-
-
-
SP15
-
-
-
3.8
-
-
4.0
where I have successfully added horizontally for students 0001 and 0002 for outcomes in subsequent courses in subsequent semesters. I have also managed to add vertically, like with student 6783, leaving blanks for previous semesters before they enrolled but still adding the relevant columns.
Questions:
Is there a way to pipe full_join() into a filter() into a bind_rows() without running into these errors?
rbind number of columns do not match
OR
Error: Can't combine ..1$term_code 'character> and ..2$term_code '<integer.
Is there a easy way to keep certain columns and only add the suffix ".fa12" to certain columns? As you can see, the .db is piling up.
Is there any way to automate this? Loops aren't my strong suit, but I'm sure there's a better-looking code than doing each of the 23 joins/binds by hand.
Thank you for assistance!
Current code for simplicity:
#reproducible example
fa11 <- structure(list(id = c("1001", "1002", "1003",
"1013"), act6_05_composite = c(33L, 26L, 27L, 25L), course_code = c("101",
"101", "101", "101"), term_code = c("FA11", "FA11", "FA11", "FA11"
), section_code = c(1L, 1L, 1L, 1L), grade = c(4, 0, 0, 2.5
), repeat_status_flag = c(NA, "PR", NA, NA), class_code = c(1L,
1L, 1L, 1L), cum_atmpt_credits_prior = c(16, 0, 0, 0), cum_completed_credits_prior = c(0L,
0L, 0L, 0L), cum_passed_credits_prior = c(16, 0, 0, 0), cum_gpa_prior = c(0,
0, 0, 0), cum_atmpt_credits_during = c(29, 15, 18, 15), cum_completed_credits_during = c(13L,
1L, 10L, 15L), cum_passed_credits_during = c(29, 1, 14, 15),
term_gpa = c(3.9615, 0.2333, 2.3214, 2.9666), row.names = c(NA, 4L
), class = "data.frame")
sp12 <- structure(list(id = c("1007", "1013", "1355",
"2779", "2302"), act6_05_composite = c(24L, 26L, 25L, 24L,
24L), course_code = c(101L, 102L, 101L, 101L, 101L
), term_code = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_), section_code = c(1L, 1L, 1L, 1L, 1L), grade = c(2,
2.5, 2, 1.5, 3.5), repeat_status_flag = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), class_code = c(2L, 2L, 1L, 2L, 2L), cum_atmpt_credits_prior = c(44,
43, 12, 43, 30), cum_completed_credits_prior = c(41L, 43L,
12L, 43L, 12L), cum_passed_credits_prior = c(41, 43, 12,
43, 30), cum_gpa_prior = c(3.3125, 3.186, 3.5416, 3.1785,
3.8636), cum_atmpt_credits_during = c(56, 59, 25, 64, 43),
cum_completed_credits_during = c(53L, 56L, 25L, 56L, 25L),
cum_passed_credits_during = c(53, 59, 25, 64, 43), term_gpa = c(2.8333,
3.423, 3.1153, 2.1923, 3.6153), row.names = c(NA,
5L), class = "data.frame")
# make object from fall 2011 semester dataframe
programmatic_database <- fa11
# join the spring 2012 semester dataframe by id using select variables and attaching relevant suffix
programmatic_database <- programmatic_database %>%
full_join(select(sp12, id, course_code, section_code, grade, term_gpa), by = "id", copy = TRUE, suffix = c(".fa11", ".sp12"), keep = FALSE, name = "id")
#view results of join, force integer type on certain variables as needed (see error above)
#filter the joined entries from fall 2012 database, then bind the remaining entries to the bottom of the growing dataset
programmatic_database <- sp12[!which(sp12$id %in% programmatic_database),] %>% dplyr::bind_rows(programmatic_database, sp12)
It would be possible to use bind_rows here if you make the column types consistent between tables. For instance, you could make a function to re-type any particular columns that aren't consistent in your original data. (That might also be something you could fix upstream as you read it in.)
library(dplyr)
set_column_types <- function(df) {
df %>%
mutate(term_code = as.character(term_code),
course_code = as.character(course_code))
}
bind_rows(
fa11 %>% set_column_types(),
sp12 %>% set_column_types() %>% mutate(term_code = "SP12")
)
This will stack your data into a relatively "long" format, like below. You may want to then reshape it depending on what kind of subsequent calculations you want to do.
id act6_05_composite course_code term_code section_code grade repeat_status_flag class_code cum_atmpt_credits_prior cum_completed_credits_prior cum_passed_credits_prior cum_gpa_prior cum_atmpt_credits_during cum_completed_credits_during cum_passed_credits_during term_gpa
1 1001 33 101 FA11 1 4.0 <NA> 1 16 0 16 0.0000 29 13 29 3.9615
2 1002 26 101 FA11 1 0.0 PR 1 0 0 0 0.0000 15 1 1 0.2333
3 1003 27 101 FA11 1 0.0 <NA> 1 0 0 0 0.0000 18 10 14 2.3214
4 1013 25 101 FA11 1 2.5 <NA> 1 0 0 0 0.0000 15 15 15 2.9666
5 1007 24 101 SP12 1 2.0 <NA> 2 44 41 41 3.3125 56 53 53 2.8333
6 1013 26 102 SP12 1 2.5 <NA> 2 43 43 43 3.1860 59 56 59 3.4230
7 1355 25 101 SP12 1 2.0 <NA> 1 12 12 12 3.5416 25 25 25 3.1153
8 2779 24 101 SP12 1 1.5 <NA> 2 43 43 43 3.1785 64 56 64 2.1923
9 2302 24 101 SP12 1 3.5 <NA> 2 30 12 30 3.8636 43 25 43 3.6153

R - Aggregate and count instances of grouping [duplicate]

This question already has answers here:
Aggregate and reshape from long to wide
(2 answers)
Closed 2 years ago.
Dataset is a breakdown of responders and the number of contacts they have had within a given time period along with details on their age bracket, something similar to:
participant participant_age contact contact_age
1 18-30 1 18-30
1 18-30 2 30-40
2 30-40 1 18-30
3 18-30 1 18-30
3 18-30 2 50-60
My aim is to calculate the mean number of contacts each age group of participant has had with each age bracket of contact. Something similar to:
age_bracket 18-30 30-40 40-50
18-30 1 3 2
30-40 1.5 4 2
40-50 3 4 1
I have been attempting to use the group_by and spread functions available in dplyr. The closest I have come is using
data%>%
group_by(participant_age, contact_age) %>%
tally() %>%
spread(key = participant_age, value = n)
But this produces the total number (n) of each contact, rather than the mean number of contacts per age bracket.
In base R use tapply.
t(with(dat, tapply(contact, list(contact_age, participant_age), mean)))
# 18-30 30-40 50-60
# 18-30 1 2 2
# 30-40 1 NA NA
Data:
dat <- structure(list(participant = c(1L, 1L, 2L, 3L, 3L), participant_age = c("18-30",
"18-30", "30-40", "18-30", "18-30"), contact = c(1L, 2L, 1L,
1L, 2L), contact_age = c("18-30", "30-40", "18-30", "18-30",
"50-60")), class = "data.frame", row.names = c(NA, -5L))
If I understood correctly your aim, you were pretty close to the right solution:
data %>%
group_by(participant_age, contact_age) %>%
summarise(mean = mean(contact), .groups = "drop") %>%
spread(key = participant_age, value = mean)
You can use pivot_wider and pass the function to apply in values_fn :
tidyr::pivot_wider(df, names_from = contact_age, values_from = contact, values_fn = mean)

Subtracting rows in R based on matching value

I am trying to substract two rows in my dataset from each other:
Name Period Time Distance Load
Tim A 01:06:20 6000 680
Max A 01:06:20 5000 600
Leo A 01:06:20 5500 640
Noa A 01:06:20 6500 700
Tim B 00:04:10 500 80
Max B 00:04:10 500 50
Leo B 00:04:10 400 40
I want to subtract the Time, Distance and Load values of Period B from Period A for matching Names.
eg. Subtract row 5 (Tim, Period B) from row 1 (Tim, Period A)
The new values should be written into a new table looking like this:
Name Period Time Distance Load
Tim C 01:02:10 5500 600
Max C 01:02:10 4500 550
Leo C 01:02:10 5100 600
Noa C 01:06:20 6500 700
The real dataset contains many more rows. I tried to play around with dplyr but could not get the result I am looking for.
Thanks in advance
There are so many answers already that this is just a bit of fun at this stage. I think this way is nice as it uses unnest_wider():
library(dplyr)
library(tidyr)
library(purrr)
diff <- function(data) {
if(apply(data[2, -1], 1, function(x) all(is.na(x)))) {
data[1, -1]
} else {
data[1, -1] - data[2, -1]
}
}
df %>% group_by(Name) %>% nest() %>%
mutate(diff = map(data, diff)) %>% unnest_wider(diff) %>%
mutate(Period = "C") %>% select(Period, Time, Distance, Load)
# A tibble: 4 x 5
Name Period Time Distance Load
<chr> <chr> <time> <dbl> <dbl>
1 Tim C 01:02:10 5500 600
2 Max C 01:02:10 4500 550
3 Leo C 01:02:10 5100 600
4 Noa C 01:06:20 6500 700
Apart from the diff() function (which can probably be made neater and 'exclusively' tidyverse), this way is also shorter.
DATA
library(readr)
# courtesy of #MartinGal
df <- read_table2("Name Period Time Distance Load
Tim A 01:06:20 6000 680
Max A 01:06:20 5000 600
Leo A 01:06:20 5500 640
Noa A 01:06:20 6500 700
Tim B 00:04:10 500 80
Max B 00:04:10 500 50
Leo B 00:04:10 400 40")
You could filter on the two periods and then join them together, thus facilitating the subtraction of columns.
library(dplyr)
inner_join(filter(df, Period=="A"), filter(df, Period=="B"), by="Name") %>%
mutate(Period="C",
Time=Time.x-Time.y,
Distance=Distance.x-Distance.y,
Load=Load.x-Load.y) %>%
select(Name, Period, Time, Distance, Load)
Name Period Time Distance Load
1 Tim C 1.036111 hours 5500 600
2 Max C 1.036111 hours 4500 550
3 Leo C 1.036111 hours 5100 600
It's basically the same idea as #Edward. You could use dplyr and tidyr:
df %>%
pivot_wider(names_from="Period", values_from=c("Time", "Distance", "Load")) %>%
mutate(Period = "C",
Time = coalesce(Time_A - Time_B, Time_A),
Distance = coalesce(Distance_A - Distance_B, Distance_A),
Load = coalesce(Load_A - Load_B, Load_A)
) %>%
select(-matches("_\\w"))
returns
# A tibble: 4 x 5
Name Period Time Distance Load
<chr> <chr> <time> <dbl> <dbl>
1 Tim C 01:02:10 5500 600
2 Max C 01:02:10 4500 550
3 Leo C 01:02:10 5100 600
4 Noa C 01:06:20 6500 700
Data
df <- read_table2("Name Period Time Distance Load
Tim A 01:06:20 6000 680
Max A 01:06:20 5000 600
Leo A 01:06:20 5500 640
Noa A 01:06:20 6500 700
Tim B 00:04:10 500 80
Max B 00:04:10 500 50
Leo B 00:04:10 400 40")
Here is a different approach which groups by Name to get the difference.
library(dplyr)
library(chron)
df <- structure(list(Name = structure(c(4L, 2L, 1L, 3L, 4L, 2L, 1L), .Label = c("Leo", "Max", "Noa", "Tim"), class = "factor"),
Period = structure(c(1L,1L, 1L, 1L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
Time = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("0:04:10", "1:06:20"), class = "factor"),
Distance = c(6000L, 5000L, 5500L, 6500L, 500L, 500L, 400L),
Load = c(680L, 600L, 640L, 700L, 80L, 50L, 40L)), class = "data.frame", row.names = c(NA, -7L))
df %>%
mutate(Time = times(Time)) %>%
group_by(Name) %>%
mutate(Time = lag(Time) - Time,
Distance = lag(Distance) - Distance,
Load = lag(Load) - Load,
Period = LETTERS[which(LETTERS == Period) + 1]) %>%
filter(!is.na(Time))
You can use data.table too.
dt <- data.table(Name = c('Tim', 'Max', 'Leo', 'Noa', 'Tim', 'Max', 'Leo'),
Period = c('A', 'A', 'A', 'A', 'B', 'B', 'B'),
Time = c('01:06:20', '01:06:20' , '01:06:20' , '01:06:20' , '00:04:10' , '00:04:10' , '00:04:10' ),
Distance = c(6000, 5000, 5500, 6500, 500, 500, 400 ),
Load = c(680, 600, 640, 700, 80, 50, 40))
Then the first thing to do is to convert the Time var:
dt[, Time := as.POSIXct(Time, format = "%H:%M:%S")]
sapply(dt, class)
Then you use dcast.data.table:
dtCast <- dcast.data.table(dt, Name ~ Period, value.var = c('Time', 'Distance', 'Load'))
And then you create a new object:
dtFinal <- dtCast[,list(Period = 'C',
Time = Time_A - Time_B,
Distance = Distance_A - Distance_B,
Load = Load_A - Load_B),
by = 'Name']
Mind that if you want to convert the Time to the same format as above, you need to do the following:
library(hms)
dtFinal[, Time := as_hms(Time)]

Merging two Dataframes in R by ID, One is the subset of the other

I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?
Thank you in advance
P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure
dfold:
structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France",
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L,
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L,
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased",
"pid"), row.names = c(NA, 5L), class = "data.frame")
dfnew:
structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000,
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA,
3L), class = "data.frame")
Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.
Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 NA Yes 5
> dfnew
Age Salary pid
1 44 72000 1
2 27 48000 2
3 30 54000 3
4 38 61000 4
5 40 70000 5
To do this for a single column, try
dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
Using it on the whole dataset was a bit trickier:
First define all common colnames except pid:
cols <- names(dfnew)[names(dfnew) != "pid"]
> cols
[1] "Age" "Salary"
Now use mapply to replace the NA values with ifelse:
dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])
> dfold
Country Age Salary Purchased pid
1 France 44 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use
cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]

Identifying Duplicate/Unique Teams (and Restructuring Data) in R

I have a data set that looks like this:
Person Team
1 30
2 30
3 30
4 30
11 40
22 40
1 50
2 50
3 50
4 50
15 60
16 60
17 60
1 70
2 70
3 70
4 70
11 80
22 80
My overall goal is to organize that team identification codes so that it is easy to see which teams are duplicates of one another and which teams are unique. I want to summarize the data so that it looks like this:
Team Duplicate1 Duplicate2
30 50 70
40 80
60
As you can see, teams 30, 50, and 70 have identical members, so they share a row. Similarly, teams 40 and 80 have identical members, so they share a row. Only team 60 (in this example) is unique.
In situations where teams are duplicated, I don't care which team id goes in which column. Also, there may be more than 2 duplicates of a team. Teams range in size from 2 members to 8 members.
This answer gives the output data format you asked for. I left the duplicate teams in a single variable because I think it's a better way to handle an arbitrary number of duplicates.
require(dplyr)
df %>%
arrange(Team, Person) %>% # this line is necessary in case the rest of your data isn't sorted
group_by(Team) %>%
summarize(players = paste0(Person, collapse = ",")) %>%
group_by(players) %>%
summarize(teams = paste0(Team, collapse = ",")) %>%
mutate(
original_team = ifelse(grepl(",", teams), substr(teams, 1, gregexpr(",", teams)[[1]][1]-1), teams),
dup_teams = ifelse(grepl(",", teams), substr(teams, gregexpr(",", teams)[[1]][1]+1, nchar(teams)), NA)
)
The result:
Source: local data frame [3 x 4]
players teams original_team dup_teams
1 1,2,3,4 30,50,70 30 50,70
2 11,22 40,80 40 80
3 15,16,17 60 60 NA
Not exactly the format you're wanting, but pretty useful:
# using MrFlick's data
library(dplyr)
dd %>% group_by(Team) %>%
arrange(Person) %>%
summarize(team.char = paste(Person, collapse = "_")) %>%
group_by(team.char) %>%
arrange(team.char, Team) %>%
mutate(duplicate = 1:n())
Source: local data frame [6 x 3]
Groups: team.char
Team team.char duplicate
1 40 11_22 1
2 80 11_22 2
3 60 15_16_17 1
4 30 1_2_3_4 1
5 50 1_2_3_4 2
6 70 1_2_3_4 3
(Edited in the arrange(Person) line in case the data isn't already sorted, got the idea from #Reed's answer.)
Using this for your sample data
dd<-structure(list(Person = c(1L, 2L, 3L, 4L, 11L, 22L, 1L, 2L, 3L,
4L, 15L, 16L, 17L, 1L, 2L, 3L, 4L, 11L, 22L), Team = c(30L, 30L,
30L, 30L, 40L, 40L, 50L, 50L, 50L, 50L, 60L, 60L, 60L, 70L, 70L,
70L, 70L, 80L, 80L)), .Names = c("Person", "Team"),
class = "data.frame", row.names = c(NA, -19L))
You could try a table()/interaction() to find duplicate groups. For example
tt <- with(dd, table(Team, Person))
grp <- do.call("interaction", c(data.frame(unclass(tt)), drop=TRUE))
split(rownames(tt), grp)
this returns
$`1.1.1.1.0.0.0.0.0`
[1] "30" "50" "70"
$`0.0.0.0.0.1.1.1.0`
[1] "60"
$`0.0.0.0.1.0.0.0.1`
[1] "40" "80"
so the group "names" are really just indicators for membership for each person. You could easily rename them if you like with setNames(). But here it collapse the appropriate teams.
Two more base R options (though not exactly the desired output):
DF2 <- aggregate(Person ~ Team, DF, toString)
> split(DF2$Team, DF2$Person)
$`1, 2, 3, 4`
[1] 30 50 70
$`11, 22`
[1] 40 80
$`15, 16, 17`
[1] 60
Or
( DF2$DupeGroup <- as.integer(factor(DF2$Person)) )
Team Person DupeGroup
1 30 1, 2, 3, 4 1
2 40 11, 22 2
3 50 1, 2, 3, 4 1
4 60 15, 16, 17 3
5 70 1, 2, 3, 4 1
6 80 11, 22 2
Note that the expected output as shown in the question would either require to add NAs or empty strings in some of the columns entries because in a data.frame, all columns must have the same number of rows. That is different for lists in, as you can see in some of the answers.
The second option, but using data.table, since aggregate tends to be slow for large data:
library(data.table)
setDT(DF)[, toString(Person), by=Team][,DupeGroup := .GRP, by=V1][]
Team V1 DupeGroup
1: 30 1, 2, 3, 4 1
2: 40 11, 22 2
3: 50 1, 2, 3, 4 1
4: 60 15, 16, 17 3
5: 70 1, 2, 3, 4 1
6: 80 11, 22 2
Using uniquecombs from the mgcv package:
library(mgcv)
library(magrittr) # for the pipe %>%
# Using MrFlick's data
team_names <- sort(unique(dd$Team))
unique_teams <- with(dd, table(Team, Person)) %>% uniquecombs %>% attr("index")
printout <- unstack(data.frame(team_names, unique_teams))
> printout
$`1`
[1] 60
$`2`
[1] 40 80
$`3`
[1] 30 50 70
Now you could use something like this answer to print it in tabular form (note that the groups are column-wise, not row-wise as in your question):
attributes(printout) <- list(names = names(printout)
, row.names = 1:max(sapply(printout, length))
, class = "data.frame")
> printout
1 2 3
1 60 40 30
2 <NA> 80 50
3 <NA> <NA> 70
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs

Resources