get max value per id, then only value per id R - r

I would like to make my df smaller by just taking one observation per person per date, based on a persons biggest quantity per date.
Here's my df:
names dates quantity
1 tom 2010-02-01 28
3 tom 2010-03-01 7
2 mary 2010-05-01 30
6 tom 2010-06-01 21
4 john 2010-07-01 45
5 mary 2010-07-01 30
8 mary 2010-07-01 28
11 tom 2010-08-01 28
7 john 2010-09-01 28
10 john 2010-09-01 30
9 john 2010-07-01 45
12 mary 2010-11-01 28
13 john 2010-12-01 7
14 john 2010-12-01 14
I do this first by finding the max quantity per person per date. This works ok, but as you can see, if a person has equal quantities they retain the same amount of obs per date.
merge(df, aggregate(quantity ~ names+dates, df, max))
names dates quantity
1 john 2010-07-01 45
2 john 2010-07-01 45
3 john 2010-09-01 30
4 john 2010-12-01 14
5 mary 2010-05-01 30
6 mary 2010-07-01 30
7 mary 2010-11-01 28
8 tom 2010-02-01 28
9 tom 2010-03-01 7
10 tom 2010-06-01 21
11 tom 2010-08-01 28
So, my next step would be to just take the first obs per date (given that I have already selected the biggest quantity). I can't get the code right for this. this is what I have tried:
merge(l, aggregate(names ~ dates, l, FUN=function(z) z[1]))->m ##doesn't get rid of one obs for john
and a data.table option
l[, .SD[1], by=c(names,dates)] ##doesn't work at all
I like the aggregate and data.table options as they are fast and by df has ~100k rows.
Thank you in advance for this!
SOLUTION
i posted too fast - apologies!! an easy solution to this problem is just to find duplicates and then remove those. e.g.,;
merge(df, aggregate(quantity ~ names+dates, df, max))->toy
toy$dup<-duplicated(toy)
toy<-toy[toy$dup!=TRUE,]
here are the system times
system.time(dt2[, max(new_quan), by = list(hai_dispense_number, date_of_claim)]->method1)
user system elapsed
20.04 0.04 20.07
system.time(aggregate(new_quan ~ hai_dispense_number+date_of_claim, dt2, max)->rpp)
user system elapsed
19.129 0.028 19.148

Here's a data.table solution:
dt[, max(quantity), by = list(names, dates)]
Bench:
N = 1e6
dt = data.table(names = sample(letters, N, T), dates = sample(LETTERS, N, T), quantity = rnorm(N))
df = data.frame(dt)
op = function(df) aggregate(quantity ~ names+dates, df, max)
eddi = function(dt) dt[, max(quantity), by = list(names, dates)]
microbenchmark(op(df), eddi(dt), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# op(df) 2535.241 3025.1485 3195.078 3398.4404 3533.209 10
# eddi(dt) 148.088 162.8073 198.222 220.1217 286.058 10

I am not sure this gives you the output you want, but it definitely takes care of the "duplicate rows":
# Replicating your dataframe
df <- data.frame(names = c("tom", "tom", "mary", "tom", "john", "mary", "mary", "tom", "john", "john", "john", "mary", "john", "john"), dates = c("2010-02-01","2010-03-01", "2010-05-01", "2010-06-01", "2010-07-01", "2010-07-01", "2010-07-01", "2010-08-01", "2010-09-01", "2010-09-01", "2010-07-01", "2010-11-01", "2010-12-01", "2010-12-01"), quantity = c(28,7,30,21,45,30,28,28,28,30,45,28,7,14))
temp = merge(df, aggregate(quantity ~ names+dates, df, max))
df.unique = unique(temp)

If you are using data.frame,
library(plyr)
ddply(mydata,.(names,dates),summarize, maxquantity=max(quantity))

do.call( rbind,
lapply( split(df, df[,c("names","dates") ]), function(d){
d[which.max(d$quantity), ] } )
)

Related

Remove inconsistent duplicate entries from data frame with Base R

I want to remove duplicate entries from a data frame that are inconsistent, the following gives a simplified example:
df <- data.frame(name = c("Andy", "Bert", "Cindy", "Cindy", "David", "Edgar", "Edgar", "Frank", "George", "George", "George", "Herbert", "Iris", "Iris", "Iris"), amount = c(100, 50, 30, 30, 200, 65, 55, 90, 120, 120, 120, 300, 15, 25, 25))
df
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 Cindy 30
## 5 David 200
## 6 Edgar 65
## 7 Edgar 55
## 8 Frank 90
## 9 George 120
## 10 George 120
## 11 George 120
## 12 Herbert 300
## 13 Iris 15
## 14 Iris 25
## 15 Iris 25
Version A)
Edgar and Iris are the same person yet the given amounts are inconsistent so I want to remove the entries:
#remove inconsistent duplicate entries
df2
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 Cindy 30
## 5 David 200
## 6 Frank 90
## 7 George 120
## 8 George 120
## 9 George 120
## 10 Herbert 300
Version B)
Another possibility would be to keep only one instance of the consistent entries:
#keep only one instance of consistent entries
df3
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 David 200
## 5 Frank 90
## 6 George 120
## 7 Herbert 300
I am interested in (elegant?) ways to solve both versions in Base R. Efficiency should not be a problem because the dataset I have is not that huge.
A base solution that solves both at once. This has the side effect of requiring row name changes.
A Remove "inconsistent" values
new_df<-do.call("rbind",
Filter(function(x) all(x$amount == x$amount[1]),
split(df,df$name)))
name amount
Andy Andy 100
Bert Bert 50
Cindy.3 Cindy 30
Cindy.4 Cindy 30
David David 200
Frank Frank 90
George.9 George 120
George.10 George 120
George.11 George 120
Herbert Herbert 300
The above needs further cleaning of row names (an unwanted side effect perhaps but we deal with that below)
B Remove duplicates
new_df<-new_df[!duplicated(new_df$name),]
row.names(new_df) <- 1:nrow(new_df)
Combined result
new_df
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300
The question specifically requests for a base solution. If for whatever reason someone from the future wants to use dplyr, I will leave this solution here.
Using dplyr, we can check if all values are equal to the first value of amount. If not, make them NA and delete them. Proceed with removing duplicates for what remains.
A Remove inconsistent ones
library(dplyr)
(df %>%
group_by(name) %>%
mutate(name = ifelse(!all(amount==first(amount)), NA, name)) %>%
na.omit() -> new_df)
A tibble: 10 x 2
# Groups: name [7]
name amount
<chr> <dbl>
1 Andy 100
2 Bert 50
3 Cindy 30
4 Cindy 30
5 David 200
6 Frank 90
7 George 120
8 George 120
9 George 120
10 Herbert 300
Remove duplicates
new_df %>%
filter(!duplicated(name)) %>%
ungroup()
# A tibble: 7 x 2
name amount
<chr> <dbl>
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300
A) First aggregate to apply the conditions, then filter the data and finally stack the result.
t <- aggregate( amount ~ name, df, function(x) c(unique(x),length(x)) )
t_m <- t[!sapply( t$amount, function(x) (length(x)>2) ),]
setNames( stack( setNames(lapply( t_m$amount, function(x)
rep(x[1],x[2]) ), t_m$name) )[,c("ind", "values")], colnames(df) )
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 Cindy 30
5 David 200
6 Frank 90
7 George 120
8 George 120
9 George 120
10 Herbert 300
B) Is a bit more straightforward. Just aggregate and filter.
t <- aggregate( amount ~ name, df, unique )
t[lengths(t$amount) == 1,]
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
6 Frank 90
7 George 120
8 Herbert 300
You can use duplicate, but you need to remove all duplicate rows. (your option B).
The result can be used to filter the data frame for all rows.
df <- data.frame(name = c("Andy", "Bert", "Cindy", "Cindy", "David", "Edgar", "Edgar", "Frank", "George", "George", "George", "Herbert", "Iris", "Iris", "Iris"), amount = c(100, 50, 30, 30, 200, 65, 55, 90, 120, 120, 120, 300, 15, 25, 25))
df_unq <- unique(df)
df3 <- df_unq[!(duplicated(df_unq$name)|duplicated(df_unq$name, fromLast = TRUE)), ]
df3
#> name amount
#> 1 Andy 100
#> 2 Bert 50
#> 3 Cindy 30
#> 5 David 200
#> 8 Frank 90
#> 9 George 120
#> 12 Herbert 300
df[df$name %in% df3$name, ]
#> name amount
#> 1 Andy 100
#> 2 Bert 50
#> 3 Cindy 30
#> 4 Cindy 30
#> 5 David 200
#> 8 Frank 90
#> 9 George 120
#> 10 George 120
#> 11 George 120
#> 12 Herbert 300
Created on 2021-12-12 by the reprex package (v2.0.1)
For the first requirement, where you need to get rid of duplicate entries, there's an in-built function in R called duplicated.
Here's the code:
df[!duplicated(df), ]
df[!duplicated(df$name),]
The output looks like this:
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
5 David 200
6 Edgar 65
8 Frank 90
9 George 120
12 Herbert 300
13 Iris 15
And for the second requirement, you'll need to do something like this:
df <- unique(df)
df <- split(df, df$name)
df <- df[sapply(df, nrow) == 1]
df <- do.call(rbind, df)
rownames(df) <- 1:nrow(df)
The output looks like this:
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300
Both versions are using Base-R. You can do the same using dplyr package in R.
Problem B is a sub-problem of problem A. To solve A we can use var() to find inconsistent values, utilizing the behavior of Filter() which always takes NAs as FALSE. To solve B we just need to get rid of duplicated rows in A applying unique().
Case A
with(df, df[!name %in% names(Filter(var, split(amount, name))), ])
# name amount
# 1 Andy 100
# 2 Bert 50
# 3 Cindy 30
# 4 Cindy 30
# 5 David 200
# 8 Frank 90
# 9 George 120
# 10 George 120
# 11 George 120
# 12 Herbert 300
Case B
with(df, df[!name %in% names(Filter(var, split(amount, name))), ]) |>
unique()
# name amount
# 1 Andy 100
# 2 Bert 50
# 3 Cindy 30
# 5 David 200
# 8 Frank 90
# 9 George 120
# 12 Herbert 300

More efficient methods than nested for loops in R -- matching

I'm trying to match people when they have identical names, last names, and first names, and keep the smallest numerical value for IDs.
I've created a test database below (much smaller than my actual dataset) and written a nested for-loop that looks like it's doing what it's supposed to.
But it's slow as hell on bigger datasets.
I'm relatively new to the apply functions, but they seem more intuitive for applying functions than data wrangling.
What's a more efficient alternative for what I'm doing here? I'm sure there's a simple solution that will have me shaking my head for asking here, but I'm not coming to it.
dta.test<- NULL
dta.test$Person_id <- c(1,2,3,4,5,6,7,8,9,10, 11)
dta.test$FirstName <- c("John", "James", "John", "Alex", "Alexander", "Jonathan", "John", "Alex", "James", "John", "John")
dta.test$LastName <- c("Smith", "Jones", "Jones", "Jones", "Jones", "Smith", "Jones", "Smith", "Johnson", "Smith", "Smith")
dta.test$DOB <- c("2001-01-01", "2002-01-01", "2003-01-01", "2004-01-01", "2004-01-01", "2001-01-01", "2003-01-01", "2006-01-01", "2006-01-01", "2001-01-01", "2009-01-01")
dta.test$Actual_ID <- c(1, 2, 3, 4, 5, 6, 3, 8, 9, 1, 11)
dta.test <- as.data.frame(dta.test)
for(i in unique(dta.test$FirstName))
for(j in unique(dta.test$LastName))
for (k in unique (dta.test$DOB))
{
{
{
dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k] <- min(dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k], na.rm=T)
}
}
}
Here's a dplyr solution
library(dplyr)
dta.test %>%
group_by(FirstName, LastName, DOB) %>%
mutate(Person_id = min(Person_id))
# A tibble: 11 x 5
# Groups: FirstName, LastName, DOB [9]
# Person_id FirstName LastName DOB Actual_ID
# <dbl> <fct> <fct> <fct> <dbl>
# 1 1. John Smith 2001-01-01 1.
# 2 2. James Jones 2002-01-01 2.
# 3 3. John Jones 2003-01-01 3.
# 4 4. Alex Jones 2004-01-01 4.
# 5 5. Alexander Jones 2004-01-01 5.
# 6 6. Jonathan Smith 2001-01-01 6.
# 7 3. John Jones 2003-01-01 3.
# 8 8. Alex Smith 2006-01-01 8.
# 9 9. James Johnson 2006-01-01 9.
# 10 1. John Smith 2001-01-01 1.
# 11 11. John Smith 2009-01-01 11.
EDIT - Added Performance comparison
for_loop_approach <- function() {
for(i in unique(dta.test$FirstName))
for(j in unique(dta.test$LastName))
for (k in unique (dta.test$DOB))
{
{
{
dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k] <- min(dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k], na.rm=T)
}
}
}
}
dplyr_approach <- function() {
require(dplyr)
dta.test %>%
group_by(FirstName, LastName, DOB) %>%
mutate(Person_id = min(Person_id))
}
library(microbenchmark)
microbenchmark(for_loop_approach(), dplyr_approach(), unit="relative", times=100L)
Unit: relative
expr min lq mean median uq max neval
for_loop_approach() 20.97948 20.6478 18.8189 17.81437 17.91815 11.76743 100
dplyr_approach() 1.00000 1.0000 1.0000 1.00000 1.00000 1.00000 100
There were 50 or more warnings (use warnings() to see the first 50)
I've implemented a base R approach rather than dplyr and it comes out (according to microbenchmark) 7.46 times faster than the dplyr approach of CPak, and 139.4 times faster than the for loop approach. I've just used the match and paste0 functions to get this working, and it will automatically retain the smallest matching id:
dta.test[, "Actual_id"] <- match(paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB), paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB))
This approach also outputs it straight to a data frame, rather than a tibble (which you would need to extract the new column from, and add to your data frame):
Person_id FirstName LastName DOB Actual_id
1 1 John Smith 2001-01-01 1
2 2 James Jones 2002-01-01 2
3 3 John Jones 2003-01-01 3
4 4 Alex Jones 2004-01-01 4
5 5 Alexander Jones 2004-01-01 5
6 6 Jonathan Smith 2001-01-01 6
7 7 John Jones 2003-01-01 3
8 8 Alex Smith 2006-01-01 8
9 9 James Johnson 2006-01-01 9
10 10 John Smith 2001-01-01 1
11 11 John Smith 2009-01-01 11
In your real data I expect the person id is not so simple (not just an integer) and doesn't run in numerical order, e.g.
dta.test$Person_id <- paste0(LETTERS[1:11],1:11)
You just need a small tweak to make this still work, to make it extract value from the Person_id column:
dta.test[, "Actual_id"] <- dta.test[match(paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB), paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB)), "Person_id"]
Giving:
Person_id FirstName LastName DOB Actual_id
1 A1 John Smith 2001-01-01 A1
2 B2 James Jones 2002-01-01 B2
3 C3 John Jones 2003-01-01 C3
4 D4 Alex Jones 2004-01-01 D4
5 E5 Alexander Jones 2004-01-01 E5
6 F6 Jonathan Smith 2001-01-01 F6
7 G7 John Jones 2003-01-01 C3
8 H8 Alex Smith 2006-01-01 H8
9 I9 James Johnson 2006-01-01 I9
10 J10 John Smith 2001-01-01 A1
11 K11 John Smith 2009-01-01 K11
A data table solution will probably be quickest on large data with lots of groups:
library(data.table)
setDT(dta.test, key = c("FirstName", "LastName", "DOB"))
dta.test[, Actual_ID := min(Person_id, na.rm = TRUE), by = .(FirstName, LastName, DOB)]

In R: add rows based on a date and another condition

I have a data frame df:
df <- data.frame(names=c("john","mary","tom"),dates=c(as.Date("2010-06-01"),as.Date("2010-07-09"),as.Date("2010-06-01")),tours_missed=c(2,12,6))
names dates tours_missed
john 2010-06-01 2
mary 2010-07-09 12
tom 2010-06-01 6
I want to be able to add a row with the dates the person missed. There are 2 tours every day the person works. Each person works every 4 days.
The result should be (though the order doesn't matter):
names dates tours_missed
john 2010-06-01 2
mary 2010-07-09 12
mary 2010-07-13 12
mary 2010-07-17 12
mary 2010-07-21 12
mary 2010-07-25 12
mary 2010-07-29 12
tom 2010-06-01 6
tom 2010-06-05 6
tom 2010-06-09 6
I have already tried looking at these topics but was unable to produce the above result: Add rows to a data frame based on date in previous row, In R: Add rows with data of previous row to data frame, add new row to dataframe, enter link description here. Thanks for your help!
library(data.table)
dt = as.data.table(df) # or convert in-place using setDT
# all of the relevant dates
dates.all = dt[, seq(dates, length = tours_missed/2, by = "4 days"), by = names]
# set the key and merge filling in the blanks with previous observation
setkey(dt, names, dates)
dt[dates.all, roll = T]
# names dates tours_missed
# 1: john 2010-06-01 2
# 2: mary 2010-07-09 12
# 3: mary 2010-07-13 12
# 4: mary 2010-07-17 12
# 5: mary 2010-07-21 12
# 6: mary 2010-07-25 12
# 7: mary 2010-07-29 12
# 8: tom 2010-06-01 6
# 9: tom 2010-06-05 6
#10: tom 2010-06-09 6
Or if merging is unnecessary (not quite clear from OP), just construct the answer:
dt[, list(dates = seq(dates, length = tours_missed/2, by = "4 days"), tours_missed)
, by = names]

How do I find last date in which a value increased in another column?

I have a data frame in R that looks something like this:
person date level
Alex 2007-06-01 3
Alex 2008-12-01 4
Alex 2009-12-01 3
Beth 2008-03-01 6
Beth 2010-10-01 6
Beth 2010-12-01 6
Mary 2009-11-04 9
Mary 2012-04-25 9
Mary 2013-09-10 10
I have sorted it first by "person" and second by "date".
I am trying to find out when the last increase in "level" occurred for each person. Ideally, the output would look something like:
person date
Alex 2008-12-01
Beth NA
Mary 2013-09-10
Using dplyr
library(dplyr)
dat %>% group_by(person) %>%
mutate(inc = c(F, diff(level) > 0)) %>%
summarize(date = last(date[inc], default = NA))
Yielding:
Source: local data frame [3 x 2]
person date
1 Alex 2008-12-01
2 Beth <NA>
3 Mary 2013-09-10
Try data.table version:
library(data.table)
setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
If na also needs to be included:
dd=setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
dd2 =data.frame(unique(ddt[!(person %in% dd$person),,]$person),NA)
names(dd2) = c('person','date')
rbind(dd, dd2)
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
3: Beth NA
A base-R version, using data frame df:
sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
})
produces the named vector
Alex Beth Mary
"2008-12-01" NA "2013-09-10"
Easy to wrap this to produce any output format you need:
last.level.up <- function(df) {
data.frame(Date=sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
}))
}
last.level.up(df)
Date
Alex 2008-12-01
Beth <NA>
Mary 2013-09-10

Merging data frames and combining columns into one

I've got the following three dataframes:
df1 <- data.frame(name=c("John", "Anne", "Christine", "Andy"),
age=c(31, 26, 54, 48),
height=c(180, 175, 160, 168),
group=c("Student",3,5,"Employer"), stringsAsFactors=FALSE)
df2 <- data.frame(name=c("Anne", "Christine"),
age=c(26, 54),
height=c(175, 160),
group=c(3,5),
group2=c("Teacher",6), stringsAsFactors=FALSE)
df2 <- data.frame(name=c("Christine"),
age=c(54),
height=c(160),
group=c(5),
group2=c(6),
group3=c("Scientist"), stringsAsFactors=FALSE)
I'd like to combine them so that I get the following result:
df.all <- data.frame(name=c("John", "Anne", "Christine", "Andy"),
age=c(31, 26, 54, 48),
height=c(180, 175, 160, 168),
group=c("Student", "Teacher", "Scientist", "Employer"))
At the moment I'm doing it this way:
df.all <- merge(merge(df1[,c(1,4)], df2[,c(1,5)], all=TRUE, by="name"),
df3[,c(1,6)], all=TRUE, by="name")
row.ind <- which(df.all$group %in% c(6,5))
df.all[row.ind, c("group")] <- df.all[row.ind, c("group2")]
row.ind2 <- which(df.all$group2 %in% c(6))
df.all[row.ind2, c("group")] <- df.all[row.ind2, c("group3")]
This isn't generalisable and it is really messy. Maybe there would be a way to use merge_all or merge_recurse for the merging step (especially as there might be more than two dataframes to be merged), but I haven't figured out how. These two don't produce the right result:
df.all <- merge_all(list(df1, df2, df3))
df.all <- merge_recurse(list(df1, df2, df3), by=c("name"))
Is there a more general and elegant way to solve this problem?
Here is another possible approach, if I understand what you're ultimately after. (It is not clear what the numeric values in the "group" columns are, so I'm not sure this is exactly what you're looking for.)
Use Reduce() to merge your multiple data.frames.
temp <- Reduce(function(x, y) merge(x, y, all=TRUE), list(df1, df2, df3))
names(temp)[4] <- "group1" # Rename "group" to "group1" for reshaping
temp
# name age height group1 group2 group3
# 1 Andy 48 168 Employer <NA> <NA>
# 2 Anne 26 175 3 Teacher <NA>
# 3 Christine 54 160 5 6 Scientist
# 4 John 31 180 Student <NA> <NA>
Use reshape() to reshape your data from wide to long.
df.all <- reshape(temp, direction = "long", idvar="name", varying=4:6, sep="")
df.all
# name age height time group
# Andy.1 Andy 48 168 1 Employer
# Anne.1 Anne 26 175 1 3
# Christine.1 Christine 54 160 1 5
# John.1 John 31 180 1 Student
# Andy.2 Andy 48 168 2 <NA>
# Anne.2 Anne 26 175 2 Teacher
# Christine.2 Christine 54 160 2 6
# John.2 John 31 180 2 <NA>
# Andy.3 Andy 48 168 3 <NA>
# Anne.3 Anne 26 175 3 <NA>
# Christine.3 Christine 54 160 3 Scientist
# John.3 John 31 180 3 <NA>
Take advantage of the fact that as.numeric() will coerce characters to NA, and use na.omit() to remove all of the rows with NA values.
na.omit(df.all[is.na(as.numeric(df.all$group)), ])
# name age height time group
# Andy.1 Andy 48 168 1 Employer
# John.1 John 31 180 1 Student
# Anne.2 Anne 26 175 2 Teacher
# Christine.3 Christine 54 160 3 Scientist
Again, this might be over-generalizing your problem--there might be NA values in other columns, for example--but it might help direct you towards a solution to your problem.
First step is to use merge_recurse with all.x = TRUE:
library(reshape)
merge.all <- merge_recurse(list(df1, df2, df3), all.x = TRUE)
# name age height group group2 group3
# 1 Anne 26 175 3 Teacher <NA>
# 2 Christine 54 160 5 6 Scientist
# 3 John 31 180 Student <NA> <NA>
# 4 Andy 48 168 Employer <NA> <NA>
Then you can use apply to get the last non-NA group from all the "group" columns:
group.cols <- grep("group", colnames(merge.all))
merge.all <- data.frame(merge.all[-group.cols],
group = apply(merge.all[group.cols], 1,
function(x)tail(na.omit(x), 1)))
# name age height group
# 1 Anne 26 175 Teacher
# 2 Christine 54 160 Scientist
# 3 John 31 180 Student
# 4 Andy 48 168 Employer

Resources