More efficient methods than nested for loops in R -- matching - r

I'm trying to match people when they have identical names, last names, and first names, and keep the smallest numerical value for IDs.
I've created a test database below (much smaller than my actual dataset) and written a nested for-loop that looks like it's doing what it's supposed to.
But it's slow as hell on bigger datasets.
I'm relatively new to the apply functions, but they seem more intuitive for applying functions than data wrangling.
What's a more efficient alternative for what I'm doing here? I'm sure there's a simple solution that will have me shaking my head for asking here, but I'm not coming to it.
dta.test<- NULL
dta.test$Person_id <- c(1,2,3,4,5,6,7,8,9,10, 11)
dta.test$FirstName <- c("John", "James", "John", "Alex", "Alexander", "Jonathan", "John", "Alex", "James", "John", "John")
dta.test$LastName <- c("Smith", "Jones", "Jones", "Jones", "Jones", "Smith", "Jones", "Smith", "Johnson", "Smith", "Smith")
dta.test$DOB <- c("2001-01-01", "2002-01-01", "2003-01-01", "2004-01-01", "2004-01-01", "2001-01-01", "2003-01-01", "2006-01-01", "2006-01-01", "2001-01-01", "2009-01-01")
dta.test$Actual_ID <- c(1, 2, 3, 4, 5, 6, 3, 8, 9, 1, 11)
dta.test <- as.data.frame(dta.test)
for(i in unique(dta.test$FirstName))
for(j in unique(dta.test$LastName))
for (k in unique (dta.test$DOB))
{
{
{
dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k] <- min(dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k], na.rm=T)
}
}
}

Here's a dplyr solution
library(dplyr)
dta.test %>%
group_by(FirstName, LastName, DOB) %>%
mutate(Person_id = min(Person_id))
# A tibble: 11 x 5
# Groups: FirstName, LastName, DOB [9]
# Person_id FirstName LastName DOB Actual_ID
# <dbl> <fct> <fct> <fct> <dbl>
# 1 1. John Smith 2001-01-01 1.
# 2 2. James Jones 2002-01-01 2.
# 3 3. John Jones 2003-01-01 3.
# 4 4. Alex Jones 2004-01-01 4.
# 5 5. Alexander Jones 2004-01-01 5.
# 6 6. Jonathan Smith 2001-01-01 6.
# 7 3. John Jones 2003-01-01 3.
# 8 8. Alex Smith 2006-01-01 8.
# 9 9. James Johnson 2006-01-01 9.
# 10 1. John Smith 2001-01-01 1.
# 11 11. John Smith 2009-01-01 11.
EDIT - Added Performance comparison
for_loop_approach <- function() {
for(i in unique(dta.test$FirstName))
for(j in unique(dta.test$LastName))
for (k in unique (dta.test$DOB))
{
{
{
dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k] <- min(dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k], na.rm=T)
}
}
}
}
dplyr_approach <- function() {
require(dplyr)
dta.test %>%
group_by(FirstName, LastName, DOB) %>%
mutate(Person_id = min(Person_id))
}
library(microbenchmark)
microbenchmark(for_loop_approach(), dplyr_approach(), unit="relative", times=100L)
Unit: relative
expr min lq mean median uq max neval
for_loop_approach() 20.97948 20.6478 18.8189 17.81437 17.91815 11.76743 100
dplyr_approach() 1.00000 1.0000 1.0000 1.00000 1.00000 1.00000 100
There were 50 or more warnings (use warnings() to see the first 50)

I've implemented a base R approach rather than dplyr and it comes out (according to microbenchmark) 7.46 times faster than the dplyr approach of CPak, and 139.4 times faster than the for loop approach. I've just used the match and paste0 functions to get this working, and it will automatically retain the smallest matching id:
dta.test[, "Actual_id"] <- match(paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB), paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB))
This approach also outputs it straight to a data frame, rather than a tibble (which you would need to extract the new column from, and add to your data frame):
Person_id FirstName LastName DOB Actual_id
1 1 John Smith 2001-01-01 1
2 2 James Jones 2002-01-01 2
3 3 John Jones 2003-01-01 3
4 4 Alex Jones 2004-01-01 4
5 5 Alexander Jones 2004-01-01 5
6 6 Jonathan Smith 2001-01-01 6
7 7 John Jones 2003-01-01 3
8 8 Alex Smith 2006-01-01 8
9 9 James Johnson 2006-01-01 9
10 10 John Smith 2001-01-01 1
11 11 John Smith 2009-01-01 11
In your real data I expect the person id is not so simple (not just an integer) and doesn't run in numerical order, e.g.
dta.test$Person_id <- paste0(LETTERS[1:11],1:11)
You just need a small tweak to make this still work, to make it extract value from the Person_id column:
dta.test[, "Actual_id"] <- dta.test[match(paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB), paste0(dta.test$FirstName, dta.test$LastName, dta.test$DOB)), "Person_id"]
Giving:
Person_id FirstName LastName DOB Actual_id
1 A1 John Smith 2001-01-01 A1
2 B2 James Jones 2002-01-01 B2
3 C3 John Jones 2003-01-01 C3
4 D4 Alex Jones 2004-01-01 D4
5 E5 Alexander Jones 2004-01-01 E5
6 F6 Jonathan Smith 2001-01-01 F6
7 G7 John Jones 2003-01-01 C3
8 H8 Alex Smith 2006-01-01 H8
9 I9 James Johnson 2006-01-01 I9
10 J10 John Smith 2001-01-01 A1
11 K11 John Smith 2009-01-01 K11

A data table solution will probably be quickest on large data with lots of groups:
library(data.table)
setDT(dta.test, key = c("FirstName", "LastName", "DOB"))
dta.test[, Actual_ID := min(Person_id, na.rm = TRUE), by = .(FirstName, LastName, DOB)]

Related

How can I group the same value across multiple columns and sum subsequent values?

I have a table of information that looks like the following:
rusher_full_name receiver_full_name rushing_fpts receiving_fpts
<chr> <chr> <dbl> <dbl>
1 Aaron Jones NA 5 0
2 NA Aaron Jones 0 5
3 Mike Davis NA 0.5 0
4 NA Allen Robinson 0 3
5 Mike Davis NA 0.7 0
What I'm trying to do is get all of the values from the rushing_fpts and receiving_fpts to sum up depending on the rusher_full_name and receiver_full_name value. For example, for every instance of "Aaron Jones" (whether it's in rusher_full_name or receiver_full_name) sum up the values of rushing_fpts and receiving_fpts
In the end, this is what I'd like it to look like:
player_full_name total_fpts
<chr> <dbl>
1 Aaron Jones 10
2 Mike Davis 1.2
3 Allen Robinson 3
I'm pretty new to using R and have Googled a number of things but can't find any solution. Any suggestions on how to accomplish this?
library(tidyverse)
df %>%
mutate(player_full_name = coalesce(rusher_full_name, receiver_full_name)) %>%
group_by(player_full_name) %>%
summarise(total_fpts = sum(rushing_fpts+receiving_fpts))
Output
# A tibble: 3 x 2
player_full_name total_fpts
<chr> <dbl>
1 Aaron Jones 10
2 Allen Robinson 3
3 Mike Davis 1.2
Data
df <- data.frame(
rusher_full_name = c("Aaron Jones", NA, "Mike Davis", NA, "Mike Davis"),
receiver_full_name = c(NA, "Aaron Jones", NA, "Allen Robinson", NA),
rushing_fpts = c(5,0,0.5,0,.7),
receiving_fpts = c(0,5,0,3,0),
stringsAsFactors = FALSE
)

Getting Data in a single row into multiple rows

I have a code where I see which people work in certain groups. When I ask the leader of each group to present those who work for them, in a survey, I get a row of all of the team members. What I need is to clean the data into multiple rows with their group information.
I don't know where to start.
This is what my data frame looks like,
LeaderName <- c('John','Jane','Louis','Carl')
Group <- c('3','1','4','2')
Member1 <- c('Lucy','Stephanie','Chris','Leslie')
Member1ID <- c('1','2','3','4')
Member2 <- c('Earl','Carlos','Devon','Francis')
Member2ID <- c('5','6','7','8')
Member3 <- c('Luther','Peter','','Severus')
Member3ID <- c('9','10','','11')
GroupInfo <- data.frame(LeaderName, Group, Member1, Member1ID, Member2 ,Member2ID, Member3, Member3ID)
This is what I would like it to show with a certain code
LeaderName_ <- c('John','Jane','Louis','Carl','John','Jane','Louis','Carl','John','Jane','','Carl')
Group_ <- c('3','1','4','2','3','1','4','2','3','1','','2')
Member <- c('Lucy','Stephanie','Chris','Leslie','Earl','Carlos','Devon','Francis','Luther','Peter','','Severus')
MemberID <- c('1','2','3','4','5','6','7','8','9','10','','11')
ActualGroupInfor <- data.frame(LeaderName_,Group_,Member,MemberID)
An option would be melt from data.table and specify the column name patterns in the measure parameter
library(data.table)
melt(setDT(GroupInfo), measure = patterns("^Member\\d+$",
"^Member\\d+ID$"), value.name = c("Member", "MemberID"))[, variable := NULL][]
# LeaderName Group Member MemberID
# 1: John 3 Lucy 1
# 2: Jane 1 Stephanie 2
# 3: Louis 4 Chris 3
# 4: Carl 2 Leslie 4
# 5: John 3 Earl 5
# 6: Jane 1 Carlos 6
# 7: Louis 4 Devon 7
# 8: Carl 2 Francis 8
# 9: John 3 Luther 9
#10: Jane 1 Peter 10
#11: Louis 4
#12: Carl 2 Severus 11
Here is a solution in base r:
reshape(
data=GroupInfo,
idvar=c("LeaderName", "Group"),
varying=list(
Member=which(names(GroupInfo) %in% grep("^Member[0-9]$",names(GroupInfo),value=TRUE)),
MemberID=which(names(GroupInfo) %in% grep("^Member[0-9]ID",names(GroupInfo),value=TRUE))),
direction="long",
v.names = c("Member","MemberID"),
sep="_")[,-3]
#> LeaderName Group Member MemberID
#> John.3.1 John 3 Lucy 1
#> Jane.1.1 Jane 1 Stephanie 2
#> Louis.4.1 Louis 4 Chris 3
#> Carl.2.1 Carl 2 Leslie 4
#> John.3.2 John 3 Earl 5
#> Jane.1.2 Jane 1 Carlos 6
#> Louis.4.2 Louis 4 Devon 7
#> Carl.2.2 Carl 2 Francis 8
#> John.3.3 John 3 Luther 9
#> Jane.1.3 Jane 1 Peter 10
#> Louis.4.3 Louis 4
#> Carl.2.3 Carl 2 Severus 11
Created on 2019-05-23 by the reprex package (v0.2.1)

Replace multiple strings/values based on separate list

I have a data frame that looks similar to this:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 John Smith GROUP1 2015 1 John Smith 5 Adam Smith 12 Mike Smith 20 Sam Smith 7 Luke Smith 3 George Smith
Each row repeats for new logs, but the values in X.1 : Y.3 change often.
The ID's and the ID's present in X.1 : Y.3 have a numeric value and then the name ID, i.e., "1 John Smith" or "20 Sam Smith" will be the string.
I have an issue where in certain instances, the ID will remain as "1 John Smith" but in X.1 : Y.3 the number may change preceding "John Smith", so for example it might be "14 John Smith". The names will always be correct, it's just the number that sometimes gets mixed up.
I have a list of 200+ ID's that are impacted by this mismatch - what is the most efficient way to replace the values in X.1 : Y.3 so that they match the correct ID in column ID?
I won't know which column "14 John Smith" shows up in, it could be X.1, or Y.2, or Y.3 depending on the row.
I can use a replace function in a dplyr line of code, or gsub for each 200+ ID's and for each column effected, but it seems very inefficient. Is there a quicker way than repeated something like the below x times?
df%>%mutate(X.1=replace(X.1, grepl('John Smith', X.1), "1 John Smith"))%>%as.data.frame()
Sometimes it helps to temporarily reshape the data. That way we can operate on all the X and Y values without iterating over them.
library(stringr)
library(tidyr)
## some data to work with
exd <- read.csv(text = "EVENT,ID,GROUP,YEAR,X.1,X.2,X.3,Y.1,Y.2,Y.3
1,1 John Smith,GROUP1,2015,19 John Smith,11 Adam Smith,9 Sam Smith,5 George Smith,13 Mike Smith,12 Luke Smith
2,2 John Smith,GROUP1,2015,1 George Smith,9 Luke Smith,19 Adam Smith,7 Sam Smith,17 Mike Smith,11 John Smith
3,3 John Smith,GROUP1,2015,5 George Smith,18 John Smith,12 Sam Smith,6 Luke Smith,2 Mike Smith,4 Adam Smith",
stringsAsFactors = FALSE)
## re-arrange to put X and Y columns into a single column
exd <- gather(exd, key = "var", value = "value", X.1, X.2, X.3, Y.1, Y.2, Y.3)
## find the X and Y values that contain the ID name
matches <- str_detect(exd$value, str_replace_all(exd$ID, "^\\d+ *", ""))
## replace X and Y values with the matching ID
exd[matches, "value"] <- exd$ID[matches]
## put it back in the original shape
exd <- spread(exd, key = "var", value = value)
exd
## EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
## 1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
## 2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
## 3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Not sure if you're set on dplyr and piping, but I think this is a plyr solution that does what you need. Given this example dataset:
> df
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 19 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 11 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 18 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
This adply function goes row by row and replaces any matching X:Y column values with the one from the ID column:
library(plyr)
adply(df, .margins = 1, function(x) {
idcol <- as.character(x$ID)
searchname <- trimws(gsub('[[:digit:]]+', "", idcol))
sapply(x[5:10], function(y) {
ifelse(grepl(searchname, y), idcol, as.character(y))
})
})
Output:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Data:
names <- c("EVENT","ID",'GROUP','YEAR', paste(rep(c("X.", "Y."), each = 3), 1:3, sep = ""))
first <- c("John", "Sam", "Adam", "Mike", "Luke", "George")
set.seed(2017)
randvals <- t(sapply(1:3, function(x) paste(sample(1:20, size = 6),
paste(sample(first, replace = FALSE, size = 6), "Smith"))))
df <- cbind(data.frame(1:3, paste(1:3, "John Smith"), "GROUP1", 2015), randvals)
names(df) <- names
I think that the most efficient way to accomplish this is by building a loop. The reason is that you will have to repeat the function to replace the names for every name in your ID list. With a loop, you can automate this.
I will make some assumptions first:
The ID list can be read as a character vector
You don't have any typos in the ID list or in your data.frame, including
different lowercase and uppercase letters in the names.
Your ID list does not contain the numbers. In case that it does contain numbers, you have to use gsub to erase them.
The example can work with a data.frame (DF) with the same structure that
you put in your question.
>
ID <- c("John Smith", "Adam Smith", "George Smith")
for(i in 1:length(ID)) {
DF[, 5:10][grep(ID[i], DF[, 5:10])] <- ID[i]
}
With each round this loop will:
Identify the positions in the columns X.1:Y.3 (columns 5 to 10 in your question) where the name "i" appears.
Then, it will change all those values to the one in the "i" position of the ID vector.
So, the first iteration will do: 1) Search for every position where the name "John Smith" appears in the data frame. 2) Replace all those "# John Smith" with "John Smith".
Note: If you simply want to delete the numbers, you can use gsub to replace them. Take into account that you probably want to erase the first space between the number and the name too. One way to do this is using gsub and a regular expression:
DF[, 5:10] <- gsub("[0-9]+ ", "", DF[, 5:10])

Match text across multiple rows in R

My data.frame(Networks) contains the following:
Location <- c("Farm", "Supermarket", "Farm", "Conference",
"Supermarket", "Supermarket")
Instructor <- c("Bob", "Bob", "Louise", "Sally", "Lee", "Jeff")
Operator <- c("Lee", "Lee", "Julie", "Louise", "Bob", "Louise")
Networks <- data.frame(Location, Instructor, Operator, stringsAsFactors=FALSE)
MY QUESTION
I wish to include a new column Transactions$Count in a new data.frame Transactions that sums the exchanges between each Instructor and Operator for every Location
EXPECTED OUTPUT
Location <- c("Farm", "Supermarket", "Farm", "Conference", "Supermarket")
Person1 <- c("Bob", "Louise", "Sally", "Jeff")
Person2 < - c("Lee", "Julie", "Louise", "Louise")
Count < - c(1, 2, 1, 1, 1)
Transactions <- data.frame(Location, Person1, Person2, Count,
stringsAsFactors=FALSE)
For example, there would be a total of 2 exchanges between Bob and Lee at the Supermarket. It does not matter if one person is a instructor or operator, I am interested in their exchange. In the expected output, the two exchanges between Bob and Lee at the Supermarket are noted. There is one exchange for every other combination at the other locations.
WHAT I HAVE TRIED
I thought grepl may be of use, but I wish to iterate across 1300 rows of this data, so it may be computationally expensive.
Thank you.
You can consider using "data.table" and use pmin and pmax in your "by" argument.
Example:
Networks <- data.frame(Location, Instructor, Operator, stringsAsFactors = FALSE)
library(data.table)
as.data.table(Networks)[
, TransCount := .N,
by = list(Location,
pmin(Instructor, Operator),
pmax(Instructor, Operator))][]
# Location Instructor Operator TransCount
# 1: Farm Bob Lee 1
# 2: Supermarket Bob Lee 2
# 3: Farm Louise Julie 1
# 4: Conference Sally Louise 1
# 5: Supermarket Lee Bob 2
# 6: Supermarket Jeff Louise 1
Based on your update, it sounds like this might be more appropriate for you:
as.data.table(Networks)[
, c("Person1", "Person2") := list(
pmin(Instructor, Operator),
pmax(Instructor, Operator)),
by = 1:nrow(Networks)
][
, list(TransCount = .N),
by = .(Location, Person1, Person2)
]
# Location Person1 Person2 TransCount
# 1: Farm Bob Lee 1
# 2: Supermarket Bob Lee 2
# 3: Farm Julie Louise 1
# 4: Conference Louise Sally 1
# 5: Supermarket Jeff Louise 1
You may try
library(dplyr)
Networks %>%
group_by(Location, Person1=pmin(Instructor,Operator),
Person2= pmax(Instructor,Operator)) %>%
summarise(Count=n())
# Location Person1 Person2 Count
#1 Conference Louise Sally 1
#2 Farm Bob Lee 1
#3 Farm Julie Louise 1
#4 Supermarket Bob Lee 2
#5 Supermarket Jeff Louise 1
Or using base R
d1 <-cbind(Location=Networks[,1],
data.frame(setNames(Map(do.call, c('pmin', 'pmax'),
list(Networks[-1])), c('Person1', 'Person2'))))
aggregate(cbind(Count=1:nrow(d1))~., d1, FUN=length)
# Location Person1 Person2 Count
#1 Farm Bob Lee 1
#2 Supermarket Bob Lee 2
#3 Supermarket Jeff Louise 1
#4 Farm Julie Louise 1
#5 Conference Louise Sally 1
data
Networks <- data.frame(Location, Instructor, Operator,
stringsAsFactors=FALSE)

get max value per id, then only value per id R

I would like to make my df smaller by just taking one observation per person per date, based on a persons biggest quantity per date.
Here's my df:
names dates quantity
1 tom 2010-02-01 28
3 tom 2010-03-01 7
2 mary 2010-05-01 30
6 tom 2010-06-01 21
4 john 2010-07-01 45
5 mary 2010-07-01 30
8 mary 2010-07-01 28
11 tom 2010-08-01 28
7 john 2010-09-01 28
10 john 2010-09-01 30
9 john 2010-07-01 45
12 mary 2010-11-01 28
13 john 2010-12-01 7
14 john 2010-12-01 14
I do this first by finding the max quantity per person per date. This works ok, but as you can see, if a person has equal quantities they retain the same amount of obs per date.
merge(df, aggregate(quantity ~ names+dates, df, max))
names dates quantity
1 john 2010-07-01 45
2 john 2010-07-01 45
3 john 2010-09-01 30
4 john 2010-12-01 14
5 mary 2010-05-01 30
6 mary 2010-07-01 30
7 mary 2010-11-01 28
8 tom 2010-02-01 28
9 tom 2010-03-01 7
10 tom 2010-06-01 21
11 tom 2010-08-01 28
So, my next step would be to just take the first obs per date (given that I have already selected the biggest quantity). I can't get the code right for this. this is what I have tried:
merge(l, aggregate(names ~ dates, l, FUN=function(z) z[1]))->m ##doesn't get rid of one obs for john
and a data.table option
l[, .SD[1], by=c(names,dates)] ##doesn't work at all
I like the aggregate and data.table options as they are fast and by df has ~100k rows.
Thank you in advance for this!
SOLUTION
i posted too fast - apologies!! an easy solution to this problem is just to find duplicates and then remove those. e.g.,;
merge(df, aggregate(quantity ~ names+dates, df, max))->toy
toy$dup<-duplicated(toy)
toy<-toy[toy$dup!=TRUE,]
here are the system times
system.time(dt2[, max(new_quan), by = list(hai_dispense_number, date_of_claim)]->method1)
user system elapsed
20.04 0.04 20.07
system.time(aggregate(new_quan ~ hai_dispense_number+date_of_claim, dt2, max)->rpp)
user system elapsed
19.129 0.028 19.148
Here's a data.table solution:
dt[, max(quantity), by = list(names, dates)]
Bench:
N = 1e6
dt = data.table(names = sample(letters, N, T), dates = sample(LETTERS, N, T), quantity = rnorm(N))
df = data.frame(dt)
op = function(df) aggregate(quantity ~ names+dates, df, max)
eddi = function(dt) dt[, max(quantity), by = list(names, dates)]
microbenchmark(op(df), eddi(dt), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# op(df) 2535.241 3025.1485 3195.078 3398.4404 3533.209 10
# eddi(dt) 148.088 162.8073 198.222 220.1217 286.058 10
I am not sure this gives you the output you want, but it definitely takes care of the "duplicate rows":
# Replicating your dataframe
df <- data.frame(names = c("tom", "tom", "mary", "tom", "john", "mary", "mary", "tom", "john", "john", "john", "mary", "john", "john"), dates = c("2010-02-01","2010-03-01", "2010-05-01", "2010-06-01", "2010-07-01", "2010-07-01", "2010-07-01", "2010-08-01", "2010-09-01", "2010-09-01", "2010-07-01", "2010-11-01", "2010-12-01", "2010-12-01"), quantity = c(28,7,30,21,45,30,28,28,28,30,45,28,7,14))
temp = merge(df, aggregate(quantity ~ names+dates, df, max))
df.unique = unique(temp)
If you are using data.frame,
library(plyr)
ddply(mydata,.(names,dates),summarize, maxquantity=max(quantity))
do.call( rbind,
lapply( split(df, df[,c("names","dates") ]), function(d){
d[which.max(d$quantity), ] } )
)

Resources