This question already has answers here:
How do I select a subset of rows after group by a specific column in R Data table [duplicate]
(2 answers)
Closed 7 years ago.
How to drop groups when there are not enough observations?
In the following reproducible example, each person (identified by name) has 10 observations:
install.packages('randomNames') # install package if required
install.packages('data.table') # install package if required
lapply(c('data.table', 'randomNames'), require, character.only = TRUE) # load packages
set.seed(1)
testDT <- data.table( date = rep(seq(as.Date("2010/1/1"), as.Date("2019/1/1"), "years"),10),
name = rep(randomNames(10, which.names='first'), times=1, each=10),
Y = runif(100, 5, 15),
X = rnorm(100, 2, 9),
testDT <- testDT[ X > 0]
Now I want to keep only the persons with at least 6 observations, so Gracelline, Anna, Aesha and Michael must be removed, because they have
only 3, 2, 4 and 5 observations respectively.
testDT[, length(X), by=name]
name V1
1: Blake 6
2: Alexander 6
3: Leigha 8
4: Gracelline 3
5: Epifanio 7
6: Keasha 6
7: Robyn 6
8: Anna 2
9: Aesha 4
10: Michael 5
How do I do this in an automatic way (real dataset is much larger)?
Edit:
Yes it's a duplicate. :(
The last proposed method was the fastest one.
> system.time(testDT[, .SD[.N>=6], by = name])
user system elapsed
0.293 0.227 0.517
> system.time(testDT[testDT[, .I[.N>=6], by = name]$V1])
user system elapsed
0.163 0.243 0.415
> system.time(testDT[,if(.N>=6) .SD , by = name])
user system elapsed
0.073 0.323 0.399
We group by 'name', get the nrow (.N), and if it is greater than 6, we Subset the Data.table (.SD).
testDT[,if(.N>=6) .SD , by = name]
# name date Y X
# 1: Blake 2010-01-01 9.820801 3.69913070
# 2: Blake 2012-01-01 9.935413 15.18999375
# 3: Blake 2013-01-01 6.862176 3.37928004
# 4: Blake 2014-01-01 13.273733 21.55350503
# 5: Blake 2015-01-01 11.684667 6.27958576
# 6: Blake 2017-01-01 6.079436 7.49653718
# 7: Alexander 2010-01-01 13.209463 4.62301612
# 8: Alexander 2012-01-01 12.829328 2.00994816
# 9: Alexander 2013-01-01 10.530363 2.66907192
#10: Alexander 2016-01-01 5.233312 0.78339246
#11: Alexander 2017-01-01 9.772301 12.60278297
#12: Alexander 2019-01-01 11.927316 7.34551569
#13: Leigha 2010-01-01 9.776196 4.99655334
#14: Leigha 2011-01-01 13.612095 11.56789854
#15: Leigha 2013-01-01 7.447973 5.33016929
#16: Leigha 2014-01-01 5.706790 4.40388912
#17: Leigha 2016-01-01 8.162717 12.87081025
#18: Leigha 2017-01-01 10.186343 12.44362354
#19: Leigha 2018-01-01 11.620051 8.30192285
#20: Leigha 2019-01-01 9.068302 16.28150109
#21: Epifanio 2010-01-01 8.390729 17.90558542
#22: Epifanio 2011-01-01 13.394404 8.45036728
#23: Epifanio 2012-01-01 8.466835 10.19156807
#24: Epifanio 2013-01-01 8.337749 5.45766822
#25: Epifanio 2014-01-01 9.763512 17.13958472
#26: Epifanio 2017-01-01 8.899895 14.89054015
#27: Epifanio 2019-01-01 14.606180 0.13357331
#28: Keasha 2013-01-01 8.253522 6.44769498
#29: Keasha 2014-01-01 12.570871 0.40402566
#30: Keasha 2016-01-01 12.111212 14.08734943
#31: Keasha 2017-01-01 6.216919 0.06878532
#32: Keasha 2018-01-01 7.454885 0.38399123
#33: Keasha 2019-01-01 6.433044 1.09828333
#34: Robyn 2010-01-01 7.396294 8.41399676
#35: Robyn 2011-01-01 5.589344 1.33792036
#36: Robyn 2012-01-01 11.422883 1.66129246
#37: Robyn 2015-01-01 12.973088 2.54144396
#38: Robyn 2017-01-01 9.100841 6.78346573
#39: Robyn 2019-01-01 11.049333 4.75902075
Or instead of if, we can directly use .N>1 and wrap with `.SD
testDT[, .SD[.N>=6], by = name]
it could be a little slow, so another option would be .I to get the row index and then subset
testDT[testDT[, .I[.N>=6], by = name]$V1]
Related
I have a large dataframes with patient data (14000 cases; df.A). We have a separate data file (df.B) including long term outcome and want to match the cases based on admission date, as a first step.
The first dataframe (df.A) includes estimates of admission date for which we have included a margin of error (2 days before and 7 days after). The second data frame is the large dataframe of hospital data (400k entries; df.B). The data frames could look like this.
IDA <- c(1:40)
AdmissionDtA <- seq(as.Date("2014-01-01"), as.Date("2014-02-09"), by="days")
FirstDtA <- AdmissionDt -2
LastDtA <- AdmissionDt +7
df.A <- data.frame(IDA, AdmissionDtA, FirstDtA, LastDtA)
IDB <- c(41:80)
AdmissionDtB <- seq(as.Date("2014-01-16"), as.Date("2014-02-24"), by="days")
df.B <- data.frame(IDB, AdmissionDtB)
We would like to extract potential matches when the AdmissionDtB lies between FirstDtA and LastDtA. The goal is to add a column to df.A, including a list of potential matching IDB's per case (IDA).
Result would look something like this:
IDA
AdmissionDtA
FirstDtA
LastDtA
MatchingIDB
8
2014-01-08
2014-01-06
2014-01-15
no match
9
2014-01-09
2014-01-07
2014-01-16
41
10
2014-01-10
2014-01-08
2014-01-17
41;42
11
2014-01-11
2014-01-09
2014-01-18
41;42;43
We have tried this with excel with INDEX and AGGREGATE, which worked but the data files were to big to handle for excel. There are around 2k matching IDBs per case (IDA).
We are relatively new to R and we could not make it work with "merge" and "join" etc.
Any help would be much appreciated. Thanks in advance!
You can use between inside a map call:
library(purrr)
library(dplyr)
df.A %>%
mutate(MatchingIDB = map2(df.A$FirstDtA, df.A$LastDtA, ~ c(df.B$IDB[between(df.B$AdmissionDtB, .x, .y)])))
output
> df.A[8:11, ]
IDA AdmissionDtA FirstDtA LastDtA MatchingIDB
8 8 2014-01-08 2014-01-06 2014-01-15
9 9 2014-01-09 2014-01-07 2014-01-16 41
10 10 2014-01-10 2014-01-08 2014-01-17 41, 42
11 11 2014-01-11 2014-01-09 2014-01-18 41, 42, 43
An alternative approach using data.table and non-equi joins. Might improve performance with very large datasets.
library(data.table)
library(magrittr)
setDT(df.B)[setDT(df.A),
.(IDA, AdmissionDtA, FirstDtA, LastDtA, IDB),
on = c("AdmissionDtB >= FirstDtA", "AdmissionDtB <= LastDtA")] %>%
.[j = .(matchingIDB = paste0(IDB, collapse = ",")),
by = names(df.A)]
#> IDA AdmissionDtA FirstDtA LastDtA matchingIDB
#> 1: 1 2014-01-01 2013-12-30 2014-01-08 NA
#> 2: 2 2014-01-02 2013-12-31 2014-01-09 NA
#> 3: 3 2014-01-03 2014-01-01 2014-01-10 NA
#> 4: 4 2014-01-04 2014-01-02 2014-01-11 NA
#> 5: 5 2014-01-05 2014-01-03 2014-01-12 NA
#> 6: 6 2014-01-06 2014-01-04 2014-01-13 NA
#> 7: 7 2014-01-07 2014-01-05 2014-01-14 NA
#> 8: 8 2014-01-08 2014-01-06 2014-01-15 NA
#> 9: 9 2014-01-09 2014-01-07 2014-01-16 41
#> 10: 10 2014-01-10 2014-01-08 2014-01-17 41,42
#> 11: 11 2014-01-11 2014-01-09 2014-01-18 41,42,43
#> 12: 12 2014-01-12 2014-01-10 2014-01-19 41,42,43,44
#> 13: 13 2014-01-13 2014-01-11 2014-01-20 41,42,43,44,45
#> 14: 14 2014-01-14 2014-01-12 2014-01-21 41,42,43,44,45,46
#> 15: 15 2014-01-15 2014-01-13 2014-01-22 41,42,43,44,45,46,47
...
The following dataset is a dummy example of the problem I am having.There are 3 columns in my data viz: Date PlayerName and Score .Thus each player's date-wise score is recorded.
The task is to find that player with maximum TotalScore (Over all the observations) from the subset of players that fulfills the two following criteria:
There should be a steady increase in players' yearly performance (meaning in each year total score should be larger than the previous year score for a player)
The growth-rate of performance should also increase(meaning the rate of increase of yearly total scores should also increase over time)
Dataframe looks like :
date <- as.Date(x = c('2010/01/01','2010/02/02',
'2011/01/01','2011/02/02',
'2012/01/01','2012/02/02',
'2013/01/01','2013/02/02',
'2014/01/01','2014/02/02'),format = "%Y/%m/%d") #toy date column
PlayerName <- rep(LETTERS[1:5],each=10) # Name Players as A:E
score <- c(100,150,270,300,400,
100,120,200,400,900,
100,80,130,70,300,
100,120,230,650,870,
100,90,110,450,342)
df <- data.table(date=date,Name=PlayerName,score=score)
> df
date Name score
1: 2010-01-01 A 100
2: 2010-02-02 A 150
3: 2011-01-01 A 270
4: 2011-02-02 A 300
5: 2012-01-01 A 400
6: 2012-02-02 A 100
7: 2013-01-01 A 120
8: 2013-02-02 A 200
9: 2014-01-01 A 400
10: 2014-02-02 A 900
11: 2010-01-01 B 100
12: 2010-02-02 B 80
13: 2011-01-01 B 130
14: 2011-02-02 B 70
15: 2012-01-01 B 300
16: 2012-02-02 B 100
17: 2013-01-01 B 120
18: 2013-02-02 B 230
19: 2014-01-01 B 650
20: 2014-02-02 B 870
21: 2010-01-01 C 100
22: 2010-02-02 C 90
23: 2011-01-01 C 110
24: 2011-02-02 C 450
25: 2012-01-01 C 342
26: 2012-02-02 C 100
27: 2013-01-01 C 150
28: 2013-02-02 C 270
29: 2014-01-01 C 300
30: 2014-02-02 C 400
31: 2010-01-01 D 100
32: 2010-02-02 D 120
33: 2011-01-01 D 200
34: 2011-02-02 D 400
35: 2012-01-01 D 900
36: 2012-02-02 D 100
37: 2013-01-01 D 80
38: 2013-02-02 D 130
39: 2014-01-01 D 70
40: 2014-02-02 D 300
41: 2010-01-01 E 100
42: 2010-02-02 E 120
43: 2011-01-01 E 230
44: 2011-02-02 E 650
45: 2012-01-01 E 870
46: 2012-02-02 E 100
47: 2013-01-01 E 90
48: 2013-02-02 E 110
49: 2014-01-01 E 450
50: 2014-02-02 E 342
What I have managed to do so far is below :
df[,year := lubridate::year(date)] # extract the year
df1 <- df[,.(total_score =sum(score)),.(Name,year)] # Yearly Aggregated Scores
df1[,total_score_lag := shift(x=total_score,type = 'lag'),.(Name)] ## creates a players lagged column of score
df1[,growth_rate := round(total_score/total_score_lag,2)] ## creates ratio of current and past years scores column
df1[,growth_rate_lag := shift(x=growth_rate,type = 'lag'),.(Name)] #### Creates a lag column of growth column
> df1
Name year total_score total_score_lag growth_rate growth_rate_lag
1: A 2010 100 NA NA NA
2: A 2011 150 100 1.50 NA
3: A 2012 270 150 1.80 1.50
4: A 2013 300 270 1.11 1.80
5: A 2014 400 300 1.33 1.11
6: B 2010 100 NA NA NA
7: B 2011 120 100 1.20 NA
8: B 2012 200 120 1.67 1.20
9: B 2013 400 200 2.00 1.67
10: B 2014 900 400 2.25 2.00
11: C 2010 100 NA NA NA
12: C 2011 80 100 0.80 NA
13: C 2012 130 80 1.62 0.80
14: C 2013 70 130 0.54 1.62
15: C 2014 300 70 4.29 0.54
16: D 2010 100 NA NA NA
17: D 2011 120 100 1.20 NA
18: D 2012 230 120 1.92 1.20
19: D 2013 650 230 2.83 1.92
20: D 2014 870 650 1.34 2.83
21: E 2010 100 NA NA NA
22: E 2011 90 100 0.90 NA
23: E 2012 110 90 1.22 0.90
24: E 2013 450 110 4.09 1.22
25: E 2014 342 450 0.76 4.09
Now I understand that I need to have two conditions validated as
filter growth_rate column player_wise with greater than 1 value throughout.
filter growth_rate_lag column for the patients whose consecutive row values greater than previous row.
But I cant code for the said logic. Also there could be an alternative way of looking into it too. I would appreciate if anyone helps. Thanks in advance.
Edit 1 :
The example I used was not accurate. So an updated example would be like:
date <- as.Date(x = c('2010/01/01','2010/02/02',
'2011/01/01','2011/02/02',
'2012/01/01','2012/02/02',
'2013/01/01','2013/02/02',
'2014/01/01','2014/02/02'),format = "%Y/%m/%d")
PlayerName <- rep(LETTERS[1:5],each=10) # Name Players as A:E
score <- c(40,60,100,50,70,200,120,180,380,20,
40,60,20,100,150,50,300,100,800,100,
10,90,30,50,100,30,10,60,100,200,
50,50,100,20,200,30,400,60,570,400,
80,20,70,20,100,10,400,50,142,200)
df <- data.table(date=date,Name=Name,score=score)
df[,year := lubridate::year(date)] # extract the year
df1 <- df[,.(total_score =sum(score)),.(Name,year)] # Yearly Aggregated Scores
df1[,total_score_lag := shift(x=total_score,type = 'lag'),.(Name)] ## creates a players lagged column of score
df1[,growth_rate := round(total_score/total_score_lag,2)] ## creates ratio of current and past years scores column
df1[,growth_rate_lag := shift(x=growth_rate,type = 'lag'),.(Name)] #### Creates a lag column of growth column
Name year total_score total_score_lag growth_rate growth_rate_lag
1: A 2010 100 NA NA NA
2: A 2011 150 100 1.50 NA
3: A 2012 270 150 1.80 1.50
4: A 2013 300 270 1.11 1.80
5: A 2014 400 300 1.33 1.11
6: B 2010 100 NA NA NA
7: B 2011 120 100 1.20 NA
8: B 2012 200 120 1.67 1.20
9: B 2013 400 200 2.00 1.67
10: B 2014 900 400 2.25 2.00
11: C 2010 100 NA NA NA
12: C 2011 80 100 0.80 NA
13: C 2012 130 80 1.62 0.80
14: C 2013 70 130 0.54 1.62
15: C 2014 300 70 4.29 0.54
16: D 2010 100 NA NA NA
17: D 2011 120 100 1.20 NA
18: D 2012 230 120 1.92 1.20
19: D 2013 460 230 2.00 1.92
20: D 2014 970 460 2.11 2.00
21: E 2010 100 NA NA NA
22: E 2011 90 100 0.90 NA
23: E 2012 110 90 1.22 0.90
24: E 2013 450 110 4.09 1.22
25: E 2014 342 450 0.76 4.09
Now here clearly Player A,B ,D meets condition1 and but only B and D meets condition 2. And as D has highest total_score answer is D.
With data.table you could use cumsum to select player up to the last year it achieved increasing score growth rate :
df1[,selected :=cumsum(fifelse(growth_rate>growth_rate_lag|is.na(growth_rate_lag),1L,NA_integer_)),by=Name]
df1[selected>0]
Name year total_score total_score_lag growth_rate growth_rate_lag selected
1: A 2010 250 NA NA NA 1
2: A 2011 570 250 2.28 NA 2
3: B 2010 180 NA NA NA 1
4: B 2011 200 180 1.11 NA 2
5: B 2012 400 200 2.00 1.11 3
6: C 2010 190 NA NA NA 1
7: C 2011 560 190 2.95 NA 2
8: D 2010 220 NA NA NA 1
9: D 2011 600 220 2.73 NA 2
10: E 2010 220 NA NA NA 1
11: E 2011 880 220 4.00 NA 2
As noted in the other answers, no player achieved increasing rate in this dataset.
Do you need something like this?
df %>% group_by(Name) %>%
mutate(grth = (score - lag(score))/lag(score),
grth_grth = (grth - lag(grth))/lag(grth)) %>%
filter(min(grth, na.rm = T) > 0, min(grth_grth, na.rm = T) >0) %>%
summarise(scrore = sum(score))
# A tibble: 0 x 2
# ... with 2 variables: Name <chr>, scrore <dbl>
Means none of the player fulfills the criteria
I believe you provided a not-so-good example data. That said, a possible solution with dplyr (I am not familiarized with data.table):
data%>%
group_by(PlayerName)%>%
mutate(steady_growth=identical(score,sort(score)),
positive_growth_rate=ifelse(is.na(lag(score))), TRUE,
score/lag(score)>=1)%>%
ungroup
This will create two additional logical columns.
Then you can filter the desired subset with:
data%>%filter(steady_growth & positive_growth_rate)
which gives a data.frame with zero rows in your example
All in one call:
data%>%
group_by(PlayerName)%>%
mutate(steady_growth=identical(score,sort(score)),
positive_growth_rate=ifelse(is.na(lag(score))), TRUE,
score/lag(score)>=1)%>%
filter(steady_growth & positive_growth_rate)
Please be wary that the steady_growth column is all TRUEs or FALSEs for a given player.
In an earlier question, I learned that graphs are useful to collapse these data
require(data.table)
set.seed(333)
t <- data.table(old=1002:2001, dif=sample(1:10,1000, replace=TRUE))
t$new <- t$old + t$dif; t$foo <- rnorm(1000); t$dif <- NULL
> head(t)
old new foo
1: 1002 1007 -0.7889534
2: 1003 1004 0.3901869
3: 1004 1014 0.7907947
4: 1005 1011 2.0964612
5: 1006 1007 1.1834171
6: 1007 1015 1.1397910
to obtain only those rows such that new[i] = old[i-1]. The result could then be joined into a table with users who each have their own starting points
i <- data.table(id=1:3, start=sample(1000:1990,3))
> i
id start
1: 1 1002
2: 2 1744
3: 3 1656
Specifically, when only the first n=3 steps are calculated, the solution was
> library(igraph)
> i[, t[old %in% subcomponent(g, start, "out")[1:n]], by=.(id)]
id old new foo
1: 1 1002 1007 -0.7889534
2: 1 1007 1015 1.1397910
3: 1 1015 1022 -1.2193666
4: 2 1744 1750 -0.1368320
5: 2 1750 1758 0.3331686
6: 2 1758 1763 1.3040357
7: 3 1656 1659 -0.1556208
8: 3 1659 1663 0.1663042
9: 3 1663 1669 0.3781835
When implementing this when the setup is the same but new, old, and start are POSIXct class,
set.seed(333)
u <- data.table(old=seq(from=as.POSIXct("2013-01-01"),
to=as.POSIXct("2013-01-02"), by="15 mins"),
dif=as.difftime(sample(seq(15,120,15),97,replace=TRUE),units="mins"))
u$new <- u$old + u$dif; u$foo <- rnorm(97); u$dif <- NULL
j <- data.table(id=1:3, start=sample(seq(from=as.POSIXct("2013-01-01"),
to=as.POSIXct("2013-01-01 22:00:00"), by="15 mins"),3))
> head(u)
old new foo
1: 2013-01-01 00:00:00 2013-01-01 01:00:00 -1.5434407
2: 2013-01-01 00:15:00 2013-01-01 00:30:00 -0.2753971
3: 2013-01-01 00:30:00 2013-01-01 02:30:00 -1.5986916
4: 2013-01-01 00:45:00 2013-01-01 02:00:00 -0.6288528
5: 2013-01-01 01:00:00 2013-01-01 01:15:00 -0.8967041
6: 2013-01-01 01:15:00 2013-01-01 02:45:00 -1.2145590
> j
id start
1: 1 2013-01-01 22:00:00
2: 2 2013-01-01 21:00:00
3: 3 2013-01-01 13:30:00
the command
> j[, u[old %in% subcomponent(h, V(h)$name %in% as.character(start), "out")[1:n]], by=.(id)]
Empty data.table (0 rows and 4 cols): id,old,new,foo
returns an empty vector, which appears to be due to the inner part u[...]. I do not quite see where the problem is in this case and wonder whether anyone spots a mistake.
I can't get my head around something that looks obvious...
library(data.table)
DT1<-data.table(MyDate=as.Date(rep("2019-02-01")),MyName=c("John","Peter","Paul"),Rate=c(210,180,190))
DT2<-data.table(MyDate=seq(as.Date("2019-01-27"),as.Date("2019-02-03"),by="days"))
setkey(DT1,MyDate)
setkey(DT2,MyDate)
I would like to see the rate for John, Peter and Paul be rolled forward towards the end. When I do
DT1[DT2,on=.(MyDate),roll=TRUE]
I get :
MyDate MyName Rate
1: 2019-01-27 <NA> NA
2: 2019-01-28 <NA> NA
3: 2019-01-29 <NA> NA
4: 2019-01-30 <NA> NA
5: 2019-01-31 <NA> NA
6: 2019-02-01 John 210
7: 2019-02-01 Paul 190
8: 2019-02-01 Peter 180
9: 2019-02-02 Peter 180
10: 2019-02-03 Peter 180
While I want this :
MyDate MyName Rate
1: 2019-01-27 <NA> NA
2: 2019-01-28 <NA> NA
3: 2019-01-29 <NA> NA
4: 2019-01-30 <NA> NA
5: 2019-01-31 <NA> NA
6: 2019-02-01 John 210
7: 2019-02-01 Paul 190
8: 2019-02-01 Peter 180
9: 2019-02-02 John 210
10: 2019-02-02 Paul 190
11: 2019-02-02 Peter 180
12: 2019-02-03 John 210
13: 2019-02-03 Paul 190
14: 2019-02-03 Peter 180
It's obvious I'm overlooking something.
A convoluted way (found by trial and error):
DT1[DT2, on=.(MyDate <= MyDate), allow.cartesian = TRUE]
MyDate MyName Rate
1: 2019-01-27 <NA> NA
2: 2019-01-28 <NA> NA
3: 2019-01-29 <NA> NA
4: 2019-01-30 <NA> NA
5: 2019-01-31 <NA> NA
6: 2019-02-01 John 210
7: 2019-02-01 Peter 180
8: 2019-02-01 Paul 190
9: 2019-02-02 John 210
10: 2019-02-02 Peter 180
11: 2019-02-02 Paul 190
12: 2019-02-03 John 210
13: 2019-02-03 Peter 180
14: 2019-02-03 Paul 190
The difficult part was the cross-join-esque rows you need after a matching date but not before that matching date. I think the steps below get at this issue.
Perform a rolling join for each Name, then change the MyName column around and filter for resulting unique lines.
library(magrittr)
DT1[, .SD[DT2, roll = TRUE], by = MyName][
, MyName := ifelse(is.na(Rate), NA, MyName)
][order(MyDate, MyName), .(MyDate, MyName, Rate)] %>%
unique()
MyDate MyName Rate
1: 2019-01-27 <NA> NA
2: 2019-01-28 <NA> NA
3: 2019-01-29 <NA> NA
4: 2019-01-30 <NA> NA
5: 2019-01-31 <NA> NA
6: 2019-02-01 John 210
7: 2019-02-01 Paul 190
8: 2019-02-01 Peter 180
9: 2019-02-02 John 210
10: 2019-02-02 Paul 190
11: 2019-02-02 Peter 180
12: 2019-02-03 John 210
13: 2019-02-03 Paul 190
14: 2019-02-03 Peter 180
Trying to fix a de-duplication problem using data.table in R.
Column A is a list of names, some of which appear multiple times. Column B is a list of dates. There are a bunch of other columns that I want to copy over as well (things that happened to Name on Date.)
However I only want to look at the most activity for each person in a new datatable which has 1 entry for each name that corresponds to the most recent date.
The example data
name.last date
1: Adams 2014-10-20
2: Adams 2014-07-07
3: Barnett 2014-11-06
4: Barnett 2014-09-22
5: Bell 2014-10-22
6: Bell 2014-07-29
7: Burns 2014-09-08
8: Burns 2014-09-03
9: Camacho 2014-08-12
10: Camacho 2014-07-08
11: Casillas 2014-10-07
12: Casillas 2014-07-17
13: Chavez 2014-09-23
14: Chavez 2014-09-17
15: Chavira 2014-07-15
16: Chavira 2014-07-07
17: Claren 2014-10-30
18: Claren 2014-10-23
19: Colleary 2014-11-11
20: Colleary 2014-11-07
The answer would return only the first of each name (since here the rows are sorted with the most recent date for each first.) However if I set the dt key setkey(dt,name.last) in order to use unique() to remove duplicates, it reorders the table in key order (alphabetical on the names). The use of unique(dt) then returns the first appearance of each name which is not necessarily the most recent date.
If I set the key over both columns setkeyv(dt,c(name.last,date)) I cannot then remove duplicates using unique() as all keys are unique.
The problem is similar to the one post here: Collapsing data frame by selecting one row per group . However I cannot assume the data to be selected is first or last unless you can suggest a way to manipulate my data to make it so after setting the key.
There are plenty of ways of doing this without ordering the data table (though ordering is preferred because duplicated is very efficient and you are also avoiding of using by - will get to that).
First of all, you have to make sure that date is of class Date in order to make things easier
dt[, date := as.Date(date)]
First simple method (though not the most efficient)
dt[, max(date), name.last]
# name.last V1
# 1: Adams 2014-10-20
# 2: Barnett 2014-11-06
# 3: Bell 2014-10-22
# 4: Burns 2014-09-08
# 5: Camacho 2014-08-12
# 6: Casillas 2014-10-07
# 7: Chavez 2014-09-23
# 8: Chavira 2014-07-15
# 9: Claren 2014-10-30
# 10: Colleary 2014-11-11
Second (proffered) method is similar to yours but is using data.tables setorder (for data.table version >= 1.9.4) and should be the most efficient
setorder(dt, name.last, -date)[!duplicated(name.last)]
# name.last date
# 1: Adams 2014-10-20
# 2: Barnett 2014-11-06
# 3: Bell 2014-10-22
# 4: Burns 2014-09-08
# 5: Camacho 2014-08-12
# 6: Casillas 2014-10-07
# 7: Chavez 2014-09-23
# 8: Chavira 2014-07-15
# 9: Claren 2014-10-30
# 10: Colleary 2014-11-11
You could achieve the same using setkey (as you already did) ans specifying from.last = TRUE in duplicated and removing !
setkey(dt, name.last, date)[duplicated(name.last, from.last = TRUE)]
# name.last date
# 1: Adams 2014-10-20
# 2: Barnett 2014-11-06
# 3: Bell 2014-10-22
# 4: Burns 2014-09-08
# 5: Camacho 2014-08-12
# 6: Casillas 2014-10-07
# 7: Chavez 2014-09-23
# 8: Chavira 2014-07-15
# 9: Claren 2014-10-30
# 10: Colleary 2014-11-11
Third method is using data.tables unique function (which should be very efficient too)
unique(setorder(dt, name.last, -date), by = "name.last")
# name.last date
# 1: Adams 2014-10-20
# 2: Barnett 2014-11-06
# 3: Bell 2014-10-22
# 4: Burns 2014-09-08
# 5: Camacho 2014-08-12
# 6: Casillas 2014-10-07
# 7: Chavez 2014-09-23
# 8: Chavira 2014-07-15
# 9: Claren 2014-10-30
# 10: Colleary 2014-11-11
Last method is using .SD. It is the least efficient, but is useful in some cases when you want to get all the column in return and you can't use functions such a sduplicated
setorder(dt, name.last, -date)[, .SD[1], name.last]
# name.last date
# 1: Adams 2014-10-20
# 2: Barnett 2014-11-06
# 3: Bell 2014-10-22
# 4: Burns 2014-09-08
# 5: Camacho 2014-08-12
# 6: Casillas 2014-10-07
# 7: Chavez 2014-09-23
# 8: Chavira 2014-07-15
# 9: Claren 2014-10-30
# 10: Colleary 2014-11-11
If I am understanding your question, I think you can do this more cleanly with the sqldf package, but the downside is that you have to know sql.
install.packages("sqldf")
library("sqldf")
dt <-data.frame(read.table(header = TRUE, text = " name.last date
1: Adams 2014-10-20
2: Adams 2014-07-07
3: Barnett 2014-11-06
4: Barnett 2014-09-22
5: Bell 2014-10-22
6: Bell 2014-07-29
7: Burns 2014-09-08
8: Burns 2014-09-03
9: Camacho 2014-08-12
10: Camacho 2014-07-08
11: Casillas 2014-10-07
12: Casillas 2014-07-17
13: Chavez 2014-09-23
14: Chavez 2014-09-17
15: Chavira 2014-07-15
16: Chavira 2014-07-07
17: Claren 2014-10-30
18: Claren 2014-10-23
19: Colleary 2014-11-11
20: Colleary 2014-11-07")
)
head(dt)
colnames(dt) <- c('names', 'date')
sqldf("select names, min(date), max(date) from dt group by names")
Hopefully this was helpful.
In writing this up I figured it out. For posterity....
Order the table by name and date so that you can depend on the date you want being first or last in the group. For example: dt[order(names,-date)].
Then rather than setting a key and using unique(), just a simple:
dt[!duplicated(names)]
Where names is the duplicated column.
Should output the desired table. If there are more elegant / reliable ways to do this I'd be interested in hearing them.