Selecting distinct entries based on specific variables in R - r

I want to select distinct entries for my dataset based on two specific variables. I may, in fact, like to create a subset and do analysis using each subset.
The data set looks like this
id <- c(3,3,6,6,4,4,3,3)
date <- c("2017-1-1", "2017-3-3", "2017-4-3", "2017-4-7", "2017-10-1", "2017-11-1", "2018-3-1", "2018-4-3")
date_cat <- c(1,1,1,1,2,2,3,3)
measurement <- c(10, 13, 14,13, 12, 11, 14, 17)
myData <- data.frame(id, date, date_cat, measurement)
myData
myData$date1 <- as.Date(myData$date)
myData
id date date_cat measurement date1
1 3 2017-1-1 1 10 2017-01-01
2 3 2017-3-3 1 13 2017-03-03
3 6 2017-4-3 1 14 2017-04-03
4 6 2017-4-7 1 13 2017-04-07
5 4 2017-10-1 2 12 2017-10-01
6 4 2017-11-1 2 11 2017-11-01
7 3 2018-3-1 3 14 2018-03-01
8 3 2018-4-3 3 17 2018-04-03
#select the last date for the ID in each date category.
Here date_cat is the date category and date1 is date formatted as date. How can I get the last date for each ID in each date_category?
I want my data to show up as
id date date_cat measurement date1
1 3 2017-3-3 1 13 2017-03-03
2 6 2017-4-7 1 13 2017-04-07
3 4 2017-11-1 2 11 2017-11-01
4 3 2018-4-3 3 17 2018-04-03
Thanks!

I am not sure if you want something like below
subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
which gives
> subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
id date date_cat measurement date1
2 3 2017-3-3 1 13 2017-03-03
4 6 2017-4-7 1 13 2017-04-07
6 4 2017-11-1 2 11 2017-11-01
8 3 2018-4-3 3 17 2018-04-03

Using data.table:
library(data.table)
myData_DT <- as.data.table(myData)
myData_DT[, .SD[.N] , by = .(date_cat, id)]

We could create a group with rleid on the 'id' column, slice the last row, remove the temporary grouping column
library(dplyr)
library(data.table)
myData %>%
group_by(grp = rleid(id)) %>%
slice(n()) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 5
# id date date_cat measurement date1
# <dbl> <chr> <dbl> <dbl> <date>
#1 3 2017-3-3 1 13 2017-03-03
#2 6 2017-4-7 1 13 2017-04-07
#3 4 2017-11-1 2 11 2017-11-01
#4 3 2018-4-3 3 17 2018-04-03
Or this can be done on the fly without creating a temporary column
myData %>%
filter(!duplicated(rleid(id), fromLast = TRUE))
Or using base R with subset and rle
subset(myData, !duplicated(with(rle(id),
rep(seq_along(values), lengths)), fromLast = TRUE))
# id date date_cat measurement date1
#2 3 2017-3-3 1 13 2017-03-03
#4 6 2017-4-7 1 13 2017-04-07
#6 4 2017-11-1 2 11 2017-11-01
#8 3 2018-4-3 3 17 2018-04-03

Using dplyr:
myData %>%
group_by(id,date_cat) %>%
top_n(1,date)

Related

R conditional count of unique value over date range/window

In R, how can you count the number of observations fulfilling a condition over a time range?
Specifically, I want to count the number of different id by country over the last 8 months, but only if id occurs at least twice during these 8 months. Hence, for the count, it does not matter whether an id occurs 2x or 100x (doing this in 2 steps is maybe easier). NA exists both in id and country. Since this could otherwise be taken care off, accounting for this is not necessary but still helpful.
My current best try is, but does not account for the restriction (ID must appear at least twice in the previous 8 months) and also I find its counting odd when looking at the dates="2017-12-12", where desired_unrestricted should be equal to 4 according to my counting but the code gives 2.
dt[, date := as.Date(date)][
, totalids := sapply(date,
function(x) length(unique(id[between(date, x - lubridate::month(8), x)]))),
by = country]
Data
library(data.table)
library(lubridate)
ID <- c("1","1","1","1","1","1","2","2","2","3","3",NA,"4")
Date <- c("2017-01-01","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2018-05-02","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2017-05-01","2017-12-12","2017-12-12" )
Value <- c(2,4,3,5,2,5,8,17,17,3,7,5,3)
Country <- c("UK","UK","US","US",NA,"US","UK","UK","US","US","US","US","US")
Desired <- c(1,1,0,2,NA,0,1,2,2,2,2,1,1)
Desired_unrestricted <- c(2,2,1,3,NA,1,2,2,3,3,3,4,4)
dt <- data.frame(id=ID, date=Date, value=Value, country=Country, desired_output=Desired, desired_unrestricted=Desired_unrestricted)
setDT(dt)
Thanks in advance.
This data.table-only answer is motivated by a comment,
dt[, date := as.Date(date)] # if not already `Date`-class
dt[, date8 := do.call(c, lapply(dt$date, function(z) seq(z, length=2, by="-8 months")[2]))
][, results := dt[dt, on = .(country, date > date8, date <= date),
length(Filter(function(z) z > 1, table(id))), by = .EACHI]$V1
][, date8 := NULL ]
# id date value country desired_output desired_unrestricted results
# <char> <Date> <num> <char> <num> <num> <int>
# 1: 1 2017-01-01 2 UK 1 2 1
# 2: 1 2017-01-01 4 UK 1 2 1
# 3: 1 2017-01-05 3 US 0 1 0
# 4: 1 2017-05-01 5 US 1 3 2
# 5: 1 2017-05-01 2 <NA> NA NA 0
# 6: 1 2018-05-02 5 US 0 1 0
# 7: 2 2017-01-01 8 UK 1 2 1
# 8: 2 2017-01-05 17 UK 2 2 2
# 9: 2 2017-05-01 17 US 1 3 2
# 10: 3 2017-05-01 3 US 2 3 2
# 11: 3 2017-05-01 7 US 2 3 2
# 12: <NA> 2017-12-12 5 US 2 4 1
# 13: 4 2017-12-12 3 US 2 4 1
That's a lot to absorb.
Quick walk-through:
"8 months ago":
seq(z, length=2, by="-8 months")[2]
seq.Date (inferred by calling seq with a Date-class first argument) starts at z (current date for each row) and produces a sequence of length 2 with 8 months between them. seq always starts at the first argument, so length=1 won't work (it'll only return z); length=2 guarantees that the second value in the returned vector will be the "8 months before date" that we need.
Date subtraction:
[, date8 := do.call(c, lapply(dt$date, function(z) seq(...)[2])) ]
A simple base-R method for subtracting 8 months is seq(date, length=2, by="-8 months")[2]. seq.Date requires its first argument to be length-1, so we need to sapply or lapply it; unfortunately, sapply drops the class, so we lapply it and then programmatically combine them with do.call(c, ...) (since c(..) creates a list-column, and unlist will de-class it). (Perhaps this part can be improved.)
We need that in dt first since we do a non-equi (range-based) join based on this value.
Counting id with 2 or more visits:
length(Filter(function(z) z > 1, table(id)))
We produce a table(id), which gives us the count of each id within the join-period. Filter(fun, ...) allows us to reduce those that have a count below 2, and we're left with a named-vector of ids that had 2 or more visits. Retrieving the length is what we need.
Self non-equi join:
dt[dt, on = .(country, date > date8, date <= date), ... ]
Relatively straight-forward. This is an open/closed ranging, it can be changed to both-closed if you prefer.
Self non-equi join but count ids by-row: by=.EACHI.
Retrieve the results of that and assign into the original dt:
[, results := dt[...]$V1 ]
Since the non-equi join included a value (length(Filter(...))) without a name, it's named V1, and all we want is that. (To be honest, I don't know exactly why assigning it more directly doesn't work ... but the counts are all wrong. Perhaps it's backwards by-row tallying.)
Cleanup:
[, date8 := NULL ]
(Nothing fancy here, just proper data-stewardship :-)
There are some discrepancies in my counts versus your desired_output, I wonder if those are just typos in the OP; I think the math is right ...
Here is another option:
setkey(dt, country, date, id)
dt[, date := as.IDate(date)][,
eightmthsago := as.IDate(sapply(as.IDate(date), function(x) seq(x, by="-8 months", length.out=2L)[2L]))]
dt[, c("out", "out_unres") :=
dt[dt, on=.(country, date>=eightmthsago, date<=date),
by=.EACHI, {
v <- id[!is.na(id)]
.(uniqueN(v[duplicated(v)]), uniqueN(v))
}][,1L:3L := NULL]
]
dt
output (like r2evans, I am also getting different output from desired as there seems to be a miscount in the desired output):
id date value country desired_output desired_unrestricted eightmthsago out out_unres
1: 1 2017-05-01 2 <NA> NA NA 2016-09-01 0 1
2: 1 2017-01-01 2 UK 1 2 2016-05-01 1 2
3: 1 2017-01-01 4 UK 1 2 2016-05-01 1 2
4: 2 2017-01-01 8 UK 1 2 2016-05-01 1 2
5: 2 2017-01-05 17 UK 2 2 2016-05-05 2 2
6: 1 2017-01-05 3 US 0 1 2016-05-05 0 1
7: 1 2017-05-01 5 US 1 3 2016-09-01 2 3
8: 2 2017-05-01 17 US 1 3 2016-09-01 2 3
9: 3 2017-05-01 3 US 2 3 2016-09-01 2 3
10: 3 2017-05-01 7 US 2 3 2016-09-01 2 3
11: <NA> 2017-12-12 5 US 2 4 2017-04-12 1 4
12: 4 2017-12-12 3 US 2 4 2017-04-12 1 4
13: 1 2018-05-02 5 US 0 1 2017-09-02 0 2
Although this question is tagged with data.table, here is a dplyr::rowwise solution to the problem. Is this what you had in mind? The output looks valid to me: The number of ìds in the last 8 months which have a count of at least greater than 2.
library(dplyr)
library(lubridate)
dt <- dt %>% mutate(date = as.Date(date))
dt %>%
group_by(country) %>%
group_modify(~ .x %>%
rowwise() %>%
mutate(totalids = .x %>%
filter(date <= .env$date, date >= .env$date %m-% months(8)) %>%
pull(id) %>%
table() %>%
`[`(. >1) %>%
length
))
#> # A tibble: 13 x 7
#> # Groups: country [3]
#> country id date value desired_output desired_unrestricted totalids
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <int>
#> 1 UK 1 2017-01-01 2 1 2 1
#> 2 UK 1 2017-01-01 4 1 2 1
#> 3 UK 2 2017-01-01 8 1 2 1
#> 4 UK 2 2017-01-05 17 2 2 2
#> 5 US 1 2017-01-05 3 0 1 0
#> 6 US 1 2017-05-01 5 1 3 2
#> 7 US 1 2018-05-02 5 0 1 0
#> 8 US 2 2017-05-01 17 1 3 2
#> 9 US 3 2017-05-01 3 2 3 2
#> 10 US 3 2017-05-01 7 2 3 2
#> 11 US <NA> 2017-12-12 5 2 4 1
#> 12 US 4 2017-12-12 3 2 4 1
#> 13 <NA> 1 2017-05-01 2 NA NA 0
Created on 2021-09-02 by the reprex package (v2.0.1)

Interpolating Mid-Year Averages

I have yearly observations of income for a series of geographies, like this:
library(dplyr)
library(lubridate)
date <- c("2004-01-01", "2005-01-01", "2006-01-01",
"2004-01-01", "2005-01-01", "2006-01-01")
geo <- c(1, 1, 1, 2, 2, 2)
inc <- c(10, 12, 14, 32, 34, 50)
data <- tibble(date = ymd(date), geo, inc)
date geo inc
<date> <dbl> <dbl>
1 2004-01-01 1 10
2 2005-01-01 1 12
3 2006-01-01 1 14
4 2004-01-01 2 32
5 2005-01-01 2 34
6 2006-01-01 2 50
I need to insert mid-year values, as averages of the start-of-year and end-of-year observations, so that the data is every 6 months. The outcome would like this:
2004-01-01 1 10
2004-06-01 1 11
2005-01-01 1 12
2004-06-01 1 13
2006-01-01 1 14
2004-01-01 2 32
2004-06-01 2 33
2005-01-01 2 34
2004-06-01 2 42
2006-01-01 2 50
Would appreciate any ideas.
Grouped by 'geoo', add (+) the 'inc' with the next value (lead) and get the average (/2), as well as add 5 months to the 'date', then filter out the NA elements in 'inc', bind the rows with the original data
library(dplyr)
library(lubridate)
data %>%
group_by(geo) %>%
summarise(date = date %m+% months(5),
inc = (inc + lead(inc))/2, .groups = 'drop') %>%
filter(!is.na(inc)) %>%
bind_rows(data, .) %>%
arrange(geo, date)
-output
# A tibble: 10 x 3
# date geo inc
# <date> <dbl> <dbl>
# 1 2004-01-01 1 10
# 2 2004-06-01 1 11
# 3 2005-01-01 1 12
# 4 2005-06-01 1 13
# 5 2006-01-01 1 14
# 6 2004-01-01 2 32
# 7 2004-06-01 2 33
# 8 2005-01-01 2 34
# 9 2005-06-01 2 42
#10 2006-01-01 2 50
You can use complete to create a sequence of dates for 6 months and then use na.approx to fill the NA values with interpolated values.
library(dplyr)
library(lubridate)
data %>%
group_by(geo) %>%
tidyr::complete(date = seq(min(date), max(date), by = '6 months')) %>%
mutate(date = if_else(is.na(inc), date %m-% months(1), date),
inc = zoo::na.approx(inc))
# geo date inc
# <dbl> <date> <dbl>
# 1 1 2004-01-01 10
# 2 1 2004-06-01 11
# 3 1 2005-01-01 12
# 4 1 2005-06-01 13
# 5 1 2006-01-01 14
# 6 2 2004-01-01 32
# 7 2 2004-06-01 33
# 8 2 2005-01-01 34
# 9 2 2005-06-01 42
#10 2 2006-01-01 50

Group records with time interval overlap

I have a data frame (with N=16) contains ID (character), w_from (date), and w_to (date). Each record represent a task.
Here’s the data in R.
ID <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
w_from <- c("2010-01-01","2010-01-05","2010-01-29","2010-01-29",
"2010-03-01","2010-03-15","2010-07-15","2010-09-10",
"2010-11-01","2010-11-30","2010-12-15","2010-12-31",
"2011-02-01","2012-04-01","2011-07-01","2011-07-01")
w_to <- c("2010-01-31","2010-01-15", "2010-02-13","2010-02-28",
"2010-03-16","2010-03-16","2010-08-14","2010-10-10",
"2010-12-01","2010-12-30","2010-12-20","2011-02-19",
"2011-03-23","2012-06-30","2011-07-31","2011-07-06")
df <- data.frame(ID, w_from, w_to)
df$w_from <- as.Date(df$w_from)
df$w_to <- as.Date(df$w_to)
I need to generate a group number by ID for the records that their time intervals overlap. As an example, and in general terms, if record#1 overlaps with record#2, and record#2 overlaps with record#3, then record#1, record#2, and record#3 overlap.
Also, if record#1 overlaps with record#2 and record#3, but record#2 doesn't overlap with record#3, then record#1, record#2, record#3 are all overlap.
In the example above and for ID=1, the first four records overlap.
Here is the final output:
Also, if this can be done using dplyr, that would be great!
Try this:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(w_from) %>%
mutate(group = 1+cumsum(
cummax(lag(as.numeric(w_to), default = first(as.numeric(w_to)))) < as.numeric(w_from)))
# A tibble: 16 x 4
# Groups: ID [2]
ID w_from w_to group
<dbl> <date> <date> <dbl>
1 1 2010-01-01 2010-01-31 1
2 1 2010-01-05 2010-01-15 1
3 1 2010-01-29 2010-02-13 1
4 1 2010-01-29 2010-02-28 1
5 1 2010-03-01 2010-03-16 2
6 1 2010-03-15 2010-03-16 2
7 1 2010-07-15 2010-08-14 3
8 1 2010-09-10 2010-10-10 4
9 1 2010-11-01 2010-12-01 5
10 1 2010-11-30 2010-12-30 5
11 1 2010-12-15 2010-12-20 5
12 1 2010-12-31 2011-02-19 6
13 1 2011-02-01 2011-03-23 6
14 2 2011-07-01 2011-07-31 1
15 2 2011-07-01 2011-07-06 1
16 2 2012-04-01 2012-06-30 2

Fill in missing cases till specific condition per group

I'm attempting to create a data frame that shows all of the in between months for my data set, by subject. Here is an example of what the data looks like:
dat <- data.frame(c(1, 1, 1, 2, 3, 3, 3, 4, 4, 4), c(rep(30, 2), rep(25, 5), rep(20, 3)), c('2017-01-01', '2017-02-01', '2017-04-01', '2017-02-01', '2017-01-01', '2017-02-01', '2017-03-01', '2017-01-01',
'2017-02-01', '2017-04-01'))
colnames(dat) <- c('id', 'value', 'date')
dat$Out.Of.Study <- c("", "", "Out", "Out", "", "", "Out", "", "", "Out")
dat
id value date Out.Of.Study
1 1 30 2017-01-01
2 1 30 2017-02-01
3 1 25 2017-04-01 Out
4 2 25 2017-02-01 Out
5 3 25 2017-01-01
6 3 25 2017-02-01
7 3 25 2017-03-01 Out
8 4 20 2017-01-01
9 4 20 2017-02-01
10 4 20 2017-04-01 Out
If I want to show the in between months where no data was collected (but the subject was still enrolled in the study) I can use the complete() function. However, the issue is that I get all missing months for each subject id based on the min and max month identified in the data set:
## Add Dates by Group
library(tidyr)
complete(dat, id, date)
id date value Out.Of.Study
1 1 2017-01-01 30
2 1 2017-02-01 30
3 1 2017-03-01 NA <NA>
4 1 2017-04-01 25 Out
5 2 2017-01-01 NA <NA>
6 2 2017-02-01 25 Out
7 2 2017-03-01 NA <NA>
8 2 2017-04-01 NA <NA>
9 3 2017-01-01 25
10 3 2017-02-01 25
11 3 2017-03-01 25 Out
12 3 2017-04-01 NA <NA>
13 4 2017-01-01 20
14 4 2017-02-01 20
15 4 2017-03-01 NA <NA>
16 4 2017-04-01 20 Out
The issue with this is that I don't want the missing months to exceed the subject's final observed month (essentially, I have subjects who are censored and would need to be removed from the study) or show up prior to the month a subject started the study. For example, subject 2 was only a participant in the month '2017-02-01'. There for, I'd like the data to represent that this was the only month they were in there and not have them represented by the extra months after and the extra month before, as shown above. The same is the case with subject 3, who has an extra month, even though they are out of the study.
Perhaps the complete() isn't the best way to go about this?
This can be solved by creating a sequence of months individually for each id and by joining the sequences with dat to complete the missing months.
1. data.table
(The question is tagged with tidyr. But as I am more acquainted with data.table I have tried this first.)
library(data.table)
# coerce date strings to class Date
setDT(dat)[, date := as.Date(date)]
# create sequence of months for each id
sdt <- dat[, .(date = seq(min(date), max(date), "month")), by = id]
# join
dat[sdt, on = .(id, date)]
id value date Out.Of.Study
1: 1 30 2017-01-01
2: 1 30 2017-02-01
3: 1 NA 2017-03-01 <NA>
4: 1 25 2017-04-01 Out
5: 2 25 2017-02-01 Out
6: 3 25 2017-01-01
7: 3 25 2017-02-01
8: 3 25 2017-03-01 Out
9: 4 20 2017-01-01
10: 4 20 2017-02-01
11: 4 NA 2017-03-01 <NA>
12: 4 20 2017-04-01 Out
Note that there is only one row for id == 2 as requested by the OP.
This approach requires to coerce date from factor to class Date to make sure that all missing months will be completed.
This is also safer than to rely on the avialable date factors in the dataset. For illustration, let's assume that id == 4 is Out in month 2017-06-01 (June) instead of 2017-04-01 (April). Then, there would be no month 2017-05-01 (May) in the whole dataset and the final result would be incomplete.
Without creating the temporary variable sdt the code becomes
library(data.table)
setDT(dat)[, date := as.Date(date)][
dat[, .(date = seq(min(date), max(date), "month")), by = id], on = .(id, date)]
2. tidyr / dplyr
library(dplyr)
library(tidyr)
# coerce date strings to class Date
dat <- dat %>%
mutate(date = as.Date(date))
dat %>%
# create sequence of months for each id
group_by(id) %>%
expand(date = seq(min(date), max(date), "month")) %>%
# join to complete the missing month for each id
left_join(dat, by = c("id", "date"))
# A tibble: 12 x 4
# Groups: id [?]
id date value Out.Of.Study
<dbl> <date> <dbl> <chr>
1 1 2017-01-01 30 ""
2 1 2017-02-01 30 ""
3 1 2017-03-01 NA NA
4 1 2017-04-01 25 Out
5 2 2017-02-01 25 Out
6 3 2017-01-01 25 ""
7 3 2017-02-01 25 ""
8 3 2017-03-01 25 Out
9 4 2017-01-01 20 ""
10 4 2017-02-01 20 ""
11 4 2017-03-01 NA NA
12 4 2017-04-01 20 Out
There is a variant which does not update dat:
library(dplyr)
library(tidyr)
dat %>%
mutate(date = as.Date(date)) %>%
right_join(group_by(., id) %>%
expand(date = seq(min(date), max(date), "month")),
by = c("id", "date"))
I would still use complete (probably the right method to use here), but after it would subset rows that exceed row with "Out". You can do this with dplyr::between.
dat %>%
group_by(id) %>%
complete(date) %>%
# Filter rows that are between 1 and the one that has "Out"
filter(between(row_number(), 1, which(Out.Of.Study == "Out")))
id date value Out.Of.Study
<dbl> <fct> <dbl> <chr>
1 1 2017-01-01 30 ""
2 1 2017-02-01 30 ""
3 1 2017-03-01 NA NA
4 1 2017-04-01 25 Out
5 2 2017-01-01 NA NA
6 2 2017-02-01 25 Out
7 3 2017-01-01 25 ""
8 3 2017-02-01 25 ""
9 3 2017-03-01 25 Out
10 4 2017-01-01 20 ""
11 4 2017-02-01 20 ""
12 4 2017-03-01 NA NA
13 4 2017-04-01 20 Out

Looping over unique values [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a data frame in long format, with one observation row per measurement. I want to loop through each unique ID and find the "minimum" date for each unique individual. For example, patient 1 may be measured at three different times, but I want the earliest time. I thought about sorting the dataset by the date (in increasing order) and removing all duplicates, but I'm not sure if this is the best way to go. Any help or suggestions would be greatly appreciated. Thank you!
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', order the 'Date' (assuming that it is in Date class or else change to Date class with as.Date with correct format), and get the first observation with head
library(data.table)
setDT(df1)[order(Date), head(.SD, 1), by = ID]
Here is another way using basic R:
earliestDates = aggregate(list(date = df$date), list(ID = df$ID), min)
result = merge(earliestDates,df)
earliestDates is a two column data frame that has the minimum date by ID. The merge will join the values in the other columns.
Example:
set.seed(1)
ID = floor(runif(20,1,5))
day = as.Date(floor(runif(20,1,25)),origin = "2017-1-1")
weight = floor(runif(20,80,95))
df = data.frame(ID = ID, date = day, weight = weight)
> df
ID date weight
1 2 2017-01-24 92
2 2 2017-01-07 89
3 3 2017-01-17 91
4 4 2017-01-05 88
5 1 2017-01-08 87
6 4 2017-01-11 91
7 4 2017-01-02 80
8 3 2017-01-11 87
9 3 2017-01-22 90
10 1 2017-01-10 90
11 1 2017-01-13 87
12 1 2017-01-16 92
13 3 2017-01-13 86
14 2 2017-01-06 83
15 4 2017-01-21 81
16 2 2017-01-18 81
17 3 2017-01-21 84
18 4 2017-01-04 87
19 2 2017-01-19 89
20 4 2017-01-11 86
After the aggregate and merge, the result is:
> result
ID date weight
1 1 2017-01-08 87
2 2 2017-01-06 83
3 3 2017-01-11 87
4 4 2017-01-02 80
Try the following dplyr code:
library(dplyr)
set.seed(12345)
###Create test dataset
tb <- tibble(id = rep(1:10, each = 3),
date = rep(seq(as.Date("2017-07-01"), by=10, len=10), 3),
obs = rnorm(30))
# # A tibble: 30 × 3
# id date obs
# <int> <date> <dbl>
# 1 2017-07-01 0.5855288
# 1 2017-07-11 0.7094660
# 1 2017-07-21 -0.1093033
# 2 2017-07-31 -0.4534972
# 2 2017-08-10 0.6058875
# 2 2017-08-20 -1.8179560
# 3 2017-08-30 0.6300986
# 3 2017-09-09 -0.2761841
# 3 2017-09-19 -0.2841597
# 4 2017-09-29 -0.9193220
# # ... with 20 more rows
###Pipe the dataset through dplyr's 'group_by' and 'filter' commands
tb %>% group_by(id) %>%
filter(date == min(date)) %>%
ungroup() %>%
distinct()
# # A tibble: 10 × 3
# id date obs
# <int> <date> <dbl>
# 1 2017-07-01 0.5855288
# 2 2017-07-31 -0.4534972
# 3 2017-08-30 0.6300986
# 4 2017-07-01 -0.1162478
# 5 2017-07-21 0.3706279
# 6 2017-08-20 0.8168998
# 7 2017-07-01 0.7796219
# 8 2017-07-11 1.4557851
# 9 2017-08-10 -1.5977095
# 10 2017-09-09 0.6203798

Resources