Situation
I have a data frame df:
df <- structure(list(person = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L), .Label = c("pA", "pB", "pC"), class = "factor"), date = structure(c(16071,
16102, 16130, 16161, 16071, 16102, 16130, 16071, 16102), class = "Date")), .Names = c("person",
"date"), row.names = c(NA, -9L), class = "data.frame")
> df
person date
1 pA 2014-01-01
2 pA 2014-02-01
3 pA 2014-03-01
4 pA 2014-04-01
5 pB 2014-01-01
6 pB 2014-02-01
7 pB 2014-03-01
8 pC 2014-01-01
9 pC 2014-02-01
Question
How can I select the last 2 (or 'n') entries, ordered by date, for each person, so that I have a resulting data frame df1:
> df1
person date
1 pA 2014-03-01
2 pA 2014-04-01
3 pB 2014-02-01
4 pB 2014-03-01
5 pC 2014-01-01
6 pC 2014-02-01
?
I've tried combinations of
library(dplyr)
df1 <- df %>%
group_by(person) %>%
select(tail(df, 2))
with no joy.
You can try slice
library(dplyr)
df %>%
group_by(person) %>%
arrange(date, person) %>%
slice((n()-1):n())
# person date
#1 pA 2014-03-01
#2 pA 2014-04-01
#3 pB 2014-02-01
#4 pB 2014-03-01
#5 pC 2014-01-01
#6 pC 2014-02-01
Or in place of the last step
do(tail(., 2))
Using data.table:
setDT(df)[order(person), tail(.SD, 2L), by=person]
# person date
# 1: pA 2014-03-01
# 2: pA 2014-04-01
# 3: pB 2014-02-01
# 4: pB 2014-03-01
# 5: pC 2014-01-01
# 6: pC 2014-02-01
We order by person and then group by person and select the last two rows from the subset of data .SD for each group.
Since you order the data by person and date (i.e. you want the 2 latest dates per person), you can also use top_n() in dplyr:
df %>% group_by(person) %>% top_n(2, date)
#Source: local data frame [6 x 2]
#Groups: person
#
# person date
#1 pA 2014-03-01
#2 pA 2014-04-01
#3 pB 2014-02-01
#4 pB 2014-03-01
#5 pC 2014-01-01
#6 pC 2014-02-01
Or, if you already order it, you could arrange it the other way before using slice:
df %>% arrange(person, desc(date)) %>% group_by(person) %>% slice(1:2)
#Source: local data frame [6 x 2]
#Groups: person
#
# person date
#1 pA 2014-04-01
#2 pA 2014-03-01
#3 pB 2014-03-01
#4 pB 2014-02-01
#5 pC 2014-02-01
#6 pC 2014-01-01
See here for a benchmark of a similar question.
Related
I need to replace NAs with the mean of previous three values, by group.
Once an NA is replaced, it will serve as input for computing the mean corresponding to the next NA (if next NA is within the next three months).
Here it is an example:
id date value
1 2017-04-01 40
1 2017-05-01 40
1 2017-06-01 10
1 2017-07-01 NA
1 2017-08-01 NA
2 2014-01-01 27
2 2014-02-01 13
Data:
dt <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), date = structure(c(17257, 17287, 17318, 17348, 17379, 16071, 16102), class = "Date"), value = c(40, 40, 10, NA, NA, 27, 13)), row.names = c(1L, 2L, 3L, 4L, 5L, 8L, 9L), class = "data.frame")
The output should look like:
id date value
1 2017-04-01 40.00
1 2017-05-01 40.00
1 2017-06-01 10.00
1 2017-07-01 30.00
1 2017-08-01 26.66
2 2014-01-01 27.00
2 2014-02-01 13.00
where 26.66 = (30 + 10 + 40)/3
What is an efficient way to do this (i.e. to avoid for loops)?
The following uses base R only and does what you need.
sp <- split(dt, dt$id)
sp <- lapply(sp, function(DF){
for(i in which(is.na(DF$value))){
tmp <- DF[seq_len(i - 1), ]
DF$value[i] <- mean(tail(tmp$value, 3))
}
DF
})
result <- do.call(rbind, sp)
row.names(result) <- NULL
result
# id date value
#1 1 2017-01-04 40.00000
#2 1 2017-01-05 40.00000
#3 1 2017-01-06 10.00000
#4 1 2017-01-07 30.00000
#5 1 2017-01-08 26.66667
#6 2 2014-01-01 27.00000
#7 2 2014-01-02 13.00000
Define a roll function which takes 3 or less previous values as a list and the current value and returns as a list the previous 2 values with the current value if the current value is not NA and the prevous 2 values with the mean if the current value is NA. Use that with Reduce and pick off the last value of each list in the result. Then apply all that to each group using ave.
roll <- function(prev, cur) {
prev <- unlist(prev)
list(tail(prev, 2), if (is.na(cur)) mean(prev) else cur)
}
reduce_roll <- function(x) {
sapply(Reduce(roll, init = x[1], x[-1], acc = TRUE), tail, 1)
}
transform(dt, value = ave(value, id, FUN = reduce_roll))
giving:
id date value
1 1 2017-04-01 40
2 1 2017-05-01 40
3 1 2017-06-01 10
4 1 2017-07-01 30
5 1 2017-08-01 26.66667
8 2 2014-01-01 27
9 2 2014-02-01 13
I have a dataset for patient medications with Start.Date and Stop.Date. Each is represented in a row. I would like to merge rows where the time intervals are sequential as below:
ID = c(2, 2, 2, 2, 3, 5)
Medication = c("aspirin", "aspirin", "aspirin", "tylenol", "lipitor", "advil")
Start.Date = c("05/01/2017", "05/05/2017", "06/20/2017", "05/01/2017", "05/06/2017", "05/28/2017")
Stop.Date = c("05/04/2017", "05/10/2017", "06/27/2017", "05/15/2017", "05/12/2017", "06/13/2017")
df = data.frame(ID, Medication, Start.Date, Stop.Date)
ID Medication Start.Date Stop.Date
2 aspirin 05/01/2017 05/04/2017
2 aspirin 05/05/2017 05/10/2017
2 aspirin 06/20/2017 06/27/2017
2 tylenol 05/01/2017 05/15/2017
3 lipitor 05/06/2017 05/12/2017
5 advil 05/28/2017 06/13/2017
I would like to reduce rows by ID and medication if the Stop.Date for one is a day before the next Start.Date. It should look like below:
ID Medication Start.Date Stop.Date
2 aspirin 05/01/2017 05/10/2017
2 aspirin 06/20/2017 06/27/2017
2 tylenol 05/01/2017 05/15/2017
3 lipitor 05/06/2017 05/12/2017
5 advil 05/28/2017 06/13/2017
library(tidyverse)
library(lubridate)
df%>%
group_by(Medication)%>%
mutate_at(vars(3:4),mdy)%>%
mutate(Start.Date = coalesce(
if_else((Start.Date-lag(Stop.Date))==1,lag(Start.Date),Start.Date),Start.Date),
s = lead(Start.Date)!=Start.Date)%>%
filter(s|is.na(s))%>%
select(-s)
# A tibble: 5 x 4
# Groups: ID, Medication [4]
ID Medication Start.Date Stop.Date
<dbl> <chr> <date> <date>
1 2 aspirin 2017-05-01 2017-05-10
2 2 aspirin 2017-06-20 2017-06-27
3 2 tylenol 2017-05-01 2017-05-15
4 3 lipitor 2017-05-06 2017-05-12
5 5 advil 2017-05-28 2017-06-13
How about this?
df %>%
mutate_at(vars(ends_with("Date")), function(x) as.Date(x, format = "%m/%d/%Y")) %>%
group_by(ID, Medication) %>%
mutate(
isConsecutive = lead(Start.Date) - Stop.Date == 1,
isConsecutive = ifelse(
is.na(isConsecutive) & lag(isConsecutive) == TRUE, FALSE, isConsecutive),
grp = cumsum(isConsecutive)) %>%
group_by(ID, Medication, grp) %>%
mutate(Start.Date = min(Start.Date), Stop.Date = max(Stop.Date)) %>%
slice(1) %>%
ungroup() %>%
select(-isConsecutive, -grp)
## A tibble: 5 x 4
# ID Medication Start.Date Stop.Date
# <dbl> <fct> <date> <date>
#1 2. aspirin 2017-05-01 2017-05-10
#2 2. aspirin 2017-06-20 2017-06-27
#3 2. tylenol 2017-05-01 2017-05-15
#4 3. lipitor 2017-05-06 2017-05-12
#5 5. advil 2017-05-28 2017-06-13
Best to test this with a few more examples to ensure robustness. Let's try a more complex example
df <- structure(list(ID = c(2, 2, 2, 2, 2, 3, 5, 5), Medication = structure(c(2L,
2L, 2L, 2L, 4L, 3L, 1L, 1L), .Label = c("advil", "aspirin", "lipitor",
"tylenol"), class = "factor"), Start.Date = structure(c(1L, 2L,
6L, 7L, 1L, 3L, 4L, 5L), .Label = c("05/01/2017", "05/05/2017",
"05/06/2017", "05/28/2017", "06/14/2017", "06/20/2017", "06/28/2017"
), class = "factor"), Stop.Date = structure(c(2L, 3L, 8L, 1L,
5L, 4L, 6L, 7L), .Label = c("04/30/2017", "05/04/2017", "05/10/2017",
"05/12/2017", "05/15/2017", "06/13/2017", "06/20/2017", "06/27/2017"
), class = "factor")), .Names = c("ID", "Medication", "Start.Date",
"Stop.Date"), row.names = c(NA, -8L), class = "data.frame")
df;
# ID Medication Start.Date Stop.Date
#1 2 aspirin 05/01/2017 05/04/2017
#2 2 aspirin 05/05/2017 05/10/2017
#3 2 aspirin 06/20/2017 06/27/2017
#4 2 aspirin 06/28/2017 04/30/2017
#5 2 tylenol 05/01/2017 05/15/2017
#6 3 lipitor 05/06/2017 05/12/2017
#7 5 advil 05/28/2017 06/13/2017
#8 5 advil 06/14/2017 06/20/2017
Note that here we have two consecutive blocks for ID=2 (rows 1+2 and rows 3+4), as well as one consecutive block for ID=5 (rows 7+8).
Output is
df %>%
mutate_at(vars(ends_with("Date")), function(x) as.Date(x, format = "%m/%d/%Y")) %>%
group_by(ID, Medication) %>%
mutate(
isConsecutive = lead(Start.Date) - Stop.Date == 1,
isConsecutive = ifelse(
is.na(isConsecutive) & lag(isConsecutive) == TRUE, FALSE, isConsecutive),
grp = cumsum(isConsecutive)) %>%
group_by(ID, Medication, grp) %>%
mutate(Start.Date = min(Start.Date), Stop.Date = max(Stop.Date)) %>%
slice(1) %>%
ungroup() %>%
select(-isConsecutive, -grp)
## A tibble: 5 x 4
# ID Medication Start.Date Stop.Date
# <dbl> <fct> <date> <date>
#1 2. aspirin 2017-05-01 2017-05-10
#2 2. aspirin 2017-06-20 2017-06-27
#3 2. tylenol 2017-05-01 2017-05-15
#4 3. lipitor 2017-05-06 2017-05-12
#5 5. advil 2017-05-28 2017-06-20
Results seem to be robust.
Convert the 'Start' and 'Stop' date columns to Date class with mdy (from lubridate), grouped by 'ID', 'Medication', filter the abs difference of the 'lead` of 'Start.Date' and 'Stop.Date' that are not equal to 1
library(dplyr)
library(lubridate)
df %>%
mutate_at(3:4, mdy) %>%
group_by(ID, Medication) %>%
filter(abs(lead(Start.Date, default = last(Start.Date)) - Stop.Date) != 1)
# A tibble: 5 x 4
# Groups: ID, Medication [4]
# ID Medication Start.Date Stop.Date
# <dbl> <fct> <date> <date>
#1 2 aspirin 2017-05-05 2017-05-10
#2 2 aspirin 2017-06-20 2017-06-27
#3 2 tylenol 2017-05-01 2017-05-15
#4 3 lipitor 2017-05-06 2017-05-12
#5 5 advil 2017-05-28 2017-06-13
Or using the similar methodology in data.table
library(data.table)
setDT(df)[df[, (shift(mdy(Start.Date), type = 'lead',
fill = last(Start.Date)) - mdy(Stop.Date)) != 1 , ID]$V1]
# ID Medication Start.Date Stop.Date
#1: 2 aspirin 05/05/2017 05/10/2017
#2: 2 aspirin 06/20/2017 06/27/2017
#3: 2 tylenol 05/01/2017 05/15/2017
#4: 3 lipitor 05/06/2017 05/12/2017
#5: 5 advil 05/28/2017 06/13/2017
NOTE: We could convert the Date columns to Date class first as before
NOTE2: Both are simple methods based on the example provided by the OP
Input:
Aim:
Create a new column named 'dayDifference' with the following rule: for each pair 'item-city' pair calculate the day difference of the related pair.
Desired output:
Row 1 and 2 [Pair Piza-Berlin] correspond to 3 because there is 3 days between 2 Feb and 4 Feb
Row 3 [Pair Pizza-Hambourg] correspond to 0 because there is no day difference
Row 4 and 5 [Pair Pasta-Hambourg] correspond to 21 because there is 21 days from 10 to 20
Row 6 [Pair Pasta-Berlin] correspond to 0 because there is no day difference
Info: Of course there can be more than 2 rows of pair [for instance I can have the pair 'pizza-berlin' 100 rows : if so always take the max(date) and substract to the min(date) pizza-berlin pair.
Constraint:
Need to be done in R [e.g. no outside connection with a database]
Source code:
df <- structure(list(id = c(4848L, 4887L, 4899L, 4811L, 4834L, 4892L
), item = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Pasta",
"Pizza"), class = "factor"), city = structure(c(1L, 1L, 2L, 2L,
2L, 1L), .Label = c("Berlin", "Hamburg"), class = "factor"),
date = structure(c(17199, 17201, -643892, 17449, 17459, 17515
), class = "Date")), .Names = c("id", "item", "city", "date"
), row.names = c(NA, -6L), class = "data.frame")
I would do it using data.table:
library(data.table)
setDT(df)
df[, min_date := min(date), by = c("item", "city")]
df[, max_date := max(date), by = c("item", "city")]
df[, dayDifference := difftime(max_date, min_date, units = "days")]
df[, c("min_date", "max_date") := NULL]
It'll give you desired output:
id item city date dayDifference
1: 4848 Pizza Berlin 2017-02-02 2 days
2: 4887 Pizza Berlin 2017-02-04 2 days
3: 4899 Pizza Hamburg 0207-02-01 0 days
4: 4811 Pasta Hamburg 2017-10-10 10 days
5: 4834 Pasta Hamburg 2017-10-20 10 days
6: 4892 Pasta Berlin 2017-12-15 0 days
You can also use df[, dayDifference := max_date - min_date] instead of df[, dayDifference := difftime(max_date, min_date, units = "days")].
Reduce is an awesome function
library(dplyr)
df %>%
group_by(item, city) %>%
mutate(dayDifference=abs(Reduce(`-`, as.numeric(range(date)))))
# A tibble: 6 x 5
# Groups: item, city [4]
id item city date dayDifference
<int> <fctr> <fctr> <date> <dbl>
1 4848 Pizza Berlin 2017-02-02 2
2 4887 Pizza Berlin 2017-02-04 2
3 4899 Pizza Hamburg 0207-02-01 0
4 4811 Pasta Hamburg 2017-10-10 10
5 4834 Pasta Hamburg 2017-10-20 10
6 4892 Pasta Berlin 2017-12-15 0
Not pretty, but...
i<-unique(lapply(1:nrow(df),function(x) which(paste(df[,2],df[,3]) %in% paste(df[x,2],df[x,3]))))
for(j in 1:length(i)) df[i[[j]],"days"]<-abs(difftime(df[i[[j]],][1,"date"],df[i[[j]],][2,"date"]))
> df
id item city date days
1 4848 Pizza Berlin 2017-02-02 2
2 4887 Pizza Berlin 2017-02-04 2
3 4899 Pizza Hamburg 0207-02-01 NA
4 4811 Pasta Hamburg 2017-10-10 10
5 4834 Pasta Hamburg 2017-10-20 10
6 4892 Pasta Berlin 2017-12-15 NA
ID Date
1 1-1-2016
1 2-1-2016
1 3-1-2016
2 5-1-2016
3 6-1-2016
3 11-1-2016
3 12-1-2016
4 7-1-2016
5 9-1-2016
5 19-1-2016
5 20-1-2016
6 11-04-2016
6 12-04-2016
6 16-04-2016
6 04-08-2016
6 05-08-2016
6 06-08-2016
Expected Data Frame is based on consecutive dates pairwise
1st_Date is when he visited for first time
2nd_Date is the date after which he visited for 2 consecutive days
3rd_Date is the date after which he visited for 3 consecutive days
For e.g :
For ID = 1 , He visited first time on 1-1-2016 and his 2 consecutive visits also began on the 1-1-2016 as well as his 3rd one .
Similarly For ID = 2 , He only visited 1 time so rest will remain blank
For ID = 3 , he visited 1st Time on 6-1-2016 but visited for 2 consecutive days starting on 11-1-2016.
NOTE : This has to be done till earliest 3rd Date only
Expected Output
ID 1st_Date 2nd_Date 3rd_Date
1 1-1-2016 1-1-2016 1-1-2016
2 5-1-2016 NA NA
3 6-1-2016 11-1-2016 NA
4 7-1-2016 NA NA
5 9-1-2016 19-1-2016 NA
6 11-04-2016 11-04-2016 04-08-2016
Here is an attempt using dplyr and tidyr. The first thing to do is to convert your Date to as.Date and group_by the IDs. We next create a few new variables. The first one, new, checks to see which dates are consecutive. Date is then updated to give NA for those consecutive dates. However, If not all the dates are consecutive, then we filter out the ones that were converted to NA. We then fill (replace NA with latest non-na date for each ID), remove unwanted columns and spread.
library(dplyr)
library(tidyr)
df %>%
mutate(Date = as.Date(Date, format = '%d-%m-%Y')) %>%
group_by(ID) %>%
mutate(new = cumsum(c(1, diff.difftime(Date, units = 'days'))),
Date = replace(Date, c(0, diff(new)) == 1, NA),
new1 = sum(is.na(Date)),
new2 = seq(n())) %>%
filter(!is.na(Date)|new1 != 1) %>%
fill(Date) %>%
select(-c(new, new1)) %>%
spread(new2, Date) %>%
select(ID:`3`)
# ID `1` `2` `3`
#* <int> <date> <date> <date>
#1 1 2016-01-01 2016-01-01 2016-01-01
#2 2 2016-01-05 <NA> <NA>
#3 3 2016-01-06 2016-01-11 <NA>
#4 4 2016-01-07 <NA> <NA>
#5 5 2016-01-09 2016-01-09 2016-01-09
With your Updated Data set, It gives
# ID `1` `2` `3`
#* <int> <date> <date> <date>
#1 1 2016-01-01 2016-01-01 2016-01-01
#2 2 2016-01-05 <NA> <NA>
#3 3 2016-01-06 2016-01-11 <NA>
#4 4 2016-01-07 <NA> <NA>
#5 5 2016-01-09 2016-01-19 <NA>
DATA USED
dput(df)
structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 3L, 4L, 5L, 5L,
5L), Date = structure(c(1L, 5L, 7L, 8L, 9L, 2L, 3L, 10L, 11L,
4L, 6L), .Label = c("1-1-2016", "11-1-2016", "12-1-2016", "19-1-2016",
"2-1-2016", "20-1-2016", "3-1-2016", "5-1-2016", "6-1-2016",
"7-1-2016", "9-1-2016"), class = "factor")), .Names = c("ID",
"Date"), class = "data.frame", row.names = c(NA, -11L))
Use reshape. Code below assumes z is your data frame where date is a numeric date/time variable, ordered increasingly.
# a "set" variable represents a set of consecutive dates
z$set <- unsplit(tapply(z$date, z$ID, function(x) cumsum(diff(c(x[1], x)) > 1)), z$ID)
# "first.date" represents the first date in the set (of consecutive dates)
z$first.date <- unsplit(lapply(split(z$date, z[, c("ID", "set")]), min), z[, c("ID", "set")])
# "occurence" is a consecutive occurence #
z$occurrence <- unsplit(lapply(split(seq(nrow(z)), z$ID), seq_along), z$ID)
reshape(z[, c("ID", "first.date", "occurrence")], direction = "wide",
idvar = "ID", v.names = "first.date", timevar = "occurrence")
The result:
ID first.date.1 first.date.2 first.date.3
1 1 2016-01-01 2016-01-01 2016-01-01
4 2 2016-01-05 <NA> <NA>
5 3 2016-01-06 2016-01-11 2016-01-11
8 4 2016-01-07 <NA> <NA>
9 5 2016-01-09 2016-01-09 2016-01-09
I have a dataframe that looks like this:
id time value
01 2014-02-26 13:00:00 6
02 2014-02-26 15:00:00 6
01 2014-02-26 18:00:00 6
04 2014-02-26 21:00:00 7
02 2014-02-27 09:00:00 6
03 2014-02-27 12:00:00 6
The dataframe consists of a mood score at different time stamps throughout the day of multiple patients.
I want the dataframe to become like this:
id 2014-02-26 2014-02-27
01 6.25 4.32
02 5.39 8.12
03 9.23 3.18
04 5.76 3.95
With on each row a patient and in each the column the daily mean of all the days in the dataframe. If there is no mood score on a specific date from a patient, I want the value to be NA.
What is the easiest way to do so using functions like ddply, or from other packages?
df <- structure(list(id = c(1L, 2L, 1L, 4L, 2L, 3L), time = structure(c(1393437600,
1393444800, 1393455600, 1393466400, 1393509600, 1393520400), class = c("POSIXct",
"POSIXt"), tzone = ""), value = c(6L, 6L, 6L, 7L, 6L, 6L)), .Names = c("id",
"time", "value"), row.names = c(NA, -6L), class = "data.frame")
Based on your description, this seems to be what you need,
library(tidyverse)
df1 %>%
group_by(id, time1 = format(time, '%Y-%m-%d')) %>%
summarise(new = mean(value)) %>%
spread(time1, new)
#Source: local data frame [4 x 3]
#Groups: id [4]
# id `2014-02-26` `2014-02-27`
#* <int> <dbl> <dbl>
#1 1 6 NA
#2 2 6 6
#3 3 NA 6
#4 4 7 NA
In base R, you could combine aggregate with reshape like this:
# get means by id-date
temp <- setNames(aggregate(value ~ id + format(time, "%y-%m-%d"), data=df, FUN=mean),
c("id", "time", "value"))
# reshape to get dates as columns
reshape(temp, direction="wide", idvar="id", timevar="time")
id value.14-02-26 value.14-02-27
1 1 6 NA
2 2 6 6
3 4 7 NA
5 3 NA 6
I'd reccomend using the data.table package, the approach then is very similar to Sotos' tidiverse solution.
library(data.table)
df <- data.table(df)
df[, time1 := format(time, '%Y-%m-%d')]
aggregated <- df[, list(meanvalue = mean(value)), by=c("id", "time1")]
aggregated <- dcast.data.table(aggregated, id~time1, value.var="meanvalue")
aggregated
# id 2014-02-26 2014-02-27
# 1: 1 6 NA
# 2: 2 6 6
# 3: 3 NA 6
# 4: 4 NA 7
(I think my result differs, because my System runs on another timezone, I imported the datetime objects as UTC.)