I have a database of information pertaining to individuals observed over time. I would like to find a way to obtain the age of these individuals whenever a record was taken. Assuming the BIRTH assigns a value of 0, I would like to obtain the age either in days or months for the visits after. It would also be helpful to obtain a final age (either day or month) for each individual (*not included in the code). For example, for ID (A), the final age would be 10 months. I would like to use the lubridate function as it's in-built date feature makes it easier to work with dates. Any help with this is much appreciated.
date<-c("2000-01-01","2000-01-14","2000-01-25","2000-02-12","2000-02-27","2000-06-05","2000-10-30",
"2001-02-04","2001-06-15","2001-12-26","2002-05-22","2002-06-04",
"2000-01-08","2000-07-11","2000-08-18","2000-11-27")
ID<-c("A","A","A","A","A","A","A",
"B","B","B","B","B",
"C","C","C","C")
status<-c("BIRTH","ETC","ETC","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
df1<-data.frame(date,ID,status)
print(df1)
date ID status
1 2000-01-01 A BIRTH
2 2000-01-14 A ETC
3 2000-01-25 A ETC
4 2000-02-12 A ETC
5 2000-02-27 A ETC
6 2000-06-05 A ETC
7 2000-10-30 A ETC
8 2001-02-04 B BIRTH
9 2001-06-15 B ETC
10 2001-12-26 B ETC
11 2002-05-22 B ETC
12 2002-06-04 B ETC
13 2000-01-08 C BIRTH
14 2000-07-11 C ETC
15 2000-08-18 C ETC
16 2000-11-27 C ETC
date.new<-c("2000-01-01","2000-01-14","2000-01-25","2000-02-12","2000-02-27","2000-06-05","2000-10-30",
"2001-02-04","2001-06-15","2001-12-26","2002-05-22","2001-02-04",
"2000-01-08","2000-07-11","2000-08-18","2000-11-27")
ID.new<-c("A","A","A","A","A","A","A",
"B","B","B","B","B",
"C","C","C","C")
status.new<-c("BIRTH","ETC","ETC","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
age<-c(0,1,1,2,2,6,10,
0,4,10,15,16,
0,6,7,10)
df2<-data.frame(date.new,ID.new,status.new,age)
print(df2)
date.new ID.new status.new age
1 2000-01-01 A BIRTH 0
2 2000-01-14 A ETC 1
3 2000-01-25 A ETC 1
4 2000-02-12 A ETC 2
5 2000-02-27 A ETC 2
6 2000-06-05 A ETC 6
7 2000-10-30 A ETC 10
8 2001-02-04 B BIRTH 0
9 2001-06-15 B ETC 4
10 2001-12-26 B ETC 10
11 2002-05-22 B ETC 15
12 2001-02-04 B ETC 16
13 2000-01-08 C BIRTH 0
14 2000-07-11 C ETC 6
15 2000-08-18 C ETC 7
16 2000-11-27 C ETC 10
For calculations related to age in years or months, I'd like to encourage you to try the clock package rather than lubridate. lubridate is a great package, but produces some unexpected results with these kinds of calculations if you aren't 100% sure of what you are doing. In clock, the function to do this is date_count_between(). Notice that one of the results is different between clock and lubridate here:
library(clock)
library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
date = c("2000-01-01","2000-01-14",
"2000-01-25","2000-02-12","2000-02-27","2000-06-05",
"2000-10-30","2001-02-04","2001-06-15","2001-12-26",
"2002-05-22","2002-06-04","2000-01-08","2000-07-11",
"2000-08-18","2000-11-27"),
ID = c("A","A","A","A","A","A",
"A","B","B","B","B","B","C","C","C","C"),
status = c("BIRTH","ETC","ETC","ETC",
"ETC","ETC","ETC","BIRTH","ETC","ETC","ETC","ETC",
"BIRTH","ETC","ETC","ETC")
)
df %>%
mutate(date = date_parse(date)) %>%
group_by(ID) %>%
mutate(birth_date = date[status == "BIRTH"]) %>%
ungroup() %>%
mutate(
age_clock = date_count_between(birth_date, date, "month"),
age_lubridate = as.period(date - birth_date) %/% months(1))
#> # A tibble: 16 × 6
#> date ID status birth_date age_clock age_lubridate
#> <date> <chr> <chr> <date> <int> <dbl>
#> 1 2000-01-01 A BIRTH 2000-01-01 0 0
#> 2 2000-01-14 A ETC 2000-01-01 0 0
#> 3 2000-01-25 A ETC 2000-01-01 0 0
#> 4 2000-02-12 A ETC 2000-01-01 1 1
#> 5 2000-02-27 A ETC 2000-01-01 1 1
#> 6 2000-06-05 A ETC 2000-01-01 5 5
#> 7 2000-10-30 A ETC 2000-01-01 9 9
#> 8 2001-02-04 B BIRTH 2001-02-04 0 0
#> 9 2001-06-15 B ETC 2001-02-04 4 4
#> 10 2001-12-26 B ETC 2001-02-04 10 10
#> 11 2002-05-22 B ETC 2001-02-04 15 15
#> 12 2002-06-04 B ETC 2001-02-04 16 15
#> 13 2000-01-08 C BIRTH 2000-01-08 0 0
#> 14 2000-07-11 C ETC 2000-01-08 6 6
#> 15 2000-08-18 C ETC 2000-01-08 7 7
#> 16 2000-11-27 C ETC 2000-01-08 10 10
clock says that 2001-02-04 to 2002-06-04 is 16 months, while the lubridate method here only says it is 15 months. This has to do with the fact that the lubridate calculation uses the length of an average month, which doesn't always accurately reflect how we think about months.
Consider this simple example, I think most people would agree that a child born on this date in February is considered "1 month and 1 day" old. But lubridate shows 0 months!
library(clock)
library(lubridate, warn.conflicts = FALSE)
# "1 month and 1 day apart"
feb <- as.Date("2020-02-28")
mar <- as.Date("2020-03-29")
# As expected when thinking about age in months
date_count_between(feb, mar, "month")
#> [1] 1
# Not expected
as.period(mar - feb) %/% months(1)
#> [1] 0
secs_in_day <- 86400
secs_in_month <- as.numeric(months(1))
secs_in_month / secs_in_day
#> [1] 30.4375
# Less than 30.4375 days, so not 1 month
mar - feb
#> Time difference of 30 days
The issue is that lubridate uses the length of an average month in the computation, which is 30.4375 days. But there are only 30 days between these two dates, so it isn't considered a full month.
clock, on the other hand, uses the day component of the starting date to determine if a "full month" has passed or not. In other words, because we have passed the 28th of March, clock decides that 1 month has passed, which is consistent with how we generally think about age.
Using dplyr and lubridate, we can do the following. We first turn the date column into a date. Then we group by ID, find the birth date and calculate the number of months since that date via some lubridate magic (see How do I use the lubridate package to calculate the number of months between two date vectors where one of the vectors has NA values?).
library(dplyr)
library(lubridate)
df1 %>%
mutate(date = as_date(date)) %>%
group_by(ID) %>%
mutate(birth_date = date[status == "BIRTH"],
age = as.period(date - birth_date) %/% months(1)) %>%
ungroup()
Which gives:
date ID status birth_date age
<date> <fct> <fct> <date> <dbl>
1 2000-01-01 A BIRTH 2000-01-01 0
2 2000-01-14 A ETC 2000-01-01 0
3 2000-01-25 A ETC 2000-01-01 0
4 2000-02-12 A ETC 2000-01-01 1
5 2000-02-27 A ETC 2000-01-01 1
6 2000-06-05 A ETC 2000-01-01 5
7 2000-10-30 A ETC 2000-01-01 9
8 2001-02-04 B BIRTH 2001-02-04 0
9 2001-06-15 B ETC 2001-02-04 4
10 2001-12-26 B ETC 2001-02-04 10
11 2002-05-22 B ETC 2001-02-04 15
12 2002-06-04 B ETC 2001-02-04 15
13 2000-01-08 C BIRTH 2000-01-08 0
14 2000-07-11 C ETC 2000-01-08 6
15 2000-08-18 C ETC 2000-01-08 7
16 2000-11-27 C ETC 2000-01-08 10
Which is your expected output except for some rounding differences. See my comment on your question.
I currently have the following data frame:
> head(Coyote_reports_garbage)
# A tibble: 6 x 4
name_1 Date Day Collection
<chr> <date> <chr> <chr>
1 PLEASANTVIEW 2013-02-20 Wednesday Friday
2 MCCONACHIE AREA 2012-11-20 Tuesday Friday
3 MAYLIEWAN 2013-11-28 Thursday Friday
4 BROOKSIDE 2013-12-18 Wednesday Thursday
5 KIRKNESS 2012-11-14 Wednesday Friday
6 RIDEAU PARK 2013-11-15 Friday Friday
Where "name_1" represents the name of a neighbourhood, "Date" represents the date when a report was made, "Day" represent the day of the week where that report was name (in relation to the date), and "Collection" represents the garbage day in that neighbourhood. "Collection" therefore varies per neighbourhood and year.
I am trying to add a column (Day_in_relation_to_collection) where the day would be related to Collection day. If the day of the week is the same as the garbage collection day, Day_in_relation_to_collection = 0. If the day of the week is a day after collection day, Day_in_relation_to_collection = 1, etc.
name_1 Date Day Collection Day_in_relation_to_collection
<chr> <date> <chr> <chr>
1 PLEASANTVIEW 2013-02-20 Wednesday Friday 5
2 MCCONACHIE AREA 2012-11-20 Tuesday Friday 4
3 MAYLIEWAN 2013-11-28 Thursday Friday 6
4 BROOKSIDE 2013-12-18 Wednesday Thursday 6
5 KIRKNESS 2012-11-14 Wednesday Friday 5
6 RIDEAU PARK 2013-11-15 Friday Friday 0
I'm not quite sure how to do this, so any help would be appreciated.
I'm assuming here that Day will always be after Collection, and it will always be the next instance of that day. If so, a simple way to do that would be to make a reference matrix setting up the number of days between a combination of 2 days of the week and then using that to fill in this value:
dnames <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
day_table <- matrix(c(0:6,6,0:5,5:6,0:4,4:6,0:3,3:6,0:2,2:6,0:1,1:6,0),
nrow=7, ncol=7, byrow=T,
dimnames = list(dnames, dnames))
day_table
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
Sunday 0 1 2 3 4 5 6
Monday 6 0 1 2 3 4 5
Tuesday 5 6 0 1 2 3 4
Wednesday 4 5 6 0 1 2 3
Thursday 3 4 5 6 0 1 2
Friday 2 3 4 5 6 0 1
Saturday 1 2 3 4 5 6 0
Now we can just access the values of Coyote_reports_garbage$Collection and Coyote_reports_garbage$Day to access values in that table to get the appropriate value. We can either run this as a tidyverse mutate statement, or insert it using base R.
Either way, we need to use diag here, as subsetting a matrix with 2 vectors gives a matrix with all combinations of the selected values. The diagonal of that matrix will give the result you want here:
library(tidyverse)
Coyote_reports_garbage %>%
mutate(Day_in_relation_to_collection = diag(day_table[Collection,Day]))
name_1 Date Day Collection Day_in_relation_to_collection
1 PLEASANTVIEW 2013-02-20 Wednesday Friday 5
2 MCCONACHIE AREA 2012-11-20 Tuesday Friday 4
3 MAYLIEWAN 2013-11-28 Thursday Friday 6
4 BROOKSIDE 2013-12-18 Wednesday Thursday 6
5 KIRKNESS 2012-11-14 Wednesday Friday 5
6 RIDEAU PARK 2013-11-15 Friday Friday 0
Or in base R
Coyote_reports_garbage$dr_collect <- diag(day_table[Coyote_reports_garbage$Collection,
Coyote_reports_garbage$Day])
Coyote_reports_garbage
name_1 Date Day Collection dr_collect
1 PLEASANTVIEW 2013-02-20 Wednesday Friday 5
2 MCCONACHIE AREA 2012-11-20 Tuesday Friday 4
3 MAYLIEWAN 2013-11-28 Thursday Friday 6
4 BROOKSIDE 2013-12-18 Wednesday Thursday 6
5 KIRKNESS 2012-11-14 Wednesday Friday 5
6 RIDEAU PARK 2013-11-15 Friday Friday 0
I'm trying, for each row, to calculate the difference with the closest previous row belonging to the same group which meets a certain criterion.
Suppose I have the following dataframe:
s <- read.table(text = "Visit_num Patient Day Admitted
1 1 2015/01/01 Yes
2 1 2015/01/10 No
3 1 2015/01/15 Yes
4 1 2015/02/10 No
5 1 2015/03/08 Yes
6 2 2015/01/01 Yes
7 2 2015/04/01 No
8 2 2015/04/10 No
9 3 2015/04/01 No
10 3 2015/04/10 No", header = T, sep = "")
For each Visit_num and for each Patient, I'd like to get the difference with the closest row for which the patient was admitted (i.e. Yes). Note column day is ordered by day, and time unit for this example is days.
Here is what I wanted my dataframe to look like:
Visit_num Patient Day Admitted Diff_days
1 1 2015/01/01 Yes NA
2 1 2015/01/10 No 9
3 1 2015/01/15 Yes 14
4 1 2015/02/10 No 26
5 1 2015/03/08 Yes 52
6 2 2015/01/01 Yes NA
7 2 2015/04/01 No 90
8 2 2015/04/10 No 99
9 3 2015/04/01 No NA
10 3 2015/04/10 No NA
Any help is appreciated.
Here is an option with tidyverse. Convert the 'Day' to Date class, arrange by 'Patient', 'Day', grouped by 'Patient' get the difference of adjacent 'Day', create a group 'grp' based on the occurrence of 'Yes' in 'Admitted' and take the cumulative sum of 'Diff_days'
library(tidyverse)
s %>%
mutate(Day = ymd(Day)) %>%
arrange(Patient, Day) %>%
group_by(Patient) %>%
mutate(Diff_days = c(NA, diff(Day))) %>%
group_by(grp = cumsum(lag(Admitted == "Yes", default = TRUE)), add = TRUE) %>%
mutate(Diff_days = cumsum(replace_na(Diff_days, 0))) %>%
ungroup %>%
select(-grp) %>%
mutate(Diff_days = na_if(Diff_days, 0))
# A tibble: 8 x 5
# Visit_num Patient Day Admitted Diff_days
# <int> <int> <date> <fct> <dbl>
#1 1 1 2015-01-01 Yes NA
#2 2 1 2015-01-10 No 9
#3 3 1 2015-01-15 Yes 14
#4 4 1 2015-02-10 No 26
#5 5 1 2015-03-08 Yes 52
#6 6 2 2015-01-01 Yes NA
#7 7 2 2015-04-01 No 90
#8 8 2 2015-04-10 No 99
I have a dataset that contains the residence period (start.date to end.date) of marked individuals (ID) at different sites. My goal is to generate a column that tells me the average number of other individuals per day that were also present at the same site (across the total residence period of each individual).
To do this, I need to determine the total number of individuals that were present per site on each date, summed across the total residence period of each individual. Ultimately, I will divide this sum by the total residence days of each individual to calculate the average. Can anyone help me accomplish this?
I calculated the total number of residence days (total.days) using lubridate and dplyr
mutate(total.days = end.date - start.date + 1)
site ID start.date end.date total.days
1 1 16 5/24/17 6/5/17 13
2 1 46 4/30/17 5/20/17 21
3 1 26 4/30/17 5/23/17 24
4 1 89 5/5/17 5/13/17 9
5 1 12 5/11/17 5/14/17 4
6 2 14 5/4/17 5/10/17 7
7 2 18 5/9/17 5/29/17 21
8 2 19 5/24/17 6/10/17 18
9 2 39 5/5/17 5/18/17 14
First of all, it is always advisable to give a sample of the data in a more friendly format using dput(yourData) so that other can easily regenerate your data. Here is the output of dput() you could better be sharing:
> dput(dat)
structure(list(site = c(1, 1, 1, 1, 1, 2, 2, 2, 2), ID = c(16,
46, 26, 89, 12, 14, 18, 19, 39), start.date = structure(c(17310,
17286, 17286, 17291, 17297, 17290, 17295, 17310, 17291), class = "Date"),
end.date = structure(c(17322, 17306, 17309, 17299, 17300,
17296, 17315, 17327, 17304), class = "Date")), class = "data.frame", row.names =
c(NA,
-9L))
To do this easily we first need to unpack the start.date and end.date to individual dates:
newDat <- data.frame()
for (i in 1:nrow(dat)){
expand <- data.frame(site = dat$site[i],
ID = dat$ID[i],
Dates = seq.Date(dat$start.date[i], dat$end.date[i], 1))
newDat <- rbind(newDat, expand)
}
newDat
site ID Dates
1 1 16 2017-05-24
2 1 16 2017-05-25
3 1 16 2017-05-26
4 1 16 2017-05-27
5 1 16 2017-05-28
6 1 16 2017-05-29
7 1 16 2017-05-30
. . .
. . .
Then we calculate the number of other individuals present in each site in each day:
individualCount = newDat %>%
group_by(site, Dates) %>%
summarise(individuals = n_distinct(ID) - 1)
individualCount
# A tibble: 75 x 3
# Groups: site [?]
site Dates individuals
<dbl> <date> <int>
1 1 2017-04-30 1
2 1 2017-05-01 1
3 1 2017-05-02 1
4 1 2017-05-03 1
5 1 2017-05-04 1
6 1 2017-05-05 2
7 1 2017-05-06 2
8 1 2017-05-07 2
9 1 2017-05-08 2
10 1 2017-05-09 2
# ... with 65 more rows
Then, we augment our data with the new information using left_join() and calculate the required average:
newDat <- left_join(newDat, individualCount, by = c("site", "Dates")) %>%
group_by(site, ID) %>%
summarise(duration = max(Dates) - min(Dates)+1,
av.individuals = mean(individuals))
newDat
# A tibble: 9 x 4
# Groups: site [?]
site ID duration av.individuals
<dbl> <dbl> <time> <dbl>
1 1 12 4 0.75
2 1 16 13 0
3 1 26 24 1.42
4 1 46 21 1.62
5 1 89 9 1.33
6 2 14 7 1.14
7 2 18 21 0.875
8 2 19 18 0.333
9 2 39 14 1.14
The final step is to add the required column to the original dataset (dat) again with left_join():
dat %>% left_join(newDat, by = c("site", "ID"))
dat
site ID start.date end.date duration av.individuals
1 1 16 2017-05-24 2017-06-05 13 days 0.000000
2 1 46 2017-04-30 2017-05-20 21 days 1.619048
3 1 26 2017-04-30 2017-05-23 24 days 1.416667
4 1 89 2017-05-05 2017-05-13 9 days 2.333333
5 1 12 2017-05-11 2017-05-14 4 days 2.750000
6 2 14 2017-05-04 2017-05-10 7 days 1.142857
7 2 18 2017-05-09 2017-05-29 21 days 0.857143
8 2 19 2017-05-24 2017-06-10 18 days 0.333333
9 2 39 2017-05-05 2017-05-18 14 days 1.142857
I'm looking to see if there is a quicker way to replace the days of the week in a R dataframe with a number. Essentially, the question I'm wondering is given one vector and a corresponding vector is there a quick way to apply a replacement to a dataframe.
Here is my dataframe:
month day_of_week skies
1 APR Tuesday Clear
2 APR Wednesday Cloudy
3 APR Thursday Cloudy
4 APR Friday Cloudy
5 APR Saturday Cloudy
6 APR Sunday Clear
The days of the week are in the following vector:
daysweek <- unique(df$day_of_week)
daysweek
[1] Tuesday Wednesday Thursday Friday Saturday Sunday Monday
The corresponding vector is:
days_num <- c(2,3,4,5,6,7,1)
The long way I would do it is without the corresponding vector and using gsub individually. I was wondering if there was a quick way to do it. I couldn't figure it out with a for loop.
for (i in c(1:7)) {
df$result <- gsub(daysweek[i], days_num[i], df$day_of_week)
}
Desired dataframe output I would want would be:
month day_of_week skies
1 APR 2 Clear
2 APR 3 Cloudy
3 APR 4 Cloudy
4 APR 5 Cloudy
5 APR 6 Cloudy
6 APR 7 Clear
Create a index of weekdays and match with the day_of_week column.
Date <- as.Date('2014-12-29') #Monday
Wdays <- weekdays(seq(Date, length.out=7, by= '1 day'))
df[,2] <- match(df[,2],Wdays)
df[,2]
#[1] 2 3 4 5 6 7
Or you can convert the column to factor with levels from Monday to Sunday and convert it to numeric
as.numeric(factor(df$day_of_week, levels=c("Monday", "Tuesday",
"Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))
#[1] 2 3 4 5 6 7
Update
If you have a vector of numeric indices that correspond the unique values in the day_of_week column
Un <- c('Tuesday', 'Wednesday', 'Thursday', 'Friday',
'Saturday', 'Sunday', 'Monday')
days_num <- c(2,3,4,5,6,7,1)
set.seed(24)
day_of_week <- sample(Un, 20, replace=TRUE)
unname(setNames(days_num, Un)[day_of_week])
#[1] 4 3 6 5 6 1 3 7 7 3 6 4 6 6 4 1 3 2 5 2
Because you used gsub, another option would be mgsub from qdap
library(qdap)
as.numeric(mgsub(Un, days_num, day_of_week))
#[1] 4 3 6 5 6 1 3 7 7 3 6 4 6 6 4 1 3 2 5 2
or
library(qdapTools)
day_of_week %l% data.frame(Un, days_num)
#[1] 4 3 6 5 6 1 3 7 7 3 6 4 6 6 4 1 3 2 5 2