Calculating age over multiple dataframes based on name of dataframe - r

I was wondering if someone here can help me with a lapply question.
Every month, data are extracted and the data frames are named according to the date extracted (01-08-2019,01-09-2019,01-10-2019 etc). The contents of each data frame are similar to the example below:
01-09-2019
ID DOB
3 01-07-2019
5 01-06-2019
7 01-05-2019
8 01-09-2019
01-10-2019
ID DOB
2 01-10-2019
5 01-06-2019
8 01-09-2019
9 01-02-2019
As the months roll on, there are more data sets being downloaded.
I am wanting to calculate the ages of people in each of the data sets based on the date the data was extracted - so in essence, the age would be the date difference between the data frame name and the DOB variable.
01-09-2019
ID DOB AGE(months)
3 01-07-2019 2
5 01-06-2019 3
7 01-05-2019 4
8 01-09-2019 0
01-10-2019
ID DOB AGE(months)
2 01-10-2019 0
5 01-06-2019 4
8 01-09-2019 1
9 01-02-2019 8
I was thinking of putting all of the data frames together in a list (as there are a lot) and then using lapply to calculate age across all data frames. How do I go about calculating the difference between a data frame name and a column?

If I may suggest a slightly differen approach: It might make more sense to compress your list into a single data frame before calculating the ages. Given your data looks something like this, i.e. it is a list of data frames, where the list element names are the dates of access:
$`01-09-2019`
# A tibble: 4 x 2
ID DOB
<dbl> <date>
1 3 2019-07-01
2 5 2019-06-01
3 7 2019-05-01
4 8 2019-09-01
$`01-10-2019`
# A tibble: 4 x 2
ID DOB
<dbl> <date>
1 2 2019-10-01
2 5 2019-06-01
3 8 2019-09-01
4 9 2019-02-01
You can call bind_rows first with parameter .id = "date_extracted" to turn your list into a data frame, and then calculate age in months.
library(tidyverse)
library(lubridate)
tib <- bind_rows(tib_list, .id = "date_extracted") %>%
mutate(date_extracted = dmy(date_extracted),
DOB = dmy(DOB),
age_months = month(date_extracted) - month(DOB)
)
#### OUTPUT ####
# A tibble: 8 x 4
date_extracted ID DOB age_months
<date> <dbl> <date> <dbl>
1 2019-09-01 3 2019-07-01 2
2 2019-09-01 5 2019-06-01 3
3 2019-09-01 7 2019-05-01 4
4 2019-09-01 8 2019-09-01 0
5 2019-10-01 2 2019-10-01 0
6 2019-10-01 5 2019-06-01 4
7 2019-10-01 8 2019-09-01 1
8 2019-10-01 9 2019-02-01 8

This can be solved with lapply as well but we can also use Map in this case to iterate over list and their names after adding all the dataframes in a list. In base R,
Map(function(x, y) {
x$DOB <- as.Date(x$DOB)
transform(x, age = as.integer(format(as.Date(y), "%m")) -
as.integer(format(x$DOB, "%m")))
}, list_df, names(list_df))
#$`01-09-2019`
# ID DOB age
#1 3 0001-07-20 2
#2 5 0001-06-20 3
#3 7 0001-05-20 4
#4 8 0001-09-20 0
#$`01-10-2019`
# ID DOB age
#1 2 0001-10-20 0
#2 5 0001-06-20 4
#3 8 0001-09-20 1
#4 9 0001-02-20 8
We can also do the same in tidyverse
library(dplyr)
library(lubridate)
purrr::imap(list_df, ~.x %>% mutate(age = month(.y) - month(DOB)))
data
list_df <- list(`01-09-2019` = structure(list(ID = c(3L, 5L, 7L, 8L),
DOB = structure(c(3L, 2L, 1L, 4L), .Label = c("01-05-2019", "01-06-2019",
"01-07-2019", "01-09-2019"), class = "factor")), class = "data.frame",
row.names = c(NA, -4L)), `01-10-2019` = structure(list(ID = c(2L, 5L, 8L, 9L),
DOB = structure(c(4L, 2L, 3L, 1L), .Label = c("01-02-2019",
"01-06-2019", "01-09-2019", "01-10-2019"), class = "factor")),
class = "data.frame", row.names = c(NA, -4L)))

It's bad practice to use dates and numbers as dataframe names consider prefix the date with an "x" as shown below in this base R solution:
df_list <- list(x01_09_2019 = `01-09-2019`, x01_10_2019 = `01-10-2019`)
df_list <- mapply(cbind, "report_date" = names(df_list), df_list, SIMPLIFY = F)
df_list <- lapply(df_list, function(x){
x$report_date <- as.Date(gsub("_", "-", gsub("x", "", x$report_date)), "%d-%m-%Y")
x$Age <- x$report_date - x$DOB
return(x)
}
)
Data:
`01-09-2019` <- structure(list(ID = c(3, 5, 7, 8),
DOB = structure(c(18078, 18048, 18017, 18140), class = "Date")),
class = "data.frame", row.names = c(NA, -4L))
`01-10-2019` <- structure(list(ID = c(2, 5, 8, 9),
DOB = structure(c(18170, 18048, 18140, 17928), class = "Date")),
class = "data.frame", row.names = c(NA, -4L))

Related

Find if a time exists between two different times, type issue

I have a dataframe as so
df <- structure(list(TIME = c("11:15:00", NA, "15:15:00", "12:00:00",
"18:40:00", "18:15:00", "7:10:00", "15:58:00", "10:00:00", "10:00:00"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
And I basically want to create a new variable which tells me if the time is in a certain group.
I wrote the following but it's not correct, tried changing to as.POSICxt but no dice.
df <- df %>%
mutate(time_groups = ifelse(between(as.POSIXct(TIME),00:00, 5:59), 1,
ifelse(between(as.POSIXct(TIME),06:00, 8:59), 2,
ifelse(between(as.POSIXct(TIME),09:00,11:59), 3,
ifelse(between(as.POSIXct(TIME),12:00,14:59), 4,
ifelse(between(as.POSIXct(TIME),15:00,17:59), 5,
ifelse(between(as.POSIXct(TIME),18:00,23:59), 6,
), NA)
You could use the findInterval function:
library(tidyverse)
library(lubridate)
a <- c("00:00","5:59", "8:59", "11:59", "14:59", "17:59", "23:59")
b <- ymd_hm(paste(Sys.Date(), a))
df %>%
mutate(Interval = findInterval(ymd_hms(paste(Sys.Date(), TIME)), b))
TIME Interval
<chr> <int>
1 11:15:00 3
2 NA NA
3 15:15:00 5
4 12:00:00 4
5 18:40:00 6
6 18:15:00 6
7 7:10:00 2
8 15:58:00 5
9 10:00:00 3
10 10:00:00 3

how to filter data based on the latest date of a date group?

i know my question is not as clear as it should be so i hope my explanation will make it more comprehensible. I have a data like this:
# total_call data
call_id | from_number | retrieved_date
1 1 2020-01-12 12:03:34
2 1 2020-01-12 12:06:34
3 2 2020-01-15 13:02:40
4 2 2020-01-15 13:05:40
5 1 2020-01-12 13:09:34
I want to group the calls by the from_number and the retrieved_date variables, which its time must be within 1 hour since the earliest. After 1 hour, it belongs to a new group. Then i want to filter the latest time of each group. This is the result i want:
# total_call data
call_id | from_number | retrieved_date
2 1 2020-01-12 12:06:34
4 2 2020-01-15 13:05:40
5 1 2020-01-12 13:09:34
Thanks for your attention. I’m looking forward to your reply.
We convert retrieved_date to POSIXct format, arrange the data and create a new group when the current retrieved_date is greater than previous retrieved_date by more than an hour and select the row with max retrieved_date.
library(dplyr)
df %>%
mutate(retrieved_date = lubridate::ymd_hms(retrieved_date)) %>%
arrange(from_number, retrieved_date) %>%
group_by(from_number) %>%
group_by(gr = cumsum(difftime(retrieved_date, lag(retrieved_date,
default = first(retrieved_date)), units = "hours") > 1),add = TRUE) %>%
slice(which.max(retrieved_date)) %>%
ungroup() %>%
select(-gr)
# A tibble: 3 x 3
# call_id from_number retrieved_date
# <int> <int> <dttm>
#1 2 1 2020-01-12 12:06:34
#2 5 1 2020-01-12 13:09:34
#3 4 2 2020-01-15 13:05:40
data
df <- structure(list(call_id = 1:5, from_number = c(1L, 1L, 2L, 2L,
1L), retrieved_date = structure(c(1L, 2L, 4L, 5L, 3L),
.Label = c("2020- 01-12 12:03:34","2020-01-12 12:06:34", "2020-01-12 13:09:34",
"2020-01-15 13:02:40", "2020-01-15 13:05:40"), class = "factor")),
class = "data.frame", row.names = c(NA, -5L))

Fill Dates based on Consecutive occurrences

ID Date
1 1-1-2016
1 2-1-2016
1 3-1-2016
2 5-1-2016
3 6-1-2016
3 11-1-2016
3 12-1-2016
4 7-1-2016
5 9-1-2016
5 19-1-2016
5 20-1-2016
6 11-04-2016
6 12-04-2016
6 16-04-2016
6 04-08-2016
6 05-08-2016
6 06-08-2016
Expected Data Frame is based on consecutive dates pairwise
1st_Date is when he visited for first time
2nd_Date is the date after which he visited for 2 consecutive days
3rd_Date is the date after which he visited for 3 consecutive days
For e.g :
For ID = 1 , He visited first time on 1-1-2016 and his 2 consecutive visits also began on the 1-1-2016 as well as his 3rd one .
Similarly For ID = 2 , He only visited 1 time so rest will remain blank
For ID = 3 , he visited 1st Time on 6-1-2016 but visited for 2 consecutive days starting on 11-1-2016.
NOTE : This has to be done till earliest 3rd Date only
Expected Output
ID 1st_Date 2nd_Date 3rd_Date
1 1-1-2016 1-1-2016 1-1-2016
2 5-1-2016 NA NA
3 6-1-2016 11-1-2016 NA
4 7-1-2016 NA NA
5 9-1-2016 19-1-2016 NA
6 11-04-2016 11-04-2016 04-08-2016
Here is an attempt using dplyr and tidyr. The first thing to do is to convert your Date to as.Date and group_by the IDs. We next create a few new variables. The first one, new, checks to see which dates are consecutive. Date is then updated to give NA for those consecutive dates. However, If not all the dates are consecutive, then we filter out the ones that were converted to NA. We then fill (replace NA with latest non-na date for each ID), remove unwanted columns and spread.
library(dplyr)
library(tidyr)
df %>%
mutate(Date = as.Date(Date, format = '%d-%m-%Y')) %>%
group_by(ID) %>%
mutate(new = cumsum(c(1, diff.difftime(Date, units = 'days'))),
Date = replace(Date, c(0, diff(new)) == 1, NA),
new1 = sum(is.na(Date)),
new2 = seq(n())) %>%
filter(!is.na(Date)|new1 != 1) %>%
fill(Date) %>%
select(-c(new, new1)) %>%
spread(new2, Date) %>%
select(ID:`3`)
# ID `1` `2` `3`
#* <int> <date> <date> <date>
#1 1 2016-01-01 2016-01-01 2016-01-01
#2 2 2016-01-05 <NA> <NA>
#3 3 2016-01-06 2016-01-11 <NA>
#4 4 2016-01-07 <NA> <NA>
#5 5 2016-01-09 2016-01-09 2016-01-09
With your Updated Data set, It gives
# ID `1` `2` `3`
#* <int> <date> <date> <date>
#1 1 2016-01-01 2016-01-01 2016-01-01
#2 2 2016-01-05 <NA> <NA>
#3 3 2016-01-06 2016-01-11 <NA>
#4 4 2016-01-07 <NA> <NA>
#5 5 2016-01-09 2016-01-19 <NA>
DATA USED
dput(df)
structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 3L, 4L, 5L, 5L,
5L), Date = structure(c(1L, 5L, 7L, 8L, 9L, 2L, 3L, 10L, 11L,
4L, 6L), .Label = c("1-1-2016", "11-1-2016", "12-1-2016", "19-1-2016",
"2-1-2016", "20-1-2016", "3-1-2016", "5-1-2016", "6-1-2016",
"7-1-2016", "9-1-2016"), class = "factor")), .Names = c("ID",
"Date"), class = "data.frame", row.names = c(NA, -11L))
Use reshape. Code below assumes z is your data frame where date is a numeric date/time variable, ordered increasingly.
# a "set" variable represents a set of consecutive dates
z$set <- unsplit(tapply(z$date, z$ID, function(x) cumsum(diff(c(x[1], x)) > 1)), z$ID)
# "first.date" represents the first date in the set (of consecutive dates)
z$first.date <- unsplit(lapply(split(z$date, z[, c("ID", "set")]), min), z[, c("ID", "set")])
# "occurence" is a consecutive occurence #
z$occurrence <- unsplit(lapply(split(seq(nrow(z)), z$ID), seq_along), z$ID)
reshape(z[, c("ID", "first.date", "occurrence")], direction = "wide",
idvar = "ID", v.names = "first.date", timevar = "occurrence")
The result:
ID first.date.1 first.date.2 first.date.3
1 1 2016-01-01 2016-01-01 2016-01-01
4 2 2016-01-05 <NA> <NA>
5 3 2016-01-06 2016-01-11 2016-01-11
8 4 2016-01-07 <NA> <NA>
9 5 2016-01-09 2016-01-09 2016-01-09

Aggregating time-based data of multiple patients to daily averages per patient in R

I have a dataframe that looks like this:
id time value
01 2014-02-26 13:00:00 6
02 2014-02-26 15:00:00 6
01 2014-02-26 18:00:00 6
04 2014-02-26 21:00:00 7
02 2014-02-27 09:00:00 6
03 2014-02-27 12:00:00 6
The dataframe consists of a mood score at different time stamps throughout the day of multiple patients.
I want the dataframe to become like this:
id 2014-02-26 2014-02-27
01 6.25 4.32
02 5.39 8.12
03 9.23 3.18
04 5.76 3.95
With on each row a patient and in each the column the daily mean of all the days in the dataframe. If there is no mood score on a specific date from a patient, I want the value to be NA.
What is the easiest way to do so using functions like ddply, or from other packages?
df <- structure(list(id = c(1L, 2L, 1L, 4L, 2L, 3L), time = structure(c(1393437600,
1393444800, 1393455600, 1393466400, 1393509600, 1393520400), class = c("POSIXct",
"POSIXt"), tzone = ""), value = c(6L, 6L, 6L, 7L, 6L, 6L)), .Names = c("id",
"time", "value"), row.names = c(NA, -6L), class = "data.frame")
Based on your description, this seems to be what you need,
library(tidyverse)
df1 %>%
group_by(id, time1 = format(time, '%Y-%m-%d')) %>%
summarise(new = mean(value)) %>%
spread(time1, new)
#Source: local data frame [4 x 3]
#Groups: id [4]
# id `2014-02-26` `2014-02-27`
#* <int> <dbl> <dbl>
#1 1 6 NA
#2 2 6 6
#3 3 NA 6
#4 4 7 NA
In base R, you could combine aggregate with reshape like this:
# get means by id-date
temp <- setNames(aggregate(value ~ id + format(time, "%y-%m-%d"), data=df, FUN=mean),
c("id", "time", "value"))
# reshape to get dates as columns
reshape(temp, direction="wide", idvar="id", timevar="time")
id value.14-02-26 value.14-02-27
1 1 6 NA
2 2 6 6
3 4 7 NA
5 3 NA 6
I'd reccomend using the data.table package, the approach then is very similar to Sotos' tidiverse solution.
library(data.table)
df <- data.table(df)
df[, time1 := format(time, '%Y-%m-%d')]
aggregated <- df[, list(meanvalue = mean(value)), by=c("id", "time1")]
aggregated <- dcast.data.table(aggregated, id~time1, value.var="meanvalue")
aggregated
# id 2014-02-26 2014-02-27
# 1: 1 6 NA
# 2: 2 6 6
# 3: 3 NA 6
# 4: 4 NA 7
(I think my result differs, because my System runs on another timezone, I imported the datetime objects as UTC.)

How to divide contents of one column by different values, conditional on contents of a second column?

I've got a data frame that looks like something along these lines:
Day Salesperson Value
==== ============ =====
Monday John 40
Monday Sarah 50
Tuesday John 60
Tuesday Sarah 30
Wednesday John 50
Wednesday Sarah 40
I want to divide the value for each salesperson by the number of times that each of the days of the week has occurred. So: There have been 3 Monday, 3 Tuesdays, and 2 Wednesdays — I don't have this information digitally, but can create a vector along the lines of
c(3, 3, 2)
How can I conditionally divide the Value column based on the number of times each day occurs?
I've found an inelegant solution, which entails copying the Day column to a temp column, replacing each of the names of the week in the new column with the number of times each day occurs using
df$temp <- sub("Monday, 3, df$temp)
but doing this seems kinda clunky. Is there a neat way to do this?
Suppose your auxiliary data is in another data.frame:
Day N_Day
1 Monday 3
2 Tuesday 3
3 Wednesday 2
The simplest way would be to merge:
DF_new <- merge(DF, DF2, by="Day")
DF_new$newcol <- DF_new$Value / DF_new$N_Day
which gives
Day Salesperson Value N_Day newcol
1 Monday John 40 3 13.33333
2 Monday Sarah 50 3 16.66667
3 Tuesday John 60 3 20.00000
4 Tuesday Sarah 30 3 10.00000
5 Wednesday John 50 2 25.00000
6 Wednesday Sarah 40 2 20.00000
The mergeless shortcut is
DF$newcol <- DF$Value / DF2$N_Day[match(DF$Day, DF2$Day)]
Data:
DF <- structure(list(Day = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label =
c("Monday",
"Tuesday", "Wednesday"), class = "factor"), Salesperson = structure(c(1L,
2L, 1L, 2L, 1L, 2L), .Label = c("John", "Sarah"), class = "factor"),
Value = c(40L, 50L, 60L, 30L, 50L, 40L)), .Names = c("Day",
"Salesperson", "Value"), class = "data.frame", row.names = c(NA,
-6L))
DF2 <- structure(list(Day = structure(1:3, .Label = c("Monday", "Tuesday",
"Wednesday"), class = "factor"), N_Day = c(3, 3, 2)), .Names = c("Day",
"N_Day"), row.names = c(NA, -3L), class = "data.frame")
You can use the library dplyr to merge your data frame with the frequency of each day.
df <- data.frame(
Day=c("Monday","Monday","Tuesday","Tuesday","Wednesday","Wednesday"),
Salesperson=c("John","Sarah","John","Sarah","John","Sarah"),
Value=c(40,50,60,30,50,40), stringsAsFactors=F)
aux <- data.frame(
Day=c("Monday","Tuesday","Wednesday"),
freq=c(3,3,2)
)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value2=Value/n)
To create this auxiliary table with the count of days that appear in your original data instead of doing it manually. You could use:
aux <- df %>% group_by(Day) %>% summarise(n=n())
> output
Day Salesperson Value n Value2
1 Monday John 40 2 20
2 Monday Sarah 50 2 25
3 Tuesday John 60 2 30
4 Tuesday Sarah 30 2 15
5 Wednesday John 50 2 25
6 Wednesday Sarah 40 2 20
If you want to substitute the actual valuecolumn, then use mutate(Value=Value/n) and to remove the additional columns, you can add a select(-n)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value=Value/n) %>% select(-n)

Resources