Related
I have a list of accounts (300k plus rows), going back six years, with a user number, open and close dates, and other information, such as location. We offer a variety of accounts, and a user can have one or several, in any combination, and both in succession as well as overlapping.
I've been asked to find out how many users we have in any given month. They'd like it split by location, as well as total.
so I have a table like this:
User Open Close Area
1 A 2018-02-13 2018-07-31 West
2 B 2018-02-26 2018-06-04 North
3 B 2018-02-27 2018-03-15 North
4 C 2018-02-27 2018-05-26 South
5 C 2018-03-15 2018-06-03 South
6 D 2018-03-20 2018-07-02 East
7 E 2018-04-01 2018-06-19 West
8 E 2018-04-14 2018-05-04 West
9 F 2018-03-20 2018-04-19 North
10 G 2018-04-26 2018-07-04 South
11 H 2017-29-12 2018-03-21 East
12 I 2016-11-29 2020-04-10 West
13 J 2018-01-31 2018-12-20 West
14 K 2017-10-31 2018-10-30 North
15 K 2018-10-31 2019-10-30 North
And I want to get to one that looks something like this:
Month Total North East South West
1 Feb 18 3 1 0 1 1
2 Mar 18 5 2 1 1 1
3 Apr 18 7 2 1 2 2
4 May 18 6 1 1 2 2
5 Jun 18 6 1 1 2 2
6 Jul 18 3 0 1 1 1
I can filter the data to get to what I need for individual months using
df%>%
filter(Open <= as.Date("2018-04-30") & Close >= as.Date("2018-04-01")) %>%
distinct(PERSON_ID, .keep_all = TRUE) %>%
count(Area)
But what I can't figure out is how to repeat that for every month in the data set automatically. Is there any where of getting r to repeat the above for every month in my data set, and then pass the results into a second table?
Any and all help gratefully received, and many thanks for your time.
Edit: added examples to the source data where Matin Gal's solution returned NA for years
This is a general solution working for dates spanning over more than one year.
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
group_by(rn = row_number()) %>%
mutate(seq = list(seq(month(Open), month(Close) + 12 * (year(Close) - year(Open))))) %>%
unnest(seq) %>%
mutate(
seq_2 = (seq - 1) %% 12 + 1,
month = month(seq_2, label = TRUE),
year = year(Open + months(seq - first(seq)))
) %>%
ungroup() %>%
distinct(User, month, year, Area) %>%
count(month, year, Area) %>%
pivot_wider(
names_from = "Area",
values_from = "n",
values_fill = 0
) %>%
mutate(Total = rowSums(across(c(North, South, West, East))))
returns
month year North South West East Total
<ord> <dbl> <int> <int> <int> <int> <dbl>
1 Feb 2018 1 1 1 0 3
2 Mar 2018 2 1 1 1 5
3 Apr 2018 2 2 2 1 7
4 May 2018 1 2 2 1 6
5 Jun 2018 1 2 2 1 6
6 Jul 2018 0 1 1 1 3
Data
df <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), User = c("A",
"B", "B", "C", "C", "D", "E", "E", "F", "G"), Open = structure(c(17575,
17588, 17589, 17589, 17605, 17610, 17622, 17635, 17610, 17647
), class = "Date"), Close = structure(c(17743, 17686, 17605,
17677, 17685, 17714, 17701, 17655, 17640, 17716), class = "Date"),
Area = c("West", "North", "North", "South", "South", "East",
"West", "West", "North", "South")), problems = structure(list(
row = 10L, col = "Area", expected = "", actual = "embedded null",
file = "literal data"), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L), spec = structure(list(
cols = list(id = structure(list(), class = c("collector_double",
"collector")), User = structure(list(), class = c("collector_character",
"collector")), Open = structure(list(format = ""), class = c("collector_date",
"collector")), Close = structure(list(format = ""), class = c("collector_date",
"collector")), Area = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Here's how I'd do it:
library(tidyverse)
set.seed(14159)
## generating some data that looks roughly
## like your data
data <- tibble(
user = sample(LETTERS[1:5], size = 20, replace = TRUE),
open = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 20),
close = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 20),
area = sample(c("N", "E", "S", "W"), 20, replace = T)
) %>%
filter(
close > open
)
data
#> # A tibble: 9 × 4
#> user open close area
#> <chr> <date> <date> <chr>
#> 1 A 1999-04-03 1999-07-28 N
#> 2 B 1999-01-27 1999-05-12 W
#> 3 B 1999-06-05 1999-12-29 W
#> 4 C 1999-09-26 1999-12-30 W
#> 5 C 1999-04-21 1999-12-04 E
#> 6 C 1999-08-11 1999-12-12 N
#> 7 A 1999-02-13 1999-09-16 W
#> 8 E 1999-02-17 1999-05-21 E
#> 9 B 1999-07-26 1999-08-16 S
## figuring out what months are in between open and close
get_months_in_range <- function(open, close) {
seq.Date(
open,
close,
by = "month"
) %>%
list()
}
data %>%
rowwise() %>%
mutate(
Month = get_months_in_range(open, close)
) %>%
ungroup() %>%
unnest_longer(
col = Month
) %>%
count(Month, area) %>%
pivot_wider(
names_from = area,
values_from = n,
values_fill = 0
) %>%
rowwise() %>%
mutate(
Total = sum(
c_across(
-Month
)
)
) %>%
ungroup()
#> # A tibble: 45 × 6
#> Month W E N S Total
#> <date> <int> <int> <int> <int> <int>
#> 1 1999-01-27 1 0 0 0 1
#> 2 1999-02-13 1 0 0 0 1
#> 3 1999-02-17 0 1 0 0 1
#> 4 1999-02-27 1 0 0 0 1
#> 5 1999-03-13 1 0 0 0 1
#> 6 1999-03-17 0 1 0 0 1
#> 7 1999-03-27 1 0 0 0 1
#> 8 1999-04-03 0 0 1 0 1
#> 9 1999-04-13 1 0 0 0 1
#> 10 1999-04-17 0 1 0 0 1
#> # … with 35 more rows
Created on 2021-08-18 by the reprex package (v2.0.1)
It's not the world's sexiest solution, but I think it'll get you where you're trying to go. Basically, I just make a helper function that gives me all the dates between open and close and then you can group by those to figure out how many users you have in any given month. Let me know if you want more explanation about what the long chain of dplyr stuff is doing.
welcome to SO. I can't test this code as you haven't provided a snippet of your data in the right format (see below for a suggestion on this point), but I think the basic idea of what you want to do is extract a month-year value from Open and then use group_by. For example:
library(lubridate)
library(dplyr)
df %>% mutate(
Date = dmy(Open),
Month_Yr = format_ISO8601(Date, precision = "ym")) %>%
group_by(Month_Yr) %>%
distinct(PERSON.ID, .keep_all = TRUE) %>%
count(Area)
Generally when sharing data on SO it's best to use a dput. See ?dput for info on how to use it if you're unsure.
I have a dataframe as so
df <- structure(list(TIME = c("11:15:00", NA, "15:15:00", "12:00:00",
"18:40:00", "18:15:00", "7:10:00", "15:58:00", "10:00:00", "10:00:00"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
And I basically want to create a new variable which tells me if the time is in a certain group.
I wrote the following but it's not correct, tried changing to as.POSICxt but no dice.
df <- df %>%
mutate(time_groups = ifelse(between(as.POSIXct(TIME),00:00, 5:59), 1,
ifelse(between(as.POSIXct(TIME),06:00, 8:59), 2,
ifelse(between(as.POSIXct(TIME),09:00,11:59), 3,
ifelse(between(as.POSIXct(TIME),12:00,14:59), 4,
ifelse(between(as.POSIXct(TIME),15:00,17:59), 5,
ifelse(between(as.POSIXct(TIME),18:00,23:59), 6,
), NA)
You could use the findInterval function:
library(tidyverse)
library(lubridate)
a <- c("00:00","5:59", "8:59", "11:59", "14:59", "17:59", "23:59")
b <- ymd_hm(paste(Sys.Date(), a))
df %>%
mutate(Interval = findInterval(ymd_hms(paste(Sys.Date(), TIME)), b))
TIME Interval
<chr> <int>
1 11:15:00 3
2 NA NA
3 15:15:00 5
4 12:00:00 4
5 18:40:00 6
6 18:15:00 6
7 7:10:00 2
8 15:58:00 5
9 10:00:00 3
10 10:00:00 3
I was wondering if someone here can help me with a lapply question.
Every month, data are extracted and the data frames are named according to the date extracted (01-08-2019,01-09-2019,01-10-2019 etc). The contents of each data frame are similar to the example below:
01-09-2019
ID DOB
3 01-07-2019
5 01-06-2019
7 01-05-2019
8 01-09-2019
01-10-2019
ID DOB
2 01-10-2019
5 01-06-2019
8 01-09-2019
9 01-02-2019
As the months roll on, there are more data sets being downloaded.
I am wanting to calculate the ages of people in each of the data sets based on the date the data was extracted - so in essence, the age would be the date difference between the data frame name and the DOB variable.
01-09-2019
ID DOB AGE(months)
3 01-07-2019 2
5 01-06-2019 3
7 01-05-2019 4
8 01-09-2019 0
01-10-2019
ID DOB AGE(months)
2 01-10-2019 0
5 01-06-2019 4
8 01-09-2019 1
9 01-02-2019 8
I was thinking of putting all of the data frames together in a list (as there are a lot) and then using lapply to calculate age across all data frames. How do I go about calculating the difference between a data frame name and a column?
If I may suggest a slightly differen approach: It might make more sense to compress your list into a single data frame before calculating the ages. Given your data looks something like this, i.e. it is a list of data frames, where the list element names are the dates of access:
$`01-09-2019`
# A tibble: 4 x 2
ID DOB
<dbl> <date>
1 3 2019-07-01
2 5 2019-06-01
3 7 2019-05-01
4 8 2019-09-01
$`01-10-2019`
# A tibble: 4 x 2
ID DOB
<dbl> <date>
1 2 2019-10-01
2 5 2019-06-01
3 8 2019-09-01
4 9 2019-02-01
You can call bind_rows first with parameter .id = "date_extracted" to turn your list into a data frame, and then calculate age in months.
library(tidyverse)
library(lubridate)
tib <- bind_rows(tib_list, .id = "date_extracted") %>%
mutate(date_extracted = dmy(date_extracted),
DOB = dmy(DOB),
age_months = month(date_extracted) - month(DOB)
)
#### OUTPUT ####
# A tibble: 8 x 4
date_extracted ID DOB age_months
<date> <dbl> <date> <dbl>
1 2019-09-01 3 2019-07-01 2
2 2019-09-01 5 2019-06-01 3
3 2019-09-01 7 2019-05-01 4
4 2019-09-01 8 2019-09-01 0
5 2019-10-01 2 2019-10-01 0
6 2019-10-01 5 2019-06-01 4
7 2019-10-01 8 2019-09-01 1
8 2019-10-01 9 2019-02-01 8
This can be solved with lapply as well but we can also use Map in this case to iterate over list and their names after adding all the dataframes in a list. In base R,
Map(function(x, y) {
x$DOB <- as.Date(x$DOB)
transform(x, age = as.integer(format(as.Date(y), "%m")) -
as.integer(format(x$DOB, "%m")))
}, list_df, names(list_df))
#$`01-09-2019`
# ID DOB age
#1 3 0001-07-20 2
#2 5 0001-06-20 3
#3 7 0001-05-20 4
#4 8 0001-09-20 0
#$`01-10-2019`
# ID DOB age
#1 2 0001-10-20 0
#2 5 0001-06-20 4
#3 8 0001-09-20 1
#4 9 0001-02-20 8
We can also do the same in tidyverse
library(dplyr)
library(lubridate)
purrr::imap(list_df, ~.x %>% mutate(age = month(.y) - month(DOB)))
data
list_df <- list(`01-09-2019` = structure(list(ID = c(3L, 5L, 7L, 8L),
DOB = structure(c(3L, 2L, 1L, 4L), .Label = c("01-05-2019", "01-06-2019",
"01-07-2019", "01-09-2019"), class = "factor")), class = "data.frame",
row.names = c(NA, -4L)), `01-10-2019` = structure(list(ID = c(2L, 5L, 8L, 9L),
DOB = structure(c(4L, 2L, 3L, 1L), .Label = c("01-02-2019",
"01-06-2019", "01-09-2019", "01-10-2019"), class = "factor")),
class = "data.frame", row.names = c(NA, -4L)))
It's bad practice to use dates and numbers as dataframe names consider prefix the date with an "x" as shown below in this base R solution:
df_list <- list(x01_09_2019 = `01-09-2019`, x01_10_2019 = `01-10-2019`)
df_list <- mapply(cbind, "report_date" = names(df_list), df_list, SIMPLIFY = F)
df_list <- lapply(df_list, function(x){
x$report_date <- as.Date(gsub("_", "-", gsub("x", "", x$report_date)), "%d-%m-%Y")
x$Age <- x$report_date - x$DOB
return(x)
}
)
Data:
`01-09-2019` <- structure(list(ID = c(3, 5, 7, 8),
DOB = structure(c(18078, 18048, 18017, 18140), class = "Date")),
class = "data.frame", row.names = c(NA, -4L))
`01-10-2019` <- structure(list(ID = c(2, 5, 8, 9),
DOB = structure(c(18170, 18048, 18140, 17928), class = "Date")),
class = "data.frame", row.names = c(NA, -4L))
I have a DataFrame of "dbGet_TRUE_EVENTS_DATA"
> dbGet_TRUE_EVENTS_DATA
LONGITUDE LATITUDE DATE_START DATE_END FLAG EQNUM
1 -39.5 80.5 2008-07-06 2008-07-10 1 1
2 -39.5 81.5 2008-07-06 2008-07-10 1 1
3 -38.5 80.5 2008-07-06 2008-07-10 1 2
4 -38.5 81.5 2008-07-06 2008-07-10 1 2
5 -39.5 79.5 2008-07-06 2008-07-10 1 3
6 -38.5 79.5 2008-07-06 2008-07-10 1 3
7 -39.5 79.5 2008-07-06 2008-07-10 1 4
8 -38.5 79.5 2008-07-06 2008-07-10 1 4
9 -39.5 79.5 2008-07-06 2008-07-10 1 5
10 -38.5 79.5 2008-07-06 2008-07-10 1 6
11 -39.5 79.5 2008-07-06 2008-07-10 1 7
and a list "TRUE_EVENTS_split_up"
> TRUE_EVENTS_split_up
$Fold1
[1] 3 4 6
$Fold2
[1] 5 7
$Fold3
[1] 1 2
I can subset data like the following:
newdata1 <-
subset(dbGet_TRUE_EVENTS_DATA, EQNUM %in% TRUE_EVENTS_split_up$Fold1)
newdata2 <-
subset(dbGet_TRUE_EVENTS_DATA, EQNUM %in% TRUE_EVENTS_split_up$Fold2)
newdata3 <-
subset(dbGet_TRUE_EVENTS_DATA, EQNUM %in% TRUE_EVENTS_split_up$Fold3)
newdata <- list(newdata1 ,newdata2 ,newdata3 )
I would like to apply directly my subset on dbGet_TRUE_EVENTS_DATA. I know that I have to use lapply function but don't know how ?
> dput(dbGet_TRUE_EVENTS_DATA)
structure(list(LONGITUDE = c(-39.5, -39.5, -38.5, -38.5, -39.5,
-38.5, -39.5, -38.5, -39.5, -38.5, -39.5), LATITUDE = c(80.5,
81.5, 80.5, 81.5, 79.5, 79.5, 79.5, 79.5, 79.5, 79.5, 79.5),
DATE_START = structure(c(1215298800, 1215298800, 1215298800,
1215298800, 1215298800, 1215298800, 1215298800, 1215298800,
1215298800, 1215298800, 1215298800), class = c("POSIXct",
"POSIXt")), DATE_END = structure(c(1215644400, 1215644400,
1215644400, 1215644400, 1215644400, 1215644400, 1215644400,
1215644400, 1215644400, 1215644400, 1215644400), class = c("POSIXct",
"POSIXt")), FLAG = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), EQNUM = c(1,
1, 2, 2, 3, 3, 4, 4, 5, 6, 7)), .Names = c("LONGITUDE", "LATITUDE",
"DATE_START", "DATE_END", "FLAG", "EQNUM"), row.names = c(NA,
-11L), class = "data.frame")
> dput(TRUE_EVENTS_split_up)
structure(list(Fold1 = c(3, 4, 6), Fold2 = c(5, 7), Fold3 = c(1,
2)), .Names = c("Fold1", "Fold2", "Fold3"))
Having problem calculating the date difference in business days, i.e. exclude weekends like networkdays function in Excel.
Here is my data.
e <- structure(list(date.pr = structure(c(15909, 15933, 16517, 15961, 15974, 15978), class = "Date"), date.po = structure(c(15909, 15933, 15954, 15961, 15974, 15978), class = "Date")), .Names = c("date.1", "date.2"), class = c("tbl_df", "data.frame"), row.names = c(NA, -6L))
Found the "bizdays" package for this task. Which works fine for this one.
> bizdays(e$date.2,e$date.1)
[1] 0 0 563 0 0 0
But my data contains cases when date.2 is before date.1.
e2 <- structure(list(date.pr = structure(c(15909, 15933, 16517, 15961, 5974, 15978, 15978), class = "Date"), date.po = structure(c(15909, 15933, 15954, 15961, 15974, 15978, 15979), class = "Date")), .Names = c("date.1", "date.2"), class = c("tbl_df", "data.frame"), row.names = c(NA, -7L))
Now it gives the following error:
> cal <- Calendar(holidaysANBIMA, weekdays=c("saturday","sunday"))
> bizdays(e2$date.2,e2$date.1,cal)
Error in bizdays.Date(e2$date.2, e2$date.1, cal) :
All from dates must be greater than all to dates.
I'm thinking using the ifelse() logic, but it gives me the same error.
> ifelse(e2$date.2 < e2$date.1, NA, bizdays(e2$date.2,e2$date.1,cal))
Error in bizdays.Date(e2$date.2, e2$date.1, cal) :
All from dates must be greater than all to dates.
Help appreciated.
Nweekdays() function is adapted from #J. Won. solution at Calculate the number of weekdays between 2 dates in R
This modified function takes into account of date differences of either positive or negative,
whereas the above link has accepted solution for positive date difference.
library("dplyr")
e2 <- structure(list(date.pr = structure(c(16524, 16524, 16507, 16510, 16510, 16524, 16510, 5974), class = "Date"),
date.po = structure(c(16524, 16525, 16510, 16517, 16524, 16510, 16531, 15974), class = "Date")),
.Names = c("date.1", "date.2"), class = c("tbl_df", "data.frame"), row.names = c(NA, -8L))
Nweekdays <- Vectorize(
function(a, b)
{
ifelse(a < b,
return(sum(!weekdays(seq(a, b, "days")) %in% c("Saturday", "Sunday")) - 1),
return(sum(!weekdays(seq(b, a, "days")) %in% c("Saturday", "Sunday")) - 1))
})
> e2 %>%
mutate(wkd1 = format(date.1, "%A"),
wkd2 = format(date.2, "%A"),
ndays_with_wkends = ifelse((date.2 > date.1), (date.2 - date.1), (date.1 - date.2)),
ndays_no_wkends = Nweekdays(date.1, date.2))
Source: local data frame [8 x 6]
date.1 date.2 wkd1 wkd2 ndays_with_wkends ndays_no_wkends
(date) (date) (chr) (chr) (dbl) (dbl)
1 2015-03-30 2015-03-30 Monday Monday 0 0
2 2015-03-30 2015-03-31 Monday Tuesday 1 1
3 2015-03-13 2015-03-16 Friday Monday 3 1
4 2015-03-16 2015-03-23 Monday Monday 7 5
5 2015-03-16 2015-03-30 Monday Monday 14 10
6 2015-03-30 2015-03-16 Monday Monday 14 10
7 2015-03-16 2015-04-06 Monday Monday 21 15
8 1986-05-11 2013-09-26 Sunday Thursday 10000 7143
> e2 %>% mutate(ndays_no_wkends = Nweekdays(date.1, date.2))
Source: local data frame [8 x 3]
date.1 date.2 ndays_no_wkends
(date) (date) (dbl)
1 2015-03-30 2015-03-30 0
2 2015-03-30 2015-03-31 1
3 2015-03-13 2015-03-16 1
4 2015-03-16 2015-03-23 5
5 2015-03-16 2015-03-30 10
6 2015-03-30 2015-03-16 10
7 2015-03-16 2015-04-06 15
8 1986-05-11 2013-09-26 7143