Apply function to get months from today's date - r

I'm trying to create a column in a dataset that tells me the (approximate) number of months a customer has been with the company.
This is my current attempt:
dat <- data.frame(ID = c(1:4), start.date = as.Date(c('2015-04-09', '2014-03- 24', '2016-07-01', '2011-02-02')))
dat$months.customer <- apply(dat[2], 1, function(x) (as.numeric(Sys.Date())- as.numeric(x))/30)
It's returning all NAs

You can use difftime:
dat$months.customer <-
as.numeric(floor(difftime(Sys.Date(),dat$start.date,units="days")/30))
# ID start.date months.customer
# 1 1 2015-04-09 16
# 2 2 2014-03-24 29
# 3 3 2016-07-01 1
# 4 4 2011-02-02 67

Related

Plotting single-variable duration time series data with ggplot2

I have a list of companies with a start and end date of an event. I want to plot a figure that displays the date on the x-axis and the count of companies currently undergoing the event on the y-axis. The only way I can think of doing this at the moment is generating a column for every day and giving it a 1/0 for whether or not that day is between the start and end date for every company, and then reshaping it. Is there a more efficient way to produce this?
Here's some example data:
set.seed(123)
df <- data.frame(id = sample(100:500, 100, replace = F))
df$start <- sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 100)
df$end <- df$start + sample(1:50, replace = T)
Here's another option, though I doubt that it's much more efficient than what you're already doing. It is also making all of the days and then identifying whether or not any particular observation is "active" that day.
outdf <- tibble(
date = seq(min(df$start), max(df$end), by=1),
num = rowSums(outer(date,
1:nrow(df),
function(x,y)x > df$start[y] & x < df$end[y]))
)
outdf
# # A tibble: 404 x 2
# date num
# <date> <dbl>
# 1 2020-01-05 0
# 2 2020-01-06 1
# 3 2020-01-07 1
# 4 2020-01-08 1
# 5 2020-01-09 1
# 6 2020-01-10 1
# 7 2020-01-11 2
# 8 2020-01-12 2
# 9 2020-01-13 2
# 10 2020-01-14 2
# # … with 394 more rows

Assigning consecutive day numbers to dates

The dataframe I am working with has two columns: 1) person ID and 2) date. I am trying to assign numeric day values of date for each person.
For instance, person 1 has date from 2016-01-01 (baseline) to 2016-01-05 (last date for person 1). I want to create a day column that would translate this to 1, 2, 3, 4, 5. If person 2 has date from 2016-01-13 to 2016-01-16, the day column for person 2 would be 1, 2, 3, 4.
df <- for(i in length(unique(per1$date))){df$day[per1$date[1] + i] <- i+1}
This is basically what I am trying to do, but I get an error message saying:
"replacement has 17119 rows, data has 1670"
Please let me know how I can write the code for this. Thank you.
you can use this
library(data.table)
## Create Data
df <- data.table(personID = c(1,1,1,2,2,2,2),
Date = c("2016-01-01", "2016-01-02", "2016-01-03", "2016-01-13", "2016-01-14", "2016-01-15", "2016-01-16"))
## Order the data according to date, per user
df <- df[order(Date), .SD, by = personID]
## Rank the date, within each personID group
df <- df[, Day:= 1:.N, .(personID)]
df
personID Date Day
1: 1 2016-01-01 1
2: 1 2016-01-02 2
3: 1 2016-01-03 3
4: 2 2016-01-13 1
5: 2 2016-01-14 2
6: 2 2016-01-15 3
7: 2 2016-01-16 4

R: cumulative total at a daily level

I have the following dataset:
I want to measure the cumulative total at a daily level. So the result look something like:
I can use dplyr's cumsum function but the count for "missing days" won't show up. As an example, the date 1/3/18 does not exist in the original dataframe. I want this missed date to be in the resultant dataframe and its cumulative sum should be the same as the last known date i.e. 1/2/18 with the sum being 5.
Any help is appreciated! I am new to the language.
I'll use this second data.frame to fill out the missing dates:
daterange <- data.frame(Date = seq(min(x$Date), max(x$Date), by = "1 day"))
Base R:
transform(merge(x, daterange, all = TRUE),
Count = cumsum(ifelse(is.na(Count), 0, Count)))
# Date Count
# 1 2018-01-01 2
# 2 2018-01-02 5
# 3 2018-01-03 5
# 4 2018-01-04 5
# 5 2018-01-05 10
# 6 2018-01-06 10
# 7 2018-01-07 10
# 8 2018-01-08 11
# ...
# 32 2018-02-01 17
dplyr
library(dplyr)
x %>%
right_join(daterange) %>%
mutate(Count = cumsum(if_else(is.na(Count), 0, Count)))
Data:
x <- data.frame(Date = as.Date(c("1/1/18", "1/2/18", "1/5/18", "1/8/18", "2/1/18"), format="%m/%d/%y"),
Count = c(2,3,5,1,6))

R count distinct elements based on two columns by group

I have this data basically, but larger:
I want to count a number of distinct combinations of (customer_id, account_id) - that is, distinct or unique values based on two columns, but for each start_date. I can't find the solution anywhere. The result should be another column added to my data.table that should look like this:
That is, for each start_date, it calculates number of distinct values based on both customer_id and account_id.
For example, for start_date equal to 2.2.2018, I have distinct combinations in (customer_id,account_id) being (4,22) (5,38) and (6,13), so I want count to be equal to 3 because I have 3 distinct combinations. I also need the solution to work with character values in customer_id and account_id columns.
Code to replicate the data:
customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02-02","%Y-%m-%d"),each=4))
data <- data.table(customer_id,account_id,start_date)
Another dplyr option:
library(dplyr)
customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02-
02","%Y-%m-%d"),each=4))
data <- data.frame(customer_id,account_id,start_date)
data %>%
group_by(start_date)%>%
mutate(distinct_values = n_distinct(customer_id, account_id)) %>%
ungroup()
dplyr option
customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02-
02","%Y-%m-%d"),each=4))
data <- data.frame(customer_id,account_id,start_date)
data %>%
group_by(start_date, customer_id, account_id) %>%
summarise(Total = 1) %>%
group_by(start_date) %>%
summarise(Count =n())
Here is a data.table option
data[, N := uniqueN(paste0(customer_id, account_id, "_")), by = start_date]
# customer_id account_id start_date N
# 1: 1 11 2017-01-01 4
# 2: 1 11 2017-01-01 4
# 3: 1 11 2017-01-01 4
# 4: 2 11 2017-01-01 4
# 5: 3 55 2017-01-01 4
# 6: 3 88 2017-01-01 4
# 7: 4 22 2018-02-02 3
# 8: 5 38 2018-02-02 3
# 9: 5 38 2018-02-02 3
#10: 6 13 2018-02-02 3
Or
data[, N := uniqueN(.SD, by = c("customer_id", "account_id")), by = start_date]

Looping to subset dataframe by timestamps at the minute scale in R

I have a large dataframe that I am trying to subset into smaller dataframes by timestamps, all the way down to the minute scale. Let's say we have the following dummy dataset:
> mydata
date id
1 3/29/17 18:16 A
2 3/30/17 18:05 B
3 3/30/17 18:16 C
4 3/30/17 18:16 D
I want to run a loop to sort and create mini dataframes by their timestamp on the scale of minutes, like this:
> mydata1
date id
2 3/29/17 18:16 B
>mydata2
date id
4 3/30/17 18:05 D
> mydata3
date id
5 3/30/17 18:16 E
6 3/30/17 18:16 F
(I do plan on merging dataframes later so that all ids are present)
What is the most efficient want to do this in R? Thanks in advance for any help!
One option is to use split function and divide your data.frame based on date column. Since, date column in your data.frame is precise up to minute only, hence split will work. It will return list of data frames.
listDfs <- split(mydata, mydata$date)
listDfs
# $`3/29/17 18:16`
# date id
# 1 3/29/17 18:16 A
#
# $`3/30/17 18:05`
# date id
# 2 3/30/17 18:05 B
#
# $`3/30/17 18:16`
# date id
# 3 3/30/17 18:16 C
# 4 3/30/17 18:16 D
Another option (I'll say, preferred option ) is to group on date and arrange data accordingly. You can add a column for data frame number (if that helps). dplyr::group_indices can be used to specify a unique number for each group. A solution using dplyr and lubridate :
library(dplyr)
library(lubridate)
mydata %>% mutate(date = mdy_hm(date)) %>%
mutate(df_num = group_indices(., date)) %>%
group_by(df_num) %>%
select(df_num, date, id)
# # A tibble: 4 x 3
# # Groups: df_num [3]
# df_num date id
# <int> <dttm> <chr>
# 1 1 2017-03-29 18:16:00 A
# 2 2 2017-03-30 18:05:00 B
# 3 3 2017-03-30 18:16:00 C
# 4 3 2017-03-30 18:16:00 D
Data:
mydata <- read.table(text =
"date id
1 '3/29/17 18:16' A
2 '3/30/17 18:05' B
3 '3/30/17 18:16' C
4 '3/30/17 18:16' D",
header = TRUE, stringsAsFactors = FALSE)

Resources