I have a large dataframe that I am trying to subset into smaller dataframes by timestamps, all the way down to the minute scale. Let's say we have the following dummy dataset:
> mydata
date id
1 3/29/17 18:16 A
2 3/30/17 18:05 B
3 3/30/17 18:16 C
4 3/30/17 18:16 D
I want to run a loop to sort and create mini dataframes by their timestamp on the scale of minutes, like this:
> mydata1
date id
2 3/29/17 18:16 B
>mydata2
date id
4 3/30/17 18:05 D
> mydata3
date id
5 3/30/17 18:16 E
6 3/30/17 18:16 F
(I do plan on merging dataframes later so that all ids are present)
What is the most efficient want to do this in R? Thanks in advance for any help!
One option is to use split function and divide your data.frame based on date column. Since, date column in your data.frame is precise up to minute only, hence split will work. It will return list of data frames.
listDfs <- split(mydata, mydata$date)
listDfs
# $`3/29/17 18:16`
# date id
# 1 3/29/17 18:16 A
#
# $`3/30/17 18:05`
# date id
# 2 3/30/17 18:05 B
#
# $`3/30/17 18:16`
# date id
# 3 3/30/17 18:16 C
# 4 3/30/17 18:16 D
Another option (I'll say, preferred option ) is to group on date and arrange data accordingly. You can add a column for data frame number (if that helps). dplyr::group_indices can be used to specify a unique number for each group. A solution using dplyr and lubridate :
library(dplyr)
library(lubridate)
mydata %>% mutate(date = mdy_hm(date)) %>%
mutate(df_num = group_indices(., date)) %>%
group_by(df_num) %>%
select(df_num, date, id)
# # A tibble: 4 x 3
# # Groups: df_num [3]
# df_num date id
# <int> <dttm> <chr>
# 1 1 2017-03-29 18:16:00 A
# 2 2 2017-03-30 18:05:00 B
# 3 3 2017-03-30 18:16:00 C
# 4 3 2017-03-30 18:16:00 D
Data:
mydata <- read.table(text =
"date id
1 '3/29/17 18:16' A
2 '3/30/17 18:05' B
3 '3/30/17 18:16' C
4 '3/30/17 18:16' D",
header = TRUE, stringsAsFactors = FALSE)
Related
I have the following dataset:
I want to measure the cumulative total at a daily level. So the result look something like:
I can use dplyr's cumsum function but the count for "missing days" won't show up. As an example, the date 1/3/18 does not exist in the original dataframe. I want this missed date to be in the resultant dataframe and its cumulative sum should be the same as the last known date i.e. 1/2/18 with the sum being 5.
Any help is appreciated! I am new to the language.
I'll use this second data.frame to fill out the missing dates:
daterange <- data.frame(Date = seq(min(x$Date), max(x$Date), by = "1 day"))
Base R:
transform(merge(x, daterange, all = TRUE),
Count = cumsum(ifelse(is.na(Count), 0, Count)))
# Date Count
# 1 2018-01-01 2
# 2 2018-01-02 5
# 3 2018-01-03 5
# 4 2018-01-04 5
# 5 2018-01-05 10
# 6 2018-01-06 10
# 7 2018-01-07 10
# 8 2018-01-08 11
# ...
# 32 2018-02-01 17
dplyr
library(dplyr)
x %>%
right_join(daterange) %>%
mutate(Count = cumsum(if_else(is.na(Count), 0, Count)))
Data:
x <- data.frame(Date = as.Date(c("1/1/18", "1/2/18", "1/5/18", "1/8/18", "2/1/18"), format="%m/%d/%y"),
Count = c(2,3,5,1,6))
I have this data basically, but larger:
I want to count a number of distinct combinations of (customer_id, account_id) - that is, distinct or unique values based on two columns, but for each start_date. I can't find the solution anywhere. The result should be another column added to my data.table that should look like this:
That is, for each start_date, it calculates number of distinct values based on both customer_id and account_id.
For example, for start_date equal to 2.2.2018, I have distinct combinations in (customer_id,account_id) being (4,22) (5,38) and (6,13), so I want count to be equal to 3 because I have 3 distinct combinations. I also need the solution to work with character values in customer_id and account_id columns.
Code to replicate the data:
customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02-02","%Y-%m-%d"),each=4))
data <- data.table(customer_id,account_id,start_date)
Another dplyr option:
library(dplyr)
customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02-
02","%Y-%m-%d"),each=4))
data <- data.frame(customer_id,account_id,start_date)
data %>%
group_by(start_date)%>%
mutate(distinct_values = n_distinct(customer_id, account_id)) %>%
ungroup()
dplyr option
customer_id <- c(1,1,1,2,3,3,4,5,5,6)
account_id <- c(11,11,11,11,55,88,22,38,38,13)
start_date <- c(rep(as.Date("2017-01-01","%Y-%m-%d"),each=6),rep(as.Date("2018-02-
02","%Y-%m-%d"),each=4))
data <- data.frame(customer_id,account_id,start_date)
data %>%
group_by(start_date, customer_id, account_id) %>%
summarise(Total = 1) %>%
group_by(start_date) %>%
summarise(Count =n())
Here is a data.table option
data[, N := uniqueN(paste0(customer_id, account_id, "_")), by = start_date]
# customer_id account_id start_date N
# 1: 1 11 2017-01-01 4
# 2: 1 11 2017-01-01 4
# 3: 1 11 2017-01-01 4
# 4: 2 11 2017-01-01 4
# 5: 3 55 2017-01-01 4
# 6: 3 88 2017-01-01 4
# 7: 4 22 2018-02-02 3
# 8: 5 38 2018-02-02 3
# 9: 5 38 2018-02-02 3
#10: 6 13 2018-02-02 3
Or
data[, N := uniqueN(.SD, by = c("customer_id", "account_id")), by = start_date]
I have a data frame as such:
data <- data.frame(daytime = c('2005-05-03 11:45:23', '2005-05-03 11:47:45',
'2005-05-03 12:00:32', '2005-05-03 12:25:01',
'2006-05-02 10:45:15', '2006-05-02 11:15:14',
'2006-05-02 11:16:15', '2006-05-02 11:18:03'),
category = c("A", "A", "A", "B", "B", "B", "B", "A"))
print(data)
daytime category date2
1 2005-05-03 11:45:23 A 05/03/05
2 2005-05-03 11:47:45 A 05/03/05
3 2005-05-03 12:00:32 A 05/03/05
4 2005-05-03 12:25:01 B 05/03/05
5 2006-05-02 10:45:15 B 05/02/06
6 2006-05-02 11:15:14 B 05/02/06
7 2006-05-02 11:16:15 B 05/02/06
8 2006-05-02 11:18:03 A 05/02/06
I would like to turn this data frame into a time series of daily categorical frequencies like this:
day cat_A_freq cat_B_freq
1 2005-05-01 3 1
2 2005-05-02 1 3
I have tried doing:
library(anytime)
data$daytime <- anytime(data$daytime)
data$day <- factor(format(data$daytime, "%D"))
table(data$day, data$category)
A B
05/02/06 1 3
05/03/05 3 1
But as you can see the formatting a new variable, day, changes the appearance of the date. You can also see that the table does not return the days in proper order (the years are out of order) so that I can then convert to a time series, easily.
Any ideas on how to get frequencies in an easier way, or if this is the way, how to get the frequencies in correct date order and into a dataframe for easy conversion to a time series object?
A solution using tidyverse. The format of your daytime column in your data is good, so we can use as.Date directly without specifying other formats or using other functions.
library(tidyverse)
data2 <- data %>%
mutate(day = as.Date(daytime)) %>%
count(day, category) %>%
spread(category, n)
data2
# # A tibble: 2 x 3
# day A B
# * <date> <int> <int>
# 1 2005-05-03 3 1
# 2 2006-05-02 1 3
It's seems to be easy, but after a long time searching and trying I didn't get it:
I have a list of time series, a short example for reproducing:
a <- seq(as.Date("1970-01-01"), as.Date("1970-01-05"), "days")
b <- seq(as.Date("1985-10-01"), as.Date("1985-10-05"), "days")
c <- seq(as.Date("2014-03-01"), as.Date("2014-03-05"), "days")
d <- c(a, b, c)
df1 <- data.frame(d)
colnames(df1) <- c("date")
e <- seq(as.Date("1975-01-01"), as.Date("1975-01-05"), "days")
f <- seq(as.Date("1990-10-01"), as.Date("1990-10-05"), "days")
g <- c(e, f)
df2 <- data.frame(g)
colnames(df2) <- c("date")
ll <- list(df1, df2)
Now I want to subset the listed data.frames to:
> llsubset
[[1]]
date
1 1970-01-01
2 1970-01-05
3 1985-10-01
4 1985-10-05
5 2014-03-01
6 2014-03-05
[[2]]
date
1 1975-01-01
2 1975-01-05
3 1990-10-01
4 1990-10-05
I've tried it by rollapply, but it doesn't work and it's not worth to see. Maybe you can help me? Thank you!
Determine which points differ from the prior by more than 1 day and from that construct a logical with TRUE at the ends of each sequence and FALSE elsewhere. Subset by it. No packages are used.
lapply(ll, subset, { dif <- diff(date) > 1; c(TRUE, dif) | c(dif, TRUE) } )
giving:
[[1]]
date
1 1970-01-01
5 1970-01-05
6 1985-10-01
10 1985-10-05
11 2014-03-01
15 2014-03-05
[[2]]
date
1 1975-01-01
5 1975-01-05
6 1990-10-01
10 1990-10-05
Maybe something like this? Use cumsum and diff to create a group variable and then subset your date (assuming you are trying to find out the min and max date within each consecutive time period and date is sorted in ascending order before hand):
library(dplyr)
lapply(ll, function(df) {
df %>%
group_by(cumsum(c(TRUE, diff(date) != 1))) %>%
slice(c(1, n())) %>%
ungroup() %>%
select(date) }
)
#[[1]]
# A tibble: 6 × 1
# date
# <date>
#1 1970-01-01
#2 1970-01-05
#3 1985-10-01
#4 1985-10-05
#5 2014-03-01
#6 2014-03-05
#[[2]]
# A tibble: 4 × 1
# date
# <date>
#1 1975-01-01
#2 1975-01-05
#3 1990-10-01
#4 1990-10-05
There's probably a package that does exactly this, but I don't yet know its name.
Using diff() on the dates can highlight which dates have only one day between them, like so:
diff(df1$date)
Time differences in days
[1] 1 1 1 1 5748 1 1 1 1 10374 1
[12] 1 1 1
We can use that.
end_finder <- function(x) {
# find the gap between dates.
# mark dates where the diff > 1,
# also mark the entry prior to that one;
# this will be the end of the previous date.
# also include the first and last element.
diff_dates <- c(100,diff(x$dates))
diff_idx <- which(diff_dates > 1)
diff_idx <- c((diff_idx -1 ), diff_idx)
# remove any elements < 1
diff_idx <- diff_idx[diff_idx >= 1 ]
# include the first element
diff_idx <- c(1, diff_idx)
# include the last element
diff_idx <- c(diff_idx, length(x$date))
# remove duplicates and sort for easier reading
diff_idx <- sort(unique(diff_idx))
x$dates[diff_idx]
}
Now run that.
> lapply(ll, end_finder)
[[1]]
[1] "1970-01-01" "1970-01-05" "1985-10-01" "1985-10-05" "2014-03-01"
[6] "2014-03-05"
[[2]]
[1] "1975-01-01" "1975-01-05" "1990-10-01" "1990-10-05"
Another solution using dplyr: First we compute year for each date and for each year we find the min and max date
using year and melt functions from lubridate and reshape2 packages respectively
library(dplyr)
library(lubridate)
library(reshape2)
ll <- list(df1, df2)
fn_endPoint_Years = function(DF) {
newDF = DF %>%
mutate(Year=year(date)) %>%
group_by(Year) %>%
do(.,data.frame(minDate=min(.$date),maxDate=max(.$date) )) %>%
melt(id="Year",value.name = "date") %>%
arrange(date) %>%
select(date)
}
lapply(ll,fn_endPoint_Years)
# [[1]]
# date
# 1 1970-01-01
# 2 1970-01-05
# 3 1985-10-01
# 4 1985-10-05
# 5 2014-03-01
# 6 2014-03-05
# [[2]]
# date
# 1 1975-01-01
# 2 1975-01-05
# 3 1990-10-01
# 4 1990-10-05
I'm trying to create a column in a dataset that tells me the (approximate) number of months a customer has been with the company.
This is my current attempt:
dat <- data.frame(ID = c(1:4), start.date = as.Date(c('2015-04-09', '2014-03- 24', '2016-07-01', '2011-02-02')))
dat$months.customer <- apply(dat[2], 1, function(x) (as.numeric(Sys.Date())- as.numeric(x))/30)
It's returning all NAs
You can use difftime:
dat$months.customer <-
as.numeric(floor(difftime(Sys.Date(),dat$start.date,units="days")/30))
# ID start.date months.customer
# 1 1 2015-04-09 16
# 2 2 2014-03-24 29
# 3 3 2016-07-01 1
# 4 4 2011-02-02 67