Aggregate data frame on 2 columns, counting the leftover column by occurrence - r

I have a data frame:
station person_id date
1 0037 103103 2015-02-02
2 0037 306558 2015-02-02
3 0037 306558 2015-02-04
4 0037 306558 2015-02-05
I need to aggregate the frame by station and date, so that every unique station/date (every row) in the result shows how many people fall on that row.
For example, the first 2 rows would collapse into a single row that shows 2 people for station 0037 and date 2015-02-02.
I tried,
result <- data_frame %>% group_by(station, week = week(date)) %>% summarise_each(funs(length), -date)

You could try:
group_by(df, station, date) %>% summarise(num_people = length(person_id))
Source: local data frame [3 x 3]
Groups: station [?]
station date num_people
(int) (fctr) (int)
1 37 2015-02-02 2
2 37 2015-02-04 1
3 37 2015-02-05 1

In base R, you could use aggregate:
# sample dataset
set.seed(1234)
df <- data.frame(station=sample(1:3, 50, replace=T),
person_id=sample(30000:35000, 50, replace=T),
date=sample(seq(as.Date("2015-02-05"), as.Date("2015-02-12")
by="day"), 50, replace=T))
# calculate number of people per station on a particular date
aggregate(cbind("passengerCount"=person_id) ~ station + date, data=df, FUN=length)
The cbind function is not necessary, but it lets you provide a variable name.

With data.table, we convert the 'data.frame' to 'data.table', grouped by 'station', 'date', we get the number of rows (.N).
library(data.table)
setDT(df1)[, .(num_people = .N), .(station, date)]
# station date num_people
#1: 37 2015-02-02 2
#2: 37 2015-02-04 1
#3: 37 2015-02-05 1

Related

New data frame with unique values and counts [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 1 year ago.
I'd like to create a new data table from my old one that includes a count of all the "article_id" that occur for each date (i.e. there are three article_id's listed for the date 2001-10-01, so I'd like one column with the date and one column that has the article count, "3").
Here is the output of the data table:
date article_id N
1: 2001-09-01 FAS_200109_11104 3
2: 2001-10-01 FAS_200110_11126 6
3: 2001-10-01 FAS_200110_11157 21
4: 2001-10-01 FAS_200110_11160 5
5: 2001-11-01 FAS_200111_11220 26
---
7359: 2019-08-01 FAZ_201908_2958 7
7360: 2019-09-01 FAZ_201909_3316 8
7361: 2019-09-01 FAZ_201909_3515 13
7362: 2000-12-01 FAZ_200012_92981 3
7363: 2001-08-01 FAZ_200108_86041 14
So I'll have to move over the unique date values to a new data frame (so that each date is only shown once), as well as a count of article_id's shown for each date.
I've been trying to figure this out but haven't found exactly the right answer regarding how to count the occurrence of a character vector (the article_id) by group (date). I think this is something pretty simple in R, but I'm new to the program and don't have much support so I would very much appreciate your suggestions - thank you so much!
The expected output is not clear. Some assumptions of expected output
Sum of 'N' by 'date'
library(data.table)
dt[, .(N = sum(N, na.rm = TRUE)), by = date]
Count of unique 'article_id' for each date
dt1[, .(N = uniqueN(article_id)), by = date]
Get the first count by 'date'
dt1[, .(N = first(N)), by = date]
We could group and then summarise:
library(dplyr)
df %>%
group_by(date) %>%
summarise(n = n())
date n
<chr> <int>
1 2000-12-01 1
2 2001-08-01 1
3 2001-09-01 1
4 2001-10-01 3
5 2001-11-01 1
6 2019-08-01 1
7 2019-09-01 2
Here 2 tidyverse solutions:
Libraries
library(tidyverse)
Example Data
df <-
tibble(
date = ymd(c("2001-09-01","2001-10-01","2001-10-01")),
article_id = c("FAS_200109_11104","FAS_200110_11126","FAS_200110_11157"),
N = c(3,6,21)
)
Solution
Solution 1
df %>%
group_by(date) %>%
summarise(N = sum(N,na.rm = TRUE))
Solution 2
df %>%
count(date,wt = N)
Result
# A tibble: 2 x 2
date n
<date> <dbl>
1 2001-09-01 3
2 2001-10-01 27

Calculate a rolling sum of 3 month in R data frame based on a date column and Product

I am looking to calculate a 3 month rolling sum of values in one column of a data frame based upon the dates in another column and product.
newResults data frame columns : Product, Date, Value
In this example, I wish to calculate the rolling sum of value for Product for 3 months. I have sorted the data frame on Product and Date.
Dataset Example:
Sample Dataset
My Code:
newResults = newResults %>%
group_by(Product) %>%
mutate(Roll_12Mth =
rollapplyr(Value, width = 1:n() - findInterval( Date %m-% months(3), date), sum)) %>%
ungroup
Error: Problem with mutate() input Roll_12Mth.
x could not find function "%m-%"
i Input Roll_12Mth is rollapplyr(...).
Output:
Output
If the dates are always spaced 1 month apart, it is easy.
dat=data.frame(Date=seq(as.Date("2/1/2017", "%m/%d/%Y"), as.Date("1/1/2018", "%m/%d/%Y"), by="month"),
Product=rep(c("A", "B"), each=6),
Value=c(4182, 4822, 4805, 6235, 3665, 3326, 3486, 3379, 3596, 3954, 3745, 3956))
library(zoo)
library(dplyr)
dat %>%
group_by(Product) %>%
arrange(Date, .by_group=TRUE) %>%
mutate(Value=rollapplyr(Value, 3, sum, partial=TRUE))
Date Product Value
<date> <fct> <dbl>
1 2017-02-01 A 4182
2 2017-03-01 A 9004
3 2017-04-01 A 13809
4 2017-05-01 A 15862
5 2017-06-01 A 14705
6 2017-07-01 A 13226
7 2017-08-01 B 3486
8 2017-09-01 B 6865
9 2017-10-01 B 10461
10 2017-11-01 B 10929
11 2017-12-01 B 11295
12 2018-01-01 B 11655

Count date observations in a month

I have a dataframe containing daily prices of a stock exchange with corresponding dates for several years. These dates are tradingdates and is thus excluded weekends and holidays. Ex:
df$date <- c(as.Date("2017-03-30", "2017-03-31", "2017-04-03", "2017-04-04")
I have used lubridate to extract a column containg which month each date is in, but what I struggle with is creating a column that for each month of every year, calculates which number of trading day in the month it is. I.e. from the example, a counter that will start at 1 for 2017-04-03 as this is the first observation of the month and not 3 as it is the third day of the month and end at the last observation of the month. So that the column would look like this:
df$DayofMonth <- c(22, 23, 1, 2)
and not
df$DayofMonth <- c(30, 31, 3, 4)
Is there anybody that can help me?
Maybe this helps:
library(data.table)
library(stringr)
df <- setDT(df)
df[,YearMonth:=str_sub(Date,1,7)]
df[, DayofMonth := seq(.N), by = YearMonth]
You have a column called YearMonth with values like these '2020-01'.
Then for each group (month) you give each date an index which in your case would correspond to the trading day.
As you can see this would lead to 1 for the date '2017-04-03' since it is the first trading day that month. This works if your df is sorted from first date to latest date.
There is a way using lubridate to extract the date components and dplyr.
library(dplyr)
library(lubridate)
df <- data.frame(date = as.Date(c("2017-03-30", "2017-03-31", "2017-04-03", "2017-04-04")))
df %>%
mutate(month = month(date),
year = year(date),
day = day(date)) %>%
group_by(year, month) %>%
mutate(DayofMonth = day - min(day) + 1)
# A tibble: 4 x 5
# Groups: year, month [2]
date month year day DayofMonth
<date> <dbl> <dbl> <int> <dbl>
1 2017-03-30 3 2017 30 1
2 2017-03-31 3 2017 31 2
3 2017-04-03 4 2017 3 1
4 2017-04-04 4 2017 4 2
You can try the following :
For each date find out the first day of that month.
Count how many working days are present between first_day_of_month and current date.
library(dplyr)
library(lubridate)
df %>%
mutate(first_day_of_month = floor_date(date, 'month'),
day_of_month = purrr::map2_dbl(first_day_of_month, date,
~sum(!weekdays(seq(.x, .y, by = 'day')) %in% c('Saturday', 'Sunday'))))
# date first_day_of_month day_of_month
#1 2017-03-30 2017-03-01 22
#2 2017-03-31 2017-03-01 23
#3 2017-04-03 2017-04-01 1
#4 2017-04-04 2017-04-01 2
You can drop the first_day_of_month column if not needed.
data
df <- data.frame(Date = as.Date(c("2017-03-30", "2017-03-31",
"2017-04-03", "2017-04-04")))

Looping to subset dataframe by timestamps at the minute scale in R

I have a large dataframe that I am trying to subset into smaller dataframes by timestamps, all the way down to the minute scale. Let's say we have the following dummy dataset:
> mydata
date id
1 3/29/17 18:16 A
2 3/30/17 18:05 B
3 3/30/17 18:16 C
4 3/30/17 18:16 D
I want to run a loop to sort and create mini dataframes by their timestamp on the scale of minutes, like this:
> mydata1
date id
2 3/29/17 18:16 B
>mydata2
date id
4 3/30/17 18:05 D
> mydata3
date id
5 3/30/17 18:16 E
6 3/30/17 18:16 F
(I do plan on merging dataframes later so that all ids are present)
What is the most efficient want to do this in R? Thanks in advance for any help!
One option is to use split function and divide your data.frame based on date column. Since, date column in your data.frame is precise up to minute only, hence split will work. It will return list of data frames.
listDfs <- split(mydata, mydata$date)
listDfs
# $`3/29/17 18:16`
# date id
# 1 3/29/17 18:16 A
#
# $`3/30/17 18:05`
# date id
# 2 3/30/17 18:05 B
#
# $`3/30/17 18:16`
# date id
# 3 3/30/17 18:16 C
# 4 3/30/17 18:16 D
Another option (I'll say, preferred option ) is to group on date and arrange data accordingly. You can add a column for data frame number (if that helps). dplyr::group_indices can be used to specify a unique number for each group. A solution using dplyr and lubridate :
library(dplyr)
library(lubridate)
mydata %>% mutate(date = mdy_hm(date)) %>%
mutate(df_num = group_indices(., date)) %>%
group_by(df_num) %>%
select(df_num, date, id)
# # A tibble: 4 x 3
# # Groups: df_num [3]
# df_num date id
# <int> <dttm> <chr>
# 1 1 2017-03-29 18:16:00 A
# 2 2 2017-03-30 18:05:00 B
# 3 3 2017-03-30 18:16:00 C
# 4 3 2017-03-30 18:16:00 D
Data:
mydata <- read.table(text =
"date id
1 '3/29/17 18:16' A
2 '3/30/17 18:05' B
3 '3/30/17 18:16' C
4 '3/30/17 18:16' D",
header = TRUE, stringsAsFactors = FALSE)

How to filter rows based on difference in dates between rows in R?

Within each id, I would like to keep rows that are at least 91 days apart. In my dataframe df below, id=1 has 5 rows and id=2 has 1 row.
For id=1, I would like to keep only the 1st, 3rd and 5th rows.
This is because if we compare 1st date and 2nd date, they differ by 32 days. So, remove 2nd date. We proceed to comparing 1st and 3rd date, and they differ by 152 days. So, we keep 3rd date.
Now, instead of using 1st date as reference, we use 3rd date. 3rd date and 4th date differ by 61 days. So, remove 4th date. We proceed to comparing 3rd date and 5th date, and they differ by 121 days. So, we keep 5th date.
In the end, the dates we keep are 1st, 3rd and 5th dates. As for id=2, there is only one row, so we keep that. The desired result is shown in dfnew.
df <- read.table(header = TRUE, text = "
id var1 date
1 A 2006-01-01
1 B 2006-02-02
1 C 2006-06-02
1 D 2006-08-02
1 E 2007-12-01
2 F 2007-04-20
",stringsAsFactors=FALSE)
dfnew <- read.table(header = TRUE, text = "
id var1 date
1 A 2006-01-01
1 C 2006-06-02
1 E 2007-12-01
2 F 2007-04-20
",stringsAsFactors=FALSE)
I can only think of starting with grouping the df by id as follows:
library(dplyr)
dfnew <- df %>% group_by(id)
However, I am not sure of how to continue from here. Should I proceed with filter function or slice? If so, how?
Here's an attempt using rolling joins in the data.table which I believe should be efficient
library(data.table)
# Set minimum distance
mindist <- 91L
# Make sure it is a real Date
setDT(df)[, date := as.IDate(date)]
# Create a new column with distance + 1 to roll join too
df[, date2 := date - (mindist + 1L)]
# Perform a rolling join per each value in df$date2 that has atleast 91 difference from df$date
unique(df[df, on = c(id = "id", date = "date2"), roll = -Inf], by = c("id", "var1"))
# id var1 date date2 i.var1 i.date
# 1: 1 A 2005-10-01 2005-10-01 A 2006-01-01
# 2: 1 C 2006-03-02 2006-03-02 C 2006-06-02
# 3: 1 E 2007-08-31 2007-08-31 E 2007-12-01
# 4: 2 F 2007-01-18 2007-01-18 F 2007-04-20
This will give you two additional columns but it's not a big of a deal IMO. Logically this makes sense and I've tested it successfully on different scenarios but it may need some additional proof tests.
An alternative that uses slice from dplyr is to define the following recursive function:
library(dplyr)
f <- function(d, ind=1) {
ind.next <- first(which(difftime(d,d[ind], units="days") > 90))
if (is.na(ind.next))
return(ind)
else
return(c(ind, f(d,ind.next)))
}
This function operates on the date column starting at ind = 1. It then finds the next index ind.next that is the first index for which the date is greater than 90 days (at least 91 days) from the date indexed by ind. Note that if there is no such ind.next, ind.next==NA and we just return ind. Otherwise, we recursively call f starting at ind.next and return its result concatenated with ind. The end result of this function call are the row indices separated by at least 91 days.
With this function, we can do:
result <- df %>% group_by(id) %>% slice(f(as.Date(date, format="%Y-%m-%d")))
##Source: local data frame [4 x 3]
##Groups: id [2]
##
## id var1 date
## <int> <chr> <chr>
##1 1 A 2006-01-01
##2 1 C 2006-06-02
##3 1 E 2007-12-01
##4 2 F 2007-04-20
The use of this function assumes that the date column is sorted in ascending order by each id group. If not, we can just sort the dates before slicing. Not sure about the efficiency of this or the dangers of recursive calls in R. Hopefully, David Arenburg or others can comment on this.
As suggested by David Arenburg, it is better to convert date to a Date class first instead of by group:
result <- df %>% mutate(date=as.Date(date, format="%Y-%m-%d")) %>%
group_by(id) %>% slice(f(date))
##Source: local data frame [4 x 3]
##Groups: id [2]
##
## id var1 date
## <int> <chr> <date>
##1 1 A 2006-01-01
##2 1 C 2006-06-02
##3 1 E 2007-12-01
##4 2 F 2007-04-20

Resources