Get count by column values - r

I have a dataframe that looks like this:
month create_time request_id weekday
1 4 2014-04-25 3647895 Friday
2 12 2013-12-06 2229374 Friday
3 4 2014-04-18 3568796 Friday
4 4 2014-04-18 3564933 Friday
5 3 2014-03-07 3081503 Friday
6 4 2014-04-18 3568889 Friday
And I'd like to get the count of request_ids by the weekday. How would I do this in R?
I've tried a lot of stuff based on ddply and aggregate with no luck.

Try using aggregate
> aggregate(request_id ~ weekday, FUN=length, dat=df)
weekday request_id
1 Friday 6

There are several valid ways to do it. I usually go with my trusty sqldf(). If the dataframe is named D, then
library(sqldf)
counts <- sqldf('select weekday, count(request_id) as nrequests from D group by weekday')
sqldf() can be wordy, but it is just so easy to remember and get right the first time!

or ... u could try:
count(df,"weekday")
or
library(plyr)
ddply(df,.(weekday),summarise,count=length(month))

Another option is to use a table and take the rowSums
> rowSums(with(dat, table(weekday, request_id)))
Friday
6

Related

Is there a possibility to lag values of a data frame in r indexed by time?

My questions concerns lagging data in r where r should be aware of the time index. I hope the question has not been asked in any further thread. Lets consider a simple setup:
df <- data.frame(date=as.Date(c("1990-01-01","1990-02-01","1990-01-15","1990-03-01","1990-05-01","1990-07-01","1993-01-02")), value=1:7)
This code should generate a table like
date
value
1990-01-01
1
1990-02-01
2
1990-01-15
3
1990-03-01
4
1990-05-01
5
1990-07-01
6
And my aim is now to try to lag the "value" by e.g. one month such that e.g when I try to compute the lagged value of "1990-05-01" (which would be 1990-04-01 but is not present in the data) should then generate an NA in the specific row. When I use the standard lag function r is not aware of the time index and simply uses the value "4" of 1990-03-01 which is not what I want. Has anyone an idea what I could do here?
Thanks in advance! :)
All the best,
Leon
You can try %m-% for lagged month like below
library(lubridate)
transform(
df,
value_lag = value[match(date %m-% months(1), date)]
)
which gives
date value value_lag
1 1990-01-01 1 NA
2 1990-02-01 2 1
3 1990-01-15 3 NA
4 1990-03-01 4 2
5 1990-05-01 5 NA
6 1990-07-01 6 NA
7 1993-01-02 7 NA
For an example with multiple columns lets consider:
df <- data.frame(date=as.Date(c("1990-01-01","1990-02-01","1990-01-15","1990-03-01","1990-05-01","1990-07-01","1993-01-02")), value=1:7, value2=7:13)
I recently found myself the following solution:
df %>%
as_tibble() %>%
mutate(across(2:ncol(df), .fns= function(x){x[match(date %m-% months(lags),date)]}, .names="{.col}_lag"))
Thanks to your code #ThomasisCoding. :)

Identify if a day of the week is 2nd/3rd etc Mon/Tues/etc day of the month in R

Given a date and the day of the week it is, I want to know if there is a code that tells me which of those days of the month it is. For example in the picture below, given 2/12/2020 and "Wednesday" I want to be given the output "2" for it being the second Wednesday of the month.
You can do that in base R in essentially one operation. You also do not need the second input column.
Here is slower walkthrough:
Code
dates <- c("2/12/2020","2/11/2020","2/10/2020","2/7/2020","2/6/2020", "2/5/2020")
Dates <- anytime::anydate(dates) ## one of several parsers
dow <- weekdays(Dates) ## for illustration, base R function
cnt <- (as.integer(format(Dates, "%d")) - 1) %/% 7 + 1
res <- data.frame(dt=Dates, dow=dow, cnt=cnt)
res
(Final) Output
R> res
dt dow cnt
1 2020-02-12 Wednesday 2
2 2020-02-11 Tuesday 2
3 2020-02-10 Monday 2
4 2020-02-07 Friday 1
5 2020-02-06 Thursday 1
6 2020-02-05 Wednesday 1
R>
Functionality like this is often in dedicated date/time libraries. I wrapped some code from the (C++) Boost date_time library in package RcppBDH -- that allowed to easily find 'the third Wednesday in the last month each quarter' and alike.
(lubridate::day(your_date) - 1) %/% 7 + 1
The idea here is that the first 7 days of the month are all the first for their weekday. Next 7 are 2nd, etc.
> (1:30 - 1) %/% 7 + 1
# [1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5
Just to offer an alternative calculation for the nth-weekday of the month, you can just divide the day by 7 and always round up:
date <- lubridate::mdy("02/12/2020")
ceiling(day(date)/7)

Calculating Average Time between Dates

I'm having difficulty calculating the average time between the payment dates for my csv. I have tried multiple methods that I have seen online (changing to data.table, using ddply) with no success
WorkerID PaymentDate
1 2015-07-18
1 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18
This is an example of my dataset- I wanted to calculate the average time between the PaymentDates (in number of days), in the simplest way possible. I would like to group by the workerID.
Thank you!
This is a perfect job for aggregate(). It groups PaymentDate by WorkerID and applies the function mean(diff(.)) to each group.
tt <- read.table(text="
WorkerID PaymentDate
1 2015-06-18
1 2015-07-18
1 2015-08-18
2 2015-09-18
3 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18
4 2015-12-16", header=TRUE)
tt$PaymentDate <- as.Date(tt$PaymentDate)
aggregate(PaymentDate ~ WorkerID, data=tt, FUN=function(x) mean(diff(x)))
# WorkerID PaymentDate
# 1 1 30.5
# 2 2 NaN
# 3 3 31.0
# 4 4 29.5
An alternative to AkselA's answer, one can use the data.table package if one prefers this over base R.
This is similar to using aggregate, but may sometimes give a speed boost. In my example below I've handled single times by setting the difference to 0, to illustrate how this can be achieved.
library(lubridate)
library(data.table)
df <- fread("WorkerID PaymentDate
1 2015-07-18
1 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18")
df[,PaymentDate := as.Date(PaymentDate)]
df[,{
if(length(PaymentDate) > 1){
mean(diff(as.numeric(PaymentDate)))
}else
0
}, by = WorkerID]

select date ranges for multiple years in r

I have a data set containing data for about 4.5 years. I'm trying to create two different data frames from this, for what I will call holiday and non-holiday periods. There are multiple periods per year, and these periods will repeat over multiple years.
For example, I'd like to choose a time period between Thanksgiving and New Year's Day, as well as periods prior to Valentine's Day and Mother's Day for each year, and make this my holiday data frame. Everything else would be non-holiday.
I apologize if this has been asked before, I just can't find it. I found a similar question for SQL, but I'm trying to figure out how to do this in R.
I've tried filtering and selecting, to no avail.
wine.holiday <- wine.sub2 %>%
select(total, cdate) %>%
subset(cdate>=2011-11-25, cdate<=2011-12-31)
wine.holiday
Source: local data frame [27,628 x 3]
Groups: clubgroup_id.x [112]
clubgroup_id.x total cdate
(chr) (dbl) (date)
1 1 45 2011-10-04
2 1 45 2011-10-08
3 1 45 2011-10-09
4 1 45 2011-10-09
5 1 45 2011-10-11
6 1 45 2011-10-15
7 1 45 2011-10-24
8 1 90 2011-11-13
9 1 45 2011-11-18
10 1 45 2011-11-26
.. ... ... ...
Clearly something isn't right, because not only is it not limiting the date range, but it's including a column in the data frame that I'm not even selecting.
As mentioned in the comments, dplyr uses filter not subset. Just a simple change to the code you've got (therefore not a complete solution to your issue, but hopefully helps) should get the subset working.
wine.holiday <- wine.sub2 %>%
select(total, cdate)
wine.holiday <- subset(wine.holiday, cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31"))
wine.holiday
Or, to stick with dplyr piping:
wine.holiday <- wine.sub2 %>%
select(total, cdate) %>%
filter( cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31") )
wine.holiday
EDIT to add: If the dplyr select isn't working (it looks fine to me), you could try this:
wine.holiday <- subset( wine.sub2, select = c( total, cdate ) )
wine.holiday <- subset(wine.holiday, cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31"))
wine.holiday
You could, of course, combine those two lines into one. This makes it harder to read, but would probably improve the processing efficiency:
wine.holiday <- subset(wine.sub2, cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31"), select=c(total,cdate) )
I figured out another method for this through looking through SO posts (took a while).
> library(dateTime)
> wine.holiday <- data.table(start = c(as.Date(USThanksgivingDay(2010:2020))),
+ end = as.Date(USNewYearsDay(2011:2021))-1)
> wine.holiday
start end
1: 2010-11-25 2010-12-31
2: 2011-11-24 2011-12-31
3: 2012-11-22 2012-12-31
4: 2013-11-28 2013-12-31
5: 2014-11-27 2014-12-31
6: 2015-11-26 2015-12-31
7: 2016-11-24 2016-12-31
8: 2017-11-23 2017-12-31
9: 2018-11-22 2018-12-31
10: 2019-11-28 2019-12-31
11: 2020-11-26 2020-12-31
I still need to figure out how to add other ranges (e.g. two weeks before Valentine's Day or Mother's Day) to this, and will update this answer if/when I figure it out.

Aggregating multiple events on the same date

I have event log data that looks like this:
id date order_value
1 2015-01-01 19.42
1 2015-01-22 21.23
1 2015-07-14 54.16
1 2015-08-13 36.28
2 2015-01-01 13.55
2 2015-03-15 16.77
2 2015-03-15 21.31
Notice how id2 has 2 events on the same date. I want to sum those but I'm at a total loss.
I tried to use dplyr but I don't see any logical constructs that will let me do this. I think I have to use some sort of an if-statement, but I've heard those should be avoided at all costs.
In dplyr you can use group_by and summarize to do this task:
library(dplyr);
df_grouped <- group_by(df, id, date);
df_summarized <- summarize(df_grouped, order_value_per_date = sum(order_value));
df_summarized;

Resources