I have event log data that looks like this:
id date order_value
1 2015-01-01 19.42
1 2015-01-22 21.23
1 2015-07-14 54.16
1 2015-08-13 36.28
2 2015-01-01 13.55
2 2015-03-15 16.77
2 2015-03-15 21.31
Notice how id2 has 2 events on the same date. I want to sum those but I'm at a total loss.
I tried to use dplyr but I don't see any logical constructs that will let me do this. I think I have to use some sort of an if-statement, but I've heard those should be avoided at all costs.
In dplyr you can use group_by and summarize to do this task:
library(dplyr);
df_grouped <- group_by(df, id, date);
df_summarized <- summarize(df_grouped, order_value_per_date = sum(order_value));
df_summarized;
Related
I'm having difficulty calculating the average time between the payment dates for my csv. I have tried multiple methods that I have seen online (changing to data.table, using ddply) with no success
WorkerID PaymentDate
1 2015-07-18
1 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18
This is an example of my dataset- I wanted to calculate the average time between the PaymentDates (in number of days), in the simplest way possible. I would like to group by the workerID.
Thank you!
This is a perfect job for aggregate(). It groups PaymentDate by WorkerID and applies the function mean(diff(.)) to each group.
tt <- read.table(text="
WorkerID PaymentDate
1 2015-06-18
1 2015-07-18
1 2015-08-18
2 2015-09-18
3 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18
4 2015-12-16", header=TRUE)
tt$PaymentDate <- as.Date(tt$PaymentDate)
aggregate(PaymentDate ~ WorkerID, data=tt, FUN=function(x) mean(diff(x)))
# WorkerID PaymentDate
# 1 1 30.5
# 2 2 NaN
# 3 3 31.0
# 4 4 29.5
An alternative to AkselA's answer, one can use the data.table package if one prefers this over base R.
This is similar to using aggregate, but may sometimes give a speed boost. In my example below I've handled single times by setting the difference to 0, to illustrate how this can be achieved.
library(lubridate)
library(data.table)
df <- fread("WorkerID PaymentDate
1 2015-07-18
1 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18")
df[,PaymentDate := as.Date(PaymentDate)]
df[,{
if(length(PaymentDate) > 1){
mean(diff(as.numeric(PaymentDate)))
}else
0
}, by = WorkerID]
I have a big data frame with dates and i need to check for the first date in a continuous way, as follows:
ID ID_2 END BEG
1 55 2017-06-30 2016-01-01
1 55 2015-12-31 2015-11-12 --> Gap (required date)
1 88 2008-07-26 2003-02-24
2 19 2014-09-30 2013-05-01
2 33 2013-04-30 2011-01-01 --> Not Gap (overlapping)
2 19 2012-12-31 2011-01-01
2 33 2010-12-31 2008-01-01
2 19 2007-12-31 2006-01-01
2 19 2005-12-31 1980-10-20 --> No actual Gap(required date)
As shown, not all the dates have overlapping and i need to return by ID (not ID_2) the date when the first gap (going backwards in time) appears. I've tried using for but it's extremely slow (dataframe has 150k rows). I've been messing around with dplyr and mutate as follows:
df <- df%>%
group_by(ID)%>%
mutate(END_lead = lead(END))
df$FLAG <- df$BEG - days(1) == df$END_lead
df <- df%>%
group_by(ID)%>%
filter(cumsum(cumsum(FLAG == FALSE))<=1)
But this set of instructions stops at the first overlapping, filtering the wrong date. I've tried anything i could think of, ordering in decreasing or ascending order, and using min and max but could not figure out a solution.
The actual result wanted would be:
ID ID_2 END BEG
1 55 2015-12-31 2015-11-12
2 19 2008-07-26 1980-10-20
Is there a way of doing this using dplyr,tidyr and lubridate?
A possible solution using dplyr:
library(dplyr)
df %>%
mutate_at(vars(END, BEG), funs(as.Date)) %>%
group_by(ID) %>%
slice(which.max(BEG > ( lead(END) + 1 ) | is.na(BEG > ( lead(END) + 1 ))))
With your last data, it gives:
# A tibble: 2 x 4
# Groups: ID [2]
ID ID_2 END BEG
<int> <int> <date> <date>
1 1 55 2015-12-31 2015-11-12
2 2 19 2005-12-31 1980-10-20
What the solution does is basically:
Changes the dates to Date format (no need for lubridate);
Groups by ID;
Selects the highest row that satisfies your criteria, i.e. the highest row which is either a gap (TRUE), or if there is no gap it is the first row (meaning it has a missing value when checking for a gap, this is why is.na(BEG > ( lead(END) + 1 ))).
I would use xts package, first creating xts objects for each ID you have, than use first() and last() function on each objects.
https://www.datacamp.com/community/blog/r-xts-cheat-sheet
I have a data set containing data for about 4.5 years. I'm trying to create two different data frames from this, for what I will call holiday and non-holiday periods. There are multiple periods per year, and these periods will repeat over multiple years.
For example, I'd like to choose a time period between Thanksgiving and New Year's Day, as well as periods prior to Valentine's Day and Mother's Day for each year, and make this my holiday data frame. Everything else would be non-holiday.
I apologize if this has been asked before, I just can't find it. I found a similar question for SQL, but I'm trying to figure out how to do this in R.
I've tried filtering and selecting, to no avail.
wine.holiday <- wine.sub2 %>%
select(total, cdate) %>%
subset(cdate>=2011-11-25, cdate<=2011-12-31)
wine.holiday
Source: local data frame [27,628 x 3]
Groups: clubgroup_id.x [112]
clubgroup_id.x total cdate
(chr) (dbl) (date)
1 1 45 2011-10-04
2 1 45 2011-10-08
3 1 45 2011-10-09
4 1 45 2011-10-09
5 1 45 2011-10-11
6 1 45 2011-10-15
7 1 45 2011-10-24
8 1 90 2011-11-13
9 1 45 2011-11-18
10 1 45 2011-11-26
.. ... ... ...
Clearly something isn't right, because not only is it not limiting the date range, but it's including a column in the data frame that I'm not even selecting.
As mentioned in the comments, dplyr uses filter not subset. Just a simple change to the code you've got (therefore not a complete solution to your issue, but hopefully helps) should get the subset working.
wine.holiday <- wine.sub2 %>%
select(total, cdate)
wine.holiday <- subset(wine.holiday, cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31"))
wine.holiday
Or, to stick with dplyr piping:
wine.holiday <- wine.sub2 %>%
select(total, cdate) %>%
filter( cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31") )
wine.holiday
EDIT to add: If the dplyr select isn't working (it looks fine to me), you could try this:
wine.holiday <- subset( wine.sub2, select = c( total, cdate ) )
wine.holiday <- subset(wine.holiday, cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31"))
wine.holiday
You could, of course, combine those two lines into one. This makes it harder to read, but would probably improve the processing efficiency:
wine.holiday <- subset(wine.sub2, cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31"), select=c(total,cdate) )
I figured out another method for this through looking through SO posts (took a while).
> library(dateTime)
> wine.holiday <- data.table(start = c(as.Date(USThanksgivingDay(2010:2020))),
+ end = as.Date(USNewYearsDay(2011:2021))-1)
> wine.holiday
start end
1: 2010-11-25 2010-12-31
2: 2011-11-24 2011-12-31
3: 2012-11-22 2012-12-31
4: 2013-11-28 2013-12-31
5: 2014-11-27 2014-12-31
6: 2015-11-26 2015-12-31
7: 2016-11-24 2016-12-31
8: 2017-11-23 2017-12-31
9: 2018-11-22 2018-12-31
10: 2019-11-28 2019-12-31
11: 2020-11-26 2020-12-31
I still need to figure out how to add other ranges (e.g. two weeks before Valentine's Day or Mother's Day) to this, and will update this answer if/when I figure it out.
I am working with a data set that includes roughly 400 unique subjects. for this example I will only be working with two however. You can generate sample data with this code:
set.seed(100)
library(tidyr)
library(dplyr)
Subject<-c("A","A","A","A","A","A","B","B","B","B")
Event1<-c("01/01/2001","01/01/2001","01/01/2001","01/01/2001","09/09/2001","09/09/2001","09/09/2009","09/09/2009","09/09/2009","09/09/2009")
random.dates<-function(N,sd="2001-01-01",ed="2010-01-01"){
sd<-as.Date(sd,"%Y-%m-%d")
ed<-as.Date(ed,"%Y-%m-%d")
dt<-as.numeric(difftime(ed,sd))
ev<-sort(runif(N,0,dt))
rt<-sd+ev
}
Event1<-as.Date(Event1,"%m/%d/%Y")
Event1
Event2<-print(random.dates(10))
df<-data.frame(Subject,Event1,Event2)
df
and produces something close to this output of output:
Subject Event1 Event2
1 A 2001-01-01 2001-05-04
2 A 2001-01-01 2001-09-24
3 A 2001-01-01 2002-10-22
4 A 2001-01-01 2003-02-25
5 A 2001-09-09 2007-07-16
6 A 2001-09-09 2008-04-06
7 B 2009-09-09 2008-07-12
8 B 2009-09-09 2008-07-24
9 B 2009-09-09 2009-04-01
10 B 2009-09-09 2009-09-11
In this case I am interested in first grouping unique Subjects with unique Event1's which I can do easily. From there I need to select Event2 that falls closest to Event1 for that unique Subject-Event1 combination, which I really need help with. For this example these data should decompose to 3 different records:
Subject Event1 Event2
1 A 2001-01-01 2001-05-04
2 A 2001-09-09 2008-04-06
3 B 2009-09-09 2009-09-11
I've jerry-rigged a solution to produce the 3 records of Subject-Event1 combinations:
df2<-df
df2$SubEv<-paste(df2$Subject,df2$Event1)
df2$Event1<-NULL
df2$Subject<-NULL
df2$Event2<-NULL
df2<-unique(df2)
df2<-separate(df2,SubEv,c("Subject","Event1"),sep=" ")
From here I'm just lost as to how to make R select from df the date of Event2 that is closest to Event1.
I already know that my code is super inefficient and sloppy (probably because of my approach at the get go). I'd like to know how to do this (at all honestly), and if there's a way I can do this calling fewer than 10 lines of code that would be pretty boss.
With dplyr:
library(dplyr)
df %>%
group_by(Subject, Event1) %>%
slice(which.min(abs(Event1 - Event2)))
# Subject Event1 Event2
# (chr) (date) (date)
# 1 A 2001-01-01 2001-07-05
# 2 A 2001-09-09 2004-05-02
# 3 B 2009-09-09 2008-04-24
Comments:
group_by can work with multiple columns.
slice selects row numbers within a group. Alternately...
... %>% filter( row_number() == which.min(abs(Event1 - Event2)) )
For a tie, which.min will return the first min. See ?which.min for details.
Data: When I run the OP's code, I get df looking like
Subject Event1 Event2
1 A 2001-01-01 2001-07-05
2 A 2001-01-01 2002-07-14
3 A 2001-01-01 2003-04-27
4 A 2001-01-01 2003-10-09
5 A 2001-09-09 2004-05-02
6 A 2001-09-09 2005-03-21
7 B 2009-09-09 2005-05-10
8 B 2009-09-09 2005-12-02
9 B 2009-09-09 2005-12-21
10 B 2009-09-09 2008-04-24
which explains why my result doesn't match exactly the OP's expected result.
I have a dataframe that looks like this:
month create_time request_id weekday
1 4 2014-04-25 3647895 Friday
2 12 2013-12-06 2229374 Friday
3 4 2014-04-18 3568796 Friday
4 4 2014-04-18 3564933 Friday
5 3 2014-03-07 3081503 Friday
6 4 2014-04-18 3568889 Friday
And I'd like to get the count of request_ids by the weekday. How would I do this in R?
I've tried a lot of stuff based on ddply and aggregate with no luck.
Try using aggregate
> aggregate(request_id ~ weekday, FUN=length, dat=df)
weekday request_id
1 Friday 6
There are several valid ways to do it. I usually go with my trusty sqldf(). If the dataframe is named D, then
library(sqldf)
counts <- sqldf('select weekday, count(request_id) as nrequests from D group by weekday')
sqldf() can be wordy, but it is just so easy to remember and get right the first time!
or ... u could try:
count(df,"weekday")
or
library(plyr)
ddply(df,.(weekday),summarise,count=length(month))
Another option is to use a table and take the rowSums
> rowSums(with(dat, table(weekday, request_id)))
Friday
6