Building User Activity Cohorts - r

Thank you for your help - I am trying to build cohorts.
And I do get what I am looking for with ...
cohort3 <- transactions %>%
group_by(userId) %>%
mutate(first_transaction = min(createDate)) %>%
group_by(first_transaction, createDate) %>%
summarize(clients = n())
BUT ... as you can see by the result, I get data back for every single day.
We had 7 users that transacted on 2017-01-03 the first time.
2 of these users transacted on 2017-01-04.
4 of these users transacted on 2017-01-05 and so forth.
This is great - but it's too granular.
How do I modify the above code to summarize by month or better quarter?
Like:
Jan-2017 - 25 users transacted the first time.
Feb-2017 - 12 users from that cohort transacted again ... and so on.
Even better.
Q1 2017 - 78 users transacted.
Q2 2017 - 35 users of that Q1 2017 cohort transacted. etc
Thank you.

The lubridate package includes a quarter() function for determining what quarter of the year a given date falls into. Something along these lines should do what you want:
library(dplyr)
library(lubridate)
cohort3 <-
transactions %>%
group_by(userId) %>%
mutate(first_transaction = min(createDate),
quarter = quarter(first_transaction, year = TRUE) %>%
group_by(quarter) %>%
summarize(clients = n())

Related

How can I filter data based on date?

I have downloaded stock price data using tidyquant package in R. This company's financial year ends on 31st March every year. So, I want to filter data only for 31st March for all the available years. The data is available from 1-Jan-2013 to 1-Feb-2023. Out of them, I want to keep only the datapoints, where the dates are as follows:
31-Mar-2013;
31-Mar-2014;
31-Mar-2015;
31-Mar-2016;
31-Mar-2017;
31-Mar-2018;
31-Mar-2019;
31-Mar-2020;
31-Mar-2021;
31-Mar-2022
How can I apply such filter?
library(lubridate)
your_data_frame$DATE_COLUMN <- dmy(your_data_frame$DATE_COLUMN)
dplyr::filter(your_data_frame, month(DATE_COLUMN) == 3, day(DATE_COLUMN == 31)
Given that March 31 will sometimes fall on a weekend, it might make sense to pull the last day of March instead.
your_data_frame %>%
filter(month(DATE_COLUMN) == 3) %>%
group_by(year = year(DATE_COLUMN)) %>%
slice_max(DATE_COLUMN) %>%
ungroup()

R:: Dplyr: How to get each Monday in a specific date range + count it

I have data on when an ATP Tennis tournament took place with two columns in the following format:
Tournament
Date Range
Australian Open
20.01.2020 - 02.02.2020
Now the goal is to predict the participation but solely for each Monday, if the date range contains a Monday of course! Also since once you lose you are out in Tennis, I am assuming that the participation in the second week is higher since then only the good players are left in the tournament. Which is why we would need to know if it is week one or two in the tournament.
Hence for the above example we would need something like this at the end:
| Tournament | Date | Number of week
|Australian Open |20.01.2022| 1 |
|Australien Open |27.01.2022| 2 |
I know that I can count in dplyr but how would you get "only Monday" that is compatible with Dplyr? Essentially in SQL "where DAYOFWEEK(Column) = 2)".
I guess first one would need to separate the date range into two columns?
Search function didn't yield nothing covering such a specific problem, hence it could help someone in future.
Cheers vie2bgd
#############################################
#############################################
Edit:
after working day and night I almost have it but am missing on the last step ... also sorry but it's my first post, no need to ghost me or give me immediate minus points thanks #NirGraham at least giving me some hints, much appreciated and will try to implement it but technically I shared data up there in line with an instruction here how to do it, simply separating by | simply forgot some point "... "...
here is what I did so far:
#first I separated the initial range into 2 columns
tennis.orf.2020.2 = tennis.orf.2020 %>% separate(Datum, c("Start", "End"), sep = " - ")
x=tennis.orf.2020.2 %>%
mutate(across(c(Start, End), as.Date, "%d.%m.%y")) %>%
transmute(Tournament, date = map2(Start, End, seq, by = 'day')) %>%
unnest(c(date)) %>%
filter(wday(date) == 2) %>%
count(Tournament,date)
Tournament
Date
Number of week
Australian Open
2020-01-20
1
Australian Open
2020-01-27
1
This should be the result:
Tournament
Date
Number of week
Australian Open
2020-01-20
1
Australian Open
2020-01-27
2
If I group by tournament I lose a row :(
################################################
EDIT2:
Nevermind got it finally although it makes zero sense,
hopefully this will help somebody out there and save at least somebody valuable time
x%>%
arrange(date) %>%
group_by(Tournament) %>%
mutate(dummy = 1) %>%
mutate(times = cumsum(dummy)) %>%
select(-dummy)
You can use row_number() helper function for this:
x%>%
arrange(date) %>%
group_by(Tournament) %>%
mutate(times = row_number())
This is a more concise equivalent of your code with the cumsum(dummy).

Using indexing to perform mathematical operations on data frame in r

I'm struggling to perform basic indexing on a data frame to perform mathematical operations. I have a data frame containing all 50 US states with an entry for each month of the year, so there are 600 observations. I wish to find the difference between a value for the month of December minus the January value for each of the states. My data looks like this:
> head(df)
state year month value
1 AL 2020 01 2.7
2 AK 2020 01 5
3 AZ 2020 01 4.8
4 AR 2020 01 3.7
5 CA 2020 01 4.2
7 CO 2020 01 2.7
For instance, AL has a value in Dec of 4.7 and Jan value of 2.7 so I'd like to return 2 for that state.
I have been trying to do this with the group_by and summarize functions, but can't figure out the indexing piece of it to grab values that correspond to a condition. I couldn't find a resource for performing these mathematical operations using indexing on a data frame, and would appreciate assistance as I have other transformations I'll be using.
With dplyr:
library(dplyr)
df %>%
group_by(state) %>%
summarize(year_change = value[month == "12"] - value[month == "01"])
This assumes that your data is as you describe--every state has a single value for every month. If you have missing rows, or multiple observations in for a state in a given month, I would not expect this code to work.
Another approach, based row order rather than month value, might look like this:
library(dplyr)
df %>%
## make sure things are in the right order
arrange(state, month) %>%
group_by(state) %>%
summarize(year_change = last(value) - first(value))

How do i summarize values attributed to several variables in a data set?

First of all I have to describe my data set. It has three columns, where number 1 is country, number 2 is date (%Y-%m-%d), and number 3 is a value associated with each row (average hotel room prices). It continues like that in rows from 1990 to 2019. It works as such:
Country Date Value
France 2011-01-01 700
etc.
I'm trying to turn the date into years instead of the normal %Y-%m-%d format, so it will instead sum the average values for each country each year (instead of each month). How would I go about doing that?
I thought about summarizing the values totally for each country each year, but that is hugely tedious and takes a long time (plus the code will look horrible). So I'm wondering if there is a better solution for this problem that I'm not seeing.
Here is the task at hand so far. My dataset priceOnly shows the average price for each month. I have also attributed it to show only values not equal to 0.
diffyear <- priceOnly %>%
group_by(Country, Date) %>%
summarize(averagePrice = mean(Value[which(Value!=0.0)]))
You can use the lubridate package to extract years and then summarise accordingly.
Something like this:
diffyear <- priceOnly %>%
mutate(Year = year(Date)) %>%
filter(Value > 0) %>%
group_by(Country, Year) %>%
summarize(averagePrice = mean(Value, na.rm = TRUE))
And in general, you should always provide a minimal reproducible example with your questions.

Aggregate hourly data for each month of the year

I've looked around for something similar, but couldn't find anything. I have an airport data set which looks something like this (I rounded the hours):
Date Arrival_Time Departure_Time ...
2017-01-01 13:00 14:00 ...
2017-01-01 16:00 17:00 ...
2017-01-01 17:00 18:00 ...
2017-01-01 11:00 12:00 ...
The problem is that for some months, there isn't a flight for a specific time which means I have missing data for some hour. How can I extract hourly arrivals for each hour of every month so that there are no missing values?
I've tried using dplyr and doing the following:
arrivals <- allFlights %>% group_by(month(Date), Arrival_Time) %>%
summarise(n()) %>%
na.omit()
but the problem clearly arrises as group_by cannot fill in my missing data. I end up with data for every month, but not entries for some hour (e.g. no entry for month 1, hour 22:00).
I could currently get my answer by filtering out every month in its own list, and then fully merging them with a complete list of hours, but that's really slow as I have to do this 12 times. Ideally I'm trying to end up with something like this:
Hour Month January February March ... December
00:00 1 ### ### ### ... ###
01:00 1 ### ### ### ... ###
...
00:00 12 ### ### ### ... ###
23:00 12 ### ### ### ... ###
where ### is the number of flights for that hour of that month. Is there a nice way of doing this?
Note: I was thinking if I could somehow join every month's hours with my complete list of hours, and replace all na's with 0's, then that would work, but I couldn't figure out how to do it properly.
Hopefully the question makes sense. I'd gladly clarify if anything is unclear.
EDIT:
If you want to try it with the nycflights13 package, you could reproduce my attempt with the following code:
allFlights <- nycflights13::flights
allFlights$arr_time <- format(strptime(substr(as.POSIXct(sprintf("%04.0f", allFlights$arr_time), format="%H%M"), 12, 16), '%H:%M'), '%H:00')
arrivals <- allFlights %>% filter(carrier == "MQ") %>% group_by(month, arr_time) %>% summarise(n()) %>% na.omit()
Notice how arrivals doesn't have anything for month 1, hour 02:00, 03:00, etc. What I'm trying to do is have this be a complete data set with the missing hours filled in as 0.
I think you can use the code below to generate what you need.
library(stringr)
dim_month_hour<-data.frame(expand.grid(hour=paste(str_pad(seq(0,23,1),2,"left","0"),"00",sep=":"),month=sort(unique(allFlights$month)),stringsAsFactors=F))
arrivals_full<-left_join(dim_month_hour,arrivals,by=c("hour"="arr_time","month"="month"))
arrivals_full[is.na(arrivals_full$`n()`),"n()"]<-0
Is this what you're trying to do? I'm not sure if I'm aggregating exactly how you want, but the !is.na should do what you're looking for.
arrivals <- allFlights %>% group_by(month(Date), Arrival_Time) %>%
rowwise() %>%
summarise(month = plyr::count(!is.na(Arrival_Time)))
Edit: I may not be clear. Do you want a zero to show for hours where there are no data?
So I'm circling it. There's a cool packaged, called padr that will "pad" the date/time entries with NAs for missing values. Because there is a time_hour field, you can use pad.
library(padr)
allFlightsPad <- allFlights %>% pad
Then you can summarize from there. See this page for info.

Resources