Cohort Data transformation

Cohort Data transformation - r

I have the following data:
signup_date purchase_date nbr_purchase
2010-12-12 7 2
2011-01-03 4 1
2010-11-28 6 2
2011-01-05 19 9
2010-11-10 26 3
2010-11-25 11 2
Where each row corresponds to a customer, signup_date is sign up date, purchase_date is number of days elapsed from sign up and first purchase, nbr_purchase is number of items purchased. I would like to carry cohort analysis and transform the data to look like:
cohort signed_up active_m0 active_m1 active_m2
2011-10 12345 10432 8765 6754
2011-11 12345 10432 8765 6754
2011-12 12345 10432 8765 6754
Cohort here is in “YYYY-MM” format, signed_up is the number of users who have created accounts in the given month, active_m0 – number of users who made first purchase in the same month as they registered, active_m1 – number of users who made first purchase in the following month, and so forth.

Assuming your input data in in the following format
dd<-structure(list(signup_date = structure(c(14955, 14977, 14941,
14979, 14923, 14938), class = "Date"), purchase_date = c(7L,
4L, 6L, 19L, 26L, 11L), nbr_purchase = c(2L, 1L, 2L, 9L, 3L,
2L)), .Names = c("signup_date", "purchase_date", "nbr_purchase"
), row.names = c(NA, -6L), class = "data.frame")
Then you can do
dd$cohort <- strftime(dd$signup_date, "%Y-%m")
dd$interval <- paste0("active_m",(dd$purchase_date %/% 10) +1)
tt<-with(dd, table(cohort, interval))
cbind(tt, signed_up=rowSums(tt))
to get the data you need
active_m1 active_m2 active_m3 signed_up
2010-11 1 1 1 3
2010-12 1 0 0 1
2011-01 1 1 0 2
note that here I used 10 day intervals rather than 30 day since you didn't have any purchase observations that were over 30 days from signup. So for your real data, change %/% 10 to %/% 30.

Related

Compare time between rows with same date/group/id in R

My dataframe is like this :
Device_id Group Nb_burst Date_time
24 1 3 2018-09-02 10:04:04
24 1 5 2018-09-02 10:08:00
55 2 3 2018-09-03 10:14:34
55 2 7 2018-09-03 10:02:29
16 3 2 2018-09-20 08:17:11
16 3 71 2018-09-20 06:03:40
22 4 10 2018-10-02 11:33:55
22 4 14 2018-10-02 16:22:18
I would like to know, only for the same ID, the same Group number, and the same Date, the timelag between two rows.
If timelag > 1 hour then all right keep them all.
If timelag < 1 hour then keep only the rows with the biggest Nb_burst.
Which mean a dataframe like :
Device_id Group Nb_burst Date_time
24 1 5 2018-09-02 10:08:00
55 2 7 2018-09-03 10:02:29
16 3 71 2018-09-20 06:03:40
22 4 10 2018-10-02 11:33:55
22 4 14 2018-10-02 16:22:18
I tried :
Data$timelag <- c(NA, difftime(Data$Min_start.time[-1], Data$Min_start.time[-nrow(Data)], units="hours"))
But I don't know how test only when Date, ID, and Group are the same, probably a loop.
My df has 1500 rows.
Hope someone could help me. Thank you !

I'm not sure why your group 3 is not duplicated, since time difference is greater than one hour.
But, you could create two indexing variables using ave. First, the order of the Nb_burst for each grouping. Second, the tine differences for each grouping.
dat <- within(dat, {
score <- ave(Nb_burst, Device_id, Group, as.Date(Date_time),
FUN=order)
thrsh <- abs(ave(as.numeric(Date_time), Device_id, Group, as.Date(Date_time),
FUN=diff)/3600) > 1
})
Finally subset by rowSums.
dat[rowSums(dat[c("score", "thrsh")]) > 1,1:4]
# Device_id Group Nb_burst Date_time
# 2 24 1 5 2018-09-02 10:08:00
# 3 55 2 7 2018-09-03 10:14:34
# 5 16 3 2 2018-09-20 08:17:11
# 6 16 3 71 2018-09-20 06:03:40
# 7 22 4 10 2018-10-02 11:33:55
# 8 22 4 14 2018-10-02 16:22:18
Data
dat <- structure(list(Device_id = c(24L, 24L, 55L, 55L, 16L, 16L, 22L,
22L), Group = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Nb_burst = c(3L,
5L, 7L, 3L, 2L, 71L, 10L, 14L), Date_time = structure(c(1535875444,
1535875680, 1535962474, 1535961749, 1537424231, 1537416220, 1538472835,
1538490138), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c(NA,
-8L), class = "data.frame")

Insert row to fill in missing date in R [duplicate]

This question already has answers here:
Insert rows for missing dates/times
(9 answers)
How to add only missing Dates in Dataframe
(3 answers)
Closed 3 years ago.
I have a dataset that look something like this:
Person date Amount
A 2019-01 900
A 2019-03 600
A 2019-04 300
A 2019-05 0
B 2019-04 1200
B 2019-07 800
B 2019-08 400
B 2019-09 0
As you'll notice in the "date" column, there are missing dates, such as '2019-02' for person A and '2019-05' and '2019-06' for person B. I would like to insert rows with the missing date and amount equal to the one before it (see expected result below).
I have tried performing group by but I don't know how to proceed from there. I've also tried converting the 'date' and 'amount' columns as lists, and from there fill in the gaps before putting them back to the dataframe. I was wondering if there is a more convenient way of doing this. In particular, getting the same results without having to extract lists from the original dataframe.
Ideally, I would want to having a dataframe that look something like this:
Person date Amount
A 2019-01 900
A 2019-02 900
A 2019-03 600
A 2019-04 300
A 2019-05 0
B 2019-04 1200
B 2019-05 1200
B 2019-06 1200
B 2019-07 800
B 2019-08 400
B 2019-09 0
I hope I was able to make my problem clear.
Thanks in advance.

We can first convert the date to actual date object (date1) by pasting "-01" at the end, then using complete we create a sequence of 1 month date objects for each Person. We then use fill to get Amount equal to the one before it and to get data in the original form we remove "-01" again from date1.
library(dplyr)
library(tidyr)
df %>%
mutate(date1 = as.Date(paste0(date, "-01"))) %>%
group_by(Person) %>%
complete(date1 = seq(min(date1), max(date1), by = "1 month")) %>%
fill(Amount) %>%
mutate(date = sub("-01$", "", date1)) %>%
select(-date1)
# Person date Amount
# <fct> <chr> <int>
# 1 A 2019-01 900
# 2 A 2019-02 900
# 3 A 2019-03 600
# 4 A 2019-04 300
# 5 A 2019-05 0
# 6 B 2019-04 1200
# 7 B 2019-05 1200
# 8 B 2019-06 1200
# 9 B 2019-07 800
#10 B 2019-08 400
#11 B 2019-09 0
data
df <- structure(list(Person = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("A", "B"), class = "factor"), date = structure(c(1L,
2L, 3L, 4L, 3L, 5L, 6L, 7L), .Label = c("2019-01", "2019-03",
"2019-04", "2019-05", "2019-07", "2019-08", "2019-09"), class = "factor"),
Amount = c(900L, 600L, 300L, 0L, 1200L, 800L, 400L, 0L)),
class = "data.frame", row.names = c(NA, -8L))

Tagging groups of sequential hours over multiple days

In R, I am looking at data that shows simulated power system outages and need a way to tag continuous outages. The data are hourly, so I am looking for something that can recognize sequential hours and then breaks in the sequence. I am having trouble tagging outages that stretch out over midnight.
I have tried a couple approaches, but keep running into issues with outages that extend multiple days. For example, I can tag a 12 hour outage that runs from hour 8 to hour 20, but it splits up the tag if the outage runs from say, hour 20 on day 1 to hour 12 on day 2 (these end up looking like 2 different, shorter, outages).
month day hour outage_tag
1 2 23 1
1 2 24 1
1 3 1 1
1 3 2 1
3 5 13 2
3 5 14 2
3 5 15 2
The goal is to create the outage_tag column shown above. I am having trouble creating the tags that wraps around midnight (tag 1 in the example would be broken into 2 different tags, which is not useful). I have the data to create a year-month-day-hour date if needed.
Any help (or suggestions for improving this question) would be much appreciated. Thanks!

If outages can extend from February to March then we will have to know the year as well so assuming that year stores the year convert to POSIXct using ISOdatetime, take successive differences, compare to 1 hour and take the cumulative sum.
year <- 2000
transform(DF, outage_tag =
cumsum(c(1, diff(ISOdatetime(year, month, day, hour-1, 0, 0, tz = "GMT")) != 1)))
giving:
month day hour outage_tag
1 1 2 23 1
2 1 2 24 1
3 1 3 1 1
4 1 3 2 1
5 3 5 13 2
6 3 5 14 2
7 3 5 15 2
Note
DF <- structure(list(month = c(1L, 1L, 1L, 1L, 3L, 3L, 3L), day = c(2L,
2L, 3L, 3L, 5L, 5L, 5L), hour = c(23L, 24L, 1L, 2L, 13L, 14L,
15L)), class = "data.frame",
row.names = c(NA, -7L))

R replace the column name by the dataframe name with a loop

I am very new to programming with R, but I am trying to replace the column name by the dataframe name with a for loop. I have 25 dataframes with cryptocurrency time series data.
ls(pattern="USD")
[1] "ADA.USD" "BCH.USD" "BNB.USD" "BTC.USD" "BTG.USD" "DASH.USD" "DOGE.USD" "EOS.USD" "ETC.USD" "ETH.USD" "IOT.USD"
[12] "LINK.USD" "LTC.USD" "NEO.USD" "OMG.USD" "QTUM.USD" "TRX.USD" "USDT.USD" "WAVES.USD" "XEM.USD" "XLM.USD" "XMR.USD"
[23] "XRP.USD" "ZEC.USD" "ZRX.USD"
Every object is a dataframe which stands for a cryptocurrency expressed in USD. And every dataframe has 2 clomuns: Date and Close (Closing price).
For example: the dataframe "BTC.USD" stands for Bitcoin in USD:
head(BTC.USD)
# A tibble: 6 x 2
Date Close
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
4 2016-01-03 431.
5 2016-01-04 433.
Now I want to replace the name of the second column ("Close") by the name of the dataframe ("BTC.USD")
For this case I used the following code:
colnames(BTC.USD)[2] <-deparse(substitute(BTC.USD))
And this code works as I imagined:
> head(BTC.USD)
# A tibble: 6 x 2
Date BTC.USD
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
Now I am trying to create a loop to change the second column name for all 25 dataframes of cryptocurrency data:
df_list <- ls(pattern="USD")
for(i in df_list){
aux <- get(i)
(colnames(aux)[2] =df_list)
assign(i,aux)
}
But the code does not work as I thought. Can someone help me figure out what step I am missing?
Thanks in advance!

You can use Map to assign the names, i.e.
Map(function(x, y) {names(x)[2] <- y; x}, l2, names(l2))
#$`a`
# v1 a
#1 3 8
#2 5 6
#3 2 7
#4 1 5
#5 4 4
#$b
# v1 b
#1 9 47
#2 18 48
#3 17 6
#4 5 25
#5 13 12
DATA
dput(l2)
list(a = structure(list(v1 = c(3L, 5L, 2L, 1L, 4L), v2 = c(8L,
6L, 7L, 5L, 4L)), class = "data.frame", row.names = c(NA, -5L
)), b = structure(list(v1 = c(9L, 18L, 17L, 5L, 13L), v2 = c(47L,
48L, 6L, 25L, 12L)), class = "data.frame", row.names = c(NA,
-5L)))

Removing duplicates based on 3 columns in R

I have a data set of 300k+ cases and where a customer id may be repeated several times. Each customer has a date and rank associated with it as well. I'd like to be able to keep only unique customer ids sorted first by date then if there is a duplicate id with a duplicate date it would sort by rank (keeping the rank closest to 1). An example of my data is like this:
Customer.ID Date Rank
576293 8/13/2012 2
576293 11/16/2015 6
581252 11/22/2013 4
581252 11/16/2011 6
581252 1/4/2016 5
581600 1/12/2015 3
581600 1/12/2015 2
582560 4/13/2016 1
591674 3/21/2012 6
586334 3/30/2014 1
Ideal outcome would then be like this:
Customer.ID Date Rank
576293 11/16/2015 6
581252 1/4/2016 5
581600 1/12/2015 2
582560 4/13/2016 1
591674 3/21/2012 6
586334 3/30/2014 1

With the desired output of the OP clarified:
We can also do this with base R, which will be faster than the below dplyr approach using group_by(Customer.ID) since we are not going to have to loop over all unique Customer.ID:
df <- df[order(-df$Customer.ID,as.Date(df$Date, format="%m/%d/%Y"),-df$Rank, decreasing=TRUE),]
res <- df[!duplicated(df$Customer.ID),]
Notes:
First, sort by Customer.ID in ascending order followed by Date in descending order followed by Rank in ascending order.
Remove the duplicates in Customer.ID so that only the first row for each Customer.ID is kept.
The result using your posted data as a data frame df (without converting the Date column) in ascending order for Customer.ID:
print(res)
## Customer.ID Date Rank
##2 576293 11/16/2015 6
##5 581252 1/4/2016 5
##7 581600 1/12/2015 2
##8 582560 4/13/2016 1
##10 586334 3/30/2014 1
##9 591674 3/21/2012 6
Data:
df <- structure(list(Customer.ID = c(591674L, 586334L, 582560L, 581600L,
581252L, 576293L), Date = c("3/21/2012", "3/30/2014", "4/13/2016",
"1/12/2015", "1/4/2016", "11/16/2015"), Rank = c(6L, 1L, 1L,
2L, 5L, 6L)), .Names = c("Customer.ID", "Date", "Rank"), row.names = c(9L,
10L, 8L, 7L, 5L, 2L), class = "data.frame")
If you want to keep only the latest date (followed by lower rank) row for each Customer.ID, you can do the following using dplyr:
library(dplyr)
res <- df %>% group_by(Customer.ID) %>% arrange(desc(Date),Rank) %>%
summarise_all(funs(first)) %>%
ungroup() %>% arrange(Customer.ID)
Notes:
group_by Customer.ID and sort using arrange by Date in descending order and Rank by ascending order.
summarise_all to keep only the first row from each Customer.ID.
Finally, ungroup and sort by Customer.ID to get your desired result.
Using your data as a data frame df with the Date column converted to the Date class:
print(res)
### A tibble: 7 x 3
## Customer.ID Date Rank
## <int> <date> <int>
##1 576293 2015-11-16 6
##2 581252 2016-01-04 5
##3 581600 2015-01-12 2
##4 582560 2016-04-13 1
##5 586334 2014-03-30 1
##6 591674 2012-03-21 6
Data:
df <- structure(list(Customer.ID = c(576293L, 576293L, 581252L, 581252L,
581252L, 581600L, 581600L, 582560L, 591674L, 586334L), Date = structure(c(15565,
16755, 16031, 15294, 16804, 16447, 16447, 16904, 15420, 16159
), class = "Date"), Rank = c(2L, 6L, 4L, 6L, 5L, 3L, 2L, 1L,
6L, 1L)), .Names = c("Customer.ID", "Date", "Rank"), row.names = c(NA,
-10L), class = "data.frame")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Cohort Data transformation - r

Related

Compare time between rows with same date/group/id in R

Insert row to fill in missing date in R [duplicate]

Tagging groups of sequential hours over multiple days

R replace the column name by the dataframe name with a loop

Removing duplicates based on 3 columns in R

Categories

Resources