I have a nice dateset that includes user logins to my website, but no user logout. The user can access again the next day and then another line is registered. I'm looking to calculate how much time each user spent on the site.
The working assumption is that the site is quite interactive, and one can assume that the time between the first and last action is the time the user was on the site. The problem starts with a final action definition. For example, a user can be on the same page all night too, but this is unlikely (without taking any action). It is very clear to me how to calculate it in reality, but haven't been able to find a proper code.
library(tidyverse)
df <- read_csv(url("https://srv-file9.gofile.io/download/eBInE9/sheet1.csv"))
df %>% group_by(`User ID`)
lag(df$Time,1)
##Now I am wondering how to cluster the itemsn by time and to calculate lags..
Any thoughts?
Thanks!
I am assuming that the csv file you read has a POSIXct column which contains the login time. If not, you should make sure this exists.
Here's some code which generates time differences using the lag function by ID. The first time difference for each group will be NA. I generate some random time data first as you have not provided sample data (as you should have).
library(dplyr)
library(lubridate)
library(tibble)
random_times <- seq(as.POSIXct('2020/01/01'), as.POSIXct('2020/02/01'), by="1 mins")
login <- tibble(login = sort(sample(random_times[hour(random_times) > "00:00" & hour(random_times) < "23:59"], 1000)),
ID = sample(LETTERS[1:6], 1000, replace = TRUE))
login <- login %>%
group_by(ID) %>%
mutate(login_delayed = lag(login, 1)) %>%
mutate(login_time = login - login_delayed)
With this output:
> login
# A tibble: 1,000 x 4
# Groups: ID [6]
login ID login_delayed login_time
<dttm> <chr> <dttm> <drtn>
1 2020-01-01 01:03:00 A NA NA mins
2 2020-01-01 01:11:00 A 2020-01-01 01:03:00 8 mins
3 2020-01-01 01:46:00 E NA NA mins
4 2020-01-01 02:33:00 E 2020-01-01 01:46:00 47 mins
5 2020-01-01 02:47:00 A 2020-01-01 01:11:00 96 mins
6 2020-01-01 10:43:00 F NA NA mins
7 2020-01-01 11:44:00 A 2020-01-01 02:47:00 537 mins
8 2020-01-01 11:57:00 A 2020-01-01 11:44:00 13 mins
9 2020-01-01 12:02:00 F 2020-01-01 10:43:00 79 mins
10 2020-01-01 12:57:00 D NA NA mins
# ... with 990 more rows
I'm attempting to create reports in R using the googleAnalyticsR package. We have 4 custom dimensions defined in Google Analytics, and when I try to access 2 of them, no data is retrieved.
In code below, When I use Dimension1 or Dimension2, everything runs smoothly. But if I try Dimension3 or Dimension4, no data is retrieved
library(googleAnalyticsR)
sampled_data_fetch2 <- google_analytics(ga_id,
date_range = c("2018-01-01","2018-02-28"),
metrics = c("sessions"),
dimensions = c("date", "dimension2"))
2019-09-09 17:58:16> Downloaded [1000] rows from a total of [8079].
sampled_data_fetch2 <- google_analytics(ga_id,
date_range = c("2018-01-01","2018-02-28"),
metrics = c("sessions"),
dimensions = c("date", "dimension4"))
2019-09-09 17:59:16> Downloaded [0] rows from a total of [].
I suspect that Dimension3 and Dimension4 are not session-scoped custom dimensions (hence the metrics sessions is not returning any results). The output of ga_custom_vars_list should confirm this (specifically the scope column):
cds <- ga_custom_vars_list(
accountId = account_details$accountId,
webPropertyId = account_details$webPropertyId
)
head(cds)
Output:
id accountId webPropertyId name index
1 ga:dimension1 418XXXXX UA-XXXXX-5 CD1 NAME (1) 1
2 ga:dimension2 418XXXXX UA-XXXXX-5 CD2 NAME (2) 2
3 ga:dimension3 418XXXXX UA-XXXXX-5 CD3 NAME (3) 3
4 ga:dimension4 418XXXXX UA-XXXXX-5 CD4 NAME (4) 4
5 ga:dimension5 418XXXXX UA-XXXXX-5 CD5 NAME (5) 5
6 ga:dimension6 418XXXXX UA-XXXXX-5 CD6 NAME (6) 6
scope active created updated
1 SESSION TRUE 2014-02-18 18:42:23 2017-05-25 16:34:20
2 SESSION TRUE 2015-09-17 21:11:19 2017-05-25 16:34:29
3 HIT TRUE 2016-06-01 15:12:18 2016-06-01 15:12:18
4 HIT TRUE 2016-06-01 15:12:27 2017-05-25 16:36:24
5 HIT TRUE 2016-06-01 15:12:42 2017-05-25 16:36:29
6 HIT TRUE 2016-06-02 11:27:14 2016-06-02 11:27:14
I have a dataset with periods
active <- data.table(id=c(1,1,2,3), beg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:50:00","2018-01-01 01:50:00","2018-01-01 01:50:00")), end=as.POSIXct(c("2018-01-01 01:20:00","2018-01-01 02:00:00","2018-01-01 02:00:00","2018-01-01 02:00:00")))
> active
id beg end
1: 1 2018-01-01 01:10:00 2018-01-01 01:20:00
2: 1 2018-01-01 01:50:00 2018-01-01 02:00:00
3: 2 2018-01-01 01:50:00 2018-01-01 02:00:00
4: 3 2018-01-01 01:50:00 2018-01-01 02:00:00
during which an id was active. I would like to aggregate across ids and determine for every point in
time <- data.table(seq(from=min(active$beg),to=max(active$end),by="mins"))
the number of IDs that are inactive and the average number of minutes until they get active. That is, ideally, the table looks like
>ans
time inactive av.time
1: 2018-01-01 01:10:00 2 30
2: 2018-01-01 01:11:00 2 29
...
50: 2018-01-01 02:00:00 0 0
I believe this can be done using data.table but I cannot figure out the syntax to get the time differences.
Using dplyr, we can join by a dummy variable to create the Cartesian product of time and active. The definitions of inactive and av.time might not be exactly what you're looking for, but it should get you started. If your data is very large, I agree that data.table will be a better way of handling this.
library(tidyverse)
time %>%
mutate(dummy = TRUE) %>%
inner_join({
active %>%
mutate(dummy = TRUE)
#join by the dummy variable to get the Cartesian product
}, by = c("dummy" = "dummy")) %>%
select(-dummy) %>%
#define what makes an id inactive and the time until it becomes active
mutate(inactive = time < beg | time > end,
TimeUntilActive = ifelse(beg > time, difftime(beg, time, units = "mins"), NA)) %>%
#group by time and summarise
group_by(time) %>%
summarise(inactive = sum(inactive),
av.time = mean(TimeUntilActive, na.rm = TRUE))
# A tibble: 51 x 3
time inactive av.time
<dttm> <int> <dbl>
1 2018-01-01 01:10:00 3 40
2 2018-01-01 01:11:00 3 39
3 2018-01-01 01:12:00 3 38
4 2018-01-01 01:13:00 3 37
5 2018-01-01 01:14:00 3 36
6 2018-01-01 01:15:00 3 35
7 2018-01-01 01:16:00 3 34
8 2018-01-01 01:17:00 3 33
9 2018-01-01 01:18:00 3 32
10 2018-01-01 01:19:00 3 31
I was agonizing over how to phrase my question. I have a data frame of accounts and I want to create a new column that is a flag for whether there is another account that has a duplicate email within 30 days of that account.
I have a table like this.
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(AccountNumbers,EmailAddress,Dates)
print(df)
AccountNumbers EmailAddress Dates
3748 John#gmail.com 2018-05-01
8894 John#gmail.com 2018-05-05
9923 Alex#outlook.com 2018-05-10
4502 Alan#yahoo.com 2018-05-15
7283 Stan#aol.com 2018-05-20
8012 Mary#outlook.com 2018-05-25
2938 Adam#outlook.com 2018-05-30
7485 Tom#aol.com 2018-06-01
1010 Jane#yahoo.com 2018-06-05
9877 John#gmail.com 2018-06-10
John#gmail.com appears three times, I want to flag the first two rows because they both appear within 30 days of each other, but I don't want to flag the third.
AccountNumbers EmailAddress Dates DuplicateEmailFlag
3748 John#gmail.com 2018-05-01 1
8894 John#gmail.com 2018-05-05 1
9923 Alex#outlook.com 2018-05-10 0
4502 Alan#yahoo.com 2018-05-15 0
7283 Stan#aol.com 2018-05-20 0
8012 Mary#outlook.com 2018-05-25 0
2938 Adam#outlook.com 2018-05-30 0
7485 Tom#aol.com 2018-06-01 0
1010 Jane#yahoo.com 2018-06-05 0
9877 John#gmail.com 2018-06-10 0
I've been trying to use an ifelse() inside of mutate, but I don't know if it's possible to tell dplyr to only consider rows that are within 30 days of the row being considered.
Edit: To clarify, I want to look at the 30 days around each account. So that if I had a scenario where the same email address was being added exactly every 30 days, all of the occurrences of that email should be flagged.
This seems to work. First, I define the data frame.
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(number = AccountNumbers, email = EmailAddress, date = as.Date(Dates))
Next, I group by email and check if there's an entry in the preceding or following 30 days. I also replace NAs (corresponding to cases with only one entry) with 0. Finally, I ungroup.
df %>%
group_by(email) %>%
mutate(dupe = coalesce(date - lag(date) < 30, (date - lead(date) < 30))) %>%
mutate(dupe = replace_na(dupe, 0)) %>%
ungroup
This gives,
# # A tibble: 10 x 4
# number email date dupe
# <dbl> <fct> <date> <dbl>
# 1 3748 John#gmail.com 2018-05-01 1
# 2 8894 John#gmail.com 2018-05-05 1
# 3 9923 Alex#outlook.com 2018-05-10 0
# 4 4502 Alan#yahoo.com 2018-05-15 0
# 5 7283 Stan#aol.com 2018-05-20 0
# 6 8012 Mary#outlook.com 2018-05-25 0
# 7 2938 Adam#outlook.com 2018-05-30 0
# 8 7485 Tom#aol.com 2018-06-01 0
# 9 1010 Jane#yahoo.com 2018-06-05 0
# 10 9877 John#gmail.com 2018-06-10 0
as required.
Edit: This makes the implicit assumption that your data are sorted by date. If not, you'd need to add an extra step to do so.
I think this gets at what you want:
df %>%
group_by(EmailAddress) %>%
mutate(helper = cumsum(coalesce(if_else(difftime(Dates, lag(Dates), 'days') <= 30, 0, 1), 0))) %>%
group_by(EmailAddress, helper) %>%
mutate(DuplicateEmailFlag = (n() >= 2)*1) %>%
ungroup() %>%
select(-helper)
# A tibble: 10 x 4
AccountNumbers EmailAddress Dates DuplicateEmailFlag
<dbl> <chr> <date> <dbl>
1 3748 John#gmail.com 2018-05-01 1
2 8894 John#gmail.com 2018-05-05 1
3 9923 Alex#outlook.com 2018-05-10 0
4 4502 Alan#yahoo.com 2018-05-15 0
5 7283 Stan#aol.com 2018-05-20 0
6 8012 Mary#outlook.com 2018-05-25 0
7 2938 Adam#outlook.com 2018-05-30 0
8 7485 Tom#aol.com 2018-06-01 0
9 1010 Jane#yahoo.com 2018-06-05 0
10 9877 John#gmail.com 2018-06-10 0
Note:
I think #Lyngbakr's solution is better for the circumstances in your question. Mine would be more appropriate if the size of the duplicate group might change (e.g., you want to check for 3 or 4 entries within 30 days of each other, rather than 2).
slightly modified data
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- as.Date(c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10"))
df <- data.frame(AccountNumbers,EmailAddress,Dates, stringsAsFactors = FALSE)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I'm new here, so I apologize if I miss any conventions.
I have a ~2000 row dataset with data on unique cases happening in a three year period. Each case has a start date and an end date. I want to be able to get a new dataframe that shows how many cases occur per week in this three year period.
The structure of the dataset I have is like this:
ID Start_Date End_Date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03
`
This problem can be solved more easily with sqldf package but I thought to stick with dplyr package.
The approach:
library(dplyr)
library(lubridate)
# First create a data frame having all weeks from chosen start date to end date.
# 2015-01-01 to 2017-12-31
df_week <- data.frame(weekStart = seq(floor_date(as.Date("2015-01-01"), "week"),
as.Date("2017-12-31"), by = 7))
df_week <- df_week %>%
mutate(weekEnd = weekStart + 7,
weekNum = as.character(weekStart, "%V-%Y"),
dummy = TRUE)
# The dummy column is only for joining purpose.
# Header looks like
#> head(df_week)
# weekStart weekEnd weekNum dummy
#1 2014-12-28 2015-01-04 52-2014 TRUE
#2 2015-01-04 2015-01-11 01-2015 TRUE
#3 2015-01-11 2015-01-18 02-2015 TRUE
#4 2015-01-18 2015-01-25 03-2015 TRUE
#5 2015-01-25 2015-02-01 04-2015 TRUE
#6 2015-02-01 2015-02-08 05-2015 TRUE
# Prepare the data as mentioned in OP
df <- read.table(text = "ID Start_Date End_Date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03", header = TRUE, stringsAsFactors = FALSE)
df$Start_Date <- as.Date(df$Start_Date)
df$End_Date <- as.Date(df$End_Date)
df <- df %>% mutate(dummy = TRUE) # just for joining
# Use dplyr to join, filter and then group on week to find number of cases
# in each week
df_week %>%
left_join(df, by = "dummy") %>%
select(-dummy) %>%
filter((weekStart >= Start_Date & weekStart <= End_Date) |
(weekEnd >= Start_Date & weekEnd <= End_Date)) %>%
group_by(weekStart, weekEnd, weekNum) %>%
summarise(cases = n())
# Result
# weekStart weekEnd weekNum cases
# <date> <date> <chr> <int>
# 1 2014-12-28 2015-01-04 52-2014 1
# 2 2015-01-04 2015-01-11 01-2015 3
# 3 2015-01-11 2015-01-18 02-2015 5
# 4 2015-01-18 2015-01-25 03-2015 8
# 5 2015-01-25 2015-02-01 04-2015 8
# 6 2015-02-01 2015-02-08 05-2015 8
# 7 2015-02-08 2015-02-15 06-2015 8
# 8 2015-02-15 2015-02-22 07-2015 8
# 9 2015-02-22 2015-03-01 08-2015 8
#10 2015-03-01 2015-03-08 09-2015 8
# ... with 139 more rows
Welcome to SO!
Before solving the problem be sure to have installed some packages and run
install.packages(c("tidyr","dplyr","lubridate"))
if you haven installed those packages yet.
I'll present you a modern R solution next and those packages are magic.
This is a way to solve it:
library(readr)
library(dplyr)
library(lubridate)
raw_data <- 'id start_date end_date
1 2015-01-04 2017-11-02
2 2015-01-05 2015-10-26
3 2015-01-07 2015-03-04
4 2015-01-12 2016-05-17
5 2015-01-15 2015-04-08
6 2015-01-21 2016-07-31
7 2015-01-21 2015-07-16
8 2015-01-22 2015-03-03'
curated_data <- read_delim(raw_data, delim = "\t") %>%
mutate(start_date = as.Date(start_date)) %>% # convert column 2 to date format assuming the date is yyyy-mm-dd
mutate(weeks_lapse = as.integer((start_date - min(start_date))/dweeks(1))) # count how many weeks passed since the lowest date in the data
curated_data %>%
group_by(weeks_lapse) %>% # I group to count by week
summarise(cases_per_week = n()) # now count by group by week
And the solution is:
# A tibble: 3 x 2
weeks_lapse cases_per_week
<int> <int>
1 0 3
2 1 2
3 2 3