I have a nice dateset that includes user logins to my website, but no user logout. The user can access again the next day and then another line is registered. I'm looking to calculate how much time each user spent on the site.
The working assumption is that the site is quite interactive, and one can assume that the time between the first and last action is the time the user was on the site. The problem starts with a final action definition. For example, a user can be on the same page all night too, but this is unlikely (without taking any action). It is very clear to me how to calculate it in reality, but haven't been able to find a proper code.
library(tidyverse)
df <- read_csv(url("https://srv-file9.gofile.io/download/eBInE9/sheet1.csv"))
df %>% group_by(`User ID`)
lag(df$Time,1)
##Now I am wondering how to cluster the itemsn by time and to calculate lags..
Any thoughts?
Thanks!
I am assuming that the csv file you read has a POSIXct column which contains the login time. If not, you should make sure this exists.
Here's some code which generates time differences using the lag function by ID. The first time difference for each group will be NA. I generate some random time data first as you have not provided sample data (as you should have).
library(dplyr)
library(lubridate)
library(tibble)
random_times <- seq(as.POSIXct('2020/01/01'), as.POSIXct('2020/02/01'), by="1 mins")
login <- tibble(login = sort(sample(random_times[hour(random_times) > "00:00" & hour(random_times) < "23:59"], 1000)),
ID = sample(LETTERS[1:6], 1000, replace = TRUE))
login <- login %>%
group_by(ID) %>%
mutate(login_delayed = lag(login, 1)) %>%
mutate(login_time = login - login_delayed)
With this output:
> login
# A tibble: 1,000 x 4
# Groups: ID [6]
login ID login_delayed login_time
<dttm> <chr> <dttm> <drtn>
1 2020-01-01 01:03:00 A NA NA mins
2 2020-01-01 01:11:00 A 2020-01-01 01:03:00 8 mins
3 2020-01-01 01:46:00 E NA NA mins
4 2020-01-01 02:33:00 E 2020-01-01 01:46:00 47 mins
5 2020-01-01 02:47:00 A 2020-01-01 01:11:00 96 mins
6 2020-01-01 10:43:00 F NA NA mins
7 2020-01-01 11:44:00 A 2020-01-01 02:47:00 537 mins
8 2020-01-01 11:57:00 A 2020-01-01 11:44:00 13 mins
9 2020-01-01 12:02:00 F 2020-01-01 10:43:00 79 mins
10 2020-01-01 12:57:00 D NA NA mins
# ... with 990 more rows
Related
I have a df in r with numerous records with the below format, with 'arrival_time' values for a 12 hour period'.
id
arrival_time
wait_time_value
1
2020-02-20 12:02:00
10
2
2020-02-20 12:04:00
5
99900
2020-02-20 23:47:00
8
10000
2020-02-20 23:59:00
21
I would like to create a new df that has a row for each 15 minute slot of the arrival time period and the wait_time_value of the record with the earliest arrival time in that slot. So, in the above example, the first and last row of the new df would look like:
id
period_start
wait_time_value
1
2020-02-20 12:00:00
10
48
2020-02-20 23:45:00
8
I have used the below code to achieve this for the mean average wait time for all records in each 15 minute range, but i'm not sure how to select the value for the earliest record?
df$period_start <- align.time(df$arrival_time- 899, n = 60*15)
avgwait_df <- aggregate(wait_time_value ~ period_start, df, mean)
Use DataFrame.resample with GroupBy.first, remove only NaNs and convert to DataFrame:
df['arrival_time'] = pd.to_datetime(df['arrival_time'])
df = (df.resample('15Min', on='arrival_time')['wait_time_value']
.first()
.dropna()
.reset_index(name='wait_time_value'))
print (df)
arrival_time wait_time_value
0 2020-02-20 12:00:00 10.0
1 2020-02-20 23:45:00 8.0
Using dplyr:
df %>%
group_by(period_start) %>%
summarise(wait_time = min(wait_time_value))
I have a dataset that has a rather complicated problem. It includes 600,000 observations. The main issue is related to the data collection process. As an example I provided the following dataset that has similar structure to the real datast I have in hand:
df <- data.frame(row_number = c(1,2,3,4,5,6,7,8,9),
date = c("2020-01-01", "2020-01-01","2020-01-01","2020-01-02","2020-01-02","2020-01-02","2020-01-03","2020-01-03","2020-01-03"),
time = c("01:00:00","09:00:00","17:00:00", "09:00:00","01:00:00","17:00:00", "01:00:00","NA","09:00:00"),
order = c(1,2,3,1,2,3,1,2,3),
value = c(10,20,30,40,10,20,30,NA,50)
I know in each day the data was recorded 3 times (order variable). That is in each day, the first time in which the data was recorded was 1:00:00, the second time in which the data was recorded was 09:00:00 and the last time in which the data was recorded was 17:00:00.
However, the person who collected data has made mistakes. For instance, in row_num 4, the time is supposed to be 01:00:00, however, the data collector recorded 09:00:00.
Also, in row number 8 I expect the time should be 9:00:00, however, since there was no information was recorded in value, the person did not fill that row and rather recorded the time to be 09:00:00 at order number 3 while it is expected that the time in order number 3 is 17:00:00.
Given the fact that we know the order of the data collection, I was wondering if you have any solution to deal with such an issue in the dataset.
Thanks in advance for your time.
Create a group of 3 rows and give time in the order we want :
library(dplyr)
df %>%
group_by(grp = ceiling(row_number/3)) %>%
mutate(time = c('01:00:00', '09:00:00', '17:00:00')) %>%
ungroup %>% select(-grp)
# row_number date time order value
# <dbl> <chr> <chr> <dbl> <dbl>
#1 1 2020-01-01 01:00:00 1 10
#2 2 2020-01-01 09:00:00 2 20
#3 3 2020-01-01 17:00:00 3 30
#4 4 2020-01-02 01:00:00 1 40
#5 5 2020-01-02 09:00:00 2 10
#6 6 2020-01-02 17:00:00 3 20
#7 7 2020-01-03 01:00:00 1 30
#8 8 2020-01-03 09:00:00 2 NA
#9 9 2020-01-03 17:00:00 3 50
time <- c("01:00:00","09:00:00","17:00:00")
rep(time, 200000)
The rep() function allows you to repeat a vector as many times as you want for your dataset. This allows you to create the 3 time slots for observations, and then repeat them for you 600,000 observations, so you can eliminate humane error.
I have a dataset of temperature values taken at specific datetimes across five locations. For whatever reason, sometimes the readings are every hour, and some every four hours. Another issue is that when the time changed as a result of daylight savings, the readings are off by one hour. I am interested in the readings taken every four hours and would like to subset these by day and night to ultimately get daily and nightly mean temperatures.
To summarise, the readings I am interested in are either:
0800, 1200, 1600 =day
2000, 0000, 0400 =night
Recordings between 0800-1600 and 2000-0400 each day should be averaged.
During daylight savings, the equivalent times are:
0900, 1300, 1700 =day
2100, 0100, 0500 =night
Recordings between 0900-1700 and 2100-0500 each day should be averaged.
In the process, I am hoping to subset by site.
There are also some NA values or blank cells which should be ignored.
So far, I tried to subset by one hour of interest just to see if it worked, but haven't got any further than that. Any tips on how to subset by a series of times of interest? Thanks!
temperature <- read.csv("SeaTemperatureData.csv",
stringsAsFactors = FALSE)
temperature <- subset(temperature, select=-c(X)) #remove last column that contains comments, not needed
temperature$Date.Time < -as.POSIXct(temperature$Date.Time,
format="%d/%m/%Y %H:%M",
tz="Pacific/Auckland")
#subset data by time, we only want to include temperatures recorded at certain times
temperature.goat <- subset(temperature, Date.Time==c('01:00:00'), select=c("Goat.Island"))
Date.Time Goat.Island Tawharanui Kawau Tiritiri Noises
1 2019-06-10 16:00:00 16.820 16.892 16.749 16.677 15.819
2 2019-06-10 20:00:00 16.773 16.844 16.582 16.654 15.796
3 2019-06-11 00:00:00 16.749 16.820 16.749 16.606 15.819
4 2019-06-11 04:00:00 16.487 16.796 16.654 16.558 15.796
5 2019-06-11 08:00:00 16.582 16.749 16.487 16.463 15.867
6 2019-06-11 12:00:00 16.630 16.773 16.725 16.654 15.867
One possible solution is to extract hours from your DateTime variable, then filter for particular hours of interest.
Here a fake example over 4 days:
library(lubridate)
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Value = sample(1:100,97, replace = TRUE))
DateTime Value
1 2020-02-01 00:00:00 99
2 2020-02-01 01:00:00 51
3 2020-02-01 02:00:00 44
4 2020-02-01 03:00:00 49
5 2020-02-01 04:00:00 60
6 2020-02-01 05:00:00 56
Now, you can extract hours with hour function of lubridate and subset for the desired hour:
library(lubridate)
subset(df, hour(DateTime) == 5)
DateTime Value
6 2020-02-01 05:00:00 56
30 2020-02-02 05:00:00 31
54 2020-02-03 05:00:00 65
78 2020-02-04 05:00:00 80
EDIT: Getting mean of each sites per subset of hours
Per OP's request in comments, the question is to calcualte the mean of values for various sites for different period of times.
Basically, you want to have two period per days, one from 8:00 to 17:00 and the other one from 18:00 to 7:00.
Here, a more elaborated example based on the previous one:
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Site1 = sample(1:100,97, replace = TRUE),
Site2 = sample(1:100,97, replace = TRUE))
DateTime Site1 Site2
1 2020-02-01 00:00:00 100 6
2 2020-02-01 01:00:00 9 49
3 2020-02-01 02:00:00 86 12
4 2020-02-01 03:00:00 34 55
5 2020-02-01 04:00:00 76 29
6 2020-02-01 05:00:00 41 1
....
So, now you can do the following to label each time point as daily or night, then group by this category for each day and calculate the mean of each individual sites using summarise_at:
library(lubridate)
library(dplyr)
df %>% mutate(Date = date(DateTime),
Hour= hour(DateTime),
Category = ifelse(between(hour(DateTime),8,17),"Daily","Night")) %>%
group_by(Date, Category) %>%
summarise_at(vars(c(Site1,Site2)), ~ mean(., na.rm = TRUE))
# A tibble: 9 x 4
# Groups: Date [5]
Date Category Site1 Site2
<date> <chr> <dbl> <dbl>
1 2020-02-01 Daily 56.9 63.1
2 2020-02-01 Night 58.9 46.6
3 2020-02-02 Daily 54.5 47.6
4 2020-02-02 Night 36.9 41.7
5 2020-02-03 Daily 42.3 56.9
6 2020-02-03 Night 44.1 55.9
7 2020-02-04 Daily 54.3 50.4
8 2020-02-04 Night 54.8 34.3
9 2020-02-05 Night 75 16
Does it answer your question ?
I have a dataset with periods
active <- data.table(id=c(1,1,2,3), beg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:50:00","2018-01-01 01:50:00","2018-01-01 01:50:00")), end=as.POSIXct(c("2018-01-01 01:20:00","2018-01-01 02:00:00","2018-01-01 02:00:00","2018-01-01 02:00:00")))
> active
id beg end
1: 1 2018-01-01 01:10:00 2018-01-01 01:20:00
2: 1 2018-01-01 01:50:00 2018-01-01 02:00:00
3: 2 2018-01-01 01:50:00 2018-01-01 02:00:00
4: 3 2018-01-01 01:50:00 2018-01-01 02:00:00
during which an id was active. I would like to aggregate across ids and determine for every point in
time <- data.table(seq(from=min(active$beg),to=max(active$end),by="mins"))
the number of IDs that are inactive and the average number of minutes until they get active. That is, ideally, the table looks like
>ans
time inactive av.time
1: 2018-01-01 01:10:00 2 30
2: 2018-01-01 01:11:00 2 29
...
50: 2018-01-01 02:00:00 0 0
I believe this can be done using data.table but I cannot figure out the syntax to get the time differences.
Using dplyr, we can join by a dummy variable to create the Cartesian product of time and active. The definitions of inactive and av.time might not be exactly what you're looking for, but it should get you started. If your data is very large, I agree that data.table will be a better way of handling this.
library(tidyverse)
time %>%
mutate(dummy = TRUE) %>%
inner_join({
active %>%
mutate(dummy = TRUE)
#join by the dummy variable to get the Cartesian product
}, by = c("dummy" = "dummy")) %>%
select(-dummy) %>%
#define what makes an id inactive and the time until it becomes active
mutate(inactive = time < beg | time > end,
TimeUntilActive = ifelse(beg > time, difftime(beg, time, units = "mins"), NA)) %>%
#group by time and summarise
group_by(time) %>%
summarise(inactive = sum(inactive),
av.time = mean(TimeUntilActive, na.rm = TRUE))
# A tibble: 51 x 3
time inactive av.time
<dttm> <int> <dbl>
1 2018-01-01 01:10:00 3 40
2 2018-01-01 01:11:00 3 39
3 2018-01-01 01:12:00 3 38
4 2018-01-01 01:13:00 3 37
5 2018-01-01 01:14:00 3 36
6 2018-01-01 01:15:00 3 35
7 2018-01-01 01:16:00 3 34
8 2018-01-01 01:17:00 3 33
9 2018-01-01 01:18:00 3 32
10 2018-01-01 01:19:00 3 31
I want to convert the seconds in "TOTALSEC" to a 24-hr time of day (ie. 14:32:40). However, it needs to be based on the starting time in "Time". So for example, I have "12:00:00" as the start time under Time and "3630" under TOTALSEC". I want to get the converted time to read 13:00:30. For context the Time is start time of an audio recorder, and TOTALSEC is # elapsed seconds that audio clip occurs since the start time. So in the above example, I started the recorded at 12:00:00, and the audio clip occurred 3630 seconds after it started recording.
> head(stacksubset)
# A tibble: 6 x 2
TOTALSEC Time Date
<dbl> <chr> <chr>
1 10613.67 15:53:50 2017-05-30
2 56404.35 17:29:44 2017-05-29
3 20480.54 16:16:12 2017-06-13
4 60613.47 15:53:50 2017-05-30
5 80034.30 16:16:12 2017-06-02
6 50710.37 16:16:12 2017-05-27
I was thinking it might have to be a loop function of some sort (there's 2000 additional lines after the above 6).
Data
data <- data.frame(TOTALSEC = c(10613.67, 56404.35, 20480.54, 60613.47, 80034.30, 50710.37),
Time = c("15:53:50", "17:29:44", "16:16:12", "15:53:50", "16:16:12", "16:16:12"))
Code
So because date will actually matter here each of the times is pasted to a date (Today's date) and then converted to datetime in the timezone of your system.
Then add together.
library(lubridate)
library(dplyr)
data %>% mutate(Time = as_datetime(paste(Date, Time), Sys.timezone(location = TRUE)),
end_time = TOTALSEC + Time)
Result
TOTALSEC Time end_time
1 10613.67 2017-11-23 15:53:50 2017-11-23 18:50:43
2 56404.35 2017-11-23 17:29:44 2017-11-24 09:09:48
3 20480.54 2017-11-23 16:16:12 2017-11-23 21:57:32
4 60613.47 2017-11-23 15:53:50 2017-11-24 08:44:03
5 80034.30 2017-11-23 16:16:12 2017-11-24 14:30:06
6 50710.37 2017-11-23 16:16:12 2017-11-24 06:21:22