Dealing with mistakes in recording the dataset in R - r

I have a dataset that has a rather complicated problem. It includes 600,000 observations. The main issue is related to the data collection process. As an example I provided the following dataset that has similar structure to the real datast I have in hand:
df <- data.frame(row_number = c(1,2,3,4,5,6,7,8,9),
date = c("2020-01-01", "2020-01-01","2020-01-01","2020-01-02","2020-01-02","2020-01-02","2020-01-03","2020-01-03","2020-01-03"),
time = c("01:00:00","09:00:00","17:00:00", "09:00:00","01:00:00","17:00:00", "01:00:00","NA","09:00:00"),
order = c(1,2,3,1,2,3,1,2,3),
value = c(10,20,30,40,10,20,30,NA,50)
I know in each day the data was recorded 3 times (order variable). That is in each day, the first time in which the data was recorded was 1:00:00, the second time in which the data was recorded was 09:00:00 and the last time in which the data was recorded was 17:00:00.
However, the person who collected data has made mistakes. For instance, in row_num 4, the time is supposed to be 01:00:00, however, the data collector recorded 09:00:00.
Also, in row number 8 I expect the time should be 9:00:00, however, since there was no information was recorded in value, the person did not fill that row and rather recorded the time to be 09:00:00 at order number 3 while it is expected that the time in order number 3 is 17:00:00.
Given the fact that we know the order of the data collection, I was wondering if you have any solution to deal with such an issue in the dataset.
Thanks in advance for your time.

Create a group of 3 rows and give time in the order we want :
library(dplyr)
df %>%
group_by(grp = ceiling(row_number/3)) %>%
mutate(time = c('01:00:00', '09:00:00', '17:00:00')) %>%
ungroup %>% select(-grp)
# row_number date time order value
# <dbl> <chr> <chr> <dbl> <dbl>
#1 1 2020-01-01 01:00:00 1 10
#2 2 2020-01-01 09:00:00 2 20
#3 3 2020-01-01 17:00:00 3 30
#4 4 2020-01-02 01:00:00 1 40
#5 5 2020-01-02 09:00:00 2 10
#6 6 2020-01-02 17:00:00 3 20
#7 7 2020-01-03 01:00:00 1 30
#8 8 2020-01-03 09:00:00 2 NA
#9 9 2020-01-03 17:00:00 3 50

time <- c("01:00:00","09:00:00","17:00:00")
rep(time, 200000)
The rep() function allows you to repeat a vector as many times as you want for your dataset. This allows you to create the 3 time slots for observations, and then repeat them for you 600,000 observations, so you can eliminate humane error.

Related

R create week numbers with specified start date

This seems like it should be straightforward but I cannot find a way to do this.
I have a sales cycle that begins ~ August 1 of each year and need to sum sales by week number. I need to create a "week number" field where week #1 begins on a date that I specify. Thus far I have looked at lubridate, baseR, and strftime, and I cannot find a way to change the "start" date from 01/01/YYYY to something else.
Solution needs to let me specify the start date and iterate week numbers as 7 days from the start date. The actual start date doesn't always occur on a Sunday or Monday.
EG Data Frame
eg_data <- data.frame(
cycle = c("cycle2019", "cycle2019", "cycle2018", "cycle2018", "cycle2017", "cycle2017", "cycle2016", "cycle2016"),
dates = as.POSIXct(c("2019-08-01" , "2019-08-10" ,"2018-07-31" , "2018-08-16", "2017-08-03" , "2017-08-14" , "2016-08-05", "2016-08-29")),
week_n = c("1", "2","1","3","1","2","1","4"))
I'd like the result to look like what is above - it would take the min date for each cycle and use that as a starting point, then iterate up week numbers based on a given date's distance from the cycle starting date.
This almost works. (Doing date arithmetic gives us durations in seconds: there may be a smoother way to convert with lubridate tools?)
secs_per_week <- 60*60*24*7
(eg_data
%>% group_by(cycle)
%>% mutate(nw=1+as.numeric(round((dates-min(dates))/secs_per_week)))
)
The results don't match for 2017, because there is an 11-day gap between the first and second observation ...
cycle dates week_n nw
<chr> <dttm> <chr> <dbl>
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 3
If someone has a better answer plz post, but this works -
Take the dataframe in the example, eg_data -
eg_data %>%
group_by(cycle) %>%
mutate(
cycle_start = as.Date(min(dates)),
days_diff = as.Date(dates) - cycle_start,
week_n = days_diff / 7,
week_n_whole = ceiling(days_diff / 7) ) -> eg_data_check
(First time I've answered my own question)
library("lubridate")
eg_data %>%
as_tibble() %>%
group_by(cycle) %>%
mutate(new_week = week(dates)-31)
This doesn't quite work the same as your example, but perhaps with some fiddling based on your domain experience you could adapt it:
library(lubridate)
eg_data %>%
mutate(aug1 = ymd_h(paste(str_sub(cycle, start = -4), "080100")),
week_n2 = ceiling((dates - aug1)/ddays(7)))
EDIT: If you have specific known dates for the start of each cycle, it might be helpful to join those dates to your data for the calc:
library(lubridate)
cycle_starts <- data.frame(
cycle = c("cycle2019", "cycle2018", "cycle2017", "cycle2016"),
start_date = ymd_h(c(2019080100, 2018072500, 2017080500, 2016071300))
)
eg_data %>%
left_join(cycle_starts) %>%
mutate(week_n2 = ceiling((dates - start_date)/ddays(7)))
#Joining, by = "cycle"
# cycle dates week_n start_date week_n2
#1 cycle2019 2019-08-01 1 2019-08-01 1
#2 cycle2019 2019-08-10 2 2019-08-01 2
#3 cycle2018 2018-07-31 1 2018-07-25 1
#4 cycle2018 2018-08-16 3 2018-07-25 4
#5 cycle2017 2017-08-03 1 2017-08-05 0
#6 cycle2017 2017-08-14 2 2017-08-05 2
#7 cycle2016 2016-08-05 1 2016-07-13 4
#8 cycle2016 2016-08-29 4 2016-07-13 7
This is a concise solution using lubridate
library(lubridate)
eg_data %>%
group_by(cycle) %>%
mutate(new_week = floor(as.period(ymd(dates) - ymd(min(dates))) / weeks()) + 1)
# A tibble: 8 x 4
# Groups: cycle [4]
cycle dates week_n new_week
<chr> <dttm> <chr> <dbl>
1 cycle2019 2019-08-01 00:00:00 1 1
2 cycle2019 2019-08-10 00:00:00 2 2
3 cycle2018 2018-07-31 00:00:00 1 1
4 cycle2018 2018-08-16 00:00:00 3 3
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 2
7 cycle2016 2016-08-05 00:00:00 1 1
8 cycle2016 2016-08-29 00:00:00 4 4

Calculate time of session from logs in R

I have a nice dateset that includes user logins to my website, but no user logout. The user can access again the next day and then another line is registered. I'm looking to calculate how much time each user spent on the site.
The working assumption is that the site is quite interactive, and one can assume that the time between the first and last action is the time the user was on the site. The problem starts with a final action definition. For example, a user can be on the same page all night too, but this is unlikely (without taking any action). It is very clear to me how to calculate it in reality, but haven't been able to find a proper code.
library(tidyverse)
df <- read_csv(url("https://srv-file9.gofile.io/download/eBInE9/sheet1.csv"))
df %>% group_by(`User ID`)
lag(df$Time,1)
##Now I am wondering how to cluster the itemsn by time and to calculate lags..
Any thoughts?
Thanks!
I am assuming that the csv file you read has a POSIXct column which contains the login time. If not, you should make sure this exists.
Here's some code which generates time differences using the lag function by ID. The first time difference for each group will be NA. I generate some random time data first as you have not provided sample data (as you should have).
library(dplyr)
library(lubridate)
library(tibble)
random_times <- seq(as.POSIXct('2020/01/01'), as.POSIXct('2020/02/01'), by="1 mins")
login <- tibble(login = sort(sample(random_times[hour(random_times) > "00:00" & hour(random_times) < "23:59"], 1000)),
ID = sample(LETTERS[1:6], 1000, replace = TRUE))
login <- login %>%
group_by(ID) %>%
mutate(login_delayed = lag(login, 1)) %>%
mutate(login_time = login - login_delayed)
With this output:
> login
# A tibble: 1,000 x 4
# Groups: ID [6]
login ID login_delayed login_time
<dttm> <chr> <dttm> <drtn>
1 2020-01-01 01:03:00 A NA NA mins
2 2020-01-01 01:11:00 A 2020-01-01 01:03:00 8 mins
3 2020-01-01 01:46:00 E NA NA mins
4 2020-01-01 02:33:00 E 2020-01-01 01:46:00 47 mins
5 2020-01-01 02:47:00 A 2020-01-01 01:11:00 96 mins
6 2020-01-01 10:43:00 F NA NA mins
7 2020-01-01 11:44:00 A 2020-01-01 02:47:00 537 mins
8 2020-01-01 11:57:00 A 2020-01-01 11:44:00 13 mins
9 2020-01-01 12:02:00 F 2020-01-01 10:43:00 79 mins
10 2020-01-01 12:57:00 D NA NA mins
# ... with 990 more rows

How to check for continuity minding possible gaps in dates

I have a big data frame with dates and i need to check for the first date in a continuous way, as follows:
ID ID_2 END BEG
1 55 2017-06-30 2016-01-01
1 55 2015-12-31 2015-11-12 --> Gap (required date)
1 88 2008-07-26 2003-02-24
2 19 2014-09-30 2013-05-01
2 33 2013-04-30 2011-01-01 --> Not Gap (overlapping)
2 19 2012-12-31 2011-01-01
2 33 2010-12-31 2008-01-01
2 19 2007-12-31 2006-01-01
2 19 2005-12-31 1980-10-20 --> No actual Gap(required date)
As shown, not all the dates have overlapping and i need to return by ID (not ID_2) the date when the first gap (going backwards in time) appears. I've tried using for but it's extremely slow (dataframe has 150k rows). I've been messing around with dplyr and mutate as follows:
df <- df%>%
group_by(ID)%>%
mutate(END_lead = lead(END))
df$FLAG <- df$BEG - days(1) == df$END_lead
df <- df%>%
group_by(ID)%>%
filter(cumsum(cumsum(FLAG == FALSE))<=1)
But this set of instructions stops at the first overlapping, filtering the wrong date. I've tried anything i could think of, ordering in decreasing or ascending order, and using min and max but could not figure out a solution.
The actual result wanted would be:
ID ID_2 END BEG
1 55 2015-12-31 2015-11-12
2 19 2008-07-26 1980-10-20
Is there a way of doing this using dplyr,tidyr and lubridate?
A possible solution using dplyr:
library(dplyr)
df %>%
mutate_at(vars(END, BEG), funs(as.Date)) %>%
group_by(ID) %>%
slice(which.max(BEG > ( lead(END) + 1 ) | is.na(BEG > ( lead(END) + 1 ))))
With your last data, it gives:
# A tibble: 2 x 4
# Groups: ID [2]
ID ID_2 END BEG
<int> <int> <date> <date>
1 1 55 2015-12-31 2015-11-12
2 2 19 2005-12-31 1980-10-20
What the solution does is basically:
Changes the dates to Date format (no need for lubridate);
Groups by ID;
Selects the highest row that satisfies your criteria, i.e. the highest row which is either a gap (TRUE), or if there is no gap it is the first row (meaning it has a missing value when checking for a gap, this is why is.na(BEG > ( lead(END) + 1 ))).
I would use xts package, first creating xts objects for each ID you have, than use first() and last() function on each objects.
https://www.datacamp.com/community/blog/r-xts-cheat-sheet

Count number of rows for each row that meet a logical condition

So I have some data with a time stamp, and for each row, I want to count the number of rows that fall within a certain time window. For example, if I have the data below with a time stamp in h:mm (column ts), I want to count the number of rows that occur from that time stamp to five minutes in the past (column count). The first n rows that are less than five minutes from the first data point should be NAs.
ts data count
1:01 123 NA
1:02 123 NA
1:03 123 NA
1:04 123 NA
1:06 123 5
1:07 123 5
1:10 123 3
1:11 123 4
1:12 123 4
This is straightforward to do with a for loop, but I've been trying to implement with the apply() family and have not yet found any success. Any suggestions?
EDIT: modified to account for the potential for multiple readings per minute, raised in comment.
Data with new mid-minute reading:
library(dplyr)
df %>%
# Take the text above and convert to datetime
mutate(ts = lubridate::ymd_hms(paste(Sys.Date(), ts))) %>%
# Count how many observations per minute
group_by(ts_min = lubridate::floor_date(ts, "1 minute")) %>%
summarize(obs_per_min = sum(!is.na(data))) %>%
# Add rows for any missing minutes, count as zero observations
padr::pad(interval = "1 min") %>%
replace_na(list(obs_per_min = 0)) %>%
# Count cumulative observations, and calc how many in window that
# begins 5 minutes ago and ends at end of current minute
mutate(cuml_count = cumsum(obs_per_min),
prior_cuml = lag(cuml_count) %>% tidyr::replace_na(0),
in_window = cuml_count - lag(prior_cuml, 5)) %>%
# Exclude unneeded columns and rows
select(-cuml_count, -prior_cuml) %>%
filter(obs_per_min > 0)
Output (now reflects add'l reading at 1:06:30)
# A tibble: 12 x 3
ts_min obs_per_min in_window
<dttm> <dbl> <dbl>
1 2018-09-26 01:01:00 1 NA
2 2018-09-26 01:02:00 1 NA
3 2018-09-26 01:03:00 1 NA
4 2018-09-26 01:04:00 1 NA
5 2018-09-26 01:06:00 2 6
6 2018-09-26 01:07:00 1 6
7 2018-09-26 01:10:00 1 4
8 2018-09-26 01:11:00 1 5
9 2018-09-26 01:12:00 1 4

Lookup based on several columns

I have a data frame that looks like this (of course it is way bigger):
> df1
# A tibble: 10 x 4
index1 index2 date1 date2
<int> <int> <date> <date>
1 5800032 6 2012-07-02 2013-09-18
2 5800032 7 2013-09-18 1970-01-01
3 5800254 6 2013-01-04 1970-01-01
4 5800261 5 2012-01-23 2013-02-11
5 5800261 6 2013-02-11 2014-02-05
6 5800261 7 2014-02-05 1970-01-01
7 3002704 7 2012-01-23 1970-01-01
8 3002728 7 2012-10-20 1970-01-01
9 3002810 7 2012-07-18 1970-01-01
10 8504593 3 2012-01-11 1970-01-01
The original variables are: index1, index2 and date1. There is one or more records with the same index1 value (their sequence is determined by index2). My objective is to filter out intervals between consequent values of date1 for the same value of index1. This means that there must be at least two records with the same index1 value to create an interval.
So I created date2 variable that provides the end date of the interval that starts on date1. This simply equals date1 of the consequent record (date2[n] = date1[n+1]). If date1[n] is the latest (or the only) date for the given index1 value, then date2[n] <- 0.
I couldn't come up with a better idea than ordering the df by index1 and index2 and running a for loop:
for (i in 1:(nrow(df1)-1)){
if (df1$index1[i] == df1$index1[i+1]){
df1$date2[i] <- df1$date1[i+1]
}
else{df1$date2[i] <- 0}
}
It sort of worked, but it was visibly slow and for some reason it did not "find" all values it should have. Also, I'm sure there must be a much more intelligent way of doing this task - possibly with sapply function. Any ideas are appreciated!
You can create date2 using lag from dplyr
df1 %>%
group_by(index1) %>%
arrange(index2) %>%
mutate(date2 = lag(date1, default=0))
I didn't clearly understand the filtering part of your question. Your problem may have to do with filtering on default date (1970-01-01) (value = zero)

Resources