I have a data set that looks like:
Trans_ID Time_Stamp Emp_ID
1111 9/15/16 01:12:50 9999
1112 9/15/16 01:12:59 9999
1113 9/15/16 01:13:01 9999
I need to get the difference (in seconds) between the current row and the row before it.
I'm looking for something like:
Trans_ID Time_Stamp Emp_ID Diff
1111 9/15/16 01:12:50 9999
1112 9/15/16 01:12:59 9999 9
1113 9/15/16 01:13:01 9999 2
The format can be flexible, but I mostly just need it to be calculated in some way. Any advice is greatly appreciated.
Related
Problem
I have a very large data frame (almost 20,000 rows) where data was documented approximately every 1-3 minutes. Unfortunately, I am unable to upload any authentic data to this post.
Aim
The aim is to reduce the size of the data frame by filtering the rows by date and time into 5-minute intervals for each month. The data spans over 4 years.
I have tried using different packages and functions in R to figure out how to do this over the last couple of days to no avail such as dplyr(), aggregate(), and tidyverse(), and I just can't solve this conundrum.
Structure of the data frame
I have a data frame like this
ID Date Time
1 9/15/16 6:48:00 AM
2 9/15/16 6:54:00 AM
3 9/15/16 6:57:00 AM
4 9/15/16 6:59:00 AM
5 9/15/16 7:03:00 AM
6 9/15/16 7:05:00 AM
I would like to convert this data into a data frame like the example below by calculating the number of minutes and seconds between each subsequent 'ID' row.
ID Date Start_Time End_Time Minutes Seconds
1 9/15/16 6:48:00 AM 6:54:00 AM 5.0 300.00
2 9/15/16 6:54:00 AM 6:57:00 AM 3.0 180.00
3 9/15/16 6:57:00 AM 6:59:00 AM 2.0 120.00
4 9/15/16 6:59:00 AM 7:03:00 AM 4.0 240.00
5 9/15/16 7:03:00 AM 7:05:00 AM 2.0 120.00
6 9/15/16 7:05:00 AM etc
Afterwards, I'd like to filter the data frame containing the new calculations between the subsequent rows of 'IDs' by date and time into 5-minute or 300.0-second time intervals per month to reduce the size of the data frame.
The output should be something like this unless someone has a more efficient method.
ID Date Start_Time End_Time Minutes Seconds
1 9/15/16 6:48:00 AM 6:54:00 AM 5.0 300.00
I appreciate your thoughts on this
Many thanks in advance.
Progress
Many many thanks in advance for this solution, it worked really well. Sorry for all these questions, I am a novice with R. Could I please query what the warning message below means (ℹ 232 failed to parse) and what went wrong with the calculations of minutes and seconds for the rows in my data frame(' New_Track' - see below)? The minutes and seconds are exactly the same in all columns called 'Minutes' and 'Seconds'. In addition, for IDs 7 and 8, the new calculations show the difference is 950520 minutes and seconds when the correct calculation is approximately 2 minutes or 120 seconds (see below).
Row 7
7 7 9/15/16 7:07:00 AM 2016-09-15 07:07:00 2016-09-26 07:09:00 950520M 0S 950520
8 8 9/26/16 7:09:00 AM 2016-09-26 07:09:00 2016-09-26 07:11:00 120M 0S 120
Warning Message:
##My data frame is called 'track' and the columns are: (1) ID; (2) Date; and (3) Time
##Code:
New_Track <- data.frame(
+ stringsAsFactors = FALSE,
+ ID = track$ID,
+ Date = track$Date,
+ Time = track$Time
+ ) %>%
+ mutate(Start_Time = mdy_hms(paste(Date, Time)),
+ End_Time = lead(Start_Time),
+ Minutes = minutes(End_Time-Start_Time),
+ Seconds = (End_Time-Start_Time) / dseconds(1))
Warning message:
Problem while computing `Start_Time = mdy_hms(paste(Date, Time))`.
ℹ 232 failed to parse.
New Data frame Layout - 'New_Track'
ID Date Time Start_Time End_Time Minutes Seconds
1 1 9/15/16 6:48:00 AM 2016-09-15 06:48:00 2016-09-15 06:54:00 360M 0S 360
2 2 9/15/16 6:54:00 AM 2016-09-15 06:54:00 2016-09-15 06:57:00 180M 0S 180
3 3 9/15/16 6:57:00 AM 2016-09-15 06:57:00 2016-09-15 06:59:00 120M 0S 120
4 4 9/15/16 6:59:00 AM 2016-09-15 06:59:00 2016-09-15 07:03:00 240M 0S 240
5 5 9/15/16 7:03:00 AM 2016-09-15 07:03:00 2016-09-15 07:05:00 120M 0S 120
6 6 9/15/16 7:05:00 AM 2016-09-15 07:05:00 2016-09-15 07:07:00 120M 0S 120
7 7 9/15/16 7:07:00 AM 2016-09-15 07:07:00 2016-09-26 07:09:00 950520M 0S 950520
8 8 9/26/16 7:09:00 AM 2016-09-26 07:09:00 2016-09-26 07:11:00 120M 0S 120
9 9 9/26/16 7:11:00 AM 2016-09-26 07:11:00 2016-09-26 07:13:00 120M 0S 120
library(dplyr); library(lubridate)
data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 2L, 3L, 4L, 5L, 6L),
Date = c("9/15/16","9/15/16","9/15/16",
"9/15/16","9/15/16","9/15/16"),
Time = c("6:48:00","6:54:00","6:57:00",
"6:59:00","7:03:00","7:05:00")
) %>%
mutate(Start_Time = mdy_hms(paste(Date, Time)),
End_Time = lead(Start_Time),
Minutes = minutes(End_Time-Start_Time),
Seconds = (End_Time-Start_Time) / dseconds(1))
Result
ID Date Time Start_Time End_Time Minutes Seconds
1 1 9/15/16 6:48:00 2016-09-15 06:48:00 2016-09-15 06:54:00 6M 0S 360
2 2 9/15/16 6:54:00 2016-09-15 06:54:00 2016-09-15 06:57:00 3M 0S 180
3 3 9/15/16 6:57:00 2016-09-15 06:57:00 2016-09-15 06:59:00 2M 0S 120
4 4 9/15/16 6:59:00 2016-09-15 06:59:00 2016-09-15 07:03:00 4M 0S 240
5 5 9/15/16 7:03:00 2016-09-15 07:03:00 2016-09-15 07:05:00 2M 0S 120
6 6 9/15/16 7:05:00 2016-09-15 07:05:00 <NA> <NA> NA
It's still unclear to me what you're looking for for this:
I'd like to filter the data frame containing the new calculations
between the subsequent rows of 'IDs' by date and time into 5-minute or
300.0-second time intervals per month to reduce the size of the data frame.
I have ran into this issue and I really have no clue how to do it. I have two data.frames, both with date columns. However, the first one, which is a big object, contains measurements each 3 seconds, while the second contains measurements each 10 minutes. I want to include the measurement variable of object 2 into object 1 (something like a left_join or merge) by the date variable. My data looks like this (df1):
date_time
measurement1
yyyy-mm-dd HH:MM:03
val1
yyyy-mm-dd HH:MM:06
val2
df2:
date_time
measurement2
yyyy-mm-dd HH:10:00
val1
yyyy-mm-dd HH:20:00
val2
I hope that is enough info, otherwise please comment. I have explored foverlapse and fuzzyjoin but without success.
Thank you in advance
Here is what I have in a bit more detail (df1):
date_time
measurement1
05/06/2018 0:00:03
73
05/06/2018 0:00:06
73.5
05/06/2018 0:00:09
48.5
05/06/2018 0:00:12
50.7
05/06/2018 0:00:15
80
05/06/2018 0:00:18
81
Data continue for a number of months every time each 3 seconds
df2:
date_time
measurement2
05/06/2018 0:00:00
110
05/06/2018 0:10:00
120
05/06/2018 0:20:00
180
What I want is this:
df:
date_time
measurement1
measurement2
05/06/2018 0:00:03
73
110
05/06/2018 0:00:06
73.5
110
05/06/2018 0:00:09
48.5
110
05/06/2018 0:00:12
50.7
110
05/06/2018 0:00:15
80
110
05/06/2018 0:00:18
81
110
I hope now is clearer, by the way, there might be an issue with tables, I am using the format I am told by Stack overflow and I can see the tables being produced in the review, but then the format is lost when I submit.
Thank you
Every minute has 20 observations if those observations occur every 3 seconds. Hence, there are 200 observations for every 10 minute interval. If your data is complete, then it would suffice that you stretch out your seconds 10-minute-interval observations accordingly, i.e. you copy every 10-minute-interval value 200 times next to the 3-second-interval values.
Try the following and tell me what you get
df1$measurement2 <- rep(df2$measurement2, each = 200)
I have a SQLite database, I want to create a query that will group records if the DateTime is within 60 minutes - the hard part is the DateTime is cumulative so if we have 3 records with DateTimes 2019-12-14 15:40:00, 2019-12-14 15:56:00 and 2019-12-14 16:55:00 it would all fall in one group. Please see the hands and desired output of the query to help you understand the requirement.
Database Table "Hands"
ID DateTime Result
1 2019-12-14 15:40:00 -100
2 2019-12-14 15:56:00 1000
3 2019-12-14 16:55:00 -2000
4 2012-01-12 12:00:00 400
5 2016-10-01 21:00:00 900
6 2016-10-01 20:55:00 1000
Desired output of query
StartTime Count Result
2019-12-14 15:40:00 3 -1100
2012-01-12 12:00:00 1 400
2016-10-01 20:55:00 2 1900
You can use some window functions to indicate at which record a new group should start (because of a datetime difference with the previous that is 60 minutes or larger), and then to turn that information into a unique group number. Finally you can group by that group number and perform the aggregation functions on it:
with base as (
select DateTime, Result,
coalesce(cast((
julianday(DateTime) - julianday(
lag(DateTime) over (order by DateTime)
)
) * 24 >= 1 as integer), 1) as firstInGroup
from Hands
), step as (
select DateTime, Result,
sum(firstInGroup) over (
order by DateTime rows
between unbounded preceding and current row) as grp
from base
)
select min(DateTime) DateTime,
count(*) Count,
sum(Result) Result
from step
group by grp;
DB-fiddle
I have a table(pay_period) as following
pay_period
period_id list_id start_date end_date price
1 100 2017-01-01 2017-08-31 100
2 100 2017-09-01 2017-12-31 110
3 101 2017-01-01 2017-08-31 75
Now I have list_id, checkin_date, checkout_date
list_id 100
checkin_date 2017-08-25
checkout_date 2017-09-10
I need to calculate the price of a list for the period from checkin date to checkout date.
therefore the calculation is supposed to be
7 * 100 + 10 * 110
I am thinking to do it with a for loop, if there is any other better way to do it, can you please suggest?
You have to see if the checkin_date and checkout_date are into the same period_id.
1.1 If yes, you multiply the price with the nunmber of days.
1.2 If no, you have count the days between checkin_day untill the end of your period 1 and multiply with the corresponding price, then do the same with checkout_day and beginning of next period.
Note: i guess it might happen to have more than 2 prices per list_id. for example:
period_id list_id start_date end_date price
1 100 2017-01-01 2017-04-30 100
2 100 2017-05-01 2017-09-30 110
3 100 2017-10-01 2017-12-31 120
4 101 2017-01-01 2017-08-31 75
and the calculation period to be:
list_id 100
checkin_date 2017-03-01
checkout_date 2017-11-10
In this case, yes, the solution would be to have a CURSOR where to keep the prices for list_id and periods; loop through it and compare the checkin_date and checkout_date with each record.
Best,
Mikcutu.
You can do the following for a much cleaner code. Although it is purely sql, I am using a function to make it code better to understand.
Create a generic function which gets you the number of overlapping days in 2 different date range.
CREATE OR REPLACE FUNCTION fn_count_range
( p_start_date1 IN DATE,
p_end_date1 IN DATE,
p_start_date2 IN DATE,
p_end_date2 IN DATE ) RETURN NUMBER AS
v_days NUMBER;
BEGIN
IF p_end_date1 < p_start_date1 OR p_end_date2 < p_start_date2 THEN
RETURN 0;
END IF;
SELECT COUNT(*) INTO v_days
FROM (
(SELECT p_start_date1 + LEVEL - 1
FROM dual CONNECT BY LEVEL <= p_end_date1 - p_start_date1 + 1 ) INTERSECT
(SELECT p_start_date2 + LEVEL - 1
FROM dual CONNECT BY LEVEL <= p_end_date2 - p_start_date2 + 1 ) );
RETURN v_days;
END;
/
Now, your query to calculate the total price is simplified.
WITH lists ( list_id,
checkin_date,
checkout_date) AS
( SELECT 100,
TO_DATE('2017-08-25','YYYY-MM-DD'),
TO_DATE('2017-09-10','YYYY-MM-DD')
FROM dual) --Not required if you have a lists table.
SELECT l.list_id,
SUM(fn_count_range(start_date,end_date,checkin_date,checkout_date) * price) total_price
FROM pay_period p
JOIN lists l ON p.list_id = l.list_id
GROUP BY l.list_id;
i have 2 data frames. one with a list of ID and dates of 700 persons, and another with 400.000 rows with date and several other variables for over 1000 persons.
example df1:
ID date
1010 2014-05-31
1011 2015-08-27
1015 2011-04-15
...
example df2:
ID Date Operationcode
1010 2008-01-03 456
1010 2016-06-09 1234
1010 1999-10-04 123186
1010 2017-02-30 71181
1010 2005-05-05 201
1011 2008-04-02 46
1011 2009-09-09 1231
1515 2017-xx-xx 156
1015 2013-xx-xx 123
1615 1998-xx-xx 123
1015 2005-xx-xx 4156
1015 2007-xx-xx 123
1015 2016-xx-xx 213
now i wanna create a df3 where i only keep rows from df2 where the date is before df1 (when matched by ID).
so i get:
ID Date Operationcode
1010 2008-01-03 456
1010 1999-10-04 123186
1010 2005-05-05 201
1015 2005-xx-xx 4156
1015 2007-xx-xx 123
ive tried
df3 <- subset(df1, ID %in% df2$ID & df2$date < df1$date)
but keep giving me an error where something with the length of the last part, df2$date < df1$date doesnt match, and when I take a sampletest (look for the operationcode for 1 ID) i can see that i miss alot of rows before the date from df1. Any idea or solutions?
AND i only got base-R as its the hospitals computer which doesnt allow any downloading -.-
In base R you could do something like this...
df3 <- merge(df2,df1,by="ID",all.x=TRUE) #merge in df1 date column
df3 <- df3[as.Date(df3$Date)<as.Date(df3$date),] #remove rows with invalid dates
#note that 'Date' is the df2 column, 'date' is the df1 version
df3 <- df3[!is.na(df3$ID),] #remove NA rows
df3$date <- NULL #remove df1 date column
df3
ID Date Operationcode
1 1010 2008-01-03 456
2 1010 1999-10-04 123186
3 1010 2005-05-05 201
6 1011 2009-09-09 1231
7 1011 2008-04-02 46
I'm not sure what is supposed to happen with the dates with xx in your data. Are they real? If they appear in the actual data, they will need special handling as otherwise they will not be converted to proper date format, so the calculation fails.