Finding difference in time between two data frames using R - r

I have two data frame ,one is the in time of employees and the other is the out time of employees.The data in both the data frames have timestamps for about 4000 employees in the last one year(excludes weekend/public holiday dates).Each data frame has 4000 rows and 250 columns.I would like to find the number of hours spent by an employee each day at work basically my approach would be to find the difference in time between the two data frames using difftime() function.i used the below code and expected a resulting data frame containing 4000 rows and 250 columns with difference in time,however the data was returned in one single column.How should I deal with this problem so that I can get the difference in time between two data frames in the data frame format with 4000 rows and 250 columns?
hours_spent <- as.data.frame(as.matrix(difftime(as.matrix(out_time_data_hrs),as.matrix(in_time_data_hrs),unit='hour')))
Input data looks like below ,
In_time data frame
Out_time data frame
Expected output

Here's a small and simple example based on the data you posted and a possible solution:
# example data in_times
df1 = data.frame(`2018-08-01` = c("2018-08-01 10:30:00", "2018-08-01 10:25:00"),
`2018-08-02` = c("2018-08-02 10:20:00", "2018-08-02 10:45:00"))
# example data out_times
df2 = data.frame(`2018-08-01` = c("2018-08-01 17:33:00", "2018-08-01 18:06:00"),
`2018-08-02` = c("2018-08-02 17:11:00", "2018-08-02 17:45:00"))
library(tidyverse)
# reshape datasets
df1_resh = df1 %>%
mutate(empl_id = row_number()) %>% # add an employee id (using the row number)
gather(day, in_time, -empl_id) # reshape dataset
df2_resh = df2 %>%
mutate(empl_id = row_number()) %>%
gather(day, out_time, -empl_id)
# join datasets and calculate hours spent
left_join(df1_resh, df2_resh, by=c("empl_id","day")) %>%
mutate(hours_spent = difftime(out_time, in_time))
# empl_id day in_time out_time hours_spent
# 1 1 X2018.08.01 2018-08-01 10:30:00 2018-08-01 17:33:00 7.050000 hours
# 2 2 X2018.08.01 2018-08-01 10:25:00 2018-08-01 18:06:00 7.683333 hours
# 3 1 X2018.08.02 2018-08-02 10:20:00 2018-08-02 17:11:00 6.850000 hours
# 4 2 X2018.08.02 2018-08-02 10:45:00 2018-08-02 17:45:00 7.000000 hours
You can use this as the final piece of code if you want to reshape back to your initial format:
left_join(df1_resh, df2_resh, by=c("empl_id","day")) %>%
mutate(hours_spent = difftime(out_time, in_time)) %>%
select(empl_id, day, hours_spent) %>%
spread(day, hours_spent)
# empl_id X2018.08.01 X2018.08.02
# 1 1 7.050000 hours 6.85 hours
# 2 2 7.683333 hours 7.00 hours

my requirement is satisfied by just doing the below, pretty straight forward
employee_hrs_df <- out_time_data - in_time_data

Related

how to get specific date in time series in r

I want to make a forecast project from a time series dataframe.
but, the time span is too big.
then, I have this column in dataframe from a time series data frame.
Date
2010-06-29
2010-06-30
2010-07-01
2010-07-02
how can I change it so that it only shows every 7 days?
Date
2010-06-29
2010-07-05
2010-07-12
2010-07-19
etc
dataframe.new = dataframe[seq(1, nrow(dataframe), 7),]
seq documentation - https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/seq
basically, seq(1, 100, 7) will generate - 1, 8, 15, ...
Daniel's answer is very simple and direct.
However, it will return only data from a specified weekday, which could lead to biased results depending on the nature of your data.
You can create an index of weekdays that is balanced with random sampling of weekdays:
# example data
df <- data.frame(date = seq.Date(from = ymd("2021/01/01"),
to = ymd("2021/12/31"),
by = "day"))
#create index by sampling weekdays randomly
set.seed(1)
index<-replicate(floor(nrow(df)/7), {sample(unique(df$weekday), replace = FALSE)}) %>%
as.vector()
#subsetting to a 7-fold smaller dataset
library(dplyr)
output<-df %>% filter(weekdays(date)==index)
#checking table of weekdays in the final dataset
table(output$weekday)
Friday Monday Saturday Sunday Thursday Tuesday Wednesday
13 6 5 9 8 10 6

R filtering/selecting data by POSIXct time and a condition

I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.

How to expand a column in a dataframe following another column's formatting?

I have a huge dataframe consisting of 10 mio objects with the following format in RStudio.
ID DATE reading
100845 2014-08-17 0,0,0,0,3,0,0,0,0,1,1,0,0,0,0,0,2,0,0,1,0,0,2,0
100845 2014-08-18 0,0,4,0,0,0,0,1,0,0,0,0,1,1,1,1,0,1,1,2,1,1,0,1
100845 2014-08-19 0,1,0,1,0,1,1,1,2,0,1,0,1,0,1,0,1,0,1,0,2,1,1,0
100918 2015-07-02 1,0,0,1,0,1,3,1,1,0,1,0,1,0,1,0,0,1,0,1,0,0,1,1
100920 2013-02-07 0,1,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0,5,6,4,2,1,0,1
100920 2013-02-08 0,1,0,0,1,3,5,4,2,1,0,1,0,1,0,0,1,3,7,5,1,1,1,0
The 24 readings per row refer to hourly meter readings during a day. I would like to convert the daily dates to hourly and convert the strings of readings to a column format instead. The IDs should follow this format.
For example, I have implemented the following:
hourly <- data.frame(Hourly=seq(min(as.POSIXct(paste0(df$date, "00:00"),tz="")),max(as.POSIXct(paste0(df$date, "23:00"),tz="")),by="hour"))
How can I fill in the new fields that are created due to the hourly setting with the same IDs as in the daily format? As the full dataset I have is extremely big, I would appreciate a solution that can run very fast.
I can't speak to the speed of this approach on a data set as large as yours, but I think this code does the steps you want:
library(dplyr)
library(tidyr)
df2 <- df %>%
# use separate to spread the readings across separate columns
separate(reading, into = paste0("hour.", seq(24)), sep = ",") %>%
# use gather to convert that wide data frame into a long one
gather(key = hour, value = reading, hour.1:hour.24) %>%
# make the hour marker into a number
mutate(hour = as.numeric(gsub("hour.", "", hour)) - 1) %>%
# order the data
arrange(ID, DATE, hour) %>%
# create a new column that combines the date and time stamp
mutate(datetime = as.POSIXct(paste(DATE, hour), format = "%Y-%m-%d %H")) %>%
# shed unneeded columns
select(ID, datetime, reading)
Result:
> head(df2)
ID datetime reading
1 100845 2014-08-17 00:00:00 0
2 100845 2014-08-17 01:00:00 0
3 100845 2014-08-17 02:00:00 0
4 100845 2014-08-17 03:00:00 0
5 100845 2014-08-17 04:00:00 3
6 100845 2014-08-17 05:00:00 0

R aggregate a dataframe by hours from a date with time field

I'm relatively new to R but I am very familiar with Excel and T-SQL.
I have a simple dataset that has a date with time and a numeric value associated it. What I'd like to do is summarize the numeric values by-hour of the day. I've found a couple resources for working with time-types in R but I was hoping to find a solution similar to is offered excel (where I can call a function and pass-in my date/time data and have it return the hour of the day).
Any suggestions would be appreciated - thanks!
library(readr)
library(dplyr)
library(lubridate)
df <- read_delim('DateTime|Value
3/14/2015 12:00:00|23
3/14/2015 13:00:00|24
3/15/2015 12:00:00|22
3/15/2015 13:00:00|40',"|")
df %>%
mutate(hour_of_day = hour(as.POSIXct(strptime(DateTime, "%m/%d/%Y %H:%M:%S")))) %>%
group_by(hour_of_day) %>%
summarise(meanValue = mean(Value))
breakdown:
Convert column of DateTime (character) into formatted time then use hour() from lubridate to pull out just that hour value and put it into new column named hour_of_day.
> df %>%
mutate(hour_of_day = hour(as.POSIXct(strptime(DateTime, "%m/%d/%Y %H:%M:%S"))))
Source: local data frame [4 x 3]
DateTime Value hour_of_day
1 3/14/2015 12:00:00 23 12
2 3/14/2015 13:00:00 24 13
3 3/15/2015 12:00:00 22 12
4 3/15/2015 13:00:00 40 13
The group_by(hour_of_day) sets the groups upon which mean(Value) is computed in the via the summarise(...) call.
this gives the result:
hour_of_day meanValue
1 12 22.5
2 13 32.0

Split Date to Day, Month and Year for ffdf Data in R

I'm using R's ff package with ffdf objects named MyData, (dim=c(10819740,16)). I'm trying to split the variable Date into Day, Month and Year and add these 3 variables into ffdf existing data MyData.
For instance: My Date column named SalesReportDate with VirtualVmode and PhysicalVmode = double after I've changed SalesReportDate to as.date(,format="%m/%d/%Y").
Example of SalesReportDate are as follow:
> B
SalesReportDate
1 2013-02-01
2 2013-05-02
3 2013-05-04
4 2013-10-06
5 2013-15-10
6 2013-11-01
7 2013-11-03
8 2013-30-02
9 2013-12-12
10 2014-01-01
I've refer to Split date into different columns for year, month and day and try to apply it but keep getting error warning.
So, is there any way for me to do this? Thanks in advance.
Credit to #jwijffels for this great solution:
require(ffbase)
MyData$SalesReportDateYear <- with(MyData["SalesReportDate"], format(SalesReportDate, "%Y"), by = 250000)
MyData$SalesReportDateMonth <- with(MyData["SalesReportDate"], format(SalesReportDate, "%m"), by = 250000)
MyData$SalesReportDateDay <- with(MyData["SalesReportDate"], format(SalesReportDate, "%d"), by = 250000)

Resources