Here is my sample dataset
id hour
1 15:10
2 12:10
3 22:10
4 06:30
I need to find out the earliest time and latest time. The class of the hour is factor. So I need to convert factor to an appropriate class, and compare the earlier and later time. I tried to format the hour using the code below, but it did not work out as expected
format(as.Date(date),"%H:%M")
Use times of chron package
#Data
xx
# id hour
#1 1 15:10
#2 2 12:10
#3 3 22:10
#4 4 06:30
library(chron)
xx$hour = times(paste0(as.character(xx$hour), ":00"))
xx
# id hour
#1 1 15:10:00
#2 2 12:10:00
#3 3 22:10:00
#4 4 06:30:00
#Min and Max
range(xx$hour)
#[1] 06:30:00 22:10:00
xx = structure(list(id = 1:4, hour = structure(c(3L, 2L, 4L, 1L), .Label = c("06:30",
"12:10", "15:10", "22:10"), class = "factor")), .Names = c("id",
"hour"), row.names = c(NA, -4L), class = "data.frame")
If all you need is to find earliest (min) and latest (max) times, you can just convert the times to a character and use min, max: e.g.,
hour <- c("15:10", "12:10", "22:10", "06:30")
hour[which(hour == max(hour))]
> "22:10"
Related
I have two data frames with timestamps (in as.POSIXct, format="%Y-%m-%d %H:%M:%S") as below.
df_ID1
ID DATETIME TIMEDIFF EV
A 2019-03-26 06:13:00 2019-03-26 00:13:00 1
B 2019-04-03 08:00:00 2019-04-03 02:00:00 1
B 2019-04-04 12:35:00 2019-04-04 06:35:00 1
df_ID0
ID DATETIME
A 2019-03-26 00:02:00
A 2019-03-26 04:55:00
A 2019-03-26 11:22:00
B 2019-04-02 20:43:00
B 2019-04-04 11:03:00
B 2019-04-06 03:12:00
I want to compare the DATETIME in df_ID1 with the DATETIME in df_ID0 that is with the same ID and the DATETIME is "smaller than but closest to" the one in df_ID1,
For the pair in two data frames that matches, I want to further compare the TIMEDIFF in df_ID1 to the matched DATETIME in df_ID0, if TIMEDIFF in df_ID1 greater than the DATETIME in df_ID0, change EV 1 to 4 in df_ID1.
My desired result is
df_ID1
ID DATETIME TIMEDIFF EV
A 2019-03-26 06:13:00 2019-03-26 00:13:00 1
B 2019-04-03 08:00:00 2019-04-03 02:00:00 4
B 2019-04-04 12:35:00 2019-04-04 06:35:00 1
I've checked how to compare timestamps and calculate the time difference, also how to change values based on criteria...
But I cannot find anything to select the "smaller than but closest to" timestamps and cannot figure out how to apply all these logic too..
Any help would be appreciate!
You can do this with a for loop keeping in mind that if your actual data base is very big then the overhead would be quite bad performance wise.
for(i in 1:nrow(df_1)){
sub <- subset(df_0, ID == df_1$ID[i]) # filter on ID
df_0_dt <- max(sub[sub$DATETIME < df_1$DATETIME[i],]$DATETIME) # Take max of those with DATETIME less than (ie less than but closest to)
if(df_0_dt < df_1$TIMEDIFF[i]){ # final condition
df_1[i, "EV"] <- 4
}
}
df_1
# A tibble: 3 x 4
ID DATETIME TIMEDIFF EV
<chr> <dttm> <dttm> <dbl>
1 A 2019-03-26 06:13:00 2019-03-26 00:13:00 1
2 B 2019-04-03 08:00:00 2019-04-03 02:00:00 4
3 B 2019-04-04 12:35:00 2019-04-04 06:35:00 1
One option using nested mapply, is to first split df_ID1 and df_ID0 based on ID. Calculate the difference in time between each value in df_ID1 with that of df_ID0 of same ID. Get the index of "smaller than but closest to" and store it in inds and change the value to 4 if the value of corresponding TIMEDIFF column is greater than the matched DATETIME value.
df_ID1$EV[unlist(mapply(function(x, y) {
mapply(function(p, q) {
vals = as.numeric(difftime(p, y$DATETIME))
inds = which(vals == min(vals[vals > 0]))
q > y$DATETIME[inds]
}, x$DATETIME, x$TIMEDIFF)
}, split(df_ID1, df_ID1$ID), split(df_ID0, df_ID0$ID)))] <- 4
df_ID1
# ID DATETIME TIMEDIFF EV
#1 A 2019-03-26 06:13:00 2019-03-26 00:13:00 1
#2 B 2019-04-03 08:00:00 2019-04-03 02:00:00 4
#3 B 2019-04-04 12:35:00 2019-04-04 06:35:00 1
data
df_ID0 <- structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L),
.Label = c("A",
"B"), class = "factor"), DATETIME = structure(c(1553529720, 1553547300,
1553570520, 1554208980, 1554346980, 1554491520), class = c("POSIXct",
"POSIXt"), tzone = "")), row.names = c(NA, -6L), class = "data.frame")
df_ID1 <- structure(list(ID = structure(c(1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), DATETIME = structure(c(1553551980, 1554249600,
1554352500), class = c("POSIXct", "POSIXt"), tzone = ""), TIMEDIFF =
structure(c(1553530380,
1554228000, 1554330900), class = c("POSIXct", "POSIXt"), tzone = ""),
EV = c(1, 1, 1)), row.names = c(NA, -3L), class = "data.frame")
I have a dataset that looks like below:
PPID join_date week date visit
A 2017-10-01 1 NA 0
A 2017-10-01 2 2017-10-08 2
A 2017-10-01 3 2017-10-15 1
A 2017-10-01 4 NA 0
B 2017-05-23 1 2017-05-21 4
B 2017-05-23 2 2017-05-28 2
B 2017-05-23 3 NA 0
week indicates the difference between the Sunday of the week of join_date and date in weeks (e.g. for participant B, the Sunday of the week of 2017-05-23 is 2017-05-21; thus participant B's week1 starts on 2017-05-21, and week2 starts on 2017-05-28).
My goal is to fill in date where it is currently NA, such that the output looks like below:
PPID join_date week date visit
A 2017-10-01 1 2017-10-01 0
A 2017-10-01 2 2017-10-08 2
A 2017-10-01 3 2017-10-15 1
A 2017-10-01 4 2017-10-22 0
B 2017-05-23 1 2017-05-21 4
B 2017-05-23 2 2017-05-28 2
B 2017-05-23 3 2017-06-04 0
The code I currently have is:
library(dplyr)
library(lubridate)
df2 <- df %>%
group_by(PPID) %>%
mutate(date = seq(unique(floor_date(as.Date(join_date), "weeks")),
unique(floor_date(as.Date(join_date), "weeks") + 7*(max(week)-1)),
by="week"))
The problem with this approach is that I'm working with large dataset (~8 mil observation) and it takes forever to run! I read some posts that all those date conversion/calculation (e.g. floor_date or as.Date) is what takes so long, and was wondering if there's ways to make my code more efficient.
Thanks!
How about simply
df2$date = floor_date(df2$join_date, 'week') + 7*(df2$week-1)
# PPID join_date week date visit
# 1 A 2017-10-01 1 2017-10-01 0
# 2 A 2017-10-01 2 2017-10-08 2
# 3 A 2017-10-01 3 2017-10-15 1
# 4 A 2017-10-01 4 2017-10-22 0
# 5 B 2017-05-23 1 2017-05-21 4
# 6 B 2017-05-23 2 2017-05-28 2
# 7 B 2017-05-23 3 2017-06-04 0
Although this calculates floor_date for every row, it is vectorised rather looping (as you did implicitly using by), so should be fast enough for most purposes. If you need even more speed-up, you could subset on is.na(df2$data) to only calculate the rows you need to impute.
Data:
df2 = structure(list(PPID = c("A", "A", "A", "A", "B", "B", "B"), join_date = structure(c(17440,
17440, 17440, 17440, 17309, 17309, 17309), class = "Date"), week = c(1L,
2L, 3L, 4L, 1L, 2L, 3L), date = structure(c(NA, 17447, 17454,
NA, 17307, 17314, NA), class = "Date"), visit = c(0L, 2L, 1L,
0L, 4L, 2L, 0L)), row.names = c(NA, -7L), class = "data.frame")
I have a dataframe consisting of:
two columns with start and end timestamps (POSIXct class) of various
projects
another timestamp (POSIXct class) column showing events which
occured within the start and end timeframe
a project id column
Projects have multiple events naturally.
Projid Event BEGIN_DT END_DT
1 04/12/2013 09:00:00 04/12/2013 08:12:00 04/14/2013 20:14:00
1 04/13/2013 15:16:24 04/12/2013 08:12:00 04/14/2013 20:14:00
2 06/06/2012 18:00:00 06/06/2012 13:54:32 08/06/2012 23:59:43
2 06/07/2012 22:54:32 06/06/2012 13:54:32 08/06/2012 23:59:43
I would like to add a field showing for each event the 60 min time bucket it belongs to (as in first hour or second hour or n-th hour of the project etc...). How could this be done?
How about the following using a floor on the difftime in hours:
# Your sample data
df <- structure(list(
Projid = c(1L, 1L, 2L, 2L),
Event = structure(c(1365721200, 1365830184, 1338969600, 1339073672), class = c("POSIXct", "POSIXt"), tzone = ""),
BEGIN_DT = structure(c(1365718320, 1365718320, 1338954872, 1338954872), class = c("POSIXct", "POSIXt"), tzone = ""),
END_DT = structure(c(1365934440, 1365934440, 1344261583, 1344261583), class = c("POSIXct", "POSIXt"), tzone = "")),
.Names = c("Projid", "Event", "BEGIN_DT", "END_DT"), row.names = c(NA, -4L), class = "data.frame");
# Add hour bin
df$hourBin <- floor(difftime(df$Event, df$BEGIN_DT, unit = "hours")) + 1;
df;
#Projid Event BEGIN_DT END_DT hourBin
#1 1 2013-04-12 09:00:00 2013-04-12 08:12:00 2013-04-14 20:14:00 1 hours
#2 1 2013-04-13 15:16:24 2013-04-12 08:12:00 2013-04-14 20:14:00 32 hours
#3 2 2012-06-06 18:00:00 2012-06-06 13:54:32 2012-08-06 23:59:43 5 hours
#4 2 2012-06-07 22:54:32 2012-06-06 13:54:32 2012-08-06 23:59:43 34 hours
I've got a data frame that looks like something along these lines:
Day Salesperson Value
==== ============ =====
Monday John 40
Monday Sarah 50
Tuesday John 60
Tuesday Sarah 30
Wednesday John 50
Wednesday Sarah 40
I want to divide the value for each salesperson by the number of times that each of the days of the week has occurred. So: There have been 3 Monday, 3 Tuesdays, and 2 Wednesdays — I don't have this information digitally, but can create a vector along the lines of
c(3, 3, 2)
How can I conditionally divide the Value column based on the number of times each day occurs?
I've found an inelegant solution, which entails copying the Day column to a temp column, replacing each of the names of the week in the new column with the number of times each day occurs using
df$temp <- sub("Monday, 3, df$temp)
but doing this seems kinda clunky. Is there a neat way to do this?
Suppose your auxiliary data is in another data.frame:
Day N_Day
1 Monday 3
2 Tuesday 3
3 Wednesday 2
The simplest way would be to merge:
DF_new <- merge(DF, DF2, by="Day")
DF_new$newcol <- DF_new$Value / DF_new$N_Day
which gives
Day Salesperson Value N_Day newcol
1 Monday John 40 3 13.33333
2 Monday Sarah 50 3 16.66667
3 Tuesday John 60 3 20.00000
4 Tuesday Sarah 30 3 10.00000
5 Wednesday John 50 2 25.00000
6 Wednesday Sarah 40 2 20.00000
The mergeless shortcut is
DF$newcol <- DF$Value / DF2$N_Day[match(DF$Day, DF2$Day)]
Data:
DF <- structure(list(Day = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label =
c("Monday",
"Tuesday", "Wednesday"), class = "factor"), Salesperson = structure(c(1L,
2L, 1L, 2L, 1L, 2L), .Label = c("John", "Sarah"), class = "factor"),
Value = c(40L, 50L, 60L, 30L, 50L, 40L)), .Names = c("Day",
"Salesperson", "Value"), class = "data.frame", row.names = c(NA,
-6L))
DF2 <- structure(list(Day = structure(1:3, .Label = c("Monday", "Tuesday",
"Wednesday"), class = "factor"), N_Day = c(3, 3, 2)), .Names = c("Day",
"N_Day"), row.names = c(NA, -3L), class = "data.frame")
You can use the library dplyr to merge your data frame with the frequency of each day.
df <- data.frame(
Day=c("Monday","Monday","Tuesday","Tuesday","Wednesday","Wednesday"),
Salesperson=c("John","Sarah","John","Sarah","John","Sarah"),
Value=c(40,50,60,30,50,40), stringsAsFactors=F)
aux <- data.frame(
Day=c("Monday","Tuesday","Wednesday"),
freq=c(3,3,2)
)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value2=Value/n)
To create this auxiliary table with the count of days that appear in your original data instead of doing it manually. You could use:
aux <- df %>% group_by(Day) %>% summarise(n=n())
> output
Day Salesperson Value n Value2
1 Monday John 40 2 20
2 Monday Sarah 50 2 25
3 Tuesday John 60 2 30
4 Tuesday Sarah 30 2 15
5 Wednesday John 50 2 25
6 Wednesday Sarah 40 2 20
If you want to substitute the actual valuecolumn, then use mutate(Value=Value/n) and to remove the additional columns, you can add a select(-n)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value=Value/n) %>% select(-n)
I have one text file that look like:
wd <- read.table("C:\\Users\\value.txt", sep ='' , header =TRUE)
head(wd) # hourly values
# Year day hour mint valu1
# 1 2002 1 7 30 0.5
# 2 2002 1 8 0 0.3
# 3 2002 1 8 30 0.4
I want to add another column with format od date like this:
"2002-01-01 07:30:00 UTC"
Thanks for your help
Try this. No packages are used:
transform(wd,
Date = as.POSIXct(paste(Year, day, hour, mint), format = "%Y %j %H %M", tz = "UTC")
)
## Year day hour mint valu1 Date
## 1 2002 1 7 30 0.5 2002-01-01 07:30:00
## 2 2002 1 8 0 0.3 2002-01-01 08:00:00
## 3 2002 1 8 30 0.4 2002-01-01 08:30:00
Note: Input is:
wd <- structure(list(Year = c(2002L, 2002L, 2002L), day = c(1L, 1L,
1L), hour = c(7L, 8L, 8L), mint = c(30L, 0L, 30L), valu1 = c(0.5,
0.3, 0.4)), .Names = c("Year", "day", "hour", "mint", "valu1"
), class = "data.frame", row.names = c(NA, -3L))
You might be able to simplify things with a package like lubridate but I think to illustrate the solution this will work for you. Next time it would save time for people answering if you provide code to create the sample data like I've done here.
d <- read.table(header=T, stringsAsFactors=F, text="
Year day hour mint valu1
2002 1 7 30 0.5
2002 1 8 0 0.3
2002 1 8 30 0.4
")
require(stringr)
d$datetime <- strptime(
paste0(
d$Year, "-",
str_pad(d$day,3,pad="0"),
str_pad(d$hour,2,pad="0"),
":",
str_pad(d$mint, 2, pad="0")
),
format="%Y-%j %H:%M"
)