Related
I am a big fan of Hyndman's packages, but stumbled with Box-Cox transformation.
I have a dataframe
class(chicago_sales)
[1] "tbl_ts" "tbl_df" "tbl" "data.frame"
I am trying to mutate an extra column, where the Mean_price variable will be transformed.
foo <- chicago_sales %>%
mutate(bc = BoxCox(x = chicago_sales$Median_price, lambda =
BoxCox.lambda(chicago_sales$Median_price)))
gives me some result (probably wrong too) and cannot apply autoplot.
I also tried to apply the code from Hyndman's book, but failed.
What am I doing wrong? Thanks!
UPDATED:
Issue, inside tsibbles, when using dplyr, you do not call chicago_sales$Median_price, but just Median_price. When using tsibbles I would advice using fable and fabletools, but if you are using forecast, it should work like this:
library(tsibble)
library(dplyr)
library(forecast)
pedestrian %>%
mutate(bc = BoxCox(Count, BoxCox.lambda(Count)))
# A tsibble: 66,037 x 6 [1h] <Australia/Melbourne>
# Key: Sensor [4]
Sensor Date_Time Date Time Count bc
<chr> <dttm> <date> <int> <int> <dbl>
1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630 11.3
2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01 1 826 9.87
3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01 2 567 9.10
4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01 3 264 7.65
5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01 4 139 6.52
6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01 5 77 5.54
7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01 6 44 4.67
8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01 7 56 5.04
9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01 8 113 6.17
10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01 9 166 6.82
# ... with 66,027 more rows
I used a built in dataset from the tsibble package as you did not provide a dput of chicago_sales.
I struggle with nested ifelse. I want to create a new variable using dplyr::mutate based on values of other variables. See the reproductible example below.
library(dplyr)
library(hms)
# make a test dataframe
datetime <- as.POSIXct(c("2015-01-26 10:10:00 UTC","2015-01-26 10:20:00 UTC","2015-01-26 10:30:00 UTC", "2015-01-26 10:40:00 UTC","2015-01-26 10:50:00 UTC","2015-01-26 11:00:00 UTC","2015-01-26 00:10:00 UTC","2015-01-26 11:20:00 UTC","2015-01-26 11:30:00 UTC","2017-03-10 10:00:00 UTC"))
time <- hms::as_hms(datetime)
pco2_corr <- c(90,135,181,226,272,317,363,NA,454,300)
State_Zero <- c(NA,NA,1,rep(NA,7))
State_Flush <- c(rep(NA,4),1,rep(NA,5))
z <- tibble(datetime, time, pco2_corr, State_Zero, State_Flush)
# now create a new variable
z <- z %>%
dplyr::mutate(pco2_corr_qf = ifelse(is.na(pco2_corr), 15,
ifelse((State_Zero >= 1 | State_Flush >= 1), 4,
ifelse(pco2_corr < 100 | pco2_corr > 450, 7,
ifelse((time >= "00:00:00" & time <= "01:30:00") |
(time >= "12:00:00" & time <= "13:00:00"), 16,
ifelse((datetime >= "2017-03-10 08:00:00" &
datetime < "2017-03-21 20:00:00"), 99,
1))))))
z
# A tibble: 10 x 6
datetime time pco2_corr State_Zero State_Flush pco2_corr_qf
<dttm> <time> <dbl> <dbl> <dbl> <dbl>
1 2015-01-26 10:10:00 10:10 90 NA NA NA
2 2015-01-26 10:20:00 10:20 135 NA NA NA
3 2015-01-26 10:30:00 10:30 181 1 NA 4
4 2015-01-26 10:40:00 10:40 226 NA NA NA
5 2015-01-26 10:50:00 10:50 272 NA 1 4
6 2015-01-26 11:00:00 11:00 317 NA NA NA
7 2015-01-26 00:10:00 00:10 363 NA NA NA
8 2015-01-26 11:20:00 11:20 NA NA NA 15
9 2015-01-26 11:30:00 11:30 454 NA NA NA
10 2017-03-10 10:00:00 10:00 300 NA NA NA
The first two ifelse work fine but the next three do not. The new variable pco2_corr_qf should not have any NA but values 7, 16, 99 and 1.
What am I doing wrong?
You are comparing time with a string that gives incorrect output, convert it to the relevant class. We can use case_when which is a better alternative to nested ifelse.
library(dplyr)
library(hms)
z %>%
mutate(pco2_corr_qf = case_when(
is.na(pco2_corr) ~ 15,
State_Zero >= 1 | State_Flush >= 1 ~ 4,
pco2_corr < 100 | pco2_corr > 450 ~ 7,
(time >= as_hms("00:00:00") & time <= as_hms("01:30:00")) |
(time >= as_hms("12:00:00") & time <= as_hms("13:00:00")) ~ 16,
datetime >= as.POSIXct("2017-03-10 08:00:00") &
datetime < as.POSIXct("2017-03-21 20:00:00") ~ 99,
TRUE ~ 1))
# datetime time pco2_corr State_Zero State_Flush pco2_corr_qf
# <dttm> <time> <dbl> <dbl> <dbl> <dbl>
# 1 2015-01-26 10:10:00 10:10 90 NA NA 7
# 2 2015-01-26 10:20:00 10:20 135 NA NA 1
# 3 2015-01-26 10:30:00 10:30 181 1 NA 4
# 4 2015-01-26 10:40:00 10:40 226 NA NA 1
# 5 2015-01-26 10:50:00 10:50 272 NA 1 4
# 6 2015-01-26 11:00:00 11:00 317 NA NA 1
# 7 2015-01-26 00:10:00 00:10 363 NA NA 16
# 8 2015-01-26 11:20:00 11:20 NA NA NA 15
# 9 2015-01-26 11:30:00 11:30 454 NA NA 7
#10 2017-03-10 10:00:00 10:00 300 NA NA 99
I am trying to split rows in an excel file based on day and time. The data is from a study which participants will need to wear a tracking watch. Each row of the data set is started with participants put on the watch (Variable: 'Wear Time Start ') and ended with them taking off the device (Variable: 'Wear Time End').
I need to calculate how many hours of each participant wearing the device on each day (NOT each time period in one row).
Data set before split:
ID WearStart WearEnd
1 01 2018-05-14 09:00:00 2018-05-14 20:00:00
2 01 2018-05-14 21:30:00 2018-05-15 02:00:00
3 01 2018-05-15 07:00:00 2018-05-16 22:30:00
4 01 2018-05-16 23:00:00 2018-05-16 23:40:00
5 01 2018-05-17 01:00:00 2018-05-19 15:00:00
6 02 ...
Some explanation about the data set before split: the data type of 'WearStart' and 'WearEnd' are POSIXlt.
Desired output after split:
ID WearStart WearEnd Interval
1 01 2018-05-14 09:00:00 2018-05-14 20:00:00 11
2 01 2018-05-14 21:30:00 2018-05-15 00:00:00 2.5
3 01 2018-05-15 00:00:00 2018-05-15 02:00:00 2
4 01 2018-05-15 07:00:00 2018-05-16 00:00:00 17
5 01 2018-05-16 00:00:00 2018-05-16 22:30:00 22.5
4 01 2018-05-16 23:00:00 2018-05-16 23:40:00 0.4
5 01 2018-05-17 01:00:00 2018-05-18 00:00:00 23
6 01 2018-05-18 00:00:00 2018-05-19 00:00:00 24
7 01 2018-05-19 00:00:00 2018-05-19 15:00:00 15
Then I need to accumulate hours based on day:
ID Wear_Day Total_Hours
1 01 2018-05-14 13.5
2 01 2018-05-15 19
3 01 2018-05-16 22.9
4 01 2018-05-17 23
5 01 2018-05-18 24
4 01 2018-05-19 15
So, I reworked the entire answer. Please, review the code. I am pretty sure this is what you want.
Short summary
The problem is that you need to split rows which start and end on different dates. And you need to do this recursively. So, I split the dataframe into a list of 1-row dataframes. For each I check whether start and end is on the same day. If not, I make it a 2-row dataframe with the adjusted start and end times. This is then split up again into a list of 1-row dataframes and so on so forth.
In the end there is a nested list of 1-row dataframes where start and end is on the same day. And this list is then recursively bound together again.
# Load Packages ---------------------------------------------------------------------------------------------------
library(tidyverse)
library(lubridate)
df <- tribble(
~ID, ~WearStart, ~WearEnd
, 01, "2018-05-14 09:00:00", "2018-05-14 20:00:00"
, 01, "2018-05-14 21:30:00", "2018-05-15 02:00:00"
, 01, "2018-05-15 07:00:00", "2018-05-16 22:30:00"
, 01, "2018-05-16 23:00:00", "2018-05-16 23:40:00"
, 01, "2018-05-17 01:00:00", "2018-05-19 15:00:00"
)
df <- df %>% mutate_at(vars(starts_with("Wear")), ymd_hms)
# Helper Functions ------------------------------------------------------------------------------------------------
endsOnOtherDay <- function(df){
as_date(df$WearStart) != as_date(df$WearEnd)
}
split1rowInto2Days <- function(df){
df1 <- df
df2 <- df
df1$WearEnd <- as_date(df1$WearStart) + days(1) - milliseconds(1)
df2$WearStart <- as_date(df2$WearStart) + days(1)
rbind(df1, df2)
}
splitDates <- function(df){
if (nrow(df) > 1){
return(df %>%
split(f = 1:nrow(df)) %>%
lapply(splitDates) %>%
reduce(rbind))
}
if (df %>% endsOnOtherDay()){
return(df %>%
split1rowInto2Days() %>%
splitDates())
}
df
}
# The actual Calculation ------------------------------------------------------------------------------------------
df %>%
splitDates() %>%
mutate(wearDuration = difftime(WearEnd, WearStart, units = "hours")
, wearDay = as_date(WearStart)) %>%
group_by(ID, wearDay) %>%
summarise(wearDuration_perDay = sum(wearDuration))
ID wearDay wearDuration_perDay
<dbl> <date> <drtn>
1 1 2018-05-14 13.50000 hours
2 1 2018-05-15 19.00000 hours
3 1 2018-05-16 23.16667 hours
4 1 2018-05-17 23.00000 hours
5 1 2018-05-18 24.00000 hours
6 1 2018-05-19 15.00000 hours
Here is my solution to your question with just using basic functions in R:
#step 1: read data from file
d <- read.csv("dt.csv", header = TRUE)
d
ID WearStart WearEnd
1 1 2018-05-14 09:00:00 2018-05-14 20:00:00
2 1 2018-05-14 21:30:00 2018-05-15 02:00:00
3 1 2018-05-15 07:00:00 2018-05-16 22:30:00
4 1 2018-05-16 23:00:00 2018-05-16 23:40:00
5 1 2018-05-17 01:00:00 2018-05-19 15:00:00
6 2 2018-05-16 11:30:00 2018-05-16 11:40:00
7 2 2018-05-16 22:05:00 2018-05-22 22:42:00
#step 2: change class of WearStart and WearEnd to POSIlct
d$WearStart <- as.POSIXlt(d$WearStart, tryFormats = "%Y-%m-%d %H:%M")
d$WearEnd <- as.POSIXlt(d$WearEnd, tryFormats = "%Y-%m-%d %H:%M")
#step 3: calculate time interval (days and hours) for each record
timeInt <- function(d) {
WearStartDay <- as.Date(d$WearStart, "%Y/%m/%d")
Interval_days <- as.numeric(difftime(d$WearEnd,d$WearStart, units = "days"))
Days <- WearStartDay + seq(0, Interval_days,1)
N_FullBTWDays <- length(Days) - 2
if (N_FullBTWDays >= 0) {
sd <- d$WearStart
sd_h <- 24 - sd$hour -1
sd_m <- (60 - sd$min)/60
sd_total <- sd_h + sd_m
hours <- sd_total
hours <- c(hours, rep(24,N_FullBTWDays))
ed <- d$WearEnd
ed_h <- ed$hour
ed_m <- ed$min/60
ed_total <- ed_h + ed_m
hours <- c(hours,ed_total)
} else {
hours <- as.numeric(difftime(d$WearEnd,d$WearStart, units = "hours"))
}
df <- data.frame(id = rep(d$ID, length(Days)), days = Days, hours = hours)
return(df)
}
df <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(df) <- c("id", "days", "hours")
for ( i in 1:nrow(d)) {
df <- rbind(df,timeInt(d[i,]))
}
id days hours
1 1 2018-05-14 11.0000000
2 1 2018-05-14 4.5000000
3 1 2018-05-15 17.0000000
4 1 2018-05-16 22.5000000
5 1 2018-05-16 0.6666667
6 1 2018-05-17 23.0000000
7 1 2018-05-18 24.0000000
8 1 2018-05-19 15.0000000
9 2 2018-05-16 0.1666667
10 2 2018-05-16 1.9166667
11 2 2018-05-17 24.0000000
12 2 2018-05-18 24.0000000
13 2 2018-05-19 24.0000000
14 2 2018-05-20 24.0000000
15 2 2018-05-21 24.0000000
16 2 2018-05-22 22.7000000
#daily usage of device for each customer
res <- as.data.frame(tapply(df$hours, list(df$days,df$id), sum))
res[is.na(res)] <- 0
res$date <- rownames(res)
res
1 2 date
2018-05-14 15.50000 0.000000 2018-05-14
2018-05-15 17.00000 0.000000 2018-05-15
2018-05-16 23.16667 2.083333 2018-05-16
2018-05-17 23.00000 24.000000 2018-05-17
2018-05-18 24.00000 24.000000 2018-05-18
2018-05-19 15.00000 24.000000 2018-05-19
2018-05-20 0.00000 24.000000 2018-05-20
2018-05-21 0.00000 24.000000 2018-05-21
2018-05-22 0.00000 22.700000 2018-05-22
I have the following data frame:
Date.POSIXct Date WeekDay DayCategory Hour Holidays value
1 2018-05-01 00:00:00 2018-05-01 MA MA-MI-JU 0 0 30
2 2018-05-01 01:00:00 2018-05-01 MA MA-MI-JU 1 0 80
3 2018-05-01 02:00:00 2018-05-01 MA MA-MI-JU 2 0 42
4 2018-05-01 03:00:00 2018-05-01 MA MA-MI-JU 3 0 90
5 2018-05-01 04:00:00 2018-05-01 MA MA-MI-JU 4 0 95
6 2018-05-01 05:00:00 2018-05-01 MA MA-MI-JU 5 0 5
DayCategory groups days of the week in the following way: Mondays goes to LU DayCategory. Tuesday, Wednesday and Thursdays go to MA-MI-JU DayCategory.
Friday goes to VI, Saturdays to SA and Sundays to DO Categories respectively.
I would like to find the value for the same hour in the previous day (Date) with the same DayCategory, while Holidays remains within the same group (e.g. if one instance has holiday 0 but previous day with same DayCategory has 1, we should lookv for the previous one, etc.)
As an intermediate step and to understand the process I would like to add a column PreviousDaySameDayCategory with the Date of the previous day that has the same DayCategory that the corresponding row. Some times it will be just the same date minus seven days ("LU","VI","SA","DO") but other days it will be just one day.
Reproducible data:
library(lubridate)
Date.POSIXct <- seq(as.POSIXct("2018-05-01"), as.POSIXct("2018-05-31"), "hour")
mydf <- as.data.frame(Date.POSIXct)
mydf$Date <- as.Date(substr(as.character(mydf$Date.POSIXct),1,10))
mydf$WeekDay <- substr(toupper((weekdays(mydf$Date))),1,2)
mydf$DayCategory <-as.factor(ifelse(mydf$WeekDay == "MA" | mydf$WeekDay == "MI" | mydf$WeekDay == "JU",
"MA-MI-JU", mydf$WeekDay))
mydf$Hour <- hour(mydf$Date.POSIXct)
mydf$Holidays <- c(rep(0, 24*7),rep(1, 24*7), rep(0, 24*16+1))
set.seed(123)
mydf$myvalue <- sample.int(101,size=nrow(mydf),replace=TRUE)
I have manually started the first days and craeted a vector of how the solution should look like:
a <- rep(NA, 24)
b <- mydf$value[1:24]
c <- mydf$value[25:48]
d <- rep(NA, 24)
e <- rep(NA,24)
f <- rep(NA,24)
g <- rep(NA,24)
h <- rep(NA,24)
i <- mydf$value[169:192]
solution <- c(a,b,c,d,e,f,g,h,i)
solution
I would appreciate any hint in the thinking process to solve this kind of problems that I face with relative frequency.
Here is a data.table solution which uses a "grouped shift()" and multiple joins to copy value from the same hour of the PreviousDaySameDayCategory.
Create reproducible data
OP's code to create reproducible data was not fully reproducible because he used the weekdays() function which returns the weekday names in the current locale (which seems to be Spanish for the OP). To be independent of the current locale, I switched to format(Date, "%u") which returns the numbers 1 to 7 for Monday to Sunday. Furthermore, the fct_collapse() from the forcats package is used to collapse the days 2, 3, and 4 (Tuesday to Thursday) into one factor level.
library(data.table)
# note that package lubridate is not required
myDT <- data.table(Date.POSIXct = seq(as.POSIXct("2018-05-01"),
as.POSIXct("2018-05-31"), "hour"))
myDT[, Date := as.Date(Date.POSIXct)]
myDT[, Weekday := format(Date, "%u")]
myDT[, DayCategory := forcats::fct_collapse(Weekday, "234" = c("2", "3", "4"))]
myDT[, hour := hour(Date.POSIXct)]
myDT[, Holidays := c(rep(0, 24 * 7), rep(1, 24 * 7), rep(0, 24 * 16 + 1))]
set.seed(123)
myDT[, myvalue := sample.int(101, size = nrow(mydf), replace = TRUE)]
Intermediate step: PreviousDaySameDayCategory
The sample data set consists of hourly data but in order to determine the PreviousDaySameDayCategory we need to work day-wise and thus only have to deal with the unique values of Date, DayCategory, and Holidays. The data is grouped by DayCategory and the Holidays indicator. For each group separately, the previous day is picked by lagging Date. As the result of shift() operations depend on the order of rows the dataset has been ordered before shifting.
tmp <- unique(myDT[order(Date), .(Date, DayCategory, Holidays)])[
, .(Date, PreviousDaySameDayCategory = shift(Date)), by = .(DayCategory, Holidays)][
order(Date)]
tmp
DayCategory Holidays Date PreviousDaySameDayCategory
1: 234 0 2018-05-01 <NA>
2: 234 0 2018-05-02 2018-05-01
3: 234 0 2018-05-03 2018-05-02
4: 5 0 2018-05-04 <NA>
5: 6 0 2018-05-05 <NA>
6: 7 0 2018-05-06 <NA>
7: 1 0 2018-05-07 <NA>
8: 234 1 2018-05-08 <NA>
9: 234 1 2018-05-09 2018-05-08
10: 234 1 2018-05-10 2018-05-09
11: 5 1 2018-05-11 <NA>
12: 6 1 2018-05-12 <NA>
13: 7 1 2018-05-13 <NA>
14: 1 1 2018-05-14 <NA>
15: 234 0 2018-05-15 2018-05-03
16: 234 0 2018-05-16 2018-05-15
17: 234 0 2018-05-17 2018-05-16
18: 5 0 2018-05-18 2018-05-04
19: 6 0 2018-05-19 2018-05-05
20: 7 0 2018-05-20 2018-05-06
21: 1 0 2018-05-21 2018-05-07
22: 234 0 2018-05-22 2018-05-17
23: 234 0 2018-05-23 2018-05-22
24: 234 0 2018-05-24 2018-05-23
25: 5 0 2018-05-25 2018-05-18
26: 6 0 2018-05-26 2018-05-19
27: 7 0 2018-05-27 2018-05-20
28: 1 0 2018-05-28 2018-05-21
29: 234 0 2018-05-29 2018-05-24
30: 234 0 2018-05-30 2018-05-29
31: 234 0 2018-05-31 2018-05-30
DayCategory Holidays Date PreviousDaySameDayCategory
For days 3 and 4 (Wednesdays and Thursday) the preceeding Tuesday and Wednesday, resp., of the same week are picked. For day 2 (Tuesday) the preceeding Thursday of the preceeding week is picked if both weeks have the same holiday indicator set. If the preceeding week has a different holiday indicator the most recent Thursday of the same holiday period is picked. This is why, e.g., the 2018-05-03 is picked in row 15.
Copying value from matching PreviousDaySameDayCategory
This is done in two steps. First, the hourly values are picked from the matching PreviousDaySameDayCategory by joining with the matching days table tmp:
tmp2 <- myDT[tmp, on = .(Date = PreviousDaySameDayCategory), .(Date = i.Date, hour, myvalue), nomatch = 0L]
tmp2
Date hour myvalue
1: 2018-05-02 0 30
2: 2018-05-02 1 80
3: 2018-05-02 2 42
4: 2018-05-02 3 90
5: 2018-05-02 4 95
---
500: 2018-05-31 19 39
501: 2018-05-31 20 1
502: 2018-05-31 21 1
503: 2018-05-31 22 101
504: 2018-05-31 23 11
Second, a new column previousValue in myDT is created by updating in a join which contains the corresponding value from PreviousDaySameDayCategory:
myDT[tmp2, on = .(Date, hour), previousValue := i.myvalue]
Here, the first two days of the result are shown:
myDT[Date %between% c(as.Date("2018-05-01"), as.Date("2018-05-02"))]
Date.POSIXct Date Weekday DayCategory hour Holidays myvalue previousValue
1: 2018-05-01 00:00:00 2018-05-01 2 234 0 0 30 NA
2: 2018-05-01 01:00:00 2018-05-01 2 234 1 0 80 NA
3: 2018-05-01 02:00:00 2018-05-01 2 234 2 0 42 NA
4: 2018-05-01 03:00:00 2018-05-01 2 234 3 0 90 NA
5: 2018-05-01 04:00:00 2018-05-01 2 234 4 0 95 NA
6: 2018-05-01 05:00:00 2018-05-01 2 234 5 0 5 NA
7: 2018-05-01 06:00:00 2018-05-01 2 234 6 0 54 NA
8: 2018-05-01 07:00:00 2018-05-01 2 234 7 0 91 NA
9: 2018-05-01 08:00:00 2018-05-01 2 234 8 0 56 NA
10: 2018-05-01 09:00:00 2018-05-01 2 234 9 0 47 NA
11: 2018-05-01 10:00:00 2018-05-01 2 234 10 0 97 NA
12: 2018-05-01 11:00:00 2018-05-01 2 234 11 0 46 NA
13: 2018-05-01 12:00:00 2018-05-01 2 234 12 0 69 NA
14: 2018-05-01 13:00:00 2018-05-01 2 234 13 0 58 NA
15: 2018-05-01 14:00:00 2018-05-01 2 234 14 0 11 NA
16: 2018-05-01 15:00:00 2018-05-01 2 234 15 0 91 NA
17: 2018-05-01 16:00:00 2018-05-01 2 234 16 0 25 NA
18: 2018-05-01 17:00:00 2018-05-01 2 234 17 0 5 NA
19: 2018-05-01 18:00:00 2018-05-01 2 234 18 0 34 NA
20: 2018-05-01 19:00:00 2018-05-01 2 234 19 0 97 NA
21: 2018-05-01 20:00:00 2018-05-01 2 234 20 0 90 NA
22: 2018-05-01 21:00:00 2018-05-01 2 234 21 0 70 NA
23: 2018-05-01 22:00:00 2018-05-01 2 234 22 0 65 NA
24: 2018-05-01 23:00:00 2018-05-01 2 234 23 0 101 NA
25: 2018-05-02 00:00:00 2018-05-02 3 234 0 0 67 30
26: 2018-05-02 01:00:00 2018-05-02 3 234 1 0 72 80
27: 2018-05-02 02:00:00 2018-05-02 3 234 2 0 55 42
28: 2018-05-02 03:00:00 2018-05-02 3 234 3 0 61 90
29: 2018-05-02 04:00:00 2018-05-02 3 234 4 0 30 95
30: 2018-05-02 05:00:00 2018-05-02 3 234 5 0 15 5
31: 2018-05-02 06:00:00 2018-05-02 3 234 6 0 98 54
32: 2018-05-02 07:00:00 2018-05-02 3 234 7 0 92 91
33: 2018-05-02 08:00:00 2018-05-02 3 234 8 0 70 56
34: 2018-05-02 09:00:00 2018-05-02 3 234 9 0 81 47
35: 2018-05-02 10:00:00 2018-05-02 3 234 10 0 3 97
36: 2018-05-02 11:00:00 2018-05-02 3 234 11 0 49 46
37: 2018-05-02 12:00:00 2018-05-02 3 234 12 0 77 69
38: 2018-05-02 13:00:00 2018-05-02 3 234 13 0 22 58
39: 2018-05-02 14:00:00 2018-05-02 3 234 14 0 33 11
40: 2018-05-02 15:00:00 2018-05-02 3 234 15 0 24 91
41: 2018-05-02 16:00:00 2018-05-02 3 234 16 0 15 25
42: 2018-05-02 17:00:00 2018-05-02 3 234 17 0 42 5
43: 2018-05-02 18:00:00 2018-05-02 3 234 18 0 42 34
44: 2018-05-02 19:00:00 2018-05-02 3 234 19 0 38 97
45: 2018-05-02 20:00:00 2018-05-02 3 234 20 0 16 90
46: 2018-05-02 21:00:00 2018-05-02 3 234 21 0 15 70
47: 2018-05-02 22:00:00 2018-05-02 3 234 22 0 24 65
48: 2018-05-02 23:00:00 2018-05-02 3 234 23 0 48 101
Date.POSIXct Date Weekday DayCategory hour Holidays myvalue previousValue
Verification
The result is in line with OP's expectations
identical(myDT[, previousValue[seq_along(solution)]], solution)
[1] TRUE
OP has posted the same question in the Data Science section as well. I am including the same solution I have there here case it might help others.
It is similar to Uwe's solution but with the dplyr library instead.
library(dplyr)
rankedDf <- mydf %>%
group_by(DayCategory, Hour, Holidays) %>%
arrange(Date) %>%
mutate(rowRank = order(Date), previousRowRank = order(Date) - 1) %>%
left_join(., ., by = c("previousRowRank" = "rowRank", "DayCategory", "Hour", "Holidays")) %>%
select(
Date.POSIXct = Date.POSIXct.x,
Date = Date.x,
WeekDay = WeekDay.x,
DayCategory,
Hour,
Holidays,
myvalue = myvalue.x,
PreviousDaySameDayCategory = Date.y,
PreviousValueSameDayCategory = myvalue.y
)
print.data.frame(rankedDf)
P.S. love the way Uwe changes the original sample code.
I have a data set that has dates and times for in and out. Each line is an in and out set, but some are blank. I can remove the blanks with na.omit and a nice read in (it was a csv, and na.strings=c("") works on the read.csv).
Of course, because the real world is never like the tutorial, some of the times are only dates, so my as.POSIXlt(Dataset$In,format="%m/%d/%Y %H:%M") returns NA on the "only date no time"s.
na.omit does not remove these lines. so the questions are 2
Why doesn't na.omit work, or how can I get it to work?
Better, How can I convert one column into both Dates and Times (in the posix format) without 2 calls or with some sort of optional parameter in the format string? (or is this even possible?).
This is a sample of the dates and times. I can't share the real file, 1 it's huge, 2 it's PII.
Id,In,Out
1,8/15/2015 8:00,8/15/2015 17:00
1,8/16/2015 8:04,8/16/2015
1,8/17/2015 8:50,8/17/2015 18:00
1,8/18/2015,8/18/2015 17:00
2,8/15/2015,8/15/2015 13:00
2,8/16/2015 8:00,8/16/2015 17:00
3,8/15/2015 4:00,8/15/2015 11:00
3,8/16/2015 9:00,8/16/2015 19:00
3,8/17/2015,8/17/2015 17:00
3,,
4,,
4,8/16/2015 6:00,8/16/2015 20:00
DF <- read.table(text = "Id,In,Out
1,8/15/2015 8:00,8/15/2015 17:00
1,8/16/2015 8:04,8/16/2015
1,8/17/2015 8:50,8/17/2015 18:00
1,8/18/2015,8/18/2015 17:00
2,8/15/2015,8/15/2015 13:00
2,8/16/2015 8:00,8/16/2015 17:00
3,8/15/2015 4:00,8/15/2015 11:00
3,8/16/2015 9:00,8/16/2015 19:00
3,8/17/2015,8/17/2015 17:00", header = TRUE, sep = ",",
stringsAsFactors = FALSE) #set this option during import
DF$In[nchar(DF$In) < 13] <- paste(DF$In[nchar(DF$In) < 13], "0:00")
DF$Out[nchar(DF$Out) < 13] <- paste(DF$Out[nchar(DF$Out) < 13], "0:00")
DF$In <- as.POSIXct(DF$In, format = "%m/%d/%Y %H:%M", tz = "GMT")
DF$Out <- as.POSIXct(DF$Out, format = "%m/%d/%Y %H:%M", tz = "GMT")
# Id In Out
#1 1 2015-08-15 08:00:00 2015-08-15 17:00:00
#2 1 2015-08-16 08:04:00 2015-08-16 00:00:00
#3 1 2015-08-17 08:50:00 2015-08-17 18:00:00
#4 1 2015-08-18 00:00:00 2015-08-18 17:00:00
#5 2 2015-08-15 00:00:00 2015-08-15 13:00:00
#6 2 2015-08-16 08:00:00 2015-08-16 17:00:00
#7 3 2015-08-15 04:00:00 2015-08-15 11:00:00
#8 3 2015-08-16 09:00:00 2015-08-16 19:00:00
#9 3 2015-08-17 00:00:00 2015-08-17 17:00:00
na.omit doesn't work with POSIXlt objects because it is documented to "handle vectors, matrices and data frames comprising vectors and matrices (only)." (see help("na.omit")). And in the strict sense, POSIXlt objects are not vectors:
unclass(as.POSIXlt(DF$In))
#$sec
#[1] 0 0 0 0 0 0 0 0 0
#
#$min
#[1] 0 4 50 0 0 0 0 0 0
#
#$hour
#[1] 8 8 8 0 0 8 4 9 0
#
#$mday
#[1] 15 16 17 18 15 16 15 16 17
#
#$mon
#[1] 7 7 7 7 7 7 7 7 7
#
#$year
#[1] 115 115 115 115 115 115 115 115 115
#
#$wday
#[1] 6 0 1 2 6 0 6 0 1
#
#$yday
#[1] 226 227 228 229 226 227 226 227 228
#
#$isdst
#[1] 0 0 0 0 0 0 0 0 0
#
#attr(,"tzone")
#[1] "GMT"
There is hardly any reason to prefer POSIXlt over POSIXct (which is an integer giving the number of seconds since the origin internally and thus needs less memory).
You've been given a couple of strategies that bring these character values in and process "in-place". I almost never use as.POSIXlt since there are so many pitfalls in dealing with the list-in-list structures that it returns, especially considering its effective incompatibility with dataframes. Here's a method that does the testing and coercion at the read.-level by defining an as-method:
setOldClass("inTime", prototype="POSIXct")
setAs("character", "inTime",
function(from) structure( ifelse( is.na(as.POSIXct(from, format="%m/%d/%Y %H:%M") ),
as.POSIXct(from, format="%m/%d/%Y") ,
as.POSIXct(from, format="%m/%d/%Y %H:%M") ),
class="POSIXct" ) )
read.csv(text=txt, colClasses=c("numeric", 'inTime','inTime') )
Id In Out
1 1 2015-08-15 08:00:00 2015-08-15 17:00:00
2 1 2015-08-16 08:04:00 2015-08-16 00:00:00
3 1 2015-08-17 08:50:00 2015-08-17 18:00:00
4 1 2015-08-18 00:00:00 2015-08-18 17:00:00
5 2 2015-08-15 00:00:00 2015-08-15 13:00:00
6 2 2015-08-16 08:00:00 2015-08-16 17:00:00
7 3 2015-08-15 04:00:00 2015-08-15 11:00:00
8 3 2015-08-16 09:00:00 2015-08-16 19:00:00
9 3 2015-08-17 00:00:00 2015-08-17 17:00:00
The structure "envelope" is needed because of the rather strange behavior of ifelse, which otherwise would return a numeric object rather than an object of class-'POSIXct'.