mutate_at column names error - r

I'm trying to apply the anytime() function from the anytime package in a dplyr chain to all columns ending with Date
I'm however getting this error.
Error: Unsupported Type
when I use
invoicePayment <- head(raw.InvoicePayment) %>%
mutate_at(ends_with("Date"), funs(anytime))
but it's fine when I use
invoicePayment <- head(raw.InvoicePayment) %>%
select(ends_with("Date")) %>%
mutate_at(ends_with("Date"), funs(anytime))
Any help is appreciated,
Thanks,

We may need to wrap with vars
library(anytime)
library(dplyr)
df1 %>%
mutate_at(vars(ends_with("Date")), anytime)
# col1 col2_Date col3_Date
#1 1 2017-06-07 05:30:00 2017-06-07 05:30:00
#2 2 2017-06-08 05:30:00 2017-06-06 05:30:00
#3 3 2017-06-09 05:30:00 2017-06-05 05:30:00
#4 4 2017-06-10 05:30:00 2017-06-04 05:30:00
#5 5 2017-06-11 05:30:00 2017-06-03 05:30:00
data
df1 <- data.frame(col1 = 1:5, col2_Date = Sys.Date() + 0:4, col3_Date = Sys.Date() - 0:4)

Related

Converting Time Range into Readable format in R

I am having trouble converting a time range in a column to a readable data for R. How would I go about converting this?
[1] "05:30P -08:00P" "07:00A -09:35A" "08:00A -10:30A" "08:55P -11:00P" "06:00P -06:30P"
c("05:30P -08:00P", "07:00A -09:35A", "08:00A -10:30A", "08:55P -11:00P",
"06:00P -06:30P")
If we want to convert to Datetime, an option is to split at the - into two columns and then use as.POSIXct to do the conversion
library(stringr)
library(dplyr)
library(tidyr)
str_replace_all(str1, "([AP])", "\\1M") %>%
tibble(str1 = .) %>%
separate(str1, into = c('start', 'end'), sep="\\s*-") %>%
mutate(across(c(start, end), ~ as.POSIXct(., format = '%I:%M %p')))
# A tibble: 5 x 2
# start end
# <dttm> <dttm>
#1 2020-08-19 17:30:00 2020-08-19 20:00:00
#2 2020-08-19 07:00:00 2020-08-19 09:35:00
#3 2020-08-19 08:00:00 2020-08-19 10:30:00
#4 2020-08-19 20:55:00 2020-08-19 23:00:00
#5 2020-08-19 18:00:00 2020-08-19 18:30:00
Or using lubridate
library(lubridate)
str_replace_all(str1, "([AP])", "\\1M") %>%
tibble(str1 = .) %>%
separate(str1, into = c('start', 'end'), sep="\\s*-") %>%
mutate(across(c(start, end), ~ parse_date_time(., 'IMp')))
data
str1 <- c("05:30P -08:00P", "07:00A -09:35A", "08:00A -10:30A", "08:55P -11:00P",
"06:00P -06:30P")
Base R attempt using strcapture to separate the timestamps out to two parts:
dr <- c("05:30P -08:00P", "07:00A -09:35A", "08:00A -10:30A", "08:55P -11:00P",
"06:00P -06:30P")
tms <- strcapture(r"((\d+:\d+[AP])[- ]+(\d+:\d+[AP]))", dr, proto=list(start="",end=""))
tms[] <- lapply(tms, function(x) as.POSIXct(paste0(x, "M"), format="%I:%M%p", tz="UTC"))
# start end
#1 2020-08-20 17:30:00 2020-08-20 20:00:00
#2 2020-08-20 07:00:00 2020-08-20 09:35:00
#3 2020-08-20 08:00:00 2020-08-20 10:30:00
#4 2020-08-20 20:55:00 2020-08-20 23:00:00
#5 2020-08-20 18:00:00 2020-08-20 18:30:00

dplyr::mutate_at iterate through columns in function

require(dplyr)
df <- data.frame(Date.time = c("2015-01-01 00:00:00", "2015-01-01 00:30:00", "2015-01-01 01:00:00", "2015-01-01 01:30:00", "2015-01-01 02:00:00"),
RH33HMP = c(99.6,99.6,99.5,99.3,98.63),
RH33HMP_f = c(9,9,92,93,9),
RH38HMP = c(99.6,99.6,99.5,99.3,98.63),
RH38HMP_f = c(9,902,9,9,91))
Here is some example data.frame.
I'd like to set every value to NA where the corresponding quality column (_f) contains something else than 9. First, I grep the column number with the actual measurements:
col_var <- grep("^Date.|_f$", names(df), invert = T)
Then I use dplyr and mutate_at with an if_else function. My problem is, that mutate_at iterates through all the columns of col_val, but the function itself does not. I tried several examples that I found on stackoverflow, but none of them seem to work.
# does not work
df_qc <- df %>%
mutate_at(.vars = col_var,
.funs = list(~ ifelse(df[, col_var+1] == 9, ., NA)))
i=1
df_qc <- df %>%
mutate_at(.vars = col_var,
.funs = list(~ ifelse(df[, i+1] == 9, ., NA)))
I think I am quite close, any help appreciated.
We can use Map :
df[col_var] <- Map(function(x, y) {y[x != 9] <- NA;y},df[col_var + 1],df[col_var])
df
# Date.time RH33HMP RH33HMP_f RH38HMP RH38HMP_f
#1 2015-01-01 00:00:00 99.60 9 99.6 9
#2 2015-01-01 00:30:00 99.60 9 NA 902
#3 2015-01-01 01:00:00 NA 92 99.5 9
#4 2015-01-01 01:30:00 NA 93 99.3 9
#5 2015-01-01 02:00:00 98.63 9 NA 91
Similarly, you can use map2 in purrr if you prefer tidyverse.
df[col_var] <- purrr::map2(df[col_var + 1],df[col_var], ~{.y[.x != 9] <- NA;.y})
One dplyr and purrr option could be:
map2_dfr(.x = df %>%
select(ends_with("HMP")),
.y = df %>%
select(ends_with("_f")),
~ replace(.x, .y != 9, NA)) %>%
bind_cols(df %>%
select(-ends_with("HMP")))
RH33HMP RH38HMP Date.time RH33HMP_f RH38HMP_f
<dbl> <dbl> <fct> <dbl> <dbl>
1 99.6 99.6 2015-01-01 00:00:00 9 9
2 99.6 NA 2015-01-01 00:30:00 9 902
3 NA 99.5 2015-01-01 01:00:00 92 9
4 NA 99.3 2015-01-01 01:30:00 93 9
5 98.6 NA 2015-01-01 02:00:00 9 91

Create a time interval of 15 minutes from minutely data in R?

I have some data which is formatted in the following way:
time count
00:00 17
00:01 62
00:02 41
So I have from 00:00 to 23:59hours and with a counter per minute. I'd like to group the data in intervals of 15 minutes such that:
time count
00:00-00:15 148
00:16-00:30 284
I have tried to do it manually but this is exhausting so I am sure there has to be a function or sth to do it easily but I haven't figured out yet how to do it.
I'd really appreciate some help!!
Thank you very much!
For data that's in POSIXct format, you can use the cut function to create 15-minute groupings, and then aggregate by those groups. The code below shows how to do this in base R and with the dplyr and data.table packages.
First, create some fake data:
set.seed(4984)
dat = data.frame(time=seq(as.POSIXct("2016-05-01"), as.POSIXct("2016-05-01") + 60*99, by=60),
count=sample(1:50, 100, replace=TRUE))
Base R
cut the data into 15 minute groups:
dat$by15 = cut(dat$time, breaks="15 min")
time count by15
1 2016-05-01 00:00:00 22 2016-05-01 00:00:00
2 2016-05-01 00:01:00 11 2016-05-01 00:00:00
3 2016-05-01 00:02:00 31 2016-05-01 00:00:00
...
98 2016-05-01 01:37:00 20 2016-05-01 01:30:00
99 2016-05-01 01:38:00 29 2016-05-01 01:30:00
100 2016-05-01 01:39:00 37 2016-05-01 01:30:00
Now aggregate by the new grouping column, using sum as the aggregation function:
dat.summary = aggregate(count ~ by15, FUN=sum, data=dat)
by15 count
1 2016-05-01 00:00:00 312
2 2016-05-01 00:15:00 395
3 2016-05-01 00:30:00 341
4 2016-05-01 00:45:00 318
5 2016-05-01 01:00:00 349
6 2016-05-01 01:15:00 397
7 2016-05-01 01:30:00 341
dplyr
library(dplyr)
dat.summary = dat %>% group_by(by15=cut(time, "15 min")) %>%
summarise(count=sum(count))
data.table
library(data.table)
dat.summary = setDT(dat)[ , list(count=sum(count)), by=cut(time, "15 min")]
UPDATE: To answer the comment, for this case the end point of each grouping interval is as.POSIXct(as.character(dat$by15)) + 60*15 - 1. In other words, the endpoint of the grouping interval is 15 minutes minus one second from the start of the interval. We add 60*15 - 1 because POSIXct is denominated in seconds. The as.POSIXct(as.character(...)) is because cut returns a factor and this just converts it back to date-time so that we can do math on it.
If you want the end point to the nearest minute before the next interval (instead of the nearest second), you could to as.POSIXct(as.character(dat$by15)) + 60*14.
If you don't know the break interval, for example, because you chose the number of breaks and let R pick the interval, you could find the number of seconds to add by doing max(unique(diff(as.POSIXct(as.character(dat$by15))))) - 1.
The cut approach is handy but slow with large data frames. The following approach is approximately 1,000x faster than the cut approach (tested with 400k records.)
# Function: Truncate (floor) POSIXct to time interval (specified in seconds)
# Author: Stephen McDaniel # PowerTrip Analytics
# Date : 2017MAY
# Copyright: (C) 2017 by Freakalytics, LLC
# License: MIT
floor_datetime <- function(date_var, floor_seconds = 60,
origin = "1970-01-01") { # defaults to minute rounding
if(!is(date_var, "POSIXct")) stop("Please pass in a POSIXct variable")
if(is.na(date_var)) return(as.POSIXct(NA)) else {
return(as.POSIXct(floor(as.numeric(date_var) /
(floor_seconds))*(floor_seconds), origin = origin))
}
}
Sample output:
test <- data.frame(good = as.POSIXct(Sys.time()),
bad1 = as.Date(Sys.time()),
bad2 = as.POSIXct(NA))
test$good_15 <- floor_datetime(test$good, 15 * 60)
test$bad1_15 <- floor_datetime(test$bad1, 15 * 60)
Error in floor_datetime(test$bad, 15 * 60) :
Please pass in a POSIXct variable
test$bad2_15 <- floor_datetime(test$bad2, 15 * 60)
test
good bad1 bad2 good_15 bad2_15
1 2017-05-06 13:55:34.48 2017-05-06 <NA> 2007-05-06 13:45:00 <NA>
You can do it in one line by using trs function from FQOAT, just like:
df_15mins=trs(df, "15 mins")
Below is a repeatable example:
library(foqat)
head(aqi[,c(1,2)])
# Time NO
#1 2017-05-01 01:00:00 0.0376578
#2 2017-05-01 01:01:00 0.0341483
#3 2017-05-01 01:02:00 0.0310285
#4 2017-05-01 01:03:00 0.0357016
#5 2017-05-01 01:04:00 0.0337507
#6 2017-05-01 01:05:00 0.0238120
#mean
aqi_15mins=trs(aqi[,c(1,2)], "15 mins")
head(aqi_15mins)
# Time NO
#1 2017-05-01 01:00:00 0.02736549
#2 2017-05-01 01:15:00 0.03244958
#3 2017-05-01 01:30:00 0.03743626
#4 2017-05-01 01:45:00 0.02769419
#5 2017-05-01 02:00:00 0.02901817
#6 2017-05-01 02:15:00 0.03439455

R time series missing values

I was working with a time series dataset having hourly data. The data contained a few missing values so I tried to create a dataframe (time_seq) with the correct time value and do a merge with the original data so the missing values become 'NA'.
> data
date value
7980 2015-03-30 20:00:00 78389
7981 2015-03-30 21:00:00 72622
7982 2015-03-30 22:00:00 65240
7983 2015-03-30 23:00:00 47795
7984 2015-03-31 08:00:00 37455
7985 2015-03-31 09:00:00 70695
7986 2015-03-31 10:00:00 68444
//converting the date in the data to POSIXct format.
> data$date <- format.POSIXct(data$date,'%Y-%m-%d %H:%M:%S')
// creating a dataframe with the correct sequence of dates.
> time_seq <- seq(from = as.POSIXct("2014-05-01 00:00:00"),
to = as.POSIXct("2015-04-30 23:00:00"), by = "hour")
> df <- data.frame(date=time_seq)
> df
date
8013 2015-03-30 20:00:00
8014 2015-03-30 21:00:00
8015 2015-03-30 22:00:00
8016 2015-03-30 23:00:00
8017 2015-03-31 00:00:00
8018 2015-03-31 01:00:00
8019 2015-03-31 02:00:00
8020 2015-03-31 03:00:00
8021 2015-03-31 04:00:00
8022 2015-03-31 05:00:00
8023 2015-03-31 06:00:00
8024 2015-03-31 07:00:00
// merging with the original data
> a <- merge(data,df, x.by = data$date, y.by = df$date ,all=TRUE)
> a
date value
4005 2014-07-23 07:00:00 37003
4006 2014-07-23 07:30:00 NA
4007 2014-07-23 08:00:00 37216
4008 2014-07-23 08:30:00 NA
The values I get after merging are incorrect and they contain half-hourly values. What would be the correct approach for solving this?
Why are is the merge result in 30 minute intervals when both my dataframes are hourly?
PS:I looked into this question : Fastest way for filling-in missing dates for data.table and followed the steps but it didn't help.
You can use the padr package to solve this problem.
library(padr)
library(dplyr) #for the pipe operator
data %>%
pad() %>%
fill_by_value()

Fill missing sequence values with dplyr

I have a data frame with missing values for "SNAP_ID". I'd like to fill in the missing values with floating point values based on a sequence from the previous non-missing value (lag()?). I would really like to achieve this using just dplyr if possible.
Assumptions:
There will never be missing data as the first or last row I'm generating the missing dates based on missing days between a min and max of a data set
There can be multiple gaps in the data set
Current data:
end SNAP_ID
1 2015-06-26 12:59:00 365
2 2015-06-26 13:59:00 366
3 2015-06-27 00:01:00 NA
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367
8 2015-06-29 09:59:00 368
What I want to achieve:
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 366.2
5 2015-06-28 00:01:00 366.3
6 2015-06-28 23:00:00 366.4
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
As a data frame:
df <- structure(list(end = structure(c(1435323540, 1435327140, 1435363260,
1435446000, 1435449660, 1435532400, 1435568400, 1435571940), tzone = "UTC", class = c("POSIXct",
"POSIXt")), SNAP_ID = c(365, 366, NA, NA, NA, NA, 367, 368)), .Names = c("end",
"SNAP_ID"), row.names = c(NA, -8L), class = "data.frame")
This was my attempt at achieving this goal, but it only works for the first missing value:
df %>%
arrange(end) %>%
mutate(SNAP_ID=ifelse(is.na(SNAP_ID),lag(SNAP_ID)+0.1,SNAP_ID))
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
The outstanding answer from #mathematical.coffee below:
df %>%
arrange(end) %>%
group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1))) %>%
ungroup() %>%
select(-tmp)
EDIT: new version works for any number of NA runs.
This one doesn't need zoo, either.
First, notice that tmp=cumsum(!is.na(SNAP_ID)) groups the SNAP_IDs such groups of the same tmp consist of one non-NA value followed by a run of NA values.
Then group by this variable and just add .1 to the first SNAP_ID to fill out the NAs:
df %>%
arrange(end) %>%
group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1)))
end SNAP_ID tmp
1 2015-06-26 12:59:00 365.0 1
2 2015-06-26 13:59:00 366.0 2
3 2015-06-27 00:01:00 366.1 2
4 2015-06-27 23:00:00 366.2 2
5 2015-06-28 00:01:00 366.3 2
6 2015-06-28 23:00:00 366.4 2
7 2015-06-29 09:00:00 367.0 3
8 2015-06-29 09:59:00 368.0 4
Then you can drop the tmp column afterwards (add %>% select(-tmp) to the end).
EDIT: this is the old version which doesn't work for subsequent runs of NAs.
If your aim is to fill each NA with the previous value + 0.1, you can use zoo's na.locf (which fills each NA with the previous value), along with cumsum(is.na(SNAP_ID))*0.1 to add the extra 0.1.
library(zoo)
df %>%
arrange(end) %>%
mutate(SNAP_ID=ifelse(is.na(SNAP_ID),
na.locf(SNAP_ID) + cumsum(is.na(SNAP_ID))*0.1,
SNAP_ID))

Resources