I have the chunk of code below where I am trying to fill the missing minutes in my data df_stuff by joining it to a time series which has all minutes for an entire year. I would actually like to aggregate this data at 15 minute intervals instead of minute. Does anyone know a simple way of doing this? I was looking at to.minutes15 from the xts package but it seems to have problems with my POSIXct format time series.
Code:
library("sqldf")
##Filling Gaps in time by minute
myTZ <- "America/Los_Angeles"
tseries <- seq(as.POSIXct("2015-01-01 00:00:00", tz=myTZ),
as.POSIXct("2015-12-31 23:59:00", tz=myTZ), by="min")
df2 <- data.frame(SeqDateTime=tseries)
finaldf <- sqldf("select df2.SeqDateTime,
median(df_stuff.brooms) as broomsTot
from df2
left outer join df_stuff on df2.SeqDateTime = df_stuff.broomTime
group by df2.SeqDateTime
order by df2.SeqDateTime asc")
Data:
df_stuff <- structure(list(brooms = c(27, 53, 10, 55, 14, 49, 26,
13, 12, NA, NA, 23, 28, 31, NA, 46, NA, 13, NA, 33, 12, 4, 28,
34, 0, 24, 7, 31, 33, 37, 56, 41, 50, 55, 41, 15, 23, 26, 14,
27, 22, 41, 48, 19, 28, 11, 11, NA, 49, NA), broomTime = structure(c(1423970100,
1424122200, 1424136180, 1424035260, 1424141580, 1424122440, 1423274580,
1424129580, 1424146320, 1429129320, 1429032060, 1429142940, 1428705000,
1429142460, 1429128720, 1429204560, 1422909480, 1424137200, 1424042100,
1424149620, 1424131920, 1424108940, 1424144820, 1424040600, 1424119620,
1424148660, 1443593040, 1443657120, 1424125860, 1424223120, 1424235240,
1424232720, 1424234940, 1424234640, 1424230440, 1424115300, 1429208280,
1429131720, 1429148460, 1429151040, 1424129760, 1424125380, 1424123220,
1424137380, 1424115780, 1424219340, 1424131560, 1424233560, 1424224920,
1443640800), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("brooms",
"broomTime"), row.names = c(NA, 50L), class = "data.frame")
You can summarize by any amount of time interval by using cut within the group_by function in dplyr.
library(dplyr)
ans <- finaldf %>%
group_by(SeqDateTime = cut(SeqDateTime, breaks = "15 min")) %>%
summarize(broomsTot = sum(as.numeric(broomsTot), na.rm = TRUE))
head(ans)
Source: local data frame [6 x 2]
SeqDateTime broomsTot
(fctr) (dbl)
1 2015-01-01 02:00:00 0
2 2015-01-01 02:15:00 0
3 2015-01-01 02:30:00 0
4 2015-01-01 02:45:00 0
5 2015-01-01 03:00:00 0
6 2015-01-01 03:15:00 0
I can assure you that xts does not have problem with your POSIXct time series. xts uses POSIXct for its internal time index.
Here's how to join df_stuff with a 1-minute series and then aggregate that result to a 15-minute series.
library(xts)
# create xts object
xts_stuff <- with(df_stuff, xts(brooms, broomTime))
# merge with empty xts object that contains a regular 1-minute index
xts_stuff_1min <- merge(xts_stuff, xts(,tseries))
# aggregate to 15-minutes
ep15 <- endpoints(xts_stuff_1min, "minutes", 15)
final_df <- period.apply(xts_stuff_1min, ep15, median, na.rm=TRUE)
Related
I have a df with some dates and I would like to filter dates to show only the current month and 12 months ahead.
This is my df:
I would like to keep, for each date in the Date column, in the DataReferencia column, the dates of the current month and 12 months ahead and then subtract the values from the Value column. For the above dates, on the day 2003-01-17, it would be the dates in the DataReferencia column 2003-01-01 and 2003-12-01. This df runs from 2003-01 to 2020-12.
I tried this code, but returns an empty df:
library(dplyr)
library(lubridate)
test %>%
filter(year(DataReferencia) == Data.Ano & month(DataReferencia) == Data.Mes + 11,
month(DataReferencia) == Data.Mes)
My dput:
structure(list(Instituicao = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1), Data = structure(c(12069, 12069, 12069,
12069, 12069, 12069, 12069, 12069, 12069, 12069, 12069, 12069,
12070, 12070, 12070, 12070, 12070), class = "Date"), DataReferencia = structure(c(12053,
12084, 12112, 12143, 12173, 12204, 12234, 12265, 12296, 12326,
12357, 12387, 12053, 12084, 12112, 12143, 12173), class = "Date"),
Valor = c(26, 24, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
26, 24, 22, 22, 22), DataReuniao = structure(c(12073, 12073,
12073, 12073, 12073, 12073, 12073, 12073, 12073, 12073, 12073,
12073, 12073, 12073, 12073, 12073, 12073), class = "Date"),
Reuniao = c(80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80,
80, 80, 80, 80, 80), MetaSelic = c(25.5, 25.5, 25.5, 25.5,
25.5, 25.5, 25.5, 25.5, 25.5, 25.5, 25.5, 25.5, 25.5, 25.5,
25.5, 25.5, 25.5)), row.names = c(NA, 17L), class = "data.frame")
If I understand your question correctly, you want to filter() for dates where the year & month are the same in Data and DataReferencia, or the date in DataReferencia is 11 months ahead of Data. I'm not sure what Data.Ano and Data.Mes are in your failed code, or if these are translated names of the columns names?
This code will do the job:
test %>%
filter(
format(DataReferencia, format = '%Y-%m') == format(Data, format = '%Y-%m')
| format(DataReferencia, format = '%Y-%m') == format(Data + months(11), format = '%Y-%m')
)
# Instituicao Data DataReferencia Valor DataReuniao Reuniao MetaSelic
# 1 1 2003-01-17 2003-01-01 26 2003-01-21 80 25.5
# 2 1 2003-01-17 2003-12-01 22 2003-01-21 80 25.5
# 3 1 2003-01-18 2003-01-01 26 2003-01-21 80 25.5
We use format() to retrieve the date of the data columns in year-month format; we specify this using format = %Y-%m, utilising symbols and abbreviations explained here; basically %Y means the (4-digit) year, and %m is the (2-digit) month. Because this is still in R-recognised date format, it allows the addition of 11 months in the second condition in filter().
The idea is to get the sum based on the column names that are
between 01/01/2021 and 01/08/2021:
# define rank parameters {start-end}
first_date <- format(Sys.Date(), "01/01/%Y")
actual_date <- format(Sys.Date() %m-% months(1), "01/%m/%Y")
# get the sum of the rows between first_date and actual_date
df$ytd<- rowSums(df[as.character(seq(first_date,
actual_date))])
However, when applied the next error arises:
Error in seq.default(first_date, to_date) :
'from' must be a finite number
Expected output is a new column that takes the sum of the rows between the specified rank.
data
df <- structure(list(country = c("Mexico", "Mexico", "Mexico", "Mexico"
), `01/01/2021` = c(12, 23, 13, 12), `01/02/2021` = c(12, 23,
13, 12), `01/03/2021` = c(12, 23, 13, 12), `01/04/2021` = c(12,
23, 13, 12), `01/05/2021` = c(12, 23, 13, 12), `01/06/2021` = c(12,
23, 13, 12), `01/07/2021` = c(12, 23, 13, 12), `01/08/2021` = c(12,
23, 13, 12), `01/09/2021` = c(12, 23, 13, 12), `01/10/2021` = c(12,
23, 13, 12), `01/11/2021` = c(12, 23, 13, 12), `01/12/2021` = c(12,
23, 13, 12)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
How could I properly apply a function to get this output?
The format and seq don't work i.e. seq expects a Date class whereas the format is a character class. Instead, make use of the range operator in across or select
library(dplyr)
out <- df %>%
mutate(ytd = rowSums(across(all_of(first_date):all_of(actual_date))))
-output
> out$ytd
[1] 96 184 104 96
A base R approach using match -
df$ytd <- rowSums(df[match(first_date, names(df)):match(actual_date, names(df))])
df$ytd
#[1] 96 184 104 96
I have a dataset like this:
data <- data.frame(Time = c(1,4,6,9,11,13,16, 25, 32, 65),
A = c(10, NA, 13, 2, 32, 19, 32, 34, 93, 12),
B = c(1, 99, 32, 31, 12, 13, NA, 13, NA, NA),
C = c(2, 32, NA, NA, NA, NA, NA, NA, NA, NA))
What I want to retrieve are the values in Time that corresponds to the last numerical value in A, B, and C.
For example, the last numerical values for A, B, and C are 12, 13, and 32 respectively.
So, the Time values that correspond are 65, 25, and 4.
I've tried something like data[which(data$Time== max(data$A)), ], but this doesn't work.
We can multiply the row index with the logical matrix, and get the colMaxs (from matrixStats) to subset the 'Time' column
library(matrixStats)
data$Time[colMaxs((!is.na(data[-1])) * row(data[-1]))]
#[1] 65 25 4
Or using base R, we get the index with which/arr.ind, get the max index using a group by operation (tapply) and use that to extract the 'Time' value
m1 <- which(!is.na(data[-1]), arr.ind = TRUE)
data$Time[tapply(m1[,1], m1[,2], FUN = max)]
#[1] 65 25 4
Or with summarise/across in the devel version of dplyr
library(dplyr)
data %>%
summarise(across(A:C, ~ tail(Time[!is.na(.)], 1)))
# A B C
#1 65 25 4
Or using summarise_at with the current version of dplyr
data %>%
summarise_at(vars(A:C), ~ tail(Time[!is.na(.)], 1))
I have time series data of 27 days (from 2018-04-09 to 2018-5-15 without weekends) with 7 observations per day (08:00 t0 20:00 every two hours) with two variables per observation (di and eu).
I want to plot all days as line plots in one plot.
I found solutions to plot one plot per day with a ggplot facet plot and I found solutions to plot the whole timeseries in one plot (di and eu from 2018-04-09 to 2018-05-15).
But nothing that let me overlay 27 daily plots for one variable in one 8:00 to 20:00 plot.
The first three days as example data with dput():
structure(list(date_time = structure(c(1523260800, 1523268000,
1523275200, 1523282400, 1523289600, 1523296800, 1523304000, 1523347200,
1523354400, 1523361600, 1523368800, 1523376000, 1523383200, 1523390400,
1523433600, 1523440800, 1523448000, 1523455200, 1523462400, 1523469600,
1523476800), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
di = c(75, 90, 35, 70, 75, 15, 5, 65, 55, 15, 15, 0, NA,
15, 55, 55, 5, 25, NA, 60, NA), eu = c(15, 0, 65, 30, 15,
65, 70, 40, 45, 75, 75, 100, NA, 85, 45, 30, 90, 65, NA,
20, NA)), row.names = c(NA, -21L), class = c("tbl_df", "tbl",
"data.frame"))
A plot with all 27 days in one plot may look confusing, but I like to try it, to see wether it makes a trend in the data obvious. A plot for each weekday would be a nice addition.
You could determine the day and hour up front and then plot with respective groups like this:
library(tidyverse)
library(lubridate)
df %>%
gather(metric, value, -date_time) %>%
mutate(
hour_of_day = hour(date_time),
day = day(date_time)
) %>%
ggplot(aes(x = hour_of_day, y = value)) +
geom_line(aes(group = day)) +
facet_wrap( ~ metric)
I have started using data.table over the past week and am facing an issue. I have already looked at the solution here and here but I am not entirely sure how it helps in my situation.
Here, is my sample data.
> dput(dt)
structure(list(link = c(1L, 1L, 1L, 1L, 1L, 1L), id = c(8395, 8738, 9788, 9789, 9908, 9920), person = c(2937837, 3092435, 3511555, 3511555, 3568112, 3575082), seqid = c(11, 14, 9, 1, 7, 10), time = c(NA, NA, 25372, 50700, NA, NA), max = c(14, 31, 9, 7, 8, 11), hr = c(NA, NA, 7, 14, NA, NA), minhr = c(11, 19, 7, 14, 7, 16), maxhr = c(11, 19, 7, 14, 7, 16), TRAVELTIME0.1avg = c(59, 59, 59, 59, 59, 59 ), TRAVELTIME1.2avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME2.3avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME3.4avg = c(59.2079086331819, 59.2079086331819, 59.2079086331819, 59.2079086331819, 59.2079086331819, 59.2079086331819 ), TRAVELTIME4.5avg = c(59.9182362587214, 59.9182362587214, 59.9182362587214, 59.9182362587214, 59.9182362587214, 59.9182362587214), TRAVELTIME5.6avg = c(60.4905040124798, 60.4905040124798, 60.4905040124798, 60.4905040124798, 60.4905040124798, 60.4905040124798), TRAVELTIME6.7avg = c(59.2897529410742, 59.2897529410742, 59.2897529410742, 59.2897529410742, 59.2897529410742, 59.2897529410742 ), TRAVELTIME7.8avg = c(59.2717176535874, 59.2717176535874, 59.2717176535874, 59.2717176535874, 59.2717176535874, 59.2717176535874), TRAVELTIME8.9avg = c(59.2569737174023, 59.2569737174023, 59.2569737174023, 59.2569737174023, 59.2569737174023, 59.2569737174023), TRAVELTIME9.10avg = c(59.2814811928216, 59.2814811928216, 59.2814811928216, 59.2814811928216, 59.2814811928216, 59.2814811928216 ), TRAVELTIME10.11avg = c(59.2084537775537, 59.2084537775537, 59.2084537775537, 59.2084537775537, 59.2084537775537, 59.2084537775537 ), TRAVELTIME11.12avg = c(59.0915653550983, 59.0915653550983, 59.0915653550983, 59.0915653550983, 59.0915653550983, 59.0915653550983 ), TRAVELTIME12.13avg = c(59.6765035434587, 59.6765035434587, 59.6765035434587, 59.6765035434587, 59.6765035434587, 59.6765035434587 ), TRAVELTIME13.14avg = c(59.246760177185, 59.246760177185, 59.246760177185, 59.246760177185, 59.246760177185, 59.246760177185), TRAVELTIME14.15avg = c(59.4095339982924, 59.4095339982924, 59.4095339982924, 59.4095339982924, 59.4095339982924, 59.4095339982924), TRAVELTIME15.16avg = c(59.5347570536373, 59.5347570536373, 59.5347570536373, 59.5347570536373, 59.5347570536373, 59.5347570536373 ), TRAVELTIME16.17avg = c(59.3799872977671, 59.3799872977671, 59.3799872977671, 59.3799872977671, 59.3799872977671, 59.3799872977671 ), TRAVELTIME17.18avg = c(59.1915498629857, 59.1915498629857, 59.1915498629857, 59.1915498629857, 59.1915498629857, 59.1915498629857 ), TRAVELTIME18.19avg = c(59.1663574471712, 59.1663574471712, 59.1663574471712, 59.1663574471712, 59.1663574471712, 59.1663574471712 ), TRAVELTIME19.20avg = c(59.0217772215269, 59.0217772215269, 59.0217772215269, 59.0217772215269, 59.0217772215269, 59.0217772215269 ), TRAVELTIME20.21avg = c(59.0893371757925, 59.0893371757925, 59.0893371757925, 59.0893371757925, 59.0893371757925, 59.0893371757925 ), TRAVELTIME21.22avg = c(59.0272727272727, 59.0272727272727, 59.0272727272727, 59.0272727272727, 59.0272727272727, 59.0272727272727 ), TRAVELTIME22.23avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME23.24avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME24.25avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME25.26avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME26.27avg = c(59, 59, 59, 59, 59, 59)), .Names = c("link", "id", "person", "seqid", "time", "max", "hr", "minhr", "maxhr", "TRAVELTIME0.1avg", "TRAVELTIME1.2avg", "TRAVELTIME2.3avg", "TRAVELTIME3.4avg", "TRAVELTIME4.5avg", "TRAVELTIME5.6avg", "TRAVELTIME6.7avg", "TRAVELTIME7.8avg", "TRAVELTIME8.9avg", "TRAVELTIME9.10avg", "TRAVELTIME10.11avg", "TRAVELTIME11.12avg", "TRAVELTIME12.13avg", "TRAVELTIME13.14avg", "TRAVELTIME14.15avg", "TRAVELTIME15.16avg", "TRAVELTIME16.17avg", "TRAVELTIME17.18avg", "TRAVELTIME18.19avg", "TRAVELTIME19.20avg", "TRAVELTIME20.21avg", "TRAVELTIME21.22avg", "TRAVELTIME22.23avg", "TRAVELTIME23.24avg", "TRAVELTIME24.25avg", "TRAVELTIME25.26avg", "TRAVELTIME26.27avg"), sorted = "link", class = c("data.table", "data.frame"), row.names = c(NA, -6L))
Update1: To avoid the issue of internal.selfref do dt <- data.table(dt) after you create dt using the above sample.
I want to use the minhr and maxhr variables to subset the travel times and calculate the rowMeans for those subsetted travel times and add it to the current dt. If minhr (or maxhr) is 11, the corresponding travel time column is TRAVELTIME11.12avg; if it is 19, the corresponding travel time column is TRAVELTIME19.20avg. So, if minhr is 9 and maxhr is 10 for a row, then I need to get the mean of TRAVELTIME9.10avg and TRAVELTIME10.11avg; similarly, if minhr is 15 and maxhr is 17 then I need to get the mean of TRAVELTIME15.16avg, TRAVELTIME16.17avg, and TRAVELTIME17.18avg.
I tried to approach the problem step-wise and used the following code for a simple case of uniform travel time columns across all the rows. It works fine.
> dt[,avg:=rowMeans(.SD[,TRAVELTIME10.11avg:TRAVELTIME12.13avg, with=FALSE]),by=.(id, seqid)]
Next, I tried to modify the above code by introducing paste0() to refer to the column names dynamically. But, this results in an error. Additionally, I have tried to use as.symbol(paste0()), noquote(paste0()) and a couple of other unquoting techniques without any success.
> dt[,avg:=rowMeans(.SD[,paste0("TRAVELTIME", minhr, "." , minhr+1, "avg"):paste0("TRAVELTIME", maxhr, "." , maxhr+1, "avg"), with=FALSE]),by=.(id, seqid)]
Error in paste0("TRAVELTIME", minhr, ".", minhr + 1, "avg"):paste0("TRAVELTIME", :
NA/NaN argument
In addition: Warning messages:
1: In eval(expr, envir, enclos) : NAs introduced by coercion
2: In eval(expr, envir, enclos) : NAs introduced by coercion
Given this, I have two questions:
1) Why doesn't data.table recognize the column names if the paste command is used (even after unquoting the pasted strings) to subset columns as opposed to directly using the column names? Does it have anything to do with the unequal number of columns for every row?
2) Since I am unsuccessful, can you please suggest a way to find the mean over variable number of columns for every row, and add it back to the dt. I would appreciate if the suggestion leads to an efficient way, because, I already tried this using a simpler looping method and it takes a long time (approximately 12 to 15 hours for my entire dataset) for the size of my data.
I believe this solves the problem you were having with paste0:
tmp <- paste0("TRAVELTIME", dt$minhr, "." , dt$minhr+1, "avg")
tmp1 <- paste0("TRAVELTIME", dt$maxhr, "." , dt$maxhr+1, "avg")
dt1 <- dt[,avg:=rowMeans(.SD[,get(tmp):get(tmp1), with=FALSE]),by=.(dt$id, dt$seqid)]
Someone will probably point out that you don't strictly need the $ in the last line, but due to the nature of the problem you were having I felt this was useful for identifying and solving the problem.