I have a daily time series data set that I am attempting to perform an hourly linear interpolation on. My code is performing a linear interpolation between points but I need to start from 0 after the last point per ID and start of the new day.
Below is my output after adding missing hours to the daily raw data
Below is my output after performing the following code, but I am not sure how to start from 0:
dfYPO0_20171 <- dfYPO0_2017
%>% mutate(TIMESTAMP = as.POSIXct(as.character(TIMESTAMP)))
%>% group_by(ID)
%>% do(left_join(data.frame(ID= .$ID[1], TIMESTAMP = seq(min(.$TIMESTAMP), max(.$TIMESTAMP), by = "hour")), ., by=c("ID", "TIMESTAMP")))
%>% mutate(CALC_HOURLY_PROD= na.approx(.$"Total Prod Yest"))
Below is what I would like my output to look like:
Thanks for any help in advance!
Here's an approach using tidyverse packages. First, we'll create groups based on the runs of missing values, then we'll use approx to interpolate.
# Fake data
dat = data.frame(time=seq(as.Date("2015-01-01"), as.Date("2015-01-31"), "1 day"),
prod=c(10.4, rep(NA,19), 25.8, rep(NA,9), 14.2))
dat = dat %>%
# Create groups based on runs of NA followed by a value
mutate(group = rev(cumsum(!is.na(rev(prod))))) %>%
# Operate on the groups we just created
group_by(group) %>%
# First, add a zero at the beginning of each group, then run the approx function
# to interpolate values for all groups of length greater than 1
mutate(prod = replace(prod, row_number()==1 & n()>1, 0),
prod = if(n()>1) approx(time, prod, xout=time)$y else prod) %>%
ungroup %>% select(-group)
time prod
1 2015-01-01 10.400000
2 2015-01-02 0.000000
3 2015-01-03 1.357895
19 2015-01-19 23.084211
20 2015-01-20 24.442105
21 2015-01-21 25.800000
22 2015-01-22 0.000000
23 2015-01-23 1.577778
24 2015-01-24 3.155556
29 2015-01-29 11.044444
30 2015-01-30 12.622222
31 2015-01-31 14.200000
I have a frequency table in R, that has 12 different columns A:L, and a date by week column `format = "%Y-%m-%d". There are 52/53 rows for the past year. Due to the calculations I need to do. I am looking to expand the rows into single day observations and just simply divide every row value (A:L) by 7 into equal observations for every 7 day period.
I am currently using expand to create a date column for every day we are looking to include, but I cannot get the data to fill in for the multiple columns and scale itself for its according date range
df %>%
expand(Date = seq(from = as.Date(first_of_this_month_last_year) , to = as.Date(first_of_this_month_this_year), by="day"), [HAVE TRIED MANY DIFFERENT FUNCTIONS HERE BUT NOTHING HAS WORKED])
Current Table :
Desired Output Table :
In the actual input and output table we have 12 columns of observations and 1 date column, and we want to expand the 52 rows of weeks into 365 days of equal observations. So I am looking for a generic formula/solution to apply across multiple rows and a far larger dataset than the one shown above.
I am still young in developing my dplyr and R cleaning and formatting skills, so any and all help is much appreciated!
Thank you in advance.
I would approach this by building each day of the week separately and then stacking them all together. Something like this:
table_0d = df %>%
mutate(A = A/7, B = B/7, ...)
table_1d = table_0d %>% mutate(Date = Date + 1)
table_2d = table_0d %>% mutate(Date = Date + 2)
table_3d = table_0d %>% mutate(Date = Date + 3)
table_6d = table_0d %>% mutate(Date = Date + 6)
output_df = rbind(table_0d, table_1d, table_2d, ... table_6d) %>%
I am working with these data on R. These are the first six rows —without counting the first column, which the write.csv function always adds—:
> head(my_data)
client_id contract_id contract_start contract_end inter_complex
1 1 15/07/2019 15/07/2020 18092+18458i
3 3 1/01/2015 1/01/2015 16436+16436i
5 5 12/06/2020 12/06/2020 18425+18425i
13 13 1/01/2015 1/01/2015 16436+16436i
18 18 1/01/2015 1/01/2015 16436+16436i
19 19 1/01/2015 1/01/2015 16436+16436i
Each row represents a different contract. The variable inter_complex is a complex number whose real part is the numeric representation of the date when a contract started, and whose imaginary part analogously represents the date when a contract ended. In case you're wondering, you can obtain that column by executing this:
chars_2_cplex = function(start, end) {
cbind(start, end) %>%
apply(2, compose(as.numeric, dmy)) %*% rbind(1, 1i)
my_data %>% transmute(inter_complex = chars_2_cplex(contract_start, contract_end))
What I want is, for each client id and each contract, to identify how many contracts associated to that same client id intersect with that contract. In other words: I want to create a new column called simultaneous which will depict for each row —i.e. for each contract— how many active contracts the corresponding client has during the very same period that the current contract is active. In case no intersection with any other contract is found for a given contract, then the value of simultaneous would have to be 1 —as while that contract is active it is also the only active contract that the respective client has—.
I figured it would help to obtain the combinations of inter_complex, then turn those combinations of complex numbers into combinations of intervals, and then use lubridate's intersect function to discern whether or not each combination of intervals intersect. For that purpose, I have written the following code:
## This function turns complex numbers into intervals.
cplex_2_inter = function(x) {
start = x %>% Re() %>% as.integer()
end = x %>% Im() %>% as.integer()
interval(as_date(start), as_date(end))
## This other function returns a list whose j-th element is a data frame that shows the interceptions
## between combinations of j + 1 intervals.
get_intersections = function(x) {
max_m = length(x)
output = vector(mode = "list", length = max_m - 1)
for (i in 2:max_m) {
output[[i - 1]] = combn(x, m = i) %>% t() %>% as.data.frame() %>%
mutate_all(cplex_2_inter) %>% rowid_to_column("id") %>%
pivot_longer(-id) %>% group_by(id) %>%
mutate(simultaneous = do.call(lubridate::intersect, as.list(value))) %>%
mutate(simultaneous = if_else(as.character(simultaneous) != "NA--NA", i, 1L))
In order to get a better grasp on what the function get_intersections does, I propose that you run the following:
example = my_data %>% filter(client_id == 1) %>% pull(inter_complex) %>% get_intersections()
The data frame example[[1]] shows whether there are interceptions —or, for a better word, overlaps— between pairs of intervals. The data frame example[[2]] shows whether there are overlaps between groups of three intervals, and so on.
You may notice that according to example[[1]] the interval 2019-07-15 UTC--2020-07-15 UTC overlaps with some other interval —and hence, the associated value of simultaneous is 2— while, according to example[[2]], that very same interval is associated to a value of 3 for the variable simultaneous. Naturally, the idea is to assign to each interval its highest simultaneous value.
Since I do not care about global overlaps but rather about overlaps within each client id I figured I would require to work on a grouped data frame. The furthest I got on this project was writing this:
my_data %>% group_by(client_id) %>% group_map(~ get_intersections(.x$inter_complex))
Now onto my questions. 1) I have executed the line above, but the process is not very efficient. It has already run for a bit more than a whole day and it doesn't finish yet. Recently I came across the concept of interval trees but I'm not a computer scientist and I would require help in order to tackle this problem in a smarter way. 2) In case we stick to my not-so-smart approach to the problem, I would still require a function that accesses each element of the list that is returned by get_intersections, so as to identify and retrieve the highest simultaneous value associated to each interval. On that matter I would have to request help as well.
Regarding Wimpel's answer, I have examined their data table and I found this.
> DT %>% filter(client_id == 502 & contract_id == 3093) %>%
> select(contract_start, contract_end, contract_intersect)
# Output
contract_start contract_end contract_intersect
1: 2018-01-11 2019-01-11 7
That is, the displayed contract allegedly overlaps with seven other contracts that the same client has.
On the other hand, let's see if this holds true when using my combinations-based approach.
combs_10_502 = my_data %>% filter(client_id == 502) %>% pull(inter_complex) %>%
combn(10) %>% t() %>% as.data.frame() %>% mutate_all(cplex_2_inter) %>%
rowid_to_column("id") %>% pivot_longer(-id) %>% group_by(id) %>%
mutate(simultaneous = do.call(lubridate::intersect, as.list(value))) %>%
ungroup() %>%
mutate(simultaneous = if_else(as.character(simultaneous) != "NA--NA", 10L, 1L))
> combs_10_502 %>% filter(simultaneous == 10) %>% slice(11:20)
# A tibble: 10 x 4
id name value simultaneous
<int> <chr> <Interval> <int>
1 24311 V1 2018-01-11 UTC--2019-01-11 UTC 10
2 24311 V2 2018-03-01 UTC--2019-03-01 UTC 10
3 24311 V3 2018-07-11 UTC--2019-07-11 UTC 10
4 24311 V4 2018-04-20 UTC--2019-04-20 UTC 10
5 24311 V5 2018-05-21 UTC--2019-05-21 UTC 10
6 24311 V6 2018-08-10 UTC--2019-08-10 UTC 10
7 24311 V7 2018-08-09 UTC--2019-08-09 UTC 10
8 24311 V8 2018-09-27 UTC--2019-09-27 UTC 10
9 24311 V9 2020-01-03 UTC--2021-01-03 UTC 10
10 24311 V10 2019-12-19 UTC--2020-12-19 UTC 10
The same contract is displayed on the first row of the tibble above. As can be seen, that contract actually overlaps with nine other contracts of the given client —those nine are displayed on the remaining rows—.
I don't know how Wimpel's solution got this wrong, but I checked that it does get the number of intersections right for several other contracts. Now I know that a data table-based solution is what I am looking for, since the processes are made very fast, but there seems to be an issue with the proposed solution.
I believe you are looking for something like this?
DT <- fread("https://raw.githubusercontent.com/pazos-feren/Data/main/contracts.csv")
#set dates as real dates
DT[, contract_start := as.Date(contract_start, format = "%d/%m/%Y")]
DT[, contract_end := as.Date(contract_end, format = "%d/%m/%Y")]
setkey(DT, V1)
DT[DT, c("contract_intersect", "contract_intersect_ids") := {
val = DT[ !V1 == i.V1 & client_id == i.client_id &
contract_start <= i.contract_end & contract_end >= i.contract_start, ]
list( nrow(val), paste0(val$contract_id, collapse = ";") )
}, by = .EACHI]
# V1 client_id contract_id contract_start contract_end inter_complex contract_intersect contract_intersect_ids
# 1: 1 1 1 2019-07-15 2020-07-15 18092+18458i 2 4162;4168
# 2: 2 3 3 2015-01-01 2015-01-01 16436+16436i 0
# 3: 3 5 5 2020-06-12 2020-06-12 18425+18425i 0
# 4: 4 13 13 2015-01-01 2015-01-01 16436+16436i 0
# 5: 5 18 18 2015-01-01 2015-01-01 16436+16436i 0
# 6: 6 19 19 2015-01-01 2015-01-01 16436+16436i 0
I have a cumulative data like;
df1 <- data.frame(code=c(1,1,1,1,1,2,2,2,2,3,3,3,3,3,3,4,4,4,4,5,5,5,5),
date=c("2020-01-01", "2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-01","2020-01-02","2020-01-03",
From here, I want to extract the maximum cumulative number of each code and each date like;
df2 <- data.frame(code=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5),
Now I have cumulative numbers for each code of each day.
From here I want to calculate incidence of 2days duration.
df3 <- data.frame(code=c(1,2,3,4,5),
Incidence1 means the difference between 2020-01-01 and 2020-01-03,
Incidence2 means the difference between 2020-01-02 and 2020-01-04
What I want to know is
1) How to extract the maximum number within the same day
2) How to calculate the difference between 2days
Please teach me, thanks.
Here is one way to do this by creating groups of every alternate row and get the difference of the cumulative value between them. To get the expected output in the same format as shown we can use pivot_wider from tidyr.
df2 %>%
group_by(code) %>%
group_by(gr = rep(seq(1, n()/2), 2), add = TRUE) %>%
summarise(incidence = diff(cumulative)) %>%
pivot_wider(names_from = gr, values_from = incidence, names_prefix = "incidence")
# code incidence1 incidence2
# <dbl> <dbl> <dbl>
#1 1 1 1
#2 2 2 2
#3 3 6 3
#4 4 2 3
#5 5 2 3
I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.
I have data in the form of start and stop times (in the format minutes:seconds). A simplistic example might be the timestamp of a light turning on, and the subsequent timestamp of the light turning off.
For example:
Start Stop
00:03.1 00:40.9
00:55.0 01:38.2
01:40.0 02:01.1
I would like to rearrange the data so that I can eventually look at it in terms of whole-minute interval bins in R.
Option 1: Turn the data into a binary listing for each tenth of a second, then aggregate the data later by timestamp.
Time.in.sec Yes.or.No
0.0 N
0.1 N
... ...
3.0 N
3.1 Y
3.2 Y
... ...
40.8 Y
40.9 N
... ...
Option 2: Split the time intervals at the minute marks and aggregate total time per minute (starting at time = 0:00.0) with some sort of logical rule.
Start Stop
00:03.10 00:40.90
00:55.00 00:59.99
01:00.00 01:38.20
01:40.00 01:59.99
02:00.00 02:01.10
I have tried looking into lubridate functions (i.e., making each range into an interval class) and cut(), but I can’t seem to figure out how to make either of these ideas work. I also am unclear whether packages such as zoo would be appropriate for this; honestly, I have very little experience with date/time formats and time series.
Other questions on Stackoverflow seem to be addressing making bins from raw timestamps (e.g., What is an efficient method for partitioning and aggregating intervals from timestamped rows in a data frame? and Aggregate data by equally spaced time intervals in R), but I essentially want to do the opposite.
EDIT 1: Here is a CSV-format of the example data, up through minute 6.
Start, Stop
00:03.1, 00:40.9
00:55.0, 01:38.2
01:40.0, 02:01.1
03:03.1, 04:30.3
04:50.0, 05:01.5
05:08.7, 05:22.0
05:40.1, 05:47.9
EDIT 2: My ultimate goal for this is to have the data in a format that I can use to chunk the observations into standardized time bins (Minute 1, Minute 2, etc.) to get a by-minute percentage of when the data is "Yes". Basically I want to get a summary of the distribution of states by minute, and since the data is binary, I can do this by looking at the "yes" state.
For the first 3 minutes (from 00:00.0 up until 03:00.0), the output would be something like this:
Minute time.yes.sec perc.time.yes
1 42.8 71.33
2 58.2 96.98
3 1.1 1.83
# *NOTE: Here, Minute 1 = [0, 60), Minute 2 = [60, 120), etc.; I'm not opposed
# to the reverse definitions though (Minute 1 = (0, 60], etc.).
I could alternatively look at the data as a cumulative distribution plot, with each successive time point updating the value of "total time yes". However, If I could get the data in the format of option 1, I would have the flexibility to look at the data either way.
An option, lightly edited from my version in the comments:
df %>% mutate_all(funs(period_to_seconds(ms(.)))) %>% # convert each time to seconds
rowwise() %>% # evaluate the following row-by-row
# make a sequence from Start to Stop by 0.1, wrapped in a list
mutate(instant = list(seq(Start, Stop, by = 0.1))) %>%
unnest() %>% # expand list column
# make a factor, cutting instants into 60 second bins
mutate(minute = cut(instant, breaks = (0:6) * 60, labels = 1:6)) %>%
group_by(minute) %>% # evaluate the following grouped by new factor column
# for each group, count the rows, subtracting 1 for starting instants, and
# dividing by 10 to convert from tenths of seconds to secontds
summarise(elapsed = (n() - n_distinct(Start)) / 10,
pct_elapsed = elapsed / 60 * 100) # convert to percent
## # A tibble: 6 × 3
## minute elapsed pct_elapsed
## <fctr> <dbl> <dbl>
## 1 1 42.8 71.333333
## 2 2 58.1 96.833333
## 3 3 1.0 1.666667
## 4 4 56.9 94.833333
## 5 5 40.2 67.000000
## 6 6 22.5 37.500000
Note the correction for counting starting instants is imperfect, as it will subtract for every starting instant, even if it is a continuation of a sequence from the previous minute. It could be calculated more thoroughly if precision matters.
A more precise but somewhat difficult route is to add stops and starts at the turn of each minute:
df %>% mutate_all(funs(period_to_seconds(ms(.)))) %>% # convert to seconds
gather(var, val) %>% # gather to long form
# construct and rbind data.frame of breaks at minute changes
bind_rows(expand.grid(var = c('Start', 'Stop'),
val = seq(60, by = 60, length.out = floor(max(.$val)/60)))) %>%
arrange(val, desc(var)) %>% # sort
mutate(index = rep(seq(n()/2), each = 2)) %>% # make indices for spreading
spread(var, val) %>% # spread back to wide form
mutate(elapsed = Stop - Start) %>% # calculate elapsed time for each row
# add and group by factor of which minute each falls in
group_by(minute = cut(Stop, seq(0, by = 60, length.out = ceiling(max(Stop) / 60 + 1)),
labels = 1:6)) %>%
summarise(elapsed = sum(elapsed), # calculate summaries
pct_elapsed = elapsed / 60 * 100)
## # A tibble: 6 × 3
## minute elapsed pct_elapsed
## <fctr> <dbl> <dbl>
## 1 1 42.8 71.333333
## 2 2 58.2 97.000000
## 3 3 1.1 1.833333
## 4 4 56.9 94.833333
## 5 5 40.3 67.166667
## 6 6 22.6 37.666667
I did the following using your original data prior to the edit:
Start Stop
00:03.1 00:40.9
00:55.0 01:38.2
01:40.0 02:01.1
agg <- read.table(con<-file("clipboard"), header=T)
The ms function below takes the raw character input I read in from the clipboard and turns changes it into minutes and seconds with an appropriate class, so that it can be used for comparisons. The same is true for the seconds function, the only difference there being that I'm dealing with data that's just measured in seconds, not minutes and seconds.
agg$Start <- lubridate::ms(agg$Start)
agg$Stop <- lubridate::ms(agg$Stop)
option1 <- data.frame(time = lubridate::seconds(seq(.1, 122, .1)),
flag = as.character("N"), stringsAsFactors = F)
for(i in 1:nrow(agg)){
option1$flag[option1$time > agg$Start[i] & option1$time < agg$Stop[i]] <- "Y"
To verify that it worked, let's look at table():
201 1019
option1$minute <- ifelse(option1$time < lubridate::seconds(60), 0, 1)
option1$minute[option1$time > lubridate::seconds(120)] <- 2
table(option1$flag, option1$minute)
0 1 2
N 172 19 10
Y 427 582 10
prop.table(table(option1$flag, option1$minute),2)
0 1 2
N 0.28714524 0.03161398 0.50000000
Y 0.71285476 0.96838602 0.50000000