I have two data frames - one which is an hourly data set (dfa) and one which has measurements per day (dfb) at the same hour each day (15:29:05) (see below for examples with 2 days)
I want to merge these data frames so that i keep all the hourly data and the daily joins aligning with the correct hour and when there is no data for the other hours of the day it fills with NA's
Simply applying merge just cuts it to the the daily data and so I loose all the hourly information:
dfc <- merge(dfa, dfb, by = "datetime")
Any help would be appreciated.
e.g. for two days:
#hourly
dfa <- structure(list(datetime = structure(c(1466231345, 1466234945,
1466238545, 1466242145, 1466245745, 1466249345, 1466252945, 1466256545,
1466260145, 1466263745, 1466267345, 1466270945, 1466274545, 1466278145,
1466281745, 1466285345, 1466288945, 1466292545, 1466296145, 1466299745,
1466303345, 1466306945, 1466310545, 1466314145, 1466317745, 1466321345,
1466324945, 1466328545, 1466332145, 1466335745, 1466339345, 1466342945,
1466346545, 1466350145, 1466353745, 1466357345, 1466360945, 1466364545,
1466368145, 1466371745, 1466375345, 1466378945, 1466382545, 1466386145,
1466389745, 1466393345, 1466396945, 1466400545), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), DFQ1 = c(0.408025, 0.4355833335,
0.68485, 0.650875, 0.5307833335, 0.509775, 0.5273135595, 0.5763083335,
0.4954, 0.444308333, 0.4048083335, 0.419475, 0.35105, 0.2740416665,
0.3038666665, 0.351774317, 0.306025, 0.3183916665, 0.249175,
0.268133333, 0.3285083335, 0.2807666665, 0.351633333, 0.374516667,
0.3763, 0.3806583335, 0.366675, 0.411133333, 0.433291667, 0.408225,
0.3812, 0.380358333, 0.3557166665, 0.3701, 0.400788842, 0.396833333,
0.362991667, 0.3790083335, 0.3631666665, 0.367041667, 0.3899583335,
0.360658333, 0.359675, 0.356358333, 0.3864083335, 0.3965083335,
0.3901166665, 0.403976695)), class = "data.frame", row.names = c(NA,
-48L))
#daily
dfb <- structure(list(datetime = structure(c(1466263745, 1466350145), class
= c("POSIXct",
"POSIXt"), tzone = "UTC"), Tchl = c(0.1265, 0.1503), TCSE = structure(c(12L,
9L), .Label = c("", "#DIV/0!", "0.000", "0.001", "0.002", "0.003",
"0.004", "0.005", "0.007", "0.008", "0.009", "0.010", "0.011",
"0.012", "0.013", "0.015", "0.021", "0.026", "0.027", "CB2016",
"Std error"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
You can use this
dfc <- merge(dfa, dfb, by = "datetime", all.x = TRUE)
# datetime DFQ1 Tchl TCSE
# 1 2016-06-18 06:29:05 0.4080250 NA <NA>
# 2 2016-06-18 07:29:05 0.4355833 NA <NA>
# 3 2016-06-18 08:29:05 0.6848500 NA <NA>
# 4 2016-06-18 09:29:05 0.6508750 NA <NA>
# 5 2016-06-18 10:29:05 0.5307833 NA <NA>
# 6 2016-06-18 11:29:05 0.5097750 NA <NA>
# 7 2016-06-18 12:29:05 0.5273136 NA <NA>
# 8 2016-06-18 13:29:05 0.5763083 NA <NA>
# 9 2016-06-18 14:29:05 0.4954000 NA <NA>
# 10 2016-06-18 15:29:05 0.4443083 0.1265 0.010
# ...
Or a tidyverse solution:
library(tidyverse)
dfc <- left_join(dfa, dfb, by="datetime")
#> head(dfc,10)
# datetime DFQ1 Tchl TCSE
#1 2016-06-18 06:29:05 0.4080250 NA <NA>
#2 2016-06-18 07:29:05 0.4355833 NA <NA>
#3 2016-06-18 08:29:05 0.6848500 NA <NA>
#4 2016-06-18 09:29:05 0.6508750 NA <NA>
#5 2016-06-18 10:29:05 0.5307833 NA <NA>
#6 2016-06-18 11:29:05 0.5097750 NA <NA>
#7 2016-06-18 12:29:05 0.5273136 NA <NA>
#8 2016-06-18 13:29:05 0.5763083 NA <NA>
#9 2016-06-18 14:29:05 0.4954000 NA <NA>
#10 2016-06-18 15:29:05 0.4443083 0.1265 0.010
Related
I'm trying to find whether a date exists between multiple pairs of dates which are wide in my dataset - the length I've given here is just an example, the eventual number may be larger or smaller. Not sure if this is the most sensible option but working longwise didn't seem to work, this is also a very common way to work with overlapping dates and date pairs in SPSS, where you can have multiple variables numerised as the dates are here and it works through each numbered 'set' to give you a response.
Here is an example dataset:
person key_date 1_end_date 2_end_date 3_end_date 4_end_date 1_start_date 2_start_date 3_start_date 4_start_date
1 1 2019-09-30 2019-05-23 2019-09-30 2016-07-22 <NA> 2019-05-23 2019-09-30 2016-07-22 <NA>
2 2 2019-06-07 2019-05-16 2019-06-07 <NA> <NA> 2019-05-16 <NA> <NA> <NA>
3 3 2020-03-09 2016-06-02 2019-08-09 2020-05-27 2020-02-12 2016-06-02 2019-08-09 2020-05-27 2020-03-09
test <- structure(list(person = 1:3, key_date = structure(c(18169, 18054,18330), class = "Date"), `1_end_date` = structure(c(18039, 18032,16954), class = "Date"), `2_end_date` = structure(c(18169, 18054,18117), class = "Date"), `3_end_date` = structure(c(17004, NA,18409), class = "Date"), `4_end_date` = structure(c(NA, NA, 18304), class = "Date"), `1_start_date` = structure(c(18039, 18032,16954), class = "Date"), `2_start_date` = structure(c(18169,NA, 18117), class = "Date"), `3_start_date` = structure(c(17004,NA, 18409), class = "Date"), `4_start_date` = structure(c(NA,NA, 18330), class = "Date")), row.names = c(NA, 3L), class = "data.frame")
The expected output would be just a binary flag to indicate that the key_date exists between any pair of start_date and end_date. In the example given, that would mean person 1 and 3. Any ideas how to do this? Is this really inefficient?
tidyverse approach
library(tidyverse)
result <- test %>% mutate(across(ends_with("end_date"), ~
key_date <= . & key_date >= get(str_replace(cur_column(), "end", "start")),
.names = '{.col}_flag')) %>%
rowwise() %>%
mutate(Flag1 = sum(c_across(ends_with("flag")), na.rm = T)) %>%
ungroup() %>%
select(-ends_with("flag"))
> result$Flag1
[1] 1 0 0
Complete output will look like
> result
# A tibble: 3 x 11
person key_date `1_end_date` `2_end_date` `3_end_date` `4_end_date` `1_start_date` `2_start_date` `3_start_date` `4_start_date` Flag1
<int> <date> <date> <date> <date> <date> <date> <date> <date> <date> <dbl>
1 1 2019-09-30 2019-05-23 2019-09-30 2016-07-22 NA 2019-05-23 2019-09-30 2016-07-22 NA 1
2 2 2019-06-07 2019-05-16 2019-06-07 NA NA 2019-05-16 NA NA NA 0
3 3 2020-03-09 2016-06-02 2019-08-09 2020-05-27 2020-02-12 2016-06-02 2019-08-09 2020-05-27 2020-03-09 0
I have a df (chpt4) with 1000+ participants, and the dates when tests where taken. I would like to accomodate the dates according to how many months have passed between the follow up (t1:t4) and the baseline (t0). For this purpose I created 4 additional columns (difft0t2:difft0t4) that show exaclty the months elapsed between the tests. The image is what I have now.
I am grouping the months in 5 different categories: (also I thought this vectors would help me as a counter)
FU6 <- 1:9
FU12 <- 10:18
FU24 <- 19:30
FU36 <- 31:42
FU48 <- 43:54
My original idea was to start indexing the values of the difft0t1 column, that belong to the above ranges using which()
which(chpt4$difft0t1 %in% c(FU6)) #this works
which(chpt4$difft0t1 %in% c(FU14)) #this doesn't work at all
...and use that outcome number, as an index of which element to paste into another column. Its just not working.
keeping with the image example from lines 243 and 244, I would like to outcome columns to look like this:
baseline
FU6
FU12
FU24
FU36
FU48
2012-02-24
NA
2013-09-06
2014-02-21
2015-06-23
NA
2012-05-24
NA
2013-05-16
NA
2015-04-20
2016-05-12
I think you need this
library (tidyverse)
df %>% pivot_longer(cols = -id, names_to = "Test", values_to = "Dates") %>%
group_by(id) %>% mutate(new_col = as.numeric(round((Dates - first(Dates))/30,0))) %>%
mutate(new_col = case_when(new_col == 0 ~ "Baseline",
new_col %in% 1:9 ~ "FU6",
new_col %in% 10:18 ~ "FU12",
new_col %in% 19:30 ~ "FU24",
new_col %in% 31:42 ~ "FU36",
new_col %in% 43:54 ~ "FU48")) %>% filter(!is.na(new_col)) %>%
select(-Test) %>% pivot_wider(id_cols = "id", names_from = "new_col", values_from = "Dates", values_fn = min)
# A tibble: 4 x 6
# Groups: id [4]
id Baseline FU12 FU24 FU36 FU48
<chr> <date> <date> <date> <date> <date>
1 waa000 2012-10-04 2013-09-05 NA NA NA
2 waf84 2012-02-24 NA 2013-09-06 2015-06-23 NA
3 waq593 2012-05-24 2013-05-16 NA 2015-04-20 2016-05-12
4 wcu776 2013-01-24 2014-01-23 NA NA NA
NOTE whenever there will be two dates in one group, minimum/first of those will be displayed. FU6 category will automatically in picture once the appropriate data is used.
sample data used
dput(df)
> dput(df)
structure(list(id = c("waa000", "waf84", "waq593", "wcu776"),
t0 = structure(c(15617, 15394, 15484, 15729), class = "Date"),
t1 = structure(c(15953, 15954, 15841, 16093), class = "Date"),
t2 = structure(c(NA, 16122, 16545, NA), class = "Date"),
t3 = structure(c(NA, 16609, 16933, NA), class = "Date"),
t4 = structure(c(NA_real_, NA_real_, NA_real_, NA_real_), class = "Date")), row.names = c(NA,
-4L), class = "data.frame")
> df
id t0 t1 t2 t3 t4
1 waa000 2012-10-04 2013-09-05 <NA> <NA> <NA>
2 waf84 2012-02-24 2013-09-06 2014-02-21 2015-06-23 <NA>
3 waq593 2012-05-24 2013-05-16 2015-04-20 2016-05-12 <NA>
4 wcu776 2013-01-24 2014-01-23 <NA> <NA> <NA>
I have the following data, Both columns are dates & have to take the difference in days. However most of the values in one of date column is blank so I have to return NA for them.
a b
02-07-2012
18-08-2012
13-08-2012
16-04-2012
26-04-2012
03-05-2012 12-05-2012
09-06-2012
30-05-2012
22-06-2012
05-07-2012
30-06-2012
09-05-2012
22-06-2012
02-07-2012
17-07-2012
17-08-2012
16-07-2012
01-08-2012
05-08-2012
17-08-2012
30-04-2012
05-07-2012
07-04-2012
27-04-2012
21-06-2012
03-07-2012
21-07-2012
24-04-2012
05-06-2012
03-07-2012
02-04-2012 01-06-2012
06-04-2012
15-04-2012
16-06-2012
01-08-2012
13-05-2012
09-07-2012
09-07-2012
18-04-2012
09-08-2012
10-04-2012
12-05-2012
04-04-2012
04-06-2012 04-06-2012
15-06-2012
02-07-2012
05-07-2012
21-08-2012
19-07-2012
06-08-2012
15-06-2012
06-04-2012
04-06-2012
23-07-2012
06-04-2012
12-04-2012 11-06-2012
24-05-2012
03-08-2012
04-05-2012 09-05-2012
07-05-2012
07-06-2012
06-07-2012
13-07-2012
26-07-2012
26-04-2012
22-06-2012
26-07-2012
12-04-2012
07-08-2012
27-06-2012
03-04-2012 02-06-2012
13-04-2012
28-07-2012
07-05-2012
29-06-2012
03-04-2012 02-06-2012
04-04-2012
04-04-2012 24-05-2012
04-04-2012
05-04-2012
07-04-2012
10-04-2012
11-04-2012
13-04-2012
13-04-2012
13-04-2012
13-04-2012
14-04-2012
14-04-2012
14-04-2012
18-04-2012
19-04-2012
21-04-2012
25-04-2012
25-04-2012
26-04-2012
26-04-2012
26-04-2012
27-04-2012
30-04-2012
04-06-2012
04-06-2012
05-06-2012
05-06-2012
05-06-2012
05-06-2012
05-06-2012 16-07-2012
06-06-2012 29-06-2012
I tried the following but couldnt succeed
date_strings[date_strings==""]<-NA # Replaced blank spaces with NA & removed them
head(date_strings)
newdata<-na.omit(date_strings)
str(newdata)
newdata$a<-as.Date(newdata$a,"%m%d%y")
newdata$b<-as.Date(newdata$b,"%m%d%y")
diff_in_days = difftime(newdata$a, newdata$b, units = "days") # days
Change the dates to date class which will turn blanks to NA automatically and then subtract days using difftime.
date_strings[] <- lapply(date_strings, as.Date, format = '%d-%m-%Y')
date_strings$diff_in_days = difftime(date_strings$b, date_strings$a,
units = "days")
date_strings
# a b diff_in_days
#1 2012-07-02 <NA> NA
#2 2012-08-18 <NA> NA
#3 2012-08-13 <NA> NA
#4 2012-04-16 <NA> NA
#5 2012-04-26 <NA> NA
#6 2012-05-03 2012-05-12 9
Or directly subtract
date_strings$diff_in_days = date_strings$b - date_strings$a
data
date_strings <- structure(list(a = c("02-07-2012", "18-08-2012", "13-08-2012",
"16-04-2012", "26-04-2012", "03-05-2012"), b = c("", "", "",
"", "", "12-05-2012")), class = "data.frame", row.names = c(NA, -6L))
With tidyverse, we can do
library(dplyr)
library(lubridate)
date_strings %>%
mutate(across(everything(), dmy)) %>%
mutate(diff_in_days = b - a)
# a b diff_in_days
#1 2012-07-02 <NA> NA days
#2 2012-08-18 <NA> NA days
#3 2012-08-13 <NA> NA days
#4 2012-04-16 <NA> NA days
#5 2012-04-26 <NA> NA days
#6 2012-05-03 2012-05-12 9 days
data
date_strings <- structure(list(a = c("02-07-2012", "18-08-2012", "13-08-2012",
"16-04-2012", "26-04-2012", "03-05-2012"), b = c("", "", "",
"", "", "12-05-2012")), class = "data.frame", row.names = c(NA, -6L))
So I have a dataframe as follows called df1:
df1 <- structure(list(startTime = structure(c(1519903920, 1519905060,
1519913740, 1519919880), class = c("POSIXct", "POSIXt"), tzone = "America/New_York"),
endTime = structure(c(1519904880, 1519912200, 1519913940,
1522142880), class = c("POSIXct", "POSIXt"), tzone = "America/New_York"),
impact = c(92.17, 616.43, 63.69, 14.69), impactPercent = c(184.15,
1495.17, 138.69, 19.97), impactSpeedDiff = c(3587.72, 25726.22,
2616.01, 474.11), maxQueueLength = c(5.76053, 5.76053, 4.829511,
2.447619), tmcs = list(c("110N04623", "110-04623", "110N04624",
"110-04624", "110N04625", "110-04625", "110N04626", "110-04626",
"110N04627"), c("110N04623", "110-04623", "110N04624", "110-04624",
"110N04625", "110-04625", "110N04626", "110-04626", "110N04627"
), c("110N04623", "110-04623", "110N04624", "110-04624",
"110N04625", "110-04625", "110N04626", "110-04626"), c("110N04623",
"110-04623", "110N04624", "110-04624", "110N04625")), early_startTime = structure(c(1519903620,
1519904760, 1519913740, 1522133400), class = c("POSIXct",
"POSIXt"), tzone = "America/New_York")), row.names = c(NA,
4L), class = "data.frame")
And given this dataframe I need to match with the following dataframe (df2).
df2 <- structure(list(created_tstamp = structure(c(1519926899, 1519913840,
1519913840, 1519927924, 1522141200, 1522152619, 1522152708, 1522152728,
1519928416, 1519928785, 1519929080, 1519929306, 1519929964, 1519930050,
1522154148, 1519930311, 1519930139, 1519930470, 1519930660, 1519929579
), class = c("POSIXct", "POSIXt"), tzone = "America/New_York"),
closed_tstamp = structure(c(1519929764, 1519926987, 1519927686,
1519928360, 1522152738, 1522152779, 1522154882, 1522152819,
1519928464, 1519928914, 1519929266, 1519929741, 1519939420,
1519930622, 1522155300, 1519930334, 1519931054, 1519951230,
1519930766, 1519930830), class = c("POSIXct", "POSIXt"), tzone = "America/New_York"),
code = c("110-04508", "110N04623", "110N04623", "110P05583",
"", "", "110N04485", "110N04357", "110-05066", "110-04421",
"110N04421", "110P04577", "110-04204", "110-04269", "110+04673",
"110-04445", "", "110P05797", "110N04269", "110+04520")), row.names = c(NA,
20L), class = "data.frame")
A match is indicated by two criteria together:
created_tstamp in df2 is between early_startTime and endTime in df1
code in df2 exists in the same tmcs cell in df1
Both conditions need to be met for it to be considered a match. Ultimately I would like to create an identifier to match each row of df2 to its corresponding match in df1. this is probably done with a loop of some sorts but I am unsure how to write it. Note: this is a subset of the data.
If a data point in df2 doesn't have a match in df1 it should be NA in the identifier column. and both df's should get an ID column in the end.
I believe this should work. Its hard to tell as it returns no matches with the provided data. This is because none of the created_tstamp are earlier than your endTime
Edit: now that we have a match with the updated question we can juggle the output as follows
test <- apply(df2,1, function(x) which(
x[1] > df1$early_startTime &
x[1] < df1$endTime &
grepl(x[3], df1$tmcs) &
x[3] != ""
))
IDlist <- lapply(test,paste0,collapse=";")
df2$ID <- unlist(ifelse(lengths(test) > 0,IDlist, NA))
output:
> df2
created_tstamp closed_tstamp code ID
1 2018-03-01 12:54:59 2018-03-01 13:42:44 110-04508 <NA>
2 2018-03-01 09:17:20 2018-03-01 12:56:27 110N04623 2
3 2018-03-01 09:17:20 2018-03-01 13:08:06 110N04623 2
4 2018-03-01 13:12:04 2018-03-01 13:19:20 110P05583 <NA>
5 2018-03-27 05:00:00 2018-03-27 08:12:18 <NA>
6 2018-03-27 08:10:19 2018-03-27 08:12:59 <NA>
7 2018-03-27 08:11:48 2018-03-27 08:48:02 110N04485 <NA>
8 2018-03-27 08:12:08 2018-03-27 08:13:39 110N04357 <NA>
9 2018-03-01 13:20:16 2018-03-01 13:21:04 110-05066 <NA>
10 2018-03-01 13:26:25 2018-03-01 13:28:34 110-04421 <NA>
11 2018-03-01 13:31:20 2018-03-01 13:34:26 110N04421 <NA>
12 2018-03-01 13:35:06 2018-03-01 13:42:21 110P04577 <NA>
13 2018-03-01 13:46:04 2018-03-01 16:23:40 110-04204 <NA>
14 2018-03-01 13:47:30 2018-03-01 13:57:02 110-04269 <NA>
15 2018-03-27 08:35:48 2018-03-27 08:55:00 110+04673 <NA>
16 2018-03-01 13:51:51 2018-03-01 13:52:14 110-04445 <NA>
17 2018-03-01 13:48:59 2018-03-01 14:04:14 <NA>
18 2018-03-01 13:54:30 2018-03-01 19:40:30 110P05797 <NA>
19 2018-03-01 13:57:40 2018-03-01 13:59:26 110N04269 <NA>
20 2018-03-01 13:39:39 2018-03-01 14:00:30 110+04520 <NA>
I have a table of gas refuelings, a, like this:
a = setDT(structure(list(date = structure(c(NA, 16837, 16843, 16847, 16852,
16854, 16858, 16862, 16867, 16871, 16874), class = "Date"), km = c(NA,
NA, 421, 351, 286, 350, 414, 332, 401, 321, 350)), .Names = c("date",
"km"), class = c("data.table", "data.frame"), row.names = c(NA,
-11L)), key = "date")
It has dates of refuel and km drove with that refuel. I'm also given a different table with dates of tire pressure adjusting and oil changes, actions, like this:
actions = setDT(structure(list(date = structure(c(16841, 16843, 16858, 16869), class = "Date"),
action = structure(c(1L, 2L, 2L, 2L), .Label = c("oil", "tires"
), class = "factor")), .Names = c("date", "action"), row.names = c(NA,
-4L), class = c("data.table", "data.frame")), key = "action")
I need to relate the fuel consumption (in the real version of a I also have gallons) to the days elapsed since last tire pressure checking and since last oil change. There must be a simple way to achieve this, but after some hours trying I'm stuck.
This is what I've tried:
library(data.table)
library(lubridate)
library(reshape2)
b <- dcast(actions, date ~ action, value.var = "date")
d <- seq(min(a$date, b$date, na.rm = TRUE), max(a$date, b$date, na.rm = TRUE), by = "day")
d <- data.table(date=d)
d <- b[d,]
d$daysOil <- as.double(difftime(d$date, d$date[! is.na(d$oil)], units = "days"))
d$daysOil[which(d$daysOil < 0)] <- NA
The thing gets a lot more complicated if I try to calculate the number of days elapsed since the last "tires" event (the one that's closer before the refuel date), and that's where I'm stuck.
My expected output is:
expected
date km daysoil daysTires
1 <NA> NA NA NA
2 2016-02-06 NA NA NA
3 2016-02-12 421 2 0
4 2016-02-16 351 6 4
5 2016-02-21 286 11 9
6 2016-02-23 350 13 11
7 2016-02-27 414 17 0
8 2016-03-02 332 21 4
9 2016-03-07 401 26 9
10 2016-03-11 321 30 2
11 2016-03-14 350 33 5
I'd appreciate any solution, but preferably by using data.table or dplyr packages.
########## EDIT ##########
If you can think of a better information (tables) structure to facilitate this task, it will be highly appreciated too!
Here's one option:
actions[, date.copy := date]
cbind(a,
dcast(actions[, .SD[a, .(days = date - date.copy, N = .I), roll = T, on = 'date']
, by = action],
N ~ action, value.var = 'days'))
# date km N oil tires
# 1: <NA> NA 1 NA days NA days
# 2: 2016-02-06 NA 2 NA days NA days
# 3: 2016-02-12 421 3 2 days 0 days
# 4: 2016-02-16 351 4 6 days 4 days
# 5: 2016-02-21 286 5 11 days 9 days
# 6: 2016-02-23 350 6 13 days 11 days
# 7: 2016-02-27 414 7 17 days 0 days
# 8: 2016-03-02 332 8 21 days 4 days
# 9: 2016-03-07 401 9 26 days 9 days
#10: 2016-03-11 321 10 30 days 2 days
#11: 2016-03-14 350 11 33 days 5 days
Several simple things are going on in the above - run in pieces to understand.