Merge and Left Join keeps duplicating line items in R - r

I am struggling with this issue that I am unable to fix! I have to data-frames ("Master" and " "Hours"). The 'Master' df has many columns but the ones in particular are as follows below:
Master
StoreNumber ... MON TUE WED THU FRI SAT SUN
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
...
NB: The Master df has many columns in between StoreNumber as the days of the week and holds a lot of data (about 3000 stores)
Hours
BranchNumber Day TimeDiff
1 MON 7.50
1 TUE 6.00
1 WED 8.50
1 THU 2.00
1 FRI 1.00
1 SAT 2.50
3 MON 7.50
3 TUE 6.00
3 WED 8.50
3 THU 2.00
3 FRI 1.00
3 SAT 2.50
3 SUN 5.00
...
So the idea is that I am trying to match the 'Hours' BrandNumber with the 'Master' StoreNumber. Once there is a match then it matched the Day column from the 'Hours' table with the Days of the week in the 'Masters' Table...It will do this for each row and then populate days of the week with the corresponding value in 'TimeDiff' column...if the store and branch number has no match (Like StoreNumber 2) then it should skip that row and move onto the next. Another condition, like BranchNumber '1' there is no data for SUNDAY so in the 'Master' table the SUNDAY cell should be left as 0...this should work for any day of the week.
The output should be the 'Master' Table but complete with all the days of the week data from the 'Hours' Table. In this example, it should look like:
StoreNumber ... MON TUE WED THU FRI SAT SUN
1 7.50 6.00 8.50 2.00 1.00 2.50 0
2 0 0 0 0 0 0 0
3 7.50 6.00 8.50 2.00 1.00 2.50 5.00
...
The code I have tried is semi-working but I am not sure if its the correct approach. The biggest problem I am getting is that its duplicating the rows expect from the first row. For instance, the output looks more like this.
StoreNumber
1
2
2
3
3
4
4
5
5
5
all are duplicating and some tripling and every 87 columns are identical...however the days of the week of the duplicated row are all 0's.
merged <- Master %>% select(-c("MON","TUE","WED","THU","FRI","SAT","SUN")) %>%
left_join(
Hours %>% pivot_wider(names_from = Day, values_from = TimeDiff),
by = c('StoreNumber' = 'BranchNumber'))
merged <- merged %>% replace(is.na(.),0)
Sorry for the long question, this issue has been bugging me a while so any help/advice will be grateful

If I understand correctly, the Master table has many columns and only the columns MON to SUN need to be updated.
Here are two approaches which uses data.table's ability to update in a join. Only the relevant columns are modified by reference, i.e., without copying the whole data object. It avoids to reshape (or pivot) the Master table forth and back.
Variant 1
library(data.table)
days <- names(Master)[which(names(Master) == "MON") + (0:6)]
setDT(Master)[, (days) := lapply(.SD, as.double), .SDcols = days]
for (d in days) {
Master[Hours, on =.(StoreNumber = BranchNumber), (d) := TimeDiff[d == Day], by = .EACHI]
}
Master[]
StoreNumber OtherCol MON TUE WED THU FRI SAT SUN
1: 1 a 7.5 6 8.5 2 1 2.5 0
2: 2 b 0.0 0 0.0 0 0 0.0 0
3: 3 c 7.5 6 8.5 2 1 2.5 5
Explanation
days contains the names of the columns.
days <- names(Master)[which(names(Master) == "MON") + (0:6)] is equivalent to
days <- c("MON", "TUE", "WED", "THU", "FRI", "SAT", "SUN")
data.table requires consistent data types when parts of a column are updated. The day columns in Master are initialized to integer zero but TimeDiff in Hours is numeric. Therefore, the day columns in Master are coerced to double before updating.
The for loop iterates over each day column and performs the update join for this column. For each match (by = .EACHI), the Timediff for the relevant day is picked.
In order to verify that Master has not been copied we can call
data.table::address(Master)
before and after the operation: The address of Master has not changed.
Variant 2
This approach is a bit leaner. It also uses an update join but it is different to variant 1 as it reshapes (or pivots) Hours from long to wide format and removes the days columns from Master instead of coercing a bunch of integer zeroes to type numeric:
library(data.table)
days <- c("MON", "TUE", "WED", "THU", "FRI", "SAT", "SUN")
Hours_wide <- dcast(setDT(Hours)[, Day := ordered(Day, levels = days)], BranchNumber ~ Day)
setDT(Master)[, (days) := NULL][
Hours_wide, on = .(StoreNumber = BranchNumber), (days) := mget(paste0("i.", days))]
Master[]
StoreNumber OtherCol MON TUE WED THU FRI SAT SUN
1: 1 a 7.5 6 8.5 2 1 2.5 NA
2: 2 b NA NA NA NA NA NA NA
3: 3 c 7.5 6 8.5 2 1 2.5 5
Note that missing elements are now initialized to / indicated by NA which is much easier to detect, IMHO. If required, the NAs can be turned into another numeric value by
Master[, (days) := lapply(.SD, nafill, fill = 0), .SDcols = days][]
StoreNumber OtherCol MON TUE WED THU FRI SAT SUN
1: 1 a 7.5 6 8.5 2 1 2.5 0
2: 2 b 0.0 0 0.0 0 0 0.0 0
3: 3 c 7.5 6 8.5 2 1 2.5 5
This approach uses mget(paste0("i.", days)) to pick the days columns from Hours. If there are columns with the same name in both data.tables in a join, we can distinguish the columns by prepending the column names by a x. and i., resp. Thus, x.MON refers to the MON column from the first data.table which is Master in this case and i.MON refers to the MON column from the second data.table which is Hours_wide. mget() takes the column names as character strings and returns a list of the values of the respective columns.
Variant 2 - Edit 1
Above code can be simplified by
setDT(Master)[, (days) := NULL][
Hours_wide, on = .(StoreNumber = BranchNumber), (days) := mget(days)][]
StoreNumber OtherCol MON TUE WED THU FRI SAT SUN
1: 1 a 7.5 6 8.5 2 1 2.5 NA
2: 2 b NA NA NA NA NA NA NA
3: 3 c 7.5 6 8.5 2 1 2.5 5
Because setDT(Master)[, (days) := NULL] already has removed the columns MON to SUN from Master there is no ambiguity on column names. Thus, the column names MON to SUN can be used without prepending them by i. as the only columns named MON to SUN are in Hours_wide.
Variant 2 - Edit 2
With development version 1.14.1 as of 2021-05-10, a new interface for programming on data.table has been added (see item 10 in NEWS and the new vignette programming on data.table). Instead of get()/ mget() the new env argument is recommended:
library(data.table) # development version 1.14.1 used
days <- c("MON", "TUE", "WED", "THU", "FRI", "SAT", "SUN")
Hours_wide <- dcast(setDT(Hours)[, Day := ordered(Day, levels = days)], BranchNumber ~ Day)
setDT(Master)[, (days) := NULL][
Hours_wide, on = .(StoreNumber = BranchNumber), (days) := s,
env = list(s = as.list(days))][]
StoreNumber OtherCol MON TUE WED THU FRI SAT SUN
1: 1 a 7.5 6 8.5 2 1 2.5 NA
2: 2 b NA NA NA NA NA NA NA
3: 3 c 7.5 6 8.5 2 1 2.5 5
Variant 3: env parameter and fcoalesce()
OP's expected result shows 0 instead of NA. With the Variants 2 above, this was was achieved by a separate update step using nafill().
This separate update step can be avoided by using the fcoalesce() function in the update join:
library(data.table) # development version 1.14.1 used
days <- c("MON", "TUE", "WED", "THU", "FRI", "SAT", "SUN")
Hours_wide <- dcast(setDT(Hours)[, Day := ordered(Day, levels = days)], BranchNumber ~ Day)
setDT(Master)[, (days) := lapply(.SD, as.double), .SDcols = days][
Hours_wide, on = .(StoreNumber = BranchNumber), (days) := lapply(s, fcoalesce, 0),
env = list(s = as.list(paste0("i.", days)))][]
StoreNumber OtherCol MON TUE WED THU FRI SAT SUN
1: 1 a 7.5 6 8.5 2 1 2.5 0
2: 2 b 0.0 0 0.0 0 0 0.0 0
3: 3 c 7.5 6 8.5 2 1 2.5 5
Data
library(data.table)
Master <- fread("
StoreNumber OtherCol MON TUE WED THU FRI SAT SUN
1 a 0 0 0 0 0 0 0
2 b 0 0 0 0 0 0 0
3 c 0 0 0 0 0 0 0
", data.table = FALSE)
Hours <- fread("
BranchNumber Day TimeDiff
1 MON 7.50
1 TUE 6.00
1 WED 8.50
1 THU 2.00
1 FRI 1.00
1 SAT 2.50
3 MON 7.50
3 TUE 6.00
3 WED 8.50
3 THU 2.00
3 FRI 1.00
3 SAT 2.50
3 SUN 5.00
", data.table = FALSE)

Based on #GregorThomas comments, here is a longer then wider approach:
master <- data.frame(
StoreNumber = 1:3,
MON = 0,
TUE = 0,
WED = 0,
THU = 0,
FRI = 0,
SAT = 0,
SUN = 0
)
hours <- read.table(text = "BranchNumber Day TimeDiff
1 MON 7.50
1 TUE 6.00
1 WED 8.50
1 THU 2.00
1 FRI 1.00
1 SAT 2.50
3 MON 7.50
3 TUE 6.00
3 WED 8.50
3 THU 2.00
3 FRI 1.00
3 SAT 2.50
3 SUN 5.00", header = TRUE)
library(dplyr)
library(tidyr)
master %>%
pivot_longer(
cols = MON:SUN,
names_to = "Day",
values_to = "Time"
) %>%
left_join(hours, by = c("StoreNumber" = "BranchNumber", "Day")) %>%
mutate(TimeDiff = replace_na(TimeDiff, 0),
Time = TimeDiff) %>%
select(-TimeDiff) %>%
pivot_wider(
id_cols = StoreNumber,
names_from = Day,
values_from = Time
)
# A tibble: 3 x 8
StoreNumber MON TUE WED THU FRI SAT SUN
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 7.5 6 8.5 2 1 2.5 0
2 2 0 0 0 0 0 0 0
3 3 7.5 6 8.5 2 1 2.5 5
Edit
Here is a version where master has additional columns and the output is stored:
master <- data.frame(
StoreNumber = 1:3,
other_colum = c("A", "B", "C"),
MON = 0,
TUE = 0,
WED = 0,
THU = 0,
FRI = 0,
SAT = 0,
SUN = 0
)
hours <- read.table(text = "BranchNumber Day TimeDiff
1 MON 7.50
1 TUE 6.00
1 WED 8.50
1 THU 2.00
1 FRI 1.00
1 SAT 2.50
3 MON 7.50
3 TUE 6.00
3 WED 8.50
3 THU 2.00
3 FRI 1.00
3 SAT 2.50
3 SUN 5.00", header = TRUE)
library(dplyr)
library(tidyr)
master <- master %>%
pivot_longer(
cols = MON:SUN,
names_to = "Day",
values_to = "Time"
) %>%
left_join(hours, by = c("StoreNumber" = "BranchNumber", "Day")) %>%
mutate(TimeDiff = replace_na(TimeDiff, 0),
Time = TimeDiff) %>%
select(-TimeDiff) %>%
pivot_wider(
names_from = Day,
values_from = Time
)
master
# A tibble: 3 x 9
StoreNumber other_colum MON TUE WED THU FRI SAT SUN
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A 7.5 6 8.5 2 1 2.5 0
2 2 B 0 0 0 0 0 0 0
3 3 C 7.5 6 8.5 2 1 2.5 5

Related

How to get the lowest value per interval in R

I have data in the zoo format in the following structure
date val
2020-11-01 3244
2020-11-02 3273
2020-11-03 2974
2020-11-04 3283
2020-11-05 3922
2020-11-06 3669
2020-11-07 4246
2020-11-08 4594
2020-11-09 4086
2020-11-10 4302
2020-11-11 4559
2020-11-12 4981
2020-11-13 4741
2020-11-14 5267
that I am trying to get into this form
date val
Mon 2020-11-01 3244
Tue 2020-11-02 3273
Wed 2020-11-03 2974
Thu 2020-11-04 3283
Fri 2020-11-05 3922
Sat 2020-11-06 3669
Sun 2020-11-07 4246
Mon 2020-11-08 4594
Tue 2020-11-09 4086
Wed 2020-11-10 4302
Thu 2020-11-11 4559
Fri 2020-11-12 4981
Sat 2020-11-13 4741
Sun 2020-11-14 5267
In order to count the number of time I observe the smallest of the values per week.
Mon = 1
Tue = 1
Wed = 0
Thu = 0
Fri = 0
Sat = 0
Sun = 0
I tried to let the data in the flat format before adding the date with zoo and added the weekdays but failed to count with it. Does anyone know an easier way to do it? I am open to visual solutions
If you store the data in a dataframe you can create a new column with weekdays and week number, for each week keep the row with minimum value and count number of weekdays that have the minimum value.
library(dplyr)
df %>%
mutate(date = as.Date(date),
weekday = factor(weekdays(date)),
week_year = format(date, '%Y-%W')) %>%
group_by(week_year) %>%
slice(which.min(val)) %>%
ungroup %>%
count(weekday, .drop = FALSE)
The following should do the trick:
library(lubridate)
df$day <- weekdays(as.Date(df$date))
# Note:
# There is one way to define a week
df$week <- week(df$date)
# And there is also another. Make sure to pick.
df$isoweek <- isoweek(df$date)
df <- df %>% group_by(isoweek) %>% mutate(min_here = val == min(val))
df %>% group_by(day) %>% summarise(sum(min_here))
# A tibble: 7 x 2
day `sum(min_here)`
<chr> <int>
1 Friday 0
2 Monday 1
3 Saturday 0
4 Sunday 1
5 Thursday 0
6 Tuesday 1
7 Wednesday 0
Base R
... though it seems a little clumsier in comparison to dplyr's mechanics in RonakShah's answer, or data.table below:
ismin <- ave(dat$val, list(format(dat$date, format = "%U")),
FUN = function(z) seq_along(z) == which.min(z))
aggregate(ismin, list(weekday = weekdays(dat$date)), FUN = sum)
# weekday x
# 1 Friday 0
# 2 Monday 1
# 3 Saturday 0
# 4 Sunday 0
# 5 Thursday 0
# 6 Tuesday 1
# 7 Wednesday 0
(The order is not emphasized here.)
data.table
library(data.table)
DT <- as.data.table(dat)
DT[, ismin := seq_len(.N) == which.min(val), by = format(date, format = "%U")
][, weekday := weekdays(date)][, .(n = sum(ismin)), by = .(weekday) ]
# weekday n
# <char> <int>
# 1: Sunday 0
# 2: Monday 1
# 3: Tuesday 1
# 4: Wednesday 0
# 5: Thursday 0
# 6: Friday 0
# 7: Saturday 0
Data
dat <- structure(list(date = structure(c(18567, 18568, 18569, 18570, 18571, 18572, 18573, 18574, 18575, 18576, 18577, 18578, 18579, 18580), class = "Date"), val = c(3244L, 3273L, 2974L, 3283L, 3922L, 3669L, 4246L, 4594L, 4086L, 4302L, 4559L, 4981L, 4741L, 5267L)), class = "data.frame", row.names = c(NA, -14L))

R - find change across same days of week for multiple variables and agrregate

With data like below
text = "
date,weekday,hour,a,b
12/2/2019,Mon,8,18.17183824,0.017741935
12/2/2019,Mon,9,18.11228506,0.020967742
12/9/2019,Mon,8,16.77932274,0.020322581
12/9/2019,Mon,9,16.97327971,0.019677419
12/3/2019,Tue,8,18.17183824,0.017741935
12/3/2019,Tue,10,18.11228506,0.020967742
12/10/2019,Tue,8,16.77932274,0.020322581
12/10/2019,Tue,10,16.97327971,0.019677419
"
df = read.table(textConnection(text), sep=",", header = T)
Need to find the change in the variables a and b on a weekday to weekday basis.
Example for a, the change would be calculated as follows
Change for hour 8 on Mondays = (16.77932274 - 18.17183824)/18.17183824
Change for hour 9 on Mondays = (16.97327971 - 18.11228506)/18.11228506
Change for hour 8 on Tuesdays = (16.77932274 - 18.17183824)/18.17183824
Change for hour 10 on Tuesdays = (16.97327971 - 18.11228506)/18.11228506
Average change for variable a in the dataset = Avg of 1,2,3,4
Would appreciate help
For one variable, I would have converted from long to wide format and computed gain for each pair of same weekdays by adding week+number as a label for values for a. I find the challenge with doing it for multiple variables - a and b here. My real data has more than these 2 variables
We can group_by weekday and hour, use lead/lag to get next/previous value and use mutate_at to apply it for multiple columns.
library(dplyr)
df %>%
group_by(weekday, hour) %>%
mutate_at(vars(a:b), list(change = ~(lead(.) - .)/.))
# date weekday hour a b a_change b_change
# <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#1 12/2/2019 Mon 8 18.2 0.0177 -0.0766 0.145
#2 12/2/2019 Mon 9 18.1 0.0210 -0.0629 -0.0615
#3 12/9/2019 Mon 8 16.8 0.0203 NA NA
#4 12/9/2019 Mon 9 17.0 0.0197 NA NA
#5 12/3/2019 Tue 8 18.2 0.0177 -0.0766 0.145
#6 12/3/2019 Tue 10 18.1 0.0210 -0.0629 -0.0615
#7 12/10/2019 Tue 8 16.8 0.0203 NA NA
#8 12/10/2019 Tue 10 17.0 0.0197 NA NA
Here is an option with data.table
library(data.table)
setDT(df)[, c('a_change', 'b_change') :=
(shift(.SD, type = 'lead') - .SD)/.SD , .(weekday, hour), .SDcols = a:b]

R - Is there a simple way to convert all prefixes of day of week to number in a data frame column?

Given the following dataframe generated by the code:
> df = data.frame(day=c('Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'),
+ value=c(2,4,1,2,3,4,1))
> df
day value
1 Mon 2
2 Tues 4
3 Wed 1
4 Thurs 2
5 Fri 3
6 Sat 4
7 Sun 1
I want to convert Mon, ... , Sun to 1, ...7 respectively to reflect the indexing of the day of the week.
Is there an easier/more straightforward way of doing this instead of using a defined function that checks the column value and converts the prefix to its assigned value?
Define the days in the order you want and then use match
wdays <- c("Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun")
match(df$day, wdays)
#[1] 1 2 3 4 5 6 7

Extract observations based on rownames and range of colnames

I have a two dataframes - one is the base dataframe and the other the query dataframe.
Base Dataframe (base_df):
Mon Tue Wed Thu Fri Sat
A 5.23 0.01 6.81 8.67 0.10 6.21
B 6.26 2.19 4.28 5.57 0.16 2.81
C 7.41 2.63 4.32 6.57 0.20 1.69
D 6.17 1.50 5.30 9.22 2.19 5.47
E 1.23 9.01 8.09 1.29 7.65 4.57
Query Dataframe (query_df):
Person Start End
A Tue Thu
C Mon Wed
D Thu Sat
C Thu Sat
B Wed Fri
I want to extract all the observations for a particular person between the start and end days. The difference between start and end days is always three (inclusive of start and end days).
Hence the output wanted is:
Person Start End D1 D2 D3
A Tue Thu 0.01 6.81 8.67
C Mon Wed 7.41 2.63 4.32
D Thu Sat 9.22 2.19 5.47
C Thu Sat 6.57 0.20 1.69
B Wed Fri 4.28 5.57 0.16
I want to avoid a loop because the actual base_df is more than 35000 rows. Is there a data.table solution? Solutions using other data structures are good too. Thank you!
Another base R solution, using mapply...
query_df <- cbind(query_df,
t(mapply(function(p,s,e) {
base_df[p, match(s, names(base_df)):match(e, names(base_df))]},
query_df$Person,
query_df$Start,
query_df$End)))
names(query_df)[4:6] <- c("D1", "D2", "D3")
query_df
Person Start End D1 D2 D3
1 A Tue Thu 0.01 6.81 8.67
2 C Mon Wed 7.41 2.63 4.32
3 D Thu Sat 9.22 2.19 5.47
4 C Thu Sat 6.57 0.2 1.69
5 B Wed Fri 4.28 5.57 0.16
The data.table solution below should be working also for varying numbers of days between Start and End days (not just 3 day periods) thanks to a non-equi join and melt() / dcast() for reshaping:
library(data.table)
setDT(base_df)
setDT(query_df)
# reshape from wide to long
long <- melt(base_df, id.vars = "Person", variable.name = "Day")
# align factor levels
cols <- c("Start", "End")
query_df[, (cols) := lapply(.SD, factor, levels = levels(long$Day)), .SDcols = cols][
# add row id because Person is not unique
, rn := .I]
# non-equi join right join, i.e., take all rows of query_df
long[query_df, on = .(Person, Day >= Start, Day <= End),
.(rn, Person, Start = i.Start, End = i.End, value)][
# reshape from long to wide
, dcast(.SD, rn + Person + ... ~ rowid(rn, prefix = "D"))]
rn Person Start End D1 D2 D3
1: 1 A Tue Thu 0.01 6.81 8.67
2: 2 C Mon Wed 7.41 2.63 4.32
3: 3 D Thu Sat 9.22 2.19 5.47
4: 4 C Thu Sat 6.57 0.20 1.69
5: 5 B Wed Fri 4.28 5.57 0.16
Note that Day is a factor with the names of weekdays as factor levels in order of appearance:
str(long)
Classes ‘data.table’ and 'data.frame': 30 obs. of 3 variables:
$ Person: chr "A" "B" "C" "D" ...
$ Day : Factor w/ 6 levels "Mon","Tue","Wed",..: 1 1 1 1 1 2 2 2 2 2 ...
$ value : num 5.23 6.26 7.41 6.17 1.23 0.01 2.19 2.63 1.5 9.01 ...
- attr(*, ".internal.selfref")=<externalptr>
Aligned factor levels are crucial for the non-equi join.
Data
library(data.table)
base_df <- fread(
"Person Mon Tue Wed Thu Fri Sat
A 5.23 0.01 6.81 8.67 0.10 6.21
B 6.26 2.19 4.28 5.57 0.16 2.81
C 7.41 2.63 4.32 6.57 0.20 1.69
D 6.17 1.50 5.30 9.22 2.19 5.47
E 1.23 9.01 8.09 1.29 7.65 4.57"
)
query_df <- fread(
"Person Start End
A Tue Thu
C Mon Wed
D Thu Sat
C Thu Sat
B Wed Fri"
)
A tidyverse answer
I reshape base_df, then join and slice the correct days, then reshape back.
library(tidyr)
library(dplyr)
base_df <- tibble::rownames_to_column(base_df, 'Person')
days <- names(base_df)[-1]
base_df %>%
gather(day, value, -Person) %>%
right_join(mutate(query_df, i = row_number())) %>%
group_by(i) %>%
slice(which(days == Start):which(days == End)) %>%
mutate(col = c('D1', 'D2', 'D3')) %>%
select(-day, -i) %>%
spread(col, value)
data.table solution:
Here I use get to extract columns (e.g. Mon) from a data.table object.
library(data.table)
# Prepare data
base_df$Person <- rownames(base_df)
d <- merge(query_df, base_df, "Person", sort = FALSE)
setDT(d)
# Extract mid day (day between start and end)
d[, Mid := days[which(Start == days) + 1], 1:nrow(d)]
# Extract columns using get
d[, .(Person, Start, End,
D1 = get(Start), D2 = get(Mid), D3 = get(End)), 1:nrow(d)][, nrow := NULL][]
Person Start End D1 D2 D3
1: A Tue Thu 0.01 6.81 8.67
2: C Mon Wed 7.41 2.63 4.32
3: D Thu Sat 9.22 2.19 5.47
4: C Thu Sat 6.57 0.20 1.69
5: B Wed Fri 4.28 5.57 0.16
Base R solution:
# Order of days
days <- names(base_df)
# Order of persons
subjects <- rownames(base_df)
res <- apply(query_df, 1, function(x) {
# Extract observation between start:end date
foo <- base_df[x[1] == subjects, which(x[2] == days):which(x[3] == days)]
colnames(foo) <- paste0("D", 1:3)
foo})
# Merge with original query_df
res <- cbind(query_df, do.call("rbind", res))
rownames(res) <- NULL
res
A base solution using indexing with a numeric matrix:
ri <- match(query_df$Person, rownames(base_df))
ci <- match(query_df$Start, names(base_df))
cbind(query_df, `dim<-`(base_df[cbind(ri, rep(ci, 3) + rep(0:2, each = nrow(query_df)))],
c(nrow(query_df), 3)))
# Person Start End 1 2 3
# 1 A Tue Thu 0.01 6.81 8.67
# 2 C Mon Wed 7.41 2.63 4.32
# 3 D Thu Sat 9.22 2.19 5.47
# 4 C Thu Sat 6.57 0.20 1.69
# 5 B Wed Fri 4.28 5.57 0.16

How to fill missing and adjust irregular time intervals in a data.frame in R

I have several datasets mostly with 15 min intervals of time. However, some datasets have missing readings (e.g., 3rd row in sample dataset was supposed to be "May 1 2015 00:40AM". In addition, there are some timesteps that are longer than 15 min (e.g., see 3rd and 6th rows)
How can add the missing time steps so that my Date will continue with 15 min intervals and at the same time adjust those time steps with more than 15 min intervals to 15 min?
s <- data.frame(Date = c(
"May 1 2015 00:10AM","May 1 2015 00:25AM",
"May 1 2015 00:56AM","May 1 2015 01:10AM",
"May 1 2015 01:25AM","May 1 2015 01:41AM",
"May 1 2015 01:55AM"),
val = c(1:7)
)
My desired output would be the following:
> s
Date val
1 May 1 2015 00:10AM 1
2 May 1 2015 00:25AM 2
3 May 1 2015 00:40AM NA
4 May 1 2015 00:55AM 3
5 May 1 2015 01:10AM 4
6 May 1 2015 01:25AM 5
7 May 1 2015 01:40AM 6
8 May 1 2015 01:55AM 7
You could try the following:
First, turn your s dataframe variable "Date" into POSIXct, so you can work with it:
s <- data.frame(Date = c(
"May 1 2015 00:10AM","May 1 2015 00:25AM",
"May 1 2015 00:56AM","May 1 2015 01:10AM",
"May 1 2015 01:25AM","May 1 2015 01:41AM",
"May 1 2015 01:55AM"),
val = c(1:7)
) %>% dplyr::mutate(Date = lubridate::parse_date_time(Date, "b d Y HM"))
Second, you can join this with another data frame that has all the time intervals you are expecting. First, we construct it, using a difference of time intervals (15 mins in this case):
one <- lubridate::parse_date_time("May 1 2015 00:10AM", orders = "b d Y HM")
two <- lubridate::parse_date_time("May 1 2015 00:25AM", orders = "b d Y HM")
dif <- two - one
Now the dataframe:
other_df <- data.frame(
Date = seq(from = lubridate::parse_date_time("May 1 2015 00:10AM",
orders = "b d Y HM"),
to = lubridate::parse_date_time("May 1 2015 01:55AM",
orders = "b d Y HM"),
by = dif))
Join the two:
result <- dplyr::full_join(other_df, s)
> result
Date val
1 2015-05-01 00:10:00 1
2 2015-05-01 00:25:00 2
3 2015-05-01 00:40:00 NA
4 2015-05-01 00:55:00 NA
5 2015-05-01 01:10:00 4
6 2015-05-01 01:25:00 5
7 2015-05-01 01:40:00 NA
8 2015-05-01 01:55:00 7
9 2015-05-01 00:56:00 3
10 2015-05-01 01:41:00 6

Resources