The data frame currently looks like this
# A tibble: 20 x 3
Badge `Effective Date` day_off
<dbl> <dttm> <int>
1 3162 2013-01-16 00:00:00 1
2 3162 2013-01-19 00:00:00 2
3 3162 2013-02-21 00:00:00 3
5 3585 2015-10-21 00:00:00 5
6 3586 2014-05-21 00:00:00 6
7 3586 2014-05-23 00:00:00 7
I would like to create a new row for each day for each badge number between each effective date so that it looks something like this. The data frame is huge, so some tidy verse functions like complete which are resource intensive won't work.
# A tibble: 20 x 3
Badge `Effective Date` day_off
<dbl> <dttm> <int>
1 3162 2013-01-16 00:00:00 1
2 3162 2013-01-17 00:00:00. 1
3 3162 2013-01-18 00:00:00. 1
4 3162 2013-01-19 00:00:00 2
5 3162 2013-01-20 00:00:00 2
6 3162 2013-01-21 00:00:00 3
7 3585 2015-10-21 00:00:00 5
8 3586 2014-05-21 00:00:00 6
9 3586 2014-05-22 00:00:00 6
10 3586 2014-05-23 00:00:00 7
You can create a table where, for each Badge group, you have a sequence of datetimes from the first to the last. Then doing a rolling join to this data frame gives the desired output
library(data.table)
## Create reproducible example as an R object
# Please do this yourself next time using dput. See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
df <- fread('
Badge , Effective_Date , day_off
3162 , 2013-01-16 00:00:00 , 1
3162 , 2013-01-19 00:00:00 , 2
3162 , 2013-01-21 00:00:00 , 3
3585 , 2015-10-21 00:00:00 , 5
3586 , 2014-05-21 00:00:00 , 6
3586 , 2014-05-23 00:00:00 , 7
')
df[, Effective_Date := as.POSIXct(Effective_Date)]
## Rolling join
setDT(df) # required if data wasn't originally a data.table as above
df[df[, .(Effective_Date = seq(min(Effective_Date), max(Effective_Date), by = '1 day')),
by = .(Badge)],
on = .(Badge, Effective_Date), roll = TRUE]
#> Badge Effective_Date day_off
#> 1: 3162 2013-01-16 1
#> 2: 3162 2013-01-17 1
#> 3: 3162 2013-01-18 1
#> 4: 3162 2013-01-19 2
#> 5: 3162 2013-01-20 2
#> 6: 3162 2013-01-21 3
#> 7: 3585 2015-10-21 5
#> 8: 3586 2014-05-21 6
#> 9: 3586 2014-05-22 6
#> 10: 3586 2014-05-23 7
Created on 2021-07-16 by the reprex package (v2.0.0)
A tidyverse way would be using complete and fill -
library(dplyr)
library(tidyr)
df %>%
group_by(Badge) %>%
complete(Effective_Date = seq(min(Effective_Date),
max(Effective_Date), by = '1 day')) %>%
fill(day_off) %>%
ungroup
# Badge Effective_Date day_off
# <int> <dttm> <int>
# 1 3162 2013-01-16 00:00:00 1
# 2 3162 2013-01-17 00:00:00 1
# 3 3162 2013-01-18 00:00:00 1
# 4 3162 2013-01-19 00:00:00 2
# 5 3162 2013-01-20 00:00:00 2
# 6 3162 2013-01-21 00:00:00 3
# 7 3585 2015-10-21 00:00:00 5
# 8 3586 2014-05-21 00:00:00 6
# 9 3586 2014-05-22 00:00:00 6
#10 3586 2014-05-23 00:00:00 7
Related
I am trying to identify clusters in a dataframe that are within 4 subsequent days of the first event. Additionally we have a grouping variable.
Here is an example:
startDate <- as.POSIXct("2022-10-01")
dt1 <- data.table(
id = 1:20,
timestamp = startDate+ lubridate::days(rep(1:10,2))+ lubridate::hours(1:20),
group_id = rep(c("A","B"), each= 10)
)
id timestamp group_id t_diff
1: 1 2022-10-02 01:00:00 A 0.000000 days
2: 2 2022-10-03 02:00:00 A 1.041667 days
3: 3 2022-10-04 03:00:00 A 2.083333 days
4: 4 2022-10-05 04:00:00 A 3.125000 days
5: 5 2022-10-06 05:00:00 A 4.166667 days
6: 6 2022-10-07 06:00:00 A 5.208333 days
7: 7 2022-10-08 07:00:00 A 6.250000 days
8: 8 2022-10-09 08:00:00 A 7.291667 days
9: 9 2022-10-10 09:00:00 A 8.333333 days
10: 10 2022-10-11 10:00:00 A 9.375000 days
11: 11 2022-10-02 11:00:00 B 0.000000 days
12: 12 2022-10-03 12:00:00 B 1.041667 days
13: 13 2022-10-04 13:00:00 B 2.083333 days
14: 14 2022-10-05 14:00:00 B 3.125000 days
15: 15 2022-10-06 15:00:00 B 4.166667 days
16: 16 2022-10-07 16:00:00 B 5.208333 days
17: 17 2022-10-08 17:00:00 B 6.250000 days
18: 18 2022-10-09 18:00:00 B 7.291667 days
19: 19 2022-10-10 19:00:00 B 8.333333 days
20: 20 2022-10-11 20:00:00 B 9.375000 days
The result should look like this:
id timestamp group_id t_diff cluster_id
1: 1 2022-10-02 01:00:00 A 0.000000 days 1
2: 2 2022-10-03 02:00:00 A 1.041667 days 1
3: 3 2022-10-04 03:00:00 A 2.083333 days 1
4: 4 2022-10-05 04:00:00 A 3.125000 days 1
5: 5 2022-10-06 05:00:00 A 4.166667 days 2
6: 6 2022-10-07 06:00:00 A 5.208333 days 2
7: 7 2022-10-08 07:00:00 A 6.250000 days 2
8: 8 2022-10-09 08:00:00 A 7.291667 days 2
9: 9 2022-10-10 09:00:00 A 8.333333 days 3
10: 10 2022-10-11 10:00:00 A 9.375000 days 3
11: 11 2022-10-02 11:00:00 B 0.000000 days 4
12: 12 2022-10-03 12:00:00 B 1.041667 days 4
13: 13 2022-10-04 13:00:00 B 2.083333 days 4
14: 14 2022-10-05 14:00:00 B 3.125000 days 4
15: 15 2022-10-06 15:00:00 B 4.166667 days 5
16: 16 2022-10-07 16:00:00 B 5.208333 days 5
17: 17 2022-10-08 17:00:00 B 6.250000 days 5
18: 18 2022-10-09 18:00:00 B 7.291667 days 5
19: 19 2022-10-10 19:00:00 B 8.333333 days 6
20: 20 2022-10-11 20:00:00 B 9.375000 days 6
I have tried an approch with lapply, but the code is ugly and very slow. I am looking for a data.table approach, but I don't know how to dynamically refer to the "first" observation.
By first observation I mean the first observation of the 4 day interval.
You can use integer division.
Not that as.numeric run on a difftime object as an argument units that converts the difference to the desired time unit.
startDate <- as.POSIXct("2022-10-01")
dt1 <- data.table::data.table(
id = 1:20,
timestamp = startDate + lubridate::days(rep(1:10,2)) + lubridate::hours(1:20),
group_id = rep(c("A","B"), each= 10)
)
#
dt1[, GRP := as.numeric(timestamp - min(timestamp),
units = "days") %/% 4,
by = group_id][]
#> id timestamp group_id GRP
#> 1: 1 2022-10-02 01:00:00 A 0
#> 2: 2 2022-10-03 02:00:00 A 0
#> 3: 3 2022-10-04 03:00:00 A 0
#> 4: 4 2022-10-05 04:00:00 A 0
#> 5: 5 2022-10-06 05:00:00 A 1
#> 6: 6 2022-10-07 06:00:00 A 1
#> 7: 7 2022-10-08 07:00:00 A 1
#> 8: 8 2022-10-09 08:00:00 A 1
#> 9: 9 2022-10-10 09:00:00 A 2
#> 10: 10 2022-10-11 10:00:00 A 2
#> 11: 11 2022-10-02 11:00:00 B 0
#> 12: 12 2022-10-03 12:00:00 B 0
#> 13: 13 2022-10-04 13:00:00 B 0
#> 14: 14 2022-10-05 14:00:00 B 0
#> 15: 15 2022-10-06 15:00:00 B 1
#> 16: 16 2022-10-07 16:00:00 B 1
#> 17: 17 2022-10-08 17:00:00 B 1
#> 18: 18 2022-10-09 18:00:00 B 1
#> 19: 19 2022-10-10 19:00:00 B 2
#> 20: 20 2022-10-11 20:00:00 B 2
# When you want a single ID index
# alternatovely, just you the combination of group_id and GRP in subsequent `by`s
dt1[, cluster_id := .GRP, by = .(group_id, GRP)][]
#> id timestamp group_id GRP cluster_id
#> 1: 1 2022-10-02 01:00:00 A 0 1
#> 2: 2 2022-10-03 02:00:00 A 0 1
#> 3: 3 2022-10-04 03:00:00 A 0 1
#> 4: 4 2022-10-05 04:00:00 A 0 1
#> 5: 5 2022-10-06 05:00:00 A 1 2
#> 6: 6 2022-10-07 06:00:00 A 1 2
#> 7: 7 2022-10-08 07:00:00 A 1 2
#> 8: 8 2022-10-09 08:00:00 A 1 2
#> 9: 9 2022-10-10 09:00:00 A 2 3
#> 10: 10 2022-10-11 10:00:00 A 2 3
#> 11: 11 2022-10-02 11:00:00 B 0 4
#> 12: 12 2022-10-03 12:00:00 B 0 4
#> 13: 13 2022-10-04 13:00:00 B 0 4
#> 14: 14 2022-10-05 14:00:00 B 0 4
#> 15: 15 2022-10-06 15:00:00 B 1 5
#> 16: 16 2022-10-07 16:00:00 B 1 5
#> 17: 17 2022-10-08 17:00:00 B 1 5
#> 18: 18 2022-10-09 18:00:00 B 1 5
#> 19: 19 2022-10-10 19:00:00 B 2 6
#> 20: 20 2022-10-11 20:00:00 B 2 6
I have a list of tibbles that look like this:
> head(temp)
$AT
# A tibble: 8,784 × 2
price_eur datetime
<dbl> <dttm>
1 50.9 2021-01-01 00:00:00
2 48.2 2021-01-01 01:00:00
3 44.7 2021-01-01 02:00:00
4 42.9 2021-01-01 03:00:00
5 40.4 2021-01-01 04:00:00
6 40.2 2021-01-01 05:00:00
7 39.6 2021-01-01 06:00:00
8 40.1 2021-01-01 07:00:00
9 41.3 2021-01-01 08:00:00
10 44.9 2021-01-01 09:00:00
# … with 8,774 more rows
$IE
# A tibble: 7,198 × 2
price_eur datetime
<dbl> <dttm>
1 54.0 2021-01-01 01:00:00
2 53 2021-01-01 02:00:00
3 51.2 2021-01-01 03:00:00
4 48.1 2021-01-01 04:00:00
5 47.3 2021-01-01 05:00:00
6 47.6 2021-01-01 06:00:00
7 45.4 2021-01-01 07:00:00
8 43.4 2021-01-01 08:00:00
9 47.8 2021-01-01 09:00:00
10 51.8 2021-01-01 10:00:00
# … with 7,188 more rows
$`IT-Calabria`
# A tibble: 8,736 × 2
price_eur datetime
<dbl> <dttm>
1 50.9 2021-01-01 00:00:00
2 48.2 2021-01-01 01:00:00
3 44.7 2021-01-01 02:00:00
4 42.9 2021-01-01 03:00:00
5 40.4 2021-01-01 04:00:00
6 40.2 2021-01-01 05:00:00
7 39.6 2021-01-01 06:00:00
8 40.1 2021-01-01 07:00:00
9 41.3 2021-01-01 08:00:00
10 41.7 2021-01-01 09:00:00
# … with 8,726 more rows
The number of rows is different because there are missing observations, usually one or several days.
Ideally I need a tibble with a single date time index and corresponding columns with NAs when there is missing data and I'm stuck here.
We can do a full join by 'datetime'
library(dplyr)
library(purrr)
reduce(temp, full_join, by = "datetime")
If we need to rename the column 'price_eur' before the join, loop over the list with imap, rename the 'price_eur' to the corresponding list name (.y) and do the join within reduce
imap(temp, ~ .x %>%
rename(!! .y := price_eur)) %>%
reduce(full_join, by = 'datetime')
I'm looking to aggregate some pedometer data, gathered in steps per minute, so I get a summed number of steps up until an EMA assessment. The EMA assessments happened four times per day. An example of the two data sets are:
Pedometer Data
ID Steps Time
1 15 2/4/2020 8:32
1 23 2/4/2020 8:33
1 76 2/4/2020 8:34
1 32 2/4/2020 8:35
1 45 2/4/2020 8:36
...
2 16 2/4/2020 8:32
2 17 2/4/2020 8:33
2 0 2/4/2020 8:34
2 5 2/4/2020 8:35
2 8 2/4/2020 8:36
EMA Data
ID Time X Y
1 2/4/2020 8:36 3 4
1 2/4/2020 12:01 3 5
1 2/4/2020 3:30 4 5
1 2/4/2020 6:45 7 8
...
2 2/4/2020 8:35 4 6
2 2/4/2020 12:05 5 7
2 2/4/2020 3:39 1 3
2 2/4/2020 6:55 8 3
I'm looking to add the pedometer data to the EMA data as a new variable, where the number of steps taken are summed until the next EMA assessment. Ideally it would like something like:
Combined Data
ID Time X Y Steps
1 2/4/2020 8:36 3 4 191
1 2/4/2020 12:01 3 5 [Sum of steps taken from 8:37 until 12:01 on 2/4/2020]
1 2/4/2020 3:30 4 5 [Sum of steps taken from 12:02 until 3:30 on 2/4/2020]
1 2/4/2020 6:45 7 8 [Sum of steps taken from 3:31 until 6:45 on 2/4/2020]
...
2 2/4/2020 8:35 4 6 38
2 2/4/2020 12:05 5 7 [Sum of steps taken from 8:36 until 12:05 on 2/4/2020]
2 2/4/2020 3:39 1 3 [Sum of steps taken from 12:06 until 3:39 on 2/4/2020]
2 2/4/2020 6:55 8 3 [Sum of steps taken from 3:40 until 6:55 on 2/4/2020]
I then need the process to continue over the entire 21 day EMA period, so the same process for the 4 EMA assessment time points on 2/5/2020, 2/6/2020, etc.
This has pushed me the limit of my R skills, so any pointers would be extremely helpful! I'm most familiar with the tidyverse but am comfortable using base R as well. Thanks in advance for all advice.
Here's a solution using rolling joins from data.table. The basic idea here is to roll each time from the pedometer data up to the next time in the EMA data (while matching on ID still). Once it's the next EMA time is found, all that's left is to isolate the X and Y values and sum up Steps.
Data creation and prep:
library(data.table)
pedometer <- data.table(ID = sort(rep(1:2, 500)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 09:35:00 EST"),
as.POSIXct("2020-02-08 17:00:00 EST"), length.out = 500), 2),
Steps = rpois(1000, 25))
EMA <- data.table(ID = sort(rep(1:2, 4*5)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 05:00:00 EST"),
as.POSIXct("2020-02-08 23:59:59 EST"), by = '6 hours'), 2),
X = sample(1:8, 2*4*5, rep = T),
Y = sample(1:8, 2*4*5, rep = T))
setkey(pedometer, Time)
setkey(EMA, Time)
EMA[,next_ema_time := Time]
And now the actual join and summation:
joined <- EMA[pedometer,
on = .(ID, Time),
roll = -Inf,
j = .(ID, Time, Steps, next_ema_time, X, Y)]
result <- joined[,.('X' = min(X),
'Y' = min(Y),
'Steps' = sum(Steps)),
.(ID, next_ema_time)]
result
#> ID next_ema_time X Y Steps
#> 1: 1 2020-02-04 11:00:00 1 2 167
#> 2: 2 2020-02-04 11:00:00 8 5 169
#> 3: 1 2020-02-04 17:00:00 3 6 740
#> 4: 2 2020-02-04 17:00:00 4 6 747
#> 5: 1 2020-02-04 23:00:00 2 2 679
#> 6: 2 2020-02-04 23:00:00 3 2 732
#> 7: 1 2020-02-05 05:00:00 7 5 720
#> 8: 2 2020-02-05 05:00:00 6 8 692
#> 9: 1 2020-02-05 11:00:00 2 4 731
#> 10: 2 2020-02-05 11:00:00 4 5 773
#> 11: 1 2020-02-05 17:00:00 1 5 757
#> 12: 2 2020-02-05 17:00:00 3 5 743
#> 13: 1 2020-02-05 23:00:00 3 8 693
#> 14: 2 2020-02-05 23:00:00 1 8 740
#> 15: 1 2020-02-06 05:00:00 8 8 710
#> 16: 2 2020-02-06 05:00:00 3 2 760
#> 17: 1 2020-02-06 11:00:00 8 4 716
#> 18: 2 2020-02-06 11:00:00 1 2 688
#> 19: 1 2020-02-06 17:00:00 5 2 738
#> 20: 2 2020-02-06 17:00:00 4 6 724
#> 21: 1 2020-02-06 23:00:00 7 8 737
#> 22: 2 2020-02-06 23:00:00 6 3 672
#> 23: 1 2020-02-07 05:00:00 2 6 726
#> 24: 2 2020-02-07 05:00:00 7 7 759
#> 25: 1 2020-02-07 11:00:00 1 4 737
#> 26: 2 2020-02-07 11:00:00 5 2 737
#> 27: 1 2020-02-07 17:00:00 3 5 766
#> 28: 2 2020-02-07 17:00:00 4 4 745
#> 29: 1 2020-02-07 23:00:00 3 3 714
#> 30: 2 2020-02-07 23:00:00 2 1 741
#> 31: 1 2020-02-08 05:00:00 4 6 751
#> 32: 2 2020-02-08 05:00:00 8 2 723
#> 33: 1 2020-02-08 11:00:00 3 3 716
#> 34: 2 2020-02-08 11:00:00 3 6 735
#> 35: 1 2020-02-08 17:00:00 1 5 696
#> 36: 2 2020-02-08 17:00:00 7 7 741
#> ID next_ema_time X Y Steps
Created on 2020-02-04 by the reprex package (v0.3.0)
I would left_join ema_df on pedometer_df by ID and Time. This way you get
all lines of pedometer_df with missing values for x and y (that I assume are identifiers) when it is not an EMA assessment time.
I fill the values using the next available (so the next ema assessment x and y)
and finally, group_by ID x and y and summarise to keep the datetime of assessment (max) and the sum of steps.
library(dplyr)
library(tidyr)
pedometer_df %>%
left_join(ema_df, by = c("ID", "Time")) %>%
fill(x, y, .direction = "up") %>%
group_by(ID, x, y) %>%
summarise(
Time = max(Time),
Steps = sum(Steps)
)
I have a dataframe :
> dsa[1:20]
Ordered.Item date Qty
1: 2011001FAM002025001 2019-06-01 19440.00
2: 2011001FAM002025001 2019-05-01 24455.53
3: 2011001FAM002025001 2019-04-01 16575.06
4: 2011001FAM002025001 2019-03-01 880.00
5: 2011001FAM002025001 2019-02-01 5000.00
6: 2011001FAM002035001 2019-04-01 175.00
7: 2011001FAM004025001 2019-06-01 2000.00
8: 2011001FAM004025001 2019-05-01 2500.00
9: 2011001FAM004025001 2019-04-01 3000.00
10: 2011001FAM012025001 2019-06-01 1200.00
11: 2011001FAM012025001 2019-04-01 1074.02
12: 2011001FAM022025001 2019-06-01 350.00
13: 2011001FAM022025001 2019-05-01 110.96
14: 2011001FAM022025001 2019-04-01 221.13
15: 2011001FAM022035001 2019-06-01 500.00
16: 2011001FAM022035001 2019-05-01 18.91
17: 2011001FAM027025001 2019-06-01 210.00
18: 2011001FAM028025001 2019-04-01 327.21
19: 2011001FBK005035001 2019-05-01 500.00
20: 2011001FBL001025001 2019-06-01 15350.00
>str(dsa)
Classes ‘data.table’ and 'data.frame': 830 obs. of 3 variables:
$ Ordered.Item: Factor w/ 435 levels "2011001FAM002025001",..: 1 1 1 1 1 2 3 3 3 4 ...
$ date : Date, format: "2019-06-01" "2019-05-01" "2019-04-01" ...
$ Qty : num 19440 24456 16575 880 5000 ...
- attr(*, ".internal.selfref")=<externalptr>
this data contains sku and it's quantity sold per month
Because i plan to use ARIMA forecasting i am trying to convert the dataframe to time series but i get a weird output
> timesr<-ts(data=dsa,start=c(12,2018),frequency = 12)
> head(timesr)
Ordered.Item date Qty
[1,] 1 18048 19440.00
[2,] 1 18017 24455.53
[3,] 1 17987 16575.06
[4,] 1 17956 880.00
[5,] 1 17928 5000.00
[6,] 2 17987 175.00
You might try something like this for your sku ARIMA modeling.
# Create dataframe
dsa = read.table(text = '
ID Ordered.Item date Qty
1 2011001FAM002025001 2019-06-01 19440.00
2 2011001FAM002025001 2019-05-01 24455.53
3 2011001FAM002025001 2019-04-01 16575.06
4 2011001FAM002025001 2019-03-01 880.00
5 2011001FAM002025001 2019-02-01 5000.00
6 2011001FAM002035001 2019-04-01 175.00
7 2011001FAM004025001 2019-06-01 2000.00
8 2011001FAM004025001 2019-05-01 2500.00
9 2011001FAM004025001 2019-04-01 3000.00
10 2011001FAM012025001 2019-06-01 1200.00
11 2011001FAM012025001 2019-04-01 1074.02
12 2011001FAM022025001 2019-06-01 350.00
13 2011001FAM022025001 2019-05-01 110.96
14 2011001FAM022025001 2019-04-01 221.13
15 2011001FAM022035001 2019-06-01 500.00
16 2011001FAM022035001 2019-05-01 18.91
17 2011001FAM027025001 2019-06-01 210.00
18 2011001FAM028025001 2019-04-01 327.21
19 2011001FBK005035001 2019-05-01 500.00
20 2011001FBL001025001 2019-06-01 15350.00
', header = T)
dsa$ID <- NULL
# Reshape
dsa2 <- reshape(data=dsa,idvar="date", v.names = "Qty", timevar = "Ordered.Item", direction="wide")
dsa2 <- dsa2[order(as.Date(dsa2$date, "%Y-%m-%d")),] # Sort by date
# Predict for sku 2011001FAM002025001
fit <- auto.arima(ts(dsa2$Qty.2011001FAM002025001))
fcast <- forecast(fit, h=60) # forecast 60 periods ahead
plot(fcast)
I have the following dataframe:
> df
Time_Start Time_End Cut Plot Inlet_NH4N Outlet_NH4N Pump_reading Anemometer_reading
1 2016-05-05 11:19:00 2016-05-06 09:30:00 1 1 0.2336795 0.30786350 79846.9 6296343
2 2016-05-05 11:25:00 2016-05-06 09:35:00 1 3 1.0905045 0.50816024 78776.5 333116
3 2016-05-05 11:33:00 2016-05-06 09:39:00 1 6 1.3538576 0.34866469 79585.1 8970447
4 2016-05-05 11:37:00 2016-05-06 09:51:00 1 7 0.6862018 0.34124629 80043.1 8436546
5 2016-05-05 11:43:00 2016-05-06 09:43:00 1 9 0.2633531 0.73813056 79227.7 9007387
6 2016-05-05 11:48:00 2016-05-06 09:47:00 1 12 0.5934718 1.10905045 79121.5 8070785
7 2016-05-06 09:33:00 2013-05-07 10:13:00 1 1 0.5213904 2.46791444 88800.2 7807792
8 2016-05-06 09:38:00 2013-05-07 10:23:00 1 3 0.1684492 0.22905526 89123.0 14127
9 2016-05-06 09:42:00 2013-05-07 10:28:00 1 6 0.4393939 0.09001782 89157.6 9844162
10 2016-05-06 09:53:00 2013-05-07 10:34:00 1 7 0.1470588 1.03832442 88852.6 9143733
11 2016-05-06 09:45:00 2013-05-07 10:40:00 1 9 0.1114082 0.32531194 89635.6 10122720
12 2016-05-06 09:50:00 2013-05-07 10:43:00 1 12 0.6853832 2.51426025 89582.6 8924198
Here is the str:
> str(df)
'data.frame': 12 obs. of 8 variables:
$ Time_Start : POSIXct, format: "2016-05-05 11:19:00" "2016-05-05 11:25:00" "2016-05-05 11:33:00" ...
$ Time_End : POSIXct, format: "2016-05-06 09:30:00" "2016-05-06 09:35:00" "2016-05-06 09:39:00" ...
$ Cut : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
$ Plot : Factor w/ 8 levels "1","3","6","7",..: 1 2 3 4 5 6 1 2 3 4 ...
$ Inlet_NH4N : num 0.234 1.091 1.354 0.686 0.263 ...
$ Outlet_NH4N : num 0.308 0.508 0.349 0.341 0.738 ...
$ Pump_reading : num 79847 78777 79585 80043 79228 ...
$ Anemometer_reading: int 6296343 333116 8970447 8436546 9007387 8070785 7807792 14127 9844162 9143733 ...
This is a small segment of a larger dataset.
I have a problem with these data in that the Anemometer_reading for plot "3" is always much lower than for the other plots. This is due to a mechanical problem. I want to remove this artifact and think that the best way to do this is to take an average of the Anemometer_reading for all the plots outwith plot "3". I want to calculate this average on a daily basis.
I can calculate the daily Anemometer_reading average, excluding plot "3" like this:
library(dplyr)
> df_avg <- df %>% filter(Plot != "3") %>% group_by(as.Date(Time_End)) %>% summarise(Anemometer_mean = mean(Anemometer_reading))
> df_avg
Source: local data frame [2 x 2]
as.Date(Time_End) Anemometer_mean
<date> <dbl>
1 2013-05-07 9168521
2 2016-05-06 8156302
I'm not sure how to go about using the resulting dataframe to replace the Anemometer_reading values from plot "3".
Can anyone point me in the right direction please?
Thanks
I would follow #roland's comment. However, if you care about how you would use dplyr to do what you asked:
result <- df %>% group_by(as.Date(Time_End)) %>%
mutate(Anemometer_mean = mean(Anemometer_reading[Plot != "3"])) %>%
mutate(Anemometer_reading = replace(Anemometer_reading, Plot == "3", first(Anemometer_mean))) %>%
ungroup() %>% select(-`as.Date(Time_End)`, -Anemometer_mean)
print(result)
## A tibble: 12 x 8
## Time_Start Time_End Cut Plot Inlet_NH4N Outlet_NH4N Pump_reading Anemometer_reading
## <fctr> <fctr> <int> <int> <dbl> <dbl> <dbl> <dbl>
##1 2016-05-05 11:19:00 2016-05-06 09:30:00 1 1 0.2336795 0.30786350 79846.9 6296343
##2 2016-05-05 11:25:00 2016-05-06 09:35:00 1 3 1.0905045 0.50816024 78776.5 8156302
##3 2016-05-05 11:33:00 2016-05-06 09:39:00 1 6 1.3538576 0.34866469 79585.1 8970447
##4 2016-05-05 11:37:00 2016-05-06 09:51:00 1 7 0.6862018 0.34124629 80043.1 8436546
##5 2016-05-05 11:43:00 2016-05-06 09:43:00 1 9 0.2633531 0.73813056 79227.7 9007387
##6 2016-05-05 11:48:00 2016-05-06 09:47:00 1 12 0.5934718 1.10905045 79121.5 8070785
##7 2016-05-06 09:33:00 2013-05-07 10:13:00 1 1 0.5213904 2.46791444 88800.2 7807792
##8 2016-05-06 09:38:00 2013-05-07 10:23:00 1 3 0.1684492 0.22905526 89123.0 9168521
##9 2016-05-06 09:42:00 2013-05-07 10:28:00 1 6 0.4393939 0.09001782 89157.6 9844162
##10 2016-05-06 09:53:00 2013-05-07 10:34:00 1 7 0.1470588 1.03832442 88852.6 9143733
##11 2016-05-06 09:45:00 2013-05-07 10:40:00 1 9 0.1114082 0.32531194 89635.6 10122720
##12 2016-05-06 09:50:00 2013-05-07 10:43:00 1 12 0.6853832 2.51426025 89582.6 8924198
Instead of filter and summarise, mutate to create a new column Anemometer_mean that computes the mean with all rows for Plot!=3. Then replace the Anemometer_read for those rows Plot==3 with this mean.
In fact, you can do all this with just one mutate:
result <- df %>% group_by(as.Date(Time_End)) %>%
mutate(Anemometer_reading = replace(Anemometer_reading, Plot == "3", mean(Anemometer_reading[Plot != "3"]))) %>%
ungroup() %>% select(-`as.Date(Time_End)`)
Hope this helps.