Anonymised example subset of a much larger dataset (now edited to show an option with multiple competing types):
structure(list(`Sample File` = c("A", "A", "A", "A", "A", "A",
"A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C"),
Marker = c("X", "X", "X", "X", "Y", "Y", "Y", "Y", "Y", "Z",
"Z", "Z", "Z", "Z", "q", "q", "q", "q"), Allele = c(19, 20,
22, 23, 18, 18.2, 19, 19.2, 20, 12, 13, 14, 15, 16, 10, 10.2,
11, 12), Size = c(249.15, 253.13, 260.64, 264.68, 366, 367.81,
369.97, 372.02, 373.95, 91.65, 95.86, 100, 104.24, 108.38,
177.51, 179.4, 181.42, 185.49), Height = c(173L, 1976L, 145L,
1078L, 137L, 62L, 1381L, 45L, 1005L, 38L, 482L, 5766L, 4893L,
19L, 287L, 36L, 5001L, 50L), Type = c("minusone", "allele",
"minusone", "allele", "ambiguous", "minushalf", "allele",
"minushalf", "allele", "minustwo", "ambiguous", "allele",
"allele", "plusone", "minusone", "minushalf", "allele", "plusone"
), LUS = c(11.75, 11.286, 13.375, 13.5, 18, 9, 19, 10, 20,
12, 11, 14, 15, 16, 9.5, NA, 11, 11.5)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -18L), groups = structure(list(
`Sample File` = c("A", "A", "B", "C"), Marker = c("X", "Y",
"Z", "q"), .rows = structure(list(1:4, 5:9, 10:14, 15:18), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), .drop = TRUE))
I want to look up values based on the classification $Type.
"minustwo" means I want to look up the "Allele", "Height" and "LUS"
values for the row with "Allele" equal to the current row plus two,
with the same Sample File and Marker.
"minusone" means the same but for "Allele" equal to the current row plus one.
"minushalf" means the same but for "Allele" equal to the current row plus 0.2 but the dot values here are 25% each, so 12.1, 12.3, 12.3, 13, 13.1 etc - I have a helper function plusTwoBP() for this.
"plusone" means the same for "Allele" equal to the current row -1
"allele" or "ambiguous" don't need to do anything.
Ideal output:
# A tibble: 18 × 10
# Rowwise: Sample File, Marker
`Sample File` Marker Allele Size Height Type LUS ParentHeight ParentAllele ParentLUS
<chr> <chr> <dbl> <dbl> <int> <chr> <dbl> <int> <dbl> <dbl>
1 A X 19 249. 173 minusone 11.8 1976 20 11.3
2 A X 20 253. 1976 allele 11.3 NA NA NA
3 A X 22 261. 145 minusone 13.4 1078 23 13.5
4 A X 23 265. 1078 allele 13.5 NA NA NA
5 A Y 18 366 137 ambiguous 18 NA NA NA
6 A Y 18.2 368. 62 minushalf 9 1381 19 19
7 A Y 19 370. 1381 allele 19 NA NA NA
8 A Y 19.2 372. 45 minushalf 10 1005 20 20
9 A Y 20 374. 1005 allele 20 NA NA NA
10 B Z 12 91.6 38 minustwo 12 5766 14 14
11 B Z 13 95.9 482 ambiguous 11 NA NA NA
12 B Z 14 100 5766 allele 14 NA NA NA
13 B Z 15 104. 4893 allele 15 NA NA NA
14 B Z 16 108. 19 plusone 16 4893 15 15
15 C q 10 178. 287 minusone 9.5 5001 11 11
16 C q 10.2 179. 36 minushalf NA 5001 11 11
17 C q 11 181. 5001 allele 11 NA NA NA
18 C q 12 185. 50 plusone 11.5 5001 11 11
I have a rather belaboured way of doing it:
# eg for minustwo
sampleData %>%
filter(Type == "minustwo") %>%
rowwise() %>%
mutate(ParentHeight = sampleData$Height[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)],
ParentAllele = sampleData$Allele[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)],
ParentLUS = sampleData$LUS[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)]) %>%
right_join(sampleData)
I then have to redo that for each of my Types
My real dataset is thousands of rows so this ends up being a little slow but manageable, but more to the point I want to learn a better way to do it, in particular the sampleData$'Sample File' == 'Sample File' & sampleData$Marker == Marker seems like it should be doable with grouping so I must be missing a trick there.
I have tried using group_map() but I've clearly not understood it correctly:
sampleData$ParentHeight <- sampleData %>%
group_by(`Sample File`, `Marker`) %>%
group_map(.f = \(.x, .y) {
pmap_dbl(.l = .x, .f = \(Allele, Height, Type, ...){
if(Type == "allele" | Type == "ambiguous") { return(0)
} else if (Type == "plusone") {
return(.x$Height[.x$Allele == round(Allele - 1, 1)])
} else if (Type == "minushalf") {
return(.x$Height[.x$Allele == round(plustwoBP(Allele), 1)])
} else if (Type == "minusone") {
return(.x$Height[.x$Allele == round(Allele + 1, 1)])
} else if (Type == "minustwo") {
return(.x$Height[.x$Allele == round(Allele + 2, 1)])
} else { stop("unexpected peak type") }
})}) %>% unlist()
Initially seems to work, but on investigation it's not respecting both layers of grouping, so brings matches from the wrong Marker. Additionally, here I'm assigning the output to a new column in the data frame, but if I try to instead wrap a mutate() around this so that I can create all three new columns in one go then the group_map() no longer works at all.
I also considered using complete() to hugely extend the data frame will all possible values of Allele (including x.0, x.1, x.2, x.3 variants) then use lag() to select the corresponding rows, then drop the spare rows. This seems like it'd make the data frame enormous in the interim.
To summarise
This works, but it feels ugly and like I'm missing a more elegant and obvious solution. How would you approach this?
You can create two versions of Allele: one identical to the original Allele, and one that is equal to an adjustment based on minusone, minustwo, etc
Then do a self left join, based on that adjusted version of Allele (and Sample File and Marker)
sampleData = sampleData %>% group_by(`Sample File`,Marker) %>% mutate(id = Allele) %>% ungroup()
left_join(
sampleData %>%
mutate(id = case_when(
Type=="minusone"~id+1,
Type=="minustwo"~id+2,
Type=="plusone"~id-1,
Type=="minushalf"~ceiling(id))),
sampleData %>% select(-c(Size,Type)),
by=c("Sample File", "Marker", "id"),
suffix = c("", ".parent")
) %>% select(-id)
Output:
# A tibble: 14 × 10
`Sample File` Marker Allele Size Height Type LUS Allele.parent Height.parent LUS.parent
<chr> <chr> <dbl> <dbl> <int> <chr> <dbl> <dbl> <int> <dbl>
1 A X 19 249. 173 minusone 11.8 20 1976 11.3
2 A X 20 253. 1976 allele 11.3 NA NA NA
3 A X 22 261. 145 minusone 13.4 23 1078 13.5
4 A X 23 265. 1078 allele 13.5 NA NA NA
5 A Y 18 366 137 ambiguous 18 NA NA NA
6 A Y 18.2 368. 62 minushalf 9 19 1381 19
7 A Y 19 370. 1381 allele 19 NA NA NA
8 A Y 19.2 372. 45 minushalf 10 20 1005 20
9 A Y 20 374. 1005 allele 20 NA NA NA
10 B Z 12 91.6 38 minustwo 12 14 5766 14
11 B Z 13 95.9 482 ambiguous 11 NA NA NA
12 B Z 14 100 5766 allele 14 NA NA NA
13 B Z 15 104. 4893 allele 15 NA NA NA
14 B Z 16 108. 19 plusone 16 15 4893 15
15 C q 10 178. 287 minusone 9.5 11 5001 11
16 C q 10.2 179. 36 minushalf NA 11 5001 11
17 C q 11 181. 5001 allele 11 NA NA NA
18 C q 12 185. 50 plusone 11.5 11 5001 11
Related
I have a longitudinal data with three follow-up. The columns 2,3 and 4
I want to set the value 99 in the columns v_9, v_01, and v_03 to NA, but I want to set their corresponding columns (columns "d_9", "d_01","d_03" and "a_9", "a_01","a_03") as NA as well. As an example for ID 101 as below:
How can I do this for all the individuals and my whole data set in R? thanks in advance for the help.
"id" "v_9" "v_01" "v_03" "d_9" "d_01" "d_03" "a_9" "a_01" "a_03"
101 12 NA 10 2015-03-23 NA 2003-06-19 40.50650 NA 44.1065
structure(list(id = c(101, 102, 103, 104), v_9 = c(12, 99, 16,
25), v_01 = c(99, 12, 16, NA), v_03 = c(10, NA, 99, NA), d_9 = structure(c(16517,
17613, 16769, 10667), class = "Date"), d_01 = structure(c(13291,
NA, 13566, NA), class = "Date"), d_03 = structure(c(12222, NA,
12119, NA), class = "Date"), a_9 = c(40.5065, 40.5065, 30.19713,
51.40862), a_01 = c(42.5065, 41.5112, 32.42847, NA), a_03 = c(44.1065,
NA, 35.46543, NA)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Try this function:
fn <- function(df){
for(s in c("_9" , "_01" , "_03")){
i <- which(`[[`(df,paste0("v",s)) == 99)
df[i, paste0("v",s)] <- NA
df[i, paste0("d",s)] <- NA
df[i, paste0("a",s)] <- NA
}
df
}
df <- fn(df)
Output
# A tibble: 4 × 10
id v_9 v_01 v_03 d_9 d_01 d_03 a_9 a_01 a_03
<dbl> <dbl> <dbl> <dbl> <date> <date> <date> <dbl> <dbl> <dbl>
1 101 12 NA 10 2015-03-23 NA 2003-06-19 40.5 NA 44.1
2 102 NA 12 NA NA NA NA NA 41.5 NA
3 103 16 16 NA 2015-11-30 2007-02-22 NA 30.2 32.4 NA
4 104 25 NA NA 1999-03-17 NA NA 51.4 NA NA
I created a multilevel regression model with nlme package and now I would like to illustrate the regression line obtained for some patients (unfortunately I cannot use geom_smooth with nlme).
So using the model I obtained the following predicted values (predicted_value) at different times (date_day) and here for two patients (ID1 and ID2).
df <- data.frame (ID = c (rep (1, 10), rep(2, 10)),
date_day = c (7:16, 7:16),
predicted_value = c (33, 33, 33, 33, 33, NA, 34, NA, NA, NA,
55, NA, NA, 53.3, NA, NA, 51.6, NA, 50.5, NA))
ID date_day predicted_value
1 1 7 33.0
2 1 8 33.0
3 1 9 33.0
4 1 10 33.0
5 1 11 33.0
6 1 12 NA
7 1 13 34.0
8 1 14 NA
9 1 15 NA
10 1 16 NA
11 2 7 55.0
12 2 8 NA
13 2 9 NA
14 2 10 53.3
15 2 11 NA
16 2 12 NA
17 2 13 51.6
18 2 14 NA
19 2 15 50.5
20 2 16 NA
Now I would like to draw the regression line for each of these patients. So I tried the following
ggplot(df%>% filter(ID %in% c("1", "2")))+
aes(x = date_day, y = predicted_value) +
geom_point(shape = "circle", size = 1.5, colour = "#112446", na.rm = T) +
geom_line(aes(y = predicted_value), na.rm = T, size = 1) +
theme_minimal() +
facet_wrap(vars(ID)) +
scale_x_continuous(name="days", limits=c(7, 16)) +
scale_y_continuous(name="predicted values", limits=c(0, 60))
But I end with the following plots: patient 1 : the line is interrupted, and patient 2 no line at all. How can I fix that ?
Thanks a lot
Thank you #BenBolker , indeed changing the first line
ggplot(df%>% filter(ID %in% c("1", "2")))
to
ggplot(na.omit(df)%>% filter(ID %in% c("1", "2")))
allowed to solve the job
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 1 year ago.
I have a df that are in different unit. I would like to convert them into the same unit based on the conversion factors in cov. When df$Test==cov$Type, df$Unit==cov$Raw, then we do the calculation. if df$unit can not be found in cov, then keep as is and flag it with new variable "Check"=="Y"
What will be the best way to complete this conversion process. I saw someone using the method: building an empty df then read in each record one by one with calculation. Is it a good way? or it is the way counts as more careful way? I would like to know what will you do if you are the person to handle such task. As many as possible. Many thanks.
df<-structure(list(Test = c("Length", "Weight", "Weight", "Weight",
"Weight", "Length", "Length", "Length", "Length", "Length", "Length",
"Length", "Length", "Length"), Result = c(4.5, 36, 147, 55, 175,
2, 125, 222, 1.6, 3, 56, 512, 28, 78), Unit = c("m", "lb", "g",
"kg", "oz", "cm", "in", "mm", "ft", "m", "in", "cm", "cm", NA
)), row.names = c(NA, -14L), class = c("tbl_df", "tbl", "data.frame"))
cov<- structure(list(Type = c("Length", "Length", "Length", "Length",
"Length", "Weight", "Weight", "Weight", "Weight"), Raw = c("m",
"cm", "mm", "in", "ft", "lb", "g", "kg", "oz"), Standard = c("cm",
"cm", "cm", "cm", "cm", "g", "g", "g", "g"), Factor = c(100,
1, 0.1, 2.54, 30.48, 453, 1, 1000, 28)), row.names = c(NA, -9L
), class = c("tbl_df", "tbl", "data.frame"))
A sensible first step is to merge the two dataframes so that each line in df is added the suitable cov$Factor as in:
merge(df, cov, by.x = "Unit", by.y = "Raw", all.x = TRUE, all.y = FALSE)
which gives:
Unit Test Result Type Standard Factor
1 cm Length 512.0 Length cm 1.00
2 cm Length 28.0 Length cm 1.00
3 cm Length 2.0 Length cm 1.00
4 ft Length 1.6 Length cm 30.48
5 g Weight 147.0 Weight g 1.00
6 in Length 125.0 Length cm 2.54
7 in Length 56.0 Length cm 2.54
8 kg Weight 55.0 Weight g 1000.00
9 lb Weight 36.0 Weight g 453.00
10 m Length 4.5 Length cm 100.00
11 m Length 3.0 Length cm 100.00
12 mm Length 222.0 Length cm 0.10
13 oz Weight 175.0 Weight g 28.00
14 <NA> Length 78.0 <NA> <NA> NA
It is then easy to multiply Result by Factor to get results in unified units and run an ifelse to add a Check variable, should that be necessary even after each line has been checked.
Using dplyr you can do -
library(dplyr)
left_join(df, cov, by = c('Test' = 'Type', 'Unit' = 'Raw')) %>%
mutate(final = Result * Factor,
Check = ifelse(is.na(final), 'Y', 'F'))
# Test Result Unit Standard Factor final Check
# <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr>
# 1 Length 4.5 m cm 100 450 F
# 2 Weight 36 lb g 453 16308 F
# 3 Weight 147 g g 1 147 F
# 4 Weight 55 kg g 1000 55000 F
# 5 Weight 175 oz g 28 4900 F
# 6 Length 2 cm cm 1 2 F
# 7 Length 125 in cm 2.54 318. F
# 8 Length 222 mm cm 0.1 22.2 F
# 9 Length 1.6 ft cm 30.5 48.8 F
#10 Length 3 m cm 100 300 F
#11 Length 56 in cm 2.54 142. F
#12 Length 512 cm cm 1 512 F
#13 Length 28 cm cm 1 28 F
#14 Length 78 NA NA NA NA Y
I have a datasets of forest stands, each containing several tree layers of different age and volume.
I want to classify the stands as even- or uneven-aged, combining volume and age data. The forest is considered even-aged if more then 80% of the volume is allocated to age classes within 20 years apart. I wonder how to implement the 'within 20 years apart' condition? I can easily calculate the sum of volume and it's share for individual tree layers (strat). But how to check for 'how many years they are apart?' Is it some sort of moving window?
Dummy example:
# investigate volume by age classes?
library(dplyr)
df <- data.frame(stand = c("id1", "id1", "id1", "id1",
'id2', 'id2', 'id2'),
strat = c(1,2,3,4,
1,2,3),
v = c(4,10,15,20,
11,15,18),
age = c(5,10,65,80,
10,15,20))
# even age = if more of teh 80% of volume is allocated in layers in 20 years range
df %>%
group_by(stand) %>%
mutate(V_tot = sum(v)) %>%
mutate(V_share = v/V_tot*100)
Expected outcome:
stand strat v age V_tot V_share quality
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 id1 1 4 5 49 8.16 uneven-aged
2 id1 2 10 10 49 20.4 uneven-aged
3 id1 3 15 65 49 30.6 uneven-aged
4 id1 4 20 80 49 40.8 uneven-aged #* because age classes 65 and 80, even less then 20 years apart have only 70% of total volume
5 id2 1 11 10 44 25 even-aged
6 id2 2 15 15 44 34.1 even-aged
7 id2 3 18 20 44 40.9 even-aged
Another tidyverse solution implementing a moving average:
library(tidyverse)
df <- structure(list(stand = c("id1", "id1", "id1", "id1", "id2", "id2", "id2"), strat = c(1, 2, 3, 4, 1, 2, 3), v = c(4, 10, 15, 20, 11, 15, 18), age = c(5, 10, 65, 80, 10, 15, 20), V_tot = c(49, 49, 49, 49, 44, 44, 44), V_share = c(8.16326530612245, 20.4081632653061, 30.6122448979592, 40.8163265306122, 25, 34.0909090909091, 40.9090909090909)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L))
df %>%
group_by(stand) %>%
mutate(range20 = map_dbl(age, ~ sum(V_share[which(abs(age - .x) <= 20)])),
quality = ifelse(any(range20 > 80), "even-aged", "uneven-aged"))
#> # A tibble: 7 × 8
#> # Groups: stand [2]
#> stand strat v age V_tot V_share range20 quality
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 id1 1 4 5 49 8.16 28.6 uneven-aged
#> 2 id1 2 10 10 49 20.4 28.6 uneven-aged
#> 3 id1 3 15 65 49 30.6 71.4 uneven-aged
#> 4 id1 4 20 80 49 40.8 71.4 uneven-aged
#> 5 id2 1 11 10 44 25 100 even-aged
#> 6 id2 2 15 15 44 34.1 100 even-aged
#> 7 id2 3 18 20 44 40.9 100 even-aged
Created on 2021-09-08 by the reprex package (v2.0.1)
Interesting issue, I think I have a solution using the runner package
df %>%
group_by(stand) %>%
mutate(
V_tot = sum(v),
V_share = v/V_tot*100,
test = sum_run(
V_share,
k = 20L,
idx = age,
na_rm = TRUE,
na_pad = FALSE
),
quality = if_else(any(test >= 80), 'even-aged', 'uneven-aged')
) %>%
select(-test)
I have a dataframe of daily water chemistry values taken from deployed sensors. I’m trying to calculate rolling 7 day averages of daily maximum values. This in in-situ environmental data, the data can be a bit messy.
Here are the rules for calculating the averages and assigning quality levels:
Data is graded and given a quality value (DQL) for the day (dyDQL).
'A' is high quality, 'B' is medium, and 'E' is poor.
7 day average is calculated at the end of a 7 day period.
Dataset needs only 6 complete days to calculate a 7 day average (Can miss 1 day of data)
If there are at least 6 days worth ‘A’ and ‘B’ graded data and 1 day of ‘E’, discard ‘E’ data and calculate the 7-day average using the 6 days of ‘A’ and ‘B’ data
I have the code working using a loop that loops through each result, creates a new dataframe containing the 7 day window, and then calculates the moving average. See minimal example below.
Note that there are missing dates for the 11th, 16th, 17th, and 18th in this example:
daily_data <- tibble::tribble(
~Monitoring.Location.ID, ~date, ~dyMax, ~dyMin, ~dyDQL,
"River 1", as.Date("2018-07-01"), 24.219, 22.537, "A",
"River 1", as.Date("2018-07-02"), 24.557, 20.388, "A",
"River 1", as.Date("2018-07-03"), 24.847, 20.126, "A",
"River 1", as.Date("2018-07-04"), 25.283, 20.674, "A",
"River 1", as.Date("2018-07-05"), 25.501, 20.865, "A",
"River 1", as.Date("2018-07-06"), 25.04, 21.008, "A",
"River 1", as.Date("2018-07-07"), 24.847, 20.674, "A",
"River 1", as.Date("2018-07-08"), 23.424, 20.793, "B",
"River 1", as.Date("2018-07-09"), 22.657, 18.866, "E",
"River 1", as.Date("2018-07-10"), 22.298, 18.2, "A",
"River 1", as.Date("2018-07-12"), 22.92, 19.008, "A",
"River 1", as.Date("2018-07-13"), 23.978, 19.532, "A",
"River 1", as.Date("2018-07-14"), 24.508, 19.936, "A",
"River 1", as.Date("2018-07-15"), 25.137, 20.627, "A",
"River 1", as.Date("2018-07-19"), 24.919, 20.674, "A"
)
for (l in seq_len(nrow(daily_data))){
station_7day <- filter(daily_data,
dplyr::between(date, daily_data[[l,'date']] - lubridate::days(6), daily_data[l,'date']))
daily_data[l,"ma.max7"] <- dplyr::case_when(nrow(subset(station_7day, dyDQL %in% c("A")))== 7 & l >=7 ~ mean(station_7day$dyMax),
nrow(subset(station_7day, dyDQL %in% c("A", 'B'))) >= 6 & l >=7~ mean(station_7day$dyMax),
max(station_7day$dyDQL == 'E') & nrow(subset(station_7day, dyDQL %in% c("A", "B"))) >= 6 & l >=7 ~ mean(station_7day$dyMax[station_7day$dyDQL %in% c("A", "B")]),
nrow(subset(station_7day, dyDQL %in% c("A", "B", "E"))) >= 6 & l >=7~ mean(station_7day$dyMax),
TRUE ~ NA_real_)
daily_data[l, "ma.max7_DQL"] <- dplyr::case_when(nrow(subset(station_7day, dyDQL %in% c("A")))== 7 & l >=7 ~ "A",
nrow(subset(station_7day, dyDQL %in% c("A", 'B'))) >= 6 & l >=7~ "B",
max(station_7day$dyDQL == 'E') & nrow(subset(station_7day, dyDQL %in% c("A", "B"))) >= 6 & l >=7 ~ "B",
nrow(subset(station_7day, dyDQL %in% c("A", "B", "E"))) >= 6 & l >=7~ "E",
TRUE ~ NA_character_)
}
The expected results are:
tibble::tribble(
~Monitoring.Location.ID, ~date, ~dyMax, ~dyMin, ~dyDQL, ~ma.max7, ~ma.max7_DQL,
"River 1", as.Date("2018-07-01"), 24.219, 22.537, "A", NA, NA,
"River 1", as.Date("2018-07-02"), 24.557, 20.388, "A", NA, NA,
"River 1", as.Date("2018-07-03"), 24.847, 20.126, "A", NA, NA,
"River 1", as.Date("2018-07-04"), 25.283, 20.674, "A", NA, NA,
"River 1", as.Date("2018-07-05"), 25.501, 20.865, "A", NA, NA,
"River 1", as.Date("2018-07-06"), 25.04, 21.008, "A", NA, NA,
"River 1", as.Date("2018-07-07"), 24.847, 20.674, "A", 24.8991428571429, "A",
"River 1", as.Date("2018-07-08"), 23.424, 20.793, "B", 24.7855714285714, "B",
"River 1", as.Date("2018-07-09"), 22.657, 18.866, "E", 24.5141428571429, "B",
"River 1", as.Date("2018-07-10"), 22.298, 18.2, "A", 24.15, "B",
"River 1", as.Date("2018-07-12"), 22.92, 19.008, "A", 23.531, "E",
"River 1", as.Date("2018-07-13"), 23.978, 19.532, "A", 23.354, "E",
"River 1", as.Date("2018-07-14"), 24.508, 19.936, "A", 23.2975, "E",
"River 1", as.Date("2018-07-15"), 25.137, 20.627, "A", 23.583, "E",
"River 1", as.Date("2018-07-19"), 24.919, 20.674, "A", NA, NA
)
The code works fine, but is very slow when calculating values for multi-year levels of data with multiple different water quality parameters at multiple locations.
Due to the fact that a 7 day value can be calculated from 6 days of data, I don’t think I can use any of the rolling functions from the zoo package. I don’t think I can use the roll_mean function from the roll package, due to the variable nature of discarding 1 days worth of ‘E’ data when there is 6 days of ‘A’ or ‘B’ data.
Is there way to vectorize this, in order to avoid looping through every row of data?
I used tidyverse and runner and have done it like this in a single piped syntax. Syntax explanation-
I first collected seven days (as per logic provided) DQL and MAX values into a list using runner.
Before doing that, I have converted DQL into an ordered factored variable, which will be used in last syntax.
Secondly, i used purrr::map to modify each list according to given conditions,
Not less than six are to be counted
If there is exactly one E in 7 values, that has not to be counted.
Finally I unnested the list using unnest_wider
library(runner)
daily_data %>% mutate(dyDQL = factor(dyDQL, levels = c("A", "B", "E"), ordered = T),
d = runner(x = data.frame(a = dyMax, b= dyDQL),
k = "7 days",
lag = 0,
idx = date,
f = function(x) list(x))) %>%
mutate(d = map(d, ~ .x %>% group_by(b) %>%
mutate(c = n()) %>%
ungroup() %>%
filter(!n() < 6) %>%
filter(!(b == 'E' & c == 1 & n() == 7)) %>%
summarise(ma.max7 = ifelse(n() == 0, NA, mean(a)), ma.max7.DQL = max(b))
)
) %>%
unnest_wider(d)
# A tibble: 15 x 7
Monitoring.Location.ID date dyMax dyMin dyDQL ma.max7 ma.max7.DQL
<chr> <date> <dbl> <dbl> <ord> <dbl> <ord>
1 River 1 2018-07-01 24.2 22.5 A NA NA
2 River 1 2018-07-02 24.6 20.4 A NA NA
3 River 1 2018-07-03 24.8 20.1 A NA NA
4 River 1 2018-07-04 25.3 20.7 A NA NA
5 River 1 2018-07-05 25.5 20.9 A NA NA
6 River 1 2018-07-06 25.0 21.0 A 24.9 A
7 River 1 2018-07-07 24.8 20.7 A 24.9 A
8 River 1 2018-07-08 23.4 20.8 B 24.8 B
9 River 1 2018-07-09 22.7 18.9 E 24.8 B
10 River 1 2018-07-10 22.3 18.2 A 24.4 B
11 River 1 2018-07-12 22.9 19.0 A 23.5 E
12 River 1 2018-07-13 24.0 19.5 A 23.4 E
13 River 1 2018-07-14 24.5 19.9 A 23.3 E
14 River 1 2018-07-15 25.1 20.6 A 23.6 E
15 River 1 2018-07-19 24.9 20.7 A NA NA
Here's a vectorized approach using slider:slide_index to calculate the high quality and backup quality values, then combine for best available:
library(tidyverse); library(slider)
The following function groups by location, calculates the weekly mean (including everything from date-6 to and including date) and number of observations included, then filters to just having 6+ observations.
get_weekly_by_loc <- function(df) {
df %>%
group_by(Monitoring.Location.ID) %>%
mutate(mean = slide_index_dbl(dyMax, date, mean, .complete = TRUE, .before = lubridate::days(6)),
count = slide_index_dbl(dyMax, date, ~sum(!is.na(.)), .before = lubridate::days(6))) %>%
ungroup() %>%
filter(count >= 6)
}
Then we can run this function on just A/B data and overall:
daily_data_high_quality <- daily_data %>%
filter(dyDQL %in% c("A", "B")) %>%
get_weekly_by_loc() %>%
select(Monitoring.Location.ID, date, high_qual_mean = mean)
daily_data_backup <- daily_data %>%
get_weekly_by_loc() %>%
select(Monitoring.Location.ID, date, backup_mean = mean)
Then join those and use the high quality if available:
daily_data %>%
left_join(daily_data_high_quality) %>%
left_join(daily_data_backup) %>%
mutate(max7_DQL = coalesce(high_qual_mean, backup_mean)) %>%
mutate(moar_digits = format(max7_DQL, nsmall = 6))
Result
# A tibble: 15 x 9
Monitoring.Location.ID date dyMax dyMin dyDQL high_qual_mean backup_mean max7_DQL moar_digits
<chr> <date> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 River 1 2018-07-01 24.2 22.5 A NA NA NA " NA"
2 River 1 2018-07-02 24.6 20.4 A NA NA NA " NA"
3 River 1 2018-07-03 24.8 20.1 A NA NA NA " NA"
4 River 1 2018-07-04 25.3 20.7 A NA NA NA " NA"
5 River 1 2018-07-05 25.5 20.9 A NA NA NA " NA"
6 River 1 2018-07-06 25.0 21.0 A NA NA NA " NA"
7 River 1 2018-07-07 24.8 20.7 A 24.9 24.9 24.9 "24.899143"
8 River 1 2018-07-08 23.4 20.8 B 24.8 24.8 24.8 "24.785571"
9 River 1 2018-07-09 22.7 18.9 E NA 24.5 24.5 "24.514143"
10 River 1 2018-07-10 22.3 18.2 A 24.4 24.2 24.4 "24.398833"
11 River 1 2018-07-12 22.9 19.0 A NA 23.5 23.5 "23.531000"
12 River 1 2018-07-13 24.0 19.5 A NA 23.4 23.4 "23.354000"
13 River 1 2018-07-14 24.5 19.9 A NA 23.3 23.3 "23.297500"
14 River 1 2018-07-15 25.1 20.6 A NA 23.6 23.6 "23.583000"
15 River 1 2018-07-19 24.9 20.7 A NA NA NA " NA"