I have a dataframe of daily water chemistry values taken from deployed sensors. I’m trying to calculate rolling 7 day averages of daily maximum values. This in in-situ environmental data, the data can be a bit messy.
Here are the rules for calculating the averages and assigning quality levels:
Data is graded and given a quality value (DQL) for the day (dyDQL).
'A' is high quality, 'B' is medium, and 'E' is poor.
7 day average is calculated at the end of a 7 day period.
Dataset needs only 6 complete days to calculate a 7 day average (Can miss 1 day of data)
If there are at least 6 days worth ‘A’ and ‘B’ graded data and 1 day of ‘E’, discard ‘E’ data and calculate the 7-day average using the 6 days of ‘A’ and ‘B’ data
I have the code working using a loop that loops through each result, creates a new dataframe containing the 7 day window, and then calculates the moving average. See minimal example below.
Note that there are missing dates for the 11th, 16th, 17th, and 18th in this example:
daily_data <- tibble::tribble(
~Monitoring.Location.ID, ~date, ~dyMax, ~dyMin, ~dyDQL,
"River 1", as.Date("2018-07-01"), 24.219, 22.537, "A",
"River 1", as.Date("2018-07-02"), 24.557, 20.388, "A",
"River 1", as.Date("2018-07-03"), 24.847, 20.126, "A",
"River 1", as.Date("2018-07-04"), 25.283, 20.674, "A",
"River 1", as.Date("2018-07-05"), 25.501, 20.865, "A",
"River 1", as.Date("2018-07-06"), 25.04, 21.008, "A",
"River 1", as.Date("2018-07-07"), 24.847, 20.674, "A",
"River 1", as.Date("2018-07-08"), 23.424, 20.793, "B",
"River 1", as.Date("2018-07-09"), 22.657, 18.866, "E",
"River 1", as.Date("2018-07-10"), 22.298, 18.2, "A",
"River 1", as.Date("2018-07-12"), 22.92, 19.008, "A",
"River 1", as.Date("2018-07-13"), 23.978, 19.532, "A",
"River 1", as.Date("2018-07-14"), 24.508, 19.936, "A",
"River 1", as.Date("2018-07-15"), 25.137, 20.627, "A",
"River 1", as.Date("2018-07-19"), 24.919, 20.674, "A"
)
for (l in seq_len(nrow(daily_data))){
station_7day <- filter(daily_data,
dplyr::between(date, daily_data[[l,'date']] - lubridate::days(6), daily_data[l,'date']))
daily_data[l,"ma.max7"] <- dplyr::case_when(nrow(subset(station_7day, dyDQL %in% c("A")))== 7 & l >=7 ~ mean(station_7day$dyMax),
nrow(subset(station_7day, dyDQL %in% c("A", 'B'))) >= 6 & l >=7~ mean(station_7day$dyMax),
max(station_7day$dyDQL == 'E') & nrow(subset(station_7day, dyDQL %in% c("A", "B"))) >= 6 & l >=7 ~ mean(station_7day$dyMax[station_7day$dyDQL %in% c("A", "B")]),
nrow(subset(station_7day, dyDQL %in% c("A", "B", "E"))) >= 6 & l >=7~ mean(station_7day$dyMax),
TRUE ~ NA_real_)
daily_data[l, "ma.max7_DQL"] <- dplyr::case_when(nrow(subset(station_7day, dyDQL %in% c("A")))== 7 & l >=7 ~ "A",
nrow(subset(station_7day, dyDQL %in% c("A", 'B'))) >= 6 & l >=7~ "B",
max(station_7day$dyDQL == 'E') & nrow(subset(station_7day, dyDQL %in% c("A", "B"))) >= 6 & l >=7 ~ "B",
nrow(subset(station_7day, dyDQL %in% c("A", "B", "E"))) >= 6 & l >=7~ "E",
TRUE ~ NA_character_)
}
The expected results are:
tibble::tribble(
~Monitoring.Location.ID, ~date, ~dyMax, ~dyMin, ~dyDQL, ~ma.max7, ~ma.max7_DQL,
"River 1", as.Date("2018-07-01"), 24.219, 22.537, "A", NA, NA,
"River 1", as.Date("2018-07-02"), 24.557, 20.388, "A", NA, NA,
"River 1", as.Date("2018-07-03"), 24.847, 20.126, "A", NA, NA,
"River 1", as.Date("2018-07-04"), 25.283, 20.674, "A", NA, NA,
"River 1", as.Date("2018-07-05"), 25.501, 20.865, "A", NA, NA,
"River 1", as.Date("2018-07-06"), 25.04, 21.008, "A", NA, NA,
"River 1", as.Date("2018-07-07"), 24.847, 20.674, "A", 24.8991428571429, "A",
"River 1", as.Date("2018-07-08"), 23.424, 20.793, "B", 24.7855714285714, "B",
"River 1", as.Date("2018-07-09"), 22.657, 18.866, "E", 24.5141428571429, "B",
"River 1", as.Date("2018-07-10"), 22.298, 18.2, "A", 24.15, "B",
"River 1", as.Date("2018-07-12"), 22.92, 19.008, "A", 23.531, "E",
"River 1", as.Date("2018-07-13"), 23.978, 19.532, "A", 23.354, "E",
"River 1", as.Date("2018-07-14"), 24.508, 19.936, "A", 23.2975, "E",
"River 1", as.Date("2018-07-15"), 25.137, 20.627, "A", 23.583, "E",
"River 1", as.Date("2018-07-19"), 24.919, 20.674, "A", NA, NA
)
The code works fine, but is very slow when calculating values for multi-year levels of data with multiple different water quality parameters at multiple locations.
Due to the fact that a 7 day value can be calculated from 6 days of data, I don’t think I can use any of the rolling functions from the zoo package. I don’t think I can use the roll_mean function from the roll package, due to the variable nature of discarding 1 days worth of ‘E’ data when there is 6 days of ‘A’ or ‘B’ data.
Is there way to vectorize this, in order to avoid looping through every row of data?
I used tidyverse and runner and have done it like this in a single piped syntax. Syntax explanation-
I first collected seven days (as per logic provided) DQL and MAX values into a list using runner.
Before doing that, I have converted DQL into an ordered factored variable, which will be used in last syntax.
Secondly, i used purrr::map to modify each list according to given conditions,
Not less than six are to be counted
If there is exactly one E in 7 values, that has not to be counted.
Finally I unnested the list using unnest_wider
library(runner)
daily_data %>% mutate(dyDQL = factor(dyDQL, levels = c("A", "B", "E"), ordered = T),
d = runner(x = data.frame(a = dyMax, b= dyDQL),
k = "7 days",
lag = 0,
idx = date,
f = function(x) list(x))) %>%
mutate(d = map(d, ~ .x %>% group_by(b) %>%
mutate(c = n()) %>%
ungroup() %>%
filter(!n() < 6) %>%
filter(!(b == 'E' & c == 1 & n() == 7)) %>%
summarise(ma.max7 = ifelse(n() == 0, NA, mean(a)), ma.max7.DQL = max(b))
)
) %>%
unnest_wider(d)
# A tibble: 15 x 7
Monitoring.Location.ID date dyMax dyMin dyDQL ma.max7 ma.max7.DQL
<chr> <date> <dbl> <dbl> <ord> <dbl> <ord>
1 River 1 2018-07-01 24.2 22.5 A NA NA
2 River 1 2018-07-02 24.6 20.4 A NA NA
3 River 1 2018-07-03 24.8 20.1 A NA NA
4 River 1 2018-07-04 25.3 20.7 A NA NA
5 River 1 2018-07-05 25.5 20.9 A NA NA
6 River 1 2018-07-06 25.0 21.0 A 24.9 A
7 River 1 2018-07-07 24.8 20.7 A 24.9 A
8 River 1 2018-07-08 23.4 20.8 B 24.8 B
9 River 1 2018-07-09 22.7 18.9 E 24.8 B
10 River 1 2018-07-10 22.3 18.2 A 24.4 B
11 River 1 2018-07-12 22.9 19.0 A 23.5 E
12 River 1 2018-07-13 24.0 19.5 A 23.4 E
13 River 1 2018-07-14 24.5 19.9 A 23.3 E
14 River 1 2018-07-15 25.1 20.6 A 23.6 E
15 River 1 2018-07-19 24.9 20.7 A NA NA
Here's a vectorized approach using slider:slide_index to calculate the high quality and backup quality values, then combine for best available:
library(tidyverse); library(slider)
The following function groups by location, calculates the weekly mean (including everything from date-6 to and including date) and number of observations included, then filters to just having 6+ observations.
get_weekly_by_loc <- function(df) {
df %>%
group_by(Monitoring.Location.ID) %>%
mutate(mean = slide_index_dbl(dyMax, date, mean, .complete = TRUE, .before = lubridate::days(6)),
count = slide_index_dbl(dyMax, date, ~sum(!is.na(.)), .before = lubridate::days(6))) %>%
ungroup() %>%
filter(count >= 6)
}
Then we can run this function on just A/B data and overall:
daily_data_high_quality <- daily_data %>%
filter(dyDQL %in% c("A", "B")) %>%
get_weekly_by_loc() %>%
select(Monitoring.Location.ID, date, high_qual_mean = mean)
daily_data_backup <- daily_data %>%
get_weekly_by_loc() %>%
select(Monitoring.Location.ID, date, backup_mean = mean)
Then join those and use the high quality if available:
daily_data %>%
left_join(daily_data_high_quality) %>%
left_join(daily_data_backup) %>%
mutate(max7_DQL = coalesce(high_qual_mean, backup_mean)) %>%
mutate(moar_digits = format(max7_DQL, nsmall = 6))
Result
# A tibble: 15 x 9
Monitoring.Location.ID date dyMax dyMin dyDQL high_qual_mean backup_mean max7_DQL moar_digits
<chr> <date> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 River 1 2018-07-01 24.2 22.5 A NA NA NA " NA"
2 River 1 2018-07-02 24.6 20.4 A NA NA NA " NA"
3 River 1 2018-07-03 24.8 20.1 A NA NA NA " NA"
4 River 1 2018-07-04 25.3 20.7 A NA NA NA " NA"
5 River 1 2018-07-05 25.5 20.9 A NA NA NA " NA"
6 River 1 2018-07-06 25.0 21.0 A NA NA NA " NA"
7 River 1 2018-07-07 24.8 20.7 A 24.9 24.9 24.9 "24.899143"
8 River 1 2018-07-08 23.4 20.8 B 24.8 24.8 24.8 "24.785571"
9 River 1 2018-07-09 22.7 18.9 E NA 24.5 24.5 "24.514143"
10 River 1 2018-07-10 22.3 18.2 A 24.4 24.2 24.4 "24.398833"
11 River 1 2018-07-12 22.9 19.0 A NA 23.5 23.5 "23.531000"
12 River 1 2018-07-13 24.0 19.5 A NA 23.4 23.4 "23.354000"
13 River 1 2018-07-14 24.5 19.9 A NA 23.3 23.3 "23.297500"
14 River 1 2018-07-15 25.1 20.6 A NA 23.6 23.6 "23.583000"
15 River 1 2018-07-19 24.9 20.7 A NA NA NA " NA"
Related
Anonymised example subset of a much larger dataset (now edited to show an option with multiple competing types):
structure(list(`Sample File` = c("A", "A", "A", "A", "A", "A",
"A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C"),
Marker = c("X", "X", "X", "X", "Y", "Y", "Y", "Y", "Y", "Z",
"Z", "Z", "Z", "Z", "q", "q", "q", "q"), Allele = c(19, 20,
22, 23, 18, 18.2, 19, 19.2, 20, 12, 13, 14, 15, 16, 10, 10.2,
11, 12), Size = c(249.15, 253.13, 260.64, 264.68, 366, 367.81,
369.97, 372.02, 373.95, 91.65, 95.86, 100, 104.24, 108.38,
177.51, 179.4, 181.42, 185.49), Height = c(173L, 1976L, 145L,
1078L, 137L, 62L, 1381L, 45L, 1005L, 38L, 482L, 5766L, 4893L,
19L, 287L, 36L, 5001L, 50L), Type = c("minusone", "allele",
"minusone", "allele", "ambiguous", "minushalf", "allele",
"minushalf", "allele", "minustwo", "ambiguous", "allele",
"allele", "plusone", "minusone", "minushalf", "allele", "plusone"
), LUS = c(11.75, 11.286, 13.375, 13.5, 18, 9, 19, 10, 20,
12, 11, 14, 15, 16, 9.5, NA, 11, 11.5)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -18L), groups = structure(list(
`Sample File` = c("A", "A", "B", "C"), Marker = c("X", "Y",
"Z", "q"), .rows = structure(list(1:4, 5:9, 10:14, 15:18), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), .drop = TRUE))
I want to look up values based on the classification $Type.
"minustwo" means I want to look up the "Allele", "Height" and "LUS"
values for the row with "Allele" equal to the current row plus two,
with the same Sample File and Marker.
"minusone" means the same but for "Allele" equal to the current row plus one.
"minushalf" means the same but for "Allele" equal to the current row plus 0.2 but the dot values here are 25% each, so 12.1, 12.3, 12.3, 13, 13.1 etc - I have a helper function plusTwoBP() for this.
"plusone" means the same for "Allele" equal to the current row -1
"allele" or "ambiguous" don't need to do anything.
Ideal output:
# A tibble: 18 × 10
# Rowwise: Sample File, Marker
`Sample File` Marker Allele Size Height Type LUS ParentHeight ParentAllele ParentLUS
<chr> <chr> <dbl> <dbl> <int> <chr> <dbl> <int> <dbl> <dbl>
1 A X 19 249. 173 minusone 11.8 1976 20 11.3
2 A X 20 253. 1976 allele 11.3 NA NA NA
3 A X 22 261. 145 minusone 13.4 1078 23 13.5
4 A X 23 265. 1078 allele 13.5 NA NA NA
5 A Y 18 366 137 ambiguous 18 NA NA NA
6 A Y 18.2 368. 62 minushalf 9 1381 19 19
7 A Y 19 370. 1381 allele 19 NA NA NA
8 A Y 19.2 372. 45 minushalf 10 1005 20 20
9 A Y 20 374. 1005 allele 20 NA NA NA
10 B Z 12 91.6 38 minustwo 12 5766 14 14
11 B Z 13 95.9 482 ambiguous 11 NA NA NA
12 B Z 14 100 5766 allele 14 NA NA NA
13 B Z 15 104. 4893 allele 15 NA NA NA
14 B Z 16 108. 19 plusone 16 4893 15 15
15 C q 10 178. 287 minusone 9.5 5001 11 11
16 C q 10.2 179. 36 minushalf NA 5001 11 11
17 C q 11 181. 5001 allele 11 NA NA NA
18 C q 12 185. 50 plusone 11.5 5001 11 11
I have a rather belaboured way of doing it:
# eg for minustwo
sampleData %>%
filter(Type == "minustwo") %>%
rowwise() %>%
mutate(ParentHeight = sampleData$Height[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)],
ParentAllele = sampleData$Allele[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)],
ParentLUS = sampleData$LUS[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)]) %>%
right_join(sampleData)
I then have to redo that for each of my Types
My real dataset is thousands of rows so this ends up being a little slow but manageable, but more to the point I want to learn a better way to do it, in particular the sampleData$'Sample File' == 'Sample File' & sampleData$Marker == Marker seems like it should be doable with grouping so I must be missing a trick there.
I have tried using group_map() but I've clearly not understood it correctly:
sampleData$ParentHeight <- sampleData %>%
group_by(`Sample File`, `Marker`) %>%
group_map(.f = \(.x, .y) {
pmap_dbl(.l = .x, .f = \(Allele, Height, Type, ...){
if(Type == "allele" | Type == "ambiguous") { return(0)
} else if (Type == "plusone") {
return(.x$Height[.x$Allele == round(Allele - 1, 1)])
} else if (Type == "minushalf") {
return(.x$Height[.x$Allele == round(plustwoBP(Allele), 1)])
} else if (Type == "minusone") {
return(.x$Height[.x$Allele == round(Allele + 1, 1)])
} else if (Type == "minustwo") {
return(.x$Height[.x$Allele == round(Allele + 2, 1)])
} else { stop("unexpected peak type") }
})}) %>% unlist()
Initially seems to work, but on investigation it's not respecting both layers of grouping, so brings matches from the wrong Marker. Additionally, here I'm assigning the output to a new column in the data frame, but if I try to instead wrap a mutate() around this so that I can create all three new columns in one go then the group_map() no longer works at all.
I also considered using complete() to hugely extend the data frame will all possible values of Allele (including x.0, x.1, x.2, x.3 variants) then use lag() to select the corresponding rows, then drop the spare rows. This seems like it'd make the data frame enormous in the interim.
To summarise
This works, but it feels ugly and like I'm missing a more elegant and obvious solution. How would you approach this?
You can create two versions of Allele: one identical to the original Allele, and one that is equal to an adjustment based on minusone, minustwo, etc
Then do a self left join, based on that adjusted version of Allele (and Sample File and Marker)
sampleData = sampleData %>% group_by(`Sample File`,Marker) %>% mutate(id = Allele) %>% ungroup()
left_join(
sampleData %>%
mutate(id = case_when(
Type=="minusone"~id+1,
Type=="minustwo"~id+2,
Type=="plusone"~id-1,
Type=="minushalf"~ceiling(id))),
sampleData %>% select(-c(Size,Type)),
by=c("Sample File", "Marker", "id"),
suffix = c("", ".parent")
) %>% select(-id)
Output:
# A tibble: 14 × 10
`Sample File` Marker Allele Size Height Type LUS Allele.parent Height.parent LUS.parent
<chr> <chr> <dbl> <dbl> <int> <chr> <dbl> <dbl> <int> <dbl>
1 A X 19 249. 173 minusone 11.8 20 1976 11.3
2 A X 20 253. 1976 allele 11.3 NA NA NA
3 A X 22 261. 145 minusone 13.4 23 1078 13.5
4 A X 23 265. 1078 allele 13.5 NA NA NA
5 A Y 18 366 137 ambiguous 18 NA NA NA
6 A Y 18.2 368. 62 minushalf 9 19 1381 19
7 A Y 19 370. 1381 allele 19 NA NA NA
8 A Y 19.2 372. 45 minushalf 10 20 1005 20
9 A Y 20 374. 1005 allele 20 NA NA NA
10 B Z 12 91.6 38 minustwo 12 14 5766 14
11 B Z 13 95.9 482 ambiguous 11 NA NA NA
12 B Z 14 100 5766 allele 14 NA NA NA
13 B Z 15 104. 4893 allele 15 NA NA NA
14 B Z 16 108. 19 plusone 16 15 4893 15
15 C q 10 178. 287 minusone 9.5 11 5001 11
16 C q 10.2 179. 36 minushalf NA 11 5001 11
17 C q 11 181. 5001 allele 11 NA NA NA
18 C q 12 185. 50 plusone 11.5 11 5001 11
I have 2 sets of data that look like this (this is a very small subset of it).
data1 <- data.frame("Metal" = c("Al", "Al", "Al", "Al", "Al", "Al", "Al"), "Type" =
c("F", "F", "F", "F", "F", "F", "F"), "Date" = c("2000-01-01", "2000-01-01", "2000-
01-02", "2000-01-03",
"2000-01-03", "2000-01-07", "2000-01-07"), "Time" = c("11:00:00", "12:00:00",
"15:00:00", "13:00:00", "17:00:00", "20:00:00", "20:00:00"), "Value" = c(100, 200,
300, 100, 400, 500, 500))
data2 <- data.frame("Metal" = c("Al", "Al", "Al", "Al", "Al", "Al", "Al"), "Type" =
c("P", "P",
"P", "P", "P",
"P", "P"), "Date" = c("2000-01-01", "2000-01-01", "2000-01-01", "2000-01-03", "2000-
01-03",
"2000-01-04", "2000-01-07"), "Time" = c("11:00:00", "11:00:00", "14:00:00",
"17:00:00", "13:00:00", "16:00:00", "20:00:00"), "Value" = c(100, 100, 200, 900, 100,
400, 999))
I want to keep data from both tables that have the same date and time and create a new table (data3). Sometimes within data1 and data2, there will be duplicates, I don't want data3 to contain those duplicates, just 1 of them and with its pair from the other table. I would also like the output table to be ordered to show the pairs from each table under each other (so my "Type" column would be alternating F, P, F, P, etc.).
Here is my desired output
data3 <- data.frame("Metal" = c("Al", "Al", "Al", "Al", "Al",
"Al", "Al", "Al"), "Type" = c("F", "P", "F",
"P", "F", "P", "F", "P"), "Date" = c("2000-01-01", "2000-01-01",
"2000-01-03", "2000-01-03", "2000-01-03", "2000-01-03", "2001-01-
07", "2001-01-07"), "Time" =
c("11:00:00", "11:00:00", "13:00:00",
"13:00:00", "17:00:00", "17:00:00", "20:00:00", "20:00:00"),
"Value" = c(100, 100, 100, 100, 400, 900, 500, 999))
I have tried using various types of joins from dplyr, but they aren't joining the way I'd like it to.
Thank you for your help!!
We may need bind the data, and then filter out the duplicates after grouping
library(dplyr)
library(data.table)
bind_rows(data1, data2, .id = 'grp')%>%
group_by(Metal, Date, Time) %>%
filter(n() > 1) %>%
arrange(Date, Time, rowid(grp)) %>%
slice(match(c("F", "P"), Type)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 8 × 5
Metal Type Date Time Value
<chr> <chr> <chr> <chr> <dbl>
1 Al F 2000-01-01 11:00:00 100
2 Al P 2000-01-01 11:00:00 100
3 Al F 2000-01-03 13:00:00 100
4 Al P 2000-01-03 13:00:00 100
5 Al F 2000-01-03 17:00:00 400
6 Al P 2000-01-03 17:00:00 900
7 Al F 2000-01-07 20:00:00 500
8 Al P 2000-01-07 20:00:00 999
-OP's data
> data3
Metal Type Date Time Value
1 Al F 2000-01-01 11:00:00 100
2 Al P 2000-01-01 11:00:00 100
3 Al F 2000-01-03 13:00:00 100
4 Al P 2000-01-03 13:00:00 100
5 Al F 2000-01-03 17:00:00 400
6 Al P 2000-01-03 17:00:00 900
7 Al F 2001-01-07 20:00:00 500
8 Al P 2001-01-07 20:00:00 999
This was not easy :-)
library(dplyr)
bind_rows(data1, data2) %>%
group_by(Date, Time) %>%
filter(n()>1) %>%
ungroup() %>%
group_by(Type) %>%
arrange(Time) %>%
ungroup() %>%
mutate(Flag = ifelse(Type == "P" & lag(Type, default = last(Type)) == "F", 1, NA)) %>%
mutate(Flag1 = lead(Flag)) %>%
filter(if_any(.cols = starts_with("Flag"), .fns = ~ . == 1)) %>%
select(-starts_with("Flag"))
Metal Type Date Time Value
<chr> <chr> <chr> <chr> <dbl>
1 Al F 2000-01-01 11:00:00 100
2 Al P 2000-01-01 11:00:00 100
3 Al F 2000-01-03 13:00:00 100
4 Al P 2000-01-03 13:00:00 100
5 Al F 2000-01-03 17:00:00 400
6 Al P 2000-01-03 17:00:00 900
7 Al F 2000-01-07 20:00:00 500
8 Al P 2000-01-07 20:00:00 999
An approach with inner_join
The difficulty here is getting the right format, the mere data filter itself is done after the inner_join.
library(dplyr)
library(tidyr)
joined <- inner_join(data1 %>% distinct(), data2 %>% distinct(),
c("Metal", "Date", "Time"))
joined
Metal Type.x Date Time Value.x Type.y Value.y
1 Al F 2000-01-01 11:00:00 100 P 100
2 Al F 2000-01-03 13:00:00 100 P 100
3 Al F 2000-01-03 17:00:00 400 P 900
4 Al F 2000-01-07 20:00:00 500 P 999
Arranging data
joined %>%
pivot_longer(starts_with("Type"), values_to="Type") %>%
rowwise() %>%
mutate(Value = c_across(starts_with("Value"))[c(F=1, P=2)[Type]]) %>%
select(-contains("."), -name) %>%
ungroup()
# A tibble: 8 × 5
Metal Date Time Type Value
<chr> <chr> <chr> <chr> <dbl>
1 Al 2000-01-01 11:00:00 F 100
2 Al 2000-01-01 11:00:00 P 100
3 Al 2000-01-03 13:00:00 F 100
4 Al 2000-01-03 13:00:00 P 100
5 Al 2000-01-03 17:00:00 F 400
6 Al 2000-01-03 17:00:00 P 900
7 Al 2000-01-07 20:00:00 F 500
8 Al 2000-01-07 20:00:00 P 999
I'm unsure how to structure my pivot longer command when I have both annual and monthly data. For example I have:
wide <- data.frame(region_name = character(), # Create empty data frame
total_population_2019 = numeric(),
total_population_2020 = numeric(),
mean_temperature_2019_1 = numeric(),
mean_temperature_2019_2 = numeric(),
mean_temperature_2020_1 = numeric(),
mean_temperature_2020_2 = numeric(),
stringsAsFactors = FALSE)
wide[1, ] <- list("funville", 50000, 51250, 26.3, 24.6, 25.7, 24.9)
region_name total_population_2019 total_population_2020 mean_temperature_2019_1 mean_temperature_2019_2 mean_temperature_2020_1 mean_temperature_2020_2
funville 50000 51250 26.3 24.6 25.7 24.9
I'm able to pivot on the monthly columns using spread:
long <- pivot_longer(wide, cols = 4:7, names_to = c("layer" ,"year", "month"),
names_pattern = "(.*)_(.*)_?_(.*)") %>%
group_by(layer) %>%
mutate(n = 1:n()) %>%
spread(layer, value) %>%
select(-n)
which gives
region_name total_population_2019 total_population_2020 year month mean_temperature
1 funville 50000 51250 2019 1 26.3
2 funville 50000 51250 2019 2 24.6
3 funville 50000 51250 2020 1 25.7
4 funville 50000 51250 2020 2 24.9
I'd like to now have a population column where the values are attributed for each row/month that falls in that year, ideally would look like:
desired.df <- data.frame(region_name = c("funville", "funville", "funville", "funville"),
year = c("2019", "2019", "2020", "2020"),
month = c("1", "2", "1", "2"),
population = c("50000", "50000", "51250", "51250"),
mean_temperature = c("26.3", "24.6", "25.7", "24.9"))
which gives
region_name year month population mean_temperature
1 funville 2019 1 50000 26.3
2 funville 2019 2 50000 24.6
3 funville 2020 1 51250 25.7
4 funville 2020 2 51250 24.9
Does anyone have a solution? Thanks in advance
One option would be to use the names_pattern argument and the special .value. To make this work I first add a helper month to your population columns. Additionally I use tidyr::fill to fill up the population column:
library(dplyr)
library(tidyr)
wide |>
rename_with(~ paste(.x, 1, sep = "_"), starts_with("total")) |>
pivot_longer(-region_name,
names_to = c(".value", "year", "month"),
names_pattern = "^(.*?)_(\\d+)_(\\d+)$") |>
group_by(year) |>
fill(total_population) |>
arrange(year)
#> # A tibble: 4 × 5
#> # Groups: year [2]
#> region_name year month total_population mean_temperature
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 funville 2019 1 50000 26.3
#> 2 funville 2019 2 50000 24.6
#> 3 funville 2020 1 51250 25.7
#> 4 funville 2020 2 51250 24.9
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a dataset looks like this:
TYPE YEAR NUMBERS
A 2020 60
A 2019 NA
A 2018 88
A 2017 NA
A 2016 90
I want to impute the missing value with the mean of the value in column 'numbers'
I have search for a lot of tutorial, but they just directly replace the missing value with the mean which is not what i want. I try using mice and hmics, but they come out errors. So, if there is any better way to do this?Thanks!
I'd have done this :
df <- read.table(text = 'TYPE YEAR NUMBERS
A 2020 60
A 2019 NA
A 2018 88
A 2017 NA
A 2016 90', header=T)
a= mean(na.omit(df$NUMBERS))
df[is.na(df$NUMBERS),"NUMBERS"]=a
df
Output:
TYPE YEAR NUMBERS
1 A 2020 60.00000
2 A 2019 79.33333
3 A 2018 88.00000
4 A 2017 79.33333
5 A 2016 90.00000
Is it what you wanted?
I'm inferring from the presence of the TYPE column that you should be imputing based on the group's mean, not the population's mean.
Modified data:
dat <- structure(list(TYPE = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"), YEAR = c(2020L, 2019L, 2018L, 2017L, 2016L, 2020L, 2019L, 2018L, 2017L, 2016L), NUMBERS = c(60L, NA, 88L, NA, 90L, 160L, NA, 188L, NA, 190L)), class = "data.frame", row.names = c(NA, -10L))
base R
do.call(rbind, by(dat, dat$TYPE,
function(z) { z$NUMBERS[is.na(z$NUMBERS)] <- mean(z$NUMBERS, na.rm = TRUE); z}))
# TYPE YEAR NUMBERS
# A.1 A 2020 60.00000
# A.2 A 2019 79.33333
# A.3 A 2018 88.00000
# A.4 A 2017 79.33333
# A.5 A 2016 90.00000
# B.6 B 2020 160.00000
# B.7 B 2019 179.33333
# B.8 B 2018 188.00000
# B.9 B 2017 179.33333
# B.10 B 2016 190.00000
or
do.call(rbind, by(dat, dat$TYPE,
function(z) transform(z, NUMBERS = ifelse(is.na(NUMBERS), mean(NUMBERS, na.rm = TRUE), NUMBERS))))
dplyr
library(dplyr)
dat %>%
group_by(TYPE) %>%
mutate(NUMBERS = if_else(is.na(NUMBERS), mean(NUMBERS, na.rm = TRUE), as.numeric(NUMBERS))) %>%
ungroup()
# # A tibble: 10 x 3
# TYPE YEAR NUMBERS
# <chr> <int> <dbl>
# 1 A 2020 60
# 2 A 2019 79.3
# 3 A 2018 88
# 4 A 2017 79.3
# 5 A 2016 90
# 6 B 2020 160
# 7 B 2019 179.
# 8 B 2018 188
# 9 B 2017 179.
# 10 B 2016 190
I have a following problem.
I computed average temperature per country and also a difference between the actual daily temperature and the average temperature. See code below:
df1 <- data.frame(country = c("01", "01", "01","01", "01", "02", "02" , "03", "03","03"),
date = c("2020-01-01", "2020-01-02", "2020-01-03" , "2020-01-05", "2020-01-07", "2020-01-01", "2020-01-03", "2020-01-02", "2020-01-03", "2020-01-04"),
temperature = c(4, 3, -2, 0.1, -3, 1.5, 12, 10, 7, 5),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7, 6, 12)
)
library(dplyr)
df2 <- df1 %>%
group_by(country) %>%
mutate(mean_per_country = mean(temperature))
df2$difference <- df2$temperature - df2$mean_per_country
Now I need to create a new column that checks if (unlimited number of) consecutive days in the same country have negative, or positive difference between the actual daily temperature and the average temperature. Is there an elegant way how can I do it in R?
Desired output is here:
desired_df <- data.frame(country = c("01", "01", "01","01", "01", "02", "02" , "03", "03","03"),
date = c("2020-01-01", "2020-01-02", "2020-01-03" , "2020-01-05", "2020-01-07", "2020-01-01", "2020-01-03", "2020-01-02", "2020-01-03", "2020-01-04"),
temperature = c(4, 3, -2, 2, -3, 1.5, 12, 10, 7, 5),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7, 6, 12),
mean_per_country = c(0.42, 0.42, 0.42, 0.42, 0.42, 6.75, 6.75, 7.33, 7.33, 7.33),
difference = c(3.58, 2.58, -2.42 , -0.32, -3.42 , -5.25, 5.25, 2.67, -0.333, -2.33),
new_column = c("hot",
"hot",
"", #day interrupted, therefor not "cold"
"", #day interrupted, therefor not "cold"
"", #day interrupted, therefor not "cold"
"",
"",
"",
"cold",
"cold")
)
Thank you very much
Here's an approach with dplyr:
library(dplyr)
df2 %>%
group_by(country) %>%
mutate(date = as.Date(date),
consecutive = date - lag(date) == 1,
result = (sign(difference) == sign(lead(difference)) & lead(consecutive) |
(sign(difference) == sign(lag(difference)) & consecutive)),
new_column = c("cold",NA_character_,"hot")[result * sign(difference) + 2])
# A tibble: 10 x 9
# Groups: country [3]
country date temperature blabla mean_per_country difference consecutive result new_column
<chr> <date> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <chr>
1 01 2020-01-01 4 23 0.42 3.58 NA TRUE hot
2 01 2020-01-02 3 41 0.42 2.58 TRUE TRUE hot
3 01 2020-01-03 -2 32 0.42 -2.42 TRUE FALSE NA
4 01 2020-01-05 0.1 8 0.42 -0.32 FALSE FALSE NA
5 01 2020-01-07 -3 50 0.42 -3.42 FALSE NA NA
6 02 2020-01-01 1.5 27 6.75 -5.25 NA NA NA
7 02 2020-01-03 12 8 6.75 5.25 FALSE NA NA
8 03 2020-01-02 10 7 7.33 2.67 NA NA NA
9 03 2020-01-03 7 6 7.33 -0.333 TRUE TRUE cold
10 03 2020-01-04 5 12 7.33 -2.33 TRUE TRUE cold
To get rid of the intermediate columns that I left there for illustration purposes, just user select(-(consecutive:result)).
You need to turn the dates to Date class and then you can calculate the differences between dates. Then group by country and use ifelse() to set the values if the differences are 1:
require(plyr)
require(dplyr)
df2$date = as.Date(df2$date)
diffs <- c(0,diff(df2$date))
df2 %>% group_by(country) %>%
plyr::mutate(new_column = ifelse((difference > 0) & (diffs == 1), "hot", ifelse((difference < 0) & (diffs == 1), "cold", " ")))
> df2
country date temperature blabla mean_per_country difference new_column
1 01 2020-01-01 4.0 23 0.420000 3.5800000
2 01 2020-01-02 3.0 41 0.420000 2.5800000 hot
3 01 2020-01-03 -2.0 32 0.420000 -2.4200000 cold
4 01 2020-01-05 0.1 8 0.420000 -0.3200000
5 01 2020-01-07 -3.0 50 0.420000 -3.4200000
6 02 2020-01-01 1.5 27 6.750000 -5.2500000
7 02 2020-01-03 12.0 8 6.750000 5.2500000
8 03 2020-01-02 10.0 7 7.333333 2.6666667
9 03 2020-01-03 7.0 6 7.333333 -0.3333333 cold
10 03 2020-01-04 5.0 12 7.333333 -2.3333333 cold