I'm trying to figure out how to do the following without looping. I have a melted dataset of time, study site, and flow that looks like:
datetime site flow
6/1/2009 00:00 EBT NA
6/2/2009 01:00 EBT NA
6/3/2009 02:00 EBT 0.1
6/4/2009 03:00 EBT NA
6/5/2009 04:00 EBT NA
6/1/2009 00:00 MUT 0.4
6/2/2009 01:00 MUT 0.3
6/3/2009 02:00 MUT 0.2
6/4/2009 03:00 MUT NA
6/5/2009 04:00 MUT NA
I need to subset this by site, and then for periods when there are at least two subsequent flow measurements I need to perform a couple of calculations, *for example the mean of the current and previous measurement.
The trick is that I need to perform the average on each set of consecutive measurements, i.e. if there are three in a row for each of the latter two I need the average of that measurement and the previous one. I've added a goal column to the sample dataframe with the results I'd like to get.*
I'd like to end up with a similar looking dataframe with the datetime, site, and result of the calculation. There is a full time series for each site.
Thanks for any help!
data generator:
structure(list(datetime = structure(c(1167627600, 1167717600,
1167807600, 1167897600, 1167987600, 1167627600, 1167717600, 1167807600,
1167897600, 1167987600, 1168077600, 1168167600, 1168257600, 1168347600,
1168437600), class = c("POSIXct", "POSIXt"), tzone = ""), site = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("EBT",
"MUT"), class = "factor"), flow = c(NA, 0.1, NA, NA, NA, NA,
0.4, 0.2, NA, NA, 0.4, 0.2, 0.1, NA, NA), goal = c(NA, NA, NA,
NA, NA, NA, NA, 0.3, NA, NA, NA, 0.3, 0.15, NA, NA)), .Names = c("datetime",
"site", "flow", "goal"), row.names = c(NA, -15L), class = "data.frame")
This will separate your dataframe by site and then filter only rows that have two or more consecutive non-NA values in flow:
by(sample, sample$site, function(d) d[with(rle(!is.na(d$flow)), rep(values & lengths>=2, lengths)),])
You can then work on the function inside to do your calculations as needed.
For instance, if you want to add the mean as a new column (assuming you want NA when not defined) you can use this:
f <- function(d)
{
x <- with(rle(!is.na(d$flow)), rep(values & lengths>=2, lengths))
within(d, {avg <- NA; avg[x] <- mean(d[x,"flow"])})
}
b <- by(sample, sample$site, f)
Reduce(rbind, b)
Result:
datetime site flow avg
1 2009-06-01 01:00:00 EBT NA NA
2 2009-06-02 02:00:00 EBT NA NA
3 2009-06-03 03:00:00 EBT 0.1 NA
4 2009-06-04 04:00:00 EBT NA NA
5 2009-06-05 05:00:00 EBT NA NA
6 2009-06-01 01:00:00 MUT 0.4 0.3
7 2009-06-02 02:00:00 MUT 0.3 0.3
8 2009-06-03 03:00:00 MUT 0.2 0.3
9 2009-06-04 04:00:00 MUT NA NA
10 2009-06-05 05:00:00 MUT NA NA
EDIT: To get the mean between the current flow measure and the previous one, you can use this:
f <- function(d)
{
within(d, avg <- (flow+c(NA,head(flow,-1)))/2)
}
Reduce(rbind, by(sample, sample$site, f))
Note that cases with a single measure are automatically set to NA. New result:
datetime site flow goal avg
1 2007-01-01 03:00:00 EBT NA NA NA
2 2007-01-02 04:00:00 EBT 0.1 NA NA
3 2007-01-03 05:00:00 EBT NA NA NA
4 2007-01-04 06:00:00 EBT NA NA NA
5 2007-01-05 07:00:00 EBT NA NA NA
6 2007-01-01 03:00:00 MUT NA NA NA
7 2007-01-02 04:00:00 MUT 0.4 NA NA
8 2007-01-03 05:00:00 MUT 0.2 0.30 0.30
9 2007-01-04 06:00:00 MUT NA NA NA
10 2007-01-05 07:00:00 MUT NA NA NA
11 2007-01-06 08:00:00 MUT 0.4 NA NA
12 2007-01-07 09:00:00 MUT 0.2 0.30 0.30
13 2007-01-08 10:00:00 MUT 0.1 0.15 0.15
14 2007-01-09 11:00:00 MUT NA NA NA
15 2007-01-10 12:00:00 MUT NA NA NA
Plyr functions are a good way to split apart dataframes by certain variables, which is what you need to do.
I thought of two ways to handle intervals on a vector: first with vector multiplication (for the mean of the data), and second with vectorizing a function (for generating the labels). They're both doing pretty much the same thing, though.
library(reshape2)
library(plyr)
library(lubridate)
meanBetween <- function(x){
l <- length(x)
diag(outer(x[1:(l-1)], x[2:l], "+"))/2
}
output <- ddply(sample, .(site), function(df){
df <- df[order(df$datetime, decreasing=FALSE), ]
result <- meanBetween(df$flow)
names(result) <- Reduce(c, (mapply(as.interval,
df$datetime[-1],
df$datetime[1:(length(df$datetime)-1)],
SIMPLIFY=FALSE)))
result
})
melt(output) # to make it look nicer
Related
I would like to create a time series with a monthly interval by extending an already existing time series.
I have "t1" time series:
structure(c(49.25, 49.25, 30, 99.25, 99.25, 100.5, 101,
91.25), .Dim = c(1L, 8L), .Dimnames = list(NULL, c("2021-03-31",
"2022-03-31", "2022-05-31", "2022-09-30", "2022-12-31", "2023-03-31",
"2023-05-31", "2023-09-30")), .Tsp = c(1, 1, 1), class = c("mts",
"ts", "matrix"))
I would like to extend the above series to include monthly observations. How can I do this?
The object in the question is in a strange form. It consists of 9 separate time series with column names given by character dates. First extract the character dates and the values into a zoo object with yearmon time class -- yearmon directly represents a year and month without day. Ensure that it has frequency 12 and convert it to ts class which will have the effect of filling in the missing months. Finally extend it to the desired date.
library(zoo)
z <- zoo(t1[-1], as.yearmon(colnames(t1)[-1]), frequency = 12)
tt <- window(as.ts(z), end = c(2024, 11), extend = TRUE)
tt
giving this ts object:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2021 49.25 NA NA NA NA NA NA NA NA NA
2022 NA NA 49.25 NA 30.00 NA NA NA 99.25 NA NA 99.25
2023 NA NA 100.50 NA 101.00 NA NA NA 91.25 NA NA NA
2024 NA NA NA NA NA NA NA NA NA NA NA
Note that you can use View(as.zoo(tt)) to view tt and can use na.approx(tt, na.rm = FALSE, rule = 2) to fill in internal NAs with interpolated values and trailing NAs with the last non-NA value.
Note
The input is shown in the question as:
t1 <- structure(c(49.25, 49.25, 30, 99.25, 99.25, 100.5, 101, 91.25), .Dim = c(1L, 8L), .Dimnames = list(NULL, c("2021-03-31", "2022-03-31", "2022-05-31", "2022-09-30", "2022-12-31", "2023-03-31", "2023-05-31", "2023-09-30")), .Tsp = c(1, 1, 1), class = c("mts", "ts", "matrix"))
My dataset contains NDVI values and NDVI-QualityDescriptor values(PixelQa) for different areas in different dates. I basically want to erase (setting to NA) the NDVI values that are related to bad quality descriptor (PixelQa). The number suffix of the column names relates both data: PixelQa_1 is related to NDVI_1 and so on.
Therefore to "clean" my data I have to check PixelQa values in order to assess if I have to change its related NDVI value. There is 3 possible situations:
PixelQa is NA -> then NDVI should be also NA.
Pixel Qa is 66±0.5 OR 130±0.5 -> then NDVI remains the same value.
Pixel Qa is different to 66±0.5 OR 130±0.5 -> then NDVI value is set to NA (this is bad quality data which needs to be ignored).
My dataset could be:
DataNDVI_split <- data.frame("21feb1987_NDVI" = c(0.123, NA, 0.192, 0.234, NA), "21feb1987_PixelQa" = c(66.30, NA, 66.00, 79.87, NA), "18jul1987_NDVI" = c(0.223, NA, 0.230, 0.334, NA), "21feb1987_PixelQa" = c(66.30, NA, 66.00, 79.87, NA), stringsAsFactors = FALSE)
DataNDVI_split
X21feb1987_NDVI1 X21feb1987_PixelQa1 X18jul1987_NDVI2 X21feb1987_PixelQa2
1 0.123 66.30 0.223 66.30
2 NA NA NA NA
3 0.192 66.00 0.230 66.00
4 0.234 79.87 0.334 79.87
5 NA NA NA NA
And "clean" it should look like:
DataNDVI_split <- data.frame("21feb1987_NDVI" = c(0.123, NA, 0.192, 0.234, NA), "21feb1987_PixelQa" = c(66.30, NA, 66.00, 79.87, NA), "18jul1987_NDVI" = c(0.223, NA, 0.230, 0.334, NA), "21feb1987_PixelQa" = c(66.30, NA, 66.00, 79.87, NA), stringsAsFactors = FALSE)
DataNDVI_split
X21feb1987_NDVI1 X21feb1987_PixelQa1 X18jul1987_NDVI2 X21feb1987_PixelQa2
1 0.123 66.30 0.223 66.30
2 NA NA NA NA
3 0.192 66.00 0.230 66.00
4 NA 79.87 NA 79.87
5 NA NA NA NA
Here's a tentative solution.
First, I'd split up the data into two separate dataframes, thus:
df_ndvi <- DataNDVI[grepl("NDVI", DataNDVI$Data), ]
df_ndvi
Data X21feb1987 X18jul1987
1 NDVI1 0.123 0.223
2 NDVI2 NA NA
3 NDVI3 0.192 0.230
4 NDVI4 0.234 0.334
5 NDVI5 NA NA
df_pixel <- DataNDVI[!grepl("NDVI", DataNDVI$Data), ]
df_pixel
Data X21feb1987 X18jul1987
6 PixelQa1 66.30 66.00
7 PixelQa2 NA NA
8 PixelQa3 66.00 124.23
9 PixelQa4 79.87 86.00
10 PixelQa5 NA NA
To perform the desired changes, there are many possible ways. One way is by using a forloop through all the columns in df_ndvi (except the first!) and defining an ifelse statement to see whether or not the conditions hold true and to define actions to be taken in either case:
for(i in 2:3){
df_ndvi[,i] <- ifelse(df_pixel[,i] < 65.5 | df_pixel[,i] > 66.5, NA, df_ndvi[,i])
}
This results in these corrections in df_ndvi:
df_ndvi
Data X21feb1987 X18jul1987
1 NDVI1 0.123 0.223
2 NDVI2 NA NA
3 NDVI3 0.192 NA
4 NDVI4 NA NA
5 NDVI5 NA NA
EDIT:
If you prefer to split-up the data in this way:
DataNDVI_split <- data.frame("21feb1987_NDVI" = c(0.123, NA, 0.192, 0.234, NA), "21feb1987_PixelQa" = c(66.30, NA, 66.00, 79.87, NA), "18jul1987_NDVI" = c(0.223, NA, 0.230, 0.334, NA), "21feb1987_PixelQa" = c(66.30, NA, 66.00, 79.87, NA), stringsAsFactors = FALSE)
DataNDVI_split
X21feb1987_NDVI X21feb1987_PixelQa X18jul1987_NDVI X21feb1987_PixelQa.1
1 0.123 66.30 0.223 66.30
2 NA NA NA NA
3 0.192 66.00 0.230 66.00
4 0.234 79.87 0.334 79.87
5 NA NA NA NA
then the for loop could be adapted thus:
for(i in c(1,3)){
DataNDVI_split[,i] <- ifelse(DataNDVI_split[,i+1] < 65.5 | DataNDVI_split[,i+1] > 66.5, NA, DataNDVI_split[,i])
}
The result is this:
DataNDVI_split
X21feb1987_NDVI X21feb1987_PixelQa X18jul1987_NDVI X21feb1987_PixelQa.1
1 0.123 66.30 0.223 66.30
2 NA NA NA NA
3 0.192 66.00 0.230 66.00
4 NA 79.87 NA 79.87
5 NA NA NA NA
I need to replace NAs with the mean of previous three values, by group.
Once an NA is replaced, it will serve as input for computing the mean corresponding to the next NA (if next NA is within the next three months).
Here it is an example:
id date value
1 2017-04-01 40
1 2017-05-01 40
1 2017-06-01 10
1 2017-07-01 NA
1 2017-08-01 NA
2 2014-01-01 27
2 2014-02-01 13
Data:
dt <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), date = structure(c(17257, 17287, 17318, 17348, 17379, 16071, 16102), class = "Date"), value = c(40, 40, 10, NA, NA, 27, 13)), row.names = c(1L, 2L, 3L, 4L, 5L, 8L, 9L), class = "data.frame")
The output should look like:
id date value
1 2017-04-01 40.00
1 2017-05-01 40.00
1 2017-06-01 10.00
1 2017-07-01 30.00
1 2017-08-01 26.66
2 2014-01-01 27.00
2 2014-02-01 13.00
where 26.66 = (30 + 10 + 40)/3
What is an efficient way to do this (i.e. to avoid for loops)?
The following uses base R only and does what you need.
sp <- split(dt, dt$id)
sp <- lapply(sp, function(DF){
for(i in which(is.na(DF$value))){
tmp <- DF[seq_len(i - 1), ]
DF$value[i] <- mean(tail(tmp$value, 3))
}
DF
})
result <- do.call(rbind, sp)
row.names(result) <- NULL
result
# id date value
#1 1 2017-01-04 40.00000
#2 1 2017-01-05 40.00000
#3 1 2017-01-06 10.00000
#4 1 2017-01-07 30.00000
#5 1 2017-01-08 26.66667
#6 2 2014-01-01 27.00000
#7 2 2014-01-02 13.00000
Define a roll function which takes 3 or less previous values as a list and the current value and returns as a list the previous 2 values with the current value if the current value is not NA and the prevous 2 values with the mean if the current value is NA. Use that with Reduce and pick off the last value of each list in the result. Then apply all that to each group using ave.
roll <- function(prev, cur) {
prev <- unlist(prev)
list(tail(prev, 2), if (is.na(cur)) mean(prev) else cur)
}
reduce_roll <- function(x) {
sapply(Reduce(roll, init = x[1], x[-1], acc = TRUE), tail, 1)
}
transform(dt, value = ave(value, id, FUN = reduce_roll))
giving:
id date value
1 1 2017-04-01 40
2 1 2017-05-01 40
3 1 2017-06-01 10
4 1 2017-07-01 30
5 1 2017-08-01 26.66667
8 2 2014-01-01 27
9 2 2014-02-01 13
I would like to compute the spatial average over a region of data that I define, by defining a longitude/latitude gridbox.
The data I have is ECMWF Sea-ice data, so it's spatio-temporal data for each .75x.75 lon/lat coordinate over the whole Northern Hemisphere. I've changed the data from NetCDF format into an R dataframe, so the head(var.df) looks like this with columns: Date, longitude, latitude, value
date_time lon lat ci
1 2016-01-01 18:00:00 0 87.75 1
2 2016-01-02 18:00:00 0 87.75 1
3 2016-01-03 18:00:00 0 87.75 1
4 2016-01-04 18:00:00 0 87.75 1
5 2016-01-05 18:00:00 0 87.75 1
6 2016-01-06 18:00:00 0 87.75 1
There is therefore a value for each lon/lat coordinate across the northern hemisphere (df is ordered by date, rather than lon for some reason).
How would I extract the spatial area that I want i.e.
BK <- subset(var.df,lon <= 30 & lon >= 105 & lat >= 70 & lat <= 80)
and then average all the values that fall within that area, for each timestep (day)? So I'd have the mean of a gridbox that I define.
Thanks in advance, I hope this wasn't phrased terribly.
Update
Using GGamba's suggested code below, I got the following output, with multiple values for the same day so it hadn't averaged the whole region by timeslice.
date_time binlat binlon mean
<dttm> <fctr> <fctr> <dbl>
1 2016-01-01 18:00:00 [80,90) [0,10) 0.4200042
2 2016-01-01 18:00:00 [80,90) [10,20) 0.4503899
3 2016-01-01 18:00:00 [80,90) [20,30) 0.5614429
4 2016-01-01 18:00:00 [80,90) [30,40) 0.6118528
5 2016-01-01 18:00:00 [80,90) [40,50) 0.5809092
6 2016-01-01 18:00:00 [80,90) [50,60) 0.5617919
7 2016-01-01 18:00:00 [80,90) [60,70) 0.6071370
8 2016-01-01 18:00:00 [80,90) [70,80) 0.6011818
9 2016-01-01 18:00:00 [80,90) [80,90] 0.5442770
10 2016-01-01 18:00:00 [80,90) NA 0.4120862
# ... with 610 more rows
I also had to add na.rm = TRUE to the mean() function at the end, as the averages were NA.
Using dplyr we can do:
library(dplyr)
df %>%
mutate(binlon = cut(lon, seq(from = min(lon), to = max(lon), by = .75), include.lowest = T, right = F),
binlat = cut(lat, seq(from = min(lat), to = max(lat), by = .75), include.lowest = T, right = F)) %>%
group_by(date_time, binlat, binlon) %>%
summarise(mean = mean(ci))
Data:
structure(list(date_time = structure(1:6, .Label = c("2016-01-01 18:00:00",
"2016-01-02 18:00:00", "2016-01-03 18:00:00", "2016-01-04 18:00:00",
"2016-01-05 18:00:00", "2016-01-06 18:00:00"), class = "factor"),
lon = c(0L, 0L, 0L, 0L, 0L, 90L), lat = c(0, 87.75, 87.75,
87.75, 87.75, 90), ci = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("date_time",
"lon", "lat", "ci"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Results:
# date_time binlat binlon mean
# <fctr> <fctr> <fctr> <dbl>
# 1 2016-01-01 18:00:00 [0,0.75) [0,0.75) 1
# 2 2016-01-02 18:00:00 [87.8,88.5) [0,0.75) 1
# 3 2016-01-03 18:00:00 [87.8,88.5) [0,0.75) 1
# 4 2016-01-04 18:00:00 [87.8,88.5) [0,0.75) 1
# 5 2016-01-05 18:00:00 [87.8,88.5) [0,0.75) 1
# 6 2016-01-06 18:00:00 [89.2,90] [89.2,90] 1
# 6 2016-01-06 18:00:00 [80,90) [0,10) 1
This create two new columns binning lat & lon into bins defined into the cut function.
Then group by date_time and the new columns and calculate the ci mean on the group.
Of course you should adapt the cut function to suit your need.
I have a data frame like this.
date X1 X2
1: 2001-12-31 96.32 NA
2: 2002-01-29 NA 100.7
3: 2002-01-31 96.59 NA
4: 2002-02-28 96.67 100.7
5: 2002-03-29 NA 100.7
6: 2002-03-31 97.36 NA
7: 2002-04-29 NA 87.3
8: 2002-04-30 97.72 NA
9: 2002-05-29 NA 87.3
10:2002-05-31 97.60 NA
I have some values with different dates and I would like to align them to month end, so would like to use X1 as a "base" and align X2 values to month end as in X1. End product would be clean data frame without NAs and matching dates.
Expected output:
date X1 X2
1: 2001-12-31 96.32 NA
2: 2002-01-31 96.59 100.7
3: 2002-02-28 96.67 100.7
4: 2002-03-31 97.36 100.7
5: 2002-04-30 97.72 87.3
6: 2002-05-31 97.60 87.3
Data
df <- structure(list(date = structure(c(11687L, 11716L, 11718L, 11746L,
11775L, 11777L, 11806L, 11807L, 11836L, 11838L), class = "Date"),
X1 = c(96.32, NA, 96.59, 96.67, NA, 97.36, NA, 97.72, NA,
97.6), X2 = c(NA, 100.7, NA, 100.7, 100.7, NA, 87.3, NA,
87.3, NA)), .Names = c("date", "X1", "X2"), row.names = c(NA,
10L), class = "data.frame")
We could try the following using data.table.
library(data.table)
setDT(df)[,month := month(date)][,lapply(.SD, max, na.rm = TRUE), by = month]
# month date X1 X2
#1: 12 2001-12-31 96.32 -Inf
#2: 1 2002-01-31 96.59 100.7
#3: 2 2002-02-28 96.67 100.7
#4: 3 2002-03-31 97.36 100.7
#5: 4 2002-04-30 97.72 87.3
#6: 5 2002-05-31 97.60 87.3
There is a new variable month that has been created for grouping purposes (and to keep the original date column), you can always get rid of it if not needed afterwards.