My dataset contains NDVI values and NDVI-QualityDescriptor values(PixelQa) for different areas in different dates. I basically want to erase (setting to NA) the NDVI values that are related to bad quality descriptor (PixelQa). The number suffix of the column names relates both data: PixelQa_1 is related to NDVI_1 and so on.
Therefore to "clean" my data I have to check PixelQa values in order to assess if I have to change its related NDVI value. There is 3 possible situations:
PixelQa is NA -> then NDVI should be also NA.
Pixel Qa is 66±0.5 OR 130±0.5 -> then NDVI remains the same value.
Pixel Qa is different to 66±0.5 OR 130±0.5 -> then NDVI value is set to NA (this is bad quality data which needs to be ignored).
My dataset could be:
DataNDVI_split <- data.frame("21feb1987_NDVI" = c(0.123, NA, 0.192, 0.234, NA), "21feb1987_PixelQa" = c(66.30, NA, 66.00, 79.87, NA), "18jul1987_NDVI" = c(0.223, NA, 0.230, 0.334, NA), "21feb1987_PixelQa" = c(66.30, NA, 66.00, 79.87, NA), stringsAsFactors = FALSE)
DataNDVI_split
X21feb1987_NDVI1 X21feb1987_PixelQa1 X18jul1987_NDVI2 X21feb1987_PixelQa2
1 0.123 66.30 0.223 66.30
2 NA NA NA NA
3 0.192 66.00 0.230 66.00
4 0.234 79.87 0.334 79.87
5 NA NA NA NA
And "clean" it should look like:
DataNDVI_split <- data.frame("21feb1987_NDVI" = c(0.123, NA, 0.192, 0.234, NA), "21feb1987_PixelQa" = c(66.30, NA, 66.00, 79.87, NA), "18jul1987_NDVI" = c(0.223, NA, 0.230, 0.334, NA), "21feb1987_PixelQa" = c(66.30, NA, 66.00, 79.87, NA), stringsAsFactors = FALSE)
DataNDVI_split
X21feb1987_NDVI1 X21feb1987_PixelQa1 X18jul1987_NDVI2 X21feb1987_PixelQa2
1 0.123 66.30 0.223 66.30
2 NA NA NA NA
3 0.192 66.00 0.230 66.00
4 NA 79.87 NA 79.87
5 NA NA NA NA
Here's a tentative solution.
First, I'd split up the data into two separate dataframes, thus:
df_ndvi <- DataNDVI[grepl("NDVI", DataNDVI$Data), ]
df_ndvi
Data X21feb1987 X18jul1987
1 NDVI1 0.123 0.223
2 NDVI2 NA NA
3 NDVI3 0.192 0.230
4 NDVI4 0.234 0.334
5 NDVI5 NA NA
df_pixel <- DataNDVI[!grepl("NDVI", DataNDVI$Data), ]
df_pixel
Data X21feb1987 X18jul1987
6 PixelQa1 66.30 66.00
7 PixelQa2 NA NA
8 PixelQa3 66.00 124.23
9 PixelQa4 79.87 86.00
10 PixelQa5 NA NA
To perform the desired changes, there are many possible ways. One way is by using a forloop through all the columns in df_ndvi (except the first!) and defining an ifelse statement to see whether or not the conditions hold true and to define actions to be taken in either case:
for(i in 2:3){
df_ndvi[,i] <- ifelse(df_pixel[,i] < 65.5 | df_pixel[,i] > 66.5, NA, df_ndvi[,i])
}
This results in these corrections in df_ndvi:
df_ndvi
Data X21feb1987 X18jul1987
1 NDVI1 0.123 0.223
2 NDVI2 NA NA
3 NDVI3 0.192 NA
4 NDVI4 NA NA
5 NDVI5 NA NA
EDIT:
If you prefer to split-up the data in this way:
DataNDVI_split <- data.frame("21feb1987_NDVI" = c(0.123, NA, 0.192, 0.234, NA), "21feb1987_PixelQa" = c(66.30, NA, 66.00, 79.87, NA), "18jul1987_NDVI" = c(0.223, NA, 0.230, 0.334, NA), "21feb1987_PixelQa" = c(66.30, NA, 66.00, 79.87, NA), stringsAsFactors = FALSE)
DataNDVI_split
X21feb1987_NDVI X21feb1987_PixelQa X18jul1987_NDVI X21feb1987_PixelQa.1
1 0.123 66.30 0.223 66.30
2 NA NA NA NA
3 0.192 66.00 0.230 66.00
4 0.234 79.87 0.334 79.87
5 NA NA NA NA
then the for loop could be adapted thus:
for(i in c(1,3)){
DataNDVI_split[,i] <- ifelse(DataNDVI_split[,i+1] < 65.5 | DataNDVI_split[,i+1] > 66.5, NA, DataNDVI_split[,i])
}
The result is this:
DataNDVI_split
X21feb1987_NDVI X21feb1987_PixelQa X18jul1987_NDVI X21feb1987_PixelQa.1
1 0.123 66.30 0.223 66.30
2 NA NA NA NA
3 0.192 66.00 0.230 66.00
4 NA 79.87 NA 79.87
5 NA NA NA NA
Related
I have this NDVI timeseries dataset where the first column is dates and the next three are NDVI data for three different IDs (59231, 158157, 282302)
Date X59231 X158157 X282302
1 13149 NA 0.398 NA
2 13157 0.344 0.267 0.327
3 13165 NA 0.431 NA
. ..... ..... ..... .....
Here's the dput:
structure(list(Date = c(13149L, 13157L, 13165L, 13173L, 13181L,
13189L, 13197L, 13205L, 13213L, 13221L, 13229L, 13237L, 13245L,
13253L, 13261L, 13269L, 13277L, 13285L, 13293L, 13301L, 13309L,
13317L, 13325L, 13333L, 13341L, 13349L, 13357L, 13365L, 13373L,
13381L, 13389L, 13397L, 13405L, 13413L, 13421L, 13429L, 13437L,
13445L, 13453L, 13461L, 13469L, 13477L, 13485L, 13493L, 13501L,
13509L), X59231 = c(NA, 0.344, NA, 0.398, NA, 0.587, NA, NA,
0.451, 0.597, 0.593, 0.556, 0.559, 0.375, 0.374, 0.386, 0.425,
0.383, 0.349, 0.315, 0.282, 0.323, 0.315, 0.359, 0.292, 0.271,
0.297, 0.307, 0.322, 0.344, 0.297, 0.285, 0.273, 0.282, 0.281,
0.304, 0.314, NA, 0.391, 0.601, 0.65, NA, 0.653, 0.666, 0.519,
0.625), X158157 = c(0.398, 0.267, 0.431, NA, 0.36, 0.434, 0.434,
0.465, 0.447, 0.521, 0.539, 0.563, 0.595, 0.541, 0.553, 0.381,
0.533, 0.505, 0.551, NA, 0.546, 0.535, 0.523, 0.501, 0.508, 0.51,
0.506, 0.51, 0.514, 0.526, 0.555, 0.545, 0.53, 0.539, 0.531,
0.53, NA, 0.585, 0.597, 0.32, 0.569, 0.601, NA, NA, 0.52, 0.532
), X282302 = c(NA, 0.327, NA, 0.282, 0.26, 0.293, 0.25, 0.288,
0.336, 0.299, 0.29, 0.28, NA, 0.305, 0.319, NA, 0.255, 0.292,
0.294, NA, NA, 0.367, 0.331, 0.344, 0.283, 0.284, 0.291, 0.273,
0.239, 0.285, 0.249, 0.285, 0.247, 0.288, 0.276, NA, 0.317, 0.375,
0.38, 0.417, 0.374, 0.491, NA, NA, NA, 0.471)), class = "data.frame", row.names = c(NA,
-46L))
I run the following code to smooth the timeseries (get rid of noise) and find the multiple maxs and mins for each ID's NDVI timeseries.
rm(list=ls())
#Read in csv data
df=read.csv("Data.csv", header = TRUE)
date_col = df[,1]
num_cols = length(df[1,]) #count number of columns there are
num_Dcols = num_cols-1 #count the number of columns there are minus the index (first) column
#Function to append columns to a dataframe
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
#Create an empty data frame
finalDF = data.frame(matrix(ncol=(0),nrow=0)) #create empty dataframe
#Create an empty vector for column names
CNames = c()
for (i in c(1:num_Dcols)){
df_sub = df[,c(1,i+1)] #create a data frame of the date column and the i+1 column
df_removeNA = na.omit(df_sub)
#Append the date column to the final data frame
df_date = df_removeNA[,1]
finalDF = cbind.fill(finalDF, df_date)
#Append the NDVI timeseries column to the final data frame
df_data = df_removeNA[,2]
finalDF = cbind.fill(finalDF, df_data)
stl_1=stl(ts(df_data, frequency=4), "periodic")
#Function to calculate all the maximums
ts_max<-function(signal)
{
points_max=which(diff(sign(diff(signal)))==-2)+1
return(points_max)
}
#Function to calculate all the minimums
ts_min<-function(signal)
{
points_min=which(diff(sign(diff(-signal)))==-2)+1
return(points_min)
}
#Smooth the timeseries
trend_1=as.numeric(stl_1$time.series[,2])
#Find max and mins of the smoothed timeseries
max_1=ts_max(trend_1)
min_1=ts_min(trend_1)
#Append max and mins to the final data frame
finalDF = cbind.fill(finalDF, df_data[max_1])
finalDF = cbind.fill(finalDF, df_data[min_1])
#Append column names to the column names vector
CNames = c(CNames, toString(colnames(df_sub[1])))
CNames = c(CNames, toString(colnames(df_sub[2])))
CNames = c(CNames, paste(c(toString(colnames(df_sub[2])), "_Max"), collapse=''))
CNames = c(CNames, paste(c(toString(colnames(df_sub[2])), "_Min"), collapse=''))
#Plot final results
plot(df_date, trend_1, type = 'l')
abline(v=df_date[max_1], col="red")
abline(v=df_date[min_1], col="blue")
}
#Rename final data frame's column names
colnames(finalDF) = CNames
#Export final data frame to CSV
write.csv(finalDF, file = "finalDF_smooth.csv")
Here's an image of all the maxs and mins for the first column of NDVI timeseries data.
What I'm trying to figure out is how to add two new columns into the original (or new) data frame next to each ID column where I can store the maximums and minimums. The maximums and minimums need to be placed in the cell that matches its corresponding date. In other words, I need two duplicated columns of each ID column. Inserted next to each ID column with all values replaced with NA except the maximums and minimums. Both of which were calculated in the smoothing code above. For example, this is what I need the final dataframe to look like :
Date 59231 59231_Max 59231_Min 158157 158157_Max 158157_Min 282302 282302_Max 282302_Min
13149 NA NA NA 0.398 NA NA NA NA NA
13157 0.344 NA NA 0.267 NA NA 0.327 NA NA
13165 NA NA NA 0.431 NA NA NA NA NA
13173 0.398 NA NA NA NA NA 0.282 NA NA
13181 NA NA NA 0.360 NA NA 0.260 NA NA
13189 0.587 NA NA 0.434 NA NA 0.293 NA 0.293
13197 NA NA NA 0.434 NA NA 0.25 NA NA
13205 NA NA NA 0.465 NA NA 0.288 NA NA
13213 0.451 NA NA 0.447 NA NA 0.336 NA NA
13221 0.597 NA NA 0.521 NA NA 0.299 0.299 NA
... ... .. .. ... .. .. ... ... ..
This is what it looks like right now.
Date 59231 59231_Max 59231_Min Date 158157 158157_Max 158157_Min Date 282302 282302_Max 282302_Min
13157 0.344 0.593 0.386 13149 0.398 0.595 0.533 13157 0.327 0.299 0.293
13173 0.398 0.425 0.282 13157 0.267 0.546 0.508 13173 0.282 0.331 0.255
13189 0.587 0.315 0.297 13165 0.431 0.545 0.539 13181 0.260 NA 0.285
13213 0.451 0.322 0.273 13181 0.360 0.530 0.320 13189 0.293 NA NA
13221 0.597 0.653 NA 13189 0.434 NA NA 13197 0.250 NA NA
13229 0.593 NA NA 13197 0.434 NA NA 13205 0.288 NA NA
13237 0.556 NA NA 13205 0.465 NA NA 13213 0.336 NA NA
13245 0.559 NA NA 13213 0.447 NA NA 13221 0.299 NA NA
13253 0.375 NA NA 13221 0.521 NA NA 13229 0.290 NA NA
13261 0.374 NA NA 13229 0.539 NA NA 13237 0.280 NA NA
..... ... .. .. ..... ..... .. .. ..... ..... ... ..
Note: I had to omit NAs during each loop so the code produces a CSV file with a unique subset date column for each ID. I would love to just have one date column like the ideal table above.
In my code I started to create a new data frame and appending each column after each loop but I can't figure out how to match up the maxs and mins in the right cells. Right now all the max and mins are stacked at the top of their columns. Any ideas? Thanks.
How about this? It adds the min and max columns.
df
df$max <- apply(df[2:4], 1, max, na.rm = TRUE)
df$min <- apply(df[2:4], 1, min, na.rm = TRUE)
head(df)
Which produces:
ID X59231 X158157 X282302 max min
1 13149 NA 0.398 NA 0.398 0.398
2 13157 0.344 0.267 0.327 0.344 0.267
3 13165 NA 0.431 NA 0.431 0.431
4 13173 0.398 NA 0.282 0.398 0.282
5 13181 NA 0.360 0.260 0.360 0.260
6 13189 0.587 0.434 0.293 0.587 0.293
I have added this based on the clarification that you have provided. You can ignore the bit above:
This will produce what you want. I have only done it for the first column, but you can just change the variables to get the other columns.
library(dplyr)
df2 <- as_tibble(df)
df2 <- df2 %>%
mutate(X59231_min = min(X59231, na.rm = TRUE))%>%
mutate(X59231_min = ifelse(X59231 == X59231_min, X59231_min, NA)) %>%
mutate(X59231_max = max(X59231, na.rm = TRUE))%>%
mutate(X59231_max = ifelse(X59231 == X59231_max, X59231_max, NA))
So:
df2 %>% filter(!is.na(X59231_min))
gives us:
# A tibble: 1 x 6
ID X59231 X158157 X282302 X59231_min X59231_max
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 13349 0.271 0.51 0.284 0.271 NA
And:
df2 %>% filter(!is.na(X59231_max))
Shows:
# A tibble: 1 x 6
ID X59231 X158157 X282302 X59231_min X59231_max
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 13493 0.666 NA NA NA 0.666
You should be able to do it for the other columns.
Time = c("7/16/2017 18:46", "7/16/2017 21:52",
"7/16/2017 23:16", "7/17/2017 4:03", "7/17/2017 5:13", "7/17/2017 5:27",
"7/17/2017 18:57", "7/17/2017 19:25", "7/17/2017 23:58", "7/18/2017 2:59",
"7/18/2017 3:27", "7/18/2017 3:59")
Flux = c(NA, NA, 4.51263406,
NA, NA, 2.291454049, NA, 4.568703192, NA, NA, 3.392520428, NA
), int = c(403.5413091, 421.5796345, NA, 410.0796897, NA, NA,
363.5271212, NA, NA, 398.9564539, NA, NA)
corr = c(422.745436,
447.6726631, NA, 420.4392183, NA, NA, 408.7056493, NA, NA, 421.8799971,
NA, NA)
dat = c(NA, NA, NA, NA, 2.316481462, NA, NA, NA, 7.11779784,
NA, NA, 2.953349661)
df$Time <- as.POSIXct(strptime(df$Timestamp, format="%m/%d/%Y %H:%M"))
which will look like...
Time Flux int corr dat
7/16/2017 18:46 NA 403.5413091 422.745436 NA
7/16/2017 21:52 NA 421.5796345 447.6726631 NA
7/16/2017 23:16 4.51263406 NA NA NA
7/17/2017 4:03 NA 410.0796897 420.4392183 NA
7/17/2017 5:13 NA NA NA 2.316481462
7/17/2017 5:27 2.291454049 NA NA NA
7/17/2017 18:57 NA 363.5271212 408.7056493 NA
7/17/2017 19:25 4.568703192 NA NA NA
7/17/2017 23:58 NA NA NA 7.11779784
7/18/2017 2:59 NA 398.9564539 421.8799971 NA
7/18/2017 3:27 3.392520428 NA NA NA
7/18/2017 3:59 NA NA NA 2.953349661
I have four columns (1 time data, 3 continuous data). I have many NA values in each column. I want to interpolate and fill the NA for all columns. Since I dont know which interpolation method I need, I would like to many interpolation methods (linear, spline etc). I tried na.approx but it didnt work.
Any help?
If you want to try and compare several interpolation methods as stated, you can use the na.interpolation() function from the imputeTS package.
For linear interpolation:
library("imputeTS")
na.interpolation(df, option = "linear")
For spline interpolation:
library("imputeTS")
na.interpolation(df, option = "spline")
For stineman interpolation:
library("imputeTS")
na.interpolation(df, option = "stine")
So as you can see, you just have to adapt the options parameter.
df <- fill(df,direction = c (names(df)))
But i dont which technique it uses to fill the NA
I want to push up (metaphorically) the dataframe in ordner to get rid of the spaces (NA-Values)
My Data:
> dput(df1)
structure(list(ID = c("CN1-1", "CN1-1", "CN1-1", "CN1-10", "CN1-10",
"CN1-10", "CN1-11", "CN1-11", "CN1-11", "CN1-12", "CN1-12", "CN1-12",
"CN1-13", "CN1-13", "CN1-13"), v1 = c(0.37673, NA, NA, 1.019972,
NA, NA, 0.515152, NA, NA, 0.375139, NA, NA, 0.508125, NA, NA),
v2 = c(NA, 0.732, NA, NA, 0, NA, NA, 0.748, NA, NA, 0.466,
NA, NA, 0.57, NA), v2 = c(NA, NA, 0.357, NA, NA, 0.816, NA,
NA, 0.519, NA, NA, 0.206, NA, NA, 0.464)), .Names = c("ID",
"v1", "v2", "v2"), row.names = c(NA, 15L), class = "data.frame")
>
Looks like:
ID v1 v2 v2
1 CN1-1 0.376730 NA NA
2 CN1-1 NA 0.732 NA
3 CN1-1 NA NA 0.357
4 CN1-10 1.019972 NA NA
5 CN1-10 NA 0.000 NA
6 CN1-10 NA NA 0.816
7 CN1-11 0.515152 NA NA
8 CN1-11 NA 0.748 NA
9 CN1-11 NA NA 0.519
10 CN1-12 0.375139 NA NA
11 CN1-12 NA 0.466 NA
12 CN1-12 NA NA 0.206
13 CN1-13 0.508125 NA NA
14 CN1-13 NA 0.570 NA
15 CN1-13 NA NA 0.464
Please note: I'm not sure if the pattern is consistent over all rows. It could also be possible, that one or more variables are prominent 2+ times per ID Group.
Desired output:
ID v1 v2 v2
1 CN1-1 0.376730 0.732 0.357
2 CN1-10 1.019972 0.000 0.816
...
My idea was to melt then get rid of all NA values and then dcast. Any better approach?
EDIT:
duplicated could look like this.
16 CN1-x 0.508125 NA NA
17 CN1-x NA 0.570 NA
18 CN1-x NA NA 0.464
19 CN1-x NA NA 0.134
do.call(rbind,
lapply(split(df1, df1$ID), function(a)
data.frame(ID = a$ID[1], lapply(a[-1], sum, na.rm = TRUE))))
# ID v1 v2 v2.1
#CN1-1 CN1-1 0.376730 0.732 0.357
#CN1-10 CN1-10 1.019972 0.000 0.816
#CN1-11 CN1-11 0.515152 0.748 0.519
#CN1-12 CN1-12 0.375139 0.466 0.206
#CN1-13 CN1-13 0.508125 0.570 0.464
This question already has answers here:
Select last non-NA value in a row, by row
(3 answers)
Closed last month.
I have a data frame Depth which consist of LON and LAT with corresponding depths temperature data. For each coordinate (LON and LAT) I would like to pull out last record of each depth corresponding to the coordinates into a new data frame,
> Depth<-read.csv('depthdata.csv')
> head(Depth)
LAT LON X150 X175 X200 X225 X250 X275 X300 X325 X350 X375 X400 X425 X450
1 -78.375 -163.875 -1.167 -1.0 NA NA NA NA NA NA NA NA NA NA NA
2 -78.125 -168.875 -1.379 -1.3 -1.259 -1.6 -1.476 -1.374 -1.507 NA NA NA NA NA NA
3 -78.125 -167.625 -1.700 -1.7 -1.700 -1.7 NA NA NA NA NA NA NA NA NA
4 -78.125 -167.375 -2.100 -2.2 -2.400 -2.3 -2.200 NA NA NA NA NA NA NA NA
5 -78.125 -167.125 -1.600 -1.6 -1.600 -1.6 NA NA NA NA NA NA NA NA NA
6 -78.125 -166.875 NA NA NA NA NA NA NA NA NA NA NA NA NA
so that I will have this;
LAT LON
-78.375 -163.875 -1
-78.125 -168.875 -1.507
-78.125 -167.625 -1.7
-78.125 -167.375 -2.2
-78.125 -167.125 -1.6
-78.125 -166.875 NA
I tried the tail() function but I don't have the desirable result.
As I understand it, you want the last non-NA value in each row, for all columns except the first two.
We can use max.col() along with is.na() with our relevant columns to get us the column number for the last non-NA value. 2 is added (shown by + 2L) to compensate for the removal of the first two columns (shown by [-(1:2)]).
idx <- max.col(!is.na(Depth[-(1:2)]), ties.method = "last") + 2L
We can use idx in cbind() to create an index matrix for retrieving the values.
Depth[cbind(seq_len(nrow(Depth)), idx)]
# [1] -1.000 -1.507 -1.700 -2.200 -1.600 NA
Bind this together with the first two columns of the original data with cbind() and we're done.
cbind(Depth[1:2], LAST = Depth[cbind(seq_len(nrow(Depth)), idx)])
# LAT LON LAST
# 1 -78.375 -163.875 -1.000
# 2 -78.125 -168.875 -1.507
# 3 -78.125 -167.625 -1.700
# 4 -78.125 -167.375 -2.200
# 5 -78.125 -167.125 -1.600
# 6 -78.125 -166.875 NA
Data:
Depth <- structure(list(LAT = c(-78.375, -78.125, -78.125, -78.125, -78.125,
-78.125), LON = c(-163.875, -168.875, -167.625, -167.375, -167.125,
-166.875), X150 = c(-1.167, -1.379, -1.7, -2.1, -1.6, NA), X175 = c(-1,
-1.3, -1.7, -2.2, -1.6, NA), X200 = c(NA, -1.259, -1.7, -2.4,
-1.6, NA), X225 = c(NA, -1.6, -1.7, -2.3, -1.6, NA), X250 = c(NA,
-1.476, NA, -2.2, NA, NA), X275 = c(NA, -1.374, NA, NA, NA, NA
), X300 = c(NA, -1.507, NA, NA, NA, NA), X325 = c(NA, NA, NA,
NA, NA, NA), X350 = c(NA, NA, NA, NA, NA, NA), X375 = c(NA, NA,
NA, NA, NA, NA), X400 = c(NA, NA, NA, NA, NA, NA), X425 = c(NA,
NA, NA, NA, NA, NA), X450 = c(NA, NA, NA, NA, NA, NA)), .Names = c("LAT",
"LON", "X150", "X175", "X200", "X225", "X250", "X275", "X300",
"X325", "X350", "X375", "X400", "X425", "X450"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
I'm trying to figure out how to do the following without looping. I have a melted dataset of time, study site, and flow that looks like:
datetime site flow
6/1/2009 00:00 EBT NA
6/2/2009 01:00 EBT NA
6/3/2009 02:00 EBT 0.1
6/4/2009 03:00 EBT NA
6/5/2009 04:00 EBT NA
6/1/2009 00:00 MUT 0.4
6/2/2009 01:00 MUT 0.3
6/3/2009 02:00 MUT 0.2
6/4/2009 03:00 MUT NA
6/5/2009 04:00 MUT NA
I need to subset this by site, and then for periods when there are at least two subsequent flow measurements I need to perform a couple of calculations, *for example the mean of the current and previous measurement.
The trick is that I need to perform the average on each set of consecutive measurements, i.e. if there are three in a row for each of the latter two I need the average of that measurement and the previous one. I've added a goal column to the sample dataframe with the results I'd like to get.*
I'd like to end up with a similar looking dataframe with the datetime, site, and result of the calculation. There is a full time series for each site.
Thanks for any help!
data generator:
structure(list(datetime = structure(c(1167627600, 1167717600,
1167807600, 1167897600, 1167987600, 1167627600, 1167717600, 1167807600,
1167897600, 1167987600, 1168077600, 1168167600, 1168257600, 1168347600,
1168437600), class = c("POSIXct", "POSIXt"), tzone = ""), site = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("EBT",
"MUT"), class = "factor"), flow = c(NA, 0.1, NA, NA, NA, NA,
0.4, 0.2, NA, NA, 0.4, 0.2, 0.1, NA, NA), goal = c(NA, NA, NA,
NA, NA, NA, NA, 0.3, NA, NA, NA, 0.3, 0.15, NA, NA)), .Names = c("datetime",
"site", "flow", "goal"), row.names = c(NA, -15L), class = "data.frame")
This will separate your dataframe by site and then filter only rows that have two or more consecutive non-NA values in flow:
by(sample, sample$site, function(d) d[with(rle(!is.na(d$flow)), rep(values & lengths>=2, lengths)),])
You can then work on the function inside to do your calculations as needed.
For instance, if you want to add the mean as a new column (assuming you want NA when not defined) you can use this:
f <- function(d)
{
x <- with(rle(!is.na(d$flow)), rep(values & lengths>=2, lengths))
within(d, {avg <- NA; avg[x] <- mean(d[x,"flow"])})
}
b <- by(sample, sample$site, f)
Reduce(rbind, b)
Result:
datetime site flow avg
1 2009-06-01 01:00:00 EBT NA NA
2 2009-06-02 02:00:00 EBT NA NA
3 2009-06-03 03:00:00 EBT 0.1 NA
4 2009-06-04 04:00:00 EBT NA NA
5 2009-06-05 05:00:00 EBT NA NA
6 2009-06-01 01:00:00 MUT 0.4 0.3
7 2009-06-02 02:00:00 MUT 0.3 0.3
8 2009-06-03 03:00:00 MUT 0.2 0.3
9 2009-06-04 04:00:00 MUT NA NA
10 2009-06-05 05:00:00 MUT NA NA
EDIT: To get the mean between the current flow measure and the previous one, you can use this:
f <- function(d)
{
within(d, avg <- (flow+c(NA,head(flow,-1)))/2)
}
Reduce(rbind, by(sample, sample$site, f))
Note that cases with a single measure are automatically set to NA. New result:
datetime site flow goal avg
1 2007-01-01 03:00:00 EBT NA NA NA
2 2007-01-02 04:00:00 EBT 0.1 NA NA
3 2007-01-03 05:00:00 EBT NA NA NA
4 2007-01-04 06:00:00 EBT NA NA NA
5 2007-01-05 07:00:00 EBT NA NA NA
6 2007-01-01 03:00:00 MUT NA NA NA
7 2007-01-02 04:00:00 MUT 0.4 NA NA
8 2007-01-03 05:00:00 MUT 0.2 0.30 0.30
9 2007-01-04 06:00:00 MUT NA NA NA
10 2007-01-05 07:00:00 MUT NA NA NA
11 2007-01-06 08:00:00 MUT 0.4 NA NA
12 2007-01-07 09:00:00 MUT 0.2 0.30 0.30
13 2007-01-08 10:00:00 MUT 0.1 0.15 0.15
14 2007-01-09 11:00:00 MUT NA NA NA
15 2007-01-10 12:00:00 MUT NA NA NA
Plyr functions are a good way to split apart dataframes by certain variables, which is what you need to do.
I thought of two ways to handle intervals on a vector: first with vector multiplication (for the mean of the data), and second with vectorizing a function (for generating the labels). They're both doing pretty much the same thing, though.
library(reshape2)
library(plyr)
library(lubridate)
meanBetween <- function(x){
l <- length(x)
diag(outer(x[1:(l-1)], x[2:l], "+"))/2
}
output <- ddply(sample, .(site), function(df){
df <- df[order(df$datetime, decreasing=FALSE), ]
result <- meanBetween(df$flow)
names(result) <- Reduce(c, (mapply(as.interval,
df$datetime[-1],
df$datetime[1:(length(df$datetime)-1)],
SIMPLIFY=FALSE)))
result
})
melt(output) # to make it look nicer