New value with one similar column and on different column in R - r

I need to mutate a new value: "new_value" based on the same ID "ï..record_id". I need all with the same ID to have the same value in "date_eortc".
My data1 looks likes:
data1 %>%
select( ï..record_id, dato1, galbeta_date, date_eortc)
> ï..record_id dato1 galbeta_date date_eortc
1 1 <NA> <NA> <NA>
2 1 <NA> <NA> <NA>
3 1 <NA> 2018-01-16 <NA>
.....
99 10 2018-02-07 <NA> 2017-12-27
100 10 <NA> <NA> <NA>
101 10 <NA> <NA> <NA>
102 10 <NA> 2017-12-19 <NA>
103 10 <NA> 2017-12-26 <NA>
104 10 <NA> 2017-12-29 <NA>
105 10 <NA> 2018-01-02 <NA>
106 10 <NA> <NA> <NA>
107 10 <NA> <NA> <NA>
108 11 <NA> <NA> <NA>
In this case I need all with "ï..record_id"=10, then date date eortc should all be "2017-12-27"
So it would looks like:
ï..record_id dato1 galbeta_date date_eortc
99 10 2018-02-07 <NA> 2017-12-27
100 10 <NA> <NA> 2017-12-27
101 10 <NA> <NA> 2017-12-27
102 10 <NA> 2017-12-19 2017-12-27
103 10 <NA> 2017-12-26 2017-12-27
104 10 <NA> 2017-12-29 2017-12-27
105 10 <NA> 2018-01-02 2017-12-27
106 10 <NA> <NA> 2017-12-27
107 10 <NA> <NA> 2017-12-27
108 11 <NA> <NA> <NA>
I have tried to make an ifelse statement, but it's not the right one...
data2 <- data1 %>%
mutate(new_value= ifelse(ï..record_id == ï..record_id , date_eortc, NA))
I hope it makes sense.
Thank you for your time,
Julie

We could do a group_by the ï..record_id and fill the NA elements in 'date_eortic' with the non-NA adjacent element
library(dplyr)
library(tidyr)
data1 %>%
group_by(ï..record_id) %>%
fill(date_eortic)

Related

Add date points between separate dates in a dataframe and create blanks (NA) in the other columns were those newly rows were created in r

This is how my data looks like:
> dput(head(h01_NDVI_specveg_data_spectra,6))
structure(list(ID = c("h01", "h01", "h01", "h01", "h01", "h01"
), collection_date = structure(c(15076, 15092, 15125, 15139,
15159, 15170), class = "Date"), NDVI = c(0.581769436997319, 0.539445628997868,
0.338541666666667, 0.302713987473904, 0.305882352941176, 0.269439421338155
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
I have separate dates without order as you can see in the example (ex.: 2011-04-12; 2011-04-28; 2011-05-31...). What I want is to insert the missing dates between the dates that I have. On top of that, consequently, I want to create new rows for the other columns, where for NDVI those rows would be NA.
Check this example of the desired output:
ID
collection_date
NDVI
h01
2011-04-12
0.5817694
h01
2011-04-13
NA
h01
2011-04-14
NA
h01
2011-04-15
NA
h01
2011-04-16
NA
h01
2011-04-17
NA
h01
2011-04-18
NA
h01
2011-04-19
NA
h01
2011-04-20
NA
h01
2011-04-21
NA
h01
2011-04-22
NA
h01
2011-04-23
NA
h01
2011-04-24
NA
h01
2011-04-25
NA
h01
2011-04-26
NA
h01
2011-04-27
NA
h01
2011-04-28
0.5394456
h01
2011-04-29
NA
h01
2011-04-30
NA
...
..........
..
Any help will be much appreciated.
df1 <- structure(list(ID = c("h01", "h01", "h01", "h01", "h01", "h01"),
collection_date = structure(c(15076, 15092, 15125, 15139,
15159, 15170), class = "Date"),
NDVI = c(0.581769436997319, 0.539445628997868, 0.338541666666667, 0.302713987473904, 0.305882352941176, 0.269439421338155)),
row.names = c(NA, -6L), class = c("data.frame"))
We create a data.frame containing all dates and tidyr::left_join it with the existing (incomplete) data. The NA are created automatically.
library(dplyr)
library(tidyr)
data.frame(collection_date = seq.Date(min(df1$collection_date), max(df1$collection_date), "days")) %>%
left_join(df1) %>%
arrange(collection_date) %>%
select(ID, collection_date, everything())
Returns:
ID collection_date NDVI
1 h01 2011-04-12 0.5817694
2 <NA> 2011-04-13 NA
3 <NA> 2011-04-14 NA
4 <NA> 2011-04-15 NA
5 <NA> 2011-04-16 NA
6 <NA> 2011-04-17 NA
7 <NA> 2011-04-18 NA
8 <NA> 2011-04-19 NA
9 <NA> 2011-04-20 NA
10 <NA> 2011-04-21 NA
11 <NA> 2011-04-22 NA
12 <NA> 2011-04-23 NA
13 <NA> 2011-04-24 NA
14 <NA> 2011-04-25 NA
15 <NA> 2011-04-26 NA
16 <NA> 2011-04-27 NA
17 h01 2011-04-28 0.5394456
18 <NA> 2011-04-29 NA
19 <NA> 2011-04-30 NA
20 <NA> 2011-05-01 NA
21 <NA> 2011-05-02 NA
22 <NA> 2011-05-03 NA
23 <NA> 2011-05-04 NA
24 <NA> 2011-05-05 NA
25 <NA> 2011-05-06 NA
26 <NA> 2011-05-07 NA
27 <NA> 2011-05-08 NA
28 <NA> 2011-05-09 NA
29 <NA> 2011-05-10 NA
30 <NA> 2011-05-11 NA
31 <NA> 2011-05-12 NA
32 <NA> 2011-05-13 NA
33 <NA> 2011-05-14 NA
34 <NA> 2011-05-15 NA
35 <NA> 2011-05-16 NA
36 <NA> 2011-05-17 NA
37 <NA> 2011-05-18 NA
38 <NA> 2011-05-19 NA
39 <NA> 2011-05-20 NA
40 <NA> 2011-05-21 NA
41 <NA> 2011-05-22 NA
42 <NA> 2011-05-23 NA
43 <NA> 2011-05-24 NA
44 <NA> 2011-05-25 NA
45 <NA> 2011-05-26 NA
46 <NA> 2011-05-27 NA
47 <NA> 2011-05-28 NA
48 <NA> 2011-05-29 NA
49 <NA> 2011-05-30 NA
50 h01 2011-05-31 0.3385417
51 <NA> 2011-06-01 NA
52 <NA> 2011-06-02 NA
53 <NA> 2011-06-03 NA
54 <NA> 2011-06-04 NA
55 <NA> 2011-06-05 NA
56 <NA> 2011-06-06 NA
57 <NA> 2011-06-07 NA
58 <NA> 2011-06-08 NA
59 <NA> 2011-06-09 NA
60 <NA> 2011-06-10 NA
61 <NA> 2011-06-11 NA
62 <NA> 2011-06-12 NA
63 <NA> 2011-06-13 NA
64 h01 2011-06-14 0.3027140
65 <NA> 2011-06-15 NA
66 <NA> 2011-06-16 NA
67 <NA> 2011-06-17 NA
68 <NA> 2011-06-18 NA
69 <NA> 2011-06-19 NA
70 <NA> 2011-06-20 NA
71 <NA> 2011-06-21 NA
72 <NA> 2011-06-22 NA
73 <NA> 2011-06-23 NA
74 <NA> 2011-06-24 NA
75 <NA> 2011-06-25 NA
76 <NA> 2011-06-26 NA
77 <NA> 2011-06-27 NA
78 <NA> 2011-06-28 NA
79 <NA> 2011-06-29 NA
80 <NA> 2011-06-30 NA
81 <NA> 2011-07-01 NA
82 <NA> 2011-07-02 NA
83 <NA> 2011-07-03 NA
84 h01 2011-07-04 0.3058824
85 <NA> 2011-07-05 NA
86 <NA> 2011-07-06 NA
87 <NA> 2011-07-07 NA
88 <NA> 2011-07-08 NA
89 <NA> 2011-07-09 NA
90 <NA> 2011-07-10 NA
91 <NA> 2011-07-11 NA
92 <NA> 2011-07-12 NA
93 <NA> 2011-07-13 NA
94 <NA> 2011-07-14 NA
95 h01 2011-07-15 0.2694394
Edit:
In order to have ID = "h01" everywhere we just add it to the constructed data.frame. I.e.:
library(dplyr)
library(tidyr)
data.frame(collection_date = seq.Date(min(df1$collection_date), max(df1$collection_date), "days"),
ID = "h01") %>%
left_join(df1) %>%
arrange(collection_date) %>%
select(ID, collection_date, everything())
library(tidyverse)
library(lubridate)
df = structure(list(ID = c("h01", "h01", "h01", "h01", "h01", "h01"
), collection_date = structure(c(15076, 15092, 15125, 15139,
15159, 15170), class = "Date"), NDVI = c(0.581769436997319, 0.539445628997868,
0.338541666666667, 0.302713987473904, 0.305882352941176, 0.269439421338155
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
df2 = tibble(
ID = "h01",
collection_date = seq(ymd("2011-04-10"), ymd("2011-07-16"), 1)
) %>% left_join(df, by = c("ID", "collection_date"))
df2 %>% head(10)
output
# A tibble: 98 x 3
ID collection_date NDVI
<chr> <date> <dbl>
1 h01 2011-04-10 NA
2 h01 2011-04-11 NA
3 h01 2011-04-12 0.582
4 h01 2011-04-13 NA
5 h01 2011-04-14 NA
6 h01 2011-04-15 NA
7 h01 2011-04-16 NA
8 h01 2011-04-17 NA
9 h01 2011-04-18 NA
10 h01 2011-04-19 NA
# ... with 88 more rows
output df2 %>% tail(10)
# A tibble: 10 x 3
ID collection_date NDVI
<chr> <date> <dbl>
1 h01 2011-07-07 NA
2 h01 2011-07-08 NA
3 h01 2011-07-09 NA
4 h01 2011-07-10 NA
5 h01 2011-07-11 NA
6 h01 2011-07-12 NA
7 h01 2011-07-13 NA
8 h01 2011-07-14 NA
9 h01 2011-07-15 0.269
10 h01 2011-07-16 NA
You may use tidyr::complete -
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
complete(collection_date = seq(min(collection_date),
max(collection_date), by = 'days')) %>%
ungroup
# ID collection_date NDVI
# <chr> <date> <dbl>
# 1 h01 2011-04-12 0.582
# 2 h01 2011-04-13 NA
# 3 h01 2011-04-14 NA
# 4 h01 2011-04-15 NA
# 5 h01 2011-04-16 NA
# 6 h01 2011-04-17 NA
# 7 h01 2011-04-18 NA
# 8 h01 2011-04-19 NA
# 9 h01 2011-04-20 NA
#10 h01 2011-04-21 NA
#11 h01 2011-04-22 NA
#12 h01 2011-04-23 NA
#13 h01 2011-04-24 NA
#14 h01 2011-04-25 NA
#15 h01 2011-04-26 NA
#16 h01 2011-04-27 NA
#17 h01 2011-04-28 0.539
#18 h01 2011-04-29 NA
#19 h01 2011-04-30 NA
#20 h01 2011-05-01 NA
#...
#...
The benefit of this approach would be that it would create missing dates based on min and max for each ID.

Error Converting a time series with NA values to data frame in r

I want to convert a time series into a data frame and keep the same format. The problem is this ts has some NA values that come from a previous calculation step and I can't fill them by interpolation. I tried to remove the NA's from the time series before converting it but I get an error that I don't even understand what it is.
My time series is the following; it has only two NAs at the beginning (an excerpt)
>spi_ts_3
Jan Feb Mar Apr May Jun
1989 NA NA 0.765069346 1.565910141 1.461138946 1.372936681
1990 -0.157878028 0.097403112 0.099963471 0.729772909 0.569480219 -0.419761595
1991 -0.157878028 0.348524568 0.230534719 0.356331349 0.250889358 0.353116608
1992 1.662879078 2.178001602 1.379790538 1.367209519 1.367845061 1.451183431
1993 0.096554376 0.058881807 0.247172184 -0.085316621 0.020991171 -0.491276965
1994 0.258656104 0.716903968 0.847780489 0.440594371 0.474698780 -0.473765100
The code I'm using to convert it and handle the NAs is the following>
library(tseries)
na.remove(spi_ts_3)
df_fitted_3 <- as.data.frame(type.convert(.preformat.ts(spi_ts_3)))
When I check at the data frame produced, I don't even understand what is happening. I get something like this for each month and a warning at end of all months.
type.convert(.preformat.ts(spi_ts_3)).Feb
1 NA
2 0.097403112
3 0.348524568
4 2.178001602
5 0.058881807
6 0.716903968
7 2.211192460
8 -1.123925787
9 0.395452064
10 -0.106514633
11 -1.637049815
12 -0.862751319
13 -0.010681104
14 -0.958173964
15 0.470583289
16 0.088061116
17 0.485598080
18 -0.661229419
19 1.323879689
20 -0.449031840
21 -1.867196593
22 0.598343928
23 -2.549778490
24 -0.174824280
25 0.892977124
26 -0.246675932
27 0.324195405
28 -0.296931389
29 0.356029416
30 0.171029515
31 <NA>
32 <NA>
33 <NA>
34 <NA>
35 <NA>
36 <NA>
37 <NA>
38 <NA>
39 <NA>
40 <NA>
41 <NA>
42 <NA>
43 <NA>
44 <NA>
45 <NA>
46 <NA>
47 <NA>
48 <NA>
49 <NA>
50 <NA>
51 <NA>
52 <NA>
53 <NA>
54 <NA>
55 <NA>
56 <NA>
57 <NA>
58 <NA>
59 <NA>
60 <NA>
61 <NA>
62 <NA>
63 <NA>
64 <NA>
65 <NA>
66 <NA>
67 <NA>
68 <NA>
69 <NA>
70 <NA>
71 <NA>
72 <NA>
73 <NA>
74 <NA>
75 <NA>
76 <NA>
77 <NA>
78 <NA>
79 <NA>
80 <NA>
81 <NA>
82 <NA>
83 <NA>
Warning message:
In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
corrupt data frame: columns will be truncated or padded with NAs

data.table and using roll over multiple columns

I can't get my head around something that looks obvious...
library(data.table)
DT1<-data.table(MyDate=as.Date(rep("2019-02-01")),MyName=c("John","Peter","Paul"),Rate=c(210,180,190))
DT2<-data.table(MyDate=seq(as.Date("2019-01-27"),as.Date("2019-02-03"),by="days"))
setkey(DT1,MyDate)
setkey(DT2,MyDate)
I would like to see the rate for John, Peter and Paul be rolled forward towards the end. When I do
DT1[DT2,on=.(MyDate),roll=TRUE]
I get :
MyDate MyName Rate
1: 2019-01-27 <NA> NA
2: 2019-01-28 <NA> NA
3: 2019-01-29 <NA> NA
4: 2019-01-30 <NA> NA
5: 2019-01-31 <NA> NA
6: 2019-02-01 John 210
7: 2019-02-01 Paul 190
8: 2019-02-01 Peter 180
9: 2019-02-02 Peter 180
10: 2019-02-03 Peter 180
While I want this :
MyDate MyName Rate
1: 2019-01-27 <NA> NA
2: 2019-01-28 <NA> NA
3: 2019-01-29 <NA> NA
4: 2019-01-30 <NA> NA
5: 2019-01-31 <NA> NA
6: 2019-02-01 John 210
7: 2019-02-01 Paul 190
8: 2019-02-01 Peter 180
9: 2019-02-02 John 210
10: 2019-02-02 Paul 190
11: 2019-02-02 Peter 180
12: 2019-02-03 John 210
13: 2019-02-03 Paul 190
14: 2019-02-03 Peter 180
It's obvious I'm overlooking something.
A convoluted way (found by trial and error):
DT1[DT2, on=.(MyDate <= MyDate), allow.cartesian = TRUE]
MyDate MyName Rate
1: 2019-01-27 <NA> NA
2: 2019-01-28 <NA> NA
3: 2019-01-29 <NA> NA
4: 2019-01-30 <NA> NA
5: 2019-01-31 <NA> NA
6: 2019-02-01 John 210
7: 2019-02-01 Peter 180
8: 2019-02-01 Paul 190
9: 2019-02-02 John 210
10: 2019-02-02 Peter 180
11: 2019-02-02 Paul 190
12: 2019-02-03 John 210
13: 2019-02-03 Peter 180
14: 2019-02-03 Paul 190
The difficult part was the cross-join-esque rows you need after a matching date but not before that matching date. I think the steps below get at this issue.
Perform a rolling join for each Name, then change the MyName column around and filter for resulting unique lines.
library(magrittr)
DT1[, .SD[DT2, roll = TRUE], by = MyName][
, MyName := ifelse(is.na(Rate), NA, MyName)
][order(MyDate, MyName), .(MyDate, MyName, Rate)] %>%
unique()
MyDate MyName Rate
1: 2019-01-27 <NA> NA
2: 2019-01-28 <NA> NA
3: 2019-01-29 <NA> NA
4: 2019-01-30 <NA> NA
5: 2019-01-31 <NA> NA
6: 2019-02-01 John 210
7: 2019-02-01 Paul 190
8: 2019-02-01 Peter 180
9: 2019-02-02 John 210
10: 2019-02-02 Paul 190
11: 2019-02-02 Peter 180
12: 2019-02-03 John 210
13: 2019-02-03 Paul 190
14: 2019-02-03 Peter 180

How to convert data into a Time Series using R?

I have an intraday dataset of stock-related quotes. How do I convert it into a time series?
Time Size Ask Bid Trade
11-1-2016 9:00:12 100 <NA> 901 <NA>
11-1-2016 9:00:21 5 <NA> <NA> 950
11-1-2016 9:00:21 5 <NA> 950 <NA>
11-1-2016 9:00:21 10 905 <NA> <NA>
11-1-2016 9:00:24 500 <NA> 921 <NA>
11-1-2016 9:00:28 2 <NA> 879 <NA>
11-1-2016 9:00:31 6 1040 <NA> <NA>
11-1-2016 9:00:39 5 <NA> <NA> 950
11-1-2016 9:00:39 5 <NA> 950 <NA>
11-1-2016 9:00:39 10 905 <NA> <NA>
11-1-2016 9:00:39 5 <NA> <NA> 950
11-1-2016 9:00:44 2 <NA> 879 <NA>
11-1-2016 9:00:44 6 1040 <NA> <NA>
11-1-2016 9:00:45 1 1005 <NA> <NA>
11-1-2016 9:00:46 1 1000 <NA> <NA>
11-1-2016 9:00:47 1 <NA> 900 <NA>
11-1-2016 9:00:47 5 <NA> <NA> 950
11-1-2016 9:00:47 5 <NA> 950 <NA>
11-1-2016 9:00:47 10 905 <NA> <NA>
11-1-2016 9:00:48 1 <NA> 900 <NA>
11-1-2016 9:00:48 1 1000 <NA> <NA>
11-1-2016 9:00:52 5 <NA> <NA> 950
11-1-2016 9:00:52 5 <NA> 950 <NA>
11-1-2016 9:00:52 10 905 <NA> <NA>
11-1-2016 9:00:53 10 <NA> <NA> 939
11-1-2016 9:00:55 1 <NA> 900 <NA>
11-1-2016 9:00:55 1 1000 <NA> <NA>
11-1-2016 9:00:55 10 <NA> <NA> 939
11-1-2016 9:00:55 5 <NA> 950 <NA>
11-1-2016 9:00:55 10 905 <NA> <NA>
11-1-2016 9:00:59 10 <NA> <NA> 939
11-1-2016 9:01:04 10 <NA> <NA> 950
11-1-2016 9:01:04 25 <NA> 950 <NA>
11-1-2016 9:01:06 1 <NA> 900 <NA>
11-1-2016 9:01:06 1 1000 <NA> <NA>
11-1-2016 9:01:14 19 <NA> <NA> 972
11-1-2016 9:01:14 20 <NA> 972 <NA>
11-1-2016 9:01:14 10 905 <NA> <NA>
11-1-2016 9:01:17 19 <NA> <NA> 972
11-1-2016 9:01:17 1 <NA> 912 <NA>
The structure of the dataset is
'data.frame': 35797 obs. of 5 variables:
$ Time : POSIXct, format: "2016-11-01 09:00:12" "2016-11-01 09:00:21" ..
$ Size : chr "100" "5" "5" "10" ...
$ ASk : chr NA NA NA "905" ...
$ Bid : chr "901" NA "950" NA ...
$ Trade: chr NA "950" NA NA ...
Once the data is converted into a time series object, then how do I aggregate the column of Ask, Bid and Trade for every 5 minute.

R Loop by date calculations and put into a new dataframe/matrix

I have a database with 7,994,625 obs of 42 variables. It's basically water quality parameters taken from multiple stations every 15 minutes for 1 to 12 years depending on stations...
here is the head of dataframe:
STATION DATE Time SONDE Layer TOTAL_DEPTH TOTAL_DEPTH_A BATT BATT_A WTEMP WTEMP_A SPCOND SPCOND_A
1 CCM0069 2001-05-01 09:45:52 AMY BS NA NND 11.6 <NA> 19.32 <NA> 0.387 <NA>
2 CCM0069 2001-05-01 10:00:52 AMY BS NA NND 11.5 <NA> 19.51 <NA> 0.399 <NA>
3 CCM0069 2001-05-01 10:15:52 AMY BS NA NND 11.5 <NA> 19.49 <NA> 0.407 <NA>
4 CCM0069 2001-05-01 10:30:52 AMY BS NA NND 11.5 <NA> 19.34 <NA> 0.428 <NA>
5 CCM0069 2001-05-01 10:45:52 AMY BS NA NND 11.5 <NA> 19.42 <NA> 0.444 <NA>
6 CCM0069 2001-05-01 11:00:52 AMY BS NA NND 11.5 <NA> 19.31 <NA> 0.460 <NA>
SALINITY SALINITY_A DO_SAT DO_SAT_A DO DO_A PH PH_A TURB_NTU TURB_NTU_A FLUOR FLUOR_A TCHL_PRE_CAL
1 0.19 <NA> 97.8 <NA> 9.01 <NA> 7.24 <NA> 19.5 <NA> 9.6 <NA> 63.4
2 0.19 <NA> 99.7 <NA> 9.14 <NA> 7.26 <NA> 21.1 <NA> 9.5 <NA> 63.2
3 0.20 <NA> 99.3 <NA> 9.11 <NA> 7.23 <NA> 19.2 <NA> 9.7 <NA> 64.3
4 0.21 <NA> 98.4 <NA> 9.05 <NA> 7.23 <NA> 20.0 <NA> 10.2 <NA> 67.6
5 0.21 <NA> 99.2 <NA> 9.12 <NA> 7.23 <NA> 21.2 <NA> 10.4 <NA> 68.7
6 0.22 <NA> 98.7 <NA> 9.09 <NA> 7.23 <NA> 18.3 <NA> 11.0 <NA> 72.5
TCHL_PRE_CAL_A CHLA CHLA_A COMMENTS month year day
1 <NA> <NA> <NA> <NA> May 2001 1
2 <NA> <NA> <NA> <NA> May 2001 1
3 <NA> <NA> <NA> <NA> May 2001 1
4 <NA> <NA> <NA> <NA> May 2001 1
5 <NA> <NA> <NA> <NA> May 2001 1
6 <NA> <NA> <NA> <NA> May 2001 1
I have been all though the R help sites and found similar questions but when I tried to addapt them to my dataframe no dice
I'm trying to
loop by date and calculate total number of DO observations, number of times DO falls below 5 mg/l and then calculate % failure rate of 5mg/l. I can do this over entire datasets and subset each station and date individually just fine but need to do this in a loop and put results in a new dataframe with other parameter calculations... I guess I just need a head start..
Here is what little I have figured out or not .
x <- levels(sub$DATE)
for(i in 1:length(x)){
x$c<-(sum(!is.na(x$DO)))/4 # number of DO measurements and put into hours(every 15 mins)
x$dur<-(sum(x$DO<= 5))/4 # number of DO measurement under 5 mg/l and put into hours
x$fail<-(x$dur/x$c)*100 # failure rate at station and day
}
I get error codes about atomic vectors
What I eventually want is this
station date c dur fail
HGD2115 5/1/2001 24 5 20.83333333
HGD2115 5/2/2001 22 20 90.90909091
HGD2115 5/3/2001 24 12 50
JLD5564 5/1/2001 20 6 30
JLD5564 5/2/2001 12 2 16.66666667
JLD5564 5/3/2001 23 5 21.73913043
there are more calculations I need to do and add to the new dataframe such as the monthly min max and mean of salinity, temperature, etc... hopefully I won't have to come back for help with that. I just need some advice and push in right direction.
and eventually I will get really wild by throwing out days with not enough DO measurements!
This seems like what you are asking (??)
# create sample dataset - you have this already
# 100 stations, 10 days, 15-minute intervals = 100*10*24*4
library(stringr) # for str_pad(...) in example only - you don't need this
set.seed(1) # for reproducible example...
data <- data.frame(STATION=paste0("CMM",str_pad(rep(1:100,each=4*24*10),3,pad="0")),
DATE = as.POSIXct("2001-05-01")+seq(0,15*60*24*1000,len=4*24*1000),
DO = rpois(4*24*1000,5))
# you start here
result <- aggregate(DO~as.Date(DATE)+STATION,data,function(x) {
count <- sum(!is.na(x))
fail <- sum(x[!is.na(x)]<5)
pct.fail <- 100*fail/count
c(count,fail,pct.fail)
})
result <- data.frame(result[,1:2],result[,3])
colnames(result) <- c("DATE","STATION","COUNT","FAIL","PCT.FAIL")
head(result)
# DATE STATION COUNT FAIL PCT.FAIL
# 1 2001-05-01 CMM001 320 147 45.93750
# 2 2001-05-02 CMM001 384 163 42.44792
# 3 2001-05-03 CMM001 256 119 46.48438
# 4 2001-05-03 CMM002 128 61 47.65625
# 5 2001-05-04 CMM002 384 191 49.73958
# 6 2001-05-05 CMM002 384 168 43.75000
This uses the so-called formula interface to aggregate(...) to subset data by date (using as.Date(DATE)) and STATION. For every subgroup, the column DO is passed to the function, which calculates count, fail, and pct.fail as you did.
When the function in aggregate(...) returns a vector, as this one does, the result is a data frame with 3 columns, one for date, one for station, and one containing the vector of results. But you want these in separate columns (so, 5 columns total in your case). The line:
result <- data.frame(result[,1:2],result[,3])
does this.
Here is a slight variation using the aggregate solution. Instead of having the relational operator inside the aggregate function, a second data set is made consisting only of the data that satisfies the requirement (DO < 5).
set.seed(5)
samp_times<- seq(as.POSIXct("2014-06-01 00:00:00", tz = "UTC"),
as.POSIXct("2014-12-31 23:45:00", tz = "UTC"),
by = 60*15)
ntimes=length(samp_times)
nSta<-15
sta<-vector(nSta,mode="any")
for (iSta in seq(1,nSta)) {
sta[iSta] <- paste(paste(sample(letters,3), collapse = ''), sample(1000:9999, 1), sep="")
}
df<-data.frame(DATETIME=rep(rep(samp_times,each=nSta)), STATION=sta, DO=runif(ntimes*nSta,.1,10))
df$DATE<-strftime(df$DATETIME, format="%Y-%m-%d")
df$TIME<-strftime(df$DATETIME, format="%H:%M:%S")
head(df,20)
do_small = 5
agr_1 <- aggregate(df$DO,list(station=df$STATION,date=df$DATE),length)
dfSmall <- df[df$DO<=do_small,]
agr_2 <- aggregate(dfSmall$DO,list(station=dfSmall$STATION,date=dfSmall$DATE),length)
names(agr_1)[3]="nDO"
names(agr_2)[3]="nDO_Small"
agr <- merge(agr_1,agr_2)
agr$pcnt_DO_SMALL <- agr$nDO_Small / agr$nDO * 100
head(agr)

Resources