How to convert data into a Time Series using R? - r

I have an intraday dataset of stock-related quotes. How do I convert it into a time series?
Time Size Ask Bid Trade
11-1-2016 9:00:12 100 <NA> 901 <NA>
11-1-2016 9:00:21 5 <NA> <NA> 950
11-1-2016 9:00:21 5 <NA> 950 <NA>
11-1-2016 9:00:21 10 905 <NA> <NA>
11-1-2016 9:00:24 500 <NA> 921 <NA>
11-1-2016 9:00:28 2 <NA> 879 <NA>
11-1-2016 9:00:31 6 1040 <NA> <NA>
11-1-2016 9:00:39 5 <NA> <NA> 950
11-1-2016 9:00:39 5 <NA> 950 <NA>
11-1-2016 9:00:39 10 905 <NA> <NA>
11-1-2016 9:00:39 5 <NA> <NA> 950
11-1-2016 9:00:44 2 <NA> 879 <NA>
11-1-2016 9:00:44 6 1040 <NA> <NA>
11-1-2016 9:00:45 1 1005 <NA> <NA>
11-1-2016 9:00:46 1 1000 <NA> <NA>
11-1-2016 9:00:47 1 <NA> 900 <NA>
11-1-2016 9:00:47 5 <NA> <NA> 950
11-1-2016 9:00:47 5 <NA> 950 <NA>
11-1-2016 9:00:47 10 905 <NA> <NA>
11-1-2016 9:00:48 1 <NA> 900 <NA>
11-1-2016 9:00:48 1 1000 <NA> <NA>
11-1-2016 9:00:52 5 <NA> <NA> 950
11-1-2016 9:00:52 5 <NA> 950 <NA>
11-1-2016 9:00:52 10 905 <NA> <NA>
11-1-2016 9:00:53 10 <NA> <NA> 939
11-1-2016 9:00:55 1 <NA> 900 <NA>
11-1-2016 9:00:55 1 1000 <NA> <NA>
11-1-2016 9:00:55 10 <NA> <NA> 939
11-1-2016 9:00:55 5 <NA> 950 <NA>
11-1-2016 9:00:55 10 905 <NA> <NA>
11-1-2016 9:00:59 10 <NA> <NA> 939
11-1-2016 9:01:04 10 <NA> <NA> 950
11-1-2016 9:01:04 25 <NA> 950 <NA>
11-1-2016 9:01:06 1 <NA> 900 <NA>
11-1-2016 9:01:06 1 1000 <NA> <NA>
11-1-2016 9:01:14 19 <NA> <NA> 972
11-1-2016 9:01:14 20 <NA> 972 <NA>
11-1-2016 9:01:14 10 905 <NA> <NA>
11-1-2016 9:01:17 19 <NA> <NA> 972
11-1-2016 9:01:17 1 <NA> 912 <NA>
The structure of the dataset is
'data.frame': 35797 obs. of 5 variables:
$ Time : POSIXct, format: "2016-11-01 09:00:12" "2016-11-01 09:00:21" ..
$ Size : chr "100" "5" "5" "10" ...
$ ASk : chr NA NA NA "905" ...
$ Bid : chr "901" NA "950" NA ...
$ Trade: chr NA "950" NA NA ...
Once the data is converted into a time series object, then how do I aggregate the column of Ask, Bid and Trade for every 5 minute.

Related

New value with one similar column and on different column in R

I need to mutate a new value: "new_value" based on the same ID "ï..record_id". I need all with the same ID to have the same value in "date_eortc".
My data1 looks likes:
data1 %>%
select( ï..record_id, dato1, galbeta_date, date_eortc)
> ï..record_id dato1 galbeta_date date_eortc
1 1 <NA> <NA> <NA>
2 1 <NA> <NA> <NA>
3 1 <NA> 2018-01-16 <NA>
.....
99 10 2018-02-07 <NA> 2017-12-27
100 10 <NA> <NA> <NA>
101 10 <NA> <NA> <NA>
102 10 <NA> 2017-12-19 <NA>
103 10 <NA> 2017-12-26 <NA>
104 10 <NA> 2017-12-29 <NA>
105 10 <NA> 2018-01-02 <NA>
106 10 <NA> <NA> <NA>
107 10 <NA> <NA> <NA>
108 11 <NA> <NA> <NA>
In this case I need all with "ï..record_id"=10, then date date eortc should all be "2017-12-27"
So it would looks like:
ï..record_id dato1 galbeta_date date_eortc
99 10 2018-02-07 <NA> 2017-12-27
100 10 <NA> <NA> 2017-12-27
101 10 <NA> <NA> 2017-12-27
102 10 <NA> 2017-12-19 2017-12-27
103 10 <NA> 2017-12-26 2017-12-27
104 10 <NA> 2017-12-29 2017-12-27
105 10 <NA> 2018-01-02 2017-12-27
106 10 <NA> <NA> 2017-12-27
107 10 <NA> <NA> 2017-12-27
108 11 <NA> <NA> <NA>
I have tried to make an ifelse statement, but it's not the right one...
data2 <- data1 %>%
mutate(new_value= ifelse(ï..record_id == ï..record_id , date_eortc, NA))
I hope it makes sense.
Thank you for your time,
Julie
We could do a group_by the ï..record_id and fill the NA elements in 'date_eortic' with the non-NA adjacent element
library(dplyr)
library(tidyr)
data1 %>%
group_by(ï..record_id) %>%
fill(date_eortic)

Error Converting a time series with NA values to data frame in r

I want to convert a time series into a data frame and keep the same format. The problem is this ts has some NA values that come from a previous calculation step and I can't fill them by interpolation. I tried to remove the NA's from the time series before converting it but I get an error that I don't even understand what it is.
My time series is the following; it has only two NAs at the beginning (an excerpt)
>spi_ts_3
Jan Feb Mar Apr May Jun
1989 NA NA 0.765069346 1.565910141 1.461138946 1.372936681
1990 -0.157878028 0.097403112 0.099963471 0.729772909 0.569480219 -0.419761595
1991 -0.157878028 0.348524568 0.230534719 0.356331349 0.250889358 0.353116608
1992 1.662879078 2.178001602 1.379790538 1.367209519 1.367845061 1.451183431
1993 0.096554376 0.058881807 0.247172184 -0.085316621 0.020991171 -0.491276965
1994 0.258656104 0.716903968 0.847780489 0.440594371 0.474698780 -0.473765100
The code I'm using to convert it and handle the NAs is the following>
library(tseries)
na.remove(spi_ts_3)
df_fitted_3 <- as.data.frame(type.convert(.preformat.ts(spi_ts_3)))
When I check at the data frame produced, I don't even understand what is happening. I get something like this for each month and a warning at end of all months.
type.convert(.preformat.ts(spi_ts_3)).Feb
1 NA
2 0.097403112
3 0.348524568
4 2.178001602
5 0.058881807
6 0.716903968
7 2.211192460
8 -1.123925787
9 0.395452064
10 -0.106514633
11 -1.637049815
12 -0.862751319
13 -0.010681104
14 -0.958173964
15 0.470583289
16 0.088061116
17 0.485598080
18 -0.661229419
19 1.323879689
20 -0.449031840
21 -1.867196593
22 0.598343928
23 -2.549778490
24 -0.174824280
25 0.892977124
26 -0.246675932
27 0.324195405
28 -0.296931389
29 0.356029416
30 0.171029515
31 <NA>
32 <NA>
33 <NA>
34 <NA>
35 <NA>
36 <NA>
37 <NA>
38 <NA>
39 <NA>
40 <NA>
41 <NA>
42 <NA>
43 <NA>
44 <NA>
45 <NA>
46 <NA>
47 <NA>
48 <NA>
49 <NA>
50 <NA>
51 <NA>
52 <NA>
53 <NA>
54 <NA>
55 <NA>
56 <NA>
57 <NA>
58 <NA>
59 <NA>
60 <NA>
61 <NA>
62 <NA>
63 <NA>
64 <NA>
65 <NA>
66 <NA>
67 <NA>
68 <NA>
69 <NA>
70 <NA>
71 <NA>
72 <NA>
73 <NA>
74 <NA>
75 <NA>
76 <NA>
77 <NA>
78 <NA>
79 <NA>
80 <NA>
81 <NA>
82 <NA>
83 <NA>
Warning message:
In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
corrupt data frame: columns will be truncated or padded with NAs

How to create a new table given X and Y from a data frame

I'm trying to get a new dataset where it can take two columns and make a new table based on a calculation of a third column.
Cust T S1 S2 S3 S4
1009 150 1007 1006 1001 1000
1010 50 1007 1006 1001 1000
1011 50 1007 1006 1001 1000
1013 10000 1007 1006 1001 1000
1931 60 1008 1007 1006 1005
1141 1000 1014 1013 1007 1006
I need to make a new table where it is:
Cust 1014 1013 1008 1007 1006 1001 1000
1009 NA NA NA T *.1 T *.1 T*.05 T * .025
1010 NA NA NA T *.1 T *.1 T*.05 T * .025
1011 NA NA NA T *.1 T *.1 T*.05 T * .025
1013 NA NA NA T *.1 T *.1 T*.05 T * .025
1931 NA NA T*.1 T *.1 T*.05 T * .025 NA
1141 T*.1 T *.1 NA T*.05 T * .025 NA NA
I just can't seem to figure it out and I'm not even sure if it is possible.
A tidyverse solution:
library(tidyverse)
df %>% gather(select = -c(Cust, T)) %>%
select(-key) %>%
spread(value, T) %>%
map2_dfc(c(1, .025, .05, rep(.1, 6)), ~ .x * .y)
# Cust `1000` `1001` `1005` `1006` `1007` `1008` `1013` `1014`
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1009 3.75 7.5 NA 15 15 NA NA NA
# 2 1010 1.25 2.5 NA 5 5 NA NA NA
# 3 1011 1.25 2.5 NA 5 5 NA NA NA
# 4 1013 250 500 NA 1000 1000 NA NA NA
# 5 1141 NA NA NA 100 100 NA 100 100
# 6 1931 NA NA 6 6 6 6 NA NA
library(dplyr)
library(tidyr)
library(data.table)
df %>% gather(key=k,value = val, -c('Cust','T')) %>%
mutate(val_upd=ifelse(k=='S1'|k=='S2','T*.1',ifelse(k=='S3','T*.05','T*.025'))) %>%
#Change 'T*.1' to T*.1 to get the actual value
select(-T,-k) %>% dcast(Cust~val,value.var='val_upd')
Cust 1000 1001 1005 1006 1007 1008 1013 1014
1 1009 T*.025 T*.05 <NA> T*.1 T*.1 <NA> <NA> <NA>
2 1010 T*.025 T*.05 <NA> T*.1 T*.1 <NA> <NA> <NA>
3 1011 T*.025 T*.05 <NA> T*.1 T*.1 <NA> <NA> <NA>
4 1013 T*.025 T*.05 <NA> T*.1 T*.1 <NA> <NA> <NA>
5 1141 <NA> <NA> <NA> T*.025 T*.05 <NA> T*.1 T*.1
6 1931 <NA> <NA> T*.025 T*.05 T*.1 T*.1 <NA> <NA>
Data
df <- read.table(text = "
Cust T S1 S2 S3 S4
1009 150 1007 1006 1001 1000
1010 50 1007 1006 1001 1000
1011 50 1007 1006 1001 1000
1013 10000 1007 1006 1001 1000
1931 60 1008 1007 1006 1005
1141 1000 1014 1013 1007 1006
", header=TRUE)
This is one way using a combination of reshape2::melt, dplyr::select, tidyr::spread and dplyr::mutate. May not be the best way, but it should do what you want:
# Read the data (if you don't already have it loaded)
df <- read.table(text="Cust T S1 S2 S3 S4
1009 150 1007 1006 1001 1000
1010 50 1007 1006 1001 1000
1011 50 1007 1006 1001 1000
1013 10000 1007 1006 1001 1000", header=T)
# Manipulate your data.frame. Replace df with the name of your data.frame
reshape2::melt(df, c("Cust", "T"), c("S1", "S2", "S3", "S4")) %>%
dplyr::select(-variable) %>%
tidyr::spread(value, T) %>%
dplyr::mutate(`1007`=`1007`*0.1,
`1006`=`1006`*0.1,
`1001`=`1001`*0.05,
`1000`=`1000`*0.025)
# Cust 1000 1001 1006 1007
#1 1009 3.75 7.5 15 15
#2 1010 1.25 2.5 5 5
#3 1011 1.25 2.5 5 5
#4 1013 250.00 500.0 1000 1000
You'll need the backticks as R doesn't handle having numeric colnames very well.
Let me know if I've misunderstood anything/something doesn't make sense

Lookup value from multiple sets of columns

This is kind of VOOLKUP problem in excel. I have a data set like the following.
dat1 <- read.table(header=TRUE, text="
ID Name1 Name2
1384 Rem_Ps Tel_Nm
1442 Teq_Ls Sel_Nm
1340 Fem_Bs Tem_Mn
1419 Few_Bn Ten_Gf
1359 Fem_Bs Tem_Mn
1237 Qwl_Po Mnt_Pj
1288 Tem_na Tem_Rt
1261 Sem_Na Tel_Tr
1382 Rem_Ps Tel_Nm
1316 Fem_Bs Tem_Mn
1279 Sem_Na Yem_Rt
1366 Sel_Ve Mkl_Po
1269 Rem_Ps Tel_Nm
")
dat1
ID Name1 Name2
1 1384 Rem_Ps Tel_Nm
2 1442 Teq_Ls Sel_Nm
3 1340 Fem_Bs Tem_Mn
4 1419 Few_Bn Ten_Gf
5 1359 Fem_Bs Tem_Mn
6 1237 Qwl_Po Mnt_Pj
7 1288 Tem_na Tem_Rt
8 1261 Sem_Na Tel_Tr
9 1382 Rem_Ps Tel_Nm
10 1316 Fem_Bs Tem_Mn
11 1279 Sem_Na Yem_Rt
12 1366 Sel_Ve Mkl_Po
13 1269 Rem_Ps Tel_Nm
The above dataset would lookup value from the following data set. Both of the lookup values Name1 and Name2 would use dat2 seven columns QC1 to NC3 to lookup the values. More clarification: If Name1 is found from the seven columns and Name2 is also found in the seven columns, only then we will consider the option as valid. For example: the second row has two values Teq_ls and Sel_Nm. As Teq_ls is not found the seven columns, we will toss this row.
dat2 <- read.table(header=TRUE, text="
ID1 REQ REM QC1 QC2 QC3 QC4 NC1 NC2 NC3
AB1 1123 44ed Fem_Bs Ten_Gf NA NA Tem_Mn Tem_Mn NA
AB2 123 331s Tem_Rt Qwl_Po NA Ten_Gf NA Tem_Mn Mnt_Pj
AB3 123 334q Ten_Gf Tem_Mn Sem_Na Tem-Mn Tel_Tr NA NA
AB4 1234 33ey Sem_Na NA NA NA Tem_Rt NA Yem_Rt
AB5 13243 ed43 Rem_Ps NA NA Tem_Mn NA Tel_Nm NA
AB6 123 34rt NA Ten_Gf NA Sel_Ve Mkl_Po Tem_Rt NA
")
dat2
ID1 REQ REM QC1 QC2 QC3 QC4 NC1 NC2 NC3
1 AB1 1123 44ed Fem_Bs Ten_Gf <NA> <NA> Tem_Mn Tem_Mn <NA>
2 AB2 123 331s Tem_Rt Qwl_Po <NA> Ten_Gf <NA> Tem_Mn Mnt_Pj
3 AB3 123 334q Ten_Gf Tem_Mn Sem_Na Tem-Mn Tel_Tr <NA> <NA>
4 AB4 1234 33ey Sem_Na <NA> <NA> <NA> Tem_Rt <NA> Yem_Rt
5 AB5 13243 ed43 Rem_Ps <NA> <NA> Tem_Mn <NA> Tel_Nm <NA>
6 AB6 123 34rt <NA> Ten_Gf <NA> Sel_Ve Mkl_Po Tem_Rt <NA>
The result would be like this.
ID Name1 Name2 ID1 REQ REM
1384 Rem_Ps Tel_Nm AB5 13243 ed43
1340 Fem_Bs Tem_Mn AB1 1123 44ed
1359 Fem_Bs Tem_Mn AB1 1123 44ed
1237 Qwl_Po Mnt_Pj AB2 123 331s
1261 Sem_Na Tel_Tr AB3 123 334q
1382 Rem_Ps Tel_Nm AB5 13243 ed43
1316 Fem_Bs Tem_Mn AB1 1123 44ed
1279 Sem_Na Yem_Rt AB4 1234 33ey
1366 Sel_Ve Mkl_Po AB6 123 34rt
1269 Rem_Ps Tel_Nm AB5 13243 ed43
Let's do it in base:
z <- which(apply(dat1, 1, function(x) apply(dat2, 1, function(z) x[[2]] %in% z & x[[3]] %in% z)), arr.ind = TRUE)
cbind(dat1[z[,2],], dat2[z[,1],])
ID Name1 Name2 ID1 REQ REM QC1 QC2 QC3 QC4 NC1 NC2 NC3
1 1384 Rem_Ps Tel_Nm AB5 13243 ed43 Rem_Ps <NA> <NA> Tem_Mn <NA> Tel_Nm <NA>
3 1340 Fem_Bs Tem_Mn AB1 1123 44ed Fem_Bs Ten_Gf <NA> <NA> Tem_Mn Tem_Mn <NA>
5 1359 Fem_Bs Tem_Mn AB1 1123 44ed Fem_Bs Ten_Gf <NA> <NA> Tem_Mn Tem_Mn <NA>
6 1237 Qwl_Po Mnt_Pj AB2 123 331s Tem_Rt Qwl_Po <NA> Ten_Gf <NA> Tem_Mn Mnt_Pj
8 1261 Sem_Na Tel_Tr AB3 123 334q Ten_Gf Tem_Mn Sem_Na Tem-Mn Tel_Tr <NA> <NA>
9 1382 Rem_Ps Tel_Nm AB5 13243 ed43 Rem_Ps <NA> <NA> Tem_Mn <NA> Tel_Nm <NA>
10 1316 Fem_Bs Tem_Mn AB1 1123 44ed Fem_Bs Ten_Gf <NA> <NA> Tem_Mn Tem_Mn <NA>
11 1279 Sem_Na Yem_Rt AB4 1234 33ey Sem_Na <NA> <NA> <NA> Tem_Rt <NA> Yem_Rt
12 1366 Sel_Ve Mkl_Po AB6 123 34rt <NA> Ten_Gf <NA> Sel_Ve Mkl_Po Tem_Rt <NA>
13 1269 Rem_Ps Tel_Nm AB5 13243 ed43 Rem_Ps <NA> <NA> Tem_Mn <NA> Tel_Nm <NA>

R Loop by date calculations and put into a new dataframe/matrix

I have a database with 7,994,625 obs of 42 variables. It's basically water quality parameters taken from multiple stations every 15 minutes for 1 to 12 years depending on stations...
here is the head of dataframe:
STATION DATE Time SONDE Layer TOTAL_DEPTH TOTAL_DEPTH_A BATT BATT_A WTEMP WTEMP_A SPCOND SPCOND_A
1 CCM0069 2001-05-01 09:45:52 AMY BS NA NND 11.6 <NA> 19.32 <NA> 0.387 <NA>
2 CCM0069 2001-05-01 10:00:52 AMY BS NA NND 11.5 <NA> 19.51 <NA> 0.399 <NA>
3 CCM0069 2001-05-01 10:15:52 AMY BS NA NND 11.5 <NA> 19.49 <NA> 0.407 <NA>
4 CCM0069 2001-05-01 10:30:52 AMY BS NA NND 11.5 <NA> 19.34 <NA> 0.428 <NA>
5 CCM0069 2001-05-01 10:45:52 AMY BS NA NND 11.5 <NA> 19.42 <NA> 0.444 <NA>
6 CCM0069 2001-05-01 11:00:52 AMY BS NA NND 11.5 <NA> 19.31 <NA> 0.460 <NA>
SALINITY SALINITY_A DO_SAT DO_SAT_A DO DO_A PH PH_A TURB_NTU TURB_NTU_A FLUOR FLUOR_A TCHL_PRE_CAL
1 0.19 <NA> 97.8 <NA> 9.01 <NA> 7.24 <NA> 19.5 <NA> 9.6 <NA> 63.4
2 0.19 <NA> 99.7 <NA> 9.14 <NA> 7.26 <NA> 21.1 <NA> 9.5 <NA> 63.2
3 0.20 <NA> 99.3 <NA> 9.11 <NA> 7.23 <NA> 19.2 <NA> 9.7 <NA> 64.3
4 0.21 <NA> 98.4 <NA> 9.05 <NA> 7.23 <NA> 20.0 <NA> 10.2 <NA> 67.6
5 0.21 <NA> 99.2 <NA> 9.12 <NA> 7.23 <NA> 21.2 <NA> 10.4 <NA> 68.7
6 0.22 <NA> 98.7 <NA> 9.09 <NA> 7.23 <NA> 18.3 <NA> 11.0 <NA> 72.5
TCHL_PRE_CAL_A CHLA CHLA_A COMMENTS month year day
1 <NA> <NA> <NA> <NA> May 2001 1
2 <NA> <NA> <NA> <NA> May 2001 1
3 <NA> <NA> <NA> <NA> May 2001 1
4 <NA> <NA> <NA> <NA> May 2001 1
5 <NA> <NA> <NA> <NA> May 2001 1
6 <NA> <NA> <NA> <NA> May 2001 1
I have been all though the R help sites and found similar questions but when I tried to addapt them to my dataframe no dice
I'm trying to
loop by date and calculate total number of DO observations, number of times DO falls below 5 mg/l and then calculate % failure rate of 5mg/l. I can do this over entire datasets and subset each station and date individually just fine but need to do this in a loop and put results in a new dataframe with other parameter calculations... I guess I just need a head start..
Here is what little I have figured out or not .
x <- levels(sub$DATE)
for(i in 1:length(x)){
x$c<-(sum(!is.na(x$DO)))/4 # number of DO measurements and put into hours(every 15 mins)
x$dur<-(sum(x$DO<= 5))/4 # number of DO measurement under 5 mg/l and put into hours
x$fail<-(x$dur/x$c)*100 # failure rate at station and day
}
I get error codes about atomic vectors
What I eventually want is this
station date c dur fail
HGD2115 5/1/2001 24 5 20.83333333
HGD2115 5/2/2001 22 20 90.90909091
HGD2115 5/3/2001 24 12 50
JLD5564 5/1/2001 20 6 30
JLD5564 5/2/2001 12 2 16.66666667
JLD5564 5/3/2001 23 5 21.73913043
there are more calculations I need to do and add to the new dataframe such as the monthly min max and mean of salinity, temperature, etc... hopefully I won't have to come back for help with that. I just need some advice and push in right direction.
and eventually I will get really wild by throwing out days with not enough DO measurements!
This seems like what you are asking (??)
# create sample dataset - you have this already
# 100 stations, 10 days, 15-minute intervals = 100*10*24*4
library(stringr) # for str_pad(...) in example only - you don't need this
set.seed(1) # for reproducible example...
data <- data.frame(STATION=paste0("CMM",str_pad(rep(1:100,each=4*24*10),3,pad="0")),
DATE = as.POSIXct("2001-05-01")+seq(0,15*60*24*1000,len=4*24*1000),
DO = rpois(4*24*1000,5))
# you start here
result <- aggregate(DO~as.Date(DATE)+STATION,data,function(x) {
count <- sum(!is.na(x))
fail <- sum(x[!is.na(x)]<5)
pct.fail <- 100*fail/count
c(count,fail,pct.fail)
})
result <- data.frame(result[,1:2],result[,3])
colnames(result) <- c("DATE","STATION","COUNT","FAIL","PCT.FAIL")
head(result)
# DATE STATION COUNT FAIL PCT.FAIL
# 1 2001-05-01 CMM001 320 147 45.93750
# 2 2001-05-02 CMM001 384 163 42.44792
# 3 2001-05-03 CMM001 256 119 46.48438
# 4 2001-05-03 CMM002 128 61 47.65625
# 5 2001-05-04 CMM002 384 191 49.73958
# 6 2001-05-05 CMM002 384 168 43.75000
This uses the so-called formula interface to aggregate(...) to subset data by date (using as.Date(DATE)) and STATION. For every subgroup, the column DO is passed to the function, which calculates count, fail, and pct.fail as you did.
When the function in aggregate(...) returns a vector, as this one does, the result is a data frame with 3 columns, one for date, one for station, and one containing the vector of results. But you want these in separate columns (so, 5 columns total in your case). The line:
result <- data.frame(result[,1:2],result[,3])
does this.
Here is a slight variation using the aggregate solution. Instead of having the relational operator inside the aggregate function, a second data set is made consisting only of the data that satisfies the requirement (DO < 5).
set.seed(5)
samp_times<- seq(as.POSIXct("2014-06-01 00:00:00", tz = "UTC"),
as.POSIXct("2014-12-31 23:45:00", tz = "UTC"),
by = 60*15)
ntimes=length(samp_times)
nSta<-15
sta<-vector(nSta,mode="any")
for (iSta in seq(1,nSta)) {
sta[iSta] <- paste(paste(sample(letters,3), collapse = ''), sample(1000:9999, 1), sep="")
}
df<-data.frame(DATETIME=rep(rep(samp_times,each=nSta)), STATION=sta, DO=runif(ntimes*nSta,.1,10))
df$DATE<-strftime(df$DATETIME, format="%Y-%m-%d")
df$TIME<-strftime(df$DATETIME, format="%H:%M:%S")
head(df,20)
do_small = 5
agr_1 <- aggregate(df$DO,list(station=df$STATION,date=df$DATE),length)
dfSmall <- df[df$DO<=do_small,]
agr_2 <- aggregate(dfSmall$DO,list(station=dfSmall$STATION,date=dfSmall$DATE),length)
names(agr_1)[3]="nDO"
names(agr_2)[3]="nDO_Small"
agr <- merge(agr_1,agr_2)
agr$pcnt_DO_SMALL <- agr$nDO_Small / agr$nDO * 100
head(agr)

Resources