I have two data frames, Table1 and Table2.
Table1:
code
CM171
CM114
CM129
CM131
CM154
CM197
CM42
CM54
CM55
Table2:
code;y;diff_y
CM60;1060;2.9
CM55;255;0.7
CM54;1182;3.2
CM53;1046;2.9
CM47;589;1.6
CM42;992;2.7
CM39;1596;4.4
CM36;1113;3
CM34;1975;5.4
CM226;155;0.4
CM224;46;0.1
CM212;43;0.1
CM197;726;2
CM154;1122;3.1
CM150;206;0.6
CM144;620;1.7
CM132;8;0
CM131;618;1.7
CM129;479;1.3
CM121;634;1.7
CM114;15;0
CM109;1050;2.9
CM107;1165;3.2
CM103;194;0.5
I want to add a column to Table2 based on the values in Table1. I tried to do this using dplyr:
result <-Table2 %>%
mutate (fbp = case_when(
code == Table1$code ~"y",))
But this only works for a few rows. Does anyone know why it doesn't add all rows? The values are not repeated.
Try this. It looks like the == operator is only checking for one value. Instead you can use %in% to test all values. Here the code:
#Code
result <-Table2 %>%
mutate (fbp = case_when(
code %in% Table1$code ~"y",))
Output:
code y diff_y fbp
1 CM60 1060 2.9 <NA>
2 CM55 255 0.7 y
3 CM54 1182 3.2 y
4 CM53 1046 2.9 <NA>
5 CM47 589 1.6 <NA>
6 CM42 992 2.7 y
7 CM39 1596 4.4 <NA>
8 CM36 1113 3.0 <NA>
9 CM34 1975 5.4 <NA>
10 CM226 155 0.4 <NA>
11 CM224 46 0.1 <NA>
12 CM212 43 0.1 <NA>
13 CM197 726 2.0 y
14 CM154 1122 3.1 y
15 CM150 206 0.6 <NA>
16 CM144 620 1.7 <NA>
17 CM132 8 0.0 <NA>
18 CM131 618 1.7 y
19 CM129 479 1.3 y
20 CM121 634 1.7 <NA>
21 CM114 15 0.0 y
22 CM109 1050 2.9 <NA>
23 CM107 1165 3.2 <NA>
24 CM103 194 0.5 <NA>
Related
R beginner here, my question is: How do I change this function so that it can be used across all my time periods without copying and pasting the function over and over? The time periods are indicated in the function by the from = X$pre.start1[i] and to = X$pre.start2[i] arguments. I would like to have all the results end up in a single .csv file as well. Is that possible?
I know this function works and I have used it in the past by copying it and changing the time periods but with multiple spreadsheets with data like this it is tedious to apply this way. So I am looking to modify it so that I am not copying and pasting hundreds of times.
The function:
ADIanalyzeFUN <- function(X) {
adianalyzeFUN <- function(X, i){
r <- read_wave(X$sound.files[i], from = X$pre.start1[i], to = X$pre.start2[i])
soundfile.adi <- acoustic_diversity(r)
return(soundfile.adi$adi_left)
return(soundfile.adi$adi_right)
}
output <- vector("logical", ncol(X))
for (i in seq_along(X$sound.files)) {
output[[i]] <- adianalyzeFUN(X, i)
}
X$adi.values.pre1to2 <-output
write.csv(X, "/media/parks/Seagate Portable Drive 2 (2tb)/Parks/2021 Threat Experiment/ADI index values/ADI01.csv", row.names = TRUE)
}
Below is a sample of the data
Each column is a list of times in seconds and I am applying the function to the wave file between one time and the next eg between pre.start1 and pre.start2.
pre.start1 pre.start2 pre.start3 pre.start4 pre.start5 pre.start6 pre.start7 pre.start8 pre.start9 pre.start10 pre.end duringpb.start1
1 2304 2364 2424 2484 2544 2604 2664 2724 2784 2844 2904 2964
2 1386 1446 1506 1566 1626 1686 1746 1806 1866 1926 1986 2046
3 1680 1740 1800 1860 1920 1980 2040 2100 2160 2220 2280 2340
4 1553 1613 1673 1733 1793 1853 1913 1973 2033 2093 2153 2213
5 1661 1721 1781 1841 1901 1961 2021 2081 2141 2201 2261 2321
6 1728 1788 1848 1908 1968 2028 2088 2148 2208 2268 2328 2388
duringpb.end1 duringpb.start2 duringpb.end2 duringpb.start3 duringpb.end3 duringpb.start4 duringpb.end4 duringpb.start5 duringpb.end5
1 3024 3084 3144 3204 3264 3324 3384 3444 3504
2 2106 2166 2226 2286 2346 2406 2466 2526 2586
3 2400 2460 2520 2580 2640 2700 2760 2820 2880
4 2273 2333 2393 2453 2513 2573 2633 2693 2753
5 2381 2441 2501 2561 2621 2681 2741 2801 2861
6 2448 2508 2568 2628 2688 2748 2808 2868 2928```
Thanks for any help!
I would like the output to be something like:
X pre.start1-pre.start2 pre.start2-pr.estart3 pre.start3-pre.start4
1 0.86 0.56 0.89
2 0.27 0.09 0.03
3 0.18 0.10 0.55
4 0.39 0.52 0.74
5 0.14 0.17 0.97
6 0.91 0.64 0.71
You could use purrr package and their map variant called map2_df() (this package is a part of the Tidyverse)
Your example isn't easily reproducible, so here's an example of taking 2 first columns from the iris dateset and constructing a dataframe (a tibble in this case) with the sum for each row and putting it into one dataframe.
library(tidyverse)
iris %>% tibble
#> # A tibble: 150 × 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
map2_df(
.x = iris$Sepal.Length,
.y = iris$Sepal.Width,
.f = ~ tibble("sum" = sum(c(.x, .y)))
)
#> # A tibble: 150 × 1
#> sum
#> <dbl>
#> 1 8.6
#> 2 7.9
#> 3 7.9
#> 4 7.7
#> 5 8.6
#> 6 9.3
#> 7 8
#> 8 8.4
#> 9 7.3
#> 10 8
#> # … with 140 more rows
Created on 2021-09-04 by the reprex package (v2.0.1)
This question already has answers here:
Categorize numeric variable into group/ bins/ breaks
(4 answers)
Closed 1 year ago.
I am attempting to add a new column to the state sample data frame in R. I am hoping for this column to cluster the ID of states into broader categories (1-4). My code is close to what I am looking for but I am not getting it quite right.. I know I could enter each state ID line by line but is there a a quicker way? Thank you!
library(tidyverse)
#Add column to denote each state
States=state.x77
States=data.frame(States)
States <- tibble::rowid_to_column(States, "ID")
States
#Create new variable for state buckets
States <- States %>%
mutate(WAGE_BUCKET=case_when(ID <= c(1,12) ~ '1',
ID <= c(13,24) ~ '2',
ID <= c(25,37) ~ '3',
ID <= c(38,50) ~ '4',
TRUE ~ 'NA'))
View(States) #It is not grouping the states in the way I want/I am still getting some NA values but unsure why!
You can use cut or findInterval if all of your groups will be using contiguous ID values:
findInterval(States$ID, c(0, 12, 24, 37, 51))
# [1] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
If you want to make it a bit more verbose, you can use dplyr::between in your case_when:
States %>%
mutate(
WAGE_BUCKET = case_when(
between(ID, 1, 12) ~ "1",
between(ID, 13, 24) ~ "2",
between(ID, 25, 37) ~ "3",
between(ID, 38, 50) ~ "4",
TRUE ~ NA_character_)
)
# ID Population Income Illiteracy Life Exp Murder HS Grad Frost Area WAGE_BUCKET
# 1 1 3615 3624 2.1 69.05 15.1 41.3 20 50708 1
# 2 2 365 6315 1.5 69.31 11.3 66.7 152 566432 1
# 3 3 2212 4530 1.8 70.55 7.8 58.1 15 113417 1
# 4 4 2110 3378 1.9 70.66 10.1 39.9 65 51945 1
# 5 5 21198 5114 1.1 71.71 10.3 62.6 20 156361 1
# 6 6 2541 4884 0.7 72.06 6.8 63.9 166 103766 1
# 7 7 3100 5348 1.1 72.48 3.1 56.0 139 4862 1
# 8 8 579 4809 0.9 70.06 6.2 54.6 103 1982 1
# 9 9 8277 4815 1.3 70.66 10.7 52.6 11 54090 1
# 10 10 4931 4091 2.0 68.54 13.9 40.6 60 58073 1
# 11 11 868 4963 1.9 73.60 6.2 61.9 0 6425 1
# 12 12 813 4119 0.6 71.87 5.3 59.5 126 82677 1
# 13 13 11197 5107 0.9 70.14 10.3 52.6 127 55748 2
# 14 14 5313 4458 0.7 70.88 7.1 52.9 122 36097 2
# 15 15 2861 4628 0.5 72.56 2.3 59.0 140 55941 2
# 16 16 2280 4669 0.6 72.58 4.5 59.9 114 81787 2
# 17 17 3387 3712 1.6 70.10 10.6 38.5 95 39650 2
# 18 18 3806 3545 2.8 68.76 13.2 42.2 12 44930 2
# 19 19 1058 3694 0.7 70.39 2.7 54.7 161 30920 2
# 20 20 4122 5299 0.9 70.22 8.5 52.3 101 9891 2
# 21 21 5814 4755 1.1 71.83 3.3 58.5 103 7826 2
# 22 22 9111 4751 0.9 70.63 11.1 52.8 125 56817 2
# 23 23 3921 4675 0.6 72.96 2.3 57.6 160 79289 2
# 24 24 2341 3098 2.4 68.09 12.5 41.0 50 47296 2
# 25 25 4767 4254 0.8 70.69 9.3 48.8 108 68995 3
# 26 26 746 4347 0.6 70.56 5.0 59.2 155 145587 3
# 27 27 1544 4508 0.6 72.60 2.9 59.3 139 76483 3
# 28 28 590 5149 0.5 69.03 11.5 65.2 188 109889 3
# 29 29 812 4281 0.7 71.23 3.3 57.6 174 9027 3
# 30 30 7333 5237 1.1 70.93 5.2 52.5 115 7521 3
# 31 31 1144 3601 2.2 70.32 9.7 55.2 120 121412 3
# 32 32 18076 4903 1.4 70.55 10.9 52.7 82 47831 3
# 33 33 5441 3875 1.8 69.21 11.1 38.5 80 48798 3
# 34 34 637 5087 0.8 72.78 1.4 50.3 186 69273 3
# 35 35 10735 4561 0.8 70.82 7.4 53.2 124 40975 3
# 36 36 2715 3983 1.1 71.42 6.4 51.6 82 68782 3
# 37 37 2284 4660 0.6 72.13 4.2 60.0 44 96184 3
# 38 38 11860 4449 1.0 70.43 6.1 50.2 126 44966 4
# 39 39 931 4558 1.3 71.90 2.4 46.4 127 1049 4
# 40 40 2816 3635 2.3 67.96 11.6 37.8 65 30225 4
# 41 41 681 4167 0.5 72.08 1.7 53.3 172 75955 4
# 42 42 4173 3821 1.7 70.11 11.0 41.8 70 41328 4
# 43 43 12237 4188 2.2 70.90 12.2 47.4 35 262134 4
# 44 44 1203 4022 0.6 72.90 4.5 67.3 137 82096 4
# 45 45 472 3907 0.6 71.64 5.5 57.1 168 9267 4
# 46 46 4981 4701 1.4 70.08 9.5 47.8 85 39780 4
# 47 47 3559 4864 0.6 71.72 4.3 63.5 32 66570 4
# 48 48 1799 3617 1.4 69.48 6.7 41.6 100 24070 4
# 49 49 4589 4468 0.7 72.48 3.0 54.5 149 54464 4
# 50 50 376 4566 0.6 70.29 6.9 62.9 173 97203 4
It is a vector of length > 1. The comparison operators works on a single vector. We could use between
library(dplyr)
States <- States %>%
mutate(WAGE_BUCKET=case_when(between(ID, 1, 12) ~ '1',
between(ID, 13,24) ~ '2',
between(ID, 25,37) ~ '3',
between(ID, 38,50) ~ '4',
TRUE ~ NA_character_))
Or another option is to use & with > and <=
States %>%
mutate(WAGE_BUCKET=case_when(ID >= 1 & ID <=12 ~ '1',
ID >= 13 & ID <= 24) ~ '2',
ID >= 25 & ID <= 37 ~ '3',
ID >= 38 & ID <= 50 ~ '4',
TRUE ~ NA_character))
Or may be the OP meant to use %in%
States %>%
mutate(WAGE_BUCKET=case_when(ID %in% c(1,12) ~ '1',
ID %in% c(13,24) ~ '2',
ID %in% c(25,37) ~ '3',
ID %in% c(38,50) ~ '4',
TRUE ~ NA_character_))
I have two tables coming from devices that gather data with different sampling frequencies. One device samples every 30 seconds, the other is roughly 30 and sometimes drops measurements (example sequence might be 31, 61, 95, 151, notice how it missed the sample around ~120). My original data.frame contains a datetime instead of the number of seconds but the toy data should work to illustrate.
q1 <-
read.table(text="
A 0 1.1
A 30 1.2
A 90 1.3
A 120 1.4
B 15 -5
B 45 -3
B 75 -3.5
C 10 0
C 40 -1.4
C 70 -1")
q2 <-
read.table(text="
A 10 10.1
A 40 10.2
A 110 10.4
B 30 -50
B 90 -30
C 5 0
C 35 -10.4
C 76 -10")
names(q1) <- c("key","datetime","x")
names(q2) <- c("key","timepoint","y")
# create a joint_time to keep the originals in place
q1$joint_time <- q1$datetime
q2$joint_time <- q2$timepoint
If I try to join by nearest, I get
# set the keys
data.table::setkey(data.table::setDT(q1), key, joint_time)
data.table::setkey(data.table::setDT(q2), key, joint_time)
q2[q1, roll="nearest"]
Notice the duplicates on row 4 and 6.
key timepoint y joint_time datetime x
1: A 10 10.1 0 0 1.1
2: A 40 10.2 30 30 1.2
3: A 110 10.4 90 90 1.3
4: A 110 10.4 120 120 1.4
5: B 30 -50.0 15 15 -5.0
6: B 30 -50.0 45 45 -3.0
7: B 90 -30.0 75 75 -3.5
8: C 5 0.0 10 10 0.0
9: C 35 -10.4 40 40 -1.4
10: C 76 -10.0 70 70 -1.0
My ideal output would join by nearest but fill with NA instead of duplicate on y values.
key timepoint y joint_time datetime x
1: A 10 10.1 0 0 1.1
2: A 40 10.2 30 30 1.2
3: A 110 10.4 90 90 1.3
4: A NA NA 120 120 1.4
5: B 30 -50.0 15 15 -5.0
6: B NA NA 45 45 -3.0
7: B 90 -30.0 75 75 -3.5
8: C 5 0.0 10 10 0.0
9: C 35 -10.4 40 40 -1.4
10: C 76 -10.0 70 70 -1.0
I'm fine with doing the join first and then finding the duplicates and changing them to NA. I will later try to interpolate the y variable there. Not sure if there's a direct way to do the join and fill with NA or if it has to be done a posteriori.
Here's what I ended up doing, I don't think it's awesome but as far as I can see, it is working as expected.
q1$joint_time <- q1$datetime
q2$joint_time <- q2$timepoint
# create a sample id using the key since the data is grouped
q2$sample_id <- paste0(q2$key, as.character(1:nrow(q2)))
# Join
res <- q2[q1, roll="nearest"]
# fill with NAs
res %>% mutate_at(vars(y,timepoint), ~ifelse(duplicated(sample_id), NA, .))
Which produces
key timepoint y joint_time sample_id datetime x
1: A 10 10.1 0 A1 0 1.1
2: A 40 10.2 30 A2 30 1.2
3: A 110 10.4 90 A3 90 1.3
4: A NA NA 120 A3 120 1.4
5: B 30 -50.0 15 B4 15 -5.0
6: B NA NA 45 B4 45 -3.0
7: B 90 -30.0 75 B5 75 -3.5
8: C 5 0.0 10 C6 10 0.0
9: C 35 -10.4 40 C7 40 -1.4
10: C 76 -10.0 70 C8 70 -1.0
I am using Rstudio (version .99.903), have a PC (windows 8). I have a follow up question from yesterday as the problem became more complicated. Here is what the data looks like:
Number Trial ID Open date Enrollment rate
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0
What I need to do is compare the enrollment rate of the most current date within a given ID to the average of those values that are up to one year prior to it. For instance, for ID 53, the date of 1/19/2011 has an enrollment rate of 0.2 and I would want to compare this against the average of 8/17/2010 and 5/12/2010 enrollment rates (e.g., 0.15).
If there are no other dates within the ID prior to the current one, then the comparison should not be made. For instance, for ID 26, there would be no comparison. Similarly, for ID 53, there would be no comparison for 5/12/2010.
When I say "compare" I am not doing any analysis or visualization. I simply want a new column that takes the average value of those enrollment rates up to one year prior to the current one (I will be plotting them and percentile ranking them later). There are >20,000 data points. Any help would be much appreciated.
Verbose but possibly high performance way of doing this. No giant for loops looping over all the rows of the data frame. The two sapply loops only operate on a big numeric vector, which should be relatively quick regardless of your data row count. But I'm sure someone will waltz in with a trivial dplyr solution soon enough.
Approach assumes that your data is first sorted by ID then by Opendata. If they are not sorted, you need to sort them first.
# Find indices where the same ID is above and below it
A = which(unlist(sapply(X = rle(df$ID)$lengths,
FUN = function(x) {if(x == 1) return(F)
if(x == 2) return(c(F,F))
if(x >= 3) return(c(F,rep(T, x-2),F))})))
# Store list of date, should speed up code a tiny bit
V_opendate = df$Opendate
# Further filter on A, where the date difference < 365 days
B = A[sapply(A, function(x) (abs(V_opendate[x]-V_opendate[x-1]) < 365) & (abs(V_opendate[x]-V_opendate[x+1]) < 365))]
# Return actual indices of rows - 1, rows +1
C = sapply(B, function(x) c(x-1, x+1), simplify = F)
# Actually take the mean of these cases
D = sapply(C, function(x) mean(df[x,]$Enrollment))
# Create new column rate and fill in with value of C. You can do the comparison from here.
df[B,"Rate"] = D
Number Trial ID Opendate Enrollmentrate Rate
1 420 NCT00091442 9 2005-01-28 0.2 NA
2 1476 NCT00301457 26 2008-02-22 1.0 NA
3 10559 NCT01307397 34 2011-07-28 0.6 NA
4 6794 NCT00948675 53 2010-05-12 0.0 NA
5 6451 NCT00917384 53 2010-08-17 0.3 0.10
6 8754 NCT01168973 53 2011-01-19 0.2 1.35
7 8578 NCT01140347 53 2011-12-30 2.4 0.25
8 11655 NCT01358877 53 2012-04-02 0.3 NA
9 428 NCT00091442 55 2005-09-07 0.1 NA
10 112 NCT00065325 62 2003-10-15 0.2 NA
11 477 NCT00091442 62 2005-11-11 0.1 NA
12 16277 NCT01843374 62 2013-12-16 0.2 NA
13 17386 NCT01905657 62 2014-01-08 0.6 NA
14 411 NCT00091442 66 2005-01-12 0.0 NA
14 411 NCT00091442 66 1/12/2005 0.00 NA
The relevant rows are calculated. You can do your comparison with the newly created Rate column.
You might have to change the code a little since I changed removed the space in the column names
df = read.table(text = " Number Trial ID Opendate Enrollmentrate
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0", header = T)
I have a database with 7,994,625 obs of 42 variables. It's basically water quality parameters taken from multiple stations every 15 minutes for 1 to 12 years depending on stations...
here is the head of dataframe:
STATION DATE Time SONDE Layer TOTAL_DEPTH TOTAL_DEPTH_A BATT BATT_A WTEMP WTEMP_A SPCOND SPCOND_A
1 CCM0069 2001-05-01 09:45:52 AMY BS NA NND 11.6 <NA> 19.32 <NA> 0.387 <NA>
2 CCM0069 2001-05-01 10:00:52 AMY BS NA NND 11.5 <NA> 19.51 <NA> 0.399 <NA>
3 CCM0069 2001-05-01 10:15:52 AMY BS NA NND 11.5 <NA> 19.49 <NA> 0.407 <NA>
4 CCM0069 2001-05-01 10:30:52 AMY BS NA NND 11.5 <NA> 19.34 <NA> 0.428 <NA>
5 CCM0069 2001-05-01 10:45:52 AMY BS NA NND 11.5 <NA> 19.42 <NA> 0.444 <NA>
6 CCM0069 2001-05-01 11:00:52 AMY BS NA NND 11.5 <NA> 19.31 <NA> 0.460 <NA>
SALINITY SALINITY_A DO_SAT DO_SAT_A DO DO_A PH PH_A TURB_NTU TURB_NTU_A FLUOR FLUOR_A TCHL_PRE_CAL
1 0.19 <NA> 97.8 <NA> 9.01 <NA> 7.24 <NA> 19.5 <NA> 9.6 <NA> 63.4
2 0.19 <NA> 99.7 <NA> 9.14 <NA> 7.26 <NA> 21.1 <NA> 9.5 <NA> 63.2
3 0.20 <NA> 99.3 <NA> 9.11 <NA> 7.23 <NA> 19.2 <NA> 9.7 <NA> 64.3
4 0.21 <NA> 98.4 <NA> 9.05 <NA> 7.23 <NA> 20.0 <NA> 10.2 <NA> 67.6
5 0.21 <NA> 99.2 <NA> 9.12 <NA> 7.23 <NA> 21.2 <NA> 10.4 <NA> 68.7
6 0.22 <NA> 98.7 <NA> 9.09 <NA> 7.23 <NA> 18.3 <NA> 11.0 <NA> 72.5
TCHL_PRE_CAL_A CHLA CHLA_A COMMENTS month year day
1 <NA> <NA> <NA> <NA> May 2001 1
2 <NA> <NA> <NA> <NA> May 2001 1
3 <NA> <NA> <NA> <NA> May 2001 1
4 <NA> <NA> <NA> <NA> May 2001 1
5 <NA> <NA> <NA> <NA> May 2001 1
6 <NA> <NA> <NA> <NA> May 2001 1
I have been all though the R help sites and found similar questions but when I tried to addapt them to my dataframe no dice
I'm trying to
loop by date and calculate total number of DO observations, number of times DO falls below 5 mg/l and then calculate % failure rate of 5mg/l. I can do this over entire datasets and subset each station and date individually just fine but need to do this in a loop and put results in a new dataframe with other parameter calculations... I guess I just need a head start..
Here is what little I have figured out or not .
x <- levels(sub$DATE)
for(i in 1:length(x)){
x$c<-(sum(!is.na(x$DO)))/4 # number of DO measurements and put into hours(every 15 mins)
x$dur<-(sum(x$DO<= 5))/4 # number of DO measurement under 5 mg/l and put into hours
x$fail<-(x$dur/x$c)*100 # failure rate at station and day
}
I get error codes about atomic vectors
What I eventually want is this
station date c dur fail
HGD2115 5/1/2001 24 5 20.83333333
HGD2115 5/2/2001 22 20 90.90909091
HGD2115 5/3/2001 24 12 50
JLD5564 5/1/2001 20 6 30
JLD5564 5/2/2001 12 2 16.66666667
JLD5564 5/3/2001 23 5 21.73913043
there are more calculations I need to do and add to the new dataframe such as the monthly min max and mean of salinity, temperature, etc... hopefully I won't have to come back for help with that. I just need some advice and push in right direction.
and eventually I will get really wild by throwing out days with not enough DO measurements!
This seems like what you are asking (??)
# create sample dataset - you have this already
# 100 stations, 10 days, 15-minute intervals = 100*10*24*4
library(stringr) # for str_pad(...) in example only - you don't need this
set.seed(1) # for reproducible example...
data <- data.frame(STATION=paste0("CMM",str_pad(rep(1:100,each=4*24*10),3,pad="0")),
DATE = as.POSIXct("2001-05-01")+seq(0,15*60*24*1000,len=4*24*1000),
DO = rpois(4*24*1000,5))
# you start here
result <- aggregate(DO~as.Date(DATE)+STATION,data,function(x) {
count <- sum(!is.na(x))
fail <- sum(x[!is.na(x)]<5)
pct.fail <- 100*fail/count
c(count,fail,pct.fail)
})
result <- data.frame(result[,1:2],result[,3])
colnames(result) <- c("DATE","STATION","COUNT","FAIL","PCT.FAIL")
head(result)
# DATE STATION COUNT FAIL PCT.FAIL
# 1 2001-05-01 CMM001 320 147 45.93750
# 2 2001-05-02 CMM001 384 163 42.44792
# 3 2001-05-03 CMM001 256 119 46.48438
# 4 2001-05-03 CMM002 128 61 47.65625
# 5 2001-05-04 CMM002 384 191 49.73958
# 6 2001-05-05 CMM002 384 168 43.75000
This uses the so-called formula interface to aggregate(...) to subset data by date (using as.Date(DATE)) and STATION. For every subgroup, the column DO is passed to the function, which calculates count, fail, and pct.fail as you did.
When the function in aggregate(...) returns a vector, as this one does, the result is a data frame with 3 columns, one for date, one for station, and one containing the vector of results. But you want these in separate columns (so, 5 columns total in your case). The line:
result <- data.frame(result[,1:2],result[,3])
does this.
Here is a slight variation using the aggregate solution. Instead of having the relational operator inside the aggregate function, a second data set is made consisting only of the data that satisfies the requirement (DO < 5).
set.seed(5)
samp_times<- seq(as.POSIXct("2014-06-01 00:00:00", tz = "UTC"),
as.POSIXct("2014-12-31 23:45:00", tz = "UTC"),
by = 60*15)
ntimes=length(samp_times)
nSta<-15
sta<-vector(nSta,mode="any")
for (iSta in seq(1,nSta)) {
sta[iSta] <- paste(paste(sample(letters,3), collapse = ''), sample(1000:9999, 1), sep="")
}
df<-data.frame(DATETIME=rep(rep(samp_times,each=nSta)), STATION=sta, DO=runif(ntimes*nSta,.1,10))
df$DATE<-strftime(df$DATETIME, format="%Y-%m-%d")
df$TIME<-strftime(df$DATETIME, format="%H:%M:%S")
head(df,20)
do_small = 5
agr_1 <- aggregate(df$DO,list(station=df$STATION,date=df$DATE),length)
dfSmall <- df[df$DO<=do_small,]
agr_2 <- aggregate(dfSmall$DO,list(station=dfSmall$STATION,date=dfSmall$DATE),length)
names(agr_1)[3]="nDO"
names(agr_2)[3]="nDO_Small"
agr <- merge(agr_1,agr_2)
agr$pcnt_DO_SMALL <- agr$nDO_Small / agr$nDO * 100
head(agr)