I have a base that shows me the answers of the applicants to a course. The original base is 2k rows and 105 columns, of which 100 correspond to questions from 4 basic areas of mathematics, language, science, and social.
I have created the following short example so that you can see more or less how the table is
sector<-c("Privado" ,"Publico" ,"Publico" ,"Publico", "Publico",
"Publico" ,"Publico", "Publico", "Publico" ,"Publico", "Publico",
"Publico" ,"Publico", "Publico", "Publico" ,"Privado" ,"Publico" ,
"Publico" ,"Publico" ,"Publico")
aspirante<-c("337877" ,"339161", "388425" ,"371828" ,"288598" ,"396295" ,"400196",
"370915", "276891" ,"335406" ,"358013", "404406", "356633", "284792", "372549" ,
"271082", "396135" ,"398664" ,"406397", "354609")
claves<-c("10" ,"9" , "10", "4" , "4" , "3" , "3" , "4" , "9" ,"10", "3",
"3" , "3" , "4" , "4" , "4" , "4", "4" ,"9" , "3")
question1<-c(1, 0, 0, 0 ,0, 0, 0, 0, 0 ,0, 0, 0 ,0, 0, 0 ,0, 0, 0 ,1, 0)
question2<-c(0, 1, 1 ,0 ,0, 0 ,0 ,0 ,1, 0, 0,0,1 ,0 ,1, 1, 0 ,0, 0, 0)
question3<-c( 0 ,0, 1, 1, 1 ,1 ,0, 0, 0 ,0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0)
question4<-c(0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1)
question5<-c(1, 0, 1, 0 ,0, 1, 0, 1, 1, 0, 1, 1, 0 ,0 ,0, 0, 1, 0, 0, 0)
note<-c(4 ,2, 6, 4, 2, 6, 0 ,4, 4 ,0, 4 ,4 ,6, 2, 4 ,4, 4, 2, 4, 2)
example<-data.frame("candidate"=aspirante,"sector"=sector,"p1"=question1,
"p2"=question2,"p3"=question3,"p4"=question4,"p5"=question5,"note"=note)
the note column is equal to the sum of the row multiplied by 2
I am asked to do a cluster analysis but I have no idea what to do, I had planned to divide the final notes into 4 categories:
failed: grades less than 3
considered: between 3 and 5
space availability: between 5 and 7
approved: from 7 to 10
but in the original base the sizes of each will vary and I cannot create a new base that divides the notes by group. Do you have any suggestions or an example where cluster analysis is applied to dichotomous data?
I am trying to summarize our detection data in a way that I can easily see when an animal moves from one pool to another. Here is an example of one animal that I track
tibble [22 x 13] (S3: tbl_df/tbl/data.frame)
$ Receiver : chr [1:22] "VR2Tx-480679" "VR2Tx-480690" "VR2Tx-480690" "VR2Tx-480690" ...
$ Transmitter : chr [1:22] "A69-9001-12418" "A69-9001-12418" "A69-9001-12418" "A69-9001-12418" ...
$ Species : chr [1:22] "PDFH" "PDFH" "PDFH" "PDFH" ...
$ LocalDATETIME: POSIXct[1:22], format: "2021-05-28 07:16:52" ...
$ StationName : chr [1:22] "1405U" "1406U" "1406U" "1406U" ...
$ LengthValue : num [1:22] 805 805 805 805 805 805 805 805 805 805 ...
$ WeightValue : num [1:22] 8.04 8.04 8.04 8.04 8.04 8.04 8.04 8.04 8.04 8.04 ...
$ Sex : chr [1:22] "NA" "NA" "NA" "NA" ...
$ Translocated : num [1:22] 0 0 0 0 0 0 0 0 0 0 ...
$ Pool : num [1:22] 16 16 16 16 16 16 16 16 16 16 ...
$ DeployDate : POSIXct[1:22], format: "2018-06-05" ...
$ Latitude : num [1:22] 41.6 41.6 41.6 41.6 41.6 ...
$ Longitude : num [1:22] -90.4 -90.4 -90.4 -90.4 -90.4 ...
I want to add columns that would allow me to summarize this data in a way that I would have the start date of when an animal was in a pool and when the animal moved to a different pool it would have the end date of when it exits.
Ex: Enters Pool 19 on 1/1/22, next detected in Pool 20 on 1/2/22, so there would be columns that say fish entered and exited Pool 19 on 1/1/22 and 1/2/22. I have shared an Excel file example of what I am trying to do. I would like to code upstream movement with a 1 and downstream movement with 0.
I have millions of detections and hundreds of animals that I monitor so I am trying to find a way to look at passages for each animal. Thank you!
Here is my dataset using dput:
structure(list(Receiver = c("VR2Tx-480679", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480692", "VR2Tx-480695",
"VR2Tx-480695", "VR2Tx-480713", "VR2Tx-480713", "VR2Tx-480702",
"VR100", "VR100", "VR100"), Transmitter = c("A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418"), Species = c("PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH"), LocalDATETIME = structure(c(1622186212, 1622381700,
1622384575, 1622184711, 1622381515, 1622381618, 1622381751, 1622381924,
1622382679, 1622383493, 1622384038, 1622384612, 1622183957, 1622381515,
1626905954, 1626905688, 1622971975, 1622970684, 1626929618, 1624616880,
1626084540, 1626954660), tzone = "UTC", class = c("POSIXct",
"POSIXt")), StationName = c("1405U", "1406U", "1406U", "1406U",
"1406U", "1406U", "1406U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1406U", "1404L", "1401D", "1401D", "14Aux2", "14Aux2",
"15.Mid.Wall", "man_loc", "man_loc", "man_loc"), LengthValue = c(805,
805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805,
805, 805, 805, 805, 805, 805, 805, 805), WeightValue = c(8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04),
Sex = c("NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA"), Translocated = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Pool = c(16,
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
16, 16, 16, 14, 14, 16), DeployDate = structure(c(1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800), tzone = "UTC", class = c("POSIXct", "POSIXt"
)), Latitude = c(41.57471, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.57463, 41.5731, 41.5731, 41.57469, 41.57469,
41.57469, 41.57469, 41.57469, 41.57469), Longitude = c(-90.39944,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39984, -90.40391, -90.40391, -90.40462, -90.40462, -90.40462,
-90.40462, -90.40462, -90.40462)), row.names = c(NA, -22L
), class = c("tbl_df", "tbl", "data.frame"))
> dput(T12418)
structure(list(Receiver = c("VR2Tx-480679", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480692", "VR2Tx-480695",
"VR2Tx-480695", "VR2Tx-480713", "VR2Tx-480713", "VR2Tx-480702",
"VR100", "VR100", "VR100"), Transmitter = c("A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418"), Species = c("PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH"), LocalDATETIME = structure(c(1622186212, 1622381700,
1622384575, 1622184711, 1622381515, 1622381618, 1622381751, 1622381924,
1622382679, 1622383493, 1622384038, 1622384612, 1622183957, 1622381515,
1626905954, 1626905688, 1622971975, 1622970684, 1626929618, 1624616880,
1626084540, 1626954660), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
StationName = c("1405U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1406U", "1406U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1404L", "1401D", "1401D", "14Aux2", "14Aux2", "15.Mid.Wall",
"man_loc", "man_loc", "man_loc"), LengthValue = c(805, 805,
805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805,
805, 805, 805, 805, 805, 805, 805, 805), WeightValue = c(8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04), Sex = c("NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA"), Translocated = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
Pool = c(16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
16, 16, 16, 16, 16, 16, 16, 14, 14, 16), DeployDate = structure(c(1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Latitude = c(41.57471, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.57463, 41.5731, 41.5731, 41.57469, 41.57469,
41.57469, 41.57469, 41.57469, 41.57469), Longitude = c(-90.39944,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39984, -90.40391, -90.40391, -90.40462, -90.40462, -90.40462,
-90.40462, -90.40462, -90.40462)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -22L))
Here is one possibility for getting the beginning date for entering a pool and ending date for leaving a pool. First, I group by Species (could also add additional grouping variables to distinguish between specimens) and arrange by the time. Then, I look for any changes to the Pool using cumsum. Then, I pull the first date recorded for the pool as the the date that they entered the pool. Then, I do some grouping and ungrouping to grab the date from the next group (i.e., the date the species left the pool) and then copy that date for the whole group. For determining upstream/downstream, we can use case_when inside of mutate. I'm also assuming that you want this to match the date, so I have filled in the values for each group with the movement for pool change.
library(tidyverse)
df_dates <- df %>%
group_by(Species, Transmitter) %>%
arrange(Species, Transmitter, LocalDATETIME) %>%
mutate(changeGroup = cumsum(Pool != lag(Pool, default = -1))) %>%
group_by(Species, Transmitter, changeGroup) %>%
mutate(EnterPool = first(format(as.Date(LocalDATETIME), "%m/%d/%Y"))) %>%
ungroup(changeGroup) %>%
mutate(LeftPool = lead(EnterPool)) %>%
group_by(Species, Transmitter, changeGroup) %>%
mutate(LeftPool = last(LeftPool)) %>%
ungroup(changeGroup) %>%
mutate(stream = case_when((Pool - lag(Pool)) > 0 ~ 0,
(Pool - lag(Pool)) < 0 ~ 1)) %>%
fill(stream, .direction = "down")
Output
print(as_tibble(df_dates[1:24, c(1:5, 10:17)]), n=24)
# A tibble: 24 × 13
Receiver Transmitter Species LocalDATETIME StationName Pool DeployDate Latitude Longitude changeGroup EnterPool LeftPool stream
<chr> <chr> <chr> <dttm> <chr> <dbl> <dttm> <dbl> <dbl> <int> <chr> <chr> <dbl>
1 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-28 06:39:17 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
2 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-28 06:51:51 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
3 VR2Tx-480679 A69-9001-12418 PDFH 2021-05-28 07:16:52 1405U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
4 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:31:55 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
5 VR2Tx-480692 A69-9001-12418 PDFH 2021-05-30 13:31:55 1404L 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
6 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:33:38 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
7 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:35:00 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
8 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:35:51 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
9 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:38:44 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
10 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:51:19 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
11 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 14:04:53 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
12 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 14:13:58 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
13 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 14:22:55 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
14 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 14:23:32 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
15 VR2Tx-480713 A69-9001-12418 PDFH 2021-06-06 09:11:24 14Aux2 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
16 VR2Tx-480713 A69-9001-12418 PDFH 2021-06-06 09:32:55 14Aux2 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
17 VR100 A69-9001-12418 PDFH 2021-06-25 10:28:00 man_loc 14 2018-06-05 00:00:00 41.6 -90.4 2 06/25/2021 07/21/2021 1
18 VR100 A69-9001-12418 PDFH 2021-07-12 10:09:00 man_loc 14 2018-06-05 00:00:00 41.6 -90.4 2 06/25/2021 07/21/2021 1
19 VR2Tx-480695 A69-9001-12418 PDFH 2021-07-21 22:14:48 1401D 16 2018-06-05 00:00:00 41.6 -90.4 3 07/21/2021 NA 0
20 VR2Tx-480695 A69-9001-12418 PDFH 2021-07-21 22:19:14 1401D 16 2018-06-05 00:00:00 41.6 -90.4 3 07/21/2021 NA 0
21 VR2Tx-480702 A69-9001-12418 PDFH 2021-07-22 04:53:38 15.Mid.Wall 16 2018-06-05 00:00:00 41.6 -90.4 3 07/21/2021 NA 0
22 VR100 A69-9001-12418 PDFH 2021-07-22 11:51:00 man_loc 16 2018-06-05 00:00:00 41.6 -90.4 3 07/21/2021 NA 0
23 AR100 B80-9001-12420 PDFH 2021-07-22 11:51:00 man_loc 19 2018-06-05 00:00:00 42.6 -90.4 1 07/22/2021 07/22/2021 NA
24 AR100 B80-9001-12420 PDFH 2021-07-22 11:51:01 man_loc 18 2018-06-05 00:00:00 42.6 -90.4 2 07/22/2021 NA 1
Data
df <- structure(list(Receiver = c("VR2Tx-480679", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480692", "VR2Tx-480695",
"VR2Tx-480695", "VR2Tx-480713", "VR2Tx-480713", "VR2Tx-480702",
"VR100", "VR100", "VR100", "AR100", "AR100"), Transmitter = c("A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "B80-9001-12420", "B80-9001-12420"), Species = c("PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH"), LocalDATETIME = structure(c(1622186212, 1622381700,
1622384575, 1622184711, 1622381515, 1622381618, 1622381751, 1622381924,
1622382679, 1622383493, 1622384038, 1622384612, 1622183957, 1622381515,
1626905954, 1626905688, 1622971975, 1622970684, 1626929618, 1624616880,
1626084540, 1626954660, 1626954661, 1626954660), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
StationName = c("1405U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1406U", "1406U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1404L", "1401D", "1401D", "14Aux2", "14Aux2", "15.Mid.Wall",
"man_loc", "man_loc", "man_loc", "man_loc", "man_loc"), LengthValue = c(805, 805,
805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805,
805, 805, 805, 805, 805, 805, 805, 805, 805, 805), WeightValue = c(8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04, 8.04, 8.04), Sex = c("NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA"), Translocated = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
Pool = c(16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
16, 16, 16, 16, 16, 16, 16, 14, 14, 16, 18, 19), DeployDate = structure(c(1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Latitude = c(41.57471, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.57463, 41.5731, 41.5731, 41.57469, 41.57469,
41.57469, 41.57469, 41.57469, 41.57469, 42.57469, 42.57469), Longitude = c(-90.39944,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39984, -90.40391, -90.40391, -90.40462, -90.40462, -90.40462,
-90.40462, -90.40462, -90.40462, -90.40470, -90.40470)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -24L))
So here is a sample of my Data
year <- c(1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,
1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,
1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,
1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,
1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,
1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980)
month <- c(1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1
,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,2
,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2
,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2)
Q <- c(NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA
,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,0.3 ,0.3 ,0.28
,0.26 ,0.26 ,0.25 ,0.25 ,0.24 ,0.24 ,0.24 ,0.24 ,0.23 ,0.23 ,NA ,NA
,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA)
I combined them into a dataframe called Flow
Flow <- data.frame(year,month,Q)
I can sum or count the number of missing or NA values in my Q column.
sum(is.na(Flow$Q))
Now I am trying to calculate the sum of NA values in each month for the year and eventually each year.
This is where I'm stuck.
group_by(Flow$year, Flow$month) %>%
sum(is.na(Flow$Q)
With group by, we can use summarise. Also, we don't need the Flow$ inside the group_by
library(dplyr)
Flow %>%
group_by(year, month) %>%
summarise(Nas = sum(is.na(Q)))
# A tibble: 2 x 3
# Groups: year [1]
# year month Nas
# <dbl> <dbl> <int>
#1 1980 1 28
#2 1980 2 19
I have a df and I want to filter out a column based on a grouping. I want to keep group by combinations ((cc, odd, tree1, and tree2) if day > 4, then keep it, otherwise drop it
df <- data_frame(
cc = c('BB', 'BB', 'BB', 'BB','BB', 'BB','BB', 'BB', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD',
'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ'),
odd = c(3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435),
tree1 = c('ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP'),
tree2 = c('ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK'),
day = c(1, 2, 3, 4, 3, 4, 5, 6, 2, 3, 4, 5, 1, 3, 5, 7, 1, 2, 6, 8, 2, 4, 6, 8)
)
I tried this but this drops any row with day value smaller than 4
df1 <- df %>%
arrange(cc, odd, tree1, tree2, day) %>%
group_by(cc, odd, tree1, tree2) %>%
filter(day > 4)
I would like to get a df as below.
df2 <- data_frame(
cc = c('BB', 'BB','BB', 'BB', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD',
'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ'),
odd = c(3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435),
tree1 = c('SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP'),
tree2 = c('ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK'),
day = c(3, 4, 5, 6, 2, 3, 4, 5, 1, 3, 5, 7, 1, 2, 6, 8, 2, 4, 6, 8)
)
You can try
df %>%
group_by(cc, odd, tree1, tree2) %>%
filter(any(day > 4))
# A tibble: 20 x 5
cc odd tree1 tree2 day
<chr> <dbl> <chr> <chr> <dbl>
1 BB 3435 SAP ATK 3
2 BB 3435 SAP ATK 4
3 BB 3435 SAP ATK 5
4 BB 3435 SAP ATK 6
5 DD 3434 ASP ATK 2
6 DD 3434 ASP ATK 3
7 DD 3434 ASP ATK 4
8 DD 3434 ASP ATK 5
9 DD 3435 SAP ATK 1
10 DD 3435 SAP ATK 3
11 DD 3435 SAP ATK 5
12 DD 3435 SAP ATK 7
13 ZZ 3434 ASP ATK 1
14 ZZ 3434 ASP ATK 2
15 ZZ 3434 ASP ATK 6
16 ZZ 3434 ASP ATK 8
17 ZZ 3435 SAP ATK 2
18 ZZ 3435 SAP ATK 4
19 ZZ 3435 SAP ATK 6
20 ZZ 3435 SAP ATK 8