How to add columns for animal passage in R - r

I am trying to summarize our detection data in a way that I can easily see when an animal moves from one pool to another. Here is an example of one animal that I track
tibble [22 x 13] (S3: tbl_df/tbl/data.frame)
$ Receiver : chr [1:22] "VR2Tx-480679" "VR2Tx-480690" "VR2Tx-480690" "VR2Tx-480690" ...
$ Transmitter : chr [1:22] "A69-9001-12418" "A69-9001-12418" "A69-9001-12418" "A69-9001-12418" ...
$ Species : chr [1:22] "PDFH" "PDFH" "PDFH" "PDFH" ...
$ LocalDATETIME: POSIXct[1:22], format: "2021-05-28 07:16:52" ...
$ StationName : chr [1:22] "1405U" "1406U" "1406U" "1406U" ...
$ LengthValue : num [1:22] 805 805 805 805 805 805 805 805 805 805 ...
$ WeightValue : num [1:22] 8.04 8.04 8.04 8.04 8.04 8.04 8.04 8.04 8.04 8.04 ...
$ Sex : chr [1:22] "NA" "NA" "NA" "NA" ...
$ Translocated : num [1:22] 0 0 0 0 0 0 0 0 0 0 ...
$ Pool : num [1:22] 16 16 16 16 16 16 16 16 16 16 ...
$ DeployDate : POSIXct[1:22], format: "2018-06-05" ...
$ Latitude : num [1:22] 41.6 41.6 41.6 41.6 41.6 ...
$ Longitude : num [1:22] -90.4 -90.4 -90.4 -90.4 -90.4 ...
I want to add columns that would allow me to summarize this data in a way that I would have the start date of when an animal was in a pool and when the animal moved to a different pool it would have the end date of when it exits.
Ex: Enters Pool 19 on 1/1/22, next detected in Pool 20 on 1/2/22, so there would be columns that say fish entered and exited Pool 19 on 1/1/22 and 1/2/22. I have shared an Excel file example of what I am trying to do. I would like to code upstream movement with a 1 and downstream movement with 0.
I have millions of detections and hundreds of animals that I monitor so I am trying to find a way to look at passages for each animal. Thank you!
Here is my dataset using dput:
structure(list(Receiver = c("VR2Tx-480679", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480692", "VR2Tx-480695",
"VR2Tx-480695", "VR2Tx-480713", "VR2Tx-480713", "VR2Tx-480702",
"VR100", "VR100", "VR100"), Transmitter = c("A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418"), Species = c("PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH"), LocalDATETIME = structure(c(1622186212, 1622381700,
1622384575, 1622184711, 1622381515, 1622381618, 1622381751, 1622381924,
1622382679, 1622383493, 1622384038, 1622384612, 1622183957, 1622381515,
1626905954, 1626905688, 1622971975, 1622970684, 1626929618, 1624616880,
1626084540, 1626954660), tzone = "UTC", class = c("POSIXct",
"POSIXt")), StationName = c("1405U", "1406U", "1406U", "1406U",
"1406U", "1406U", "1406U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1406U", "1404L", "1401D", "1401D", "14Aux2", "14Aux2",
"15.Mid.Wall", "man_loc", "man_loc", "man_loc"), LengthValue = c(805,
805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805,
805, 805, 805, 805, 805, 805, 805, 805), WeightValue = c(8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04),
Sex = c("NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA"), Translocated = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Pool = c(16,
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
16, 16, 16, 14, 14, 16), DeployDate = structure(c(1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800), tzone = "UTC", class = c("POSIXct", "POSIXt"
)), Latitude = c(41.57471, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.57463, 41.5731, 41.5731, 41.57469, 41.57469,
41.57469, 41.57469, 41.57469, 41.57469), Longitude = c(-90.39944,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39984, -90.40391, -90.40391, -90.40462, -90.40462, -90.40462,
-90.40462, -90.40462, -90.40462)), row.names = c(NA, -22L
), class = c("tbl_df", "tbl", "data.frame"))
> dput(T12418)
structure(list(Receiver = c("VR2Tx-480679", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480692", "VR2Tx-480695",
"VR2Tx-480695", "VR2Tx-480713", "VR2Tx-480713", "VR2Tx-480702",
"VR100", "VR100", "VR100"), Transmitter = c("A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418"), Species = c("PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH"), LocalDATETIME = structure(c(1622186212, 1622381700,
1622384575, 1622184711, 1622381515, 1622381618, 1622381751, 1622381924,
1622382679, 1622383493, 1622384038, 1622384612, 1622183957, 1622381515,
1626905954, 1626905688, 1622971975, 1622970684, 1626929618, 1624616880,
1626084540, 1626954660), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
StationName = c("1405U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1406U", "1406U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1404L", "1401D", "1401D", "14Aux2", "14Aux2", "15.Mid.Wall",
"man_loc", "man_loc", "man_loc"), LengthValue = c(805, 805,
805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805,
805, 805, 805, 805, 805, 805, 805, 805), WeightValue = c(8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04), Sex = c("NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA"), Translocated = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
Pool = c(16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
16, 16, 16, 16, 16, 16, 16, 14, 14, 16), DeployDate = structure(c(1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Latitude = c(41.57471, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.57463, 41.5731, 41.5731, 41.57469, 41.57469,
41.57469, 41.57469, 41.57469, 41.57469), Longitude = c(-90.39944,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39984, -90.40391, -90.40391, -90.40462, -90.40462, -90.40462,
-90.40462, -90.40462, -90.40462)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -22L))

Here is one possibility for getting the beginning date for entering a pool and ending date for leaving a pool. First, I group by Species (could also add additional grouping variables to distinguish between specimens) and arrange by the time. Then, I look for any changes to the Pool using cumsum. Then, I pull the first date recorded for the pool as the the date that they entered the pool. Then, I do some grouping and ungrouping to grab the date from the next group (i.e., the date the species left the pool) and then copy that date for the whole group. For determining upstream/downstream, we can use case_when inside of mutate. I'm also assuming that you want this to match the date, so I have filled in the values for each group with the movement for pool change.
library(tidyverse)
df_dates <- df %>%
group_by(Species, Transmitter) %>%
arrange(Species, Transmitter, LocalDATETIME) %>%
mutate(changeGroup = cumsum(Pool != lag(Pool, default = -1))) %>%
group_by(Species, Transmitter, changeGroup) %>%
mutate(EnterPool = first(format(as.Date(LocalDATETIME), "%m/%d/%Y"))) %>%
ungroup(changeGroup) %>%
mutate(LeftPool = lead(EnterPool)) %>%
group_by(Species, Transmitter, changeGroup) %>%
mutate(LeftPool = last(LeftPool)) %>%
ungroup(changeGroup) %>%
mutate(stream = case_when((Pool - lag(Pool)) > 0 ~ 0,
(Pool - lag(Pool)) < 0 ~ 1)) %>%
fill(stream, .direction = "down")
Output
print(as_tibble(df_dates[1:24, c(1:5, 10:17)]), n=24)
# A tibble: 24 × 13
Receiver Transmitter Species LocalDATETIME StationName Pool DeployDate Latitude Longitude changeGroup EnterPool LeftPool stream
<chr> <chr> <chr> <dttm> <chr> <dbl> <dttm> <dbl> <dbl> <int> <chr> <chr> <dbl>
1 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-28 06:39:17 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
2 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-28 06:51:51 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
3 VR2Tx-480679 A69-9001-12418 PDFH 2021-05-28 07:16:52 1405U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
4 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:31:55 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
5 VR2Tx-480692 A69-9001-12418 PDFH 2021-05-30 13:31:55 1404L 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
6 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:33:38 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
7 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:35:00 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
8 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:35:51 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
9 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:38:44 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
10 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:51:19 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
11 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 14:04:53 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
12 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 14:13:58 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
13 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 14:22:55 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
14 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 14:23:32 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
15 VR2Tx-480713 A69-9001-12418 PDFH 2021-06-06 09:11:24 14Aux2 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
16 VR2Tx-480713 A69-9001-12418 PDFH 2021-06-06 09:32:55 14Aux2 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
17 VR100 A69-9001-12418 PDFH 2021-06-25 10:28:00 man_loc 14 2018-06-05 00:00:00 41.6 -90.4 2 06/25/2021 07/21/2021 1
18 VR100 A69-9001-12418 PDFH 2021-07-12 10:09:00 man_loc 14 2018-06-05 00:00:00 41.6 -90.4 2 06/25/2021 07/21/2021 1
19 VR2Tx-480695 A69-9001-12418 PDFH 2021-07-21 22:14:48 1401D 16 2018-06-05 00:00:00 41.6 -90.4 3 07/21/2021 NA 0
20 VR2Tx-480695 A69-9001-12418 PDFH 2021-07-21 22:19:14 1401D 16 2018-06-05 00:00:00 41.6 -90.4 3 07/21/2021 NA 0
21 VR2Tx-480702 A69-9001-12418 PDFH 2021-07-22 04:53:38 15.Mid.Wall 16 2018-06-05 00:00:00 41.6 -90.4 3 07/21/2021 NA 0
22 VR100 A69-9001-12418 PDFH 2021-07-22 11:51:00 man_loc 16 2018-06-05 00:00:00 41.6 -90.4 3 07/21/2021 NA 0
23 AR100 B80-9001-12420 PDFH 2021-07-22 11:51:00 man_loc 19 2018-06-05 00:00:00 42.6 -90.4 1 07/22/2021 07/22/2021 NA
24 AR100 B80-9001-12420 PDFH 2021-07-22 11:51:01 man_loc 18 2018-06-05 00:00:00 42.6 -90.4 2 07/22/2021 NA 1
Data
df <- structure(list(Receiver = c("VR2Tx-480679", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480692", "VR2Tx-480695",
"VR2Tx-480695", "VR2Tx-480713", "VR2Tx-480713", "VR2Tx-480702",
"VR100", "VR100", "VR100", "AR100", "AR100"), Transmitter = c("A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "B80-9001-12420", "B80-9001-12420"), Species = c("PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH"), LocalDATETIME = structure(c(1622186212, 1622381700,
1622384575, 1622184711, 1622381515, 1622381618, 1622381751, 1622381924,
1622382679, 1622383493, 1622384038, 1622384612, 1622183957, 1622381515,
1626905954, 1626905688, 1622971975, 1622970684, 1626929618, 1624616880,
1626084540, 1626954660, 1626954661, 1626954660), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
StationName = c("1405U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1406U", "1406U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1404L", "1401D", "1401D", "14Aux2", "14Aux2", "15.Mid.Wall",
"man_loc", "man_loc", "man_loc", "man_loc", "man_loc"), LengthValue = c(805, 805,
805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805,
805, 805, 805, 805, 805, 805, 805, 805, 805, 805), WeightValue = c(8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04, 8.04, 8.04), Sex = c("NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA"), Translocated = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
Pool = c(16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
16, 16, 16, 16, 16, 16, 16, 14, 14, 16, 18, 19), DeployDate = structure(c(1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Latitude = c(41.57471, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.57463, 41.5731, 41.5731, 41.57469, 41.57469,
41.57469, 41.57469, 41.57469, 41.57469, 42.57469, 42.57469), Longitude = c(-90.39944,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39984, -90.40391, -90.40391, -90.40462, -90.40462, -90.40462,
-90.40462, -90.40462, -90.40462, -90.40470, -90.40470)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -24L))

Related

Problems when filtering a row in a table

I'm having trouble filtering a table row, but I still don't understand why. Notice that I'm filtering by Id, date2 and Category, but when I filter, it looks like it doesn't have any rows, but notice that from input information the corresponding row would have to be line 15 of SPV, however, it gives null return.
library(dplyr)
library(tidyverse)
library(lubridate)
df1<-structure(list(Id = c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 1011, 1011, 1011, 1011, 1011, 1011, 1011, 1011,
1011), date1 = structure(c(1641945600, 1641945600,
1641945600, 1641945600, 1641945600, 1641945600, 1641945600, 1641945600,
1641945600, 1641945600, 1641945600, 1641945600, 1641945600, 1641945600,
1641945600, 1641945600, 1641945600, 1641945600, 1641945600, 1641945600,
1641945600, 1641945600, 1641945600, 1641945600, 1641945600), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), date2 = structure(c(1641340800,
1641340800, 1641427200, 1641427200, 1641513600, 1641513600, 1641600000,
1641600000, 1641686400, 1641686400, 1641772800, 1641772800, 1641859200,
1641859200, 1641945600, 1641945600, 1641254400, 1641340800, 1641427200,
1641513600, 1641600000, 1641686400, 1641772800, 1641859200, 1641945600),
class = c("POSIXct", "POSIXt"), tzone = "UTC"), Week = c("Wednesday",
"Wednesday", "Thursday", "Thursday", "Friday", "Friday", "Saturday", "Saturday",
"Sunday", "Sunday", "Monday", "Monday", "Tuesday", "Tuesday",
"Wednesday", "Wednesday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday",
"Sunday", "Monday", "Tuesday", "Wednesday"), Category = c("ABC",
"EFG", "ABC", "EFG", "ABC", "EFG", "ABC", "EFG", "ABC", "EFG",
"ABC", "EFG", "ABC", "EFG", "ABC", "EFG", "ABC", "ABC", "ABC",
"ABC", "ABC", "ABC", "ABC", "ABC", "ABC"), DR1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 200, 350, 330, 400, 400, 332, 327.9, 383.6, 0), DRM0 = c(300,
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300,
0, 0, 200, 350, 330, 400, 400, 332, 327.9, 327.6, 323.75), DRM01 = c(300,
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300,
0, 0, 200, 350, 330, 400, 400, 332, 327.9, 340, 329.17), DRM02 = c(300,
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300,
300, 0, 200, 350, 330, 400, 400, 332, 340, 340, 329.17), DRM03 = c(300,
300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300,
300, 0, 200, 350, 330, 400, 400, 338.8, 340, 340, 329.17), DRM04 = c(300,
250, 250, 250, 250, 250, 250, 250, 250, 250, 300, 300, 300, 300,
300, 0, 200, 350, 330, 400, 400, 338.8, 340, 340, 329.17)), row.names = c(NA, -25L), class = c("tbl_df",
"tbl", "data.frame"))
idd<-"4"
dmda<-"2022-01-12"
CategoryChosse<-"ABC"
x<-df1 %>% select(starts_with("DRM0"))
x<-cbind(df1, setNames(df1$DR1 - x, paste0(names(x), "_PV")))
PV<-select(x,Id, date2,Week, Category, DR1, ends_with("PV"))
med<-PV %>%
group_by(Id,Category,Week) %>%
dplyr::summarize(dplyr::across(ends_with("PV"), median))
SPV<-df1%>%
inner_join(med, by = c('Id','Category', 'Week')) %>%
mutate(across(matches("^DRM0\\d+$"), ~.x +
get(paste0(cur_column(), '_PV')),
.names = '{col}_{col}_PV')) %>%
select(Id:Category, DRM01_DRM01_PV:last_col())
SPV<-data.frame(SPV)
> SPV
Id date1 date2 Week Category DRM01_DRM01_PV DRM02_DRM02_PV DRM03_DRM03_PV
1 4 2022-01-12 2022-01-05 Wednesday ABC 150.000 0.000 0.000
2 4 2022-01-12 2022-01-05 Wednesday EFG 150.000 150.000 150.000
3 4 2022-01-12 2022-01-06 Thursday ABC 0.000 0.000 0.000
4 4 2022-01-12 2022-01-06 Thursday EFG 0.000 0.000 0.000
5 4 2022-01-12 2022-01-07 Friday ABC 0.000 0.000 0.000
6 4 2022-01-12 2022-01-07 Friday EFG 0.000 0.000 0.000
7 4 2022-01-12 2022-01-08 Saturday ABC 0.000 0.000 0.000
8 4 2022-01-12 2022-01-08 Saturday EFG 0.000 0.000 0.000
9 4 2022-01-12 2022-01-09 Sunday ABC 0.000 0.000 0.000
10 4 2022-01-12 2022-01-09 Sunday EFG 0.000 0.000 0.000
11 4 2022-01-12 2022-01-10 Monday ABC 0.000 0.000 0.000
12 4 2022-01-12 2022-01-10 Monday EFG 0.000 0.000 0.000
13 4 2022-01-12 2022-01-11 Tuesday ABC 0.000 0.000 0.000
14 4 2022-01-12 2022-01-11 Tuesday EFG 0.000 0.000 0.000
15 4 2022-01-12 2022-01-12 Wednesday ABC -150.000 0.000 0.000
16 4 2022-01-12 2022-01-12 Wednesday EFG -150.000 -150.000 -150.000
17 1011 2022-01-12 2022-01-04 Tuesday ABC 221.800 221.800 221.800
18 1011 2022-01-12 2022-01-05 Wednesday ABC 185.415 185.415 185.415
19 1011 2022-01-12 2022-01-06 Thursday ABC 330.000 330.000 330.000
20 1011 2022-01-12 2022-01-07 Friday ABC 400.000 400.000 400.000
21 1011 2022-01-12 2022-01-08 Saturday ABC 400.000 400.000 400.000
22 1011 2022-01-12 2022-01-09 Sunday ABC 332.000 332.000 332.000
23 1011 2022-01-12 2022-01-10 Monday ABC 327.900 327.900 327.900
24 1011 2022-01-12 2022-01-11 Tuesday ABC 361.800 361.800 361.800
25 1011 2022-01-12 2022-01-12 Wednesday ABC 164.585 164.585 164.585
DRM04_DRM04_PV
1 0.000
2 125.000
3 0.000
4 0.000
5 0.000
6 0.000
7 0.000
8 0.000
9 0.000
10 0.000
11 0.000
12 0.000
13 0.000
14 0.000
15 0.000
16 -125.000
17 221.800
18 185.415
19 330.000
20 400.000
21 400.000
22 332.000
23 327.900
24 361.800
25 164.585
SPV <- SPV %>%
filter(Id==idd,date2 == dmda, Category == CategoryChosse)
> SPV
[1] Id date1 date2 Week Category DRM01_DRM01_PV DRM02_DRM02_PV
[8] DRM03_DRM03_PV DRM04_DRM04_PV
Here's a possible approach, writing out what's already been outlined in the comments. I used ymd from lubridate to convert the date string to a date object.
library(dplyr)
library(lubridate)
idd<-4
dmda<-ymd("2022-01-12")
CategoryChosse<-"ABC"
df1 %>%
filter(Id == idd, date2 == dmda, Category == CategoryChosse)
# A tibble: 1 x 11
Id date1 date2 Week Category DR1 DRM0 DRM01
<dbl> <dttm> <dttm> <chr> <chr> <dbl> <dbl> <dbl>
1 4 2022-01-12 00:00:00 2022-01-12 00:00:00 Wedn… ABC 0 0 0
# … with 3 more variables: DRM02 <dbl>, DRM03 <dbl>, DRM04 <dbl>

R converts date from yyyy-mm-dd to long date format while adding to Time series object

DataSTP2 <- read.csv("/Users/mass/Desktop/DatasetforR.csv", stringsAsFactors = F, na.strings = c(NA,"NA", " NA", "-", "NT"))
class(DataSTP2$Month)
DataSTP2_omit <- na.omit(DataSTP2)
DataSTP2$Month <- as.Date(DataSTP2$Month, origin="1899-12-30" )
class(DataSTP2$Month)
STP.ts <- ts(data = DataSTP2, frequency = 12)
ggplot(data = DataSTP2, aes(x = Month, y = pH_in))+ ylab('pH') + geom_line(aes(y = pH_in), colour="blue") + geom_point() +geom_line(aes(y = pH_out), color="red") +geom_point() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
The command on 6th line converts the date from 2018-12-01 to 17866, I don't want that to happen please can some one help me figure it out?
Edit:
Month
pH_in
pH_out
2018-12-01
7.6
7.5
2019-01-01
7.8
7.6
A reference to the data I have been working with.
head(dput(DataSTP2),20)
structure(list(Month = structure(c(17866, 17897, 17928, 17956,
17987, 18017, 18048, 18078, 18109, 18140, 18170, 18201, 18231,
18262, 18293, 18322, 18353, 18383, 18414, 18444, 18475), class = "Date"),
pH_in = c(7.52, 7.56, 7.62, 7.48, 7.61, 7.58, 7.67, 7.77,
7.65, 7.87, 7.84, 8.01, 7.83, 7.77, 7.75, 7.62, 7.58, 7.62,
7.34, 7.66, 7.56), pH_out = c(7.44, 7.31, 7.38, 7.31, 7.48,
7.62, 7.59, 7.72, 7.58, 7.74, 7.76, 7.66, 7.74, 7.7, 7.7,
7.6, 7.7, 7.7, 7.5, 7.3, 7.4), TSS_in = c(567.71, 647.55,
555.57, 537.35, 530.8, 449.87, 430.53, 433.42, 410.17, 530,
555, 662.9, 708.26, 674.77, 603.24, 602.17, 517, 452.5, 467.14,
484.67, 392), TSS_out = c(19.65, 23.06, 16.64, 11.06, 13.7,
12.55, 9.83, 15.74, 14.23, 14, 12.16, 11.41, 13.19, 13, 15.1,
14.5, 13, 11.5, 10.9, 12.7, 13), COD_in = c(681.87, 805.42,
799.43, 789.35, 865.4, 822.87, 718.17, 714.83, 632.77, 735,
762, 798.13, 830.53, 812.87, 849.64, 869.29, 788, 688, 712.66,
753.47, 600.6), COD_out = c(59.29, 67.55, 61.86, 56.57, 56.63,
58.5, 50.62, 55.49, 56.32, 61, 53.47, 62.01, 62.97, 80.1,
58.3, 58.2, 54, 50.4, 48.5, 56.3, 57.2), BOD3_in = c(187.84,
326.23, 316.46, 321.45, 352.03, 327, 334.97, 343.23, 320.33,
267, 276, 283.46, 268.71, 272.26, 277.92, 277.33, NA, NA,
223.75, 246.54, 186), BOD3_out = c(15.42, 14.45, 13.25, 12.19,
11.6, 10.79, 10.1, 11.23, 11.37, 9.4, 9.58, 10.62, 10.05,
9.8, 10, 10.1, NA, NA, 8.3, 11.7, 11), OG_in = c(NA, NA,
NA, NA, NA, 0.27, 0.21, 0.23, 0.26, 0.22, 0.21, 0.24, 0.26,
0.26, 0.26, 0.26, NA, NA, 0.23, 0.23, 0.2), OG_out = c("›",
NA, NA, NA, NA, "0.11", "0.08", "0.12", "0.10", "0.148",
"0.11", "0.09", "0.09", "0.1", "0.1", "0.2", NA, NA, "0.1",
"0.1", "0.1")), row.names = c(NA, -21L), class = "data.frame")
``` Month pH_in pH_out TSS_in TSS_out COD_in COD_out BOD3_in BOD3_out OG_in OG_out
1 2018-12-01 7.52 7.44 567.71 19.65 681.87 59.29 187.84 15.42 NA ›
2 2019-01-01 7.56 7.31 647.55 23.06 805.42 67.55 326.23 14.45 NA <NA>
3 2019-02-01 7.62 7.38 555.57 16.64 799.43 61.86 316.46 13.25 NA <NA>
4 2019-03-01 7.48 7.31 537.35 11.06 789.35 56.57 321.45 12.19 NA <NA>
5 2019-04-01 7.61 7.48 530.80 13.70 865.40 56.63 352.03 11.60 NA <NA>
6 2019-05-01 7.58 7.62 449.87 12.55 822.87 58.50 327.00 10.79 0.27 0.11
7 2019-06-01 7.67 7.59 430.53 9.83 718.17 50.62 334.97 10.10 0.21 0.08
8 2019-07-01 7.77 7.72 433.42 15.74 714.83 55.49 343.23 11.23 0.23 0.12
9 2019-08-01 7.65 7.58 410.17 14.23 632.77 56.32 320.33 11.37 0.26 0.10
10 2019-09-01 7.87 7.74 530.00 14.00 735.00 61.00 267.00 9.40 0.22 0.148
11 2019-10-01 7.84 7.76 555.00 12.16 762.00 53.47 276.00 9.58 0.21 0.11
12 2019-11-01 8.01 7.66 662.90 11.41 798.13 62.01 283.46 10.62 0.24 0.09
13 2019-12-01 7.83 7.74 708.26 13.19 830.53 62.97 268.71 10.05 0.26 0.09
14 2020-01-01 7.77 7.70 674.77 13.00 812.87 80.10 272.26 9.80 0.26 0.1
15 2020-02-01 7.75 7.70 603.24 15.10 849.64 58.30 277.92 10.00 0.26 0.1
16 2020-03-01 7.62 7.60 602.17 14.50 869.29 58.20 277.33 10.10 0.26 0.2
17 2020-04-01 7.58 7.70 517.00 13.00 788.00 54.00 NA NA NA <NA>
18 2020-05-01 7.62 7.70 452.50 11.50 688.00 50.40 NA NA NA <NA>
19 2020-06-01 7.34 7.50 467.14 10.90 712.66 48.50 223.75 8.30 0.23 0.1
20 2020-07-01 7.66 7.30 484.67 12.70 753.47 56.30 246.54 11.70 0.23 0.1

Join two data frames in R based on closest timestamp within groups

I have the following dataframes
structure(list(id = c(1, 2, 3, 4, 5), time = structure(c(1484092800,
1485907200, 1490227200, 1490918400, 1491955200), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
id time
<dbl> <dttm>
1 1 2017-01-11 00:00:00
2 2 2017-02-01 00:00:00
3 3 2017-03-23 00:00:00
4 4 2017-03-31 00:00:00
5 5 2017-04-12 00:00:00
structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5), time = structure(c(1466553600,
1465948800, 1453420800, 1485302400, 1433030400, 1421712000, 1453852800,
1485302400, 1485993600, 1517529600, 1400544000, 1434067200, 1466985600,
1497484800, 1390003200, 1516060800, 1464825600, 1497916800, 1527638400,
1454025600, 1390608000, 1421712000, 1466467200, 1453852800, 1485820800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), score = c(3,
2, 5, 4, 5, 24.2, 24.8, 25.4, 26, 26.6, 36.2, 36.8, 37.4, 38,
38.6, 44, 44.6, 45.2, 45.8, 46.4, 59, 59.6, 60.2, 60.8, 61.4)), row.names = c(NA,
-25L), class = c("tbl_df", "tbl", "data.frame"))
id time score
<dbl> <dttm> <dbl>
1 1 2016-06-22 00:00:00 3
2 1 2016-06-15 00:00:00 2
3 1 2016-01-22 00:00:00 5
4 1 2017-01-25 00:00:00 4
5 1 2015-05-31 00:00:00 5
6 2 2015-01-20 00:00:00 24.2
7 2 2016-01-27 00:00:00 24.8
8 2 2017-01-25 00:00:00 25.4
9 2 2017-02-02 00:00:00 26
10 2 2018-02-02 00:00:00 26.6
# … with 15 more rows
I would like to have the score of sdf where the time is closest to that of in df. But I would also have to look at the id's! I already tried this from Join two data frames in R based on closest timestamp:
d <- function(x,y) abs(x-y) # define the distance function
idx <- sapply( df$time, function(x) which.min( d(x,sdf$time) ))
cbind(df,sdf[idx,-1,drop=FALSE])
id time time score
1 1 2017-01-11 2017-01-25 4
2 2 2017-02-01 2017-02-02 26
3 3 2017-03-23 2017-02-02 26
4 4 2017-03-31 2017-02-02 26
5 5 2017-04-12 2017-06-15 38
But you don't look at the id, I tried to incorporate the id, however did not work. Any ideas? Thank you in advance :)
We can join the data frames by id and then calculate the time difference and keep the observation with the minimal time difference for each individual:
library(tidyverse)
df2 %>%
left_join(df1, by = "id") %>%
mutate(time_dif = abs(time.x - time.y)) %>%
group_by(id) %>%
filter(time_dif == min(time_dif))
# A tibble: 5 x 5
# Groups: id [5]
id time.x score time.y time_dif
<dbl> <dttm> <dbl> <dttm> <drtn>
1 1 2017-01-25 00:00:00 4 2017-01-11 00:00:00 14 days
2 2 2017-02-02 00:00:00 26 2017-02-01 00:00:00 1 days
3 3 2017-06-15 00:00:00 38 2017-03-23 00:00:00 84 days
4 4 2017-06-20 00:00:00 45.2 2017-03-31 00:00:00 81 days
5 5 2017-01-31 00:00:00 61.4 2017-04-12 00:00:00 71 days
Data
df1 <- structure(list(id = c(1, 2, 3, 4, 5), time = structure(c(1484092800,
1485907200, 1490227200, 1490918400, 1491955200), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
df2 <- structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5), time = structure(c(1466553600,
1465948800, 1453420800, 1485302400, 1433030400, 1421712000, 1453852800,
1485302400, 1485993600, 1517529600, 1400544000, 1434067200, 1466985600,
1497484800, 1390003200, 1516060800, 1464825600, 1497916800, 1527638400,
1454025600, 1390608000, 1421712000, 1466467200, 1453852800, 1485820800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), score = c(3,
2, 5, 4, 5, 24.2, 24.8, 25.4, 26, 26.6, 36.2, 36.8, 37.4, 38,
38.6, 44, 44.6, 45.2, 45.8, 46.4, 59, 59.6, 60.2, 60.8, 61.4)), row.names = c(NA,
-25L), class = c("tbl_df", "tbl", "data.frame"))

Splitting two messy vectors in a data frame into one common column

Sample of dataset:
library(dplyr)
sample <- structure(list(Rank = c(15, 17, 20, 2, 16, 8, 21, 5, 13, 31, 22, 18, 2, 19, 11, 11, 8, 7, 12, 9, 5, 23, 17, 16, 15, 14, 4, 20, 13, 2), Athlete = c("François Gourmet(BEL)", "Agustín Félix(ESP)", "Keisuke Ushiro", "Michael Schrader", "Pieter Braun", "Laurent Hernu(FRA)", "Dmitriy Karpov", "Laurent Hernu(FRA)", "Thomas van der Plaetsen", "Attila Szabó", "Nadir El Fassi", "Eduard Mikhan", "Leonel Suárez", "Janek Õiglane", "Hans van Alphen(BEL)", "Roman Šebrle", "André Niklaus(GER)", "Pascal Behrenbruch", "Pieter Braun", "Oleksandr Yurkov(UKR)", "Eelco Sintnicolaas", "Brent Newdick", "Kim Kun-woo", "Akihiko Nakamura", "Bastien Auzeil", "Frédéric Xhonneux", "Janek Õiglane", "Keisuke Ushiro", "Roman Šebrle", "Rico Freimuth"), Total = c(7974, 7749, 7498, 8670, 7890, 8280, 7550, 8218, 8069, 7610, 7922, 7968, 8640, 7581, 8034, 8266, 8020, 8211, 8114, 8264, 8298, 7915, 7860, 7745, 7922, 7616, 8371, 7532, 8069, 8564), `100m` = c(10.67, 11.17, 11.53, 10.73, 11.22, 10.97, 11.24, 11.2, 11.2, 11.15, 11.12, 10.97, 11.13, 11.51, 11.11, 11.16, 11.19, 11.08, 11.11, 10.93, 10.76, 11.11, 11.11, 10.86, 11.35, 11.28, 11.08, 11.51, 11.25, 10.53), LJ = c(7.15, 7.12, 6.64, 7.85, 7.17, 7.31, 6.86, 7.22, 7.79, 7.09, 7.26, 7.42, 7.24, 6.78, 7.35, 7.8, 7.21, 6.8, 7.29, 7.37, 7.29, 7.42, 7.24, 7.26, 6.87, 7.21, 7.33, 6.73, 7.3, 7.48), SP = c(13.74, 13.29, 13.43, 14.56, 14.48, 14.43, 15.69, 13.99, 12.76, 13.92, 13.62, 14.15, 15.2, 14.43, 14.67, 14.98, 13.87, 16.01, 13.9, 15.15, 14.13, 14.35, 12.96, 11.67, 15.23, 12.92, 15.13, 14.93, 15.2, 14.85), HJ = c(1.85, 2.03, 1.96, 1.99, 1.93, 2.03, 1.93, 2.03, 2.17, 1.84, 1.99, 1.96, 2.11, 1.92, 1.88, 2.11, 1.97, 1.93, 2.04, 1.97, 1.93, 1.99, 1.96, 1.95, 1.96, 2.03, 2.05, 1.89, 2.05, 1.99), `400m` = c(47.98, 52.08, 51.43, 47.66, 48.54, 49.31, 52.01, 48.95, 49.46, 49.79, 51.35, 48.8, 48, 50.95, 48.52, 50.42, 49.95, 49.9, 48.24, 49.45, 48.35, 50.1, 49.24, 47.81, 50.36, 49.04, 49.58, 50.85, 51.18, 48.41), `110mh` = c(15.02, 14.75, 15.35, 14.29, 14.67, 14.01, 14.64, 14.15, 14.79, 14.65, 14.9, 14.82, 14.45, 15.33, 14.77, 14.44, 14.5, 14.33, 14.37, 14.41, 14.42, 14.82, 14.95, 14.72, 14.59, 15.75, 14.56, 15.43, 14.75, 13.68), DT = c(39.87, 43.67, 47.64, 46.44, 42.59, 43.93, 47.1, 46.13, 37.2, 43.75, 42.25, 48, 44.71, 40.94, 44.3, 46.3, 42.68, 48.56, 42.09, 48.1, 42.23, 43.6, 39.53, 33.48, 46.86, 38.62, 42.11, 46.85, 46.93, 51.17), PV = c(5, 5, 4.6, 5, 4.7, 5.1, 4.8, 4.9, 5.1, 4.4, 4.8, 4.6, 5, 4.6, 4.3, 4.6, 5.1, 4.9, 4.9, 5, 5.2, 4.8, 4.9, 4.7, 4.8, 4.7, 5.1, 4.7, 4.8, 4.8), JT = c(57.73, 56.69, 63.28, 65.67, 59.26, 59.9, 46.91, 59.63, 58.91, 59.56, 57.65, 50.74, 75.19, 68.51, 65.71, 65.61, 57.55, 66.5, 56.95, 58.63, 61.07, 51.52, 53.33, 53.57, 60.8, 50.18, 71.73, 56.52, 67.28, 62.34), `1500m` = c(265.51, 288.27, 291.9, 265.38, 278.4, 277.41, 298.41, 268.4, 285.86, 285.64, 256.51, 273.71, 267.25, 283.06, 262.5, 290.33, 268.8, 276.64, 272.46, 278.43, 265.4, 270.57, 255.63, 256.36, 279.8, 262.71, 279.24, 283.51, 296.5, 281.57), Year = structure(c(4L, 4L, 9L, 7L, 9L, 1L, 6L, 2L, 6L, 5L, 5L, 7L, 5L, 8L, 4L, 5L, 2L, 6L, 8L, 1L, 6L, 5L, 6L, 8L, 9L, 3L, 9L, 8L, 6L, 9L), .Label = c("2001", "2003", "2005", "2007", "2009", "2011", "2013", "2015", "2017"), class = "factor"), Nationality = c(NA, NA, "Japan(JPN)", "Germany(GER)", "Netherlands(NED)", NA, "Kazakhstan(KAZ)", NA, "Belgium(BEL)", "Hungary", "France", "Belarus(BLR)", "Cuba", "Estonia(EST)", NA, "Czech Republic", NA, "Germany(GER)", "Netherlands(NED)", NA, "Netherlands(NED)", "New Zealand", "South Korea(KOR)", "Japan(JPN)", "France(FRA)", NA, "Estonia(EST)", "Japan(JPN)", "Czech Republic(CZE)", "Germany(GER)"), Notes = c(NA, NA, NA, "PB", NA, NA, NA, NA, NA, NA, "SB", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "PB", "NR", NA, "SB", NA, "PB", NA, NA, NA)), .Names = c("Rank", "Athlete", "Total", "100m", "LJ", "SP", "HJ", "400m", "110mh", "DT", "PV", "JT", "1500m", "Year", "Nationality", "Notes"), row.names = c(NA, -30L), class = c("tbl_df", "tbl", "data.frame"))
# A tibble: 30 x 16
Rank Athlete Total `100m` LJ SP HJ `400m` `110mh` DT PV JT `1500m` Year Nationality Notes
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <chr> <chr>
1 15 François Gourmet(BEL) 7974 10.67 7.15 13.74 1.85 47.98 15.02 39.87 5.0 57.73 265.51 2007 <NA> <NA>
2 17 Agustín Félix(ESP) 7749 11.17 7.12 13.29 2.03 52.08 14.75 43.67 5.0 56.69 288.27 2007 <NA> <NA>
3 20 Keisuke Ushiro 7498 11.53 6.64 13.43 1.96 51.43 15.35 47.64 4.6 63.28 291.90 2017 Japan(JPN) <NA>
4 2 Michael Schrader 8670 10.73 7.85 14.56 1.99 47.66 14.29 46.44 5.0 65.67 265.38 2013 Germany(GER) PB
5 16 Pieter Braun 7890 11.22 7.17 14.48 1.93 48.54 14.67 42.59 4.7 59.26 278.40 2017 Netherlands(NED) <NA>
6 8 Laurent Hernu(FRA) 8280 10.97 7.31 14.43 2.03 49.31 14.01 43.93 5.1 59.90 277.41 2001 <NA> <NA>
7 21 Dmitriy Karpov 7550 11.24 6.86 15.69 1.93 52.01 14.64 47.10 4.8 46.91 298.41 2011 Kazakhstan(KAZ) <NA>
8 5 Laurent Hernu(FRA) 8218 11.20 7.22 13.99 2.03 48.95 14.15 46.13 4.9 59.63 268.40 2003 <NA> <NA>
9 13 Thomas van der Plaetsen 8069 11.20 7.79 12.76 2.17 49.46 14.79 37.20 5.1 58.91 285.86 2011 Belgium(BEL) <NA>
10 31 Attila Szabó 7610 11.15 7.09 13.92 1.84 49.79 14.65 43.75 4.4 59.56 285.64 2009 Hungary <NA>
# ... with 20 more rows
I have two character vectors, "Athlete and "Nationality", in my dataset where some entries have country codes in brackets attached at the end. I want to be able to split only the country codes from these two vectors into a new variable, say "countrycode", while getting rid of the brackets at the same time. I'm not sure what the best way or syntax to go about splitting would be though - dplyr::separate possibly? Though I'm uncertain how to incorporate the combinations of characters in the country codes within the brackets during the split, and the fact that some entries don't need splitting.
I would then do something like this after to remove the brackets from the new variable.
sample$countrycode<- gsub(pattern="\\(",replacement="",x=sample$countrycode)
sample$countrycode<- gsub(pattern="\\)",replacement="",x=sample$countrycode)
Thanks
Hope this works for you:
library(dplyr)
res <- sample %>% mutate(
countrycode = case_when(
is.na(Nationality) & grepl('\\(', Athlete) ~ gsub('.*?\\((.*)\\)', '\\1', Athlete),
grepl('\\(', Nationality) ~ gsub('.*?\\((.*)\\)', '\\1', Nationality),
TRUE ~ Nationality
)
)
sample output:
res %>% select(Athlete, Nationality, countrycode)
# # A tibble: 30 x 3
# Athlete Nationality countrycode
# <chr> <chr> <chr>
# 1 François Gourmet(BEL) NA BEL
# 2 Agustín Félix(ESP) NA ESP
# 3 Keisuke Ushiro Japan(JPN) JPN
# 4 Michael Schrader Germany(GER) GER
# 5 Pieter Braun Netherlands(NED) NED
# 6 Laurent Hernu(FRA) NA FRA
# 7 Dmitriy Karpov Kazakhstan(KAZ) KAZ
# 8 Laurent Hernu(FRA) NA FRA
# 9 Thomas van der Plaetsen Belgium(BEL) BEL
# 10 Attila Szabó Hungary Hungary
# # ... with 20 more rows
Remove the TRUE ~ Nationality to extract only country code as commented by Frank:
sample %>% mutate(
countrycode = case_when(
is.na(Nationality) & grepl('\\(', Athlete) ~ gsub('.*?\\((.*)\\)', '\\1', Athlete),
grepl('\\(', Nationality) ~ gsub('.*?\\((.*)\\)', '\\1', Nationality)
))
An ugly approach would be to use sub:
library(data.table)
DT = data.table(sample)
patt = "^.*\\((.{3})\\).*$"; rp = "\\1"
DT[Athlete %like% patt, cc := sub(patt, rp, Athlete)]
DT[Nationality %like% patt, cc := sub(patt, rp, Nationality)]
Something like str_extract from the stringr package would probably be cleaner if you're already working with tidyverse packages. Also, for the dplyr analogue to the code above, maybe look at the case_when function. (I am not familiar enough with these tools to know the exact syntax.)
The result looks like...
> DT[, .(Athlete, Nationality, cc)]
Athlete Nationality cc
1: François Gourmet(BEL) NA BEL
2: Agustín Félix(ESP) NA ESP
3: Keisuke Ushiro Japan(JPN) JPN
4: Michael Schrader Germany(GER) GER
5: Pieter Braun Netherlands(NED) NED
6: Laurent Hernu(FRA) NA FRA
7: Dmitriy Karpov Kazakhstan(KAZ) KAZ
8: Laurent Hernu(FRA) NA FRA
9: Thomas van der Plaetsen Belgium(BEL) BEL
10: Attila Szabó Hungary NA
11: Nadir El Fassi France NA
12: Eduard Mikhan Belarus(BLR) BLR
13: Leonel Suárez Cuba NA
14: Janek Õiglane Estonia(EST) EST
15: Hans van Alphen(BEL) NA BEL
16: Roman Šebrle Czech Republic NA
17: André Niklaus(GER) NA GER
18: Pascal Behrenbruch Germany(GER) GER
19: Pieter Braun Netherlands(NED) NED
20: Oleksandr Yurkov(UKR) NA UKR
21: Eelco Sintnicolaas Netherlands(NED) NED
22: Brent Newdick New Zealand NA
23: Kim Kun-woo South Korea(KOR) KOR
24: Akihiko Nakamura Japan(JPN) JPN
25: Bastien Auzeil France(FRA) FRA
26: Frédéric Xhonneux NA NA
27: Janek Õiglane Estonia(EST) EST
28: Keisuke Ushiro Japan(JPN) JPN
29: Roman Šebrle Czech Republic(CZE) CZE
30: Rico Freimuth Germany(GER) GER
Athlete Nationality cc
This simple solution works too.
library(stringr)
data1$country_code <- sapply(data1$Nationality, function(x) unlist(stri_extract_all(str = x, regex = '([A-Z]+)'))[2])
Nationality country_code
1: NA NA
2: NA NA
3: Japan(JPN) JPN
4: Germany(GER) GER
5: Netherlands(NED) NED
6: NA NA

Subsetting data through a pairwise condition (coordinates) given through another matrix

I have the following problem. I want to subset data from data.frame A through the coordinates in matrix B. The matching must be pairwise between the coordinates of data.frame A and matrix B, for instance:
lon lat
315 10.47 52.26
342 10.47 52.37
314 10.36 52.26
341 10.36 52.37
316 10.58 52.26
288 10.47 52.15
343 10.58 52.37
287 10.36 52.15
369 10.47 52.48
Those are the coordinates, I want to use to select the rows, which have the same coordinates as in data.frame A.
Subset of data.frame A:
structure(list(V13138 = c(-15.0545539855957, -15.0118608474731,
-14.8698711395264, -14.7185792922974, -14.5449771881104, -14.3298683166504,
-14.093412399292, -13.8265686035156, -13.5609474182129, -13.2979116439819,
-12.9872589111328, -16.1379947662354, -16.0786437988281, -16.0000076293945,
-15.9013233184814, -15.8254327774048, -15.7928791046143, -15.7776985168457,
-15.7719392776489, -15.6555442810059, -15.5040102005005, -15.299674987793,
-15.1203699111938, -14.9610414505005, -14.9183511734009, -14.8809566497803,
-14.8522186279297, -14.7452983856201, -14.6159820556641, -14.4459781646729,
-14.2792196273804, -14.1135988235474, -13.919620513916, -13.7100219726562,
-13.4647169113159, -13.2096605300903, -12.9506988525391, -12.6352672576904,
-16.0543613433838, -15.9694194793701, -15.8757200241089, -15.7634477615356,
-15.6610631942749, -15.5705652236938, -15.4900894165039, -15.4035835266113,
-15.2854766845703, -15.1539916992188, -15.0065212249756, -14.8876695632935,
-14.7782440185547, -14.7028961181641, -14.6101722717285, -14.512882232666,
-14.3619556427002, -14.2041110992432, -14.0312938690186, -13.8672027587891,
-13.7057323455811, -13.5090990066528, -13.2931470870972, -13.0507898330688,
-12.7776670455933, -12.495795249939, -12.1937398910522, -15.8203887939453,
-15.7082033157349, -15.5987091064453, -15.4917774200439, -15.390435218811,
-15.2902202606201, -15.1883487701416, -15.0730466842651, -14.9352264404297,
-14.8101224899292, -14.6887359619141, -14.5857553482056, -14.479567527771,
-14.3734302520752, -14.2455368041992, -14.1091232299805, -13.9528331756592,
-13.7913122177124, -13.6249303817749), V13139 = c(-7.07704830169678,
-7.36577892303467, -7.60899782180786, -7.85753965377808, -8.1030740737915,
-8.29149341583252, -8.45194625854492, -8.52234935760498, -8.58086585998535,
-8.62695598602295, -8.63118934631348, -3.46103024482727, -3.60997128486633,
-3.75486493110657, -3.90544772148132, -4.07107162475586, -4.2713623046875,
-4.53771257400513, -4.8378758430481, -5.18377256393433, -5.52214574813843,
-5.85517549514771, -6.14670562744141, -6.42839002609253, -6.70926380157471,
-6.97871208190918, -7.23635053634644, -7.43477869033813, -7.61982440948486,
-7.77778148651123, -7.9507007598877, -8.140061378479, -8.27810287475586,
-8.39971256256104, -8.4821662902832, -8.54337215423584, -8.59362697601318,
-8.6127290725708, -3.92732691764832, -4.10400390625, -4.28167676925659,
-4.4712872505188, -4.68312835693359, -4.91524791717529, -5.20708131790161,
-5.51510334014893, -5.85246753692627, -6.17613887786865, -6.49274349212646,
-6.75846433639526, -7.00491952896118, -7.21479940414429, -7.39973735809326,
-7.57557010650635, -7.69487333297729, -7.81140756607056, -7.91438579559326,
-8.01808547973633, -8.12346649169922, -8.21732807159424, -8.30572509765625,
-8.38486099243164, -8.45881938934326, -8.52587699890137, -8.57262134552002,
-4.34183073043823, -4.54555749893188, -4.76105737686157, -4.99724388122559,
-5.26278305053711, -5.53893136978149, -5.85093879699707, -6.16052055358887,
-6.46362257003784, -6.73704147338867, -6.99943161010742, -7.20827674865723,
-7.38761377334595, -7.53203630447388, -7.64012908935547, -7.74198341369629,
-7.82141494750977, -7.89632749557495, -7.96634721755981), V13140 = c(2.38613152503967,
2.37324142456055, 2.38662815093994, 2.38441777229309, 2.35186982154846,
2.31384658813477, 2.2728853225708, 2.23825240135193, 2.20144987106323,
2.15977454185486, 2.13386940956116, 2.9677951335907, 2.92966151237488,
2.8759753704071, 2.8214259147644, 2.76297402381897, 2.70412373542786,
2.6543300151825, 2.61105895042419, 2.60811114311218, 2.60745763778687,
2.60840320587158, 2.60350298881531, 2.59271574020386, 2.54244041442871,
2.47917294502258, 2.40393853187561, 2.35462546348572, 2.31852698326111,
2.31014728546143, 2.29225921630859, 2.26293158531189, 2.23773765563965,
2.21094441413879, 2.18223357200623, 2.15001082420349, 2.11311554908752,
2.08600211143494, 2.89936757087708, 2.89069938659668, 2.87734007835388,
2.85613536834717, 2.8233802318573, 2.78204131126404, 2.73941993713379,
2.70192885398865, 2.68653988838196, 2.66766142845154, 2.64770603179932,
2.62153196334839, 2.5882031917572, 2.53414297103882, 2.46355938911438,
2.38226866722107, 2.31638383865356, 2.26739454269409, 2.24680852890015,
2.23866200447083, 2.23337078094482, 2.22489714622498, 2.21137762069702,
2.19127559661865, 2.16362285614014, 2.13074207305908, 2.08486270904541,
2.85043382644653, 2.87193655967712, 2.88829565048218, 2.89718008041382,
2.88119888305664, 2.85316681861877, 2.81990385055542, 2.7852942943573,
2.75193023681641, 2.72042202949524, 2.6889750957489, 2.650550365448,
2.60602164268494, 2.55256152153015, 2.48241400718689, 2.40686845779419,
2.32716631889343, 2.27034878730774, 2.23567771911621), lon = structure(c(10.36,
10.47, 10.58, 10.69, 10.8, 10.91, 11.02, 11.13, 11.24, 11.35,
11.46, 8.6, 8.71, 8.82, 8.93, 9.04, 9.15, 9.26, 9.37, 9.48, 9.59,
9.7, 9.81, 9.92, 10.03, 10.14, 10.25, 10.36, 10.47, 10.58, 10.69,
10.8, 10.91, 11.02, 11.13, 11.24, 11.35, 11.46, 8.6, 8.71, 8.82,
8.93, 9.04, 9.15, 9.26, 9.37, 9.48, 9.59, 9.7, 9.81, 9.92, 10.03,
10.14, 10.25, 10.36, 10.47, 10.58, 10.69, 10.8, 10.91, 11.02,
11.13, 11.24, 11.35, 11.46, 8.6, 8.71, 8.82, 8.93, 9.04, 9.15,
9.26, 9.37, 9.48, 9.59, 9.7, 9.81, 9.92, 10.03, 10.14, 10.25,
10.36, 10.47, 10.58), .Dim = 84L), lat = structure(c(52.15, 52.15,
52.15, 52.15, 52.15, 52.15, 52.15, 52.15, 52.15, 52.15, 52.15,
52.26, 52.26, 52.26, 52.26, 52.26, 52.26, 52.26, 52.26, 52.26,
52.26, 52.26, 52.26, 52.26, 52.26, 52.26, 52.26, 52.26, 52.26,
52.26, 52.26, 52.26, 52.26, 52.26, 52.26, 52.26, 52.26, 52.26,
52.37, 52.37, 52.37, 52.37, 52.37, 52.37, 52.37, 52.37, 52.37,
52.37, 52.37, 52.37, 52.37, 52.37, 52.37, 52.37, 52.37, 52.37,
52.37, 52.37, 52.37, 52.37, 52.37, 52.37, 52.37, 52.37, 52.37,
52.48, 52.48, 52.48, 52.48, 52.48, 52.48, 52.48, 52.48, 52.48,
52.48, 52.48, 52.48, 52.48, 52.48, 52.48, 52.48, 52.48, 52.48,
52.48), .Dim = 84L)), .Names = c("V13138", "V13139", "V13140",
"lon", "lat"), row.names = 287:370, class = "data.frame")
Matrix B with coordinates:
structure(list(lon = structure(c(10.47, 10.47, 10.36, 10.36,
10.58, 10.47, 10.58, 10.36, 10.47), .Dim = 9L), lat = structure(c(52.26,
52.37, 52.26, 52.37, 52.26, 52.15, 52.37, 52.15, 52.48), .Dim = 9L)), .Names = c("lon",
"lat"), out.attrs = structure(list(dim = c(27L, 23L), dimnames = structure(list(
Var1 = c("Var1= 8.60", "Var1= 8.71", "Var1= 8.82", "Var1= 8.93",
"Var1= 9.04", "Var1= 9.15", "Var1= 9.26", "Var1= 9.37", "Var1= 9.48",
"Var1= 9.59", "Var1= 9.70", "Var1= 9.81", "Var1= 9.92", "Var1=10.03",
"Var1=10.14", "Var1=10.25", "Var1=10.36", "Var1=10.47", "Var1=10.58",
"Var1=10.69", "Var1=10.80", "Var1=10.91", "Var1=11.02", "Var1=11.13",
"Var1=11.24", "Var1=11.35", "Var1=11.46"), Var2 = c("Var2=51.05",
"Var2=51.16", "Var2=51.27", "Var2=51.38", "Var2=51.49", "Var2=51.60",
"Var2=51.71", "Var2=51.82", "Var2=51.93", "Var2=52.04", "Var2=52.15",
"Var2=52.26", "Var2=52.37", "Var2=52.48", "Var2=52.59", "Var2=52.70",
"Var2=52.81", "Var2=52.92", "Var2=53.03", "Var2=53.14", "Var2=53.25",
"Var2=53.36", "Var2=53.47")), .Names = c("Var1", "Var2"))), .Names = c("dim",
"dimnames")), row.names = c(315L, 342L, 314L, 341L, 316L, 288L,
343L, 287L, 369L), class = "data.frame")
Matrix B has to have the same column names for the columns with the lon and lat coordinates as data.frame A. In the given data, the column names are lon and lat in both objects:
head.matrix(data.frame.A)
V13138 V13139 V13140 lon lat
287 -15.05455 -7.077048 2.386132 10.36 52.15
288 -15.01186 -7.365779 2.373241 10.47 52.15
289 -14.86987 -7.608998 2.386628 10.58 52.15
290 -14.71858 -7.857540 2.384418 10.69 52.15
291 -14.54498 -8.103074 2.351870 10.80 52.15
292 -14.32987 -8.291493 2.313847 10.91 52.15
matrix.B
lon lat
315 10.47 52.26
342 10.47 52.37
314 10.36 52.26
341 10.36 52.37
316 10.58 52.26
288 10.47 52.15
343 10.58 52.37
287 10.36 52.15
369 10.47 52.48
To subset data from data.frame A through the coordinates in matrix B, just use the following code:
subset.A <- merge(data.frame.A, matrix.B)
lon lat V13138 V13139 V13140
1 10.36 52.15 -15.05455 -7.077048 2.386132
2 10.36 52.26 -14.74530 -7.434779 2.354625
3 10.36 52.37 -14.36196 -7.694873 2.316384
4 10.47 52.15 -15.01186 -7.365779 2.373241
5 10.47 52.26 -14.61598 -7.619824 2.318527
6 10.47 52.37 -14.20411 -7.811408 2.267395
7 10.47 52.48 -13.79131 -7.896327 2.270349
8 10.58 52.26 -14.44598 -7.777781 2.310147
9 10.58 52.37 -14.03129 -7.914386 2.246809
It's a really simple solution. The columns with the coordinates will be placed in the first 2 columns of the subset.

Resources