Kusto KQL Query - TimeGenerated issue - azure-data-explorer

I have a script running on endpoints daily that sends a list of the applications installed to a Log Analytics workspace. I would like to query the current list of applications installed on each device.
The issue with the query below is that it includes applications that were reported on a previous occasion that have since either updated to newer version or have been uninstalled from the device. The first table below is illustrative of the starting table and the second table shows the result returned by the query. The row column is added for reference. In the second table, the apps on rows 3 and 8 have been updated to a new version and row 5 has been uninstalled. I would like the query to only return the latest set of software and versions by the last TimeGenerated for each device and not return these rows.
datatable(Row:int, DeviceName:string, AppName:string, AppVersion:string, TimeGenerated:datetime)
[
1 ,"A" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-23T06:02:39.66Z"
,2 ,"A" ,"Microsoft Word" ,"1.0.0" ,"2022-09-23T06:02:39.66Z"
,3 ,"A" ,"Microsoft Excel" ,"1.0.0" ,"2022-09-23T06:02:39.66Z"
,4 ,"A" ,"Microsoft ToDo" ,"1.0.0" ,"2022-09-23T06:02:39.66Z"
,5 ,"A" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-26T06:02:39.66Z"
,6 ,"A" ,"Microsoft Word" ,"1.0.0" ,"2022-09-26T06:02:39.66Z"
,7 ,"A" ,"Microsoft Excel" ,"1.0.0" ,"2022-09-26T06:02:39.66Z"
,8 ,"A" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-27T06:02:39.66Z"
,9 ,"A" ,"Microsoft Word" ,"1.0.0" ,"2022-09-27T06:02:39.66Z"
,10 ,"A" ,"Microsoft Excel" ,"1.0.1" ,"2022-09-27T06:02:39.66Z"
,11 ,"B" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-25T16:31:57.688Z"
,12 ,"B" ,"Microsoft Word" ,"1.0.0" ,"2022-09-25T16:31:57.688Z"
,13 ,"B" ,"Microsoft Excel" ,"1.0.0" ,"2022-09-25T16:31:57.688Z"
,14 ,"B" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-28T10:45:37.168Z"
,15 ,"B" ,"Microsoft Word" ,"1.0.0" ,"2022-09-28T10:45:37.168Z"
,16 ,"B" ,"Microsoft Excel" ,"1.0.1" ,"2022-09-28T10:45:37.168Z"
]
Table | summarize arg_max(TimeGenerated, *) by DeviceName,AppName
Row
DeviceName
AppName
AppVersion
TimeGenerated
1
A
Microsoft Teams
1.0.0
2022-09-27T06:02:39.66Z
2
A
Microsoft Word
1.0.0
2022-09-27T06:02:39.66Z
3
A
Microsoft Excel
1.0.0
2022-09-26T06:02:39.66Z
4
A
Microsoft Excel
1.0.1
2022-09-27T06:02:39.66Z
5
A
Microsoft ToDo
1.0.0
2022-09-23T06:02:39.66Z
6
B
Microsoft Teams
1.0.0
2022-09-28T10:45:37.168Z
7
B
Microsoft Word
1.0.0
2022-09-28T10:45:37.168Z
8
B
Microsoft Excel
1.0.0
2022-09-25T16:31:57.688Z
9
B
Microsoft Excel
1.0.1
2022-09-28T10:45:37.168Z

datatable(Row:int, DeviceName:string, AppName:string, AppVersion:string, TimeGenerated:datetime)
[
1 ,"A" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-23T06:02:39.66Z"
,2 ,"A" ,"Microsoft Word" ,"1.0.0" ,"2022-09-23T06:02:39.66Z"
,3 ,"A" ,"Microsoft Excel" ,"1.0.0" ,"2022-09-23T06:02:39.66Z"
,4 ,"A" ,"Microsoft ToDo" ,"1.0.0" ,"2022-09-23T06:02:39.66Z"
,5 ,"A" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-26T06:02:39.66Z"
,6 ,"A" ,"Microsoft Word" ,"1.0.0" ,"2022-09-26T06:02:39.66Z"
,7 ,"A" ,"Microsoft Excel" ,"1.0.0" ,"2022-09-26T06:02:39.66Z"
,8 ,"A" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-27T06:02:39.66Z"
,9 ,"A" ,"Microsoft Word" ,"1.0.0" ,"2022-09-27T06:02:39.66Z"
,10 ,"A" ,"Microsoft Excel" ,"1.0.1" ,"2022-09-27T06:02:39.66Z"
,11 ,"B" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-25T16:31:57.688Z"
,12 ,"B" ,"Microsoft Word" ,"1.0.0" ,"2022-09-25T16:31:57.688Z"
,13 ,"B" ,"Microsoft Excel" ,"1.0.0" ,"2022-09-25T16:31:57.688Z"
,14 ,"B" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-28T10:45:37.168Z"
,15 ,"B" ,"Microsoft Word" ,"1.0.0" ,"2022-09-28T10:45:37.168Z"
,16 ,"B" ,"Microsoft Excel" ,"1.0.1" ,"2022-09-28T10:45:37.168Z"
]
| summarize arg_max(TimeGenerated, *) by DeviceName, AppName
| partition hint.strategy=native by DeviceName
(
order by TimeGenerated
| where row_rank(TimeGenerated) == 1
)
DeviceName
AppName
TimeGenerated
Row
AppVersion
A
Microsoft Teams
2022-09-27T06:02:39.66Z
8
1.0.0
A
Microsoft Word
2022-09-27T06:02:39.66Z
9
1.0.0
A
Microsoft Excel
2022-09-27T06:02:39.66Z
10
1.0.1
B
Microsoft Teams
2022-09-28T10:45:37.168Z
14
1.0.0
B
Microsoft Word
2022-09-28T10:45:37.168Z
15
1.0.0
B
Microsoft Excel
2022-09-28T10:45:37.168Z
16
1.0.1
Fiddle

Another way to acheive the same results, using join
datatable(Row:int, DeviceName:string, AppName:string, AppVersion:string, TimeGenerated:datetime)
[
1 ,"A" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-23T06:02:39.66Z"
,2 ,"A" ,"Microsoft Word" ,"1.0.0" ,"2022-09-23T06:02:39.66Z"
,3 ,"A" ,"Microsoft Excel" ,"1.0.0" ,"2022-09-23T06:02:39.66Z"
,4 ,"A" ,"Microsoft ToDo" ,"1.0.0" ,"2022-09-23T06:02:39.66Z"
,5 ,"A" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-26T06:02:39.66Z"
,6 ,"A" ,"Microsoft Word" ,"1.0.0" ,"2022-09-26T06:02:39.66Z"
,7 ,"A" ,"Microsoft Excel" ,"1.0.0" ,"2022-09-26T06:02:39.66Z"
,8 ,"A" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-27T06:02:39.66Z"
,9 ,"A" ,"Microsoft Word" ,"1.0.0" ,"2022-09-27T06:02:39.66Z"
,10 ,"A" ,"Microsoft Excel" ,"1.0.1" ,"2022-09-27T06:02:39.66Z"
,11 ,"B" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-25T16:31:57.688Z"
,12 ,"B" ,"Microsoft Word" ,"1.0.0" ,"2022-09-25T16:31:57.688Z"
,13 ,"B" ,"Microsoft Excel" ,"1.0.0" ,"2022-09-25T16:31:57.688Z"
,14 ,"B" ,"Microsoft Teams" ,"1.0.0" ,"2022-09-28T10:45:37.168Z"
,15 ,"B" ,"Microsoft Word" ,"1.0.0" ,"2022-09-28T10:45:37.168Z"
,16 ,"B" ,"Microsoft Excel" ,"1.0.1" ,"2022-09-28T10:45:37.168Z"
]
| summarize arg_max(TimeGenerated, *) by DeviceName, AppName
| as LatestAppsVersions
| join kind=inner
(
LatestAppsVersions
| summarize TimeGenerated = max(TimeGenerated) by DeviceName
) on DeviceName, TimeGenerated
| project-away *1
DeviceName
AppName
TimeGenerated
Row
AppVersion
A
Microsoft Excel
2022-09-27T06:02:39.66Z
10
1.0.1
A
Microsoft Word
2022-09-27T06:02:39.66Z
9
1.0.0
A
Microsoft Teams
2022-09-27T06:02:39.66Z
8
1.0.0
B
Microsoft Excel
2022-09-28T10:45:37.168Z
16
1.0.1
B
Microsoft Word
2022-09-28T10:45:37.168Z
15
1.0.0
B
Microsoft Teams
2022-09-28T10:45:37.168Z
14
1.0.0
Fiddle

Related

How to apply cluster analysis to a database in R

I have a base that shows me the answers of the applicants to a course. The original base is 2k rows and 105 columns, of which 100 correspond to questions from 4 basic areas of mathematics, language, science, and social.
I have created the following short example so that you can see more or less how the table is
sector<-c("Privado" ,"Publico" ,"Publico" ,"Publico", "Publico",
"Publico" ,"Publico", "Publico", "Publico" ,"Publico", "Publico",
"Publico" ,"Publico", "Publico", "Publico" ,"Privado" ,"Publico" ,
"Publico" ,"Publico" ,"Publico")
aspirante<-c("337877" ,"339161", "388425" ,"371828" ,"288598" ,"396295" ,"400196",
"370915", "276891" ,"335406" ,"358013", "404406", "356633", "284792", "372549" ,
"271082", "396135" ,"398664" ,"406397", "354609")
claves<-c("10" ,"9" , "10", "4" , "4" , "3" , "3" , "4" , "9" ,"10", "3",
"3" , "3" , "4" , "4" , "4" , "4", "4" ,"9" , "3")
question1<-c(1, 0, 0, 0 ,0, 0, 0, 0, 0 ,0, 0, 0 ,0, 0, 0 ,0, 0, 0 ,1, 0)
question2<-c(0, 1, 1 ,0 ,0, 0 ,0 ,0 ,1, 0, 0,0,1 ,0 ,1, 1, 0 ,0, 0, 0)
question3<-c( 0 ,0, 1, 1, 1 ,1 ,0, 0, 0 ,0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0)
question4<-c(0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1)
question5<-c(1, 0, 1, 0 ,0, 1, 0, 1, 1, 0, 1, 1, 0 ,0 ,0, 0, 1, 0, 0, 0)
note<-c(4 ,2, 6, 4, 2, 6, 0 ,4, 4 ,0, 4 ,4 ,6, 2, 4 ,4, 4, 2, 4, 2)
example<-data.frame("candidate"=aspirante,"sector"=sector,"p1"=question1,
"p2"=question2,"p3"=question3,"p4"=question4,"p5"=question5,"note"=note)
the note column is equal to the sum of the row multiplied by 2
I am asked to do a cluster analysis but I have no idea what to do, I had planned to divide the final notes into 4 categories:
failed: grades less than 3
considered: between 3 and 5
space availability: between 5 and 7
approved: from 7 to 10
but in the original base the sizes of each will vary and I cannot create a new base that divides the notes by group. Do you have any suggestions or an example where cluster analysis is applied to dichotomous data?

How to add columns for animal passage in R

I am trying to summarize our detection data in a way that I can easily see when an animal moves from one pool to another. Here is an example of one animal that I track
tibble [22 x 13] (S3: tbl_df/tbl/data.frame)
$ Receiver : chr [1:22] "VR2Tx-480679" "VR2Tx-480690" "VR2Tx-480690" "VR2Tx-480690" ...
$ Transmitter : chr [1:22] "A69-9001-12418" "A69-9001-12418" "A69-9001-12418" "A69-9001-12418" ...
$ Species : chr [1:22] "PDFH" "PDFH" "PDFH" "PDFH" ...
$ LocalDATETIME: POSIXct[1:22], format: "2021-05-28 07:16:52" ...
$ StationName : chr [1:22] "1405U" "1406U" "1406U" "1406U" ...
$ LengthValue : num [1:22] 805 805 805 805 805 805 805 805 805 805 ...
$ WeightValue : num [1:22] 8.04 8.04 8.04 8.04 8.04 8.04 8.04 8.04 8.04 8.04 ...
$ Sex : chr [1:22] "NA" "NA" "NA" "NA" ...
$ Translocated : num [1:22] 0 0 0 0 0 0 0 0 0 0 ...
$ Pool : num [1:22] 16 16 16 16 16 16 16 16 16 16 ...
$ DeployDate : POSIXct[1:22], format: "2018-06-05" ...
$ Latitude : num [1:22] 41.6 41.6 41.6 41.6 41.6 ...
$ Longitude : num [1:22] -90.4 -90.4 -90.4 -90.4 -90.4 ...
I want to add columns that would allow me to summarize this data in a way that I would have the start date of when an animal was in a pool and when the animal moved to a different pool it would have the end date of when it exits.
Ex: Enters Pool 19 on 1/1/22, next detected in Pool 20 on 1/2/22, so there would be columns that say fish entered and exited Pool 19 on 1/1/22 and 1/2/22. I have shared an Excel file example of what I am trying to do. I would like to code upstream movement with a 1 and downstream movement with 0.
I have millions of detections and hundreds of animals that I monitor so I am trying to find a way to look at passages for each animal. Thank you!
Here is my dataset using dput:
structure(list(Receiver = c("VR2Tx-480679", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480692", "VR2Tx-480695",
"VR2Tx-480695", "VR2Tx-480713", "VR2Tx-480713", "VR2Tx-480702",
"VR100", "VR100", "VR100"), Transmitter = c("A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418"), Species = c("PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH"), LocalDATETIME = structure(c(1622186212, 1622381700,
1622384575, 1622184711, 1622381515, 1622381618, 1622381751, 1622381924,
1622382679, 1622383493, 1622384038, 1622384612, 1622183957, 1622381515,
1626905954, 1626905688, 1622971975, 1622970684, 1626929618, 1624616880,
1626084540, 1626954660), tzone = "UTC", class = c("POSIXct",
"POSIXt")), StationName = c("1405U", "1406U", "1406U", "1406U",
"1406U", "1406U", "1406U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1406U", "1404L", "1401D", "1401D", "14Aux2", "14Aux2",
"15.Mid.Wall", "man_loc", "man_loc", "man_loc"), LengthValue = c(805,
805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805,
805, 805, 805, 805, 805, 805, 805, 805), WeightValue = c(8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04),
Sex = c("NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA"), Translocated = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Pool = c(16,
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
16, 16, 16, 14, 14, 16), DeployDate = structure(c(1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800), tzone = "UTC", class = c("POSIXct", "POSIXt"
)), Latitude = c(41.57471, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.57463, 41.5731, 41.5731, 41.57469, 41.57469,
41.57469, 41.57469, 41.57469, 41.57469), Longitude = c(-90.39944,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39984, -90.40391, -90.40391, -90.40462, -90.40462, -90.40462,
-90.40462, -90.40462, -90.40462)), row.names = c(NA, -22L
), class = c("tbl_df", "tbl", "data.frame"))
> dput(T12418)
structure(list(Receiver = c("VR2Tx-480679", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480692", "VR2Tx-480695",
"VR2Tx-480695", "VR2Tx-480713", "VR2Tx-480713", "VR2Tx-480702",
"VR100", "VR100", "VR100"), Transmitter = c("A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418"), Species = c("PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH"), LocalDATETIME = structure(c(1622186212, 1622381700,
1622384575, 1622184711, 1622381515, 1622381618, 1622381751, 1622381924,
1622382679, 1622383493, 1622384038, 1622384612, 1622183957, 1622381515,
1626905954, 1626905688, 1622971975, 1622970684, 1626929618, 1624616880,
1626084540, 1626954660), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
StationName = c("1405U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1406U", "1406U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1404L", "1401D", "1401D", "14Aux2", "14Aux2", "15.Mid.Wall",
"man_loc", "man_loc", "man_loc"), LengthValue = c(805, 805,
805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805,
805, 805, 805, 805, 805, 805, 805, 805), WeightValue = c(8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04), Sex = c("NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA"), Translocated = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
Pool = c(16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
16, 16, 16, 16, 16, 16, 16, 14, 14, 16), DeployDate = structure(c(1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Latitude = c(41.57471, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.57463, 41.5731, 41.5731, 41.57469, 41.57469,
41.57469, 41.57469, 41.57469, 41.57469), Longitude = c(-90.39944,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39984, -90.40391, -90.40391, -90.40462, -90.40462, -90.40462,
-90.40462, -90.40462, -90.40462)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -22L))
Here is one possibility for getting the beginning date for entering a pool and ending date for leaving a pool. First, I group by Species (could also add additional grouping variables to distinguish between specimens) and arrange by the time. Then, I look for any changes to the Pool using cumsum. Then, I pull the first date recorded for the pool as the the date that they entered the pool. Then, I do some grouping and ungrouping to grab the date from the next group (i.e., the date the species left the pool) and then copy that date for the whole group. For determining upstream/downstream, we can use case_when inside of mutate. I'm also assuming that you want this to match the date, so I have filled in the values for each group with the movement for pool change.
library(tidyverse)
df_dates <- df %>%
group_by(Species, Transmitter) %>%
arrange(Species, Transmitter, LocalDATETIME) %>%
mutate(changeGroup = cumsum(Pool != lag(Pool, default = -1))) %>%
group_by(Species, Transmitter, changeGroup) %>%
mutate(EnterPool = first(format(as.Date(LocalDATETIME), "%m/%d/%Y"))) %>%
ungroup(changeGroup) %>%
mutate(LeftPool = lead(EnterPool)) %>%
group_by(Species, Transmitter, changeGroup) %>%
mutate(LeftPool = last(LeftPool)) %>%
ungroup(changeGroup) %>%
mutate(stream = case_when((Pool - lag(Pool)) > 0 ~ 0,
(Pool - lag(Pool)) < 0 ~ 1)) %>%
fill(stream, .direction = "down")
Output
print(as_tibble(df_dates[1:24, c(1:5, 10:17)]), n=24)
# A tibble: 24 × 13
Receiver Transmitter Species LocalDATETIME StationName Pool DeployDate Latitude Longitude changeGroup EnterPool LeftPool stream
<chr> <chr> <chr> <dttm> <chr> <dbl> <dttm> <dbl> <dbl> <int> <chr> <chr> <dbl>
1 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-28 06:39:17 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
2 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-28 06:51:51 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
3 VR2Tx-480679 A69-9001-12418 PDFH 2021-05-28 07:16:52 1405U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
4 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:31:55 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
5 VR2Tx-480692 A69-9001-12418 PDFH 2021-05-30 13:31:55 1404L 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
6 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:33:38 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
7 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:35:00 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
8 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:35:51 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
9 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:38:44 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
10 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 13:51:19 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
11 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 14:04:53 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
12 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 14:13:58 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
13 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 14:22:55 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
14 VR2Tx-480690 A69-9001-12418 PDFH 2021-05-30 14:23:32 1406U 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
15 VR2Tx-480713 A69-9001-12418 PDFH 2021-06-06 09:11:24 14Aux2 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
16 VR2Tx-480713 A69-9001-12418 PDFH 2021-06-06 09:32:55 14Aux2 16 2018-06-05 00:00:00 41.6 -90.4 1 05/28/2021 06/25/2021 NA
17 VR100 A69-9001-12418 PDFH 2021-06-25 10:28:00 man_loc 14 2018-06-05 00:00:00 41.6 -90.4 2 06/25/2021 07/21/2021 1
18 VR100 A69-9001-12418 PDFH 2021-07-12 10:09:00 man_loc 14 2018-06-05 00:00:00 41.6 -90.4 2 06/25/2021 07/21/2021 1
19 VR2Tx-480695 A69-9001-12418 PDFH 2021-07-21 22:14:48 1401D 16 2018-06-05 00:00:00 41.6 -90.4 3 07/21/2021 NA 0
20 VR2Tx-480695 A69-9001-12418 PDFH 2021-07-21 22:19:14 1401D 16 2018-06-05 00:00:00 41.6 -90.4 3 07/21/2021 NA 0
21 VR2Tx-480702 A69-9001-12418 PDFH 2021-07-22 04:53:38 15.Mid.Wall 16 2018-06-05 00:00:00 41.6 -90.4 3 07/21/2021 NA 0
22 VR100 A69-9001-12418 PDFH 2021-07-22 11:51:00 man_loc 16 2018-06-05 00:00:00 41.6 -90.4 3 07/21/2021 NA 0
23 AR100 B80-9001-12420 PDFH 2021-07-22 11:51:00 man_loc 19 2018-06-05 00:00:00 42.6 -90.4 1 07/22/2021 07/22/2021 NA
24 AR100 B80-9001-12420 PDFH 2021-07-22 11:51:01 man_loc 18 2018-06-05 00:00:00 42.6 -90.4 2 07/22/2021 NA 1
Data
df <- structure(list(Receiver = c("VR2Tx-480679", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480690",
"VR2Tx-480690", "VR2Tx-480690", "VR2Tx-480692", "VR2Tx-480695",
"VR2Tx-480695", "VR2Tx-480713", "VR2Tx-480713", "VR2Tx-480702",
"VR100", "VR100", "VR100", "AR100", "AR100"), Transmitter = c("A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "A69-9001-12418", "A69-9001-12418", "A69-9001-12418",
"A69-9001-12418", "B80-9001-12420", "B80-9001-12420"), Species = c("PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH", "PDFH",
"PDFH", "PDFH", "PDFH", "PDFH"), LocalDATETIME = structure(c(1622186212, 1622381700,
1622384575, 1622184711, 1622381515, 1622381618, 1622381751, 1622381924,
1622382679, 1622383493, 1622384038, 1622384612, 1622183957, 1622381515,
1626905954, 1626905688, 1622971975, 1622970684, 1626929618, 1624616880,
1626084540, 1626954660, 1626954661, 1626954660), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
StationName = c("1405U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1406U", "1406U", "1406U", "1406U", "1406U", "1406U",
"1406U", "1404L", "1401D", "1401D", "14Aux2", "14Aux2", "15.Mid.Wall",
"man_loc", "man_loc", "man_loc", "man_loc", "man_loc"), LengthValue = c(805, 805,
805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805, 805,
805, 805, 805, 805, 805, 805, 805, 805, 805, 805), WeightValue = c(8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04, 8.04,
8.04, 8.04, 8.04), Sex = c("NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA", "NA", "NA", "NA"), Translocated = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
Pool = c(16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
16, 16, 16, 16, 16, 16, 16, 14, 14, 16, 18, 19), DeployDate = structure(c(1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800, 1528156800, 1528156800,
1528156800, 1528156800, 1528156800), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Latitude = c(41.57471, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758, 41.5758,
41.5758, 41.57463, 41.5731, 41.5731, 41.57469, 41.57469,
41.57469, 41.57469, 41.57469, 41.57469, 42.57469, 42.57469), Longitude = c(-90.39944,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39793, -90.39793, -90.39793, -90.39793, -90.39793, -90.39793,
-90.39984, -90.40391, -90.40391, -90.40462, -90.40462, -90.40462,
-90.40462, -90.40462, -90.40462, -90.40470, -90.40470)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -24L))

How can I sum the totals of NA values in a data.frame or tibble column in R and group them by Month and Year

So here is a sample of my Data
year <- c(1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,
1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,
1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,
1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,
1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,
1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980 ,1980)
month <- c(1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1
,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,1 ,2
,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2
,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2 ,2)
Q <- c(NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA
,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,0.3 ,0.3 ,0.28
,0.26 ,0.26 ,0.25 ,0.25 ,0.24 ,0.24 ,0.24 ,0.24 ,0.23 ,0.23 ,NA ,NA
,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA ,NA)
I combined them into a dataframe called Flow
Flow <- data.frame(year,month,Q)
I can sum or count the number of missing or NA values in my Q column.
sum(is.na(Flow$Q))
Now I am trying to calculate the sum of NA values in each month for the year and eventually each year.
This is where I'm stuck.
group_by(Flow$year, Flow$month) %>%
sum(is.na(Flow$Q)
With group by, we can use summarise. Also, we don't need the Flow$ inside the group_by
library(dplyr)
Flow %>%
group_by(year, month) %>%
summarise(Nas = sum(is.na(Q)))
# A tibble: 2 x 3
# Groups: year [1]
# year month Nas
# <dbl> <dbl> <int>
#1 1980 1 28
#2 1980 2 19

Full Join in dplyr

I have a dataframe looking like:
library(tidyverse)
df <- tibble::tribble(
~sub_date, ~period,
"2019-01", 1,
"2019-01", 2,
"2019-01", 3,
"2019-02", 1,
"2019-02", 2,
"2019-03", 1,
"2019-03", 2,
"2019-03", 3,
"2019-03", 4
)
sub_date period
<chr> <dbl>
1 2019-01 1
2 2019-01 2
3 2019-01 3
4 2019-02 1
5 2019-02 2
6 2019-03 1
7 2019-03 2
8 2019-03 3
9 2019-03 4
and another:
period <- tibble::tribble(
~period, ~forecast,
1, 10,
2, 20,
3, 30,
4, 40,
5, 50,
6, 60,
7, 70
)
period forecast
<dbl> <dbl>
1 1 10
2 2 20
3 3 30
4 4 40
5 5 50
6 6 60
7 7 70
I am struggling to join them in a way that in df I can fill the missing periods in the table period, aka the number of rows in period X the different sub_date in df.
as follows:
df_output <- tibble::tribble(
~sub_date, ~period, ~forecast,
"2019-01", 1, 10,
"2019-01", 2, 20,
"2019-01", 3, 30,
"2019-01", 4, 40,
"2019-01", 5, 50,
"2019-01", 6, 60,
"2019-01", 7, 70,
"2019-02", 1, 10,
"2019-02", 2, 20,
"2019-02", 3, 30,
"2019-02", 4, 40,
"2019-02", 5, 50,
"2019-02", 6, 60,
"2019-02", 7, 70,
"2019-03", 1, 10,
"2019-03", 2, 20,
"2019-03", 3, 30,
"2019-03", 4, 40,
"2019-03", 5, 50,
"2019-03", 6, 60,
"2019-03", 7, 70
)
# A tibble: 21 x 3
sub_date period forecast
<chr> <dbl> <dbl>
1 2019-01 1 10
2 2019-01 2 20
3 2019-01 3 30
4 2019-01 4 40
5 2019-01 5 50
6 2019-01 6 60
7 2019-01 7 70
8 2019-02 1 10
9 2019-02 2 20
10 2019-02 3 30
# … with 11 more rows
I assumed it was a full join but I don't get the desired result.
Any help?
you can use tidyr::crossing to obtained your desired result:
crossing(select(df, sub_date), period)
Note that you are not looking for a join since you want every combination of sub_date combinded (or crossed) with every combination of period and forecast.
You can try to merge the tables? Try this to see if it gives you what you need?
df <- df %>% distinct(sub_date)
answer <- merge(periods, df, all = TRUE)

Filter days based on groupby

I have a df and I want to filter out a column based on a grouping. I want to keep group by combinations ((cc, odd, tree1, and tree2) if day > 4, then keep it, otherwise drop it
df <- data_frame(
cc = c('BB', 'BB', 'BB', 'BB','BB', 'BB','BB', 'BB', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD',
'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ'),
odd = c(3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435),
tree1 = c('ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP'),
tree2 = c('ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK'),
day = c(1, 2, 3, 4, 3, 4, 5, 6, 2, 3, 4, 5, 1, 3, 5, 7, 1, 2, 6, 8, 2, 4, 6, 8)
)
I tried this but this drops any row with day value smaller than 4
df1 <- df %>%
arrange(cc, odd, tree1, tree2, day) %>%
group_by(cc, odd, tree1, tree2) %>%
filter(day > 4)
I would like to get a df as below.
df2 <- data_frame(
cc = c('BB', 'BB','BB', 'BB', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD',
'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ'),
odd = c(3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435),
tree1 = c('SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP'),
tree2 = c('ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK'),
day = c(3, 4, 5, 6, 2, 3, 4, 5, 1, 3, 5, 7, 1, 2, 6, 8, 2, 4, 6, 8)
)
You can try
df %>%
group_by(cc, odd, tree1, tree2) %>%
filter(any(day > 4))
# A tibble: 20 x 5
cc odd tree1 tree2 day
<chr> <dbl> <chr> <chr> <dbl>
1 BB 3435 SAP ATK 3
2 BB 3435 SAP ATK 4
3 BB 3435 SAP ATK 5
4 BB 3435 SAP ATK 6
5 DD 3434 ASP ATK 2
6 DD 3434 ASP ATK 3
7 DD 3434 ASP ATK 4
8 DD 3434 ASP ATK 5
9 DD 3435 SAP ATK 1
10 DD 3435 SAP ATK 3
11 DD 3435 SAP ATK 5
12 DD 3435 SAP ATK 7
13 ZZ 3434 ASP ATK 1
14 ZZ 3434 ASP ATK 2
15 ZZ 3434 ASP ATK 6
16 ZZ 3434 ASP ATK 8
17 ZZ 3435 SAP ATK 2
18 ZZ 3435 SAP ATK 4
19 ZZ 3435 SAP ATK 6
20 ZZ 3435 SAP ATK 8

Resources