Kusto - Help writing KQL Pivot - azure-data-explorer

In an IoT project we are gathering sensor data in Azure Data Explorer. All sensor data is stored in a "signals" table. To uniqly identify a timeseries for a given sensor, we query like this:
Signals
| where TestId == "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3" and SignalName == "Signal1"
We want to be able to Pivot all timeseries from a given TestId, from the "signals" Table Rows into Columns.
I have been unable to write a Kusto Query that Achieves this, and I am hoping for some help on this forum.
AS-IS
The current signals table schema looks like this:
Timestamp
TestId
SignalName
Value
2021-01-01 12:00:30
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal1
23400
2021-01-01 12:00:30
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal2
0.113
2021-01-01 12:00:30
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal3
77.5
2021-01-01 12:00:31
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal1
23450
2021-01-01 12:00:31
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal2
0.114
2021-01-01 12:00:31
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal3
75.4
2021-01-01 12:00:32
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal1
22450
2021-01-01 12:00:32
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal2
0.113
2021-01-01 12:00:32
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal3
80.05
TO-BE
I want to be able to Pivot the Table, to the following Schema:
Timestamp
TestId
Signal1
Signal2
Signal3
2021-01-01 12:00:30
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23400
0.113
77.5
2021-01-01 12:00:31
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23450
0.114
75.4
2021-01-01 12:00:32
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
22450
0.113
80.05
I have tried the following query:
let testId = "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3";
signals
| where TestId == testId
| where SignalName == "Signal1" or SignalName == "Signal2" or SignalName == "Signal3"
| order by Timestamp desc
| evaluate pivot(SignalName)
But the resulting table, seems to repeat the timestamp - the timestamp is represented multiple times and a default value "0" is inserted in other signal columns:
Timestamp
TestId
Signal1
Signal2
Signal3
2021-01-01 12:00:30
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23400
0
0
2021-01-01 12:00:30
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
0
0.113
0
2021-01-01 12:00:30
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
0
0
77.5
2021-01-01 12:00:31
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23450
0
0
2021-01-01 12:00:31
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
0
0.114
0
2021-01-01 12:00:31
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
0
0
75.4
2021-01-01 12:00:32
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
22450
0
0
2021-01-01 12:00:32
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
0
0.113
0
2021-01-01 12:00:32
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
0
0
80.5
I do not need to do any aggregation using the Pivot operator, since all Signals should have a value on the exact same timestamp.
Can anyone help me writing a KQL query for this?
Do I need to create a Materialized View in Azure Data Explorer to Achieve this? An update Policy or Function?
Thanks

You need to specify the aggregation function in the pivot plugin:
datatable(Timestamp:datetime, TestId:string, SignalName:string, Value:double)
[
datetime(2021-01-01 12:00:30), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal1", 23400,
datetime(2021-01-01 12:00:30), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal2", 0.113,
datetime(2021-01-01 12:00:30), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal3", 77.5,
datetime(2021-01-01 12:00:31), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal1", 23450,
datetime(2021-01-01 12:00:31), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal2", 0.114,
datetime(2021-01-01 12:00:31), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal3", 75.4,
datetime(2021-01-01 12:00:32), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal1", 22450,
datetime(2021-01-01 12:00:32), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal2", 0.113,
datetime(2021-01-01 12:00:32), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal3", 80.05
]
| evaluate pivot(SignalName, sum(Value))
Timestamp
TestId
Signal1
Signal2
Signal3
2021-01-01 12:00:30.0000000
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23400
0.113
77.5
2021-01-01 12:00:31.0000000
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23450
0.114
75.4
2021-01-01 12:00:32.0000000
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
22450
0.113
80.05

Please see below:
datatable(Timestamp:datetime, TestId:string, SignalName:string, Value:double)
[
datetime(2021-01-01 12:00:30), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal1", 23400,
datetime(2021-01-01 12:00:30), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal2", 0.113,
datetime(2021-01-01 12:00:30), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal3", 77.5,
datetime(2021-01-01 12:00:31), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal1", 23450,
datetime(2021-01-01 12:00:31), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal2", 0.114,
datetime(2021-01-01 12:00:31), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal3", 75.4,
datetime(2021-01-01 12:00:32), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal1", 22450,
datetime(2021-01-01 12:00:32), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal2", 0.113,
datetime(2021-01-01 12:00:32), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal3", 80.05
]
| project Timestamp, TestId, P = pack(SignalName, Value)
| summarize make_bag(P) by Timestamp, TestId
| evaluate bag_unpack(bag_P)
Timestamp
TestId
Signal1
Signal2
Signal3
2021-01-01 12:00:30.0000000
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23400
0.113
77.5
2021-01-01 12:00:31.0000000
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23450
0.114
75.4
2021-01-01 12:00:32.0000000
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
22450
0.113
80.05

Related

How to loop st_distance through list

My goal is to apply the st_distance function to a very large data frame,
yet because the data frame concerns multiple individuals, I split it using the purrr package and split function.
I have seen the use of 'lists' and 'forloops' in the past but I have no experience with these.
Below is a fraction of my dataset, I have split the dataframe by ID, into a list with 43 elements.
The st_distance function I plan to use looks something like, it it would be applied to the full data frame, not split into a list:
PART 2:
I want to do the same as explained by Dave2e, but now for geosphere::bearing
I have attached long and lat in wgs84 to the initial data frame, which now looks like this:
ID Date Time Datetime Long Lat x y
10_17 4/18/2017 15:02:00 4/18/2017 15:02 379800.5 5181001 -91.72272 46.35156
10_17 4/20/2017 6:00:00 4/20/2017 6:00 383409 5179885 -91.7044 46.34891
10_17 4/21/2017 21:02:00 4/21/2017 21:02 383191.2 5177960 -91.72297 46.35134
10_24 4/22/2017 10:03:00 4/22/2017 10:03 383448.6 5179918 -91.72298 46.35134
10_17 4/23/2017 12:01:00 4/23/2017 12:01 378582.5 5182110 -91.7242 46.34506
10_24 4/24/2017 1:00:00 4/24/2017 1:00 383647.4 5180009 -91.72515 46.34738
10_24 4/25/2017 16:01:00 4/25/2017 16:01 383407.9 5179872 -91.7184 46.32236
10_17 4/26/2017 18:02:00 4/26/2017 18:02 380691.9 5179353 -91.65361 46.34712
10_36 4/27/2017 20:00:00 4/27/2017 20:00 382521.9 5175266 -91.66127 46.3485
10_36 4/29/2017 11:01:00 4/29/2017 11:01 383443.8 5179909 -91.70303 46.35451
10_36 4/30/2017 0:00:00 4/30/2017 0:00 383060.8 5178361 -91.6685 46.32941
10_40 4/30/2017 13:02:00 4/30/2017 13:02 383426.3 5179873 -91.70263 46.35481
10_40 5/2/2017 17:02:00 5/2/2017 17:02 383393.7 5179883 -91.67099 46.34138
10_40 5/3/2017 6:01:00 5/3/2017 6:01 382875.8 5179376 -91.66324 46.34763
10_88 5/3/2017 19:02:00 5/3/2017 19:02 383264.3 5179948 -91.73075 46.3684
10_88 5/4/2017 8:01:00 5/4/2017 8:01 378554.4 5181966 -91.70413 46.35429
10_88 5/4/2017 21:03:00 5/4/2017 21:03 379830.5 5177232 -91.66452 46.37274
I then try a function similar to the one below, but with the coordinates changed to x and y but it leads to an error
dis_list <- split(data, data$ID)
answer <- lapply(dis_list, function(df) {
start <- df[-1 , c("x", "y")] %>%
st_as_sf(coords = c('x', 'y'))
end <- df[-nrow(df), c("x", "y")] %>%
st_as_sf(coords = c('x', 'y'))
angles <-geosphere::bearing(start, end)
df$angles <- c(NA, angles)
df
})
answer
which gives the error
Error in .pointsToMatrix(p1) :
'list' object cannot be coerced to type 'double'
Here is an basic solution. I split the original data into multiple data frames using split and then wrapped the distance function in lapply().
data <- read.table(header=TRUE, text="ID Date Time Datetime time2 Long Lat
10_17 4/18/2017 15:02:00 4/18/2017 15:02 379800.5 5181001
10_17 4/20/2017 6:00:00 4/20/2017 6:00 383409 5179885
10_17 4/21/2017 21:02:00 4/21/2017 21:02 383191.2 5177960
10_24 4/22/2017 10:03:00 4/22/2017 10:03 383448.6 5179918
10_17 4/23/2017 12:01:00 4/23/2017 12:01 378582.5 5182110
10_24 4/24/2017 1:00:00 4/24/2017 1:00 383647.4 5180009
10_24 4/25/2017 16:01:00 4/25/2017 16:01 383407.9 5179872
10_17 4/26/2017 18:02:00 4/26/2017 18:02 380691.9 5179353
10_36 4/27/2017 20:00:00 4/27/2017 20:00 382521.9 5175266
10_36 4/29/2017 11:01:00 4/29/2017 11:01 383443.8 5179909
10_36 4/30/2017 0:00:00 4/30/2017 0:00 383060.8 5178361
10_40 4/30/2017 13:02:00 4/30/2017 13:02 383426.3 5179873
10_40 5/2/2017 17:02:00 5/2/2017 17:02 383393.7 5179883
10_40 5/3/2017 6:01:00 5/3/2017 6:01 382875.8 5179376
10_88 5/3/2017 19:02:00 5/3/2017 19:02 383264.3 5179948
10_88 5/4/2017 8:01:00 5/4/2017 8:01 378554.4 5181966
10_88 5/4/2017 21:03:00 5/4/2017 21:03 379830.5 5177232")
#EPSG:32615 32615
library(sf)
library(magrittr)
dfs <- split(data, data$ID)
answer <- lapply(dfs, function(df) {
#convert to a sf oject and specify coordinate systems
start <- df[-1 , c("Long", "Lat")] %>%
st_as_sf(coords = c('Long', 'Lat')) %>%
st_set_crs(32615)
end <- df[-nrow(df), c("Long", "Lat")] %>%
st_as_sf(coords = c('Long', 'Lat')) %>%
st_set_crs(32615)
#long_lat <-st_transform(start, 4326)
distances <-sf::st_distance(start, end, by_element = TRUE)
df$distances <- c(NA, distances)
df
})
answer
$`10_17`
ID Date Time Datetime time2 Long Lat distances
1 10_17 4/18/2017 15:02:00 4/18/2017 15:02 379800.5 5181001 NA
2 10_17 4/20/2017 6:00:00 4/20/2017 6:00 383409.0 5179885 3777.132
3 10_17 4/21/2017 21:02:00 4/21/2017 21:02 383191.2 5177960 1937.282
5 10_17 4/23/2017 12:01:00 4/23/2017 12:01 378582.5 5182110 6201.824
8 10_17 4/26/2017 18:02:00 4/26/2017 18:02 380691.9 5179353 3471.400
$`10_24`
ID Date Time Datetime time2 Long Lat distances
4 10_24 4/22/2017 10:03:00 4/22/2017 10:03 383448.6 5179918 NA
6 10_24 4/24/2017 1:00:00 4/24/2017 1:00 383647.4 5180009 218.6377
7 10_24 4/25/2017 16:01:00 4/25/2017 16:01 383407.9 5179872 275.9153
There should be an easier way to calculate distances between rows instead of creating 2 series of points.
Referenced: Converting table columns to spatial objects

R code (Rstats) calculating unemployment rate based off columns in long form data

I am trying to calculate the unemployment rate based of the data below and add it as new rows to the data table. I want to divide unemployed by labourforce based off the date and add each datapoint as a row.
Essentially, I am trying to go from this
date
series_1
value
2021-01-01
labourforce
13793
2021-02-01
labourforce
13812
2021-03-01
labourforce
13856
2021-01-01
unemployed
875
2021-02-01
unemployed
805
2021-03-01
unemployed
778
to this
date
series_1
value
2021-01-01
labourforce
13793
2021-02-01
labourforce
13812
2021-03-01
labourforce
13856
2021-01-01
unemployed
875
2021-02-01
unemployed
805
2021-03-01
unemployed
778
2021-01-01
unemploymentrate
6.3
2021-02-01
unemploymentrate
5.8
2021-03-01
unemploymentrate
5.6
Here is my code so far. I know the last line is wrong? Any suggestions or ideas are welcome!
longdata %>%
group_by(date) %>%
summarise(series_1 = 'unemploymentrate',
value = series_1$unemployed/series_1$labourforce))
Fro each day, you can get the ratio of 'unemployed' by 'labourforce' and add it as new rows to your original dataset.
library(dplyr)
df %>%
group_by(date) %>%
summarise(value = value[series_1 == 'unemployed']/value[series_1 == 'labourforce'] * 100,
series_1 = 'unemploymentrate') %>%
bind_rows(df) %>%
arrange(series_1)
# date value series_1
# <chr> <dbl> <chr>
#1 2021-01-01 13793 labourforce
#2 2021-02-01 13812 labourforce
#3 2021-03-01 13856 labourforce
#4 2021-01-01 875 unemployed
#5 2021-02-01 805 unemployed
#6 2021-03-01 778 unemployed
#7 2021-01-01 6.34 unemploymentrate
#8 2021-02-01 5.83 unemploymentrate
#9 2021-03-01 5.61 unemploymentrate
Try:
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = series_1, values_from = value) %>%
mutate(unempolymentrate = round(unemployed*100/labourforce, 2)) %>%
pivot_longer(-1, names_to = "series_1", values_to = "value") %>%
mutate(series_1 = factor(series_1, levels = c("labourforce", "unemployed", "unempolymentrate"))) %>%
arrange(series_1, date)
#> # A tibble: 9 x 3
#> date series_1 value
#> <chr> <fct> <dbl>
#> 1 2021-01-01 labourforce 13793
#> 2 2021-02-01 labourforce 13812
#> 3 2021-03-01 labourforce 13856
#> 4 2021-01-01 unemployed 875
#> 5 2021-02-01 unemployed 805
#> 6 2021-03-01 unemployed 778
#> 7 2021-01-01 unempolymentrate 6.34
#> 8 2021-02-01 unempolymentrate 5.83
#> 9 2021-03-01 unempolymentrate 5.61
Created on 2021-04-23 by the reprex package (v2.0.0)
data
df <- structure(list(date = c("2021-01-01", "2021-02-01", "2021-03-01",
"2021-01-01", "2021-02-01", "2021-03-01"), series_1 = c("labourforce",
"labourforce", "labourforce", "unemployed", "unemployed", "unemployed"
), value = c(13793L, 13812L, 13856L, 875L, 805L, 778L)), class = "data.frame", row.names = c(NA,
-6L))

How to filter rows based on time (hh:mm:ss) using dplyr in tidyversse in R?

This is my data
library(tidyverse)
a<-tribble(
~"Date", ~"Time", ~"Name", ~"Value",
"2020-06-03", "00:15:00", "DR.RADHAKRISHNAN SALAI", 0.166,
"2020-06-03", "00:30:00", "DR.RADHAKRISHNAN SALAI", 0.867,
"2020-06-03", "00:45:00", "DR.RADHAKRISHNAN SALAI", 0.906,
"2020-06-03", "01:00:00", "DR.RADHAKRISHNAN SALAI", 0.677,
"2020-06-03", "01:15:00", "DR.RADHAKRISHNAN SALAI", 0.077
)
Solution needed:
I wanted to filter all rows based on time (eg: between 00:15:00 and 00:45:00)
What i tried :
a%>%
filter(Time => hms::as.hms(00:15:00) | Time <= hms::as.hms(00:45:00) )
But i didnt end up with what i expected . Please help
Use hour and minute as appropriate
library(lubridate)
library(tidyverse)
a<-tribble(
~"Date", ~"Time", ~"Name", ~"Value",
"2020-06-03", "00:15:00", "DR.RADHAKRISHNAN SALAI", 0.166,
"2020-06-03", "00:30:00", "DR.RADHAKRISHNAN SALAI", 0.867,
"2020-06-03", "00:45:00", "DR.RADHAKRISHNAN SALAI", 0.906,
"2020-06-03", "01:00:00", "DR.RADHAKRISHNAN SALAI", 0.677,
"2020-06-03", "01:15:00", "DR.RADHAKRISHNAN SALAI", 0.077
)
a %>%
filter(hour(hms::as_hms(Time)) == 0,
between(minute(hms::as_hms(Time)), 15, 45))
#> # A tibble: 3 x 4
#> Date Time Name Value
#> <chr> <chr> <chr> <dbl>
#> 1 2020-06-03 00:15:00 DR.RADHAKRISHNAN SALAI 0.166
#> 2 2020-06-03 00:30:00 DR.RADHAKRISHNAN SALAI 0.867
#> 3 2020-06-03 00:45:00 DR.RADHAKRISHNAN SALAI 0.906
Created on 2020-11-08 by the reprex package (v0.3.0)
Using base R :
subset(a, as.POSIXct(Time, format = "%T") >= as.POSIXct('00:15:00', format = '%T') &
as.POSIXct(Time, format = "%T") <= as.POSIXct('00:45:00', format = '%T'))
# Date Time Name Value
# <chr> <chr> <chr> <dbl>
#1 2020-06-03 00:15:00 DR.RADHAKRISHNAN SALAI 0.166
#2 2020-06-03 00:30:00 DR.RADHAKRISHNAN SALAI 0.867
#3 2020-06-03 00:45:00 DR.RADHAKRISHNAN SALAI 0.906
Does this work:
library(dplyr)
a %>% mutate(min = as.numeric(substr(Time, 1,2)) * 60 + as.numeric(substr(Time, 4,5))) %>%
filter(between(min,15,45)) %>% select(-min)
# A tibble: 3 x 4
Date Time Name Value
<chr> <chr> <chr> <dbl>
1 2020-06-03 00:15:00 DR.RADHAKRISHNAN SALAI 0.166
2 2020-06-03 00:30:00 DR.RADHAKRISHNAN SALAI 0.867
3 2020-06-03 00:45:00 DR.RADHAKRISHNAN SALAI 0.906
>

How to confront error "wrong embedding dimension" in cajolst R function?

When I try to use cajolst function from urca package I get a strange error.
would you please guide me how can i confront the problem?
result<-urca::cajolst(data ,trend = FALSE, K = 2, season = NULL)
Error in embed(diff(x), K) : wrong embedding dimension.
dates A G
2016-11-30 0 0
2016-12-01 -3.53 3.198
2016-12-02 -2.832 8.703
2016-12-04 -2.666 7.799
2016-12-05 -0.54 7.701
2016-12-06 -1.296 4.685
2016-12-07 -1.785 -4.587
2016-12-08 -6.834 -3.696
2016-12-09 -9.624 -5.461
2016-12-11 -11.374 -0.423
2016-12-12 -6.037 -1.614
2016-12-13 -5.934 -3.231
2016-12-14 -7.279 1.072
2016-12-15 -7.859 -4.823
2016-12-16 -15.132 10.838
2016-12-19 -15.345 11.5
2016-12-20 -15.673 6.639
2016-12-21 -15.391 11.162
2016-12-22 -14.357 7.032
2016-12-23 -14.99 12.355
2016-12-26 -15.626 10.944
2016-12-27 -12.297 10.215
2016-12-28 -13.967 5.957
2016-12-29 -12.946 3.446
2016-12-30 -19.681 10.274
2017-01-02 -18.24 8.781
2017-01-03 -16.83 1.116
2017-01-04 -18.189 -0.036
2017-01-05 -15.897 -1.441
2017-01-06 -20.196 -8.534
2017-01-09 -14.57 -28.768
2017-01-10 -13.27 -29.821
2017-01-11 -8.85 -38.881
2017-01-12 -6.375 -50.885
2017-01-13 -8.056 -51.321
2017-01-16 -5.217 -63.619
2017-01-17 -4.75 -39.163
2017-01-18 3.505 -46.309
2017-01-19 10.939 -45.825
2017-01-20 9.248 -42.973
2017-01-23 9.532 -33.396
2017-01-24 4.235 -31.38
2017-01-25 -1.885 -19.21
2017-01-26 -5.027 -15.74
2017-01-27 0.015 -23.029
2017-01-30 -0.685 -30.773
2017-01-31 -2.692 -25.544
2017-02-01 -2.654 -17.912
2017-02-02 4.002 -43.309
2017-02-03 4.813 -52.627
2017-02-06 7.049 -49.965
2017-02-07 10.003 -40.568
2017-02-08 8.996 -39.828
2017-02-09 7.047 -41.19
2017-02-10 7.656 -50.853
2017-02-13 4.986 -41.318
2017-02-14 8.493 -51.946
2017-02-15 12.547 -59.538
2017-02-16 10.327 -54.496
2017-02-17 7.09 -57.571
2017-02-20 11.633 -54.91
2017-02-21 12.664 -51.597
2017-02-22 16.103 -57.819
2017-02-23 14.25 -51.336
2017-02-24 7.794 -54.898
2017-02-27 15.27 -55.754
2017-02-28 19.984 -58.37
2017-03-01 23.899 -70.73
2017-03-02 16.63 -56.29
2017-03-03 16.443 -55.858
2017-03-06 17.901 -59.377
2017-03-07 19.067 -64.383
2017-03-08 17.219 -57.829
2017-03-09 15.694 -55.022
2017-03-10 17.351 -60.431
2017-03-13 18.945 -59.79
2017-03-14 20.001 -64.848
2017-03-15 23.852 -73.806
2017-03-16 22.697 -64.191
2017-03-17 26.892 -65.328
2017-03-20 29.221 -72.764
2017-03-21 25.165 -53.427
2017-03-22 22.998 -51.676
2017-03-23 20.072 -40.57
2017-03-24 20.758 -43.654
2017-03-27 20.062 -33.672
2017-03-28 22.066 -47.184
2017-03-29 22.363 -54.57
2017-03-30 20.684 -48.199
2017-03-31 17.056 -40.887
2017-04-03 19.12 -39.618
2017-04-04 16.359 -37.1
2017-04-05 18.643 -32.734
2017-04-06 14.708 -30.455
2017-04-07 8.403 -33.553
2017-04-10 6.072 -29.048
2017-04-11 5.186 -20.696
2017-04-12 4.248 -20.924
2017-04-13 12.803 -31.075
2017-04-14 12.566 -29.768
2017-04-17 14.065 -28.906
2017-04-18 14.5 4.121
2017-04-19 13.865 8.835
2017-04-20 16.126 6.191
2017-04-21 17.591 3.77
2017-04-24 22.3 -2.497
2017-04-25 22.731 7.408
2017-04-26 19.146 18.45
2017-04-27 19.052 25.541
2017-04-28 21.889 26.878
2017-05-01 27.323 14.362
2017-05-02 29.93 17.525
2017-05-03 19.835 29.856
2017-05-04 19.683 36.72
2017-05-05 13.545 41.055
2017-05-08 14.165 43.544
2017-05-09 11.325 49.978
2017-05-10 10.143 47.072
2017-05-11 13.718 38.901
2017-05-12 14.216 36.017
2017-05-15 13.701 33.797
2017-05-16 13.505 33.867
2017-05-17 13.456 38.004
2017-05-18 12.613 37.758
2017-05-19 11.166 40.367
2017-05-22 12.221 34.022
2017-05-23 13.682 29.793
2017-05-24 10.05 26.701
2017-05-25 10.122 31.394
2017-05-26 7.592 20.073
2017-05-29 6.796 23.809
2017-05-30 9.638 16.1
2017-05-31 7.983 29.043
2017-06-01 3.594 39.557
2017-06-02 8.763 27.863
2017-06-05 12.157 22.397
2017-06-06 13.383 19.053
2017-06-07 20.52 17.449
2017-06-08 19.534 -1.615
2017-06-09 16.011 -1.989
2017-06-12 9.153 -9.294
2017-06-13 4.295 -0.897
2017-06-14 9.743 -9.818
2017-06-15 10.386 -8.255
2017-06-16 11.983 -12.522
2017-06-19 9.513 -12.931
2017-06-20 10.298 -21.024
2017-06-21 11.087 -11.801
2017-06-22 4.472 -9.048
2017-06-23 9.416 -9.592
2017-06-26 9.686 -12.006
2017-06-27 6.424 -2.632
2017-06-28 3.062 -1.016
2017-06-29 5.593 -0.825
2017-06-30 3.531 0.914
2017-07-03 3.208 -2.596
2017-07-04 -6.373 4.289
2017-07-05 -5.149 5.917
2017-07-06 -6.104 12.75
2017-07-07 -9.565 1.615
2017-07-10 -8.961 -0.053
2017-07-11 -4.065 -8.541
2017-07-12 -10.133 -11.286
2017-07-13 -6.223 -15.181
2017-07-14 -1.524 -14.396
2017-07-17 -1.613 -14.61
2017-07-18 5.781 -35.473
2017-07-19 8.243 -44.186
2017-07-20 7.665 -49.857
2017-07-21 0.485 -41.286
2017-07-24 -0.638 -39.127
2017-07-25 0.767 -40.952
2017-07-26 3.566 -44.388
2017-07-27 6.834 -42.543
2017-07-28 1.306 -37.657
2017-07-31 5.839 -34.048
2017-08-01 5.838 -28.939
2017-08-02 7.298 -26.566
2017-08-03 6.804 -32.876
2017-08-04 8.989 -38.618
2017-08-07 8.862 -36.676
2017-08-08 8.234 -40.893
2017-08-09 7.39 -35.16
2017-08-10 8.593 -35.555
2017-08-11 7.253 -35.175
2017-08-14 5.593 -33.644
2017-08-15 4.528 -37.82
2017-08-16 6.752 -53.217
2017-08-17 6.284 -49.252
2017-08-18 4.765 -55.602
2017-08-21 3.905 -54.32
2017-08-22 1.76 -57.853
2017-08-23 0.406 -58.925
2017-08-24 -2.438 -58.098
2017-08-25 -0.791 -56.682
2017-08-28 2.173 -51.278
2017-08-29 2.523 -54.353
2017-08-30 4.482 -46.325
2017-08-31 0.246 -52.567
2017-09-01 -4.214 -53.636
2017-09-04 -4.548 -52.735
2017-09-05 -1.781 -50.421
2017-09-06 -10.463 -51.122
2017-09-07 -13.119 -52.433
2017-09-08 -11.716 -43.493
2017-09-11 -16.15 -43.142
2017-09-12 -12.478 -29.335
2017-09-13 -16.457 -31.697
2017-09-14 -14.615 -15.13
2017-09-15 -13.911 3.023
One of the issue is that the 'Date' column is also included and secondly, the season is not needed, it can be FALSE or specify an integer value
library(urca)
out <- cajolst(data[-1] ,trend = FALSE, K = 2, season =FALSE)
If there is a season effect and it is `quarterly, the value would be 4
out1 <- cajolst(data[-1] ,trend = FALSE, K = 2, season = 4)
out1
#####################################################
# Johansen-Procedure Unit Root / Cointegration Test #
#####################################################
#The value of the test statistic is: 3.6212 13.2233
data
data <- structure(list(dates = c("2016-11-30", "2016-12-01", "2016-12-02",
"2016-12-04", "2016-12-05", "2016-12-06", "2016-12-07", "2016-12-08",
"2016-12-09", "2016-12-11", "2016-12-12", "2016-12-13", "2016-12-14",
"2016-12-15", "2016-12-16", "2016-12-19", "2016-12-20", "2016-12-21",
"2016-12-22", "2016-12-23", "2016-12-26", "2016-12-27", "2016-12-28",
"2016-12-29", "2016-12-30", "2017-01-02", "2017-01-03", "2017-01-04",
"2017-01-05", "2017-01-06", "2017-01-09", "2017-01-10", "2017-01-11",
"2017-01-12", "2017-01-13", "2017-01-16", "2017-01-17", "2017-01-18",
"2017-01-19", "2017-01-20", "2017-01-23", "2017-01-24", "2017-01-25",
"2017-01-26", "2017-01-27", "2017-01-30", "2017-01-31", "2017-02-01",
"2017-02-02", "2017-02-03", "2017-02-06", "2017-02-07", "2017-02-08",
"2017-02-09", "2017-02-10", "2017-02-13", "2017-02-14", "2017-02-15",
"2017-02-16", "2017-02-17", "2017-02-20", "2017-02-21", "2017-02-22",
"2017-02-23", "2017-02-24", "2017-02-27", "2017-02-28", "2017-03-01",
"2017-03-02", "2017-03-03", "2017-03-06", "2017-03-07", "2017-03-08",
"2017-03-09", "2017-03-10", "2017-03-13", "2017-03-14", "2017-03-15",
"2017-03-16", "2017-03-17", "2017-03-20", "2017-03-21", "2017-03-22",
"2017-03-23", "2017-03-24", "2017-03-27", "2017-03-28", "2017-03-29",
"2017-03-30", "2017-03-31", "2017-04-03", "2017-04-04", "2017-04-05",
"2017-04-06", "2017-04-07", "2017-04-10", "2017-04-11", "2017-04-12",
"2017-04-13", "2017-04-14", "2017-04-17", "2017-04-18", "2017-04-19",
"2017-04-20", "2017-04-21", "2017-04-24", "2017-04-25", "2017-04-26",
"2017-04-27", "2017-04-28", "2017-05-01", "2017-05-02", "2017-05-03",
"2017-05-04", "2017-05-05", "2017-05-08", "2017-05-09", "2017-05-10",
"2017-05-11", "2017-05-12", "2017-05-15", "2017-05-16", "2017-05-17",
"2017-05-18", "2017-05-19", "2017-05-22", "2017-05-23", "2017-05-24",
"2017-05-25", "2017-05-26", "2017-05-29", "2017-05-30", "2017-05-31",
"2017-06-01", "2017-06-02", "2017-06-05", "2017-06-06", "2017-06-07",
"2017-06-08", "2017-06-09", "2017-06-12", "2017-06-13", "2017-06-14",
"2017-06-15", "2017-06-16", "2017-06-19", "2017-06-20", "2017-06-21",
"2017-06-22", "2017-06-23", "2017-06-26", "2017-06-27", "2017-06-28",
"2017-06-29", "2017-06-30", "2017-07-03", "2017-07-04", "2017-07-05",
"2017-07-06", "2017-07-07", "2017-07-10", "2017-07-11", "2017-07-12",
"2017-07-13", "2017-07-14", "2017-07-17", "2017-07-18", "2017-07-19",
"2017-07-20", "2017-07-21", "2017-07-24", "2017-07-25", "2017-07-26",
"2017-07-27", "2017-07-28", "2017-07-31", "2017-08-01", "2017-08-02",
"2017-08-03", "2017-08-04", "2017-08-07", "2017-08-08", "2017-08-09",
"2017-08-10", "2017-08-11", "2017-08-14", "2017-08-15", "2017-08-16",
"2017-08-17", "2017-08-18", "2017-08-21", "2017-08-22", "2017-08-23",
"2017-08-24", "2017-08-25", "2017-08-28", "2017-08-29", "2017-08-30",
"2017-08-31", "2017-09-01", "2017-09-04", "2017-09-05", "2017-09-06",
"2017-09-07", "2017-09-08", "2017-09-11", "2017-09-12", "2017-09-13",
"2017-09-14", "2017-09-15"), A = c(0, -3.53, -2.832, -2.666,
-0.54, -1.296, -1.785, -6.834, -9.624, -11.374, -6.037, -5.934,
-7.279, -7.859, -15.132, -15.345, -15.673, -15.391, -14.357,
-14.99, -15.626, -12.297, -13.967, -12.946, -19.681, -18.24,
-16.83, -18.189, -15.897, -20.196, -14.57, -13.27, -8.85, -6.375,
-8.056, -5.217, -4.75, 3.505, 10.939, 9.248, 9.532, 4.235, -1.885,
-5.027, 0.015, -0.685, -2.692, -2.654, 4.002, 4.813, 7.049, 10.003,
8.996, 7.047, 7.656, 4.986, 8.493, 12.547, 10.327, 7.09, 11.633,
12.664, 16.103, 14.25, 7.794, 15.27, 19.984, 23.899, 16.63, 16.443,
17.901, 19.067, 17.219, 15.694, 17.351, 18.945, 20.001, 23.852,
22.697, 26.892, 29.221, 25.165, 22.998, 20.072, 20.758, 20.062,
22.066, 22.363, 20.684, 17.056, 19.12, 16.359, 18.643, 14.708,
8.403, 6.072, 5.186, 4.248, 12.803, 12.566, 14.065, 14.5, 13.865,
16.126, 17.591, 22.3, 22.731, 19.146, 19.052, 21.889, 27.323,
29.93, 19.835, 19.683, 13.545, 14.165, 11.325, 10.143, 13.718,
14.216, 13.701, 13.505, 13.456, 12.613, 11.166, 12.221, 13.682,
10.05, 10.122, 7.592, 6.796, 9.638, 7.983, 3.594, 8.763, 12.157,
13.383, 20.52, 19.534, 16.011, 9.153, 4.295, 9.743, 10.386, 11.983,
9.513, 10.298, 11.087, 4.472, 9.416, 9.686, 6.424, 3.062, 5.593,
3.531, 3.208, -6.373, -5.149, -6.104, -9.565, -8.961, -4.065,
-10.133, -6.223, -1.524, -1.613, 5.781, 8.243, 7.665, 0.485,
-0.638, 0.767, 3.566, 6.834, 1.306, 5.839, 5.838, 7.298, 6.804,
8.989, 8.862, 8.234, 7.39, 8.593, 7.253, 5.593, 4.528, 6.752,
6.284, 4.765, 3.905, 1.76, 0.406, -2.438, -0.791, 2.173, 2.523,
4.482, 0.246, -4.214, -4.548, -1.781, -10.463, -13.119, -11.716,
-16.15, -12.478, -16.457, -14.615, -13.911), G = c(0, 3.198,
8.703, 7.799, 7.701, 4.685, -4.587, -3.696, -5.461, -0.423, -1.614,
-3.231, 1.072, -4.823, 10.838, 11.5, 6.639, 11.162, 7.032, 12.355,
10.944, 10.215, 5.957, 3.446, 10.274, 8.781, 1.116, -0.036, -1.441,
-8.534, -28.768, -29.821, -38.881, -50.885, -51.321, -63.619,
-39.163, -46.309, -45.825, -42.973, -33.396, -31.38, -19.21,
-15.74, -23.029, -30.773, -25.544, -17.912, -43.309, -52.627,
-49.965, -40.568, -39.828, -41.19, -50.853, -41.318, -51.946,
-59.538, -54.496, -57.571, -54.91, -51.597, -57.819, -51.336,
-54.898, -55.754, -58.37, -70.73, -56.29, -55.858, -59.377, -64.383,
-57.829, -55.022, -60.431, -59.79, -64.848, -73.806, -64.191,
-65.328, -72.764, -53.427, -51.676, -40.57, -43.654, -33.672,
-47.184, -54.57, -48.199, -40.887, -39.618, -37.1, -32.734, -30.455,
-33.553, -29.048, -20.696, -20.924, -31.075, -29.768, -28.906,
4.121, 8.835, 6.191, 3.77, -2.497, 7.408, 18.45, 25.541, 26.878,
14.362, 17.525, 29.856, 36.72, 41.055, 43.544, 49.978, 47.072,
38.901, 36.017, 33.797, 33.867, 38.004, 37.758, 40.367, 34.022,
29.793, 26.701, 31.394, 20.073, 23.809, 16.1, 29.043, 39.557,
27.863, 22.397, 19.053, 17.449, -1.615, -1.989, -9.294, -0.897,
-9.818, -8.255, -12.522, -12.931, -21.024, -11.801, -9.048, -9.592,
-12.006, -2.632, -1.016, -0.825, 0.914, -2.596, 4.289, 5.917,
12.75, 1.615, -0.053, -8.541, -11.286, -15.181, -14.396, -14.61,
-35.473, -44.186, -49.857, -41.286, -39.127, -40.952, -44.388,
-42.543, -37.657, -34.048, -28.939, -26.566, -32.876, -38.618,
-36.676, -40.893, -35.16, -35.555, -35.175, -33.644, -37.82,
-53.217, -49.252, -55.602, -54.32, -57.853, -58.925, -58.098,
-56.682, -51.278, -54.353, -46.325, -52.567, -53.636, -52.735,
-50.421, -51.122, -52.433, -43.493, -43.142, -29.335, -31.697,
-15.13, 3.023)), class = "data.frame", row.names = c(NA, -210L
))

Filling missing rows

I have a large data set, a sample is given below:
df <- data.frame(stringsAsFactors=FALSE,
Date = c("2015-10-26", "2015-10-26", "2015-10-26", "2015-10-26",
"2015-10-27", "2015-10-27", "2015-10-27"),
Ticker = c("ANZ", "CBA", "NAB", "WBC", "ANZ", "CBA", "WBC"),
Open = c(29.11, 77.89, 32.69, 31.87, 29.05, 77.61, 31.84),
High = c(29.17, 77.93, 32.76, 31.92, 29.08, 78.1, 31.95),
Low = c(28.89, 77.37, 32.42, 31.71, 28.9, 77.54, 31.65),
Close = c(28.9, 77.5, 32.42, 31.84, 28.94, 77.74, 31.77),
Volume = c(6350170L, 2251288L, 3804239L, 5597684L, 5925519L, 2424679L,
5448863L)
)
The problem I am trying to solve is the missing data for NAB on 27-10-2015
I want the last value to repeat itself for the missing dates:
Date Ticker Open High Low Close Volume
2 2015-10-27 NAB 32.69 32.76 32.42 32.42 3804239
Any ideas on how to do this?
I have already unsuccessfully tried gather + spread
What if you tried something like this?
library(zoo)
res <- expand.grid(Date = unique(df$Date), Ticker = unique(df$Ticker))
res2 <- merge(res, df, all.x = TRUE)
res2 <- res2[order(res2$Ticker, res2$Date),]
res3 <- na.locf(res2)
res3[order(res3$Date, res3$Ticker),]
# Date Ticker Open High Low Close Volume
#1 2015-10-26 ANZ 29.11 29.17 28.89 28.90 6350170
#3 2015-10-26 CBA 77.89 77.93 77.37 77.50 2251288
#5 2015-10-26 NAB 32.69 32.76 32.42 32.42 3804239
#6 2015-10-26 WBC 31.87 31.92 31.71 31.84 5597684
#2 2015-10-27 ANZ 29.05 29.08 28.90 28.94 5925519
#4 2015-10-27 CBA 77.61 78.10 77.54 77.74 2424679
#8 2015-10-27 NAB 32.69 32.76 32.42 32.42 3804239
#7 2015-10-27 WBC 31.84 31.95 31.65 31.77 5448863
I'm assuming that if a Ticker/Day combo does not exist, you want to create one and LOCF it. This is what the expand.grid does.
tidyr::complete and tidyr::fill are built just for this situation:
library(tidyverse)
df %>%
complete(Date,Ticker) %>%
arrange(Ticker) %>%
fill(names(.)) %>%
arrange(Date)
#
# # A tibble: 8 x 7
# Date Ticker Open High Low Close Volume
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
# 1 2015-10-26 ANZ 29.11 29.17 28.89 28.90 6350170
# 2 2015-10-26 CBA 77.89 77.93 77.37 77.50 2251288
# 3 2015-10-26 NAB 32.69 32.76 32.42 32.42 3804239
# 4 2015-10-26 WBC 31.87 31.92 31.71 31.84 5597684
# 5 2015-10-27 ANZ 29.05 29.08 28.90 28.94 5925519
# 6 2015-10-27 CBA 77.61 78.10 77.54 77.74 2424679
# 7 2015-10-27 NAB 32.69 32.76 32.42 32.42 3804239
# 8 2015-10-27 WBC 31.84 31.95 31.65 31.77 5448863
Another potential solution (Note: I had to convert your date vector to Date format, but this could be reversed in the final output):
library(tidyr)
library(dplyr)
df <- data.frame(stringsAsFactors=FALSE,
Date = as.Date(c("2015-10-26", "2015-10-26", "2015-10-26", "2015-10-26",
"2015-10-27", "2015-10-27", "2015-10-27")),
Ticker = c("ANZ", "CBA", "NAB", "WBC", "ANZ", "CBA", "WBC"),
Open = c(29.11, 77.89, 32.69, 31.87, 29.05, 77.61, 31.84),
High = c(29.17, 77.93, 32.76, 31.92, 29.08, 78.1, 31.95),
Low = c(28.89, 77.37, 32.42, 31.71, 28.9, 77.54, 31.65),
Close = c(28.9, 77.5, 32.42, 31.84, 28.94, 77.74, 31.77),
Volume = c(6350170L, 2251288L, 3804239L, 5597684L, 5925519L, 2424679L,
5448863L))
tickers<- unique(df$Ticker)
dates<- as.Date(df$Date)
possibilities<- as.data.frame(unique(expand.grid(dates,tickers)))
colnames(possibilities)<- c('Date','Ticker')
missing<- anti_join(possibilities,df[,c('Date','Ticker')])
missing_filled<- if(nrow(missing)>0){
replacement<- cbind(missing,filter(df,Date==missing$Date-1,Ticker==missing$Ticker)[,3:7])
}
final<- arrange(rbind(df,replacement),Date)

Resources