Kusto - Help writing KQL Pivot - azure-data-explorer
In an IoT project we are gathering sensor data in Azure Data Explorer. All sensor data is stored in a "signals" table. To uniqly identify a timeseries for a given sensor, we query like this:
Signals
| where TestId == "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3" and SignalName == "Signal1"
We want to be able to Pivot all timeseries from a given TestId, from the "signals" Table Rows into Columns.
I have been unable to write a Kusto Query that Achieves this, and I am hoping for some help on this forum.
AS-IS
The current signals table schema looks like this:
Timestamp
TestId
SignalName
Value
2021-01-01 12:00:30
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal1
23400
2021-01-01 12:00:30
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal2
0.113
2021-01-01 12:00:30
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal3
77.5
2021-01-01 12:00:31
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal1
23450
2021-01-01 12:00:31
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal2
0.114
2021-01-01 12:00:31
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal3
75.4
2021-01-01 12:00:32
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal1
22450
2021-01-01 12:00:32
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal2
0.113
2021-01-01 12:00:32
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
Signal3
80.05
TO-BE
I want to be able to Pivot the Table, to the following Schema:
Timestamp
TestId
Signal1
Signal2
Signal3
2021-01-01 12:00:30
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23400
0.113
77.5
2021-01-01 12:00:31
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23450
0.114
75.4
2021-01-01 12:00:32
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
22450
0.113
80.05
I have tried the following query:
let testId = "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3";
signals
| where TestId == testId
| where SignalName == "Signal1" or SignalName == "Signal2" or SignalName == "Signal3"
| order by Timestamp desc
| evaluate pivot(SignalName)
But the resulting table, seems to repeat the timestamp - the timestamp is represented multiple times and a default value "0" is inserted in other signal columns:
Timestamp
TestId
Signal1
Signal2
Signal3
2021-01-01 12:00:30
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23400
0
0
2021-01-01 12:00:30
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
0
0.113
0
2021-01-01 12:00:30
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
0
0
77.5
2021-01-01 12:00:31
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23450
0
0
2021-01-01 12:00:31
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
0
0.114
0
2021-01-01 12:00:31
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
0
0
75.4
2021-01-01 12:00:32
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
22450
0
0
2021-01-01 12:00:32
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
0
0.113
0
2021-01-01 12:00:32
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
0
0
80.5
I do not need to do any aggregation using the Pivot operator, since all Signals should have a value on the exact same timestamp.
Can anyone help me writing a KQL query for this?
Do I need to create a Materialized View in Azure Data Explorer to Achieve this? An update Policy or Function?
Thanks
You need to specify the aggregation function in the pivot plugin:
datatable(Timestamp:datetime, TestId:string, SignalName:string, Value:double)
[
datetime(2021-01-01 12:00:30), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal1", 23400,
datetime(2021-01-01 12:00:30), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal2", 0.113,
datetime(2021-01-01 12:00:30), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal3", 77.5,
datetime(2021-01-01 12:00:31), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal1", 23450,
datetime(2021-01-01 12:00:31), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal2", 0.114,
datetime(2021-01-01 12:00:31), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal3", 75.4,
datetime(2021-01-01 12:00:32), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal1", 22450,
datetime(2021-01-01 12:00:32), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal2", 0.113,
datetime(2021-01-01 12:00:32), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal3", 80.05
]
| evaluate pivot(SignalName, sum(Value))
Timestamp
TestId
Signal1
Signal2
Signal3
2021-01-01 12:00:30.0000000
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23400
0.113
77.5
2021-01-01 12:00:31.0000000
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23450
0.114
75.4
2021-01-01 12:00:32.0000000
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
22450
0.113
80.05
Please see below:
datatable(Timestamp:datetime, TestId:string, SignalName:string, Value:double)
[
datetime(2021-01-01 12:00:30), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal1", 23400,
datetime(2021-01-01 12:00:30), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal2", 0.113,
datetime(2021-01-01 12:00:30), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal3", 77.5,
datetime(2021-01-01 12:00:31), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal1", 23450,
datetime(2021-01-01 12:00:31), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal2", 0.114,
datetime(2021-01-01 12:00:31), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal3", 75.4,
datetime(2021-01-01 12:00:32), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal1", 22450,
datetime(2021-01-01 12:00:32), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal2", 0.113,
datetime(2021-01-01 12:00:32), "cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3", "Signal3", 80.05
]
| project Timestamp, TestId, P = pack(SignalName, Value)
| summarize make_bag(P) by Timestamp, TestId
| evaluate bag_unpack(bag_P)
Timestamp
TestId
Signal1
Signal2
Signal3
2021-01-01 12:00:30.0000000
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23400
0.113
77.5
2021-01-01 12:00:31.0000000
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
23450
0.114
75.4
2021-01-01 12:00:32.0000000
cbb8bff1-ee9d-4ead-bbd6-c9c246d84fd3
22450
0.113
80.05
Related
How to loop st_distance through list
My goal is to apply the st_distance function to a very large data frame, yet because the data frame concerns multiple individuals, I split it using the purrr package and split function. I have seen the use of 'lists' and 'forloops' in the past but I have no experience with these. Below is a fraction of my dataset, I have split the dataframe by ID, into a list with 43 elements. The st_distance function I plan to use looks something like, it it would be applied to the full data frame, not split into a list: PART 2: I want to do the same as explained by Dave2e, but now for geosphere::bearing I have attached long and lat in wgs84 to the initial data frame, which now looks like this: ID Date Time Datetime Long Lat x y 10_17 4/18/2017 15:02:00 4/18/2017 15:02 379800.5 5181001 -91.72272 46.35156 10_17 4/20/2017 6:00:00 4/20/2017 6:00 383409 5179885 -91.7044 46.34891 10_17 4/21/2017 21:02:00 4/21/2017 21:02 383191.2 5177960 -91.72297 46.35134 10_24 4/22/2017 10:03:00 4/22/2017 10:03 383448.6 5179918 -91.72298 46.35134 10_17 4/23/2017 12:01:00 4/23/2017 12:01 378582.5 5182110 -91.7242 46.34506 10_24 4/24/2017 1:00:00 4/24/2017 1:00 383647.4 5180009 -91.72515 46.34738 10_24 4/25/2017 16:01:00 4/25/2017 16:01 383407.9 5179872 -91.7184 46.32236 10_17 4/26/2017 18:02:00 4/26/2017 18:02 380691.9 5179353 -91.65361 46.34712 10_36 4/27/2017 20:00:00 4/27/2017 20:00 382521.9 5175266 -91.66127 46.3485 10_36 4/29/2017 11:01:00 4/29/2017 11:01 383443.8 5179909 -91.70303 46.35451 10_36 4/30/2017 0:00:00 4/30/2017 0:00 383060.8 5178361 -91.6685 46.32941 10_40 4/30/2017 13:02:00 4/30/2017 13:02 383426.3 5179873 -91.70263 46.35481 10_40 5/2/2017 17:02:00 5/2/2017 17:02 383393.7 5179883 -91.67099 46.34138 10_40 5/3/2017 6:01:00 5/3/2017 6:01 382875.8 5179376 -91.66324 46.34763 10_88 5/3/2017 19:02:00 5/3/2017 19:02 383264.3 5179948 -91.73075 46.3684 10_88 5/4/2017 8:01:00 5/4/2017 8:01 378554.4 5181966 -91.70413 46.35429 10_88 5/4/2017 21:03:00 5/4/2017 21:03 379830.5 5177232 -91.66452 46.37274 I then try a function similar to the one below, but with the coordinates changed to x and y but it leads to an error dis_list <- split(data, data$ID) answer <- lapply(dis_list, function(df) { start <- df[-1 , c("x", "y")] %>% st_as_sf(coords = c('x', 'y')) end <- df[-nrow(df), c("x", "y")] %>% st_as_sf(coords = c('x', 'y')) angles <-geosphere::bearing(start, end) df$angles <- c(NA, angles) df }) answer which gives the error Error in .pointsToMatrix(p1) : 'list' object cannot be coerced to type 'double'
Here is an basic solution. I split the original data into multiple data frames using split and then wrapped the distance function in lapply(). data <- read.table(header=TRUE, text="ID Date Time Datetime time2 Long Lat 10_17 4/18/2017 15:02:00 4/18/2017 15:02 379800.5 5181001 10_17 4/20/2017 6:00:00 4/20/2017 6:00 383409 5179885 10_17 4/21/2017 21:02:00 4/21/2017 21:02 383191.2 5177960 10_24 4/22/2017 10:03:00 4/22/2017 10:03 383448.6 5179918 10_17 4/23/2017 12:01:00 4/23/2017 12:01 378582.5 5182110 10_24 4/24/2017 1:00:00 4/24/2017 1:00 383647.4 5180009 10_24 4/25/2017 16:01:00 4/25/2017 16:01 383407.9 5179872 10_17 4/26/2017 18:02:00 4/26/2017 18:02 380691.9 5179353 10_36 4/27/2017 20:00:00 4/27/2017 20:00 382521.9 5175266 10_36 4/29/2017 11:01:00 4/29/2017 11:01 383443.8 5179909 10_36 4/30/2017 0:00:00 4/30/2017 0:00 383060.8 5178361 10_40 4/30/2017 13:02:00 4/30/2017 13:02 383426.3 5179873 10_40 5/2/2017 17:02:00 5/2/2017 17:02 383393.7 5179883 10_40 5/3/2017 6:01:00 5/3/2017 6:01 382875.8 5179376 10_88 5/3/2017 19:02:00 5/3/2017 19:02 383264.3 5179948 10_88 5/4/2017 8:01:00 5/4/2017 8:01 378554.4 5181966 10_88 5/4/2017 21:03:00 5/4/2017 21:03 379830.5 5177232") #EPSG:32615 32615 library(sf) library(magrittr) dfs <- split(data, data$ID) answer <- lapply(dfs, function(df) { #convert to a sf oject and specify coordinate systems start <- df[-1 , c("Long", "Lat")] %>% st_as_sf(coords = c('Long', 'Lat')) %>% st_set_crs(32615) end <- df[-nrow(df), c("Long", "Lat")] %>% st_as_sf(coords = c('Long', 'Lat')) %>% st_set_crs(32615) #long_lat <-st_transform(start, 4326) distances <-sf::st_distance(start, end, by_element = TRUE) df$distances <- c(NA, distances) df }) answer $`10_17` ID Date Time Datetime time2 Long Lat distances 1 10_17 4/18/2017 15:02:00 4/18/2017 15:02 379800.5 5181001 NA 2 10_17 4/20/2017 6:00:00 4/20/2017 6:00 383409.0 5179885 3777.132 3 10_17 4/21/2017 21:02:00 4/21/2017 21:02 383191.2 5177960 1937.282 5 10_17 4/23/2017 12:01:00 4/23/2017 12:01 378582.5 5182110 6201.824 8 10_17 4/26/2017 18:02:00 4/26/2017 18:02 380691.9 5179353 3471.400 $`10_24` ID Date Time Datetime time2 Long Lat distances 4 10_24 4/22/2017 10:03:00 4/22/2017 10:03 383448.6 5179918 NA 6 10_24 4/24/2017 1:00:00 4/24/2017 1:00 383647.4 5180009 218.6377 7 10_24 4/25/2017 16:01:00 4/25/2017 16:01 383407.9 5179872 275.9153 There should be an easier way to calculate distances between rows instead of creating 2 series of points. Referenced: Converting table columns to spatial objects
R code (Rstats) calculating unemployment rate based off columns in long form data
I am trying to calculate the unemployment rate based of the data below and add it as new rows to the data table. I want to divide unemployed by labourforce based off the date and add each datapoint as a row. Essentially, I am trying to go from this date series_1 value 2021-01-01 labourforce 13793 2021-02-01 labourforce 13812 2021-03-01 labourforce 13856 2021-01-01 unemployed 875 2021-02-01 unemployed 805 2021-03-01 unemployed 778 to this date series_1 value 2021-01-01 labourforce 13793 2021-02-01 labourforce 13812 2021-03-01 labourforce 13856 2021-01-01 unemployed 875 2021-02-01 unemployed 805 2021-03-01 unemployed 778 2021-01-01 unemploymentrate 6.3 2021-02-01 unemploymentrate 5.8 2021-03-01 unemploymentrate 5.6 Here is my code so far. I know the last line is wrong? Any suggestions or ideas are welcome! longdata %>% group_by(date) %>% summarise(series_1 = 'unemploymentrate', value = series_1$unemployed/series_1$labourforce))
Fro each day, you can get the ratio of 'unemployed' by 'labourforce' and add it as new rows to your original dataset. library(dplyr) df %>% group_by(date) %>% summarise(value = value[series_1 == 'unemployed']/value[series_1 == 'labourforce'] * 100, series_1 = 'unemploymentrate') %>% bind_rows(df) %>% arrange(series_1) # date value series_1 # <chr> <dbl> <chr> #1 2021-01-01 13793 labourforce #2 2021-02-01 13812 labourforce #3 2021-03-01 13856 labourforce #4 2021-01-01 875 unemployed #5 2021-02-01 805 unemployed #6 2021-03-01 778 unemployed #7 2021-01-01 6.34 unemploymentrate #8 2021-02-01 5.83 unemploymentrate #9 2021-03-01 5.61 unemploymentrate
Try: library(dplyr) library(tidyr) df %>% pivot_wider(names_from = series_1, values_from = value) %>% mutate(unempolymentrate = round(unemployed*100/labourforce, 2)) %>% pivot_longer(-1, names_to = "series_1", values_to = "value") %>% mutate(series_1 = factor(series_1, levels = c("labourforce", "unemployed", "unempolymentrate"))) %>% arrange(series_1, date) #> # A tibble: 9 x 3 #> date series_1 value #> <chr> <fct> <dbl> #> 1 2021-01-01 labourforce 13793 #> 2 2021-02-01 labourforce 13812 #> 3 2021-03-01 labourforce 13856 #> 4 2021-01-01 unemployed 875 #> 5 2021-02-01 unemployed 805 #> 6 2021-03-01 unemployed 778 #> 7 2021-01-01 unempolymentrate 6.34 #> 8 2021-02-01 unempolymentrate 5.83 #> 9 2021-03-01 unempolymentrate 5.61 Created on 2021-04-23 by the reprex package (v2.0.0) data df <- structure(list(date = c("2021-01-01", "2021-02-01", "2021-03-01", "2021-01-01", "2021-02-01", "2021-03-01"), series_1 = c("labourforce", "labourforce", "labourforce", "unemployed", "unemployed", "unemployed" ), value = c(13793L, 13812L, 13856L, 875L, 805L, 778L)), class = "data.frame", row.names = c(NA, -6L))
How to filter rows based on time (hh:mm:ss) using dplyr in tidyversse in R?
This is my data library(tidyverse) a<-tribble( ~"Date", ~"Time", ~"Name", ~"Value", "2020-06-03", "00:15:00", "DR.RADHAKRISHNAN SALAI", 0.166, "2020-06-03", "00:30:00", "DR.RADHAKRISHNAN SALAI", 0.867, "2020-06-03", "00:45:00", "DR.RADHAKRISHNAN SALAI", 0.906, "2020-06-03", "01:00:00", "DR.RADHAKRISHNAN SALAI", 0.677, "2020-06-03", "01:15:00", "DR.RADHAKRISHNAN SALAI", 0.077 ) Solution needed: I wanted to filter all rows based on time (eg: between 00:15:00 and 00:45:00) What i tried : a%>% filter(Time => hms::as.hms(00:15:00) | Time <= hms::as.hms(00:45:00) ) But i didnt end up with what i expected . Please help
Use hour and minute as appropriate library(lubridate) library(tidyverse) a<-tribble( ~"Date", ~"Time", ~"Name", ~"Value", "2020-06-03", "00:15:00", "DR.RADHAKRISHNAN SALAI", 0.166, "2020-06-03", "00:30:00", "DR.RADHAKRISHNAN SALAI", 0.867, "2020-06-03", "00:45:00", "DR.RADHAKRISHNAN SALAI", 0.906, "2020-06-03", "01:00:00", "DR.RADHAKRISHNAN SALAI", 0.677, "2020-06-03", "01:15:00", "DR.RADHAKRISHNAN SALAI", 0.077 ) a %>% filter(hour(hms::as_hms(Time)) == 0, between(minute(hms::as_hms(Time)), 15, 45)) #> # A tibble: 3 x 4 #> Date Time Name Value #> <chr> <chr> <chr> <dbl> #> 1 2020-06-03 00:15:00 DR.RADHAKRISHNAN SALAI 0.166 #> 2 2020-06-03 00:30:00 DR.RADHAKRISHNAN SALAI 0.867 #> 3 2020-06-03 00:45:00 DR.RADHAKRISHNAN SALAI 0.906 Created on 2020-11-08 by the reprex package (v0.3.0)
Using base R : subset(a, as.POSIXct(Time, format = "%T") >= as.POSIXct('00:15:00', format = '%T') & as.POSIXct(Time, format = "%T") <= as.POSIXct('00:45:00', format = '%T')) # Date Time Name Value # <chr> <chr> <chr> <dbl> #1 2020-06-03 00:15:00 DR.RADHAKRISHNAN SALAI 0.166 #2 2020-06-03 00:30:00 DR.RADHAKRISHNAN SALAI 0.867 #3 2020-06-03 00:45:00 DR.RADHAKRISHNAN SALAI 0.906
Does this work: library(dplyr) a %>% mutate(min = as.numeric(substr(Time, 1,2)) * 60 + as.numeric(substr(Time, 4,5))) %>% filter(between(min,15,45)) %>% select(-min) # A tibble: 3 x 4 Date Time Name Value <chr> <chr> <chr> <dbl> 1 2020-06-03 00:15:00 DR.RADHAKRISHNAN SALAI 0.166 2 2020-06-03 00:30:00 DR.RADHAKRISHNAN SALAI 0.867 3 2020-06-03 00:45:00 DR.RADHAKRISHNAN SALAI 0.906 >
How to confront error "wrong embedding dimension" in cajolst R function?
When I try to use cajolst function from urca package I get a strange error. would you please guide me how can i confront the problem? result<-urca::cajolst(data ,trend = FALSE, K = 2, season = NULL) Error in embed(diff(x), K) : wrong embedding dimension. dates A G 2016-11-30 0 0 2016-12-01 -3.53 3.198 2016-12-02 -2.832 8.703 2016-12-04 -2.666 7.799 2016-12-05 -0.54 7.701 2016-12-06 -1.296 4.685 2016-12-07 -1.785 -4.587 2016-12-08 -6.834 -3.696 2016-12-09 -9.624 -5.461 2016-12-11 -11.374 -0.423 2016-12-12 -6.037 -1.614 2016-12-13 -5.934 -3.231 2016-12-14 -7.279 1.072 2016-12-15 -7.859 -4.823 2016-12-16 -15.132 10.838 2016-12-19 -15.345 11.5 2016-12-20 -15.673 6.639 2016-12-21 -15.391 11.162 2016-12-22 -14.357 7.032 2016-12-23 -14.99 12.355 2016-12-26 -15.626 10.944 2016-12-27 -12.297 10.215 2016-12-28 -13.967 5.957 2016-12-29 -12.946 3.446 2016-12-30 -19.681 10.274 2017-01-02 -18.24 8.781 2017-01-03 -16.83 1.116 2017-01-04 -18.189 -0.036 2017-01-05 -15.897 -1.441 2017-01-06 -20.196 -8.534 2017-01-09 -14.57 -28.768 2017-01-10 -13.27 -29.821 2017-01-11 -8.85 -38.881 2017-01-12 -6.375 -50.885 2017-01-13 -8.056 -51.321 2017-01-16 -5.217 -63.619 2017-01-17 -4.75 -39.163 2017-01-18 3.505 -46.309 2017-01-19 10.939 -45.825 2017-01-20 9.248 -42.973 2017-01-23 9.532 -33.396 2017-01-24 4.235 -31.38 2017-01-25 -1.885 -19.21 2017-01-26 -5.027 -15.74 2017-01-27 0.015 -23.029 2017-01-30 -0.685 -30.773 2017-01-31 -2.692 -25.544 2017-02-01 -2.654 -17.912 2017-02-02 4.002 -43.309 2017-02-03 4.813 -52.627 2017-02-06 7.049 -49.965 2017-02-07 10.003 -40.568 2017-02-08 8.996 -39.828 2017-02-09 7.047 -41.19 2017-02-10 7.656 -50.853 2017-02-13 4.986 -41.318 2017-02-14 8.493 -51.946 2017-02-15 12.547 -59.538 2017-02-16 10.327 -54.496 2017-02-17 7.09 -57.571 2017-02-20 11.633 -54.91 2017-02-21 12.664 -51.597 2017-02-22 16.103 -57.819 2017-02-23 14.25 -51.336 2017-02-24 7.794 -54.898 2017-02-27 15.27 -55.754 2017-02-28 19.984 -58.37 2017-03-01 23.899 -70.73 2017-03-02 16.63 -56.29 2017-03-03 16.443 -55.858 2017-03-06 17.901 -59.377 2017-03-07 19.067 -64.383 2017-03-08 17.219 -57.829 2017-03-09 15.694 -55.022 2017-03-10 17.351 -60.431 2017-03-13 18.945 -59.79 2017-03-14 20.001 -64.848 2017-03-15 23.852 -73.806 2017-03-16 22.697 -64.191 2017-03-17 26.892 -65.328 2017-03-20 29.221 -72.764 2017-03-21 25.165 -53.427 2017-03-22 22.998 -51.676 2017-03-23 20.072 -40.57 2017-03-24 20.758 -43.654 2017-03-27 20.062 -33.672 2017-03-28 22.066 -47.184 2017-03-29 22.363 -54.57 2017-03-30 20.684 -48.199 2017-03-31 17.056 -40.887 2017-04-03 19.12 -39.618 2017-04-04 16.359 -37.1 2017-04-05 18.643 -32.734 2017-04-06 14.708 -30.455 2017-04-07 8.403 -33.553 2017-04-10 6.072 -29.048 2017-04-11 5.186 -20.696 2017-04-12 4.248 -20.924 2017-04-13 12.803 -31.075 2017-04-14 12.566 -29.768 2017-04-17 14.065 -28.906 2017-04-18 14.5 4.121 2017-04-19 13.865 8.835 2017-04-20 16.126 6.191 2017-04-21 17.591 3.77 2017-04-24 22.3 -2.497 2017-04-25 22.731 7.408 2017-04-26 19.146 18.45 2017-04-27 19.052 25.541 2017-04-28 21.889 26.878 2017-05-01 27.323 14.362 2017-05-02 29.93 17.525 2017-05-03 19.835 29.856 2017-05-04 19.683 36.72 2017-05-05 13.545 41.055 2017-05-08 14.165 43.544 2017-05-09 11.325 49.978 2017-05-10 10.143 47.072 2017-05-11 13.718 38.901 2017-05-12 14.216 36.017 2017-05-15 13.701 33.797 2017-05-16 13.505 33.867 2017-05-17 13.456 38.004 2017-05-18 12.613 37.758 2017-05-19 11.166 40.367 2017-05-22 12.221 34.022 2017-05-23 13.682 29.793 2017-05-24 10.05 26.701 2017-05-25 10.122 31.394 2017-05-26 7.592 20.073 2017-05-29 6.796 23.809 2017-05-30 9.638 16.1 2017-05-31 7.983 29.043 2017-06-01 3.594 39.557 2017-06-02 8.763 27.863 2017-06-05 12.157 22.397 2017-06-06 13.383 19.053 2017-06-07 20.52 17.449 2017-06-08 19.534 -1.615 2017-06-09 16.011 -1.989 2017-06-12 9.153 -9.294 2017-06-13 4.295 -0.897 2017-06-14 9.743 -9.818 2017-06-15 10.386 -8.255 2017-06-16 11.983 -12.522 2017-06-19 9.513 -12.931 2017-06-20 10.298 -21.024 2017-06-21 11.087 -11.801 2017-06-22 4.472 -9.048 2017-06-23 9.416 -9.592 2017-06-26 9.686 -12.006 2017-06-27 6.424 -2.632 2017-06-28 3.062 -1.016 2017-06-29 5.593 -0.825 2017-06-30 3.531 0.914 2017-07-03 3.208 -2.596 2017-07-04 -6.373 4.289 2017-07-05 -5.149 5.917 2017-07-06 -6.104 12.75 2017-07-07 -9.565 1.615 2017-07-10 -8.961 -0.053 2017-07-11 -4.065 -8.541 2017-07-12 -10.133 -11.286 2017-07-13 -6.223 -15.181 2017-07-14 -1.524 -14.396 2017-07-17 -1.613 -14.61 2017-07-18 5.781 -35.473 2017-07-19 8.243 -44.186 2017-07-20 7.665 -49.857 2017-07-21 0.485 -41.286 2017-07-24 -0.638 -39.127 2017-07-25 0.767 -40.952 2017-07-26 3.566 -44.388 2017-07-27 6.834 -42.543 2017-07-28 1.306 -37.657 2017-07-31 5.839 -34.048 2017-08-01 5.838 -28.939 2017-08-02 7.298 -26.566 2017-08-03 6.804 -32.876 2017-08-04 8.989 -38.618 2017-08-07 8.862 -36.676 2017-08-08 8.234 -40.893 2017-08-09 7.39 -35.16 2017-08-10 8.593 -35.555 2017-08-11 7.253 -35.175 2017-08-14 5.593 -33.644 2017-08-15 4.528 -37.82 2017-08-16 6.752 -53.217 2017-08-17 6.284 -49.252 2017-08-18 4.765 -55.602 2017-08-21 3.905 -54.32 2017-08-22 1.76 -57.853 2017-08-23 0.406 -58.925 2017-08-24 -2.438 -58.098 2017-08-25 -0.791 -56.682 2017-08-28 2.173 -51.278 2017-08-29 2.523 -54.353 2017-08-30 4.482 -46.325 2017-08-31 0.246 -52.567 2017-09-01 -4.214 -53.636 2017-09-04 -4.548 -52.735 2017-09-05 -1.781 -50.421 2017-09-06 -10.463 -51.122 2017-09-07 -13.119 -52.433 2017-09-08 -11.716 -43.493 2017-09-11 -16.15 -43.142 2017-09-12 -12.478 -29.335 2017-09-13 -16.457 -31.697 2017-09-14 -14.615 -15.13 2017-09-15 -13.911 3.023
One of the issue is that the 'Date' column is also included and secondly, the season is not needed, it can be FALSE or specify an integer value library(urca) out <- cajolst(data[-1] ,trend = FALSE, K = 2, season =FALSE) If there is a season effect and it is `quarterly, the value would be 4 out1 <- cajolst(data[-1] ,trend = FALSE, K = 2, season = 4) out1 ##################################################### # Johansen-Procedure Unit Root / Cointegration Test # ##################################################### #The value of the test statistic is: 3.6212 13.2233 data data <- structure(list(dates = c("2016-11-30", "2016-12-01", "2016-12-02", "2016-12-04", "2016-12-05", "2016-12-06", "2016-12-07", "2016-12-08", "2016-12-09", "2016-12-11", "2016-12-12", "2016-12-13", "2016-12-14", "2016-12-15", "2016-12-16", "2016-12-19", "2016-12-20", "2016-12-21", "2016-12-22", "2016-12-23", "2016-12-26", "2016-12-27", "2016-12-28", "2016-12-29", "2016-12-30", "2017-01-02", "2017-01-03", "2017-01-04", "2017-01-05", "2017-01-06", "2017-01-09", "2017-01-10", "2017-01-11", "2017-01-12", "2017-01-13", "2017-01-16", "2017-01-17", "2017-01-18", "2017-01-19", "2017-01-20", "2017-01-23", "2017-01-24", "2017-01-25", "2017-01-26", "2017-01-27", "2017-01-30", "2017-01-31", "2017-02-01", "2017-02-02", "2017-02-03", "2017-02-06", "2017-02-07", "2017-02-08", "2017-02-09", "2017-02-10", "2017-02-13", "2017-02-14", "2017-02-15", "2017-02-16", "2017-02-17", "2017-02-20", "2017-02-21", "2017-02-22", "2017-02-23", "2017-02-24", "2017-02-27", "2017-02-28", "2017-03-01", "2017-03-02", "2017-03-03", "2017-03-06", "2017-03-07", "2017-03-08", "2017-03-09", "2017-03-10", "2017-03-13", "2017-03-14", "2017-03-15", "2017-03-16", "2017-03-17", "2017-03-20", "2017-03-21", "2017-03-22", "2017-03-23", "2017-03-24", "2017-03-27", "2017-03-28", "2017-03-29", "2017-03-30", "2017-03-31", "2017-04-03", "2017-04-04", "2017-04-05", "2017-04-06", "2017-04-07", "2017-04-10", "2017-04-11", "2017-04-12", "2017-04-13", "2017-04-14", "2017-04-17", "2017-04-18", "2017-04-19", "2017-04-20", "2017-04-21", "2017-04-24", "2017-04-25", "2017-04-26", "2017-04-27", "2017-04-28", "2017-05-01", "2017-05-02", "2017-05-03", "2017-05-04", "2017-05-05", "2017-05-08", "2017-05-09", "2017-05-10", "2017-05-11", "2017-05-12", "2017-05-15", "2017-05-16", "2017-05-17", "2017-05-18", "2017-05-19", "2017-05-22", "2017-05-23", "2017-05-24", "2017-05-25", "2017-05-26", "2017-05-29", "2017-05-30", "2017-05-31", "2017-06-01", "2017-06-02", "2017-06-05", "2017-06-06", "2017-06-07", "2017-06-08", "2017-06-09", "2017-06-12", "2017-06-13", "2017-06-14", "2017-06-15", "2017-06-16", "2017-06-19", "2017-06-20", "2017-06-21", "2017-06-22", "2017-06-23", "2017-06-26", "2017-06-27", "2017-06-28", "2017-06-29", "2017-06-30", "2017-07-03", "2017-07-04", "2017-07-05", "2017-07-06", "2017-07-07", "2017-07-10", "2017-07-11", "2017-07-12", "2017-07-13", "2017-07-14", "2017-07-17", "2017-07-18", "2017-07-19", "2017-07-20", "2017-07-21", "2017-07-24", "2017-07-25", "2017-07-26", "2017-07-27", "2017-07-28", "2017-07-31", "2017-08-01", "2017-08-02", "2017-08-03", "2017-08-04", "2017-08-07", "2017-08-08", "2017-08-09", "2017-08-10", "2017-08-11", "2017-08-14", "2017-08-15", "2017-08-16", "2017-08-17", "2017-08-18", "2017-08-21", "2017-08-22", "2017-08-23", "2017-08-24", "2017-08-25", "2017-08-28", "2017-08-29", "2017-08-30", "2017-08-31", "2017-09-01", "2017-09-04", "2017-09-05", "2017-09-06", "2017-09-07", "2017-09-08", "2017-09-11", "2017-09-12", "2017-09-13", "2017-09-14", "2017-09-15"), A = c(0, -3.53, -2.832, -2.666, -0.54, -1.296, -1.785, -6.834, -9.624, -11.374, -6.037, -5.934, -7.279, -7.859, -15.132, -15.345, -15.673, -15.391, -14.357, -14.99, -15.626, -12.297, -13.967, -12.946, -19.681, -18.24, -16.83, -18.189, -15.897, -20.196, -14.57, -13.27, -8.85, -6.375, -8.056, -5.217, -4.75, 3.505, 10.939, 9.248, 9.532, 4.235, -1.885, -5.027, 0.015, -0.685, -2.692, -2.654, 4.002, 4.813, 7.049, 10.003, 8.996, 7.047, 7.656, 4.986, 8.493, 12.547, 10.327, 7.09, 11.633, 12.664, 16.103, 14.25, 7.794, 15.27, 19.984, 23.899, 16.63, 16.443, 17.901, 19.067, 17.219, 15.694, 17.351, 18.945, 20.001, 23.852, 22.697, 26.892, 29.221, 25.165, 22.998, 20.072, 20.758, 20.062, 22.066, 22.363, 20.684, 17.056, 19.12, 16.359, 18.643, 14.708, 8.403, 6.072, 5.186, 4.248, 12.803, 12.566, 14.065, 14.5, 13.865, 16.126, 17.591, 22.3, 22.731, 19.146, 19.052, 21.889, 27.323, 29.93, 19.835, 19.683, 13.545, 14.165, 11.325, 10.143, 13.718, 14.216, 13.701, 13.505, 13.456, 12.613, 11.166, 12.221, 13.682, 10.05, 10.122, 7.592, 6.796, 9.638, 7.983, 3.594, 8.763, 12.157, 13.383, 20.52, 19.534, 16.011, 9.153, 4.295, 9.743, 10.386, 11.983, 9.513, 10.298, 11.087, 4.472, 9.416, 9.686, 6.424, 3.062, 5.593, 3.531, 3.208, -6.373, -5.149, -6.104, -9.565, -8.961, -4.065, -10.133, -6.223, -1.524, -1.613, 5.781, 8.243, 7.665, 0.485, -0.638, 0.767, 3.566, 6.834, 1.306, 5.839, 5.838, 7.298, 6.804, 8.989, 8.862, 8.234, 7.39, 8.593, 7.253, 5.593, 4.528, 6.752, 6.284, 4.765, 3.905, 1.76, 0.406, -2.438, -0.791, 2.173, 2.523, 4.482, 0.246, -4.214, -4.548, -1.781, -10.463, -13.119, -11.716, -16.15, -12.478, -16.457, -14.615, -13.911), G = c(0, 3.198, 8.703, 7.799, 7.701, 4.685, -4.587, -3.696, -5.461, -0.423, -1.614, -3.231, 1.072, -4.823, 10.838, 11.5, 6.639, 11.162, 7.032, 12.355, 10.944, 10.215, 5.957, 3.446, 10.274, 8.781, 1.116, -0.036, -1.441, -8.534, -28.768, -29.821, -38.881, -50.885, -51.321, -63.619, -39.163, -46.309, -45.825, -42.973, -33.396, -31.38, -19.21, -15.74, -23.029, -30.773, -25.544, -17.912, -43.309, -52.627, -49.965, -40.568, -39.828, -41.19, -50.853, -41.318, -51.946, -59.538, -54.496, -57.571, -54.91, -51.597, -57.819, -51.336, -54.898, -55.754, -58.37, -70.73, -56.29, -55.858, -59.377, -64.383, -57.829, -55.022, -60.431, -59.79, -64.848, -73.806, -64.191, -65.328, -72.764, -53.427, -51.676, -40.57, -43.654, -33.672, -47.184, -54.57, -48.199, -40.887, -39.618, -37.1, -32.734, -30.455, -33.553, -29.048, -20.696, -20.924, -31.075, -29.768, -28.906, 4.121, 8.835, 6.191, 3.77, -2.497, 7.408, 18.45, 25.541, 26.878, 14.362, 17.525, 29.856, 36.72, 41.055, 43.544, 49.978, 47.072, 38.901, 36.017, 33.797, 33.867, 38.004, 37.758, 40.367, 34.022, 29.793, 26.701, 31.394, 20.073, 23.809, 16.1, 29.043, 39.557, 27.863, 22.397, 19.053, 17.449, -1.615, -1.989, -9.294, -0.897, -9.818, -8.255, -12.522, -12.931, -21.024, -11.801, -9.048, -9.592, -12.006, -2.632, -1.016, -0.825, 0.914, -2.596, 4.289, 5.917, 12.75, 1.615, -0.053, -8.541, -11.286, -15.181, -14.396, -14.61, -35.473, -44.186, -49.857, -41.286, -39.127, -40.952, -44.388, -42.543, -37.657, -34.048, -28.939, -26.566, -32.876, -38.618, -36.676, -40.893, -35.16, -35.555, -35.175, -33.644, -37.82, -53.217, -49.252, -55.602, -54.32, -57.853, -58.925, -58.098, -56.682, -51.278, -54.353, -46.325, -52.567, -53.636, -52.735, -50.421, -51.122, -52.433, -43.493, -43.142, -29.335, -31.697, -15.13, 3.023)), class = "data.frame", row.names = c(NA, -210L ))
Filling missing rows
I have a large data set, a sample is given below: df <- data.frame(stringsAsFactors=FALSE, Date = c("2015-10-26", "2015-10-26", "2015-10-26", "2015-10-26", "2015-10-27", "2015-10-27", "2015-10-27"), Ticker = c("ANZ", "CBA", "NAB", "WBC", "ANZ", "CBA", "WBC"), Open = c(29.11, 77.89, 32.69, 31.87, 29.05, 77.61, 31.84), High = c(29.17, 77.93, 32.76, 31.92, 29.08, 78.1, 31.95), Low = c(28.89, 77.37, 32.42, 31.71, 28.9, 77.54, 31.65), Close = c(28.9, 77.5, 32.42, 31.84, 28.94, 77.74, 31.77), Volume = c(6350170L, 2251288L, 3804239L, 5597684L, 5925519L, 2424679L, 5448863L) ) The problem I am trying to solve is the missing data for NAB on 27-10-2015 I want the last value to repeat itself for the missing dates: Date Ticker Open High Low Close Volume 2 2015-10-27 NAB 32.69 32.76 32.42 32.42 3804239 Any ideas on how to do this? I have already unsuccessfully tried gather + spread
What if you tried something like this? library(zoo) res <- expand.grid(Date = unique(df$Date), Ticker = unique(df$Ticker)) res2 <- merge(res, df, all.x = TRUE) res2 <- res2[order(res2$Ticker, res2$Date),] res3 <- na.locf(res2) res3[order(res3$Date, res3$Ticker),] # Date Ticker Open High Low Close Volume #1 2015-10-26 ANZ 29.11 29.17 28.89 28.90 6350170 #3 2015-10-26 CBA 77.89 77.93 77.37 77.50 2251288 #5 2015-10-26 NAB 32.69 32.76 32.42 32.42 3804239 #6 2015-10-26 WBC 31.87 31.92 31.71 31.84 5597684 #2 2015-10-27 ANZ 29.05 29.08 28.90 28.94 5925519 #4 2015-10-27 CBA 77.61 78.10 77.54 77.74 2424679 #8 2015-10-27 NAB 32.69 32.76 32.42 32.42 3804239 #7 2015-10-27 WBC 31.84 31.95 31.65 31.77 5448863 I'm assuming that if a Ticker/Day combo does not exist, you want to create one and LOCF it. This is what the expand.grid does.
tidyr::complete and tidyr::fill are built just for this situation: library(tidyverse) df %>% complete(Date,Ticker) %>% arrange(Ticker) %>% fill(names(.)) %>% arrange(Date) # # # A tibble: 8 x 7 # Date Ticker Open High Low Close Volume # <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int> # 1 2015-10-26 ANZ 29.11 29.17 28.89 28.90 6350170 # 2 2015-10-26 CBA 77.89 77.93 77.37 77.50 2251288 # 3 2015-10-26 NAB 32.69 32.76 32.42 32.42 3804239 # 4 2015-10-26 WBC 31.87 31.92 31.71 31.84 5597684 # 5 2015-10-27 ANZ 29.05 29.08 28.90 28.94 5925519 # 6 2015-10-27 CBA 77.61 78.10 77.54 77.74 2424679 # 7 2015-10-27 NAB 32.69 32.76 32.42 32.42 3804239 # 8 2015-10-27 WBC 31.84 31.95 31.65 31.77 5448863
Another potential solution (Note: I had to convert your date vector to Date format, but this could be reversed in the final output): library(tidyr) library(dplyr) df <- data.frame(stringsAsFactors=FALSE, Date = as.Date(c("2015-10-26", "2015-10-26", "2015-10-26", "2015-10-26", "2015-10-27", "2015-10-27", "2015-10-27")), Ticker = c("ANZ", "CBA", "NAB", "WBC", "ANZ", "CBA", "WBC"), Open = c(29.11, 77.89, 32.69, 31.87, 29.05, 77.61, 31.84), High = c(29.17, 77.93, 32.76, 31.92, 29.08, 78.1, 31.95), Low = c(28.89, 77.37, 32.42, 31.71, 28.9, 77.54, 31.65), Close = c(28.9, 77.5, 32.42, 31.84, 28.94, 77.74, 31.77), Volume = c(6350170L, 2251288L, 3804239L, 5597684L, 5925519L, 2424679L, 5448863L)) tickers<- unique(df$Ticker) dates<- as.Date(df$Date) possibilities<- as.data.frame(unique(expand.grid(dates,tickers))) colnames(possibilities)<- c('Date','Ticker') missing<- anti_join(possibilities,df[,c('Date','Ticker')]) missing_filled<- if(nrow(missing)>0){ replacement<- cbind(missing,filter(df,Date==missing$Date-1,Ticker==missing$Ticker)[,3:7]) } final<- arrange(rbind(df,replacement),Date)