Clickhouse GROUP BY milliseconds - datetime

How do I GROUP BY 0.004 seconds using the timestamps? I want to calculate the averages of 4 consecutive rows and have a table with a quarter of the values in the new table.
INSERT INTO sensor_1 Values
('2021-01-01 00:00:00.000', 1.52), ('2021-01-01 00:00:00.001', 1.54), ('2021-01-01 00:00:00.002', 1.42), ('2021-01-01 00:00:00.003', 1.54), ('2021-01-01 00:00:00.004', 1.42), ('2021-01-01 00:00:00.005', 1.52), ('2021-01-01 00:00:00.006', 1.54), ('2021-01-01 00:00:00.007', 1.42), ('2021-01-01 00:00:00.008', 1.54), ('2021-01-01 00:00:00.009', 1.42),
('2021-01-01 00:00:00.010', 1.55), ('2021-01-01 00:00:00.011', 1.45), ('2021-01-01 00:00:00.012', 1.55), ('2021-01-01 00:00:00.013', 1.45), ('2021-01-01 00:00:00.014', 1.35), ('2021-01-01 00:00:00.015', 1.55), ('2021-01-01 00:00:00.016', 1.45), ('2021-01-01 00:00:00.017', 1.55), ('2021-01-01 00:00:00.018', 1.45), ('2021-01-01 00:00:00.019', 1.35),
('2021-01-01 00:00:00.020', 1.54), ('2021-01-01 00:00:00.021', 1.44), ('2021-01-01 00:00:00.022', 1.54), ('2021-01-01 00:00:00.023', 1.44), ('2021-01-01 00:00:00.024', 1.34), ('2021-01-01 00:00:00.025', 1.54), ('2021-01-01 00:00:00.026', 1.44), ('2021-01-01 00:00:00.027', 1.54), ('2021-01-01 00:00:00.028', 1.44), ('2021-01-01 00:00:00.029', 1.34),
('2021-01-01 00:00:00.030', 1.53), ('2021-01-01 00:00:00.031', 1.43), ('2021-01-01 00:00:00.032', 1.53), ('2021-01-01 00:00:00.033', 1.43), ('2021-01-01 00:00:00.034', 1.33), ('2021-01-01 00:00:00.035', 1.53), ('2021-01-01 00:00:00.036', 1.43), ('2021-01-01 00:00:00.037', 1.53), ('2021-01-01 00:00:00.038', 1.43), ('2021-01-01 00:00:00.039', 1.33);
How do I GROUP BY 0.004 seconds using the timestamps? I want to calculate the averages of 4 consecutive rows and have a table with a quarter of the values in the new table.

At first sight, I would use unix timestamp format (with milliseconds support) and then use modulo to group by 4 items.
It will be easier to group all rows by second, 1/10s or 1/100s than 4/1000s

Related

First Derivative of Scatter Plot R

Hello I am working with sigmoidal data and am attempting to plot two scatter plots on top of each other: the raw data & the first derivative of the raw data. My issue doesn't lie in plotting the data, but more-so finding a function that will create an accurate representation of the first derivative.
What have I tried: Creating a function that calculates the slope of the current & next point: (y2-y1)/(x2-x1) & assigning the value to the current temperature.
dput() of Data Frame:
structure(list(Temperature = c(4.98, 5.49, 6.01, 6.5, 7.02, 7.52, 8.03, 8.52, 9.03, 9.54, 10.04, 10.54, 11.05, 11.55, 12.05, 12.55, 13.05, 13.56, 14.06, 14.57, 15.07, 15.57, 16.07, 16.59, 17.08, 17.59, 18.08, 18.59, 19.09, 19.6, 20.1, 20.64, 21.12, 21.63, 22.13, 22.62, 23.13, 23.63, 24.13, 24.63, 25.11, 25.62, 26.11, 26.68, 27.19, 27.7, 28.2, 28.71, 29.21, 29.71, 30.21, 30.7, 31.21, 31.69, 32.19, 32.69, 33.19, 33.7, 34.19, 34.68, 35.19, 35.68, 36.19, 36.69, 37.19, 37.7, 38.19, 38.7, 39.2, 39.7, 40.21, 40.7, 41.22, 41.71, 42.21, 42.71, 43.21, 43.72, 44.22, 44.72, 45.22, 45.73, 46.23, 46.73, 47.23, 47.97, 48.71, 49.23, 49.74, 50.23, 50.73, 51.23, 51.73, 52.24, 52.75, 53.24, 53.75, 54.24, 54.75, 55.26, 55.75, 56.25, 56.75, 57.24, 57.75, 58.27, 58.77, 59.26, 59.77, 60.26, 60.78, 61.27, 61.79, 62.27, 62.77, 63.29, 63.79, 64.27, 64.78, 65.3, 65.8, 66.27, 66.8, 67.3, 67.8, 68.31, 68.78, 69.3, 69.8, 70.32, 70.81, 71.32, 71.81, 72.33, 72.82, 73.31, 73.83, 74.33, 74.82, 75.32, 75.83, 76.34, 76.84, 77.35, 77.82, 78.34, 78.85, 79.36, 79.84, 80.35, 80.85, 81.36, 81.86, 82.37, 82.86, 83.37, 83.88, 84.36, 84.88, 85.38, 85.88, 86.38, 86.89, 87.38, 87.89, 88.39, 88.89, 89.4, 89.9, 90.39, 90.9, 91.4, 91.91, 92.37, 92.89, 93.4, 93.91, 94.41, 94.91, 95.42), Absorbance = c(1.401351929, 1.403320313, 1.405181885, 1.406326294, 1.407440186, 1.409118652, 1.410095215, 1.410797119, 1.411560059, 1.412918091, 1.413970947, 1.414245605, 1.416000366, 1.415435791, 1.41809082, 1.4190979, 1.419677734, 1.420150757, 1.421966553, 1.420333862, 1.422637939, 1.422790527, 1.423461914, 1.426513672, 1.426315308, 1.426071167, 1.426467896, 1.428710938, 1.428070068, 1.428817749, 1.429733276, 1.432144165, 1.432434082, 1.433227539, 1.434616089, 1.435806274, 1.434814453, 1.436096191, 1.436096191, 1.436447144, 1.437896729, 1.4375, 1.438934326, 1.440139771, 1.440139771, 1.441741943, 1.442108154, 1.443969727, 1.444778442, 1.443862915, 1.444534302, 1.445648193, 1.444473267, 1.446395874, 1.447219849, 1.446151733, 1.449569702, 1.449066162, 1.448852539, 1.4503479, 1.451385498, 1.45111084, 1.451217651, 1.453125, 1.452560425, 1.455047607, 1.455093384, 1.456665039, 1.457977295, 1.457336426, 1.458648682, 1.46043396, 1.462158203, 1.464813232, 1.463531494, 1.468048096, 1.468643188, 1.470748901, 1.471878052, 1.476257324, 1.478057861, 1.482040405, 1.484466553, 1.486129761, 1.48815918, 1.496520996, 1.499786377, 1.504302979, 1.507217407, 1.512985229, 1.517471313, 1.524108887, 1.528198242, 1.534637451, 1.539169312, 1.546142578, 1.554611206, 1.55809021, 1.56854248, 1.572875977, 1.580307007, 1.585739136, 1.592514038, 1.600067139, 1.609222412, 1.616607666, 1.622375488, 1.631469727, 1.635635376, 1.642929077, 1.649780273, 1.655014038, 1.661483765, 1.663742065, 1.671859741, 1.677200317, 1.677108765, 1.683380127, 1.684082031, 1.687438965, 1.694595337, 1.694961548, 1.696685791, 1.696685791, 1.699768066, 1.702514648, 1.703613281, 1.705093384, 1.70022583, 1.707595825, 1.707962036, 1.709075928, 1.705276489, 1.71055603, 1.709259033, 1.70916748, 1.709732056, 1.710189819, 1.710281372, 1.711868286, 1.711883545, 1.713104248, 1.713760376, 1.711120605, 1.709716797, 1.711776733, 1.712814331, 1.714324951, 1.711120605, 1.713378906, 1.712432861, 1.716125488, 1.710006714, 1.710845947, 1.711502075, 1.711120605, 1.710006714, 1.70980835, 1.708602905, 1.708236694, 1.710189819, 1.707672119, 1.706939697, 1.710006714, 1.706192017, 1.706573486, 1.706207275, 1.705734253, 1.706207275, 1.705184937, 1.70954895, 1.705841064, 1.702972412, 1.703979492, 1.703063965, 1.709350586, 1.703338623, 1.700408936, 1.705276489, 1.705368042)), row.names = 1621:1800, class = "data.frame")
Code For my Attempt
raw = "<insert dput line>>"
columns = c("Temperature","Absorbance")
first = data.frame(matrix(nrow=0,ncol=2))
colnames(dFrame) = columns
for (i in 1:nrow(raw)) {
if(i != nrow(raw)) {
cAbs = raw[i,2]
nextAbs = raw[i+1,2]
cT = raw[i,1]
nextT = raw[i+1,1]
Temperature = raw[i,1]
Absorbance =((nextAbs-cAbs)/(nextT-cT))
t <- data.frame(Temperature,Absorbance)
names(t) <- names(raw)
first <- rbind(first, t)
}
}
ggplot()+
geom_point(data=raw, aes(x=Temperature,y=Absorbance), color = "red") +
geom_point(data = first, aes(x=Temperature,y = Absorbance), color = "blue")
What I was expecting
I was expecting an output that had the shape of something like so:
library(dplyr); library(ggplot2)
df %>%
arrange(Temperature) %>%
mutate(slope = (Absorbance - lag(Absorbance))/
(Temperature - lag(Temperature))) %>%
ggplot(aes(Temperature)) +
geom_line(aes(y= Absorbance, color = "Absorbance"), size = 1.2) +
geom_point(aes(y= slope * 20 + 1.4, color = "slope")) +
geom_smooth(aes(y= slope * 20 + 1.4, color = "slope"), se = FALSE, size = 0.8) +
scale_y_continuous(sec.axis = sec_axis(trans = ~(.x - 1.4)/20, name = "slope"))
If the data is even a little noisy, calculating the derivative by first differencing can be very noisy.
You can get a better estimate by fitting a smoothing spline function and calculating the derivative of the spline function. By differentiating a smooth function, you get a smooth derivative.
In most cases, smooth.spline with default arguments is fine, but I recommend taking a look at the result and possibly tuning the smooth.spline parameters for more or less smoothing, depending on your judgment.
edit: I learned this approach from the Numerical Recipes textbook.
library(tidyverse)
df <- tibble(
x = seq(1, 15, by = 0.1),
y = sin(x) + runif(length(x), -0.2, 0.2),
d1_diff = c(NA, diff(y) / diff(x)),
d1_spline = smooth.spline(x, y) %>% predict(x, deriv = 1) %>% pluck("y")
)
df %>%
pivot_longer(-x) %>%
mutate(name = factor(name, unique(name))) %>%
ggplot() + aes(x, value, color = name) + geom_point() + geom_line() +
facet_wrap(~name, ncol = 1)
#> Warning: Removed 1 rows containing missing values (geom_point).
#> Warning: Removed 1 row(s) containing missing values (geom_path).
Created on 2022-10-26 with reprex v2.0.2

R Scaling a vector between -1 and 1

I have a vector with 100 elements.
vec <- c(58.12, 51.97, 61.83, 53.46, 30.67, 38.8, 48.79, 56.82, 20.19,
53.1, 54.95, 46.45, 41.09, 51.76, 52.56, 44.63, 52.95, 30, 50.7,
56.33, 64.72, 39.99, 39.37, 33.82, 47.62, 51.28, 37.38, 50.55,
68.39, 53.88, 33.37, 29.69, 30.74, 47.51, 72.64, 47.88, 42.28,
62.71, 47.47, 71.45, 55.94, 39.5, 32.97, 28.81, 56.59, 49.79,
43.49, 41.97, 43.61, 30.09, 50.18, 63.88, 57.77, 41.57, 27.52,
38.47, 46.13, 41.85, 39.14, 46.38, 47.73, 61.51, 66.73, 56.28,
59.89, 47.38, 27.27, 17.41, 36.8, 27.21, 43.13, 43.68, 29.33,
53.76, 74.69, 29.56, 63.41, 31.61, 56.32, 49.68, 48.65, 46.81,
51.23, 65.23, 54.79, 84.64, 63.55, 32.4, 47.93, 68.13, 33.05,
30.21, 40.62, 48.28, 38.69, 31.72, 52.01, 64.17, 53.12, 35.03)
I want to scale this vector vec so that all the numbers between 0 to 50 are scaled from -1 to 0 and all number between 50 to 100 are scaled from 0 to 1.
I have written the following code to do this -
newvec = ifelse(vec < 50, -(vec/min(vec, na.rm = T)), vec/max(vec, na.rm = T))
plot(vec, newvec)
The output looks like (see the black circles) -
For the numbers above 50, scaling is fine, however, for the numbers below 50, the scaling is working in reverse order and is incorrect (as showing in the graph).
I have drawn a red line in this graph showing the correct scaling.
Can someone show, what I am doing wrong?
Thanks.
looks like a perfect linear equation!
m <- lm( y~x, data=data.frame( x=c(0,50,100), y=c(-1,0,1) ) )
coef(m)
gives:
(Intercept) x
-1.00 0.02
So multiply by 0.02 and subtract 1
You could do it by first rescaling it to 0-1 to then multiply it by two and subtracting one:
vec <- c(58.12, 51.97, 61.83, 53.46, 30.67, 38.8, 48.79, 56.82, 20.19,
53.1, 54.95, 46.45, 41.09, 51.76, 52.56, 44.63, 52.95, 30, 50.7,
56.33, 64.72, 39.99, 39.37, 33.82, 47.62, 51.28, 37.38, 50.55,
68.39, 53.88, 33.37, 29.69, 30.74, 47.51, 72.64, 47.88, 42.28,
62.71, 47.47, 71.45, 55.94, 39.5, 32.97, 28.81, 56.59, 49.79,
43.49, 41.97, 43.61, 30.09, 50.18, 63.88, 57.77, 41.57, 27.52,
38.47, 46.13, 41.85, 39.14, 46.38, 47.73, 61.51, 66.73, 56.28,
59.89, 47.38, 27.27, 17.41, 36.8, 27.21, 43.13, 43.68, 29.33,
53.76, 74.69, 29.56, 63.41, 31.61, 56.32, 49.68, 48.65, 46.81,
51.23, 65.23, 54.79, 84.64, 63.55, 32.4, 47.93, 68.13, 33.05,
30.21, 40.62, 48.28, 38.69, 31.72, 52.01, 64.17, 53.12, 35.03)
rescale_minMax <- function(x){
1 - (x - max(x)) / (min(x) - max(x))
}
newvec = rescale_minMax(vec) * 2 - 1
plot(vec, newvec)
Created on 2021-03-12 by the reprex package (v0.3.0)

How to multiply a column by a dynamic value of another column?

I have two df's, one with values by month and another with weights by year. I would like to multiply the values of each month by the weight of the respective year.
I have data from January of 2010 from August 2020. I was trying to use xts, but it was only multiplying the first month. What can I do to solve this problem?
My weights dput:
structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 34.4047080126754,
32.7172573747916, 30.6161439261264, 30.7788857122934, 33.7950781612753,
38.8198000487686, 35.4479688369505, 30.795545527331, 30.795545527331,
30.795545527331, 30.795545527331, 30.795545527331, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), index = structure(c(1262304000, 1293840000,
1325376000, 1356998400, 1388534400, 1420070400, 1451606400, 1483228800,
1514764800, 1546300800, 1577836800, 1609459200), tzone = "UTC", tclass = "Date"), class = c("xts",
"zoo"), .Dim = c(12L, 3L), .Dimnames = list(NULL, c("Carvão mineral",
"Minerais não-metálicos", "Petróleo, gás natural e serviços de apoio"
)))
My data dput:
structure(c(90.6396989456161, 84.0613068812432, 94.6447239178692,
92.7542716426195, 97.2115579911211, 97.8381908990013, 104.569346559379,
104.440703385128, 103.402512486127, 102.224120421754, 99.045226137625,
101.397487513874, 96.8256284683686, 90.2070352386239, 93.9874375693675,
95.9522613762488, 102.192964761376, 102.076381798002, 107.267336293008,
107.257788568258, 103.645728357381, 103.631155660378, 103.805024972253,
103.597487513874, 84.457789955605, 90.368342119867, 96.956784128746,
95.761809100999, 105.614572697003, 102.197989733629, 106.200502219756,
106.574371531632, 100.293466981132, 112.208039678136, 96.6427136514985,
102.568844339623, 89.5346739733631, 82.6613068812432, 85.7110557713653,
87.6969852941178, 94.8974875138736, 97.1547738623753, 103.174371531632,
104.474371531632, 97.2000000000002, 106.629145394007, 102.171859045505,
100.064321587126, 95.8713568257493, 85.3758795782465, 95.7472364039957,
95.9000000000001, 103.621607935627, 102.250251109878, 108.933668146504,
112.324120421754, 108.631155660378, 113.459798834628, 106.247738623752,
109.121105715871, 90.6396989456161, 84.0613068812432, 94.6447239178692,
92.7542716426195, 97.2115579911211, 97.8381908990013, 104.569346559379,
104.440703385128, 103.402512486127, 102.224120421754, 99.045226137625,
101.397487513874, 96.8256284683686, 90.2070352386239, 93.9874375693675,
95.9522613762488, 102.192964761376, 102.076381798002, 107.267336293008,
107.257788568258, 103.645728357381, 103.631155660378, 103.805024972253,
103.597487513874, 84.457789955605, 90.368342119867, 96.956784128746,
95.761809100999, 105.614572697003, 102.197989733629, 106.200502219756,
106.574371531632, 100.293466981132, 112.208039678136, 96.6427136514985,
102.568844339623, 89.5346739733631, 82.6613068812432, 85.7110557713653,
87.6969852941178, 94.8974875138736, 97.1547738623753, 103.174371531632,
104.474371531632, 97.2000000000002, 106.629145394007, 102.171859045505,
100.064321587126, 95.8713568257493, 85.3758795782465, 95.7472364039957,
95.9000000000001, 103.621607935627, 102.250251109878, 108.933668146504,
112.324120421754, 108.631155660378, 113.459798834628, 106.247738623752,
109.121105715871, 165.763771576311, 151.472311910598, 168.854440604357,
167.01494030152, 172.566095042522, 165.278910760377, 171.245703490676,
173.005078675003, 161.715874268151, 167.370633868768, 169.164858518633,
182.349525211683, 177.289636413718, 155.287062174998, 172.942860603901,
165.57948196148, 173.580501842325, 172.873069603743, 173.862508830951,
171.994459838177, 169.613581901888, 175.855080325953, 176.541789876917,
185.284535064393, 186.406738740301, 171.699930121725, 174.417747407092,
163.872357926782, 172.113751272419, 166.397768544, 170.952183056081,
169.545822051413, 158.11329959031, 170.395638330209, 167.443226012601,
178.312765639496, 174.175143485261, 154.998922814227, 159.313917848059,
158.79737063505, 169.424496986474, 173.000079912105, 168.825964328539,
171.133589234327, 171.992673103431, 175.419528662589, 171.244236480227,
179.590587893763, 174.848849204677, 161.09792093833, 180.575597173093,
176.570773004147, 186.333954561173, 184.94640453128, 193.152978105855,
198.244306242776, 193.85527669408, 203.792127163977, 194.393862208802,
212.217706501233), class = c("xts", "zoo"), index = structure(c(1262304000,
1264982400, 1267401600, 1270080000, 1272672000, 1275350400, 1277942400,
1280620800, 1283299200, 1285891200, 1288569600, 1291161600, 1293840000,
1296518400, 1298937600, 1301616000, 1304208000, 1306886400, 1309478400,
1312156800, 1314835200, 1317427200, 1320105600, 1322697600, 1325376000,
1328054400, 1330560000, 1333238400, 1335830400, 1338508800, 1341100800,
1343779200, 1346457600, 1349049600, 1351728000, 1354320000, 1356998400,
1359676800, 1362096000, 1364774400, 1367366400, 1370044800, 1372636800,
1375315200, 1377993600, 1380585600, 1383264000, 1385856000, 1388534400,
1391212800, 1393632000, 1396310400, 1398902400, 1401580800, 1404172800,
1406851200, 1409529600, 1412121600, 1414800000, 1417392000), tzone = "UTC", tclass = c("POSIXct",
"POSIXt")), .Dim = c(60L, 3L), .Dimnames = list(NULL, c("Carvão.Mineral",
"Minerais.não.metálicos", "Petróleo..gás.natural.e.serviços.de.apoio"
)))
My code:
library(xts)
for (i in 1:nrow(pesos)) {
Extrativas_ex_petro_gas <- cbind(
pesos$`Carvão mineral`[i] * Dados_xts$Carvão.Mineral[i],
pesos$`Minerais não-metálicos`[i] * Dados_xts$Minerais.não.metálicos[i])
}
If there is any solution with dplyr, I would like to know too.
I called your first table "a" and the large one "b", just to keep it simple. I just added a year-column to both data.frames, derived from their dates. Then I created the dataframe df by merging a and b together via their year column.
Lastly I added the desired column by multiplication as you wanted.
a$year <- gsub("-.*","",a$Ano)
b$year <- gsub("-.*","",b$Data)
df <- merge(a, b, by="year")
df$multi <- df[,4]*df[,6]
if you want to use dplyr, df can be constructed with:
df <- merge(a, b, by="year") %>% mutate(multi=.[,4]*.[,6])
it would be a little better if your columns had better(shorter) names, so one wouldn't have to use .[,4] but rather just the column name, but that's up to you.

outlining statistically significant pixels in plotly (heatmap)

In plotly's heatmap, is it possible to outline specific pixels? In the small reproducible example below, I'd like to know if I can use the matrix defined at the end, called "significant," to add a black outline (with lwd=2) to all pixels with a corresponding value of 1 in 'significant':
library(plotly)
library(RColorBrewer)
m1 <- matrix(c(
-0.0024, -0.0031, -0.0021, -0.0034, -0.0060, -1.00e-02, -8.47e-03, -0.0117, -0.0075, -0.0043, -0.0026, -0.0021,
-0.0015, -0.0076, -0.0032, -0.0105, -0.0107, -2.73e-02, -3.37e-02, -0.0282, -0.0149, -0.0070, -0.0046, -0.0039,
-0.0121, -0.0155, -0.0203, -0.0290, -0.0330, -3.19e-02, -1.74e-02, -0.0103, -0.0084, -0.0180, -0.0162, -0.0136,
-0.0073, -0.0053, -0.0050, -0.0058, -0.0060, -4.38e-03, -2.21e-03, -0.0012, -0.0026, -0.0026, -0.0034, -0.0073,
-0.0027, -0.0031, -0.0054, -0.0069, -0.0071, -6.28e-03, -2.88e-03, -0.0014, -0.0031, -0.0037, -0.0030, -0.0027,
-0.0261, -0.0223, -0.0216, -0.0293, -0.0327, -3.17e-02, -1.77e-02, -0.0084, -0.0059, -0.0060, -0.0120, -0.0157,
0.0045, 0.0006, -0.0031, -0.0058, -0.0093, -9.20e-03, -6.76e-03, -0.0033, 0.0002, 0.0045, 0.0080, 0.0084,
-0.0021, -0.0018, -0.0020, -0.0046, -0.0080, -2.73e-03, 7.43e-04, 0.0004, -0.0010, -0.0017, -0.0022, -0.0024,
-0.0345, -0.0294, -0.0212, -0.0194, -0.0192, -2.25e-02, -2.05e-02, -0.0163, -0.0179, -0.0213, -0.0275, -0.0304,
-0.0034, -0.0038, -0.0040, -0.0045, -0.0059, -1.89e-03, 6.99e-05, -0.0050, -0.0114, -0.0112, -0.0087, -0.0064,
-0.0051, -0.0061, -0.0052, -0.0035, 0.0012, -7.41e-06, -3.43e-03, -0.0055, -0.0020, 0.0016, -0.0024, -0.0069,
-0.0061, -0.0068, -0.0089, -0.0107, -0.0104, -7.65e-03, 2.43e-03, 0.0008, -0.0006, -0.0014, -0.0021, -0.0057,
0.0381, 0.0149, -0.0074, -0.0302, -0.0550, -6.40e-02, -5.28e-02, -0.0326, -0.0114, 0.0121, 0.0367, 0.0501,
-0.0075, -0.0096, -0.0123, -0.0200, -0.0288, -2.65e-02, -2.08e-02, -0.0176, -0.0146, -0.0067, -0.0038, -0.0029,
-0.0154, -0.0162, -0.0252, -0.0299, -0.0350, -3.40e-02, -2.51e-02, -0.0172, -0.0139, -0.0091, -0.0119, -0.0156),
nrow = 15, ncol = 12, byrow=TRUE)
# palette definition
palette <- colorRampPalette(c("darkblue", "blue", "white", "red", "darkred"))
# find max stretch value
zmax1 = max(abs(m1))
plot_ly(
x = c(format(seq(as.Date('2000-10-01'), as.Date('2001-09-30'), by='month'), "%b")),
y = rev(c("TH1", "IN1", "IN3", "GL1", "LH1", "ED9", "TC1", "TC2", "TC3", "UT1", "UT3", "UT5", "GC1", "BC1", "WC1")), ,
z = m1, colors = palette(50), type = "heatmap", width = 700, height = 600, zauto = FALSE, zmin = -1 * zmax1, zmax = zmax1
)
# Can this matrix somehow be used to outline pixels in the heatmap that are equal to one?
# (In this example, the outline pattern will look like a checker board)
significant <- matrix(rep(c(0,1), times=nrow(m1) * ncol(m1) / 2), nrow = 15, ncol = 12)

Data Smoothing in R

This question is related to this one that I asked before. But referring to that question is not necessary to answer this one.
Data
I have a data set containing velocities of 2169 vehicles recorded at intervals of 0.1 seconds. So, there are many rows for an individual vehicle. Here I am reproducing the data only for the vehicle # 2:
> dput(uma)
structure(list(Frame.ID = 13:445, Vehicle.velocity = c(40, 40,
40, 40, 40, 40, 40, 40.02, 40.03, 39.93, 39.61, 39.14, 38.61,
38.28, 38.42, 38.78, 38.92, 38.54, 37.51, 36.34, 35.5, 35.08,
34.96, 34.98, 35, 34.99, 34.98, 35.1, 35.49, 36.2, 37.15, 38.12,
38.76, 38.95, 38.95, 38.99, 39.18, 39.34, 39.2, 38.89, 38.73,
38.88, 39.28, 39.68, 39.94, 40.02, 40, 39.99, 39.99, 39.65, 38.92,
38.52, 38.8, 39.72, 40.76, 41.07, 40.8, 40.59, 40.75, 41.38,
42.37, 43.37, 44.06, 44.29, 44.13, 43.9, 43.92, 44.21, 44.59,
44.87, 44.99, 45.01, 45.01, 45, 45, 45, 44.79, 44.32, 43.98,
43.97, 44.29, 44.76, 45.06, 45.36, 45.92, 46.6, 47.05, 47.05,
46.6, 45.92, 45.36, 45.06, 44.96, 44.97, 44.99, 44.99, 44.99,
44.99, 45.01, 45.02, 44.9, 44.46, 43.62, 42.47, 41.41, 40.72,
40.49, 40.6, 40.76, 40.72, 40.5, 40.38, 40.43, 40.38, 39.83,
38.59, 37.02, 35.73, 35.04, 34.85, 34.91, 34.99, 34.99, 34.97,
34.96, 34.98, 35.07, 35.29, 35.54, 35.67, 35.63, 35.53, 35.53,
35.63, 35.68, 35.55, 35.28, 35.06, 35.09, 35.49, 36.22, 37.08,
37.8, 38.3, 38.73, 39.18, 39.62, 39.83, 39.73, 39.58, 39.57,
39.71, 39.91, 40, 39.98, 39.97, 40.08, 40.38, 40.81, 41.27, 41.69,
42.2, 42.92, 43.77, 44.49, 44.9, 45.03, 45.01, 45, 45, 45, 45,
45, 45, 45, 45, 45, 45, 45, 44.99, 45.03, 45.26, 45.83, 46.83,
48.2, 49.68, 50.95, 51.83, 52.19, 52, 51.35, 50.38, 49.38, 48.63,
48.15, 47.87, 47.78, 48.01, 48.63, 49.52, 50.39, 50.9, 50.96,
50.68, 50.3, 50.05, 49.94, 49.87, 49.82, 49.82, 49.88, 49.96,
50, 50, 49.98, 49.98, 50.16, 50.64, 51.43, 52.33, 53.01, 53.27,
53.22, 53.25, 53.75, 54.86, 56.36, 57.64, 58.28, 58.29, 57.94,
57.51, 57.07, 56.64, 56.43, 56.73, 57.5, 58.27, 58.55, 58.32,
57.99, 57.89, 57.92, 57.74, 57.12, 56.24, 55.51, 55.1, 54.97,
54.98, 55.02, 55.03, 54.86, 54.3, 53.25, 51.8, 50.36, 49.41,
49.06, 49.17, 49.4, 49.51, 49.52, 49.51, 49.45, 49.24, 48.84,
48.29, 47.74, 47.33, 47.12, 47.06, 47.07, 47.08, 47.05, 47.04,
47.25, 47.68, 47.93, 47.56, 46.31, 44.43, 42.7, 41.56, 41.03,
40.92, 40.92, 40.98, 41.19, 41.45, 41.54, 41.32, 40.85, 40.37,
40.09, 39.99, 39.99, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
39.98, 39.97, 40.1, 40.53, 41.36, 42.52, 43.71, 44.57, 45.01,
45.1, 45.04, 45, 45, 45, 45, 45, 45, 44.98, 44.97, 45.08, 45.39,
45.85, 46.2, 46.28, 46.21, 46.29, 46.74, 47.49, 48.35, 49.11,
49.63, 49.89, 49.94, 49.97, 50.14, 50.44, 50.78, 51.03, 51.12,
51.05, 50.85, 50.56, 50.26, 50.06, 50.1, 50.52, 51.36, 52.5,
53.63, 54.46, 54.9, 55.03, 55.09, 55.23, 55.35, 55.35, 55.23,
55.07, 54.99, 54.98, 54.97, 55.06, 55.37, 55.91, 56.66, 57.42,
58.07, 58.7, 59.24, 59.67, 59.95, 60.02, 60, 60, 60, 60, 60,
60.01, 60.06, 60.23, 60.65, 61.34, 62.17, 62.93, 63.53, 64, 64.41,
64.75, 65.04, 65.3, 65.57, 65.75, 65.74, 65.66, 65.62, 65.71,
65.91, 66.1, 66.26, 66.44, 66.61, 66.78, 66.91, 66.99, 66.91,
66.7, 66.56, 66.6, 66.83, 67.17, 67.45, 67.75, 68.15, 68.64,
69.15, 69.57, 69.79, 69.79, 69.72, 69.72, 69.81, 69.94, 70, 70.01,
70.02, 70.03)), .Names = c("Frame.ID", "Vehicle.velocity"), class = "data.frame", row.names = c(NA,
433L))
Frame.ID is the time frame in which the Vehicle.velocity was observed. There is some noise in the velocity variable and I want to smooth it.
Methodology
To smooth the velocity I am using following equation:
where,
Delta = 10
Nalpha = number of data points (rows)
i = 1, ... ,Nalpha (i.e. the row number)
D = minimum of {i-1, Nalpha - i, 3*delta=30}
xalpha = velocity
Question
I have gone through the documentation of filter and convolution in R. It seems that I have to know about convolution to do this. However, I have tried my best and can't understand how convolution works! The linked question has an answer which helped me in understanding some of the inner workings in the function but I am still not sure.
Could anyone here on SO please explain how this thing works? Or guide me to an alternative methodology to achieve the same purpose i.e. apply the equation?
My current code which works but is lengthy
Here is what uma looks like:
> head(uma)
Frame.ID Vehicle.velocity
1 13 40
2 14 40
3 15 40
4 16 40
5 17 40
6 18 40
uma$i <- 1:nrow(uma) # this is i
uma$im1 <- uma$i - 1
uma$Nai <- nrow(uma) - uma$i # this is Nalpha
uma$delta3 <- 30 # this is 3 times delta
uma$D <- pmin(uma$im1, uma$Nai, uma$delta3) # selecting the minimum of {i-1, Nalpha - i, 3*delta=15}
uma$imD <- uma$i - uma$D # i-D
uma$ipD <- uma$i + uma$D # i+D
uma <- ddply(uma, .(Frame.ID), transform, k = imD:ipD) # to include all k in the data frame
umai <- uma
umai$imk <- umai$i - umai$k # i-k
umai$aimk <- (-1) * abs(umai$imk) # -|i-k|
umai$delta <- 10
umai$kernel <- exp(umai$aimk/umai$delta) # The kernel in the equation i.e. EXP^-|i-k|/delta
umai$p <- umai$Vehicle.velocity[match(umai$k,umai$i)] #observed velocity in kth row as described in equation as t(k)
umai$kernelp <- umai$p * umai$kernel # the product of kernel and observed velocity in kth row as described in equation as t(k)
umair <- ddply(umai, .(Frame.ID), summarize, Z = sum(kernel), prod = sum(kernelp)) # summing the kernel to get Z and summing the product to get the numerator of the equation
umair$new.Y <- umair$prod/umair$Z # the final step to get the smoothed velocity
Plot
Just for reference, if I plot the observed and smoothed velocities against time frames we can see the result of smoothing:
ggplot() +
geom_point(data=uma,aes(y=Vehicle.velocity, x= Frame.ID)) +
geom_point(data=umair,aes(y=new.Y, x= Frame.ID), color="red")
Please help me making my code short and applicable to all vehicles (represented by Vehicle.ID in the data set) by guiding me about use of convolution.
dplyr
Alright, so I used following code and it works but takes 3 hours on 32 GB RAM. Can anyone suggest improvements to speed it up (1 hour each is taken by umal, umav and umaa)?
uma <- tbl_df(uma)
uma <- uma %>% # take data frame
group_by(Vehicle.ID) %>% # group by Vehicle ID
mutate(i = 1:length(Frame.ID), im1 = i-1, Nai = length(Frame.ID) - i,
Dv = pmin(im1, Nai, 30),
Da = pmin(im1, Nai, 120),
Dl = pmin(im1, Nai, 15),
imDv = i - Dv,
ipDv = i + Dv,
imDa = i - Da,
ipDa = i + Da,
imDl = i - Dl,
ipDl = i + Dl) %>% # finding i, i-1 and Nalpha-i, D, i-D and i+D for location, velocity and acceleration
ungroup()
umav <- uma %>%
group_by(Vehicle.ID, Frame.ID) %>%
do(data.frame(kv = .$imDv:.$ipDv)) %>%
left_join(x=., y=uma) %>%
mutate(imk = i - kv, aimk = (-1) * abs(imk), delta = 10, kernel = exp(aimk/delta)) %>%
ungroup() %>%
group_by(Vehicle.ID) %>%
mutate(p = Vehicle.velocity2[match(kv,i)], kernelp = p * kernel) %>%
ungroup() %>%
group_by(Vehicle.ID, Frame.ID) %>%
summarise(Z = sum(kernel), prod = sum(kernelp)) %>%
mutate(svel = prod/Z) %>%
ungroup()
umaa <- uma %>%
group_by(Vehicle.ID, Frame.ID) %>%
do(data.frame(ka = .$imDa:.$ipDa)) %>%
left_join(x=., y=uma) %>%
mutate(imk = i - ka, aimk = (-1) * abs(imk), delta = 10, kernel = exp(aimk/delta)) %>%
ungroup() %>%
group_by(Vehicle.ID) %>%
mutate(p = Vehicle.acceleration2[match(ka,i)], kernelp = p * kernel) %>%
ungroup() %>%
group_by(Vehicle.ID, Frame.ID) %>%
summarise(Z = sum(kernel), prod = sum(kernelp)) %>%
mutate(sacc = prod/Z) %>%
ungroup()
umal <- uma %>%
group_by(Vehicle.ID, Frame.ID) %>%
do(data.frame(kl = .$imDl:.$ipDl)) %>%
left_join(x=., y=uma) %>%
mutate(imk = i - kl, aimk = (-1) * abs(imk), delta = 10, kernel = exp(aimk/delta)) %>%
ungroup() %>%
group_by(Vehicle.ID) %>%
mutate(p = Local.Y[match(kl,i)], kernelp = p * kernel) %>%
ungroup() %>%
group_by(Vehicle.ID, Frame.ID) %>%
summarise(Z = sum(kernel), prod = sum(kernelp)) %>%
mutate(ycoord = prod/Z) %>%
ungroup()
umal <- select(umal,c("Vehicle.ID", "Frame.ID", "ycoord"))
umav <- select(umav, c("Vehicle.ID", "Frame.ID", "svel"))
umaa <- select(umaa, c("Vehicle.ID", "Frame.ID", "sacc"))
umair <- left_join(uma, umal) %>% left_join(x=., y=umav) %>% left_join(x=., y=umaa)
A good first step would be to take a for loop (which I'll hide with sapply) and perform the exponential smoothing for each index:
josilber1 <- function(uma) {
delta <- 10
sapply(1:nrow(uma), function(i) {
D <- min(i-1, nrow(uma)-i, 30)
rng <- (i-D):(i+D)
rng <- rng[rng >= 1 & rng <= nrow(uma)]
expabs <- exp(-abs(i-rng)/delta)
return(sum(uma$Vehicle.velocity[rng] * expabs) / sum(expabs))
})
}
A more involved approach would be to only compute the incremental change in the exponential smoothing function for each index (as opposed to re-summing at each index). The exponential smoothing function has a lower part (data before the current index; I include the current index in low in the code below) and an upper part (data after the current index; high in the code below). As we loop through the vector, all the data in the lower part gets weighted less (we divide by mult) and all the data in the upper part gets weighted more (we multiply by mult). The leftmost element is dropped from low, the leftmost element in high moves to low, and one element is added to the right side of high.
The actual code is a bit messier to deal with the beginning and ending of the vector and to deal with numerical stability issues (errors in high are multiplied by mult each iteration):
josilber2 <- function(uma) {
delta <- 10
x <- uma$Vehicle.velocity
ret <- c(x[1], rep(NA, nrow(uma)-1))
low <- x[1]
high <- 0
norm <- 1
old.D <- 0
mult <- exp(1/delta)
for (i in 2:nrow(uma)) {
D <- min(i-1, nrow(uma)-i, 30)
if (D == old.D + 1) {
low <- low / mult + x[i]
high <- high * mult - x[i] + x[i+D-1]/mult^(D-1) + x[i+D]/mult^D
norm <- norm + 2 / mult^D
} else if (D == old.D) {
low <- low / mult - x[i-(D+1)]/mult^(D+1) + x[i]
high <- high * mult - x[i] + x[i+D]/mult^D
} else {
low <- low / mult - x[i-(D+2)]/mult^(D+2) - x[i-(D+1)]/mult^(D+1) + x[i]
high <- high * mult - x[i]
norm <- norm - 2 / mult^(D+1)
}
# For numerical stability, recompute high every so often
if (i %% 50 == 0) {
rng <- (i+1):(i+D)
expabs <- exp(-abs(i-rng)/delta)
high <- sum(x[rng] * expabs)
}
ret[i] <- (low+high)/norm
old.D <- D
}
return(ret)
}
R code like josilber2 can often be sped up considerably using the Rcpp package:
library(Rcpp)
josilber3 <- cppFunction(
"
NumericVector josilber3(NumericVector x) {
double delta = 10.0;
NumericVector ret(x.size(), 0.0);
ret[0] = x[0];
double low = x[0];
double high = 0.0;
double norm = 1.0;
int oldD = 0;
double mult = exp(1/delta);
for (int i=1; i < x.size(); ++i) {
int D = i;
if (x.size()-i-1 < D) D = x.size()-i-1;
if (30 < D) D = 30;
if (D == oldD + 1) {
low = low / mult + x[i];
high = high * mult - x[i] + x[i+D-1]/pow(mult, D-1) + x[i+D]/pow(mult, D);
norm = norm + 2 / pow(mult, D);
} else if (D == oldD) {
low = low / mult - x[i-(D+1)]/pow(mult, D+1) + x[i];
high = high * mult - x[i] + x[i+D]/pow(mult, D);
} else {
low = low / mult - x[i-(D+2)]/pow(mult, D+2) - x[i-(D+1)]/pow(mult, D+1) + x[i];
high = high * mult - x[i];
norm = norm - 2 / pow(mult, D+1);
}
if (i % 50 == 0) {
high = 0.0;
for (int j=i+1; j <= i+D; ++j) {
high += x[j] * exp((i-j)/delta);
}
}
ret[i] = (low+high)/norm;
oldD = D;
}
return ret;
}")
We can now benchmark the improvements from these three new approaches:
all.equal(umair.fxn(uma), josilber1(uma))
# [1] TRUE
all.equal(umair.fxn(uma), josilber2(uma))
# [1] TRUE
all.equal(umair.fxn(uma), josilber3(uma$Vehicle.velocity))
# [1] TRUE
library(microbenchmark)
microbenchmark(umair.fxn(uma), josilber1(uma), josilber2(uma), josilber3(uma$Vehicle.velocity))
# Unit: microseconds
# expr min lq mean median uq max neval
# umair.fxn(uma) 370006.728 382327.4115 398554.71080 393495.052 404186.153 572801.355 100
# josilber1(uma) 12879.268 13640.1310 15981.82099 14265.610 14805.419 28959.230 100
# josilber2(uma) 4324.724 4502.8125 5753.47088 4918.835 5244.309 17328.797 100
# josilber3(uma$Vehicle.velocity) 41.582 54.5235 57.76919 57.435 60.099 90.998 100
We got a lot of improvement (25x) with the simpler josilber1 and a 70x total speedup with josilber2 (the advantage would be more with a larger delta value). With josilber3 we achieve a 6800x speedup, getting the runtime all the way down to 54 microseconds to process a single vehicle!

Resources