I was wondering if anyone could tell me how to represent the enumeration of vectors of privite key f in a Meet-In-the-Middle Attack on an NTRU Private key. I can not understand the example, given here http://securityinnovation.com/cryptolab/pdf/NTRUTech004v2.pdf
I'll be very thankful if anyone could show an example in detail.
(Full disclosure: I work for Security Innovation and worked for NTRU until SI acquired us)
Warning: Long answer!
Let's look at a toy example: N = 11, q = 29. Let's take df = 3, so f consists of 3 coefficients equal to 1 and 8 coefficients equal to 0. Take dg = 5. And assume that h = g*f^{-1} mod p, rather than using the optimizations that have f = 1+pF. Then we might have
f = [1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
finv = [16, 12, 4, 18, 17, 14, 9, 28, 8, 26, 3]
g = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0]
h = [15, 20, 1, 21, 4, 26, 14, 17, 25, 11, 12]
You can check that f*h = g here.
The attacker wants to find f, so they can do the brute force search for df = 3. They can speed this up by taking advantage of the fact that there will be some rotation of f that has a 1 in the first position, so they only need to search the (10 pick 2) possible locations for the other two nonzero coefficients of f. The full search they perform is this:
f*h (=g) f
[9, 18, 7, 13, 26, 22, 15, 28, 27, 24, 19]; [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[23, 17, 4, 8, 16, 2, 3, 6, 10, 21, 11]; [1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0]
[15, 2, 3, 5, 11, 21, 12, 23, 17, 4, 8]; [1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0]
[12, 23, 17, 4, 8, 16, 2, 3, 5, 11, 20]; [1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0]
[24, 20, 9, 18, 7, 13, 26, 22, 14, 28, 27]; [1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]
[2, 3, 6, 10, 21, 12, 23, 17, 4, 8, 15]; [1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0]
[19, 10, 18, 7, 13, 26, 22, 14, 28, 27, 24]; [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[28, 27, 25, 19, 10, 18, 7, 13, 25, 22, 14]; [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0]
[18, 7, 13, 26, 22, 15, 28, 27, 24, 19, 9]; [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1]
[22, 14, 28, 27, 25, 19, 10, 18, 7, 13, 25]; [1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0]
[14, 28, 27, 24, 20, 9, 19, 6, 14, 25, 22]; [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]
[11, 20, 12, 23, 17, 4, 9, 15, 2, 3, 5]; [1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0]
[23, 17, 4, 8, 16, 1, 4, 5, 11, 20, 12]; [1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]
[1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0]; [1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
[18, 7, 13, 26, 22, 14, 0, 26, 25, 19, 9]; [1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0]
[27, 24, 20, 9, 19, 6, 14, 25, 22, 14, 28]; [1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]
[17, 4, 8, 16, 2, 3, 6, 10, 21, 11, 23]; [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1]
[28, 27, 24, 19, 10, 18, 7, 13, 26, 22, 14]; [1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]
[25, 19, 9, 18, 7, 13, 26, 22, 14, 0, 26]; [1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]
[8, 16, 1, 3, 6, 10, 21, 12, 23, 17, 4]; [1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0]
[15, 28, 27, 24, 20, 9, 18, 7, 13, 26, 21]; [1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]
[3, 6, 10, 21, 12, 23, 17, 4, 8, 16, 1]; [1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]
[12, 23, 17, 4, 9, 15, 2, 3, 5, 11, 20]; [1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0]
[2, 3, 5, 11, 21, 12, 23, 17, 4, 8, 15]; [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
[17, 4, 8, 15, 2, 3, 6, 10, 21, 12, 23]; [1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]
[0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1]; [1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]
[7, 13, 26, 21, 15, 28, 27, 24, 20, 9, 18]; [1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
[24, 20, 9, 18, 7, 13, 26, 21, 15, 28, 27]; [1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]
[4, 8, 16, 1, 4, 5, 11, 20, 12, 23, 17]; [1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]
[23, 17, 4, 8, 16, 2, 3, 5, 11, 20, 12]; [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]
[26, 22, 14, 28, 27, 24, 20, 9, 18, 7, 13]; [1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0]
[4, 5, 11, 20, 12, 23, 17, 4, 8, 16, 1]; [1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0]
[21, 12, 23, 17, 4, 8, 16, 1, 3, 6, 10]; [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0]
[1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0]; [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0]
[20, 9, 18, 7, 13, 26, 22, 14, 28, 27, 24]; [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]
[16, 2, 3, 5, 11, 20, 12, 23, 17, 4, 8]; [1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]
[4, 9, 15, 2, 3, 5, 11, 20, 12, 23, 17]; [1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0]
[13, 26, 22, 14, 0, 26, 25, 19, 9, 18, 7]; [1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0]
[3, 6, 10, 21, 12, 23, 17, 4, 8, 15, 2]; [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1]
[11, 21, 12, 23, 17, 4, 8, 15, 2, 3, 5]; [1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]
[20, 9, 19, 6, 14, 25, 22, 14, 28, 27, 24]; [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0]
[10, 18, 7, 13, 26, 22, 14, 28, 27, 24, 19]; [1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1]
[8, 16, 2, 3, 6, 10, 21, 11, 23, 17, 4]; [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0]
[27, 25, 19, 10, 18, 7, 13, 25, 22, 14, 28]; [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1]
[7, 13, 26, 22, 15, 28, 27, 24, 19, 9, 18]; [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
Scan down there, and you can see that g appears in row 14, 26 and 34 of the 45 rows. (g appears three times because there are three 1's in f, so there are three rotations of f that have a 1 in the leading position).
Now let's look at the meet-in-the-middle attack. The attacker uses the formula
(f1+f2) * h = g
so
f1*h = g - f2*h
Using e[i] to mean the i'th coefficient of e, this means that the attacker knows that
(f1*h)[i] = - (f2*h)[i] + 0 or 1
So the attacker calculates all possible values of f1*h. Call the resulting list {g1}. They then calculate -f2*h and for each result g2, they see if g2 is the same as an existing g1 or if g2 differs from any g1 by no more than 1 in each coefficient. In other words,
[3, 10, 12, 7]
would match
[4, 10, 12, 8]
Doing it this way, the attacker needs only work through the following:
All 10 f1s with a 1 in the leading position and a 1 somewhere else
All 10 f2s with a single 1 in any position other than the leading one
This gives the following. I've sorted the lists to make the matches easier to spot.
f1*h = g1 f1
[00, 08, 26, 03, 16, 12, 05, 18, 17, 15, 09] [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
[03, 16, 12, 04, 19, 17, 15, 09, 00, 08, 26] [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
[06, 21, 22, 25, 01, 11, 02, 13, 07, 23, 27] [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
[07, 24, 27, 06, 21, 22, 25, 00, 11, 02, 13] [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[11, 02, 13, 07, 24, 27, 06, 21, 22, 25, 00] [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
[12, 05, 18, 17, 15, 09, 00, 08, 26, 03, 16] [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
[16, 12, 05, 18, 18, 14, 10, 28, 08, 26, 03] [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
[19, 17, 15, 09, 00, 08, 26, 03, 16, 12, 04] [1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
[26, 03, 16, 12, 05, 18, 18, 14, 10, 28, 08] [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[27, 06, 21, 22, 25, 01, 11, 02, 13, 07, 23] [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
-f2*h = g2 f2
[03, 15, 12, 04, 18, 17, 14, 09, 28, 08, 25] [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
[04, 18, 17, 14, 09, 28, 08, 25, 03, 15, 12] [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
[08, 25, 03, 15, 12, 04, 18, 17, 14, 09, 28] [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
[09, 28, 08, 25, 03, 15, 12, 04, 18, 17, 14] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
[12, 04, 18, 17, 14, 09, 28, 08, 25, 03, 15] [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
[15, 12, 04, 18, 17, 14, 09, 28, 08, 25, 03] [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
[17, 14, 09, 28, 08, 25, 03, 15, 12, 04, 18] [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[18, 17, 14, 09, 28, 08, 25, 03, 15, 12, 04] [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[25, 03, 15, 12, 04, 18, 17, 14, 09, 28, 08] [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
[28, 08, 25, 03, 15, 12, 04, 18, 17, 14, 09] [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
You can see that:
line 1 of g1 matches with line 10 of g2, giving [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0]
line 2 of g1 matches with line 1 of g2, giving [1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]
line 6 of g1 matches with line 5 of g2, giving [1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]
line 7 of g1 matches with line 6 of g2, giving [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0]
line 8 of g1 matches with line 8 of g2, giving [1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
line 9 of g1 matches with line 9 of g2, giving [1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
There are 6 collisions here because there are 3 rotations with a 1 in the leading position and for each rotation there are two ways to pick the other two coefficients.
So an attacker would have to do about 45/3 = 15 work to find the key with a brute force search and about 10 work to find the key with a meet-in-the-middle attack (slightly less than 10 due to the rotations, but I don't have a clean formula to hand).
There are various optimizations, but this should be enough to give you the idea.
One thing I haven't dealt with so far is how to keep the search time down. A straightforward way to do it is simply to sort the results as you're going along. The time to insert or look for a collision with an entry is about log_2(size of the search space). Alternatively, at the cost of using more memory, it's possible to bring this search time down to a constant by reserving a block for each possible value of the first few coefficients of g1.
Hope this helps. Let me know if you have any more questions.
Related
Using the markovchain library, I managed to generate 144 transitions probability matrixes (eg. matrixList), but how can I save all (143) as a .csv file?
library(tidyverse)
library(markovchain)
mcListFist<-markovchainListFit(data=df[, 1:144], name="df")
matrixList<-list()
for (i in 1:dim(mcListFist$estimate)) {
myMatr<- mcListFist$estimate[[i]]#transitionMatrix
matrixList[[i]]<-myMatr
}
matrixList
Output of transition probability matrix 84-85
............
[[84]]
0
0 1
[[85]]
0
0 1
Sample data:
df<-structure(list(`04:00` = c(11, 11, 11, 11, 11, 11, 11, 11, 11,
11), `04:10` = c(11, 11, 11, 11, 11, 11, 11, 11, 11, 11), `04:20` = c(11,
11, 11, 11, 11, 11, 11, 11, 11, 11), `04:30` = c(11, 11, 11,
11, 11, 11, 11, 11, 11, 11), `04:40` = c(11, 11, 11, 11, 11,
11, 11, 11, 11, 11), `04:50` = c(11, 11, 11, 11, 11, 11, 11,
11, 11, 11), `05:00` = c(11, 11, 11, 11, 11, 11, 11, 11, 11,
11), `05:10` = c(11, 11, 11, 11, 11, 11, 11, 11, 11, 11), `05:20` = c(11,
11, 11, 11, 11, 11, 11, 11, 11, 11), `05:30` = c(11, 11, 11,
11, 11, 11, 11, 11, 11, 11), `05:40` = c(11, 11, 11, 11, 11,
11, 11, 11, 11, 11), `05:50` = c(11, 11, 11, 11, 11, 11, 11,
11, 11, 11), `06:00` = c(11, 0, 11, 11, 11, 11, 11, 0, 0, 11),
`06:10` = c(11, 0, 11, 11, 11, 11, 11, 0, 0, 11), `06:20` = c(11,
0, 11, 11, 11, 11, 11, 0, 0, 11), `06:30` = c(11, 0, 11,
11, 11, 11, 11, 0, 0, 0), `06:40` = c(11, 0, 11, 11, 11,
11, 11, 0, 0, 0), `06:50` = c(11, 0, 11, 11, 11, 11, 11,
0, 0, 0), `07:00` = c(11, 0, 11, 0, 11, 11, 11, 0, 0, 0),
`07:10` = c(11, 0, 11, 0, 11, 11, 11, 0, 0, 0), `07:20` = c(11,
0, 11, 0, 11, 11, 11, 0, 0, 0), `07:30` = c(11, 0, 11, 0,
11, 11, 11, 0, 0, 0), `07:40` = c(0, 0, 11, 0, 11, 11, 11,
0, 0, 0), `07:50` = c(0, 0, 11, 0, 11, 11, 11, 0, 0, 0),
`08:00` = c(0, 0, 0, 0, 0, 11, 11, 0, 0, 0), `08:10` = c(0,
0, 0, 0, 0, 11, 11, 0, 0, 0), `08:20` = c(0, 0, 0, 0, 0,
11, 11, 0, 0, 0), `08:30` = c(0, 0, 0, 0, 0, 11, 11, 0, 0,
0), `08:40` = c(0, 0, 0, 0, 0, 11, 11, 0, 0, 0), `08:50` = c(0,
0, 0, 0, 0, 11, 11, 0, 0, 0), `09:00` = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0), `09:10` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
`09:20` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `09:30` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `09:40` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `09:50` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `10:00` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `10:10` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `10:20` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `10:30` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `10:40` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `10:50` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `11:00` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `11:10` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `11:20` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `11:30` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `11:40` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `11:50` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `12:00` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `12:10` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `12:20` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `12:30` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `12:40` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `12:50` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `13:00` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `13:10` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `13:20` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `13:30` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `13:40` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `13:50` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `14:00` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `14:10` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `14:20` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `14:30` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `14:40` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `14:50` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `15:00` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `15:10` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `15:20` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `15:30` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `15:40` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `15:50` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `16:00` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `16:10` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `16:20` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `16:30` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `16:40` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `16:50` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `17:00` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `17:10` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `17:20` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `17:30` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `17:40` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `17:50` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `18:00` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `18:10` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `18:20` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `18:30` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `18:40` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `18:50` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `19:00` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `19:10` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `19:20` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `19:30` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `19:40` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `19:50` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `20:00` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `20:10` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `20:20` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `20:30` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `20:40` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `20:50` = c(0, 11, 11, 0, 0, 0, 0, 0, 0, 0),
`21:00` = c(0, 11, 11, 0, 0, 0, 0, 0, 0, 11), `21:10` = c(0,
11, 11, 0, 0, 0, 0, 0, 0, 11), `21:20` = c(0, 11, 11, 0,
0, 0, 0, 0, 0, 11), `21:30` = c(0, 11, 11, 0, 0, 0, 0, 0,
0, 11), `21:40` = c(0, 11, 11, 0, 0, 0, 0, 0, 0, 11), `21:50` = c(11,
11, 11, 0, 0, 0, 0, 0, 0, 11), `22:00` = c(11, 11, 11, 0,
0, 0, 0, 0, 0, 11), `22:10` = c(11, 11, 11, 0, 0, 0, 11,
0, 0, 11), `22:20` = c(11, 11, 11, 0, 0, 0, 11, 0, 0, 11),
`22:30` = c(11, 11, 11, 0, 0, 11, 11, 11, 11, 11), `22:40` = c(11,
11, 11, 0, 0, 11, 11, 11, 11, 11), `22:50` = c(11, 11, 11,
0, 0, 11, 11, 11, 11, 11), `23:00` = c(11, 11, 11, 0, 0,
11, 11, 11, 11, 11), `23:10` = c(11, 11, 11, 0, 0, 11, 11,
11, 11, 11), `23:20` = c(11, 11, 11, 0, 0, 11, 11, 11, 11,
11), `23:30` = c(11, 11, 11, 0, 0, 11, 11, 11, 11, 11), `23:40` = c(11,
11, 11, 0, 0, 11, 11, 11, 11, 11), `23:50` = c(11, 11, 11,
0, 11, 11, 11, 11, 11, 11), `00:00` = c(11, 11, 11, 0, 11,
11, 11, 11, 11, 11), `00:10` = c(11, 11, 11, 0, 11, 11, 11,
11, 11, 11), `00:20` = c(11, 11, 11, 0, 11, 11, 11, 11, 11,
11), `00:30` = c(11, 11, 11, 11, 11, 11, 11, 11, 11, 11),
`00:40` = c(11, 11, 11, 11, 11, 11, 11, 11, 11, 11), `00:50` = c(11,
11, 11, 11, 11, 11, 11, 11, 11, 11), `01:00` = c(11, 11,
11, 11, 11, 11, 11, 11, 11, 11), `01:10` = c(11, 11, 11,
11, 11, 11, 11, 11, 11, 11), `01:20` = c(11, 11, 11, 11,
11, 11, 11, 11, 11, 11), `01:30` = c(11, 11, 11, 11, 11,
11, 11, 11, 11, 11), `01:40` = c(11, 11, 11, 11, 11, 11,
11, 11, 11, 11), `01:50` = c(11, 11, 11, 11, 11, 11, 11,
11, 11, 11), `02:00` = c(11, 11, 11, 11, 11, 11, 11, 11,
11, 11), `02:10` = c(11, 11, 11, 11, 11, 11, 11, 11, 11,
11), `02:20` = c(11, 11, 11, 11, 11, 11, 0, 11, 11, 11),
`02:30` = c(11, 11, 11, 11, 11, 11, 11, 11, 11, 11), `02:40` = c(11,
11, 11, 11, 11, 11, 11, 11, 11, 11), `02:50` = c(11, 11,
11, 11, 11, 11, 11, 11, 11, 11), `03:00` = c(11, 11, 11,
11, 11, 11, 11, 11, 11, 11), `03:10` = c(11, 11, 11, 11,
11, 11, 11, 11, 11, 11), `03:20` = c(11, 11, 11, 11, 11,
11, 11, 11, 11, 11), `03:30` = c(11, 11, 11, 11, 11, 11,
11, 11, 11, 11), `03:40` = c(11, 11, 11, 11, 11, 11, 11,
11, 11, 11), `03:50` = c(11, 11, 11, 11, 11, 11, 11, 11,
11, 11)), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))
Apply the writing function to every element of the list:
mapply(function(data, name) {
data <- as.data.frame(data)
write.csv(data, paste0(name, ".csv"))
}, matrixList, 1:length(matrixList))
I am trying to create a histogram with a rainbow color scale but I also want to have the bin labels. I have been able to create a histogram with labeled bins and I have read a couple of posts talking about how to make a rainbow histogram which I have been able to recreate (here and here). However, I have not been able to create a rainbow histogram with the correct bin labels. I will attach an example data set and some sample code that I have tried. Ideally, I would also like to remove any bin labels that have zero as a value but I don't want to be too greedy here.
ggplot(final_df,aes(x=V1, fill = cut(V1, 25)))+ geom_histogram(show.legend = FALSE) +
stat_bin(aes(y=..count.., label=..count..), geom="text", vjust=-.5)
As you can see, it creates the rainbow histogram but the bin labels are all messed up.
structure(list(V1 = c(18, 0, 20, 21, 0, 2, 0, 1, 0, 0, 4, 16,
0, 0, 20, 20, 2, 0, 19, 22, 0, 0, 19, 0, 22, 22, 19, 2, 0, 0,
1, 18, 23, 1, 3, 1, 1, 1, 0, 21, 21, 0, 0, 15, 24, 0, 20, 19,
0, 1, 20, 21, 0, 0, 20, 22, 20, 0, 21, 0, 0, 22, 0, 0, 0, 23,
2, 1, 1, 21, 0, 2, 3, 23, 23, 1, 22, 0, 19, 23, 1, 2, 23, 1,
0, 0, 20, 1, 0, 0, 1, 18, 0, 0, 0, 0, 0, 2, 0, 7, 22, 0, 0, 23,
1, 0, 23, 0, 0, 1, 2, 0, 0, 18, 16, 0, 0, 1, 0, 0, 0, 2, 22,
0, 2, 0, 0, 0, 24, 0, 0, 0, 1, 1, 20, 0, 0, 1, 18, 0, 1, 1, 0,
0, 3, 0, 20, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 20, 2,
0, 1, 22, 0, 1, 23, 2, 0, 1, 5, 0, 10, 1, 17, 0, 0, 1, 1, 2,
1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 23, 2, 19, 2, 1, 21, 3,
0, 0, 20, 0, 1, 0, 1, 0, 0, 24, 2, 1, 1, 23, 1, 1, 0, 1, 0, 0,
22, 23, 0, 23, 0, 22, 2, 19, 0, 20, 22, 0, 23, 0, 21, 0, 0, 23,
0, 0, 0, 0, 3, 22, 1, 0, 1, 22, 22, 20, 0, 1, 2, 22, 2, 23, 0,
18, 1, 23, 0, 2, 0, 1, 22, 0, 21, 0, 2, 20, 0, 0, 23, 0, 1, 18,
0, 18, 20, 1, 0, 20, 0, 1, 0, 0, 17, 20, 0, 0, 1, 22, 20, 22,
2, 1, 1, 0, 1, 0, 0, 0, 18, 0, 0, 21, 0, 0, 2, 22, 20, 1, 0,
0, 0, 0, 1, 0, 0, 1, 0, 4, 1, 0, 21, 21, 0, 0, 1, 0, 1, 3, 0,
1, 1, 0, 24, 0, 0, 22, 17, 0, 1, 20, 1, 1, 21, 1, 21, 21, 0,
21, 0, 1, 23, 0, 0, 23, 21, 0, 0, 24, 0, 6, 17, 0, 21, 0, 23,
0, 0, 22, 1, 1, 22, 0, 2, 0, 0, 1, 19, 0, 21, 21, 2, 1, 18, 1,
21, 0, 1, 1, 0, 0, 1, 23, 0, 0, 1, 0, 0, 0, 1, 2, 1, 0, 0, 0,
25, 0, 0, 1, 0, 0, 0, 23, 23, 0, 0, 0, 21, 19, 2, 0, 0, 0, 0,
0, 1, 0, 22, 22, 0, 19, 0, 3, 0, 21, 0, 1, 20, 1, 1, 1, 22, 1,
22, 1, 22, 1, 0, 2, 0, 25, 23, 0, 20, 0, 2, 22, 0, 0, 1, 0, 1,
23, 22, 0, 1, 19, 23, 1, 0, 2, 0, 18, 0, 0, 2, 0, 0, 23, 0, 0,
0, 0, 0, 1, 2, 1, 0, 21, 0, 21, 20, 0, 1, 19, 23, 0, 1, 23, 0,
1, 22, 21, 3, 0, 22, 2, 0, 1, 23, 2, 0, 24, 23, 21, 23, 20, 0,
0, 0, 20, 22, 0, 2, 0, 17, 0, 0, 1, 22, 1, 1, 1, 0, 0, 3, 3,
5, 21, 21, 1, 19, 18, 0, 24, 1, 2, 0, 0, 1, 1, 0, 0, 0, 0, 0,
0, 23, 1, 20, 0, 0, 1, 19, 22, 21, 24, 3, 1, 2, 24, 0, 0, 23,
17, 22, 0, 24, 23, 16, 1, 0, 2, 20, 0, 19, 0, 2, 1, 22, 20, 0,
20, 0, 1, 22, 0, 1, 0, 2, 0, 1, 0, 0, 2, 25, 24, 2, 20, 3, 0,
0, 23, 0, 4, 0, 19, 1, 0, 1, 0, 3, 19, 22, 0, 0, 0, 1, 0, 1,
23, 20, 20, 23, 0, 0, 0, 24, 0, 21, 20, 23, 0, 1, 1, 0, 19, 0,
0, 0, 1, 22, 0, 22, 0, 1, 18, 0, 20, 1, 0, 0, 1, 20, 0, 0, 0,
0, 0, 0, 0, 0, 19, 0, 0, 1, 0, 2, 23, 19, 21, 4, 1, 0, 0, 1,
23, 21, 21, 4, 20, 24, 0, 3, 0, 20, 23, 1, 23, 21, 20, 18, 0,
21, 2, 1, 21, 0)), class = "data.frame", row.names = c(NA, -713L
))
The issue is that you manually bin your V1 variable using cut(V1, 25. Thereby you get 25 groups which (while most of the time having a zero count) get stacked on top of each other. Hence, you end up with 25 stacked (and overlapping) labels per bin. Instead make use of the bins computed by stat_bin by mapping factor(..x..) on fill:
library(ggplot2)
p <- ggplot(final_df, aes(x = V1, fill = factor(..x..))) +
geom_histogram(show.legend = FALSE)
p +
stat_bin(aes(y = ..count.., label = ..count..), geom = "text", vjust = -.5)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
To get rid of the zero entries you could make use of an ifelse:
p +
stat_bin(aes(y = ..count.., label = ifelse(..count.. > 0, ..count.., "")), geom = "text", vjust = -.5)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I am trying to implement a bincount operation in OpenCL which allocates an output buffer and uses indices from x to accumulate some weights at the same index (assume that num_bins == max(x)). This is equivalent to the following python code:
out = np.zeros_like(num_bins)
for i in range(len(x)):
out[x[i]] += weight[i]
return out
What I have is the following:
import pyopencl as cl
import numpy as np
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
prg = cl.Program(ctx, """
__kernel void bincount(__global int *res_g, __global const int* x_g, __global const int* weight_g)
{
int gid = get_global_id(0);
res_g[x_g[gid]] += weight_g[gid];
}
""").build()
# test
x = np.arange(5, dtype=np.int32).repeat(2) # [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
x_g = cl.Buffer(ctx, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf=x)
weight = np.arange(10, dtype=np.int32) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
weight_g = cl.Buffer(ctx, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf=weight)
res_g = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * 5)
prg.bincount(queue, [10], None, res_g, x_g, weight_g)
# transfer back to cpu
res_np = np.empty(5).astype(np.int32)
cl.enqueue_copy(queue, res_np, res_g)
Output in res_np:
array([1, 3, 5, 7, 9], dtype=int32)
Expected output:
array([1, 5, 9, 13, 17], dtype=int32)
How do I accumulate the elements that are indexed more than once?
EDIT
The above is a contrived example, in my real-world application x will be indices from a sliding window algorithm:
x = np.array([ 0, 1, 2, 4, 5, 6, 8, 9, 10, 1, 2, 3, 5, 6, 7, 9, 10,
11, 4, 5, 6, 8, 9, 10, 12, 13, 14, 5, 6, 7, 9, 10, 11, 13,
14, 15, 8, 9, 10, 12, 13, 14, 16, 17, 18, 9, 10, 11, 13, 14, 15,
17, 18, 19, 20, 21, 22, 24, 25, 26, 28, 29, 30, 21, 22, 23, 25, 26,
27, 29, 30, 31, 24, 25, 26, 28, 29, 30, 32, 33, 34, 25, 26, 27, 29,
30, 31, 33, 34, 35, 28, 29, 30, 32, 33, 34, 36, 37, 38, 29, 30, 31,
33, 34, 35, 37, 38, 39], dtype=np.int32)
weight = np.array([1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1,
0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0,
0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,
1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1,
0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0], dtype=np.int32)
There is a pattern which becomes more apparent when reshaping x to (2,3,2,3,3). But I am having a hard time figuring out how the approach given by #doqtor can be used here and especially if it is easy enough to generalize.
The expected output is:
array([1, 1, 0, 0, 2, 2, 0, 0, 3, 3, 0, 0, 2, 2, 0, 0, 1, 1, 0, 0, 1, 1,
0, 0, 2, 2, 0, 0, 3, 3, 0, 0, 2, 2, 0, 0, 1, 1, 0, 0], dtype=int32)
The problem is that OpenCL buffer to which weights are accumulated is not initialized (zeroed). Fixing that:
res_np = np.zeros(5).astype(np.int32)
res_g = cl.Buffer(ctx, cl.mem_flags.WRITE_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=res_np)
prg.bincount(queue, [10], None, res_g, x_g, weight_g)
# transfer back to cpu
cl.enqueue_copy(queue, res_np, res_g)
Returns correct results: [ 1 5 9 13 17]
====== Update ==========
As #Kevin noticed there is race condition here too. If there is any pattern it could be addressed this way without using synchronization, for example processing every 2 elements by 1 work item:
__kernel void bincount(__global int *res_g, __global const int* x_g, __global const int* weight_g)
{
int gid = get_global_id(0);
for(int x = gid*2; x < gid*2+2; ++x)
res_g[x_g[x]] += weight_g[x];
}
Then schedule 5 work items:
prg.bincount(queue, [5], None, res_g, x_g, weight_g)
Consider dput:
structure(list(REAÇÃO = structure(c(0, 1, 0, 0, 1, 0, 1, 1,
0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,
1, 0, 1, 1, 0, 1, 1), format.spss = "F11.0"), IDADE = structure(c(22,
38, 36, 58, 37, 31, 32, 54, 60, 34, 45, 27, 30, 20, 30, 30, 22,
26, 19, 18, 22, 23, 24, 50, 20, 47, 34, 31, 43, 35, 23, 34, 51,
63, 22, 29), format.spss = "F11.0"), ESCOLARIDADE = structure(c(6,
12, 12, 8, 12, 12, 10, 12, 8, 12, 12, 12, 8, 4, 8, 8, 12, 8,
9, 4, 12, 6, 12, 12, 12, 12, 12, 12, 12, 8, 8, 12, 16, 12, 12,
12), format.spss = "F11.0"), SEXO = structure(c(1, 1, 0, 0, 1,
0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,
0, 1, 0, 1, 0, 0, 0, 1, 1, 1), format.spss = "F11.0")), .Names = c("REAÇÃO",
"IDADE", "ESCOLARIDADE", "SEXO"), row.names = c(NA, -36L), class = "data.frame")
where: REAÇÃO is a dependent variable in the model.
Constant: -4.438.
How can I obtain this value using a simple function in R?
For obtain constant term in Discriminant Analysis on R (with library MASS):
groupmean<-(model$prior%*%model$means)
constant<-(groupmean%*%model$scaling)
constant
where model is the lda discriminant expression:
model<-lda(y~x1+x2+xn,data=mydata)
model
Example data:
df <- structure(
list(
group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 7, 7, 7, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11),
val = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.000141111115226522, 0, 0, 0, 0.00127000000793487, 0.00070555554702878, 0.000141111115226522, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00127000000793487, 0.000282222230453044, 0, 0, 0.000141111115226522, 0.000282222230453044, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00070555554702878, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.000141111115226522, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
),
.Names = c("group", "val"),
row.names = c(NA, -1000L),
class = c("data.table", "data.frame"),
.internal.selfref = <pointer: 0xe4d468>
)
I can plot this as a timeseries per group, like so:
ggplot(df, aes(x=unlist(by(val, group, seq_along)), y=val, group=group)) +
geom_line(alpha=0.5)
but I want to plot a rolling mean of the data, like so:
library(zoo)
ggplot(df, aes(x=unlist(by(val, group, seq_along)),
y=rollmean(val, 48, fill=NA), group=group)) +
geom_line(alpha=0.5)
But this adds upticks to the end of each line, that do not exist in the data:
The upticks at 130 and 670 do not exist in the data, nor do they exist in the rolling mean, as you can see with rollmean(df[group==5, val], 48, fill=NA). So what is causing them?
The first uptick does occur at exactly 132. In your rolling mean, you chose the default align, which sets it to center, meaning that each point is the mean of the previous k/2 and the future k/2 points. Since you set k=48, it means that point 132 will be the mean of (132-24):(132+24). You can verify that the first non-zero point is indeed 156.
# First non-zero value
min(which(df$val!=0))
# 156
You can also verify that the first non-zero value in the rolling mean is 132.
df$rollmean <- rollmean(df$val, 48, fill=NA)
min(which(df$rollmean!=0))
# 132
Additionally, it looks like you are applying your rolling mean across all groups, which you almost certainly don't want. Try splitting by group, like you did with by to create the time variable. Here is an example:
# Set a time variable before hand
df$time <- with(df, unlist(by(val, group, seq_along)))
df$group <- as.factor(df$group)
k=48
# Remove those groups wtihout enough values for rolling mean of k window
df.subset <- df[df$group %in% names(which(table(df$group) >= k)),]
# Calculate a rolling mean on each group
df.subset$rollmean <- unlist(by(df.subset$val, df.subset$group, FUN=rollmean, k=k, fill=NA))
# Plot
ggplot(df.subset, aes(x=time,
y=rollmean,
colour=group)) + geom_line()