How to check for substantial change in two depth frames? - intel

My goal is to detect any change in frames(single channel. each pixel is depth value). On app init, I take average of all corresponding pixels of first 30 frames so one average background frame will be created. On new frames arrival I subtract each frame from saved background frame(mean frame of first 30 frames). Currently algorithm is: I take mean of first 30 frames say it bg_mean(scalar, not frame, say 2345). Then calculate mean of new frame and compare it with bg_mean with some threshold added to bg_mean (to avoid noise consideration) . But this method method does not give good results if distance is far. Are there any other methods ?

I suspect your threshold value is the issue. I'm not familiar with depth-cameras, but I'd assume the noise values might be impacted by the distance. To validate this, try changing/removing it to see if it improves the results.
Traditional image processing techniques will also update the history as time goes so large changes to the scene are not counted as changes in every subsequent frames. Your idea is essentially how it is done, but approaches vary in how they account for noise (in standard images you also have to account for shadows, which you can avoid with a depth camera)
I'd suggest looking into built-in algorithms with more sophisticated noise removal (https://docs.opencv.org/3.4/d/d38/tutorial_bgsegm_bg_subtraction.html). Although for images, I suspect this would work similarly for depths cameras.

Related

Does column order matter in RNN?

My question is somewhat similar to this one. But I want to ask whether the column order matters or not. I have some time series data. For each cycle I computed some features (let's call them var1, var2,.... I now train the model using the following column order which of course will be consistent for the test set.
X_train=data['var1','var2','var3',var4']
After watching this video I've concluded that the order in which the columns appear is significant i.e. swapping var 1 and var 3 as:
X_train=data['var3','var2','var1',var4']
I would get a different loss function.
If the above is true, then how does one figure out the correct feature order to minimize the loss function, especially when the number of features could be in dozens.

Increasing the "bit depth" of a one hot encoded variable in a time series in R

I am working with frame by frame video data in tabular format. When an event happens, it is only coded for that single frame. I would like to order events in pseudotime and there is no distinct periodicity to them. Regular one-hot encoding does not encode the 'eventness' of timestamps within close proximity to the frame labeled "hey stuff happens here". I imagine that this could be modeled as a sinusoidal function, convolved with the the initial array that represents a column. This way, even when frames are not ordered, I can still see the 'eventness' of each one.
I was thinking something along these lines:
x <- c(...0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0...)
y <- c(...0,1,2,3,4,5,6,7,8,9,10,9,8,7,6,5,4,3,2,1,0...)
(convolve(x, y))
I am struggling with designing a for loop that might do this for each column of dummy variables.
Also, I am using "bit depth" as an analogy, since this is like increasing the bit depth of an audio sample. Basically, I want to trade in my NES for a Sega Genesis.
Thanks!
To the bevy of people on the edge of their seats, I ended up using a moving average to solve this problem.

Queries on the same big data dataset

Lets say I have a very big dataset (billions of records), one that doesnt fit on a single machine and I want to have multiple unknown queries (its a service where a user can choose a certain subset of the dataset and I need to return the max of that subset).
For the computation itself I was thinking about Spark or something similar, problem is Im going to have a lot of IO/network activity since Spark is going to have to keep re-reading the data set from the disk and distributing it to the workers, instead of, for instance, having Spark divide the data among the workers when the cluster goes up and then just ask from each worker to do the work on certain records (by their number, for example).
So, to the big data people here, what do you usually do? Just have Spark redo the read and distribution for every request?
If I want to do what I said above I have no choice but to write something of my own?
If the queries are known but the subsets unknown, you could precalculate the max (or whatever the operator) for many smaller windows / slices of the data. This gives you a small and easily queried index of sorts, which might allow you to calculate the max for an arbitrary subset. In case a subset does not start and end neatly where your slices do, you just need to process the ‘outermost’ partial slices to get the result.
If the queries are unknown, you might want to consider storing the data in a MPP database or use OLAP cubes (Kylin, Druid?) depending on the specifics; or you could store the data in a columnar format such as Parquet for efficient querying.
Here's a precalculating solution based on the problem description in the OP's comment to my other answer:
Million entries, each has 3k name->number pairs. Given a subset of the million entries and a subset of the names, you want the average for each name for all the entries in the subset. So each possible subset (of each possible size) of a million entries is too much to calculate and keep.
Precalculation
First, we split the data into smaller 'windows' (shards, pages, partitions).
Let's say each window contains around 10k rows with roughly 20k distinct names and 3k (name,value) pairs in each row (choosing the window size can affect performance, and you might be better off with smaller windows).
Assuming ~24 bytes per name and 2 bytes for the value, each window contains 10k*3k*(24+2 bytes) = 780 MB of data plus some overhead that we can ignore.
For each window, we precalculate the number of occurrences of each name, as well as the sum of the values for that name. With those two values we can calculate the average for a name over any set of windows as:
Average for name N = (sum of sums for N)/(sum of counts for N)
Here's a small example with much less data:
Window 1
{'aaa':20,'abcd':25,'bb':10,'caca':25,'ddddd':50,'bada':30}
{'aaa':12,'abcd':31,'bb':15,'caca':24,'ddddd':48,'bada':43}
Window 2
{'abcd':34,'bb':8,'caca':22,'ddddd':67,'bada':9,'rara':36}
{'aaa':21,'bb':11,'caca':25,'ddddd':56,'bada':17,'rara':22}
Window 3
{'caca':20,'ddddd':66,'bada':23,'rara':29,'tutu':4}
{'aaa':10,'abcd':30,'bb':8,'caca':42,'ddddd':38,'bada':19,'tutu':6}
The precalculated Window 1 'index' with sums and counts:
{'aaa':[32,2],'abcd':[56,2],'bb':[25,2],'caca':[49,2],'ddddd':[98,2],'bada':[73,2]}
This 'index' will contain around 20k distinct names and two values for each name, or 20k*(24+2+2 bytes) = 560 KB of data. That's one thousand times less than the data itself.
Querying
Now let's put this in action: given an input spanning 1 million rows, you'll need to load (1M/10k)=100 indices or 56 MB, which fits easily in memory on a single machine (heck, it would fit in memory on your smartphone).
But since you are aggregating the results, you can do even better; you don't even need to load all of the indices at once, you can load them one at a time, filter and sum the values, and discard the index before loading the next. That way you could do it with just a few megabytes of memory.
More importantly, the calculation should take no more than a few seconds for any set of windows and names. If the names are sorted alphabetically (another worthwhile pre-optimization) you get the best performance, but even with unsorted lists it should run more than fast enough.
Corner cases
The only thing left to do is handle the case where the input span doesn't line up exactly with the precalculated windows. This requires a little bit of logic for the two 'ends' of the input span, but it can be easily built into your code.
Say each window contains exactly one week of data, from Monday through Sunday, but your input specifies a period starting on a Wednesday. In that case you would have to load the actual raw data from Wednesday through Sunday of the first week (a few hundred megabytes as we noted above) to calculate the (count,sum) tuples for each name first, and then use the indices for the rest of the input span.
This does add some processing time to the calculation, but with an upper bound of 2*780MB it still fits very comfortably on a single machine.
At least that's how I would do it.

RRDTOOL one second logging, values missing

I spent more than two months with RRDTOOL to find out how to store and visualize data on graph. I'm very close now to my goal, but for some reason I don't understand why it is happening that some data are considered to be NaN in my case.
I counting lines in gigabytes sized of log files and have feeding the result to an rrd database to visualize events occurrence. The stepping of the database is 60 seconds, the data is inserted in seconds base whenever it is available, so no guarantee the the next timestamp will be withing the heartbeat or within the stepping. Sometimes no data for minutes.
If have such big distance mostly my data is considered to be NaN.
b1_5D.rrd
1420068436:1
1420069461:1
1420073558:1
1420074583:1
1420076632:1
1420077656:1
1420079707:1
1420080732:1
1420082782:1
1420083807:1
1420086881:1
1420087907:1
1420089959:1
1420090983:1
1420094055:1
1420095080:1
1420097132:1
1420098158:1
1420103284:1
1420104308:1
1420107380:1
1420108403:1
1420117622:1
1420118646:1
1420121717:1
1420122743:1
1420124792:1
1420125815:1
1420131960:1
1420134007:1
1420147326:1
1420148352:1
rrdtool create b1_4A.rrd --start 1420066799 --step 60 DS:Value:GAUGE:120:0:U RRA:AVERAGE:0.5:1:1440 RRA:AVERAGE:0.5:10:1008 RRA:AVERAGE:0.5:30:1440 RRA:AVERAGE:0.5:360:1460
The above gives me an empty graph for the input above.
If I extend the heart beat, than it will fill the time gaps with the same data. I've tried to insert zero values, but that will average out the counts and bring results in mils.
Maybe I taking something wrong regarding RRDTool.
It would be great if someone could explain what I doing wrong.
Thank you.
It sounds as if your data - which is event-based at irregular timings - is not suitable for an RRD structure. RRD prefers to have its data at constant, regular intervals, and will coerce the incoming data to match its requirements.
Your RRD is defined to have a 60s step, and a 120s heartbeat. This means that it expects one sample every 60s, and no further apart than 120s.
Your DS is a gauge, and so the values you enter (all of them '1' in your example) will be the values stored, after any time normalisation.
If you increase the heartbeat, then a value received within this time will be used to make a linear approximation to fill in all samples since the last one. This is why doing so fills the gaps with the same data.
Since your step is 60s, the smallest sample time sidth will be 1 minute.
Since you are always storing '1's, your graph will therefore either show '1' (when the sample was received in the heartbeart window) or Unknown (when the heartbeat expired).
In other words, your graph is showing exactly what you gave it. You data are being coerced into a regular set of numerical values at a 1-minute step, each being 1 or Unknown.

Movement data analysis in R; Flights and temporal subsampling

I want to analyse angles in movement of animals. I have tracking data that has 10 recordings per second. The data per recording consists of the position (x,y) of the animal, the angle and distance relative to the previous recording and furthermore includes speed and acceleration.
I want to analyse the speed an animal has while making a particular angle, however since the temporal resolution of my data is so high, each turn consists of a number of minute angles.
I figured there are two possible ways to work around this problem for both of which I do not know how to achieve such a thing in R and help would be greatly appreciated.
The first: Reducing my temporal resolution by a certain factor. However, this brings the disadvantage of losing possibly important parts of the data. Despite this, how would I be able to automatically subsample for example every 3rd or 10th recording of my data set?
The second: By converting straight movement into so called 'flights'; rule based aggregation of steps in approximately the same direction, separated by acute turns (see the figure). A flight between two points ends when the perpendicular distance from the main direction of that flight is larger than x, a value that can be arbitrarily set. Does anyone have any idea how to do that with the xy coordinate positional data that I have?
It sounds like there are three potential things you might want help with: the algorithm, the math, or R syntax.
The algorithm you need may depend on the specifics of your data. For example, how much data do you have? What format is it in? Is it in 2D or 3D? One possibility is to iterate through your data set. With each new point, you need to check all the previous points to see if they fall within your desired column. If the data set is large, however, this might be really slow. Worst case scenario, all the data points are in a single flight segment, meaning you would check the first point the same number of times as you have data points, the second point one less, etc. The means n + (n-1) + (n-2) + ... + 1 = n(n-1)/2 operations. That's O(n^2); the operating time could have quadratic growth with respect to the size of your data set. Hence, you may need something more sophisticated.
The math to check whether a point is within your desired column of x is pretty straightforward, although maybe more sophisticated math could help inform a better algorithm. One approach would be to use vector arithmetic. To take an example, suppose you have points A, B, and C. Your goal is to see if B falls in a column of width x around the vector from A to C. To do this, find the vector v orthogonal to C, then look at whether the magnitude of the scalar projection of the vector from A to B onto v is less than x. There is lots of literature available for help with this sort of thing, here is one example.
I think this is where I might start (with a boolean function for an individual point), since it seems like an R function to determine this would be convenient. Then another function that takes a set of points and calculates the vector v and calls the first function for each point in the set. Then run some data and see how long it takes.
I'm afraid I won't be of much help with R syntax, although it is on my list of things I'd like to learn. I checked out the manual for R last night and it had plenty of useful examples. I believe this is very doable, even for an R novice like myself. It might be kind of slow if you have a big data set. However, with something that works, it might also be easier to acquire help from people with more knowledge and experience to optimize it.
Two quick clarifying points in case they are helpful:
The above suggestion is just to start with the data for a single animal, so when I talk about growth of data I'm talking about the average data sample size for a single animal. If that is slow, you'll probably need to fix that first. Then you'll need to potentially analyze/optimize an algorithm for processing multiple animals afterwards.
I'm implicitly assuming that the definition of flight segment is the largest subset of contiguous data points where no "sub" flight segment violates the column rule. That is to say, I think I could come up with an example where a set of points satisfies your rule of falling within a column of width x around the vector to the last point, but if you looked at the column of width x around the vector to the second to last point, one point wouldn't meet the criteria anymore. Depending on how you define the flight segment then (e.g. if you want it to be the largest possible set of points that meet your condition and don't care about what happens inside), you may need something different (e.g. work backwards instead of forwards).

Resources