When dnp3 transmits a large amount of data, an example of a message divided into multiple frames and multiple segments - dnp3

For example, it is divided into 15 frames and 2 segments. How does the seq of the transport layer and the seq of the application layer change?
I checked the manual, but couldn't find the corresponding description

Related

Queries on the same big data dataset

Lets say I have a very big dataset (billions of records), one that doesnt fit on a single machine and I want to have multiple unknown queries (its a service where a user can choose a certain subset of the dataset and I need to return the max of that subset).
For the computation itself I was thinking about Spark or something similar, problem is Im going to have a lot of IO/network activity since Spark is going to have to keep re-reading the data set from the disk and distributing it to the workers, instead of, for instance, having Spark divide the data among the workers when the cluster goes up and then just ask from each worker to do the work on certain records (by their number, for example).
So, to the big data people here, what do you usually do? Just have Spark redo the read and distribution for every request?
If I want to do what I said above I have no choice but to write something of my own?
If the queries are known but the subsets unknown, you could precalculate the max (or whatever the operator) for many smaller windows / slices of the data. This gives you a small and easily queried index of sorts, which might allow you to calculate the max for an arbitrary subset. In case a subset does not start and end neatly where your slices do, you just need to process the ‘outermost’ partial slices to get the result.
If the queries are unknown, you might want to consider storing the data in a MPP database or use OLAP cubes (Kylin, Druid?) depending on the specifics; or you could store the data in a columnar format such as Parquet for efficient querying.
Here's a precalculating solution based on the problem description in the OP's comment to my other answer:
Million entries, each has 3k name->number pairs. Given a subset of the million entries and a subset of the names, you want the average for each name for all the entries in the subset. So each possible subset (of each possible size) of a million entries is too much to calculate and keep.
Precalculation
First, we split the data into smaller 'windows' (shards, pages, partitions).
Let's say each window contains around 10k rows with roughly 20k distinct names and 3k (name,value) pairs in each row (choosing the window size can affect performance, and you might be better off with smaller windows).
Assuming ~24 bytes per name and 2 bytes for the value, each window contains 10k*3k*(24+2 bytes) = 780 MB of data plus some overhead that we can ignore.
For each window, we precalculate the number of occurrences of each name, as well as the sum of the values for that name. With those two values we can calculate the average for a name over any set of windows as:
Average for name N = (sum of sums for N)/(sum of counts for N)
Here's a small example with much less data:
Window 1
{'aaa':20,'abcd':25,'bb':10,'caca':25,'ddddd':50,'bada':30}
{'aaa':12,'abcd':31,'bb':15,'caca':24,'ddddd':48,'bada':43}
Window 2
{'abcd':34,'bb':8,'caca':22,'ddddd':67,'bada':9,'rara':36}
{'aaa':21,'bb':11,'caca':25,'ddddd':56,'bada':17,'rara':22}
Window 3
{'caca':20,'ddddd':66,'bada':23,'rara':29,'tutu':4}
{'aaa':10,'abcd':30,'bb':8,'caca':42,'ddddd':38,'bada':19,'tutu':6}
The precalculated Window 1 'index' with sums and counts:
{'aaa':[32,2],'abcd':[56,2],'bb':[25,2],'caca':[49,2],'ddddd':[98,2],'bada':[73,2]}
This 'index' will contain around 20k distinct names and two values for each name, or 20k*(24+2+2 bytes) = 560 KB of data. That's one thousand times less than the data itself.
Querying
Now let's put this in action: given an input spanning 1 million rows, you'll need to load (1M/10k)=100 indices or 56 MB, which fits easily in memory on a single machine (heck, it would fit in memory on your smartphone).
But since you are aggregating the results, you can do even better; you don't even need to load all of the indices at once, you can load them one at a time, filter and sum the values, and discard the index before loading the next. That way you could do it with just a few megabytes of memory.
More importantly, the calculation should take no more than a few seconds for any set of windows and names. If the names are sorted alphabetically (another worthwhile pre-optimization) you get the best performance, but even with unsorted lists it should run more than fast enough.
Corner cases
The only thing left to do is handle the case where the input span doesn't line up exactly with the precalculated windows. This requires a little bit of logic for the two 'ends' of the input span, but it can be easily built into your code.
Say each window contains exactly one week of data, from Monday through Sunday, but your input specifies a period starting on a Wednesday. In that case you would have to load the actual raw data from Wednesday through Sunday of the first week (a few hundred megabytes as we noted above) to calculate the (count,sum) tuples for each name first, and then use the indices for the rest of the input span.
This does add some processing time to the calculation, but with an upper bound of 2*780MB it still fits very comfortably on a single machine.
At least that's how I would do it.

Generate time-series synthetic data in R

I have power consumption data of few electrical appliances (like AC, Refrigerator, and Microwave ) as shown in below plots. Now, I want to generate synthetic data for these appliances. In the synthetic data generation process:
How can I generate data corresponding to first figure? Where states are of different duration (widths) and varying magnitude (heights).
How can I restrict the appliance usage for a specific time portion? For example, first figure corresponds to AC. This shows that AC works only after 11 PM till 8 AM of next day. Can I enforce such time restrictions during data generation
How to ensure some random usage? For example, In second figure we find several sudden usages of 500 watts for few minutes.
There are some appliances which consume power differently at different times, for example, third figure shows microwave usage which consumes anything between 100 to 800 watts.
What all the functions/approaches should I use to generate this type of time-series data ,
,

Creating a weighted adjacency matrix with iterations

I have a data on the list of directors from different companies. Directors from one company meet at the same board of directors. Moreover, I also have a data how many times these directors were in the same board of directors. I have to create an adjacency matrix consisting from these directors. Nodes represent how many times 2 directors were in the same board of directors (i.e. if A and B are from company 1, and there were 11 meetings in this company, hence it must be 11 on at the intersection of A and B and if A and B from different boards of directors (from different companies), then it must be 0 at the intersection.
I have created this matrix in Excel successfully via command
=IF(VLOOKUP($E2;$A$1:$C$27;2;0)=(VLOOKUP(F$1;$A$1:$C$27;2;0));$C2;0)
However, the main problem is that two or more directors may meet in more than one board of directors (one company). In this case the total number of meetings must be added together. For example, if A and B meet together in company 1 for 11 times and in company 3 for 4 times, then it must be 15 at the intersection and, unfortunately, I can't understand how to realize it. I've searched for similar problems and I didn't found any cases where the data in original data was repeated. I have no idea, whether it is possible to realize it in Excel or should I apply another software (R or something else)?
See if this array formula works for you:-
=SUM(ISNUMBER(MATCH(IF($A$2:$A$27=F$1,$B$2:$B$27,"+"),IF($A$2:$A$27=$E2,$B$2:$B$27,"-"),0))*$C$2:$C$27)
Must be entered with CtrlShiftEnter

Reading multiple data frames from a single file with R

My problem is that I'm trying to read in data which has been formatted by an archaic piece of Fortran code (and thus is character limited on each line). The data consists of a number of chunks, each with a fixed width format, and the basic structure of each chunk is:
header line (one line, 11 columns)
data (80 lines, 11 columns)
header line (identical to above)
blank (3 lines)
The first column is identical for each chunk, so once read in, I can join the dfs into a single df. However, how do I read all of the chunks of the data in? Am I limited to writing a loop with a skip value that goes up in increments of 85, or is there a neater way to do things?

A Neverending cforest

how can I decouple the time cforest/ctree takes to construct a tree from the number of columns in the data?
I thought the option mtry could be used to do just that, i.e. the help says
number of input variables randomly sampled as candidates at each node for random forest like algorithms.
But while that does randomize the output trees it doesn't decouple the CPU time from the number of columns, e.g.
p<-proc.time()
ctree(gs.Fit~.,
data=Aspekte.Fit[,1:60],
controls=ctree_control(mincriterion=0,
maxdepth=2,
mtry=1))
proc.time()-p
takes twice as long as the same with Aspekte.Fit[,1:30] (btw. all variables are boolean). Why? Where does it scale with the number of columns?
As I see it the algorithm should:
At each node randomly select two columns.
Use them to split the response. (no scaling because of mincriterion=0)
Proceed to the next node (for a total of 3 due to maxdepth=2)
without being influenced by the column total.
Thx for pointing out the error of my ways

Resources