A peer-to-peer and privacy-aware data mining/aggregation algorithm: is it possible? - aggregate-functions

Suppose I have a network of N nodes, each with a unique identity (e.g. public key) communicating with a central-server-less protocol (e.g. DHT, Kad). Each node stores a variable V. With reference to e-voting as an easy example, that variable could be the name of a candidate.
Now I want to execute an "aggregation" function on all V variables available in the network. With reference to e-voting example, I want to count votes.
My question is completely theoretical (I have to prove a statement, details at the end of the question), so please don't focus on the e-voting and all of its security aspects. Do I have to say it again? Don't answer me that "a node may have any number identities by generating more keys", "IPs can be traced back" etc. because that's another matter.
Let's see the distributed aggregation only from the privacy point of view.
THE question
Is it possible, in a general case, for a node to compute a function of variables stored at other nodes without getting their value associated to the node's identity? Did researchers design such a privacy-aware distributed algorithm?
I'm only dealing with privacy aspects, not general security!
Current thoughts
My current answer is no, so I say that a central server, obtaining all Vs and processes them without storing, is necessary and there are more legal than technical means to assure that no individual node's data is either stored or retransmitted by the central server. I'm asking to prove that my previous statement is false :)
In the e-voting example, I think it's impossible to count how many people voted for Alice and Bob without asking all the nodes, one by one "Hey, who do you vote for?"
Real case
I'm doing research in the Personal Data Store field. Suppose you store your call log in the PDS and somebody wants to find statistical values about the phone calls (i.e. mean duration, number of calls per day, variance, st-dev) without being revealed neither aggregated nor punctual data about an individual (that is, nobody must know neither whom do I call, nor my own mean call duration).
If a trusted broker exists, and everybody trusts it, that node can expose a double getMeanCallDuration() API that first invokes CallRecord[] getCalls() on every PDS in the network and then operates statistics on all rows. Without the central trusted broker, each PDS exposing double getMyMeanCallDuration() isn't statistically usable (the mean of the means shouldn't be the mean of all...) and most importantly reveals the identity of the single user.

Yes, it is possible. There is work that actually answers your question solving the problem, given some assumptions. Check the following paper: Privacy, efficiency & fault tolerance in aggregate computations on massive star networks.
You can do some computation (for example summing) of a group of nodes at another node without having the participants nodes to reveal any data between themselves and not even the node that is computing. After the computation, everyone learns the result (but no one learns any individual data besides their own which they knew already anyways). The paper describes the protocol and proves its security (and the protocol itself gives you the privacy level I just described).
As for protecting the identity of the nodes to unlink their value from their identity, that would be another problem. You could use anonymous credentials (check this: https://idemix.wordpress.com/2009/08/18/quick-intro-to-credentials/) or something alike to show that you are who you are without revealing your identity (in a distributed scenario).
The catch of this protocol is that you need a semi-trusted node to do the computation. A fully distributed protocol (for example, in a P2P network scenario) is not that easy though. Not because of a lack of a storage (you can have a DHT, for example) but rather you need to replace that trusted or semi-trusted node by the network, and that is when you find your issues, who does it? Why that one and not another one? And what if there is a collusion? Etc...

How about when each node publishes two sets of data x and y, such that
x - y = v
Assuming that I can emit x and y independently, you can correctly compute the overall mean and sum, while every single message is largely worthless.
So for the voting example and candidates X, Y, Z, I might have one identity publishing the vote
+2 -1 +3
and my second identity publishes the vote:
-2 +2 -3
But of course you cannot verify that I didn't vote multiple times anymore.

Related

What is lockstep in Peer-to-Peer gaming?

I am researching about Peer-To-Peer network architecture for games.
What i have read from multiples sources is that Peer-To-Peer model makes it easy for people to hack. Sending incorrect data about your game character, whether it is your wrong position or the amount of health point you have.
Now I have read that one of the things to make Peer-To-Peer more secure is to put an anti-cheat system into your game, which controls some thing like: how fast has someone moved from spot A to spot B, or controls if someones health points did not change drastically without a reason.
I have also read about Lockstep, which is described as a "handshake" between all the clients in Peer-to-Peer network, where clients promise not to do certain things, for instance "move faster than X or not to be able to jump higher than Y" and then their actions are compared to the rules set in the "handshake".
To me this seems like an anti-cheat system.
What I am asking in the end is: What is Lockstep in Peer-To-Peer model, is it an Anti-Cheat system or something else and where should this system be placed in Peer-To-Peer. In every players computer or could it work if it is not in all of the players computer, should this system control the whole game, or only a subset?
Lockstep was designed primarily to save on bandwidth (in the days before broadband).
Question: How can you simulate (tens of) thousands of units, distributed across multiple systems, when you have only a vanishingly small amount of bandwidth (14400-28800 baud)?
What you can't do: Send tens of thousands of positions or deltas, every tick, across the network.
What you can do: Send only the inputs that each player makes, for example, "Player A orders this (limited size) group ID=3 of selected units to go to x=12,y=207".
However, the onus of responsibility now falls on each client application (or rather, on developers of P2P client code) to transform those inputs into exactly the same gamestate per every logic tick. Otherwise you get synchronisation errors and simulation failure, since no peer is authoritative. These sync errors can result from a great deal more than just cheaters, i.e. they can arise in many legitimate, non-cheating scenarios (and indeed, when I was a young man in the '90s playing lockstepped games, this was a frequent frustration even over LAN, which should be reliable).
So now you are using only a tiny fraction of the bandwidth. But the meticulous coding required to be certain that clients do not produce desync conditions makes this a lot harder to code than an authoritative server, where non-sane inputs or gamestate can be discarded by the server.
Cheating: It is easy to see things you shouldn't be able to see: every client has all the simulation data available. It is very hard to modify the gamestate without immediately crashing the game.
I've accidentally stumbled across this question in google search results, and thought I might as well answer years later. For future generations, you know :)
Lockstep is not an anti-cheat system, it is one of the common p2p network models used to implement online multiplayer in games (most notably in strategy games). The base concept is fairly straightforward:
The game simulation is split into fairly short time frames.
After each frame players collect input commands from that frame and send those commands over the network
Once all the players receive the commands from all the other players, they apply them to their local game simulation during the next time frame.
If simulation is deterministic (as it should be for lockstep to work), after applying the commands all the players will have the same world state. Implementing the determinism right is arguably the hardest part, especially for cross-platform games.
Being a p2p model lockstep is inherently weak to cheaters, since there is no agent in the network that can be fully trusted. As opposed to, for example, server-authoritative network models, where developer can trust a privately-owned server that hosts the game. Lockstep does not offer any special protection against cheaters by itself, but it can certainly be designed to be less (or more) vulnerable to cheating.
Here is an old but still good write-up on lockstep model used in Age of Empires series if anyone needs a concrete example.

Is fast data access related to the availability (A) in CAP theorem?

I realize that it will be a basic concept, but it would be helpful if anyone could explain if fast data access related to the availability (A) in CAP theorem. Fast data access is an important feature expected of Big Data systems. And does the various K-Access and K-grouping method all a part of it.
Availability in CAP theorem is about whether you can access your data even if there is failure in the hardware (e.g. network outage, node outage etc).
Fast access to large volume of data is important feature observed in most of the big data systems. But, it should not confused with availability as described above.
Availability in CAP theorem means that all your requests will receive a response, but does not specify when or how accurate. Nor does it specify what fast could mean.
Every request receives a (non-error) response – without guarantee that
it contains the most recent write
Keep in mind that this theorem enforces strict guarantees. For example, systems can guarantee C and A, and still be good at P most of the time.

Options to achieve consensus in an immutable distributed hash table

I'm implementing a completely decentralized database. Anyone at any moment can upload any type of data to it. One good solution that fits on this problem is an immutable distributed hash table. Values are keyed with their hash. Immutability ensures this map remains always valid, simplifies data integrity checking, and avoids synchronization.
To provide some data retrieval facilities a tag-based classification will be implemented. Any key (associated with a single unique value) can be tagged with arbitrary tag (an arbitrary sequence of bytes). To keep things simple I want to use same distributed hash table to store this tag-hash index.
To implement this database I need some way to maintain a decentralized consensus of what is the actual and valid tag-hash index. Immutability forces me to use some kind of linked data structure. How can I find the root? How to synchronize entry additions? How to make sure there is a single shared root for everybody?
In a distributed hash table you can have the nodes structured in a ring, where each node in the ring knows about at least one other node in the ring (to keep it connected). To make the ring more fault-tolerant make sure that each node has knowledge about more than one other node in the ring, so that it is able to still connect if some node crashes. In DHT terminology, this is called a "sucessor list". When the nodes are structured in the ring with unique IDs and some stabilization-protocol, you can do key lookups by routing through the ring to find the node responsible for a certain key.
How to synchronize entry additions?
If you don't want replication, a weak version of decentralized consensus is enough and that is that each node has its unique ID and that they know about the ring structure, this can be achieved by a periodic stabilization protocol, like in Chord: http://nms.lcs.mit.edu/papers/chord.pdf
The stabilization protocol has each node communicating with its successor periodically to see if it is the true successor in the ring or if a new node has joined in-between in the ring or the sucessor has crashed and the ring must be updated. Since no replication is used, to do consistent insertions it is enough that the ring is stable so that peers can route the insertion to the correct node that inserts it in its storage. Each item is only held by a single node in a DHT without replication.
This stabilization procedure can give you very good probability that the ring will always be stable and that you minimize inconsistency, but it cannot guarantee strong consistency, there might be gaps where the ring is temporary unstable when nodes joins or leaves. During the inconsistency periods, data loss, duplication, overwrites etc could happen.
If your application requires strong consistency, DHT is not the best architecture, it will be very complex to implement that kind of consistency in a DHT. First of all you'll need replication and you'll also need to add a lot of ACK and synchronity in the stabilization protocol, for instance using a 2PC protocol or paxos protocol for each insertion to ensure that each replica got the new value.
How can I find the root?
How to make sure there is a single shared root for everybody?
Typically DHTs are associated with some lookup-service (centralized) that contains IPs/IDs of nodes and new nodes registers at the service. This service can then also ensure that each new node gets a unique ID. Since this service only manages IDs and simple lookups it is not under any high load or risk of crashing so it is "OK" to have it centralized without hurting fault-tolerance, but of course you could distribute the lookup service as well, and sycnhronizing them with a consensus protocol like Paxos.

time-based simulation with actors model

we have a single threaded application that simulates the interaction of a hundred of thousands of objects over time with the shared memory model.
obviously, it suffers from its inability to scale over multi CPU hardware.
after reading a little about agent based modeling and functional programming/actor model I was considering a rewrite with the message-passing paradigm.
the idea is very simple - each object will be an actor and their interactions will be messages so that the simulation could happen in parallel. given a configuration of objects at a certain time - its future consequences can be easily computed.
the question is how to model the time:
for example let's assume the the behavior of object X depends on A and B, as the actors and the messages calculations order is not guaranteed it could be that when X is to be computed A has already sent its message to X but B didn't.
how to make sure the computation happens correctly?
I hope the question is clear
thanks in advance.
Your approach of using message passing to parallelize a (discrete-event?) simulation is well-known and does not require a functional style per se (although, of course, this does not prevent you to implement it like that).
The basic problem you describe w.r.t. to the timing of events is also known as the local causality constraint (see, for example, this textbook). Basically, you need to use a synchronization protocol to ensure that each object (or agent) processes its messages in the right order. In the domain of parallel discrete-event simulation, such objects are called logical processes, and they communicate via events (i.e. time-stamped messages).
Correctly implementing a synchronization protocol for these events is challenging and the right choice of protocol is highly application-specific. For example, one important factor is the average amount of computation required per event: if there is little computation required, the communication costs dominate the overall execution time and it will be hard to scale the simulation.
I would therefore recommend to look for existing solutions/libraries on top of the actors framework you intend to use before starting from scratch.

What are the tradeoffs when generating unique sequence numbers in a distributed and concurrent environment?

I am curious about the contraints and tradeoffs for generating unique sequence numbers in a distributed and concurrent environment.
Imagine this: I have a system where all it does is give back an unique sequence number every time you ask it. Here is an ideal spec for such a system (constraints):
Stay up under high-load.
Allow as many concurrent connections as possible.
Distributed: spread load across multiple machines.
Performance: run as fast as possible and have as much throughput as possible.
Correctness: numbers generated must:
not repeat.
be unique per request (must have a way break ties if any two request happens at the exact same time).
in (increasing) sequential order.
have no gaps between requests: 1,2,3,4... (effectively a counter for total # requests)
Fault tolerant: if one or more, or all machines went down, it could resume to the state before failure.
Obviously, this is an idealized spec and not all constraints can be satisfied fully. See CAP Theorem. However, I would love to hear your analysis on various relaxation of the constraints. What type of problems will we left with and what algorithms would we use to solve the remaining problems. For example, if we rid of the counter constraint, then the problem becomes much easier: since gaps are allowed, we can just partition the numeric ranges and map them onto different machines.
Any references (papers, books, code) are welcome. I'd also like to keep a list of existing software (open source or not).
Software:
Snowflake: a network service for generating unique ID numbers at high scale with some simple guarantees.
keyspace: a publicly accessible, unique 128-bit ID generator, whose IDs can be used for any purpose
RFC-4122 implementations exist in many languages. The RFC spec is probably a really good base, as it prevents the need for any inter-system coordination, the UUIDs are 128-bit, and when using IDs from software implementing certain versions of the spec, they include a time code portion that makes sorting possible, etc.
If you must be sequential (per machine) but can drop the gap/counter requirments look for an implementation of the Version 1 UUID as specified in RFC 4122.
If you're working in .NET and can eliminate the sequential and gap/counter requirements, just use System.Guids. They implement RFC 4122 Version 4 and are already unique (very low collision probability) across machines and requests. This could be easily implemented as a web service or just used locally.
Here's a high-level idea for an approach that may fulfill all the requirements, albeit with a significant caveat that may not match many use cases.
If you can tolerate having two sequence numbers - a logical one returned immediately; guaranteed unique and ordered but with gaps - and a separate physical one guaranteed to be in sequential order with no gaps and available a short while later - then the solution seems straightforward:
One distributed system that can serve up a high resolution clock + machine id as the logical sequence number
Stream all the logical sequence numbers into a separate distributed system that orders the logical sequence numbers and maps them to the physical sequence numbers.
The mapping from logical to physical can happen on-demand as soon as the second system is done with processing.

Resources