[Noisebridge-discuss] Distributed computing and storage (with some major caveats)

Sai Emrys noisebridge at saizai.com
Tue Jun 23 11:05:15 UTC 2009


On Tue, Jun 23, 2009 at 12:02 AM, Ian<ian at slumbrparty.com> wrote:
> http://pdos.csail.mit.edu/chord/

Very interesting. Two things it fails per its FAQ: anonymous C&C /
file access, and protection against malicious nodes.

It's possible that these could be added in somehow; e.g. some sort of
FreeNet / Tor style multi-encrypted onion routed command and response
packets for anonymity; not sure what for malice-proofing.

Scalability is good though. I'd like this to be able to support
millions of nodes.

> http://freepastry.org/

AFAICT is very similar in pros and cons to Chord.

> http://www.cs.uiowa.edu/~ghosh/Viceroy.pdf

> anyway, you may want to talk more about what you want such a system to
> accomplish. the reason this is important is because there could be
> certain constraints and restrictions that do not need to exist, making
> the system you want easier to implement.

Anything in particular?

I think that the constraints I gave are necessary ones.

One of them conflicts with a primary goal though - one cannot both
have encrypted data, and have it be searchable. That's OK.

E.g. for a simple case, suppose you are mapping the entire Internet
with something like nmap + GeoIP, and you want the results to be
placed in the überdatabase. (Yes, I'm aware that actually doing so
would have... issues. But it's an easy example.)

This is both a segmentable task problem and a distributed storage
problem. And you don't want to repeat it more than a couple times per
IP mapped, to avoid DoS etc. And of course such a table would be
ginormous (~3-4 billion rows at present?).

Such a table set might look like e.g.:

nodes:
* node ID (UUID)
* IPv4 (string? bigint? 4 separate smallints?)
* IPv6 (string? bigint? 8 separate ints?)
* within-NAT route?
* OS (string? enum?)
* latitude (float)
* longitude (float)
* state (enum - eg down, nmapped, etc)
* comment (string)
* timestamp (of last test)
* signature (bigint? - of node authenticating this data)

services:
* service ID (UUID)
* node ID (foreign key)
* port # (int)
* version (string? enum? enum + int?)
* status (enum - tested-responding, open, cloaked, honeypot, etc)
* timestamp
* signature

Now, you might want some of that information to be encrypted (e.g. the
services list? the IP?). But if you do so, then you won't be able to
search by it any more, which in turn means that it becomes a bit
difficult to parcel out the task of searching the entire space (e.g.
control nodes may need to regularly do a search to ask what subset of
its keyspace has been done, so as to reassign tasks if subnodes have
failed to do so in some reasonable amount of time). Whereas encrypting
other things (like the comment string) is not a problem, since you
probably don't need to do something like "select * from nodes where
comment like '%awesome%';" - you'd add a column if there's something
that structured.

You might also want to perform searches like "count the number of
nodes by OS" or (harder) "count the number of nodes by # open
services".

Then factor in that you may have:
* actively malicious nodes (trying to return bogus results, subvert or
spoof C&C, map your nodes, attack privacy of C&C, etc)
* nodes coming in and out during the search (especially for
long-running searches) and data being under active change (by the time
you see it, it's stale)
* need to deniably transport command packets (so that while you can be
identified as a node, you cannot be identified as the origin node of
any request or response unless you choose to authenticate)
* need to prevent an attacker from mapping all your nodes (yet any
node should be able to *communicate with* any other node - similar to
FreeNet and Tor's issues)
* etc

So yeah, it gets kinda complicated. :-P

But I think my listed specs are all motivated and necessary. Any in
particular you think aren't?

- Sai



More information about the Noisebridge-discuss mailing list