[Noisebridge-discuss] Distributed computing and storage (with some major caveats)

Ian ian at slumbrparty.com
Sat Jun 27 00:40:44 UTC 2009


sorry it took me so long to respond. the most prominent requirement
you touched on was encrypting the data stored and also being able to
search it. this is possible.

there is a crypto thing called Public Information Retrieval. it lets
you do just that. if you google ""Public Information Retrieval"
encryption", you'll find what you're looking for. you can also do an
offline search where you encrypt the data but create an index of it
beforehand and encrypt that too. then you just have to download and
aggregate the indices and then you can search offline. i think it
works. i've never actually build this offline thing, but it sounds
like it'll work :).

anyway, you have a decentralized control and command unit which is not
typical of these systems. distributed systems usually have a
centralized C&C while decentralized networks either are autonomous or
require an entry node. from what i have seen implemented so far,
barring currently research that has not been proven, are that of
botnets. it is possible. unfortunately, botnets (afaik) dont have a
lot of the other properties you listed.

thanks,

verbal

On Tue, Jun 23, 2009 at 4:05 AM, Sai Emrys<noisebridge at saizai.com> wrote:
> On Tue, Jun 23, 2009 at 12:02 AM, Ian<ian at slumbrparty.com> wrote:
>> http://pdos.csail.mit.edu/chord/
>
> Very interesting. Two things it fails per its FAQ: anonymous C&C /
> file access, and protection against malicious nodes.
>
> It's possible that these could be added in somehow; e.g. some sort of
> FreeNet / Tor style multi-encrypted onion routed command and response
> packets for anonymity; not sure what for malice-proofing.
>
> Scalability is good though. I'd like this to be able to support
> millions of nodes.
>
>> http://freepastry.org/
>
> AFAICT is very similar in pros and cons to Chord.
>
>> http://www.cs.uiowa.edu/~ghosh/Viceroy.pdf
>
>> anyway, you may want to talk more about what you want such a system to
>> accomplish. the reason this is important is because there could be
>> certain constraints and restrictions that do not need to exist, making
>> the system you want easier to implement.
>
> Anything in particular?
>
> I think that the constraints I gave are necessary ones.
>
> One of them conflicts with a primary goal though - one cannot both
> have encrypted data, and have it be searchable. That's OK.
>
> E.g. for a simple case, suppose you are mapping the entire Internet
> with something like nmap + GeoIP, and you want the results to be
> placed in the überdatabase. (Yes, I'm aware that actually doing so
> would have... issues. But it's an easy example.)
>
> This is both a segmentable task problem and a distributed storage
> problem. And you don't want to repeat it more than a couple times per
> IP mapped, to avoid DoS etc. And of course such a table would be
> ginormous (~3-4 billion rows at present?).
>
> Such a table set might look like e.g.:
>
> nodes:
> * node ID (UUID)
> * IPv4 (string? bigint? 4 separate smallints?)
> * IPv6 (string? bigint? 8 separate ints?)
> * within-NAT route?
> * OS (string? enum?)
> * latitude (float)
> * longitude (float)
> * state (enum - eg down, nmapped, etc)
> * comment (string)
> * timestamp (of last test)
> * signature (bigint? - of node authenticating this data)
>
> services:
> * service ID (UUID)
> * node ID (foreign key)
> * port # (int)
> * version (string? enum? enum + int?)
> * status (enum - tested-responding, open, cloaked, honeypot, etc)
> * timestamp
> * signature
>
> Now, you might want some of that information to be encrypted (e.g. the
> services list? the IP?). But if you do so, then you won't be able to
> search by it any more, which in turn means that it becomes a bit
> difficult to parcel out the task of searching the entire space (e.g.
> control nodes may need to regularly do a search to ask what subset of
> its keyspace has been done, so as to reassign tasks if subnodes have
> failed to do so in some reasonable amount of time). Whereas encrypting
> other things (like the comment string) is not a problem, since you
> probably don't need to do something like "select * from nodes where
> comment like '%awesome%';" - you'd add a column if there's something
> that structured.
>
> You might also want to perform searches like "count the number of
> nodes by OS" or (harder) "count the number of nodes by # open
> services".
>
> Then factor in that you may have:
> * actively malicious nodes (trying to return bogus results, subvert or
> spoof C&C, map your nodes, attack privacy of C&C, etc)
> * nodes coming in and out during the search (especially for
> long-running searches) and data being under active change (by the time
> you see it, it's stale)
> * need to deniably transport command packets (so that while you can be
> identified as a node, you cannot be identified as the origin node of
> any request or response unless you choose to authenticate)
> * need to prevent an attacker from mapping all your nodes (yet any
> node should be able to *communicate with* any other node - similar to
> FreeNet and Tor's issues)
> * etc
>
> So yeah, it gets kinda complicated. :-P
>
> But I think my listed specs are all motivated and necessary. Any in
> particular you think aren't?
>
> - Sai
>



More information about the Noisebridge-discuss mailing list