[Noisebridge-discuss] Distributed computing and storage (with some major caveats)

Tue Jun 23 06:06:13 UTC 2009

On Mon, Jun 22, 2009 at 9:48 PM, Shannon Lee<shannon at scatter.com> wrote:
> It seems interesting to me.  Does it have to be SQL-style, or could you do
> some sort of key-value store thing?

As I responded to Jason, I don't especially care if it's SQL per se,
although it'd be convenient as an already known standard. I just want
it to support certain features that simple pure key/value stores (eg
memcache) do not - indexing, search by data field, etc. If it can do
so by implementing a compliant subset of SQL such that it's compatible
with SQL-based programs and just needs a different database interface
library, so much the awesomer.

For the most part, I expect the data being stored in this system to be
structured in broadly the same way as a mysql or other RDBMS-type
database is (i.e. fields, foreign keys, etc) and thus want to take
advantage of that fact as much as possible. But certainly some things
would need to be different - e.g. UUIDs throughout (since you'll have
severe race conditions and propagation issues), indexing and other
search as a sort of eDonkey-esque command (since no node will have an
authoritative list), versioning (in case one node updates a record and
another hasn't heard about that update yet), etc. Not to mention how
you deal with version collisions or erstwhile atomic actions (e.g.
counter increment); I think those are just not possible and have to
have workarounds (e.g. a minimalist list of events instead of a
counter thereof; maybe some sort of dated 'rollups' system if that
gets to be too large). I think the advanced features (like JOIN)
aren't especially necessary, at least not at first pass.

It's also a bit unlike SQL in that you would most like receive the
response as an ongoing data stream, rather than an atomic single
answer, and you may want to work on data as it comes in. (Again, this
is very similar to how e.g. eDonkey searches work AFAIU.)

However, I think that each node could operate its own database as
something SQL-compatible (mysql, sqlite, whatever) - I'm talking here
about the interface to the amorphous system as a whole.

On Mon, Jun 22, 2009 at 10:26 PM, d p chang<weasel at meer.net> wrote:
> out of curiosity, is this feature set suggesting that the data come/go
> w/ the nodes so that an action is local to the node (possibly
> replicated)?

I'm not sure I understand your question.

All data should be replicated redundantly across nodes, so that it can
survive any node going down without loss of the data that that node
may have generated (or proxied).

Action distribution is a tricky problem. For some things, it may be
simply wasteful to do them multiple times; for others (such as, say, a
search and stored classification of IP space) it may be actually
harmful.

On the other hand, to deal with node death and/or malice, you need to
have multiple paths for any given task subset to get executed and
verified. (For example, a simple divide-and-conquer strategy would be
very vulnerable on these points, because the early nodes would control
a great deal of the overall keyspace. So this would probably need to
be structured as a graph, not a tree.)

On Mon, Jun 22, 2009 at 10:51 PM, Ian<ian at slumbrparty.com> wrote:
> this is something that i'm greatly interested in and have worked on
> systems like parts of your requirements before. the thing is, what you
> propose is very complicated to get right.

Definitely. ;-)

> i suggest you do some
> research (if you havent already) on what's out there. there are
> systems out there that already fulfill your requirements individually.
> maybe you can combine them or augment existing systems. if you would
> like, we can start a distributed systems/p2p/decentralized networks
> group at NB so we can learn about these existing things.

Part of the point of this email was to figure out what those systems
are that I ought to read / rip. ;-)

TBH, my knowledge of such systems is minimal and theoretical. I have a
BA CogSci degree including various CS stuff, and I've got production
experience with relatively large (~300k unique users / day) web apps
using memcache & mysql. And I am superficially familiar with stuff
like Metasploit, BOINC, & MapReduce. I'm sure that other NBers
(perhaps you?) can easily outclass me in relevant knowledge.

Anyhow, I'd be interested in such a group.

- Sai