
Considerations on the problem of data management.
(0) The cloud is for data that will leak. Account information is being lost constantly. The best secure service out there that I have heard of is tarsnap, which is not a friendly service.
(1) My data needs to be content-searchable, as well as filename searchable. Much of this information can be extracted.
(1.1) My data includes email, pictures, music, video, source code, and in short, the generalized document idea.
(2) Therefore, a search tool needs to be created. Spotlight is a reasonable solution, but afaik, closed source. It can be substantially improved, I expect.
(3) My data needs to be globally versioned. This will incur at least a doubling of the space occupied by it. Git/hg/etc are designed for source code.
(4) My data needs to be migratable. This implies that the archives need to be (1) copyable or (2) distributed in nature.
(5) The archive needs to be reasonably transparent to existing management systems. HG, for instance, treats .hg directories as special. So HG can't be used to version .hg systems (which is what is going to logically happen).
(6) A desktop-only solution is fine. The web has problems that I am not interested in solving. A local-only database for the search index is fine, as is possibly storing all the archive information in (cue horrible stories from people who have done distributed databases).
(7) As a core principle set:
(7a) There is no reason that data should be to a computer user who has access to multiple computers and does not suffer destruction of all computer simultaneously.
(7b) There is no reason that textual data should be unsearchable.
(7c) The software that performs these services would be able to access any information on the system and thus ought to be auditable and, ideally, open source.
I'm contemplating what it would take to produce these two things:
- Desktop search service
- Data management/versioning system
I am figuring that the desktop search properly will involve three components: a reasonably aggressive parser hacking apart the data files, a database, and a client interface. The parser will of course have to support a number of different formats.
The data management & versioning system could likely be built on top of git's primitives. The reason I say that is that git is relatively simple under the hood, exposes a lot of its capability to the Linux user, and has a very fast communication protocol. Around this could be wrapped a tarball, the indexer, and a gpg call to bowtie it all off.
Is anyone to work on this sort of thing with me? This is a reasonably serious project that I anticipate would span two or three years before maturity. Things to consider:
- I would insist on A/GPL3 licencing
- I would not drop this project without a reasonable and viable alternative being available. This is sort of a big deal to me.
- I have worked on software both academically, commercialy, and have interacted with open source for years. I can crank code when I am on the roll.
- I am open to using your pet language & tech stack (I'm personally interested in learning Clojure right now), and would learn it if I didn't already know it.
- If we pulled this off, this be of a scope to go on your resume.
- I offer virtual cupcakes. :: cupcake ::
vlion@dreamwidth.org routes to my email address, if you don't want to comment on this post