Saturday, April 17, 2004

Today I was reminded of a project I've tried my best to forget because of the painful and hideous loss involved.

Some time ago, I spend a good deal of time developing a documentation program called Document Designr, whose primary function was to enable the production and navigation of very complex design specifications, and other networks of interdependent description/notes or text.

Basically it was an XML data format that encapsulated XHTML text files, with the extension, predictably enough of .ddd . It was implemented as an XUL application, on Mozilla 1.4, so it could use the Mozilla Composer to manage the XHTML, and other components for various functions, navigation, rendering, compression, etc etc. It was a very clever idea that I got from someone else, who was quite looking forward to the development, I think.

Then I lost all the data, and all the work.

In fact I lost all my data. My journals, my databases, my online texts, my carefully hoarded archived mails, websites, pictures, lists. It totally derailed me. I had come to rely on the portion of myself I had documented and offloaded onto the computer. In the larger tragedy, DocDesignr was lost. I never reopened the work, and the sourceforge project remains completely empty. It's an object lesson in the neccesity for remote backups and other basic issues.

Today I was reminded of it. I work at an AI research firm, which means our main currency is in ideas. We investigate things, write about them, implement them, and research more. Our conclusions we come to through consensus, meeting, tests, and a bit of luck. Unfortunately, we generate a lot of ideas. And our discussions raise more. Last month, we produced about 40 documents. This represents maybe half of the total issues we dealt with. Many exist only as memories of words spoken at a brainstorming session, and some brief notes on a paper easel we have.

Even with issues and projects we have written documentation on, much information exists as experience, or things discovered after the documentation is written, but too small to warrant a rewrite.

Today I reopened a task we researched and discussed last month. To my shame, I had to open the document and spend several minutes reviewing it before I could even be certain I remembered completely the conversations we had on it. I should spend some time flipping through the easel archives to ensure I don't miss points raised in brainstorming. And even when I've done that, I won't be certain I haven't lost something in the intervening month.

This is obviously suboptimal. But what is the alternative? More documents? Already navigating the folder with the last month's is dependent on previous experience, and a good feel for what indicates content. The overhead involved in writing everything we do down is already considerable, and time spent writing down ideas is time spent not having them, particulalry when such ideas do not strike as part of the composition process, but in brainstorming meeting, review, or in working with a particular test case.

More time gathering information? I could requiz everyone involved in the original spec and research, collating their responses into an updated document, but that would take time, and could even introduce recollection errors. the responses might not agree, and which one would I take as authoritative. When does review and research become re-research of the original issue?

It's interesting to note that there are two divergent issues here, although related, seem to indicate differing solutions. One, the generation of more and better documentation regarding work done. More bits recorded and accessible. Two, the documentation that currently exists is already too unwieldy to navigate with precision and too time-wasting to keep up.

I think, however, that there is a single dynamic that holds a partial answer to both.

It's interesting here to diverge into a quick overview of a company that speaks very closely to the issue at hand. Google. In the news recently for their email initiative, gmail. I don't know how many of all you keep records of all your emails. But, particularly if you write many, you really should keep them all. You'd be surprised how much data lies therein, and how useful it is. But beyond what you remember, the navigation quickly becomes one of threads and dates. Imagine, however, that every email you've sent or recieved lies within the reach of a single search engine. And not just a port of grep or a simple text matching box, but the fastest, most complicated, most semantic searching parser in the world, hooked up to the biggest map of the internet, and indeed all information in the world. Now that would be magic. It's enough to put up with nearly any price. And it certainly seems that google is counting on this.Privacy concerns aside: The strange thing here is that google, unlike most distributed targeted services, only gets better the more people use it. Every bit in the database more makes the database that much more relevant and comprehensive. It's to their advantage to nearly give storage space away, and they seem to recognize this. Witness gmail's vaunted gigabyte of storage space. People imagine they are taking advantage of google, when they are literally handing google the keys to their data, and providing more context, more categorization, and more associative links for free.

Google's strategy points to a solution to the smaller domain of project documentation that sparked this entry. The dynamic that will solve the two seemingly contradictory problems. The solution isn't pruning, or increased documentation overhead. It's far simpler than that. The solution is to simplify, enlarge, and integrate the data, just as google attempts to do. Do all you can to get all the data into a common database(google parses pdf's, meta-tags, pictures, news, rdf, html, txt, powerpoint, auto-translates, and god knows what else now, all to get it into a big honking map of links and keywords.) then keep it all. Don't filter, don't delete. Don't manage it as a total resource, but manage it per request. Pre associate as much as you can, but don't try to semanticise past your parser's capability. Remember vivisimo? yeah, me too, just barely.

I remember the first time I ran a search across a database of IRC logs, archived email, personal documentation, and journal entries. It was like watching points leap out from the darkness. It was like having a perfect memory, like I used to think I had. imagine such a search, including email send and received, google's webpages, google groups, google news. auto-translated pages from elsewhere in the world. autotranslated emails from people in those countries. a true constellation of data, all ruthlessly ripped through to generate a itemized list, organized by relevance, and served with a timer printed at the top, gloating at it's own speed.

For the special case of projects like this, I imagine some great cousin to my poor abandoned baby DocDesignr, logs of conversations, raw text of personal notes. Data from tests and output from software. formal design documents. outlines of brainstorms. drawn pictures. all related not just by content, but by date/time entered, by relationships associated at parsetime, or later added automatically, or intentionally. related according to information obtained during the parsing of other documents. presented as a huge honking network of interlinks and keywords. browseable from a known starting point, or searchable. Representing the sum total of data, but never presenting more than relates to each request.

Add a bidirectional interface, allowing explicit relationships to be drawn, text to be amended and appended. present the database as a network of arbitrary hierarchies and heterarchies, mix it into a distributed, ever growing network resource, and you'd really have something.

Ironically enough, the service I'm describing is quite possible. It may even be somewhat the direction google intends to move. It's probably incredibly useful for nearly everything. It would likely make my job a hell of a lot easier to have such a thing to work with. But I am not the person to develop it. I dont' know who is. It's easier than AI, it's more commercial than AI, and it would probably skyrocket the successful inventors to superstardom. Sadly, it is at best an enabler for my own research, and even if I had incredible free time for it, I likely am not enough of a databaser or software engineer to do it myself. Oh, but I can imagine it so well I can taste it.

Perhaps I should look at my surviving notes and documentation on docdesignr after all. I could probably get by on an hour less sleep a night.....

No comments: