If commentary or discussion about online content is going to get beyond ‘Like’ and ‘Hate,’ there needs to be a way to make comments or annotations fine-grained below the level of the document itself: at least to the paragraph or even to the sentence or other excerpt. However, this isn’t as easy as it might appear.
The problem is that references to paragraphs (for example) need to be stable even as the rendering of the paragraphs changes across devices or displays. Page and line numbers, for example, are the centuries-old solution to this problem which don’t work when readers (human and digital) change page dimensions or type sizes with regularity. So what’s the solution?
The easiest solution is to simply attach some kind of unique identifier to paragraphs or other passages. This is what the Bible does, for instance, with chapter and verse numbers that are stable across editions and translations. This scheme is a pretty old one and now close to universal. The trick is to have production and reproduction processes which create and sustain these identifiers and it helps (as with the bible) to have the content itself be relatively stable without many edits. This is what we do with sBooks (www.sbooks.net) where we assign unique XML element IDs and have various ways to sustain these identifiers across edits and versions. Not rocket science, but not always easy either.
Unfortunately, we don’t always of the luxury of unique identifiers so sometimes we need to be a little more clever. One approach is to use the structure of the document to define identifiers, so that a paragraph might be identified as “the third paragraph in section 5 of chapter 3″. This is the approach used by the most recent version of EPUB3, though it has gotten some flak. The problem is that it uses the document hierarchy for encoding identifiers and consequently assumes that the narrative hierarchy and flow of the document is consistently transformed into the XML containment structure. Change the XML structure, and all your references break. Whoops.
Another approach is to use the actual content of the paragraph to identify it semi-uniquely. In the simplest version of this, an annotation just quotes the entire paragraph. This works pretty well in general but fails for paragraphs (such as ‘This is left as an exercise to the reader’) which may occur repeatedly in a document. However, in these cases, we can provide some context or other information to help disambiguate the reference.
Unfortunately, the use of the text itself has a few other problems of its own:
- it’s wordy and takes up space;
- it can vary between renderings (think whitespace, line breaks, hyphenation) or edits;
- it exposes the content which may offend or frighten some rights-holders.
There are two things we can do to address these problems: normalize the content and hash the normalized version.
With sBooks and our open annotation standard (Knotes), we’ve been using a simple normalization scheme called word segment normalization (WSN) which basically breaks the content at whitespace, strips off non-embedded punctuation, and then glues the words back together into a string where words are separated by single spaces. This works pretty well because most rendering and minor edits affect punctation and whitespace but not the words themselves. To be even more normal, we also decompose the content’s Unicode characters and then strip out all of the modifiers. And (in case you were wondering), if the content contains embedded markup, we strip out the markup tags (but not their embedded content) before applying the WSN processing.
(It turns out that the WSN representation can also be helpful for identifying stable text ranges, but that’s the topic for another post.)
Once we have a normalized string, we can hash it with MD5 or SHA1 or however you like to hash your strings. Unless you have more disruptive edits (fixing typos, for instance) or have to deal with genuine duplicate content (‘exercise for the reader’), this works pretty well. It is fairly compact and hides the actual content from anyone who has the reference but hasn’t paid for the content.
There is an interesting variation which hides the content while allowing smart fuzzy matching of references. The idea is to start with the WSN vector and then sort the words using some context-free criteria like alphabetical order, word length, or frequency in some external corpus. It’s then possible to use a little vector algebra to match similar passages automatically, but the exposed word vectors are relatively useless for reconstructing the original paragraph.
The real message here is that there is no silver bullet (besides unique initial IDs) to point to paragraphs. This leads to the idea of using multiple identifiers for any given passage and expecting that applications will use and choose among those identifiers to do the right thing in a particular context.
For example, an sBooks annotation stores the unique paragraph identifier (if there is one), along with the WSN vector sorted by word length, and the vector’s MD5 hash (WSNMD5), both by itself and together with WSNMD5 hashes of other paragraphs in the document. Finally, if the document was converted from a scanned document, we also use the page+lineno for where the original paragraph started in the printed book. The philosophy is that the more pointers you have, the more likely you are to be able to connect an annotation with the content.
One of the systemic consequences of cheap computing and bandwidth is that the rendered forms of ideas have gotten more diverse in vacuous but annoyingly idiosyncratic ways. These methods are ways to reduce this problem so that the important stuff, conversations about ideas and real issues, can proceed. I hope it helps.
October 20, 2011 at 7:07 pm |
[...] sBooks: Reinventing Reading the best idea since sliced paper « How To Point To A Paragraph [...]