Archive for October, 2011

Introducing codex.js

October 20, 2011

I’ve implemented a lightweight Javascript library for breaking large HTML documents into a series of pages, obeying the stylesheet constraints (like page-break-before) provided by CSS. You can find the library, which is open-source, at codex.js. It’s technically part of the FDJT library, but it only depends on dom.js.

If you wonder why you might want something like this, take a look at my post on how The Page Isn’t Dead. For all sorts of reasons, it’s easier to read a large document when it’s intelligently broken into pages. Unfortunately, most browsers don’t do this themselves, and even most e-readers do it poorly.

The library works by constructing a series of div.codexpage blocks within a designated container. A stylesheet (codex.css) defines div.codexpage to be a fixed positioned element and users of the library can extend or modify this definition.

Using the library starts by instantiating an instance of the CodexLayout object, e.g.

  var layout=new CodexLayout();

You can set a lot of parameters, though it also tries to get defaults values in various ways. A more fleshed out call might look like this:

var layout=new CodexLayout({
  page_width: 500, page_height: 500, // dimensions
  // Where to add new pages
  container: document.getElementByID("MYPAGES"),
  // Prefix for page element IDs, e.g. 
  //  page 42 would have id MYCODEXPAGE42
  pageprefix: "MYCODEXPAGE",
  logfn: console.log, // how to log notable events
  // Layout rules:
  //  Codex observes CSS declarations, but you can put
  //    additional constraints here:
  forcebreakbefore: "H1",
  forcebreakafter: "div.signature",
  avoidbreakinside: "div.code",
  avoidbreakbefore: "div.signature,div.attribution",
  avoidbreakafter: "h1,h2,h3,h4,h5,h6,h7",
  // There are some constraints which CSS doesn't include:
  codexfullpage: "div.titlepage",
  // Put this element on a page by itself, but don't 
  // interrupt the narrative flow
  codexfloatpage: "div.illustration"});

Once you’ve got a layout object, you just add DOM nodes to it by calling addContent(node). It actually moves the DOM nodes, but you can always call the layout’s revert() method to restore all the nodes to where they came from.

The implementation works by duplicating the document hierarchy on each page, splitting the contents of container nodes and duplicating their contents if needed. For example, suppose a poem wrapped in a div.poem block needs to be split across multiple pages. Each page will have its own div.poem block and the first one will have a codexdupstart CSS class, the last one will have the codexdupend class, and any intervening nodes will have a simple codexdup class. By default, these CSS classes try to override any top or bottom definitions (margins, borders, padding), but designers may want to customize these definitions.

The library is agnostic about how pages are navigated. By default, div.codexpage has zero opacity and div.codexpage.curpage has an opacity of 1, making it easy to change pages by moving the curpage class around (it does a little fade-in/fade-out if supported by the browser). But users of the library can override these definitions to add (for example) sliding page transitions or other effects.

There is still work to be done, but it seems to be working pretty well and it’s not too slow. It uses the browser’s underlying geometry and styling engines, which limits how fast it can really go.

Enjoy. I’ll update here if there are minor changes and post news about more significant changes.

The Page Isn’t Dead

October 20, 2011

I once thought that the rise of electronic reading would mean the death of the “page” as part of the reading experience. I fully expected that the chop-chop-chop of paged media would be replaced with the smooth flow of endless scrolling. It seemed so much simpler and more elegant.

I was wrong, but the reasons are interesting. The chop-chop of pages are here to stay but the way documents are designed is due for transformation.

What we think of as paged layout was introduced in the first century. It’s called a ‘Codex‘ and it succeeded the scrolled media where a very long piece of printed material was rolled up and then selectively unrolled (typically between two spools) to expose a desired passage or fragment. The technology of the codex, which had significant advantages by itself, also attached itself to a quite prolific meme: early Christian texts were nearly all published as codices.

The codex solved two problems. First, it presents a reduced aspect of a long document which fits the visual field so that the text is readable. Second, it simplifies random access to the text, so that it’s very easy to go to a particular part of the text. (It would be interesting to speculate on the impact of random access on how people thought and read, but that’s not my current point.)

Advance several thousand years, and technology has made the random access problem go away: you can jump within a scrolled document in less time than it takes to say “Heraclitus.” However, the first problem. the reality of the human visual field and acuity, is still with us. Clark Kent might be able to read an entire book fit onto a single exposed surface (either a rampart-sized display or something with very tiny letters), but we can’t. The scrolling computer display works by presenting us with a slice of the text and letting us move that slice within the document (just like the old unrolling scrolls which predated the codex).

Unfortunately, it’s the way we move that slice which makes things complicated. While it works fine for short documents (e-mails, short news articles, advertising), it doesn’t really scale to longer documents. To see this, think about what your hand and eye are doing when you scroll forward to the next chunk of reading in a long document.

Your eye is on a line somewhere near the bottom of the screen and you’re caught up in the argument, or the thrill, or the emotional tension, but you’re near the bottom, so you have to scroll up. You then use some physical device — mouse in your hand, finger on a trackpad, thumb on a trackball — to move the content within the display.

Your eyes follow the line you’re reading and, when it nears or passes, the top, you abruptly stop whatever physical motion is driving the display change. To pick a metaphor from physical sports, the line is the ball, the top of the display is the basket, and you’re trying to score. Every time you turn the page. Now what were you reading about?

All of the human pieces of this little game are organic and subject to fatigue with time and repetition, which is why scrolling is less objectionable for shorter texts than longer documents or books.

Traditional scrolling of longer texts is both tiring and and disruptive of narrative flow. What’s amazing is that brilliant writing and compelling stories can survive these interruptions so easily. We are a lucky race of media consumers. But what’s tragic is that it’s not necessary.

Just press PageDown! Many of my more astute readers have probably just said “why don’t you simply push page down (or space) and move your eyes to the top!”

That is a huge improvement and the remark is the perfect segue to my second point. One button/key scrolling is what I usually do when I need to scroll, but many people don’t even think of it. Even when we manage to remember the possibility, it reveals the obvious fact that scrolled information is already broken into pages by the size of the display, but it’s just broken into pages BADLY.

In the worst case, a line of text is split across the bottom of the display or window, so that we see only head or feet of a line at the bottom or top of the scrolled page. It is a testament to the miracle of visual perception that we can often read a line that’s lost it’s head or feet, but it slows us up even when it doesn’t stop us in our virtual tracks. Better software always scrolls by whole lines (when it can identify them), which is better, but the problem runs deeper.

Texts aren’t just series of lines. They includes blocks and heads and hierarchies, and the whole purpose of their design is to guide our attention and understanding. When a display arbitrarily imposes stops and breaks, it doesn’t help the flow, the attention, or the understanding. Think of it as cogitus interruptus. Not fun.

Almost any printed book has had attention paid to where the page breaks fall, doing so in such a way as to minimize the disruption to the reader’s experience. Don Knuth, famous computer scientist and digital typesetter extraordinaire, used to rewrite sentences in his books to avoid awkward line and page breaks. Unfortunately, we can’t make those adjustments by hand in this modern web world where anyone can adjust their page dimensions or font size long after the author or publisher has left the building.

To its credit, the web has a partial solution, in that CSS has properties, like page-break-before, that let authors or designers provide constraints which layout engines can try to honor. Though intended originally for printed layout, they’re equally helpful (when honored) for codex-style electronic display.

Unfortunately, most eBooks ignore this information, leading to cases where headings appear at the bottom of pages and thought-sized chunks of text are split across boundaries. It’s distracting, confusing, ugly, and unnecessary.

The page isn’t about to die. In fact, it’s becoming more alive, as its dimensions and attributes change across devices and purposes. The page is fundamentally about providing an eye-sized window into a text; what’s changing is that the character of that window is now a fluid and lively aspect rather than a fixed and frozen window. This is an extraordinary opportunity, where the reader’s experience has more dimensions and publishers will be able to add more kinds of value than they had in the past. The page isn’t dead. It’s about to be reborn.

WSN: Why Be Normal?

October 20, 2011

WSN, or Word Segment Normalization, is a simple way to get canonical representations of paragraphs. It can be used for identifying passages within a document for purposes of annotation or reference or as a part of other algorithms which might (for example) be trying to align passages in different editions of the same document.

A simple open source (LGPL) Javascript implementation can be found on github at wsn.js. The underlying idea is to normalize strings for case, whitespace, and modifier characters (in Unicode) and to use this as the basis for further normalization.

The basic WSN form splits an input string at whitespace+punctuation runs but leave embedded punctuation where it is, so that a string like:

the 'look-alike' scores were 3.5, 2.7, and 1.2 respectively

becomes the canonical string:

the look-alike scores were 3.5 2.7 and 1.2

Once, a WSN string has been generated, it can be used as the basis for hashing to get a more compact unique representation. So the MD5 hash of the above string, represented as hex would be:

8cfc45dcea529b6f1177ed71e58c463b

This hash value can then be used in generating unique IDs of the kind I discuss in How To Point to a Paragraph.

Fuddled normalizations

In addition, we can generate a fuddled representation by taking the words in the WSN string and modifying and/or sorting them according to some function. In most cases, we’ll also want to remove duplicates after sorting.

Why fuddle a WSN string? First, fuddling may further normalize the string in useful ways. For example, simple rewordings would map to the same fuddled string. Also, if we were to sort on something like word frequency (rarer words sooner) or a proxy to frequency (like word length), the fuddled string could provide a good representation for finding identical passages across editions or renderings.

Another reason to fuddle strings is to obscure the content itself. Suppose that we want to refer to a passage in a copyrighted text but don’t want to expose the original text in our reference. We can use a fuddled WSN to refer to a passage precisely (or even approximately) using a fuddled representation, but the fuddled string itself would be nearly useless for actual consumption (unless it were very short).

Fuddling can extend beyond sorting to actually remove words (like stop words) or transform them (for example by stemming or soundex coding).

The Javascript library

The library is released under the LGPL (and some other licenses) and can be downloaded at wsn.js. While it is part of the FDJT Javascript library, it doesn’t depend on any other components, except that it can use the fdjtHash object to generate hash values if desired.

The function WSN(arg) takes a string or a DOM node and generates the WSN normalization of its textual content. The function WSN.Hash(arg,hashfn) applies hashfn to the WSN normalization and the functions WSN.md5ID and WSN.sha1ID generate hashed values for the content using the specified algorithms (represented as hexadecimal strings). These functions require the fdjtHash utility or you can just set WSN.md5 or WSN.sha1 to use your own implementations.

WSN takes extra arguments, particularly WSN(arg,sortfn,wordfn,keepdups):

sortfn is the function used to produce a fuddled representation; if sortfn is true rather than a callable function, it sorts first by length and then lexicographically. This yields strings which generally differ sooner than other sorting methods.

wordfn, if not false, is applied to every word in the WSN representation. If it returns false, that word is deleted, otherwise, the is replaced by the result of wordfn. This can be used to use (for example) stemmers or other normalizers to generate even more normal forms. special note: if the wordfn ends up yielding no words at all, it is ignored.
wordfn special cases: if wordfn is a number, it excludes all words shorter than the value; if wordfn is a non-callable table/dictionary/hash/object, it is used to map words to alternate words or to the empty string to indicate the word should be deleted.

keepdups indicates that duplicate words should be kept in the fuddled WSN representation. The removal of duplicate words happens after the wordfn is applied.

All of the hash generating WSN functions take the same extra arguments as WSN. The default values of these arguments can be set by setting the corresponding properties of WSN, e.g. WSN.sortfn.

If the argument to WSN is false, it assumes that it is being used as a constructor and creates an object whose attributes specify the other parameters. This object’s normalize method then uses those defaults. This object also supports the Hash method, which computes a hash value based on the objects .hashfn attribute (which defaults to MD5 if it’s available).

DOM walking

The library is mildly clever as it walks the DOM when it generates the input text for the WSN algorithm itself. In general, the textual form is simply a concatenation of any underling TEXT nodes, with two caveats, based on the each node’s CSS style information:

  1. any nodes whose position is not static are simply ignored;
  2. any nodes whose display attribute is not inline have newlines inserted before their content, though the newlines will be normalized away by the WSN algorithm.

As a final feature, WSN provides a Map method which takes a set of nodes (or strings) and generates a table from their WSN normalizations to the nodes (or strings) themselves. This can work with an array or a node list, so (if you’re using something like jQuery), you can just say:

WSN.Map($("P"),WSN.md5)

to get a table mapping WSN MD5 ids to paragraphs. The second argument specifies a hash function to apply to the WSN normalizations, and the normal WSN parameters (sortfn, wordfn, etc) can follow the hash function.

How To Point To A Paragraph

October 20, 2011

If commentary or discussion about online content is going to get beyond ‘Like’ and ‘Hate,’ there needs to be a way to make comments or annotations fine-grained below the level of the document itself: at least to the paragraph or even to the sentence or other excerpt. However, this isn’t as easy as it might appear.

The problem is that references to paragraphs (for example) need to be stable even as the rendering of the paragraphs changes across devices or displays. Page and line numbers, for example, are the centuries-old solution to this problem which don’t work when readers (human and digital) change page dimensions or type sizes with regularity. So what’s the solution?

The easiest solution is to simply attach some kind of unique identifier to paragraphs or other passages. This is what the Bible does, for instance, with chapter and verse numbers that are stable across editions and translations. This scheme is a pretty old one and now close to universal. The trick is to have production and reproduction processes which create and sustain these identifiers and it helps (as with the bible) to have the content itself be relatively stable without many edits. This is what we do with sBooks (www.sbooks.net) where we assign unique XML element IDs and have various ways to sustain these identifiers across edits and versions. Not rocket science, but not always easy either.

Unfortunately, we don’t always of the luxury of unique identifiers so sometimes we need to be a little more clever. One approach is to use the structure of the document to define identifiers, so that a paragraph might be identified as “the third paragraph in section 5 of chapter 3″. This is the approach used by the most recent version of EPUB3, though it has gotten some flak. The problem is that it uses the document hierarchy for encoding identifiers and consequently assumes that the narrative hierarchy and flow of the document is consistently transformed into the XML containment structure. Change the XML structure, and all your references break. Whoops.

Another approach is to use the actual content of the paragraph to identify it semi-uniquely. In the simplest version of this, an annotation just quotes the entire paragraph. This works pretty well in general but fails for paragraphs (such as ‘This is left as an exercise to the reader’) which may occur repeatedly in a document. However, in these cases, we can provide some context or other information to help disambiguate the reference.

Unfortunately, the use of the text itself has a few other problems of its own:

  • it’s wordy and takes up space;
  • it can vary between renderings (think whitespace, line breaks, hyphenation) or edits;
  • it exposes the content which may offend or frighten some rights-holders.

There are two things we can do to address these problems: normalize the content and hash the normalized version.

With sBooks and our open annotation standard (Knotes), we’ve been using a simple normalization scheme called word segment normalization (WSN) which basically breaks the content at whitespace, strips off non-embedded punctuation, and then glues the words back together into a string where words are separated by single spaces. This works pretty well because most rendering and minor edits affect punctation and whitespace but not the words themselves. To be even more normal, we also decompose the content’s Unicode characters and then strip out all of the modifiers. And (in case you were wondering), if the content contains embedded markup, we strip out the markup tags (but not their embedded content) before applying the WSN processing.

(It turns out that the WSN representation can also be helpful for identifying stable text ranges, but that’s the topic for another post.)

Once we have a normalized string, we can hash it with MD5 or SHA1 or however you like to hash your strings. Unless you have more disruptive edits (fixing typos, for instance) or have to deal with genuine duplicate content (‘exercise for the reader’), this works pretty well. It is fairly compact and hides the actual content from anyone who has the reference but hasn’t paid for the content.

There is an interesting variation which hides the content while allowing smart fuzzy matching of references. The idea is to start with the WSN vector and then sort the words using some context-free criteria like alphabetical order, word length, or frequency in some external corpus. It’s then possible to use a little vector algebra to match similar passages automatically, but the exposed word vectors are relatively useless for reconstructing the original paragraph.

The real message here is that there is no silver bullet (besides unique initial IDs) to point to paragraphs. This leads to the idea of using multiple identifiers for any given passage and expecting that applications will use and choose among those identifiers to do the right thing in a particular context.

For example, an sBooks annotation stores the unique paragraph identifier (if there is one), along with the WSN vector sorted by word length, and the vector’s MD5 hash (WSNMD5), both by itself and together with WSNMD5 hashes of other paragraphs in the document. Finally, if the document was converted from a scanned document, we also use the page+lineno for where the original paragraph started in the printed book. The philosophy is that the more pointers you have, the more likely you are to be able to connect an annotation with the content.

One of the systemic consequences of cheap computing and bandwidth is that the rendered forms of ideas have gotten more diverse in vacuous but annoyingly idiosyncratic ways. These methods are ways to reduce this problem so that the important stuff, conversations about ideas and real issues, can proceed. I hope it helps.


Follow

Get every new post delivered to your Inbox.