WSN: Why Be Normal?

WSN, or Word Segment Normalization, is a simple way to get canonical representations of paragraphs. It can be used for identifying passages within a document for purposes of annotation or reference or as a part of other algorithms which might (for example) be trying to align passages in different editions of the same document.

A simple open source (LGPL) Javascript implementation can be found on github at wsn.js. The underlying idea is to normalize strings for case, whitespace, and modifier characters (in Unicode) and to use this as the basis for further normalization.

The basic WSN form splits an input string at whitespace+punctuation runs but leave embedded punctuation where it is, so that a string like:

the 'look-alike' scores were 3.5, 2.7, and 1.2 respectively

becomes the canonical string:

the look-alike scores were 3.5 2.7 and 1.2

Once, a WSN string has been generated, it can be used as the basis for hashing to get a more compact unique representation. So the MD5 hash of the above string, represented as hex would be:

8cfc45dcea529b6f1177ed71e58c463b

This hash value can then be used in generating unique IDs of the kind I discuss in How To Point to a Paragraph.

Fuddled normalizations

In addition, we can generate a fuddled representation by taking the words in the WSN string and modifying and/or sorting them according to some function. In most cases, we’ll also want to remove duplicates after sorting.

Why fuddle a WSN string? First, fuddling may further normalize the string in useful ways. For example, simple rewordings would map to the same fuddled string. Also, if we were to sort on something like word frequency (rarer words sooner) or a proxy to frequency (like word length), the fuddled string could provide a good representation for finding identical passages across editions or renderings.

Another reason to fuddle strings is to obscure the content itself. Suppose that we want to refer to a passage in a copyrighted text but don’t want to expose the original text in our reference. We can use a fuddled WSN to refer to a passage precisely (or even approximately) using a fuddled representation, but the fuddled string itself would be nearly useless for actual consumption (unless it were very short).

Fuddling can extend beyond sorting to actually remove words (like stop words) or transform them (for example by stemming or soundex coding).

The Javascript library

The library is released under the LGPL (and some other licenses) and can be downloaded at wsn.js. While it is part of the FDJT Javascript library, it doesn’t depend on any other components, except that it can use the fdjtHash object to generate hash values if desired.

The function WSN(arg) takes a string or a DOM node and generates the WSN normalization of its textual content. The function WSN.Hash(arg,hashfn) applies hashfn to the WSN normalization and the functions WSN.md5ID and WSN.sha1ID generate hashed values for the content using the specified algorithms (represented as hexadecimal strings). These functions require the fdjtHash utility or you can just set WSN.md5 or WSN.sha1 to use your own implementations.

WSN takes extra arguments, particularly WSN(arg,sortfn,wordfn,keepdups):

sortfn is the function used to produce a fuddled representation; if sortfn is true rather than a callable function, it sorts first by length and then lexicographically. This yields strings which generally differ sooner than other sorting methods.

wordfn, if not false, is applied to every word in the WSN representation. If it returns false, that word is deleted, otherwise, the is replaced by the result of wordfn. This can be used to use (for example) stemmers or other normalizers to generate even more normal forms. special note: if the wordfn ends up yielding no words at all, it is ignored.
wordfn special cases: if wordfn is a number, it excludes all words shorter than the value; if wordfn is a non-callable table/dictionary/hash/object, it is used to map words to alternate words or to the empty string to indicate the word should be deleted.

keepdups indicates that duplicate words should be kept in the fuddled WSN representation. The removal of duplicate words happens after the wordfn is applied.

All of the hash generating WSN functions take the same extra arguments as WSN. The default values of these arguments can be set by setting the corresponding properties of WSN, e.g. WSN.sortfn.

If the argument to WSN is false, it assumes that it is being used as a constructor and creates an object whose attributes specify the other parameters. This object’s normalize method then uses those defaults. This object also supports the Hash method, which computes a hash value based on the objects .hashfn attribute (which defaults to MD5 if it’s available).

DOM walking

The library is mildly clever as it walks the DOM when it generates the input text for the WSN algorithm itself. In general, the textual form is simply a concatenation of any underling TEXT nodes, with two caveats, based on the each node’s CSS style information:

  1. any nodes whose position is not static are simply ignored;
  2. any nodes whose display attribute is not inline have newlines inserted before their content, though the newlines will be normalized away by the WSN algorithm.

As a final feature, WSN provides a Map method which takes a set of nodes (or strings) and generates a table from their WSN normalizations to the nodes (or strings) themselves. This can work with an array or a node list, so (if you’re using something like jQuery), you can just say:

WSN.Map($("P"),WSN.md5)

to get a table mapping WSN MD5 ids to paragraphs. The second argument specifies a hash function to apply to the WSN normalizations, and the normal WSN parameters (sortfn, wordfn, etc) can follow the hash function.

Advertisement

Tags: , ,

One Response to “WSN: Why Be Normal?”

  1. How To Point To A Paragraph « sBooks: Reinventing Reading Says:

    [...] Reading the best idea since sliced paper « Should an iPad be my eReader? WSN: Why Be Normal? [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.