I QUIT

You should be more of a fan of Mr. BEN HAMMERSLEY, variously an author, exhibited photographer, RSS authority, war correspondent, and second-in-command of Wired U.K. What will he do next? Hammersley has learned the secret of all accomplished people: Leave a good job early and do something different. I view him as a kind of Renaissance man and have told him so. I was shocked when he wrote back and said, in effect, “No, I’m not. You are.” Such flattery, while bullshit, was devastating.

I can recommend Hammersley’s ongoing series on the the complexity of converting legacy formats, like books and magazines, to digital formats. The term “format” has different meanings in that sentence. I think about the issue a lot. When doing so, mostly I rail against managers and self-styled experts in the publishing industry who are so fucking stupid they can’t run their own Windows XP boxen, let alone rescue their dying industries.

Established readers will evince no surprise when I reveal a few habits I consider dead giveaways – top-posting and, for book composition, the use of any or all of fake small caps, hot-metal typefaces from two centuries past, and nospace-emdash-nospace. These kinds of people don’t know what they’re already doing and, like a lesbian without a project, are a danger to themselves and society. They run, and ruin, the publishing industry.

Let me mention another indicator, this time among the authorial class. If you’re writing your book in Microsoft Word, man, are you stupid. It’ll crash. It’ll eat your work without crashing. One version won’t be able to open another version’s files. Unless you have expertise and a lot of time to spare, you will not overcome the sullyings of what Textism aptly described as this “antitypographic” program.

And your files have no future. All they can be turned into is files with no future themselves.

Documents in the 21st century need structure. Without Herculean effort, MS Word documents have none. Their stylesheets are diabolical and unreliable, making some presses’ insistence on authorial use of Word templates a minor war crime. A limited subset of structures in Word (chiefly Heading 1 and siblings) can possibly be exported for use elsewhere. In practice, nothing else can, and even with headings, the resulting file is tag soup (and may have effaced necessary characters like quotation marks and dashes).

The entire publishing industry expects manuscripts to arrive as MS Word files, which are then dutifully transformed into InDesign files, or, for lower forms of life, Quark files. You can print these out no problem, but they have no value beyond that and certainly no lifespan, unless you’re an expert, and few are.

What is the solution?

The solution is to write all manuscripts in HTML. Real, honest-to-God, perfect, valid HTML. Everything, all the time, forever. Word can open HTML files and save them as Word documents, with most structures intact. Such structures survive translation to InDesign, which can then output transformable XML or a tagged PDF. These are huge advantages in turning the finished work into some kind of electronic document. Or you could just stop being totally fucking stupid and use the original HTML.

In this scenario, the original file is structured and stays that way. It can be converted to a file for printing that may have more than a little structure, but you don’t have to worry too much about the latter feature because you still have the original file. The canonical document is a structured document. Your HTML file can be converted to an ePub (which is XHTML) or to many other formats. (If you’re as fucking stupid as Cory Doctorow is on this count, you start and end with an ASCII text file, which even an aging mule knows is only marginally more valuable than IBM Selectric typescript.)

Does this not solve your problems?

It does not.

How do you edit your copy? You now have two versions whose words must be identical in typical cases. You can reasonably expect two to five times as much work. Why five? Because you’re going to want to fix your own mistakes and somebody else will have a second list of mistakes you’ll need to fix.

Marked-up page proofs This is not theoretical. Production of my first book, written natively in XHTML, took an entire summer of two or three afternoons a week designing a print version and consolidating edits from two sources:

  1. My own marked-up printout.

  2. A similar printout from Moveable Type, whom we paid about $3,000 to copy-edit the book. (This was money well spent. You should spend the money, and only at Moveable Type. Accept no substitutes. Now, could somebody fix their Web site?)

    Moveable Type corrected four categories of mistakes, two of which related to page design.

Hence our workflow was as follows:

  • I had to enter two sets of text changes in HTML.
  • The designer, Marc Sullivan, and I had to enter all those very same text changes – two sets – in Quark, a program we wouldn’t be caught dead using today.
  • Marc had to vet all suggested design fixes, like widows and orphans and keeps violations.
  • Marc had to enter his own set of corrections, mostly for ligatures (which Quark could not automatically handle – we used seven of them).
  • Every page in both books that contained changes had to be marked to show that such changes were entered twice. Changes we disagreed with had to be specially marked. At the very end of the process, every page of both books had to be re-checked to find any change not so marked, an indication that we’d missed it. Then those rare missed fixes had to be entered (twice).

John Maxwell at SFU is playing with a new capacity of InDesign to use an XML format that precisely mimics every feature of a native InDesign format. An author’s original XHTML would need to be transformed into that format – something I don’t know how to do, and I’m pretty good at all this. I don’t think it’s unreasonable for everyone to edit the resulting XML file, assuming use of competent software (BBEdit can’t be beat in this regard). This will remove many steps from the process listed above, but not all of them.

My work was actually not done: I chose to put effort into adding metadata to the book chapters I published online. This turned out to have been nearly useless, but it took hours.

What is the real stumbling block?

Authors are word people, not math people. Structured documents are mathematical in nature (text strings are delimited by a set or sets of markup). Web standards and semantic markup can be explained to nonexperts in minutes. (For the umpteenth time: I’ve done it umpteen times, and audiences always react with delight at the simplicity of the concept.) Turning theory into practice is quite another thing. Nearly all authors will be incapable of writing structured markup. There are acclaimed authors who are so borderline autistic they cannot hear unfamiliar words. Rare authors use software that, while antediluvian, assists writing yet impedes conversion, like WordPerfect 5.1.

I understand this authorial phenomenon because, after years and years of trying, I do not understand the first thing about JavaScript and that is never going to change. The typical author will have the same reaction to plain HTML. It’s hopeless.

On the other hand, a competent standardista with high literacy and good tools can unfuck even horrifically unstructured documents in a matter of hours. I can unfuck your 300-page book in a day (and I’ll charge you a fortune to do it). Rather after the manner of outsourcing CAPTCHA retyping to India, what E-books need is a caste of intermediaries who are paid top dollar to transform authors’ tag-soup MS Word pieces of shit into semantic gold.

The foregoing posting appeared on Joe Clark’s personal Weblog on 2010.01.07 15:16. This presentation was designed for printing and omits components that make sense only onscreen. (If you are seeing this on a screen, then the page stylesheet was not loaded or not loaded properly.) The permanent link is:
https://blog.fawny.org/2010/01/07/booksemantics1/

(Values you enter are stored and may be published)

  

Information

None. I quit.

Copyright © 2004–2025