‘International perspective’

(UPDATED) I have been saying for a couple of years that the PDF/Universal Access Committee (slash in original, sadly) is a happy ship, stewarded by an erudite and urbane chair and staffed by a small core of constituents with good skills and an actual sense of humour. The goal is a standard for PDF accessibility that extends beyond the simplistic dictum “Only use tagged PDF.”

More or less by default, I was assigned the Text and Headings modules. This task built on my actual knowledge and skills, but I did a great deal of external research. For example, I ended up going well beyond the recommendations in the CSS Working Group’s paper “Robust Vertical Text Layout.” CSS is not PDF, and we have our own requirements, but I was able to set up two new attributes for text direction. You can use any direction for either attribute.

WritingModeInline

The attribute WritingModeInline specifies the inline direction of text, that is, the direction of text within a block (typically the direction of characters within lines).

WritingModeBlock

The attribute WritingModeBlock specifies the block direction of text (typically the direction of lines over a page).

Requirements for text direction

Text direction shall be declared. The declaration should be on the root tag of the document.

Text direction may be locally overridden where warranted, e.g., for mixed-language text where writing direction changes. For this purpose or for other change in writing direction, such change shall be declared. Only the axis of change shall be declared (inline or block, respectively); if the other axis (block or inline, respectively) has not changed, it may be declared.

A tag shall use only one method of declaring text direction. That is, an application may use PDF 1.7 WritingMode or a combination of PDF/UA WritingModeInline and PDF/UA WritingModeBlock but shall not use both methods on the same tag.

Direction values

WritingModeInline and WritingModeBlock share the same set of possible values:

Rectilinear

LR

left to right

RL

right to left

TB

top to bottom

BT

bottom to top

Paths around rectangles (including squares and rhombi) are modelled as sequences of straight lines. To define the direction of text along a rectangle, an author must use sequences of rectilinear values.

Curved

Clockwise

in a curve corresponding in direction to the movement of the hands of a clock

Counterclockwise

in a curve opposite in direction to the movement of the hands of a clock

Note that PDF/UA uses Counterclockwise as a value, not Anticlockwise.

Paths along ellipses and arbitrary curves are modelled as circles of equivalent radius. To define the direction of text along an arbitrary curve, an author shall use a sequence of Clockwise and Counterclockwise value(s).

Diagonal

LLUR

lower left to upper right

URLL

upper right to lower left

LRUL

lower right to upper left

ULLR

upper left to lower right

No text direction

None

no declared text direction

Unknown

no known text direction

Text at corners, vertices, and inflection points

Text located exactly at a corner or vertex of a rectangle or at an inflection point of a curved path must declare a WritingModeInline value of None. […]

No default direction

PDF 1.7 gave WritingMode a default value of LrTb. PDF/UA-compliant documents have no default value for inline or block text direction. Text direction shall be explicitly declared.

Scripts with naturally changing direction

Neither PDF 1.7 nor PDF/UA specifies explicit attributes for scripts that, by their nature, naturally change direction in running text. To encode a script that continually changes or alternates direction, an author shall use a sequence of WritingModeInline and/or WritingModeBlock values. For example, boustrophedon text, whose reading direction varies from left-to-right on one line to right-to-left on the next, shall be modelled as a sequence of WritingModeInline = LR and WritingModeInline = RL values.

Use cases

For clarification, some typical combinations of WritingModeInline and WritingModeBlock values are as follows.

WritingModeInline WritingModeBlock Example

LR TB English; French; Basque; Georgian; Tibetan; Japanese (horizontal writing); Chinese (horizontal writing); Korean; Mongolian (Cyrillic); Tamil; numerals within Hebrew text

RL TB Hebrew; Yiddish; Arabic; Farsi; Urdu; Pashto

TB RL Japanese (vertical writing); Chinese (vertical writing)

TB LR Mongolian (traditional)

BT LR Ogham

any rectilinear any diagonal Crossword puzzles; sudoku

None None Single glyph (any language); single numeral

Undefined Undefined Language unknown to the PDF author

`WritingModeInline`	`WritingModeBlock`	Example
LR	TB	English; French; Basque; Georgian; Tibetan; Japanese (horizontal writing); Chinese (horizontal writing); Korean; Mongolian (Cyrillic); Tamil; numerals within Hebrew text
RL	TB	Hebrew; Yiddish; Arabic; Farsi; Urdu; Pashto
TB	RL	Japanese (vertical writing); Chinese (vertical writing)
TB	LR	Mongolian (traditional)
BT	LR	Ogham
any rectilinear	any diagonal	Crossword puzzles; sudoku
None	None	Single glyph (any language); single numeral
Undefined	Undefined	Language unknown to the PDF author

The current PDF spec can handle, in effect, English, Hebrew, Japanese, and nothing else. The finished PDF/UA spec can handle anything from Ogham to Mongolian to Scrabble to crossword puzzles. And the text-direction components might be built into the actual PDF specification, now known as ISO 32000. (PDF has not been an “Adobe” specification for quite a while.)

There was also the issue of declaration of language, which no standard has gotten right. We also can finally handle multilingual alt texts and similar mixed-language attribute values. Since PDFs are little databases, we use a lookup table for abbreviations and acronyms, meaning we can disambiguate the dual usages of St. in “St. George St.” without HTML-style markup or tooltips. (Every form of abbreviation, acronym, initialism, or short form is collapsed onto Abbr.)

Content shall be tagged in logical reading order. The most semantically appropriate tag shall be used for document content.

Character codes shall map to Unicode as described in “Unicode Mapping in Tagged PDF.” […]

Stretchable characters such as parentheses or brackets (often drawn by combining several individual glyphs to form the appearance of a single glyph) shall be tagged using Actual Text [a feature HTML doesn’t have]….

Characters not included in any published Unicode specification may use the Unicode private use area or declare another published character encoding.

Font characters shall be available for each character code, including Braille, [as all human-readable characters must have a visible form in PDF]

Natural language shall be declared…. Language codes shall be derived solely from IETF BCP 47, “Tags for Identifying Languages.” In particular:

Documents not expressed in a natural language shall declare the root language as zxx.

Documents expressed in a language unknown to the author or creator shall declare the root language as und.

Documents with equal proportions of multiple languages shall declare the root language as mul and use structure elements to group and tag each content block with the correct code for the language of the content.

Changes in natural language shall be declared.

Changes in natural language inside attribute values (e.g., inside Alternate Text and Bookmarks) shall be declared using tag characters as described in §16.9 of Unicode 5.1: Special Areas and Format Characters (PDF).

Text direction shall be declared.

Changes in text direction shall be declared.

When the meaning is ambiguous to the intended readership, abbreviations, acronyms, initialisms, and short forms shall be tagged with Abbr and their expansion shall be given per §14.9.5 in ISO 32000.

If you’re working on HTML5 and you’re struggling with unnumbered headings, well, my version has that down pat.

Documents, or portions of documents, that use structural tags to group related content blocks shall not use numbered headings. Only the generic heading H may be used…. Each instance of H shall have one section-tag parent. To indicate successive levels of headings, authors must nest other section tags; each of those nested section tags may contain at most one H.

So. Pretty solid, I think.

But our little working group is part of a larger standards organization that in turn works with ISO. The latter two organizations, among others, had a giant meeting recently in Beijing, which only the Adobe and Microsoft (!) representatives could afford to attend. (A perennial problem with international standards bodies.) They didn’t like the wording of the headings module (only partially excerpted above), which reflected some trouble we were having in separating mandatory from “advisory” information. Fine.

The real problem, though, was with the text module. They apparently glanced at it and decided it did not take account of double-byte and “complex” scripts and needed an “international perspective.” I am told that the Adobe and Microsoft reps stood up for the quality of the work. If that happened, it did nothing. But there’s no basis for the objections in the first place – Chinese and Bengali (examples of double-byte and complex scripts) render at a level above the one we’re working on. In effect, they are, as ever, an issue of fonts and Unicode.

(German was mentioned later. German hyphenation is already covered in the old PDF 1.7 spec. If you’re wondering about the Bengali example, in that script, and many others, a raw Unicode character sequence like 1 2 3 could display as [1+3=4] 2. It’s similar to ligature substitution, where the two letters f+i become the single character ﬁ when displayed.)

That issue has been handled; it hasn’t been overlooked. I’ve been talking about it for two years. What really happened is the Beijing committee members have so little topic knowledge that they think our nice short text module doesn’t account for all the possibilities. By accounting for all the possibilities, we made it that short.

The international-perspective business was particularly galling, as I am an international perspective. Until a couple of meetings ago, I was the only international perspective within PDF/UA; everyone else on the committee is American. The Beijing conference accused us, in effect, of being presumptuous Americans steamrollering over the delicate sensibilities of advanced cultures whose scripts are “complex” or require twice as many octets as simplistic American language. The fact that I have repeatedly shown my work to other experts in the field (none of them American) counted for nothing.

The PDF/UA Committee has completely caved on this point and is in no way standing up for me and my work. In fact, they cooked up some post-facto bullshit that there might be some “international [adaptive technology]” that “we” don’t know about. They admitted the whole thing amounted to political correctness. The chair terminated debate on the issue, and I barely got an objection to that termination entered into the record.

Now, is this bullshit or what? It is another way of stating that, say, a coloured person or somebody who talk de English mit an akzent is more likely to write a good standard than I am. It ignores the actual work and obsesses over which boxes the authors could tick on an employment-equity form. (That, incidentally, is a Canadian, hence “international,” concept.)

To put it yet another way, some Indian dude who knows nothing about the topic could get his own text module rubber-stamped, while my actual work is sent to the shitter because I am supposedly some kind of imperialist American. Ironically, U.S. English has a couple of useful terms that apply perfectly here – reverse racism and affirmative action.

There is a concern that, unless “we” do whatever the Beijing committee ordered us to do, a later ISO committee could vote down the entire PDF/UA specification. They won’t.

Here’s what they should really be concerned about: Are they going to be able to find enough coloured people with broken English to do the work I used to do?

Update

(2008.12.05) The meeting minutes from 2008.12.03 state, somewhat self-incriminatingly:

Diversity [sic], the last item on the action items page. Committee was prompted by [ISO] to obtain more input on complex scripts and double-byte character[s]. [The Microsoft rep behind this entire boondoggle s]ought out folks familiar with standards in general and accessibility standards in particular. Search indicated there were experts on the issues of text from Tokyo, China, Egypt, and Israel who can assist.

So now I have my answer: It takes four coloured people or people with funny accents to do the work I used to do.

Permanent link & datestamp ☞ 2008.11.11 13:42
Filed under ☞ PDF
Select a category to see additional posts.
Add feed/ to a category to subscribe via RSS

The foregoing posting appeared on Joe Clark’s personal Weblog on 2008.11.11 13:42. This presentation was designed for printing and omits components that make sense only onscreen. (If you are seeing this on a screen, then the page stylesheet was not loaded or not loaded properly.) The permanent link is:
https://blog.fawny.org/2008/11/11/pdfu-ixnay/

‘International perspective’

`WritingModeInline`

`WritingModeBlock`

Requirements for text direction

Direction values

Rectilinear

Curved

Diagonal

No text direction

Text at corners, vertices, and inflection points

No default direction

Scripts with naturally changing direction

Use cases

Update

Search

Information

Copyright