Moz here means Mozilla, not Morrissey. Sylvia Pfeiffer et al. have been exploring the issue of captioning, subtitling, and other forms of moving text as used on the Web. There’s too much of an insistence on an open standard to guarantee a successful outcome (sometimes the best option is one somebody else owns), but for today, let’s consider some warning signs from Pfeiffer’s recent report.
Open captions don’t have to be burned in
You can use a separate text track (of any kind) and merely force a player to display the captions. Same goes for subtitles. You see it quite a lot in the DVD field (check the concept of “forced subtitles”; it involves setting SPRMs, but I’ll let you look that up).
Open captioning merely means “they’ll be visible to everybody who can see.” It is not the opposite of encoded captioning or captioning provided as a separate stream. Hence it is also incorrect to state that “Only closed captions and subtitles, as well as lyrics and linguistic transcripts, have widely-used open text formats.”
DVD subpictures are essentially TIFFs
Pfeiffer helpfully suggests that we could do something like run OCR on DVD subpictures, but implies this would be ever so tricky, listing an alphabet soup of open-source formats nobody now uses or will ever use. Subpictures start out as TIFFs and are merely run-length-encoded graphics files that are trivial to decode. Once decoded, they will have been helpfully removed from any obfuscating background graphics and can be read via OCR with high accuracy, at least in scripts like Latin and Cyrillic.
The myth of written audio description
I wonder if we’re ever going to be able to kill of this mythology, repeated ad nauseam by beginners in the industry, that handing a blind or deaf-blind person a transcript of an audio description is in some way helpful. It isn’t. Film is a medium of motion; action happens right now and so does its description. (Or a second or two before or after the action, but in any event not at some other time jotted down in a text file.)
The idea that deaf-blind people have an interest in transcriptions of audio descriptions is essentially false. There was one trial project – one – with ambiguous results. The idea is a non-starter. If a soundtrack can be reduced to a printout, why can’t the moving image be reduced to a single photograph?
This is all about bottom-centred titles
As with the YouTube case and so many others, I get the impression that proponents have watched a bit of captioning, maybe five minutes here and there, and have added that as a kind of icing on the cake of lifelong subtitle viewing. Half these people are British and cannot actually distinguish the two, also perhaps claiming there is no distinction.
Hence I see Mozilla’s and YouTube’s work as doing nothing but providing invariant-bottom-centred titles of one sort or another, with likely limitations on number of lines and number of characters per line. In short, there is no understanding that pop-on captions must be positioned in all cases, that even some subtitles must be positioned away from screen bottom, and that some subtitlers use flush-left justification.
Many combinations unaccounted for in an all-bottom-centre system are seen every day:
- Multiple languages per subtitle, e.g., one line each of Chinese, Vietnamese, and English. (Hence this statement is false: “If, for example, there are subtitles in ten different languages, a user will only want to see the subtitles of one language at a time.”)
- Multiple simultaneous caption blocks.
- Simultaneous captioning or subtitling, or, more commonly, a subtitled program with occasional added captions. (Let me guess: Your system makes me choose one or the other but not both.)
- Scrollup captioning. While misused, it exists and has to be accounted for.
- Combinations of scrollup and pop-on captioning, e.g., scrollup for dialogue and pop-on for music.
- Multiple caption streams, as in verbatim and easy-reader versions (rare but not unknown) or same-language and translated versions (viewable every day of the week on American television).
To explain this deficiency another way, there is a rush to solve what underinformed people view as the dominant use case with no understanding of other use cases. When presented with the latter, the response is either “That doesn’t happen” (it does) or “That’s pretty rare” (across a mythical 500 channels running 24/7, it isn’t).
Solving the four-fifths of the problem you think is the entire problem means you haven’t solved the problem.
Then we get into issues like typography (I’m not naïve; it’ll be all Arial all the time, and you’ll defend that to the death); in-frame vs. out-of-frame display; and just how Mozilla and everybody else intends to actually digitize largely undocumented proprietary formats. The complete ideological zeal for open-source formats will, I promise, get in the way of the entire project.
Simplistic answers to complex problems
From my reading, it seems Mozilla is doing what YouTube did: Grasping at the most expedient and simplistic solution.
Mozilla manages this degree of oversimplification despite publishing one paper after another ostensibly documenting months of research. “The aim of the study,” Pfeiffer writes, “was to ‘deliver a well-thought-out recommendation for how to support the different types of accessibility needs for audio and video, including a specification of the actual file format(s) to use.’ ” What they’re on track to deliver is a very elaborate system to encode the dumbest possible form of subtitling and declare it suitable for all variations of everyone’s needs.