SC22/WG20 N791 L2/00-369 From: Kenneth Whistler [kenw@sybase.com] Sent: Friday, October 06, 2000 8:43 PM Subject: WG2 in Vouliagmeni (Athens) Unicadetti, As I am occasionally wont to do, I have decided to provide a detailed report on the goings-on in Athens. The first thing of note is that the meetings were not actually in Athens, but outside of town in the rich seaside suburb of Vouliagmeni. A couple miles to the southeast of us was the private harbor where the super-rich park their Onassis-class yachts. But where we were was nice enough -- right next to a beautiful Aegean beach, where on occasion Asmus and Erkki could be spotted wading in their shorts. Or, if you were up before dawn at 6:30, Arnold Winkler could be seen hiking briskly down the seaside causeway, leading a laggard retinue of morning dog-spotters. And there were 3 lovely Greek restaurants with covered pavilions right down next to the shore, for the lunch and evening dining pleasure of those who didn't care to deal with taxi-drivers to head downtown to eat the same things that could be had by crossing the street from the hotel. And if that is not enough to wonder why you missed it -- ELOT flew the entire group to Santorini on the weekend and put us up overnight in little inns in Kamari, the black beach resort town. So ELOT certainly gets full marks for hospitality for its visitors to Greece! Since the next WG2 meeting is hosted by the U.S. in Mountain View, this left the U.S. delegation wondering whether a bus tour to the Museum of Tech Innovation in downtown San Jose would be considered as not quite making the grade in the way of planned outings for next spring's WG2 meeting. ;-) Now down to the business at hand. The resolutions from the WG2 meeting are available as document WG2 N2254 (= SC2 N3489). You can go get that document for details, so I won't repeat everything here. However, there are a number of issues that do have implications for the UTC business, so I will focus on interpreting those actions. ==================================================================== 1. 10646-2 The handling of the disposition of comments for 10646-2 was the main agenda item for WG2. The summary of what took place is that the FCD was successfully progressed for an FDIS ballot, with only minor changes resulting from the ballot comments. The FDIS ballot should be issued at the end of December -- the main reasons for it taking that long are getting the revised, high quality font for the Plane 2 Han characters, plus giving the IRG time to resolve North Korea's request to include DPRK source information in the publication. In terms of synchronizing with Unicode 3.1, the target completion dates are now for approved FDIS 2001-05 and (published) IS 2001-12. Here are some of the notable changes for 10646-2 that resulted from the ballot comments and the committee discussion: a. Title change. Instead of the horrendous mouthful: Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 2: Secondary Multilingual Plane for scripts and symbols, Supplementary Plane for CJK Ideographs, Special Purpose Plane all of the planes are being amended to be referred to as "supplementary planes". So now we will have simply: -- Part 1: Basic Multilingual Plane -- Part 2: Supplementary Planes And the text inside 10646-2 is being corrected appropriately. Chalk up one point for editorial sanity over committee standardese. b. Old Italic All of the U.S. comments were accepted. In addition, the Irish and German comments resulted in the pulling of the 5 most problematical numerical characters. The UTC will need to reaffirm the result, since names changed, some characters were removed, and a few positions changed. c. Byzantine Musical Symbols The character repertoire was unchanged, but the names were all changed to start with "BYZANTINE MUSICAL SYMBOL", omitting "GREEK" from the names. All of the parenthetical stripe comments were removed, too, and the discussion of these symbols was simplified in the annex, pending more definitive information about how these should be implemented. The UTC will need to reaffirm the name changes. d. Western Musical Symbols WG2 accepted the reorganization of this block, based on the report of a U.S./Ireland ad hoc before WG2, including the input of the authors of the original proposal. The names were shorted to "MUSICAL SYMBOL ...", omitting "WESTERN" from the names. The UTC will need to reaffirm the entire block, since names and encodings changed, and there were a few drops and adds. The net effect, by the way, is a much more logical ordering than in the original proposal. e. Mathematical Alphanumeric Symbols The U.S. comments to replace the open-face italic with boldface fraktur were adopted. "OPEN-FACE" was changed to "DOUBLE-STRUCK" in other names for consistency. Otherwise the entire set survived intact. The UTC will need to reaffirm the name changes, but there were no encoding changes otherwise. f. CJK Compatibility Ideographs Supplement (Plane 2) Michel corrected the gaps for this set of CNS compatibility characters, but during the meeting an additional problem turned up, where a character whose unification was in question popped up in this compatibility set, instead of being dumped into the bucket for Vertical Extension C. That character will be removed from the FDIS. The UTC needs to reaffirm, because of encoding changes. g. Vertical Extension B (Plane 2) China delivered to Michel what purports to be the final data files for Vertical Extension B, and pledged that from this point forward all they are working on is improvement of the font. There will be no additions, deletions, reencodings, or unification changes for anything in Vertical Extension B before the FDIS ballot. So I think we finally have a stable encoding. The UTC needs to reaffirm this set, which results from the application of the IRG editor's report on the last document. h. Language Tag Characters (Plane 14) No changes. ==================================================================== 2. Amendment 1 for 10646-1 The other major agenda item for the WG2 meeting was to roll all the math symbols and miscellaneous other items from the WG2 "bucket", plus various required text changes, into an omnibus amendment for Part 1. This was accomplished, but not without a fair amount of dickering on the details. The good news for the math folks is that the entire set of math symbols went in almost unscathed, and was the vehicle whereby we carried all the rest of the additions. However, there were a small number of changes which need to be dealt with, as well as the additions that the UTC will need to take up. And one major omission. For those who have access to the WG2 document site, who want to read along in detail, the document going in was WG2 N2263 "Working Draft of Tables & Character Names for Proposed Amendment 1 to 10646-1:2000". The document prepared during the meeting, incorporating all the additions and changes from the meeting, as input to the editor, was WG2 N2281, with the same name. The WG2 resolution M39.25 initiates the PDAM-1, "Mathematical symbols and other characters", with the working draft due at the end of October, and with the PDAM ballotting to be complete by 2001-04. So we're on our way, folks, for Unicode 3.2! a. U+0363 COMBINING GRAPHEME JOINER WG2 declined to accept this character, which the UTC had approved. The main objector was Ireland, on the grounds that the documentation provided was woefully inadequate to explain the character and how it might react with other characters like the ZWJ, etc. The U.S. and the Unicode liaison argued for its inclusion, as directed, but didn't have any good documentation of use to point to, so didn't carry the day. The UTC needs to provide more convincing documentation of the need for this character, presumably to accompany ballot comments on the PDAM. But in any case the code point will have to change, since WG2 put something else in that position. b. Medieval superscript letters In response to the German request for superscript (combining above) letters, WG2 took the conservative approach of encoding just the 13 attested letters, rather than a full alphabet. These got added at U+0363..U+036F, and the UTC will need to consider and approve these. These will pose some interesting property assignment problems, since they are *alphabetic* non-spacing combining marks. c. Misc. Greek symbols As might be expected in Greece, the Greek symbols included in the math symbols proposal came up for intensive (and passionate) discussion. To preserve the repertoire for the STIX set, we acquiesced in a number of name changes and the moving of a couple characters. The Q-SHAPED KOPPA characters were renamed to ARCHAIC KOPPA, after a long discussion of alternatives. The Greek committee strongly objected to "Q-SHAPED KOPPA", even though that seemed like an innocuous, descriptive term to the rest of us. The qualifier "WITH STRAIGHT BAR" was dropped from the name of U+03F4 GREEK CAPITAL THETA SYMBOL. The move involved the straight epsilon symbols: U+213B GREEK SYMBOL STRAIGHT EPSILON --> U+03F5 GREEK LUNATE EPSILON SYMBOL U+213C GREEK SYMBOL REVERSED STRAIGHT EPSILON --> U+03F6 GREEK REVERSED LUNATE EPSILON SYMBOL The insistence on "LUNATE" was a Greek preference. These are handwritten variants of the epsilon, and are seen as exact parallels to the lunate sigma. The mathematician's preference for "straight epsilon" will be handled by a Unicode name alias. Additionally, there were some further name modifications for all of the other letterlike symbols from the math symbols proposal, though all of them stayed at their proposed encoding positions. The UTC will need to consider and approve all these modifications, since the UTC had already approved the math symbols proposal in toto. d. Komi Cyrillic 16 Komi Cyrillic characters were added at U+0500..U+050F. This required establishing a Cyrillic Supplemental block at U+0500..U+053F -- which was conveniently available there, courtesy of Joe Becker's guesstimate of about how much bizarre Cyrillic was likely to trickle in eventually. The UTC will need to consider and approve these. e. (ZERO WIDTH) WORD JOINER U+2060 WORD JOINER was approved by WG2, but without the "ZERO WIDTH" in its name. This because no one on the committee, including the Unicoders, could get the phrase "ZERO WIDTH WORD JOINER" out of their mouths consistently, without mixing it up with "ZERO WIDTH JOINER". Rather than bequeath that problem on the world, we went with the simpler and less confusing "WORD JOINER". The UTC will need to approve this name change. f. U+20B0 GERMAN PENNY SYMBOL --> GERMAN PENNY SIGN A simple name change that the UTC needs to approve. g. U+21F4 DOWNWARDS WHITE ARROW WITH CORNER LEFTWARDS --> U+23CE RETURN SYMBOL This is the return symbol from JIS X 0213. The Japanese committee objected to the name, and various people pointed out that this really is a keyboard symbol and may either be represented as a filled or a hollow glyph, so the description as "WHITE ARROW" was inappropriate. To resolve the impasse, we renamed it and moved it into the misc. technical symbols block with the other keyboard symbols. The UTC will need to consider and approve this change. Incidentally, *all* of the rest of the STIX-derived math symbols survived with no name or encoding changes -- which should be a boon to the people working on MathML. So the front-loaded pain of name changes from the Beijing meeting paid off in stability in Athens. h. U+30A0 KATAKANA DOUBLE HYPHEN --> KATAKANA-HIRAGANA DOUBLE HYPHEN The Japanese committee objected to adding "KATAKANA" to their suggested name "DOUBLE HYPHEN", but when presented with the argument about the confusion with the meaning of "double hyphen" in Western typography, acquiesced in labelling it "KATAKANA-HIRAGANA", so as not to indicate a constraint on its usage to Katakana only. The UTC will need to approve this name change. i. JIS X 0213 Compatibility ideographs This batch of 57 (or is it 61?, or is it 60?) compatibility ideographs is a botched mess. The draft I brought to the WG2 meeting in WG2 N2263 contained 57 characters, based on WG2 N2197 (from the Japan NB), "Update: CJK Compatibility Ideograph request", minus the 4 duplicate radicals that the U.S. and the UTC declined to endorse. Japan countered in Athens with WG2 N2273, misleadingly labeled as PDAM text, "AMENDMENT xx: JIS compatibility ideographs", which was provided as "a correction to N2142", and which Japan insisted was the correct input to the editor for the PDAM text. That document contains 60 [sic] characters, not the 61 previously advertised, including the 4 radicals in dispute. So it apparently is missing the one extra character that was included at the end of the list in WG2 N2197. (A compatibility form of U+9B2D, if anyone is still counting.) At any rate, WG2 N2273 was designated to be used as input to the editor for this block of characters. So the UTC action now will have to be to draft comments on the PDAM text firmly requesting that the 4 duplicate radicals be taken out again (and presumably a conditional request to add back the one missing character from WG2 N2197, if it is really needed). j. Circled Numbers Given that the UTC couldn't come up with an usable alternative, and given the Japan NB's insistence on getting an encoding answer, the extra circled numbers from JIS X 0213 were just dumped into the PDAM. This, by the way, also made the DPRK happy, since it covered part of the mapping requirements for their standard as well. The additions were: In a gap in the Enclosed Alphanumerics block: U+24EB..U+24F4 negative circled 11 -- 20 U+24F5..U+24FE double circled 1 -- 10 In two gaps in the Enclosed CJK Letters and Months block: U+3251..U+325F circled 21 -- 35 U+32B1..U+32BF circled 36 -- 50 At least it was possible to accomodate these in the otherwise useless holes in these enclosed symbols blocks, where like meets like. The UTC will need to consider and approve these encodings. Japan promised not to do this again. ;-) k. Philippine Scripts WG2 formally endorsed adding the 4 Philippine scripts into the PDAM. These are Tagalog, Hanunóo, Buhid, and Tagbanwa, and they were just plunked in with the names and encodings already approved by the UTC. So there is no UTC action needed on these, except possibly a huzzah that some more minority scripts have finally been progressed in 10646. l. Recycling Symbols WG2 approved the recycling symbols proposal, encoding the UNIVERSAL RECYCLING SYMBOL at U+2672, and the 7 specific plastic type symbols at U+2673..U+2679. The UTC will need to consider and approve these encodings. m. UCS Sequence Identifier In addition to the character encodings for Amendment 1, there are a number of important textual changes for 10646 that the UTC is interested in. The first of these is for the UCS Sequence Identifiers. WG2 endorsed the proposal that Uma wrote up, based on UTC input. The only change is that the USI's will be known as UCS Sequence Identifiers, rather than Unique Sequence Identifiers. The ambiguity of the term "unique sequence" was the reason for this change. Otherwise, the syntax and intent is exactly as the UTC proposed. n. U+ Notation WG2 completely endorsed the proposed change for the U+ short identifier notation, as written up in WG2 N2234. So the PDAM will contain the text which allows 5- and 6-digit U+ notation, as suggested by the UTC. o. Permanent Reservation (in process codes) WG2 also accepted permanent reservation of U+FDD0..U+FDEF for internal processing purposes. So this also matches the UTC request. p. WG2 acceptance of UTC proposed characters WG2 accepted the addition of the terminal graphic characters and of the THAANA LETTER NAA, which were already approved by the UTC. So that puts WG2 in synch with the UTC on those characters. ==================================================================== 3. DPRK Issues The DPRK delegation (North Korea) was again present and active. Only 3 people showed up (instead of the 7 who came to the Beijing meeting), but they had a full set of revised documents for consideration. Once again DPRK asked for name changes and reordering of all the Korean characters in 10646, and once again WG2 refused to do this. But this time, the delegates from Germany and from Sweden, who have been active in the WG20 work on ISO 14651 (International String Ordering) were able to spend a fair amount of time explaining the alternatives for ordering that make it unnecessary to insist on particular orders for encoding in 10646 itself. Particularly helpful was Kent Karlsson bringing a draft 14651-style collation tailoring showing what would be required to get an ordering that would match the DPRK requirements. Several people also spent time talking to the DPRK delegates about commercial and open source implementations of collation. The DPRK also brought in a properly restructured request for additional character encodings to match the content of their national standard. We were able to work with them offline to further pare down and refine that request, and we can expect them to come back with something for the next WG2 meeting that will be close to ready for prime time consideration, finally. The major accomplishment, however, was one not actually on the formal agenda of documents, but clearly on the DPRK's actual agenda. WG2 authorized a continuing ad hoc group on Korean issues, inviting the DPRK, ROK, and any other interested parties to participate. Germany and Sweden will be participating, and it would make sense for someone from the U.S. more knowledgable about the Korean script than I am to participate as well. What this really is is a backdoor channel to work on IT technical issues that will result from the prospective reunification of Korea. It is quite clear to everyone that once the DMZ falls and the inevitable political and economic reunification occurs, that the South Korean economy and technology will engulf the North -- probably even more decisively than what has happened in the West/East Germany reunification. But the DPRK technologists are looking for answers for interoperability during the transition. Basically they will just need conversion and collation solutions and tools, and need channels open to find out about such things while the politicians and diplomats are groping about for the bigger political solution. So WG2 in this case was able to softpedal the confrontational approach that we saw at the Beijing meeting. Instead of "These proposals are ridiculous -- and we'll just say no", the approach from Athens turned more towards "We understand you have a number of technical problems in character encoding -- let's find a way to continue the dialogue and figure out mutually satisfactory solutions to the problems." ==================================================================== 4. Editorial Corrigenda & Publication Issues for 10646-1 WG2 resolution M39.5 bundles together a bunch of editorial corrigenda for glyphs, requests the Unicode Consortium to prepare the updated tables, and instructs the editor to forward to corrigenda to ITTF to publish as a Minor Revision to the standard. When that happens, it will put the 10646-1 charts back in synch with the revised charts that the UTC has posted for the Unicode Standard, reflecting all these glyph charts. Discussion of this raised the sore point for WG2 about the delay that ITTF has caused in the publication of 10646-1:2000. As of the Athens meeting, no one yet had a published copy of 10646-1:2000, even though ITTF had the contents delivered to them in February. There were some hot-tempered comments about this state of affairs. It is certainly making ISO look bad, in comparison with the publication track record of the Unicode Consortium. The whole issue was escalated up to the SC2 plenary for discussion, where it resulted in some careful editing of the text of the relevant resolution -- with the SC2 chair trying for a polite, nonconfrontational resolution, but with others pushing for a very strongly worded resolution *demanding* that ITTF get off its butt, so to speak, and get the thing published. --Ken 8