SC22/WG20 N663 ISO/IEC JTC 1/SC22 Programming languages, their environments and system software interfaces Secretariat: U.S.A. (ANSI) ISO/IEC JTC 1/SC22 N2911 TITLE: Summary of Voting on Second FCD Ballot for FCD 14651: Information technology - International String Ordering and Comparison - Method for Comparing Character Strings and Description of a Common Tailorable Ordering Template DATE ASSIGNED: 1999-04-16 SOURCE: Secretariat, ISO/IEC JTC 1/SC22 BACKWARD POINTER: N/A DOCUMENT TYPE: Summary of Voting PROJECT NUMBER: JTC 1.22.30.02.02 STATUS: WG20 is requested to prepare a Disposition of Comments Report and make a recommendation on the further processing of the FCD. Please note that the comments accompanying the vote from the Sweden Member Body were made available only in pdf format. The nature of pdf files precludes their being integrated into this summary and these comments are being distributed as document SC22 N2912. ACTION IDENTIFIER: FYI DUE DATE: N/A DISTRIBUTION: Text CROSS REFERENCE: SC22 N2844, N2912 DISTRIBUTION FORM: Def Address reply to: ISO/IEC JTC 1/SC22 Secretariat William C. Rinehuls 8457 Rushing Creek Court Springfield, VA 22153 USA Telephone: +1 (703) 912-9680 Fax: +1 (703) 912-2973 email: rinehuls@digex.net __________ end of title page; beginning of overall summary ___________ SUMMARY OF VOTING ON Letter Ballot Reference No: SC22 N2844 Circulated by: JTC 1/SC22 Circulation Date: 1998-11-30 Closing Date: 1999-04-08 SUBJECT: Second FCD Ballot for FCD 14651: Information technology - International String Ordering and Comparison - Method for Comparing Character Strings and Description of a Common Tailorable Ordering Template ----------------------------------------------------------------------- The following responses have been received on the subject of approval: "P" Members supporting approval without comment 6 "P" Members supporting approval with comment 4 "P" Members not supporting approval 6 "P" Members abstaining 2 "P" Members not voting 4 "O" Members supporting approval without comment 1 "O" Members not supporting approval 1 ------------------------------------------------------------------------ Secretariat Action: WG20 is requested to prepare a Disposition of Comments Report and make a recommendation on the further processing of the FCD. The comment accompanying the abstention vote from Australia was: "No expertise in this area." The comments accompanying the affirmative vote from Austria, Canada, France and the United Kingdom are attached along with the comments accompanying the negative vote from Denmark, Germany, Ireland, Japan, the Netherlands and the United States of America. The comments accompanying the negative vote from Sweden were provided only in pdf format and are being distributed as document SC22 N2912. _____ end of overall summary; beginning of detail summary ______________ ISO/IEC JTC1/SC22 LETTER BALLOT SUMMARY PROJECT NO: JTC 1.22.30.02.02 SUBJECT: Second FCD Ballot for FCD 14651: Information technology - International String Ordering and Comparison - Method for Comparing Character Strings and Description of a Common Tailorable Ordering Template Reference Document No: N2844 Ballot Document No: N2844 Circulation Date: 1998-11-30 Closing Date: 1999-04-08 Circulated To: SC22 P, O, L Circulated By: Secretariat SUMMARY OF VOTING AND COMMENTS RECEIVED Approve Disapprove Abstain Comments Not Voting 'P' Members Australia ( ) ( ) (X) (X) ( ) Austria (X) ( ) ( ) (X) ( ) Belgium ( ) ( ) ( ) ( ) (X) Brazil ( ) ( ) (X) ( ) ( ) Canada (X) ( ) ( ) (X) ( ) China ( ) ( ) ( ) ( ) (X) Czech Republic (X) ( ) ( ) ( ) ( ) Denmark ( ) (X) ( ) (X) ( ) Egypt ( ) ( ) ( ) ( ) (X) Finland (X) ( ) ( ) ( ) ( ) France (X) ( ) ( ) (X) ( ) Germany ( ) (X) ( ) (X) ( ) Ireland ( ) (X) ( ) (X) ( ) Japan ( ) (X) ( ) (X) ( ) Netherlands ( ) (X) ( ) (X) ( ) Norway (X) ( ) ( ) ( ) ( ) Romania (X) ( ) ( ) ( ) ( ) Russian Federation (X) ( ) ( ) ( ) ( ) Slovenia ( ) ( ) ( ) ( ) (X) UK (X) ( ) ( ) (X) ( ) Ukraine (X) ( ) ( ) ( ) ( ) USA ( ) (X) ( ) (X) ( ) 'O' Members Voting Korea Republic (X) ( ) ( ) ( ) ( ) Sweden ( ) (X) ( ) (X) ( ) ___ end of detailed summary; beginning of Austria comments ___________ From: Martin Schoenhacker Subject: Austrian vote on N2844 The Austrian vote on SC22 Letter Ballot N2844 - Approval for FCD 14651 - Method for Comparing Character Strings and Description of a Common Tailorable Ordering Template is YES with an editorial comment: On page 35, paragraph 4, line 1, word 2 should read: "public-domain" rather than "plublic-domain". _____ end of Austria comments; beginning of Canada comments ____________ From: Doug, Langlotz Subject: REVISED Canadian reponse 22N2844 Canada SUPPORTS the document with the attached comments: "Canada wants to make sure that relative weights in the template respect those of special characters as in Canadian standard CAN/CSA Z243.4.1.-1999. Furthermore Canada insists that this International standard shall allow to declare a minimal Canadian delta without having to do prehandling (the delta shall be specifiable simply by declaring a modification of the table), in order to fit with Canadian industry practice. Canada will not accept any change that would jeopardize that objective." _____ end of Canada comments; beginning of Denmark comments _________ From: Pia Junker Hviid Subject: Danish vote on JTC1/SC22, N 2844 We can inform you that Denmark votes NO on ISO/IEC FCD 14651, N 2844 with the following comments: 1. The main table should be included in the standard ad verbatim. 2. The weights on the second level should include a weight for all letters with accents, to ensure as equal treatment as possible of fully composed characters and split-up characters, in non-normalized text. This addresses 6.1.1 note 1, which should be removed. 3. In clause 5, The notation "UXXXXXXXX" should also be allowed. 4. In the main table, the control characters of ISO/IEC 6429 C0 and C1 should be included, and ISO 6429 be added to clause 3, references. 5. in 6.2.2.2 description of level 1, please change "basic letter" to "first-level letter". any basic letters of for example the Latin script are not sorted uniquely at level 1, eg: , , . Also for the description of 2nd level: it is culturally dependent what "diacritics" means, and the term should be avoided in an international standard. For example "" and "" are not diacritic letter, but base letters, in some languages. There is no diacritic in these letters. 5. in 6.3.1 - the BNF should be terminated with a semicolon. 6. in 6.3.1 rule 13 should also allow for a '' 7. 6.3.1 and 6.3.2 should be explained in terms of a narrative description as the 14652 LC_COLLATE category specification. 8. 6.3.1 should be aligned with the 14652 BNF for LC_COLLATE, also in terms of terminology used.. 9. There should be tokens "LC_COLLATE" and "END LC_COLLATE" to surround the whole specification in 6.3.1. 10. 6.3.1 rule 8: space should consist of one or more spaces or tabs. 11. 6.3.1 rule 28: The name should be "section-symbol". 12. in 6.4 references to 6.3.1 terms should be in italic. 13. The examples with reorder-after should use "-" instead of "_" in the keywords. 14. 6.5 - The name should be following ISO/IEC 15897 naming. 15. in Annex B.1 the line 5 should have <> around TABLE, as in order_start ;.... 16. Annex B.2 : change "assumption that character mnemonics are resolved into UCS identifiers" to "mnemonic identifiers for UCS defined in ISO/IEC 14652" 17. Key generation on-the-fly should be described, eg as a note at the end of 6.1.2, saying that comparison with keys generated on-the-fly character for character is an equivalent way of implementing the key generation,and may eliminate elaborate key generation when a difference is to be found in the first few characters. 18. Position should be specifiable on all levels, as it is legacy from POSIX. 19. Toggles "ifdef" etc as in 14652 should be reintroduced. 20. The conformance clause needs to be reformulated. It should not be possible to claim conformance to 14651 if full tailoring is not available with the application. That would mean that eg. Danish specifications cannot be accomodated by the application and that defeats the main purpose of this standard. The conformance clause does not read as English. Ith should also be possible for a specification to claim conformance - possibly in the way of 6.4 tailoring. 21. The Danish test data in annex B should be replaced with the following: A/S ANDRE ANDR ANDREAS AS CA A CB C DA A DB C DSB D.S.B. DSC EKSTRA-ARBEJDE EKSTRABUD EKSTRAARBEJDE HST HAAG HNDBOG HAANDVRKSBANKEN Karl karl NIELS JRGEN NIELS-JRGEN NIELSEN RE, A REE, B RE, L REE, V SCHYTT, B SCHYTT, H SCHTT, H SCHYTT, L SCHTT, M SS SSA STORE VILDMOSE STOREKR STORM PETERSEN STORMLY THORVALD THORVARDUR ORVARUR THYGESEN VESTERGRD, A VESTERGAARD, A VESTERGRD, B BLE BLE BERG OBERG ______ end of Denmark comments; beginning of France comments __________ TITLE: French ballot comments on ISO/IEC FCD 14651 - Method for Comparing Character Strings and Description of a Common Tailorable Ordering Template SOURCE: AFNOR DATE: 1999-04-08 France votes YES on FCD 14651, with the following comment: Insufficient effort has been done to define an acceptable ordering for some lesser-used scripts. A lot of scripts are actually ordered based just on Unicode code values. When WG20 can find some existing practice of a culturally accepted ordering not conflicting with another one, these practices should be included in FCD 14651 default template ordering. We suggest that experts of those scripts should be invited to define a correct default ordering. For example, this is the case for Tamil (like most other indic scripts) and Thai scripts, where evidence of existing practice has been demonstrated and no evidence of other equally valid practice has been found. However, considering these issues are more of a concern for national bodies where those scripts are in widespread use, and even if there is a Tamil community in the French territory Reunion Island, we suggest that this work should be done, perhaps in a future amendment to this forthcoming standard. As the same problem exists with any new codepoints added in the UCS, we also suggest that we should contact ISO/IEC JTC1/SC2/WG2 to ensure the existing procedures to register new characters are adjusted to include the needed informations to update the forthcoming collation standard. _____ end of France comments; beginning of Germany comments _________ The German member body vote is "No" with comments. If the technical comments are resolved satisfactorily, the German "no" vote will be changed to a "Yes" unless other significant changes be made to the standard in an unsatisfactory way. Introduction General Germany wishes to thank the editor for many fundamental improvements of this draft over the previous FCD. They greatly increase the usefulness of the future standard and render void many essential German concerns. German comments touch upon two principal points: Technical comments on the body of the draft and on Annexes_B-E; Comments on the normative Common Template Table (Annex_A). Germany does not comment on matters of English style as it is expected that this will be improved by native English speakers. Lack of explicit comments on this should not be taken as endorsement of a style that is, as yet, not always a paragon of clarity. There are many paragraphs where "loose ends" are noticable, caused probably by numerous cuts and reworkings over time. Furthermore, Germany does not comment on purely typographic deviations from the ISO drafting rules (e._g. semicolons ought to be used to terminate items of unordered lists). It is confident that these points will be addressed by the editor at a later stage. Comments on the body of the draft Introduction, 2nd paragraph This paragraph should best be removed altogether, or at least reformulated in such a way that it does not imply any more that the syntax of the Common Template Table (hereafter CTT) is in any way normative. The current formulation of the whole paragraph is unfortunate in this respect. The draft does not -- and must not -- mandate that conformant applications can either directly exchange ordering specifications or even use the CTT in the syntax used in Annex_A. To stress this point, it is advisable to add another annex with the specification of another possible syntax. The XML-conformant Swedish suggestion can serve as a useful starting point. Introduction, 4th paragraph Remove 2nd sentence. Scope: 1st dash Remove text in brackets ["(independently of coding)"]. Change the formulation in the remainder of that paragraph to stress that mappings from ISO/IEC_10646 to any other coding scheme are also permissible. Scope: 2nd dash Remove phrase "using a variant of the Backus-Naur Form (BNF)" as the reference format as such does not use the BNF. It is simply defined using the BNF syntax. Scope: Note Remove note. Scope: Additions Add an entry under the heading "This International Standard does#/+not#/- mandate" to stress that no preparatory procedures are prescribed, but is normally necessary. Give a reference to Annex_C. Definitions: 4.9 The term depth does not elucidate the problem but rather explains an X with an Y. Either define the term or chose a different formulation. Definitions: 4.10 The reference comparison method should be defined or explained in more detail before. Definitions: 4.11 In the context of this draft the "set of strings" can always be understood as having one and only one member (no preparatory procedures are part of the standard itself). Therefore change the formulation accordingly. Definitions: 4.11 (suggestion) Replace the word order by sequence and reformulate the phrase accordingly. Symbols and abbreviations Simplify the matter of code-dependence on ISO/IEC_10646. Any application is conformant that is able to achieve identical results as those of section_6, but not necessarily in the same way. A mapping between some encoding system and the UCS and back can be seen as a special case of the preparation of character strings (cf._6.1.1) and of the presentation of the resulting sequence after ordering. Therefore, without loss of generality, a character can be seen as being part of the UCS. In consequence, the 2nd paragraph except the last sentence should be removed and the 3rd paragraph can be reformulated accordingly, i._e. it can refer to the private-zone UCS coding without further preconditions. Requirements: 6.1.1 Clarify 1st sentence of the 2nd paragraph. Recommendation: At minimum, the preparation shall guarantee that either only precomposed characters or only combining sequences, which in the context of the conformant application are deemed equivalent, are presented to the comparison method ... Requirements: 6.2.2.1 This section is not explained in necessary detail and clarity. Concepts like stacks are suddenly implied ("stacking of the token will be done"), push and pop operations appear. None of these operations have been referred to before nor are they explicitely used thereafter. Technically, the algorithm which the editor obviously has in mind, is, of course, correct. It should, however, be elaborated in more detail. The reader which the editor should have in mind here is the programmer who knows basic devices, but has never worked on ordering. Typographically, it is difficult to understand why the three paragraphs in question are printed with identation. Requirements: 6.2.2.2 The part from Generally to the end should be handled as a note or alternatively as a section (6.2.3) of its own. Level_3: The topic of #/+variant character shapes#/- ("modified letters") must be dealt with on level_2 to ensure maximal compatibility with pan-European requirements. It has no conceptual likeness to "case" and is not normally used on level_3 (cf._also the tayloring of Informative Annex_B.1). Requirements: 6.3.2 Make all text of the explanatory [I.e....]-statements into notes to stress their informative character or consider other means to achieve that end. Such a solution might be to add an informative annex that explains these and other points which concern the syntax of the CTT. Requirements: 6.3 and WF1 hex^_symbol's are not defined. Requirements: 6.3.3, items I4 to I6 The terms normal form, evaluated [weight table] and collation-element-weighted are implicitly defined here, but are used nowhere else. Either the definitions are considered to be of sufficient importance to be included in the "Definitions"-section proper or they should be removed altogether. In part, they can also be incorporated in the specifications themselves, as they explain some requirements more concicely then the corresponding specification itself. Requirements: 6.4 Remove 2nd sentence of 1st paragraph. Annex_B.2 Align the presentation of the delta with that of Annex_B.1 (as it stands the presentation is not conformant to 6.4) and remove all references to the mnemonics which are altogether irrelevant in this context. Annex_C (general) Add a remark on the importance of higher level protocols (e._g. markup system SGML) for correct evaluation of numerals and other prehandling objects (e._g. units -- keys -- in a phone book). Context rarely suffices to achieve anything like #/+total certainty#/-. Many of the tasks are quite trivial if we assume an internal tagging like ^-9^ (cf._C.2.4), but bordering on the impossible to solve reliably without them (In C.2.4 the word Temperature: can be regarded as an implicit tag, but most texts are not nearly that schematic as the examples in this annex assume). It is to be considered if Annex_C really needs to be quite as detailed and extensive as it currently is. Annex_C.1, 1st dash (minor) Why are the names of the strings in capitals? Annex_C.1, 2nd dash (minor) The example text is somewhat obscure (e._g. the remark "according to noble origin or not" presupposes knowledge that this is of importance when ordering). Annex_C.2 The text needs to be clarified to some extend (e._g. what are "Run- together numerals"?). Annex_C.2.2 A cautionary note should be added to stress that these preparatory steps have in some cases (e._g. ordering of telephone numbers in phone books) undesirable consequences and should then be avoided. Annex C.2.3, 3rd paragraph The 2nd sentence ought to be modified. "total certainty" can rarely be achieved even with information on the context. Annex_D, item V.2 Change the formulation of the last sentence of the 1st paragraph. German dictionaries usually employ the German norm DIN_5007. Some dictionaries explicitely refer to this norm, others simply use it without further clarification, still others explain their ordering principles in some detail. Annex_D, item V.3 Remove phrase for the first time in the fourth paragraph. Anenx_D, item VII Remove this item. Comments on Annex_A: Common Template Table General: Names of internal symbols Either reduce all names to a maximum of five letters for consistency or (preferably) give less cryptic names to all of them (e._g. ^ instead of ^ and ^ instead of ^). Names should best be derived from their description in the UCS. Variant letter shapes As mentioned above, variant letter shapes must be distinguished on level_2 instead of level_3. Letters such as F WITH HOOK (^) should best be treated as second level letters. Ideally, only a-z and thorn should be treated as first level letters, though Germany sees this last statement as a strong suggestion for discussion. Relative order of scripts (point of discussion) It is seriously to be considered if the relative order of scripts should not follow a general East-to-West scheme as proposed by the last UK comments. This could easily be achieved by "internal tailoring" the CTT as already done for the special characters of CAN/CSA_Z243.4.1-1998. Germany sees this, however, only as a strong suggestion for an internal discussion in WG20. Script: Greek Maximum compatibility with the specifications of ELOT as presented in WG20/NXXXX is to be sought. To achieve this the breathing marks Psili and Dasia should precede the other diacritics. This is also in line with usual Greek (cf. the study CEN/TC304/Nyyy. COMBINING COMMA ABOVE and >tt>COMBINING REVERSED COMMA ABOVE (with which Psili and Dasia are -- unwisely -- unified in the UCS) are diacritics which appear infrequently in languages other than Greek, whereas in Greek they are very frequent indeed. Cf._also the approach of the E. Script: Cyrillic The order for Cyrillic is not in line with pan-Cyrillic requirements and contains numerous errors. The sequence must be brought in line with the specifications from GOST as reflected in the current edition of the European Ordering Rules (cf._EOR). Detailed documentation both from GOST itself and from other sources will be made available to WG20 before the May meeting. Script: Georgian The ordering of Georgian should be coordinated with the results of ongoing discussion with experts in the field both from Georgia itself and in academic organizations. _____ end of Germany comments; beginning of Ireland comments _______ Irish comments on FCD ISO 14651 Reference: SC22 N2844 Closes: 1999-04-08 Date: 1999-04-08 Although Ireland voted positively on the draft on 1998-01-26, we now wish, because of subsequent review of the document, to reverse our position. Ireland votes No on the FCD draft. Many of our our objections are editorial in nature, and we believe that our No vote can be turned back to Yes easily if the following points are addressed appropriately by SC22/WG20: 1 The English text must be revised so that it is in all cases unambiguous and grammatically correct. 2 Informative text in the Common Template must be revised so that the implication is not made that French backwards-ordering of accents is not a special case. 3 The assertion that small letters ordered before capital letters is the normal practice for the English language is not made and is removed from informative annex D. 4 The Canadian and Danish example benchmarks must provide enough examples to interpret the specifications from which they are derived. 5 The Common Template should contain orderings for all Amendments to 10646 up to Amendment 31, not up to Amendment 7. Ogham, Cherokee, and Runic are already in order (except for the Ogham and Runic punctuation); Canadian Syllabics will require some work to get it right. 1. We have remarked on earlier drafts of this International Standard that the use of the English language is in many cases either ambiguous or grammatically incorrect. We had offered to prepare a corrected version, but because text was not provided to us in time before the last meeting WG20, we were forced to withdraw our offer of making the corrections. We offer now again to provide a new version with document revision annotations. We feel strongly about this because in reviewing the draft, we were often forced to stop and read aloud certain passages in order to decipher the intended meaning. Examples of grammatically incorrect or ambiguous sentences: 1 It is demonstrated that by tailoring the Common Template Table to add extra token values at level 2 for all precomposed characters affected by a diacritics diacritic, it is possible to accomplish identical results for combining sequences without requiring that preparation. 2 The scanning properties for the level i being processed needs to be carefully monitored. When there is a change in scanning direction at level i (this implies implying that the character being processed comes from a block that which is different from the preceding character processed and which has different scanning properties) and the new direction is backward, stacking of the token will be done at the position where the change of direction has occurred. 3 If the order_start_entry does not uses use the position value at level m of a block (the position value is explicitly used in the template for the only block defined) then the formation of subkey level m is done in exactly the same way as the above-defined formation. 4 WF7. No two section_definition_entryOs instances of section_definition_entry in a tailored_table may contain the same values in their section_identifierOs instances of section_identifier. [I.e. That is, multiple definition of sectionOs is prohibited; section_identifierOs instances of section_identifier must be unique.] 5 [I.e., That is, if one takes two strings, builds keys for each based on table 1 and compares them, one should always get the same results as when one builds keys for them based on table 2 and compare compares them.] 6 In cases where the applications an application has provision to allow the end-user to tailor the table himself or herself, any statement of conformance shall indicate which ones of the 4 elements of the previous list are tailorable and which ones are not tailorable. 7 Whenever the Common Template Table is refered referred externally as a starting point in a given context, either applicative or contractual [WHATDOESTHISMEAN???], it shall be referenced using the name ISO14651_1999_TABLE1. 8 For very big large, or very tiny small, values, one often uses formats like 2.5*107 (to just pick one possible way of writing these for the purposes of the examples here). 9 But the Common Template Table has digits as specifies digits to be level 1 significant. 10 Such processing is beyond the scope of this International Standard, though however. 11 A plublic-domain public-domain reduction technique is described in details detail (with ample numerous examples) in Technique de rduction - Tris informatiques quatre cls, Alain LaBont, Ministre des Communications du Qubec, June 1989 1989-06 (ISBN 2-550-19965-0). 12 To illustrate this (without discussing context analysis which is not necessary in what follows), examples of dictionary sequences are given here for two languages which whose native order is not in the Common Template table: 2. The Common Template states: % To tailor for French accent handling, or not to make French % a special case add an order_start statement % and order_end for Latin in the Latin section, as follows: % order_start Latin;forward;backward;forward;forward,position In Ireland we consider French to be a special case, which in fact yields incorrect sorting for our first official language, and we disagree with the implication here, namely, that Onot making French a special caseO does no harm. French is a special case of the default template, just as Danish and Swedish are. The Common Template must read: % To tailor for French accent handling, add an % order_start statement and order_end for Latin % in the Latin section, as follows: % order_start Latin;forward;backward;forward;forward,position 3. Annex D states: 3. The third decomposition breaks ties for quasi-homographs different only because upper-case and lower-case characters are used. This time, the tradition is well established in English and German dictionaries, where lower case always precedes upper case in homographs, while the tradition is not well established in French dictionaries, which generally use only accented capital letters for common word entries. In known French dictionaries where upper and lower case letters are mixed, the capitals generally come first, but this is not an established and stated rule, because there are numerous exceptions. This is, as we have said many times to SC22/WG20, incorrect. Lower case does not precede upper case in English. The concise Oxford dictionary of current English, cited in the JTC1 and CEN directives as a standard for the English language, consistently gives, in its 8th edition (1990) and its 9th edition (1998) the following: August (month) May (month) august (venerable) may (be able) March (month) Polish (of Poland) march (tread) polish (shine) Mass (ritual) mass (heap) So for a Common Template it is advisable to use English and German traditions, if one wants to group the largest possible number of languages together. This rationale is therefore unacceptable, as it is untrue. The reason the Common Template has smalls before capitals (which we do not prefer) is because that is what is specified in the Unicode template. This text must be revised. Let's note here by the way that in Denmark, upper case comes before lower case, a different but well established rule. This is a second fact calling for adaptability in the model used in this standard. This same rule is used for the English language. Example: to have the following order: "august", "August", numbers could be assigned indicating respectively "llllll", "ulllll", where "l" means lower case and "u" upper case. This example is not sufficient. The actual syntax for ordering smalls before caps which appears in the Common Template should be repeated here, along with the actual syntax for ordering caps before smalls. 4. The Canadian delta specifies treatment of THORN and ETH but the benchmark does not contain examples containing these characters. Please add: orsmsrk, Thorvardur, orvarur, medal, meal. The Danish benchmark examples of REE and RE are not sufficient to demonstrate E vs. . Please add more examples as well as examples of such as Ree and Re. 5. The draft is a bit overloaded with references to English, French, and German. A few more examples from other languages would be preferred. ______ end of Ireland comments; beginning of Japan comments __________ From haruhana@itscj.ipsj.or.jp Thu Apr 8 11:42:58 1999 Subject: Japan's vote on SC22N2844 Comments on FCD 14651.2 The National Body of Japan disapproves FCD 14651.2 for the reasons below. If the comments are satisfactorily resolved, Japan will change its vote to approval. J.1) Global: This draft contains many errors and is too difficult to understand because it throws away a great deal of the material developed in FCD 14651.1 and the LC_COLLATE section in FCD 14652.1. Japan agreed to make FCD 14651.2 independent of 14652 assuming that the well discussed and sophisticated part of 14652 would be imported in the second FCD thus enabling us to review it as FCD. But the current draft is far from that. We request to put it back to a mixture of FCD 14651.1 and the LC_COLLATE section in FCD 14652.1 which have been studied by many people. If our request is rejected, the project should be put back to the CD stage. J.2) Global: There are many inconsistencies about tailoring and "delta". Japan considers that the following principles should be reconfirmed in the FCD disposition before any other detailed discussion: a) The Common Template Table (CTT, hereafter) is not a table to be used by the ordering method -- the CTT always needs tailoring. b) Tailoring is always described as a delta to CTT. c) The tailored table is a result of applying a delta to CTT, d) The tailored table is a table assumed in the reference method description. J.3) p.iv, Introduction, the first sentence: The sentence This International Standard provides a method for ordering text data worldwide, and provides a Common Template Table whose tailoring eases adaptation of a specific script while retaining universal properties for other scripts should be changed to This International Standard provides a method for ordering text data worldwide, and provides a Common Template Table whose tailoring eases adaptation for culturally specific handling of some scripts with minimal efforts. because tailoring of the Common Template Table usually deals with two or more scripts and the wording "universal properties for other scripts" may be interpreted as if there were an universally accepted set of collating properties for each script. J.4) p.1, 1 Scope, bullet 1: In the first bullet - A simple method of reference for comparing two characters strings in order to determine their respective order in a sorted list. The method is applicable on strings that exploit the full repertoire of ISO/IEC 10646 (independently of coding). "10646" should be changed to "10646-1" because the syntax "Uxxxx" allows only to refer to BMP. J.5) p.1, 1 Scope, bullet 1: The sentence This method uses transformation tables derived from either the Common Template Table defined in this International Standard or from one of its tailorings. should be changed to This method uses transformation tables derived from table specifications tailored from the Common Template Table defined in this International Standard. because the Common Template Table without tailoring should not be used as a source of transformation tables. J.6) p.1, 1 Scope, bullet 4: p.11, 6.5 Name of the Common Template Table: The fourth bullet in the scope and the subclause 6.5 should be removed because defining the reference name for Common Template Tables is not a matter of this standard but a matter of the referencing systems. NOTE) The addition of the reference name does not depend on the NB comments to the first FCD. J.7) p.1, 1 Scope: Add a bullet - Requirements for a declaration of the differences between the comparison table used in applications and the Common Template Table, in order to cover the contents of subclause 6.4. J.8) p.2, 2. Conformance: An application is not appropriate as a target for defining conformance. We propose to define the conformance of "a text data", "an ordering service with built-in table", and "an ordering service without built-in table" as follows: 2 Conformance The order of a text data according to a declared tailored table is conforming to this International Standard if the text data coincides with the output of the referenced method prescribed in clause 6. with some input data and the tailored table input. An ordering service with a built-in and declared tailored table is conforming to this International Standard if the order of each output for an input data according to the built-in tailored table is conforming to this International Standard. An ordering service without built-in table is conforming to this International Standard if the order of each output data for a pair of an input data and a declared tailored table is conforming to this International Standard. J.9) p.2, 2 Conformance: NOTE: This comment needs not be considered if the comment J.8 is accepted. The sentence More specifically, it is the responsibility of implementers to show how their delta declaration is related to the table syntax described in clause 6.3, and how the comparison method they use. should be simplified to More specifically, it is the responsibility of implementers to show how their delta declaration is related to the table syntax described in clause 6.3. because the phrase "how the comparison method they use" is not grammatically correct and implementers need not to make open their inner mechanisms if only their outputs are conforming. J.10 p.2, 2 Conformance: NOTE: This comment needs not be considered if the comment J.8 is accepted. The sentence Any declaration of conformity to this International Standard shall be accompanied by a declaration of the tailoring delta described in clause 6.4 in case tailoring is not provided by the concerned application should be changed to Any declaration of conformity to this International Standard shall be accompanied with a declaration of the tailoring delta described in clause 6.4 because the Common Template Table will not be in work without tailoring. If this request is rejected, the words "in case" in this sentence should be replaced by the word "unless". J.11) p.2, 2. Conformance, 2nd para.: NOTE: This comment needs not be considered if the comment J.8 is accepted. The last sentence, which lacks the subject, should be removed because it is covered by the first sentence of this clause. J.12) p.3, 4.7 "glyph", 4.8 "graphic character": The second sentence in 4.8 "graphic character" should be removed because its meaning is already introduced in the first sentence by "that has a visual representation ..." The definition 4.7 "glyph" should be removed because it is used only in 4.8 thus the first part of the following UK comment on the first FCD A definition of "glyph" is required (Clause 4 para 3) if this term is to be used. Alternatively, the use of the term "graphic symbol" (as in ISO/IEC 10646, section 4.19) may be preferable. becomes meaningless now. J.13) p.4, 6.1.1 Preparation of character strings: This subclause 6.1.1 should be put out of the subclause 6.1 (say the new clause 7) because the subclause 6.1.1 discusses about the outside of the reference method. J.14) p.4-7, 6.2 Building the ordering key used in the reference comparison method: Although there are descriptions for building subkeys, there is no description for building a numeric key to be used in 6.1. Japan considers that the drastic change of the algorithm from the first FCD produced many fatal deficiencies. Japan recommends to put back the whole content as a merge of FCD 14651.1 and the related part of CD 14652. J.15) p.7, 6.3 Common Template Table: formation and interpretation: The relation between the syntax defined here and the semantics in the previous subclause is too poor as a standard and this subclause 6.3 contains many errors in itself. See the detailed comments below. J.15-1, Global) The production rules should be presented in a top-down manner. J.15-2, Global) The names of the terms should be exactly the same as are used in other places e.g. the name "untailored_template_table" in Rule 46 should be changed to "common_template_table". J.15-3, Rule 44) The two lines in CTT section CANSpecials and reorder-section-after CANSpecial are illegal according to the BNF. They should be changed as simple_line's or they should be removed from CTT. J.15-4, Rule 24, 20) The multiple symbol weight definition in CTT such as .. ..;;;.. is illegal according to the BNF. The production rules should be supplied J.15-5, Rule 24) "line_completion" should be removed. J.15-6, Rule 14, 13, 12, 11, 5, 6) From the current definitions, all the ucs_symbols are recognized also as simple symbols. J.15-7, Rule 41, 40) The lines consisting of "line_completion" only are recognized as "simple_line" and "tailoring_line". J.15-8, Rule 38) Remove the second appearance of "space" in order to match with CTT. J.15-9, Rule 38) There is no explanation throughout this document for the use of "identifier" here. J.15-10, Rule 28) "line_completion" should be removed. J.15-11, Rule 29) "line_completion" should be removed. J.15-12, Global) The functionality which is supported by "collating-element" should be supported as a tailoring line. J.15-13, Rule 1, 10) Make clear that "line_delimiter" is not included in "character". J.15-14, Rule 43) This production rule should be removed because it is not referenced. J.15-15, WF1) This condition should be modified to WF1. Any "simple_symbol" occurring in a "multiple_level_token" must be defined in a "symbol_definition" line in the table. because there may be a "symbol_weight_entry" such as ;;; where , , , or needs to be greater than . J.15-16, WF1) The term "hex_symbol" does not appear in BNF. It should be changed to "ucs_symbol". J.15-17, WF2) This condition should be replaced by an explanation An empty level_token shall be interpreted as the collating element itself. in the same way as POSIX because the current condition prohibits defining a collation which needs more than four levels. If this proposal is rejected, the sentence All multiple_level_token's in a tailored_table must contain the same number of delimited_level_token's should be changed to All multiple_level_token's in a tailored_table in a normal form (see I4 later) must contain the same number of delimited_level_token's J.15-18, I1) The text should be changed as follows: I1. There are two types of sections. One type, "simple definition", consists of the list of simple_line's following a section_definition_simple_entry in a tailored_table. Another type, "list definition", is defined by a "section_definition_list_entry". It is equivalent to a "simple definition" consisting of a list of "symbol_definition" lines which are regarded as an expansion of the symbol_list. Example) section FOO ;; is equivalent to section (non simple line) J.15-19, I2, I3) Usage of the word "same" here is confusing. J.15-20, I2, I3, I4) The explanations for tailoring here need some improvements because applying a number of operation sequentially causes a problem of their order and side-effects. For example, when a symbol in CTT is redefined by a "reorder-after" directive and the symbol is a target symbol in a successive operation, it is not clear which position, old one's or new one's, is preferred. J.15-21, I5) It should be explained how to deal with multiple occurrences of a symbol to be evaluated -- e.g. only the last one should be valid. J.15-21, I6) The term "hex_symbol" does not appear in BNF. J.15-22, I6) The sentence All hex_symbol's are assumed to map to an integral weight value equal to that hex_symbol interpreted as a hexadecimal number is a source of problems. The term "hex_symbol" does not appear in BNF. If hex_symbol's are equivalent to ucs_symbol's or ones like in CTT, the sentence is wrong because ucs_symbol's and ones like should be numbered in the sequence of table lines along with simple_symbol's and their numbers have no relation with the hexadecimal values except the incremental nature in each range specification. J.15-23, I6) The sentence All hex_symbol's (ucs_symbol in our understanding!) are assumed to map to an integral weight value equal to that hex_symbol interpreted as a hexadecimal number is wrong, because ucs_symbol's should be mapped to an integral also in the sequence of table lines along with simple_symbol's and the values have no relation with the hexadecimal values. J.15-24, Rule 19) CTT includes many lines which have two or more "space"s immediately before "comment". They should be modified or the BNF should be modified. J.15-25, Rule 5, 11) CTT includes illegal identifiers such as <2AIGU> % COMBINING DOUBLE ACUTE ACCENT <2GRAV> % COMBINING DOUBLE GRAVE ACCENT They should be modified or the BNF should be modified. J.15-26, Rule 21 and other places) The Rule 21 allows an expression like .. It should be clarified in syntax or in well-formedness or in interpretation what are allowed for "symbol_list_item_range" and how they are interpreted. J.16) p.10-, 6.4 Declaration of delta, 1st sentence: The first sentence It is recommended that tailoring be done starting with the Common Template table described in annex A. is wrong because all the tailoring shall start from the Common Template Table. If this standard allows to define some collating specification from the scratch, there are many places to be changed. J.17) p.17, Annex B.2, Example 2 - Danish delta and benchmark: This is a wrong example because it contains no valid order_start entry and it contains some illegal lines starting from "collating-element". J.18) p.10, 6.4 Declaration of a delta: p.12, Annex A Common Template Table: Two of the three toggling switch, which was the major achievements until the first FCD and got no NB comment to remove them, are omitted in this draft. It should be revived in 6.4 and Annex A. J.19) Global: The word "conformant" should be replaced with the word "conforming". ______ end of Japan comments; beginning of Netherland comments ________ From John.Bijlsma@nni.nl ----------------------------------------------------------------- 22N2844 FCD14651 International String Ordering and Comparison Method for Comparing Character Strings and Description of a Common Tailorable Ordering 1999-04-08 DISAPPROVAL WITH COMMENT --------------------------------------------------------------- The NNI votes NO on FCD 14651 for the reasons detailed below. The vote from the NNI will turn into yes when the defects indicated below have been repaired. -1- Apart from FCD 14651, another document standardizing string sorting is available: Draft Unicode Technical Report #10: Unicode collation algorithm Comparing both documents, the following (partial) reasons for a NO-vote appear: -a- The Unicode Report is much clearer and better defined than the 14651 document. -b- Both documents describe the algorithm(s) in informal English. It is therefore impossible to present a formal reasoning or mathematical proof that the algorithms are equal (if they are supposed to be) or are not equal and implement different functionality (if they are supposed to be different) It is similarly impossible to proof that a program correctly implements one of these algorithms (or both algorithms). -c- It seems that both descriptions are not equivalent. There seem to be differences in particular regarding level 4. This is said with some prudence given the issue -b- above. Summary of -1-: The NNI is of the opinion that the world has no need for having two (almost) equal sorting standards. The current situation is seen as a source of confusion and a waste of standardization resources. The NNI thinks that only one of these developments should be continued. -2- Quite some comments have come in on the previous FCD. This has led to a large delta between the previous and the current document. Because this delta was to be expected, the NNI had requested that the current document is issued as a CD instead of an FCD. WG20 has decided to issue an FCD, therewith neglecting what the F in FCD stands for. After this round, a similar delta is to be expected. The NNI therefore repeats its request to issue the next document as a CD. -3- The previous document contained many unclear definitions and clauses. While some improvement has been noticed, the rewriting that has taken place has introduced many new ambiguities. Below we will first give some general remarks and then some remarks related to the paragraphs in the document. General remark 1: There are still quite a few sentences in the document that are clearly not written in proper English. This makes the document difficult to understand. General remark 2: There are quite a few occurrences of words that do not belong in an IS. We mention just a few: minimum of efforts, fundamental choices, highly recommended, straightforward, challenge, simple, a lot of, excellent, carefully. General remark 3: The precision of definitions and wording still leaves much to be desired. Some of the detailed issues below are consequences of the textual ambiguities in the document. Detailed remarks: Re Introduction: There is still confusion about the precise meaning (or difference in meaning) of 'ordering', 'collation' and 'comparison'. The example of 'English as a poor exception' sounds negative and is unintelligible. Re 1 Scope: Is 'a method of reference for comparing two character strings' (first dash) the same as 'the comparison method' (third dash)? ....any equivalent method giving the same results is acceptable. Are there equivalent methods giving different results? Are there non-equivalent methods giving the same results? Re 2 Conformance: section => clause paragraph 2: crippled English Re 3 Normative References: 8859 and 14652 are missing. Re 4 Definitions: The notions of 'object', 'element', 'comparison element' and 'internally' have not been clarified. 4.10 discusses 'the reference comparison method'. Is this the same as 'a method of reference' in clause 1? 4.11 states that ordering affects two SETS OF strings, whereas clause 1 states that ordering affects TWO STRINGS. Re 6 Requirements: 6.1 states 'Reference method' whereas 6.1.1 states 'comparison method' Are these the same? Although not part of the scope of this IS, ...... It is unclear whether this part is normative or not. If this part is not normative, requirements as presented under 6.1.1 should be moved to an informative annex. ....described in 6.1.... This is unclear as this is clause 6.1. ...are meant to be equivalent. The notion of equivalent is unclear. 6.1.2 ......the algorithm of key formation described in clause 6.2 ... 6.2 does not describe 'key formation'; 6.2.2 describes 'key composition'; has that been intended? 6.2.1.1 We have here 'ordering table', 'transformation table' and 'matrix of n lines'. None of these notions is particularly clear; in particular the last one is quite ambiguous. It seems only one notion would be sufficient. For a precise notion, WG20 is referred to the notion of 'map' as used in VDM-SL. 6.2.1.2 ...A tailored table may be separated into blocks. This seems to imply that a non-tailored table may not be separated into blocks. This seems odd. 'May' is not allowed in an IS. The notion of a block is unclear. Is a diagonal sub-matrix a proper block? 6.2.1.2 Note: The notions of 'logical sequence', 'presentation sequence' and 'logical order of the presentation forms(?)' are unclear. 6.2.2 Key composition: The notion of 'comparison field' is unclear. The notion of 'successive sequence' is unclear. The whole issue of 'stacking a token' and 'push position' is unclear. As far as understandable, the stack seems never to be popped; the use of the values in the stack stays unclear. The discussion under 'Level 4' is incomprehensible. Additionally, it is unclear what differentiates 'logical string sequence' from 'logical sequence'. 6.3.1 BNF Syntax Rules: This is NOT BNF; it is not EBNF either, but a local variation. Why not use the SC22 document available? There are various kinds of quotes in this table. I5. .... order in this file. It is unclear which file is used here. It would have been most helpful when the notion of a block as introduced in clause 6.2.1.1 would have been present in the BNF. The notions of combining character and precomposed character have not been defined. 6.3.4 C1. (full stop missing) C1. Two collation weighting tables... What on earth are these? ... is exactly matched by ... What is the difference between 'exactly matched', 'exactly equal' and 'equal'? 6.4 Declaration of a delta: ...14652, which uses a syntax that is compatible with the one described in this IS. Why having two partially overlapping standards? ...that occur in the comparison table used relatively to the Common Template Table if a fixed table is ... The number of tables gets (relatively) overwhelming. ....as defined in 6.2.1 => 6.3.1 (two times) Re Note: It is unclear why two imprecise forms are allowed here when a precise one is available also. Re Annex A: It is unclear what a 'common template' is. Re Annex B: It seems the lines containing order_start TABLE;forward;backward;forward;forward,position cannot be derived from the BNF. It seems the line copy ISO14651_1999_TABLE1 cannot be derived from the BNF. It seems the lines containing sequences of cannot be derived from the BNF as line 15 of the BNF requires double quotes. There are some formatting problems here. _____ end of Netherlands comments_______________________________________ Secretariat Note: The Sweden comments are contained in document SC22 N2912. _______________________ beginning of UK comments ______________________ UK comments on ISO/IEC FCD 14651 The UK votes Yes with comments - UK comments GB(a)-GB(b) refer to editorial issues in sections 1-6; - UK comments GB(c) refers to a technical issue: - UK comments GB1-GB8 refer to details of the default table in section 7. General: the UK notes that Michael Everson (NSAI, Ireland) had volunteered to ISO/IEC JTC1/SC22/WG20 to undertake the task of improving the English text, and hopes he will be able to continue that task. UK comments GB(a)-GB(b) are intended to assist him in that task. ---------------------------------------------------------------- GB(a) Editorial (mainly English problems) ---------------------------------------------------------------- 1. Scope para starting "Specific symbols" insert "for" after "except" 4.8 Second sentence replace "To a" with "A" 5. Second para second sentence delete "ever" 6.1.1 Note 1 replace "It is demonstrated" by "It can be demonstrated"; "not typically" by "typically not" and "required" by necessary" 6.2.1.2 Note para 4 replace "to code Arabic completely" with "the complete coding of Arabic" ---------------------------------------------------------------- GB(b) Editorial (mainly English problems, but without a recommended solution since the meaning of the original text isn't clear ---------------------------------------------------------------- 5. Second para second sentence the usage of "all the coded graphic characters" 6.1.1 Note 1 "economy of means in the general case" isn't right 6.1.1 Note 2 "constitute very sensitive to interpret" isn't the correct English phrase, perhaps "are context sensitive data"? 6.2.1.1 "in a special way according to what is described in what follows"?? 6.2.1.1 Note para 4 "presentation forms be coded in" is unclear 6.2.2.2 Level 4 "common to all scripts or the level not specifically belonging to any script"?? 6.2.2.2 Level 4 para 3 It is not clear what the subject "these characters" actually is. ---------------------------------------------------------------- GB(c) Technical ---------------------------------------------------------------- BNF Syntax Rules should be those of the approved IS and this should be included in the References Clause 3 ---------------------------------------------------------------- GB1. Cyrillic letters used in Old Church Slavonic and Macedonian: ---------------------------------------------------------------- Prefer altering position of character DZE, so it follows in the order ZHE, DZE, Z. Rationale: If the default order uses that, it provides for old Church Slavonic (with a considerable literature, over many centuries) without any tailoring being required. The current order involving DZE provides only for Macedonian, which was established as a literary language during WWII (BGN/PCGN information). It is Macedonian which should use a tailoring here, as tailoring is very likely for Macedonian anyway, due to the interchange of glyphs G_acute and K_acute for DJE and TSHE respectively, but retaining the underlyiong Serbian order despite the glyph change. BGN/PCGN also has the order Zhe, z, dze - a further variant ordering for Macedonian. So the more stable Old Church Slavonic order should be adopted as the default order. ---------------------------------------------------------------- GB2. Greek ---------------------------------------------------------------- . TECHNICAL comments The main changes that the UTC requires of 14651 can be summarized as: A. Levels Conformant 14651 implementations must not be required to support more than the first 3 levels. (They are free to support more than 3, but not required to.) It is not at all clear from the current conformance clause how many levels a conformant implementation must support. To address this concern, make the following changes: a. On page 5, 6.2.1.1 Assumptions. The statement that "The number of levels can be extended in the tailoring phase by the end-user." should be modified to: "The number of levels can be extended or reduced in the tailoring phase." (Note also removal of the red-herring use of the term "end-user".) b. Add the following language to 6.2.1.1 "Conformant implementations of 14651 must support at least three levels. They may support more levels, but they are not required to for conformance. In the absence of such support, fourth and higher level information can be ignored." B. Position Conformant 14651 implementations must not be required to support the position designator. (They are free to support the position designator, but not required to.) In addition, the text following the paragraph in 6.2.2.2 starting with "Generally" is informative, not normative, and does not belong in this section. To address these requirements, make the following changes: On page 5, 6.2.1.1 Assumptions. The sentence starting "The user shall take care that,..." should be omitted. It is very strange in that it normatively requires a user to "take care that...", but what they must take care is then expressed as a conditional with a protasis expressed as "so that the last level may processed [sic]". The whole sentence is an incomprehensible admonition as it stands. What we want is a clear statement that the standard does not *require* special processing at the last level, but does *allow* it (see below). In 6.2.1.2, change "A specific property" to "An optional property" In the first paragraph of 6.2.2.2, change the condition to read: "If there is an order_start entry that does not use the position value at level m of a block, or if there is no order_start entry, then the formation of subkey level m is done in exactly the same way as the above-defined formation. Otherwise..." Add the following language to 6.2.2.2 after the paragraph starting "During". "Conformant implementations of 14651 are not required to support the position value. They may support this value, but are not required to for conformance. In the absence of such support, the position value is ignored." d. Split 6.2.2.2 into two parts. The new part 6.2.2.3 would begin on the bottom of page 6, just above the paragraph starting "Generally," and should be entitled: "General interpretation of each level in the Common Template Table". e. In the new 6.2.2.3, delete all but the first sentence in the paragraph labeled "Level 4". That would disconnect the interpretation of Level 4 from whether or not keys are constructed for Level 4 using the position mechanism. f. Move the paragraph following the "Level 4" paragraph (starting "In the table, this behavior is...") up into 6.2.2.2 after the note about forward and backward scanning. g. Move the new section 6.2.2.3 into some other place in the standard. It is informative, and should not be part of the normative clause 6. C. Backward Conformant 14651 implementations must not be required to support the backward designator at any level but level 2. Moreover, conformant 14651 implementations are not required to have anything but a global backwards switch (e.g. that all weights at a particular level are either uniformly forward or backward). (They are free to support the multiple levels of backwards, and fine-grained directionality [on a per character basis], but not required to.) To address this requirement, add the following language to 6.2.1.2: "Conformant implementations of 14651 are not required to support the 'backward' scanning direction at any level but level 2. In the absence of such support, the scanning direction is treated as if it were 'forward' at every level but level 2. "Conformant implementations of 14651 are also not required to support different scanning directions for different blocks. In the absence of such support, if any block has a backward scanning direction for any level, then all blocks are considered to have that scanning direction at that level." To the note at the end of 6.2.1.2 starting "In ISO/IEC 10646-1, Arabic, add the following text: "However, the Unicode Standard does proscribe the logical order of all characters, including Arabic and Hebrew. Implementations conforming to the Unicode standard will not use the backward scanning property." [Note: the current description of per-block backward and forwards support in 14651 does not serve the goal it was designed for. Since languages and scripts share a great many characters in common, a choice of either forward or backward will cause those common characters to disrupt the order within text of the other direction. For example, suppose Greek is ordered forwards, and French backwards. If digits, for example, are forward then they disrupt the French accents. If they are backward, then they will disrupt the Greek accents. Even going to a forward, backward, neutral model, as in UCA Version 2 will not work. No matter which heuristics are used to assign the direction of the neutrals, sometimes the choice will be incorrect. Mixing blocks of different direction is not well supported in industry practice. Most implementations of POSIX do not support it, nor does Java. Forcing these implementations to revise without solid justification is unwarranted. However, as long as implementations are not forced to implement mixed scanning directions, the current language can remain.] D. Unicode conformance ISO 14651 must permit a conformant implementation to do the following. (These are required for conformance to the Unicode Standard.) D.1. Treat canonical equivalent strings as precisely equal in ordering. D.2. Perform Thai/Lao-style character reversal (see UCA Step 1). D.3. Exclude irrelevant combining marks when looking up matches for contracting characters (see UCA Step 2). D.4. Exclude unsupported characters from a collation ordering, or cause them to be sorted in Unicode code point order. Items D.1 through D.3 are probably covered by section 6.1. However, to ensure that they are, these three items must be added in Notes as examples of conformant implementations, with the following language: "Note: to allow conformance to the Unicode Standard, conformant implementations may a. Treat canonical equivalent strings as precisely equal in ordering. b. Perform Thai/Lao-style character reversal. c. Exclude irrelevant combining marks when looking up matches for contracting characters. For more information, see Unicode Technical Report #10." D.4 is commonly implemented as UNDEFINED in POSIX and other standards. It must be included so that implementations working in low-memory environments that do not need the full default collation rules can use a small subset, and have all other Unicode characters sorted by code order. To fix this problem, make the following changes: In 6.3.1 rule 23, add the text " | UNDEFINED" to the end of the line. At the end of 6.2.2.1, add the text: "If there are no tokens corresponding to a character of the input string, then the character is undefined. Undefined characters are sorted with respect to defined characters as if they were at the position UNDEFINED in the Template Table. (If there is no UNDEFINED token in the table, then the table is interpreted as if there were one at the very end.) The ordering of undefined characters with respect to other undefined characters is not specified by this standard. Note: there are two common treatments of UNDEFINED characters. The first is to sort among them as if their level-one weight differences were based upon their UCS character code. The second is to sort them as if they all had the same level-one weight, and their second-level weights were the same as their UCS character codes." E. Stability: The data for both UCA and 14651 must be updated to the level of symdump-2.1.9.txt on the SC22/WG20 server (incorporating all of the individual changes that the US would be asking for). No further changes to other parts of 14651 that would substantially affect the current major collation implementations are acceptable to the UTC or the US national body. In particular, the default data for levels 1, 2, and 3 used by 14651 must be consistent with the UCA data (though perhaps not in the same format). The data was synchronized; this must not diverge due to ballot comments. F. Specific Technical Comments Section 6.3.3. is not well defined. Rule I2 (reorder_after) must state what the relationship is between the table lines (X) between the entries and the tailored line containing the symbol defintion (S). That is, suppose we have the following rules: ;;; ;;; ;;; ;;; We want to tailor that table by adding a reordering rule: reorder-after ;;; ;;; reorder-end What does the normalized output (I4) look like? According to the rules, it could be: ;;; ;;; ;;; Or it could be ;;; ;;;+1 ;;; Both of these operations might be required for a tailoring, but the rules I1 and I2 do not distinguish between them. Moreover, the rules do not say what is the effect on UB-does it have the same level distinction with the last of the new line(s) that it used to with UA? To address this problem, the following (or equivalent) change must be made. 6.3.1, rule 32. Change to: reorder_after_entry := 'reorder-after ' target_symbol ' at level ' digit+ 6.3.3 rule I2. Add: " The reorder entry effectively inserts lines X through Y between existing lines A and B, producing the new ordering . The level of the reorder-after statement determines the level of the differences between A and X. The level of the difference between Y and B is the stronger of the old difference level between A and B and the new difference level between A and X. For example, suppose we have the following lines (where B1 != A1): ;;; ;;; reorder-after at level 2 ;;; ;;; reorder-end will produce the normalized result equivalent to: ;;; ;+2;; ;;; +1; ;;" It must be clearly stated that a reorder-entry also *removes* the lines from where they used to be. In addition, the following text must be added at the end. "The reorder-entries must be processed in order during normalization, otherwise incorrect results will be obtained." The rule I3 also unclear in that it doesn't discuss changing the actual numerical values of the weights. Yet the assignment of numerical values to weights doesn't occur until I5. If the assignment is not done in the reordering, then the subsequent assignment of weights would defeat the purpose of the reordering. This must be clarified. G. Unicode Reference Given their importance in the development of this standard, and the fact that the vast majority of 10646 implementations are in fact Unicode implementations, the Unicode Standard must be referenced in Section 3, and Unicode 2.0, TR #8, and DTR #10 must be referenced in the Bibliography. EDITORIAL Comments: A. The BNF rules in 6.3.1 should be supplemented by a textual description of the format. The well-formedness conditions can be interleaved with the textual description for clarity. B. Examples must be added to 6.3.3 to make the requirements clear, as above. C. Change the explanation in 6.3.1 BNF Syntax Rules to fully describe the notation (e.g. Aho and Ullman): "<> refers to terms not defined in this BNF syntax, and assume general English usage. '' refers to literal characters () used for grouping X Y matches the token sequence X followed by Y X | Y matches either X or Y tokens X* matches zero or more repetitions of X X+ matches one or more repetitions of X {X} matches one or more repetitions of X " D. Certain wordsmithing needs to be done for clarity and accuracy. Take the introduction alone: * Sentence #2 is untrue-that is not the only purpose; others are mentioned below. * #4 is has an incorrect reference "English" is not a "past approach". * The last sentence of para#2 is incorrect-one does not "achieve challenges"; one might "overcome them", if that is what is meant. * "result discrepancies" must be changed to "discrepancies in results" * "excellent" sounds like blowing our own horn too much. A full list would take too long to compile-marked-up copies will be brought to the Pennsylvania meeting. Introduction, page iv, first paragraph a) The meaning of the word "universal" is ambiguous here. It perhaps implies that there may be other non-Universal properties which are not retained during tailoring. Does this paragraph intend to indicate that all scripts have these properties, or does it mean that the particular values of these properties as defined for each script is common to all users of the Common Template Table, if they are not tailored? One can presume the latter, but it should be more clearly stated. A suggestion might be to change "retaining universal properties for other scripts" to "retaining properties already defined for other scripts." b) This paragraph seems to be saying that the purpose of this standard is to improve on collation algorithms based only on binary coded character values. If this refers to the use of the binary coded values without associating a weight to those values, then the next comment about English, with uppercase characters only and no punctuation, being an exception, makes sense. However, it is a rather weak statement, given that even the simplest collation algorithms generally apply some weighting scheme. A suggestion might be to simply delete the remainder of the paragraph beginning with "The purpose of such a mechanism" Introduction, page iv, second paragraph In the first sentence "this is one of the major flaws that affect portability" it is not clear what "this" is referring to, or what is "flawed". A suggestion might be to combine the sentence with the parenthetical remark: "That different programs use different ordering specifications is a significant problem reducing portability between countries and between applications." Section 1 Scope In the first paragraph "A simple method of reference" delete "of reference", as the method is for comparing not for referencing. It is understood that this standard is defining a method which can be a reference for international ordering. In the last bullet in this section, delete the final 2 words "to order" in "A context-dependent ordering which would require complex transformation of data to order." Section 2 Conformance The requirements imposed by the second paragraph are unclear. In the last sentence "and how the comparison method they use If different" the "I" in "if" should not be capitalized. There should be a comma after the word "use". Section 4 Definitions 4.6 delta- change "relatively" to "relative" 4.8 graphic character- change "To a graphic character normally corresponds a glyph." to "A graphic character normally corresponds to a glyph." 4.9 level- This definition is ambiguous as "depth" is not defined. The author should provide a more meaningful definition. The word "token" should be replaced throughout the document by "weight", unless the definition is in error. Collating symbol and collating element should be change to collation symbol and collation element. The difference between ordering key and collation element is not clear from the definitions. "preparation": speaking of the actual source strings being modified here and in 6.1.1 is worrysome-it is copies of the source strings that are modified, if anything. Section 5 Symbols and abbreviations The last 2 sentences in the first paragraph can be worded more grammatically correct and "covered" can be clarified by changing "What is being referenced is a graphic character, independently of its coding, and any character set whose subrepertoire is taken into account in ISO/IEC 10646-1 is covered in this way." to "This is a way to reference a graphic character, independent of its coding. Any character set whose subrepertoire is taken into account in ISO/IEC 10646-1, is included in this specification by this nomenclature." Section 6.1.1 Preparation of character strings prior to comparison In the first paragraph, will the reference to telephone-book ordering be universally understood, or should the specific problem referred to in this example be brought out? In the second paragraph, the words "but not both" should be added to the phrase "An application conformant to this international standard shall at the minimum prepare the string so that sequences using either combining sequences or using precomposed characters" In Note 1 of this section, remove the extraneous " a " in "precomposed characters affected by a diacritics," Section 6.2.1.1 Matrix of n lines 6.2.1.1 "matrix of n lines. N is the number of characters in the repertoire used." This would exclude multiple characters sorting as 1. Also, "matrix" is unclear; what is meant? It is also not really a "transformation table". What it is is a mapping table from character sequences to collation elements. Section 6.2.2 Key composition and Section 6.2.2.1 Formation of subey level 1 through (m-1) This section is very unclear and must be made more precise and would greatly benefit from an example. In particular, references to directionality are made with respect to string processing, levels and characters and is hard to understand. Stacking is described but unstacking is left to the reader's imagination. In particular it is not clear when to unstack. For example, in the second paragraph after the parenthetical remark, it states: "and the new direction is backward" it is not clear how many attributes of the algorithm are affected. The character has the property of being backward, this changes the direction of the current level i, and might be presumed to also affect the scanning direction of the input character string, which is described as initially forward in the first paragraph. If we understand the proposed algorithm correctly, it would benefit the specification to state clearly: 1) That scanning of the input character string is always forward thru the logical sequence of the string. 2) That reaching a character with a backwards property changes the current direction of level i from forward to backward, and commences stacking of position and token. 3) That reaching a character with a forwards property when the current direction of level i is backwards, changes the level's direction to forwards and commences unstacking, with a description of what is involved in unstacking. Section 6.2.2.2 Formation of subkey level m The first sentence should change "uses" to "use". The first paragraph begins with discussion of order_start_entry which is not yet introduced . This should be characterized and the subsequent reference to having or not having a position, expanded upon for clarity. The significance of using the table as-is versus changing it in accordance with frequent market practice should also be clarified and the alternative behaviors of the ordering described. An explanation of why the Common Template Table does not follow frequent market practice might also be offered. In the second paragraph, the sentence "When the character is not assigned at level m in the table, it is ignored for the formation of subkey level m and no pair is concatenated." Might be better moved to the end of the paragraph, so the subsequent sentences cannot be perceived to be part of the condition "when the character is not assigned at level m". In addition, this paragraph is the first indication that a character might not have entries for every table level. There should be some discussion of this and its impact on behavior of the ordering. The first sentence in the description of level 4 states: "This level represents the level common to all scripts or the level not specifically belonging to any script." We do not understand what this means. How and why is this level different from the other levels? In the last paragraph of this section, it is stated: "In the Common Template table, definitions of these characters for level 1 to 3". We do not understand which characters are referred to by "these characters". Perhaps the author should state: "In the Common Template table, characters that are assigned values at level 4, are exclusively assigned to level 4, and are ignorable, and have no values assigned, at levels 1-3. It might improve the readability and understandability of the specification, if the actual description of the Common Template table was moved out of this section to the later section on the Common Template table and if the information in level 4, about the formation of the level 4 or level m subkey, was included with the first 2 paragraphs of this section, describing the key formation. Section 6.3.4. : The first paragraph can be simplified considerably to: Two collation weighting tables are said to be equivalent if any comparison of strings using those tables results in the same ordering. Section 6.4 Declaration of a delta In the second paragraph, conformance is described as declarable if a fixed table is used by the application. Can an application conform if it does not make use of a fixed table analagous to the Common Template table? Also, the term "comparison table" is not defined. Presumably this is the name for the transformation table used with the comparison method and this should be stated or clarified. Also the word "relatively" should be "relative" in this instance. In the first bullet, there is a reference to direction values being dependent on writing systems. Earlier, the specification pointed out that scanning direction is in fact independent of the direction of writing, so this may be confusing and misleading to readers. In the first paragraph after the 4 bullets, the sentence beginning with "In cases where the applications has" should be changed to "In cases where the applications have". ____________________ end of SC22 N2911 ___________________________________