Collation Classes

Collation Classes

This file contains information about the following subjects:

Overview of Collation Classes
Collation Subclasses
Collation Iteration
Ordering Strength

Overview of Collation Classes

In most cases, the ordering of Unicode values does not produce correct ordering results. For example, in the ASCII-based character sets, Z is ordered before a, and z is ordered before ñ. Open Class collation classes, however, support collation objects that compare strings based not on the Unicode values of each character, but on the rules of a natural language. This is what enables language-sensitive string comparison.

Each International Framework collation objects is based on a set of rules that define the results for alphabetizing and comparing text in a particular natural language. These rules define not only a ranking (such as a < b < c) but three levels of priority within the ranking.

For many European languages, the difference between two base letters (a and b) is a primary difference, the difference between an unaccented and an accented base letter (ä and a) is secondary, and the difference between an uppercase and lowercase letter (A and a) is tertiary. These distinctions allow you to set the level of comparison for more sophisticated sorting and searching.

The ICollation interface is based on the protocols in the ANSI C++ standard library collate class, which provides string comparison and hashing functions. The ICollation comparison functions take two strings or substrings and return a value that indicates whether the source string is greater than (later in the alphabet), less than (earlier in the alphabet), or equal to the target string. You can specify the ordering strength of the comparison to control how differences such as case and accents are handled.

You can compare styled text in an IText object, but styling information is ignored.

This figure shows the collation class architecture:

Collation Subclasses

The collation classes include the abstract base class ICollation, which defines the protocol for language-sensitive string comparison and several concrete subclasses, and ICollationIterator, which lets you iterate through the list of available localized collation objects.

Class	Description
IBitwiseCollation	Provides bitwise, language-insensitive string comparison.
ICollation	Provides access to either host-specific or portable collation for a given language as available. Primary class for language-sensitive string comparison.
ICollationIterator	Lets you iterate through the available collation objects.
IPortableCollation	Provides portable (non-host-specific) language-sensitive collation.

ICollation provides the protocols you use to create both language-sensitive and language-insensitive collation objects. The ICollation interface is a superset of the interface of the ANSI C++ Standard collate class. Based on the locale you specify, the ICollation::createCollation function can return:

A host-specific collation object for the specified language or locale
A portable collation object for the specified language or locale, derived from IPortableCollation
To specifically create an IPortableCollation object, call IPortableCollation::createCollation directly with a portable locale key
An IBitwiseCollation object that performs language-insensitive collation
To request that createCollation return an IBitwiseCollation object, specify the POSIX locale ("POSIX" or "C").

This figure shows the ICollation interface:

createCollation is a static function that returns the collation object for a specified locale. If a host-specific object is available, it is returned. Otherwise a portable object is returned. If you don't specify a locale, the function returns the collation object for the default locale. createCollation also lets you specify a comparison level. The default is ICollation::kTertiaryDifference.
compare returns the result of comparing two strings. The result is returned as an enum value: kSourceEqual, kSourceLess, or kSourceGreater.
strength and setStrength provide access to the collation object's current ordering strength (primary, secondary, or tertiary).
isEqual, isGreaterThan, and isLessThan are convenience functions that return a Boolean value indicating the comparison result of two strings.
transform converts an IText into another IText that is compared lexicographically with the original text. Comparing two transformed IText objects returns the same results as comparing the same strings before transformation.
localeKey returns an ILocaleKey indicating the locale the collation object is associated with.
displayName returns a displayable name for the object for a specified locale.

Collation Iteration

Use ICollationIterator to iterate through the list of international collation objects currently available on the system. You can set the iterator to enumerate only host collation objects, only portable collation objects, or both (both is the default).

This figure shows the interface for ICollationIterator:

operator++ increments the iterator to the next collation object.
operator bool and end indicate that the iterator is at the end of the list.
reset sets the iterator to reference the first item in the list.
create returns the collation object that the iterator currently references. You can specify the ordering strength.
localePOSIXID returns a text string that contains the POSIX ID of the locale associated with the currently referenced collation object.
displayName returns a displayable name for the object for a specified locale.

Ordering Strength

The correct collation for each language or script is determined by a set of rules that define a ranking, from least to greatest, for each character. To allow more comparison options, each character is assigned an ordering priority within the ranking: primary, secondary, or tertiary. For example, in an English collation:

Base letters represent a primary difference ("a" and "b")
Diacritical marks on the same base letter represent a secondary difference ("a" and "â")
Uppercase and lowercase versions of the same base letter represent a tertiary difference ("a" and "A")

In English, then, you can implement case-insensitive comparison by setting the ordering strength to kSecondaryDifference. Primary and secondary differences are considered but any tertiary (case) differences are ignored-thus, "pat," "Pat," and "PAT" would be considered equivalent strings.

When you create a collation object, you specify an ordering strength that determines whether all differences, both primary and secondary differences, or only primary differences are considered. The types of differences that are considered primary, secondary, and tertiary may vary based on the language you are working with.

This table shows the results for English strings compared with different ordering strengths:

Source	Target	Ordering strength	Comparison result
abc	abc	kPrimaryDifference	kSourceEqual
äbc	abc	kSecondaryDifference	kSourceEqual
Abc	abc	kTertiaryDifference	kSourceEqual
abc	def	kPrimaryDifference	kSourceLess
abc	äbc	kSecondaryDifference	kSourceLess
abc	Abc	kTertiaryDifference	kSourceLess
def	abc	kPrimaryDifference	kSourceGreater
äbc	abc	kSecondaryDifference	kSourceGreater
Abc	abc	kTertiaryDifference	kSourceGreater

When you are using the collation object for the POSIX locale (a portable collation) specifying an ordering strength has no effect.