Transcoding Classes

This file contains the following subjects:

Overview of Transcoding Classes
Transcoders
Line Breaking
Character Set Iteration

Overview of Transcoding Classes

Transcoding is the process of converting text data between two coded character sets using a mapping rules.

The goal of the Open Class libraries is for all text to be encoded in Unicode and manipulated according to the Unicode character encoding standard. Any text you import from or export to a system that uses a different character encoding scheme must be transcoded so that the text can be manipulated directly on the target system.

The Open Class transcoding classes support conversion of character data to and from Unicode and a wide variety of other encoding sets and encoding schemes, including ASCII and other ISO standards, to enable you to import and export text data between Open Class applications and other environments. The Open Class transcoders use the UniChar datatype to represent Unicode characters in IText objects and the char datatype to represent non-Unicode characters in IString objects.

The transcoding classes also provide mechanisms for handling characters that don't have obvious mappings between Unicode and another character set. These mechanisms handle both line-breaking characters, which differ between platforms, and exception characters. Exception characters are characters that can often be transcoded but do not have a one-to-one mapping. These may include ligature characters, foreign characters, or composed characters.

The transcoding classes are:

Class	Description
ITranscoder	Primary class defining protocols for transcoding character data between Unicode and any other character encoding standard
ILineBreakConverter	Simple class used to postprocess line-breaking conventions after character data is transcoded into Unicode, or preprocess line-breaking characters before Unicode data is transcoded into `char`-based data
ICharacterSetIterator	Lets you iterate through the character sets for which transcoders are available

Transcoders

ITranscoder provides the abstract protocol for all transcoders supported by the Open Class system. You create a transcoder by specifying a character set to ITranscoder::createTranscoder, which returns an instance of the ITranscoder subclass supporting that character set.

This figure shows some of the available transcoders:

ITranscoder is currently the only public transcoder class. You access all concrete subclasses through the ITranscoder interfaces.

ITranscoder provides both a simple, high-level interface, which converts between IText and IString instances, and a low-level interface, based on the ANSI C++ Standard codecvt interface, which takes pointers to char and UniChar arrays. The high-level functions take two parameters, a char-based IString object and a UniChar-based IText object, and convert either to or from Unicode data.

The low-level pointer-based functions let you manipulate char and UniChar strings directly. Some of the low-level functions are identical to the interfaces provided by the ANSI C++ standard library codecvt class. They allow you to specify exact ranges of text to transcode and to provide error-recovery mechanisms.

This figure shows the ITranscoder interface:

createTranscoder returns a transcoder for the character encoding set you specify. See the table "Open Class transcoder names" for a list of supported transcoder names. If you don't specify a name, the function returns a transcoder for the current host character set.
toUnicode and fromUnicode provide conversion between char and UniChar data. Overloads of these functions take IString or char* and IText or UniChar*.
result returns an enum value that indicates whether the conversion was fully converted, partially converted, or stopped due to an error. This value is also returned by toUnicode and fromUnicode.
unmappedBehavior and setUnmappedBehavior let you determine how the transcoder handles exception characters.
setCharSubstitute lets you specify a character to be used as a substitute for characters that do not have a mapping from Unicode to the specified character set. The default character is UASCII::kSubstitute (0x1A).
uniCharSubstitute returns the character used as a substitute for characters that do not have a mapping from the source character set into Unicode.
characterEncoding returns an IText containing the name of the character encoding supported by the transcoder.
characterSet returns an IText containing the name of the default encoding for a specified locale.
resetState resets the state of the transcoder to ASCII.
Storage query functions let you get information about storage requirements so you can manage storage allocation for transcoding operations efficiently.

Line-Breaking Conversion

ILineBreakConverter is a simple class that you use to ensure that line breaking characters are transcoded correctly between Unicode and the target character set. You can use ILineBreakConverter to postprocess strings just converted into Unicode, or preprocess strings before converting them into char data.

This figure shows the ILineBreakConverter interface:

convertInPlace and convert process the line breaks in an IText object according to a specified line-breaking convention.
hostConvention returns the line-breaking convention for the current host.

ILineBreakConverter uses the enum ELineBreakConvention to describe different line-breaking conventions. Currently the following conventions are defined:

Line-Breaking Conventions

kUnicode	Unicode convention	UGeneralPunctuation::kParagraphSeparator (U+2029)
kCRLF	Windows, DOS, OS/2 convention	CR LF sequence
kLF	UNIX convention	LF
kCR	Macintosh System 7 convention	CR
kCRLF_VT	Microsoft Word/Rich Edit convention	CR LF or VT
kHost	Indicates the current host's convention

ILineBreakConverter uses the following rules to convert between various host line-breaking conventions and Unicode:

Line-Breaking Conversion Rules

Host	Line-breaking convention	Unicode text with host convention	Unicode convention
Win32, OS/2, DOS	CR LF sequence	0x000D 0x000A	U+2029
AIX	LF	0x000A	U+2029
Macintosh	CR	0x000D	U+2029
Word/RichEdit	CR LF sequence	0x000D 0x000A	U+2029
Word/RichEdit	VT	0x000B	U+2028

Character Set Iteration

Use ICharacterSetIterator to iterate through the list of character encoding sets for which transcoders are available on the current system. ICharacterSetIterator returns IText objects that contain the names of supported character sets.

This figure shows the interface for ICharacterSetIterator:

operator++ increments the iterator to reference the next available transcoder.
operator* returns an IText with the name of the character set supported by the currently referenced transcoder
operator bool indicates when the iterator has reached the end of the list.
reset sets the iterator back to the first available character set
operator== and operator!= are equality and inequality operators for two character set iterators

Special Characters

The ITranscoder transcoding functions provide special handling for both line-break and exception characters.

The class ILineBreakConverter provides for conversion between the Unicode paragraph-separator character (U+2029 or UGeneralPunctuation::kParagraphSeparator) and the appropriate line-break character for a given character set or host. You can use this class to postprocess transcoded strings after conversion into Unicode or to preprocess strings before conversion into char-based formats.

ITranscoder also lets you control how exception characters are handled. Exception characters are characters that do not have a single-character equivalent or that do not exist in the target character set. For example, Greek characters may be used in some environments where they are not part of the native character set, and ligature characters, which by definition combine two characters, are often mapped to a sequence of their individual components. The following table shows some typical cases:

Unicode Name	Unicode sequence	Display	May be mapped to	Control code
LATIN SMALL LIGATURE FI	FB01	[fi]	0066 [f] + 0069 [i]	\xde
GREEK CAPITAL LETTER DELTA	0394	D	ı or other	\xc6
GREEK SMALL LETTER PI	03C0	p	* or ¼ or other	\xb9

To specify how you want exception characters to be handled, call the ITranscoder::setUnmappedBehavior function. You can specify a substitution character, in either Unicode or the target character set, or you can specify that the transcoder either skip exception characters or stop the transcoding operation when it reaches one.

By default, transcoders substitute UGeneralFunction::kReplacementCharacter (U+FFFD) for Unicode characters with no mapping, and the ASCII substitution character (UASCII::kSubstitute, or 0x1A) for char characters with no mapping.

Exception Characters

The transcoders let you specify how you want exception characters to be handled. Exception characters are characters for which there are no one-to-one mappings between Unicode and the target character set. Use the EUnmappedBehavior enum to specify one of the following:

kUseSub	Substitutes an equivalent representation in the target character set for characters with no exact mapping
kStop	Stops transcoding when an exception character is detected
kOmit	Skips any exception characters detected

If you don't specify behavior for exception character handling, the transcoder uses EUnmappedBehavior::kUseSub as the default. The substitute characters that are used are:

UGeneralPunctuation::kReplacementCharacter (U+FFFD) for characters that cannot be transcoded into Unicode
UASCII::kSubstitute (0x1A) for characters that cannot be transcoded out of Unicode, that is, into the target char-based character set

You can set the char substitute character to another character with the ITranscoder function setCharSubstitute. Whether to display these characters as glyphs or as text strings is left to the host operating system.

Mapping Proximity

When you create a transcoder for a specific character set, you can specify how close the mapping proximity must be between Unicode and the target character set. Use the EMappingProximity enum to specify one of the following:

kExactMapping	Create a transcoder with an exact mapping to the specified character set
kSupersetMapping	Create a transcoder with a character set that is a superset of the specified character set
kCloseMapping	Create a transcoder with a character set as close to the specified character set as possible

If you don't specify a mapping proximity, the transcoder uses EMappingProximity::kSupersetMapping as the default.

Transcoder Names

Instantiating a Transcoder
Iterating Through Available Transcoders
Converting Text from Unicode to Character Format
Converting Text from Character Format to Unicode
Processing Line-breaking Characters
Using ANSI C++ Compatible Transcoding Functions
Verifying Transcoding Results