2. Codesets

2. Codesets

2.1. DEC Hanzi

2.2. GB18030

DECwindows Motif Supplemental Guide for Simplified Chinese Support

DECwindows Motif supports the following Simplified Chinese codesets:

DEC Hanzi
GB18303

The ASCII, GB2312-80 and extended GB character sets are combined to form the DEC Hanzi codeset.

DEC Hanzi, or Simplified Chinese and denoted as dechanzi, uses a 2-byte data representation for symbols and ideographic characters defined in the GB2312-80 character set. To differentiate GB2312-80 codes from ASCII codes, the most significant bit (MSB) of the first byte is always set on while that of the second byte is on for GB2312-80 and off for extended GB as shown in Figure 2-1.

Figure 2-1. DEC Hanzi Character Encoding

The first byte of a 2-byte code determines its row number, while the second byte determines its column number.

The following formulas illustrate the code of a GB2312-80 character or an extended GB character in relation to its row and column numbers.

GB2312-80 character:

First byte = A0 + row number
Second byte = A0 + column number

Extended GB character:

First byte = A0 + row number
Second byte = 20 + column number

For example, if a character is positioned at the first column of the 16th row on the GB2312-80 code plane, its encoding value is calculated as follows:

First byte = A0 (hex) + 16 = B0 (hex)
Second byte = A0 (hex) + 01 = A1 (hex)

The resulting encoded value is B0A1.

Similarly, if a character is positioned at the first column of the 16th row on the extended GB code plane, its encoding value is calculated as follows:

First byte = A0 (hex) + 16 = B0 (hex)
Second byte = 20 (hex) + 01 = 21 (hex)

The resulting encoded value is B021.

Figure 2-2 illustrates the division of a 2-byte code space and the position of the Chinese character sets.

Figure 2-2. GB2312-80 and Extended GB Code Space

		Second Byte
	00	20	80	A0	FF
First Byte	20
	80
	A0
	FF		Extended GB		GB2312-80

GB2312-80

First Byte

Second Byte

Extended GB

First Byte

Second Byte

The GB18030 codeset provides 1-byte, 2-byte, and 4-byte encoding with the following structure:

Number of Bytes	Encoding Range	Code Points
1 byte	0x00 to 0x7F	128
2 bytes	0x81 to 0xFE 0x40 to 0xFE (except 0x7F)	23940
4 bytes	0x81 to 0xFE 0x30 to 0x39 0x81 to 0xFE 0x30 to 0x39	1587600

GB18030 1-byte code supports ASCII characters.

GB18030 2-byte code supports all the CJK characters (Chinese, Japanese, Korean) in the Unicode Version 2.1 Standard.

GB18030 4-byte code supports Unicode Version 3.0 additions. The 4-byte code also leaves a large number of unassigned code points available for future use.