Cesuuu

Solo disponible en BuenasTareas
  • Páginas : 2 (325 palabras )
  • Descarga(s) : 0
  • Publicado : 14 de noviembre de 2010
Leer documento completo
Vista previa del texto
The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26 [1]. A Unicode code point from the BasicMultilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the rangeU+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes persurrogate) for each Unicode supplementary character while UTF-8 needs only four. Each CESU-8 character code (1, 2, or 3 bytes) can be converted to exactly one UTF-16 code (2bytes).
CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only. It should be used exclusively for internalprocessing and never for external data exchange.
CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000).
The CESU-8 encoding formis used in the Oracle database software. Oracle's UTF8 character set (unfortunately, a misnomer), available since version 8.0 of the database, is actually CESU-8. The characterset AL32UTF8, introduced in version 9.0, is UTF-8 compliant.
The encoding of unicode supplementary characters works out to 11101101 1010yyyy 10xxxxxx 11101101 1011xxxx 10xxxxxx(yyyy represents the top five bits of the character minus one i.e. U+10**** becomes 1111, U+01**** becomes 0000, x represents the remaining bits of the character).[clarificationneeded]
[edit]Examples

Encoding Unicode code point
U+0045 U+0205 U+10400
E ȅ
UTF-8 45 C8 85 F0 90 90 80
UTF-16 0045 0205 D801 DC00
CESU-8 45 C8 85 ED A0 81 ED B0 80
tracking img