22. Charset
Module Charset
- Description
The Charset module supports a wide variety of different character sets, and it is flexible in regard of the names of character sets it accepts. The character case is ignored, as are the most common non-alaphanumeric characters appearing in character set names. E.g.
"iso-8859-1"
works just as well as"ISO_8859_1"
. All encodings specified in RFC 1345 are supported.First of all the Charset module is capable of handling the following encodings of Unicode:
- utf7
- utf8
- utf16
- utf16be
- utf16le
- utf32
- utf32be
- utf32le
- utf75
- utf7½
UTF encodings
- shiftjis
- euc-kr
- euc-cn
- euc-jp
Most, if not all, of the relevant code pages are represented, as the following list shows. Prefix the numbers as noted in the list to get the wanted codec:
- 037
- 038
- 273
- 274
- 275
- 277
- 278
- 280
- 281
- 284
- 285
- 290
- 297
- 367
- 420
- 423
- 424
- 437
- 500
- 819
- 850
- 851
- 852
- 855
- 857
- 860
- 861
- 862
- 863
- 864
- 865
- 866
- 868
- 869
- 870
- 871
- 880
- 891
- 903
- 904
- 905
- 918
- 932
- 936
- 950
- 1026
These may be prefixed with
"cp"
,"ibm"
or"ms"
. - 1250
- 1251
- 1252
- 1253
- 1254
- 1255
- 1256
- 1257
- 1258
These may be prefixed with
"cp"
,"ibm"
,"ms"
or"windows"
- mysql-latin1
The default charset in MySQL, similar to
cp1252
.
+359 more.
- Note
In Pike 7.8 and earlier this module was named
Locale.Charset
.
- Methoddecode_error
void
decode_error(string
err_str
,int
err_pos
,string
charset
,void
|string
reason
,mixed
...args
)- Description
Throws a
DecodeError
exception. SeeDecodeError.create
for details about the arguments. Ifargs
is given then the error reason is formatted usingsprintf(
.reason
, @args
)
- Methoddecoder
Decoder
decoder(string
|zero
name
)- Description
Returns a charset decoder object.
- Parameter
name
The name of the character set to decode from. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.
- Throws
If the asked-for
name
was not supported, an error is thrown.
- Methoddecoder_from_mib
Decoder
decoder_from_mib(int
mib
)- Description
Returns a decoder for the encoding schema denoted by MIB
mib
.
- Methodencode_error
void
encode_error(string
err_str
,int
err_pos
,string
charset
,void
|string
reason
,mixed
...args
)- Description
Throws an
EncodeError
exception. SeeEncodeError.create
for details about the arguments. Ifargs
is given then the error reason is formatted usingsprintf(
.reason
, @args
)
- Methodencoder
Encoder
encoder(string
|zero
name
,string
|void
replacement
,function
(string
:string
)|void
repcb
)- Description
Returns a charset encoder object.
- Parameter
name
The name of the character set to encode to. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.
- Parameter
replacement
The string to use for characters that cannot be represented in the charset. It's used when
repcb
is not given or when it returns zero. If no replacement string is given then an error is thrown instead.- Parameter
repcb
A function to call for every character that cannot be represented in the charset. If specified it's called with one argument - a string containing the character in question. If it returns a string then that one will replace the character in the output. If it returns something else then the
replacement
argument will be used to decide what to do.- Throws
If the asked-for
name
was not supported, an error is thrown.
- Methodencoder_from_mib
Encoder
encoder_from_mib(int
mib
,string
|void
replacement
,function
(string
:string
)|void
repcb
)- Description
Returns an encoder for the encoding schema denoted by MIB
mib
.
- Methodnormalize
string
|zero
normalize(string
|zero
in
)- Description
All character set names are normalized through this function before compared.
- Methodset_decoder
void
set_decoder(string
name
,program
decoder
)- Description
Adds a custom defined character set decoder. The name is normalized through the use of
normalize
.
- Methodset_encoder
void
set_encoder(string
name
,program
encoder
)- Description
Adds a custom defined character set encoder. The name is normalized through the use of
normalize
.
Class Charset.CharsetGenericError
- Description
Base class for errors thrown by the
Charset
module.
Class Charset.DecodeError
- Description
Error thrown when decode fails (and no replacement char or replacement callback has been registered).
- FIXME
This error class is not actually used by this module yet - decode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.
- Variablecharset
string
Charset.DecodeError.charset- Description
The decoding charset, typically as known to
Charset.decoder
.- Note
Other code may produce errors of this type. In that case this name is something that
Charset.decoder
does not accept (unless it implements exactly the same charset), and it should be reasonably certain thatCharset.decoder
never accepts that name in the future (unless it is extended to implement exactly the same charset).
Class Charset.Decoder
- Description
Virtual base class for charset decoders.
- Example
string win1252_to_string( string data ) { return Charset.decoder("windows-1252")->feed( data )->drain(); }
- Variablecharset
string
Charset.Decoder.charset- Description
Name of the charset - giving this name to
decoder
returns an instance of the same class as this object.- Note
This is not necessarily the same name that was actually given to
decoder
to produce this object.
- Methodclear
this_program
clear()- Description
Clear buffers, and reset all state.
- Returns
Returns the current object to allow for chaining of calls.
- Methoddrain
string
drain()- Description
Get the decoded data, and reset buffers.
- Returns
Returns the decoded string.
Class Charset.EncodeError
- Description
Error thrown when encode fails (and no replacement char or replacement callback has been registered).
- FIXME
This error class is not actually used by this module yet - encode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.
- Variablecharset
string
Charset.EncodeError.charset- Description
The encoding charset, typically as known to
Charset.encoder
.- Note
Other code may produce errors of this type. In that case this name is something that
Charset.encoder
does not accept (unless it implements exactly the same charset), and it should be reasonably certain thatCharset.encoder
never accepts that name in the future (unless it is extended to implement exactly the same charset).
Class Charset.Encoder
- Description
Virtual base class for charset encoders.
- InheritDecoder
inherit Decoder : Decoder
- Description
An encoder only differs from a decoder in that it has an extra function.
- Variablecharset
string
Charset.Encoder.charset- Description
Name of the charset - giving this name to
encoder
returns an instance of the same class as this one.- Note
This is not necessarily the same name that was actually given to
encoder
to produce this object.
- Methodset_replacement_callback
this_program
set_replacement_callback(function
(string
:string
)rc
)- Description
Change the replacement callback function.
- Parameter
rc
Function that is called to encode characters outside the current character encoding.
- Returns
Returns the current object to allow for chaining of calls.
Module Charset.Tables
Module Charset.Tables.iso88591
- Description
Codec for the ISO-8859-1 character encoding.