module Ustring:Unicode string library - Ustring. Version 0.01 This module implements Unicode variants of functions using strings in the OCaml standard library, e.g., modulessig..end
String and Pervasives.
There are also a number of additional functions available. Several
basic operators (e.g., equality and string concatenation) are defined in
sub module Ustring.Op. It is recommended to open up this sub module
but not Ustring directly.
This module currently supports ASCII, Latin-1, and UTF-8 encoding. If
another encoding is used, an exception will be raised. In all functions
below, a ustring s is indexed from 0 to (Ustring.length s)-1.
Copyright (C) 2010 David Broman. All rights reserved. This file
is distributed under the "New BSD License".
typeuchar =int
module Op:sig..end
Ustring.Op is containing Unicode versions of several functions
available in the Pervasives module, as well as operators and functions
for simple string creation and manipulation (for example string concatenation
operator ^., ustring equality operator =. and string creation us"string").
typet =Op.ustring
Op.ustringtypeustring =Op.ustring
Op.ustringtype encoding =
| |
Ascii |
(* | Standard ASCII (values 0-127) | *) |
| |
Latin1 |
(* | Latin-1 encoding. Supports most European languages. | *) |
| |
Utf8 |
(* | UTF-8 encoding. Full Unicode encoding, where each character is represented by 1-4 bytes. | *) |
| |
Utf16le |
(* | Not yet supported | *) |
| |
Utf16be |
(* | Not yet supported | *) |
| |
Utf32le |
(* | Not yet supported | *) |
| |
Utf32be |
(* | Not yet supported | *) |
| |
Auto |
(* | Not a specific encoding, but a method for how encoded data is interpreted. If the input data have a byte order mark (BOM) in the beginning of the sequence, the encoding type stated in the BOM will be used. If no BOM is available, ASCII will initially be assumed to be the encoding type. Then, when a byte sequence appear that is not ASCII (value 0-127), a choice is made for the remaining encoding type: If the sequence represents a legal UTF-8 encoded character, the rest of the input will be treated as UTF-8 encoded. If it is not an UTF encoded character, Latin-1 will be assumed for the rest of the data sequence. Any later illegal decoding will then be reported as an error. | *) |
String.val length : ustring -> intUstring.length s returns the number of characters of s.val get : ustring -> int -> ucharUstring.get s n returns character number n in ustring s.
Raises Invalid_argument "Ustring.get" if out of range.val set : ustring -> int -> uchar -> unitUstring.set s n c modifies ustring s in place by replacing uchar at
index n by uchar c. Raises Invalid_argument "Ustring.get" if out of
range.val create : int -> ustringUstring.create n returns a new fresh ustring with length n
characters. Raises Invalid_argument "Ustring.create" if n < 0 or
n > Sys.max_array_length.val make : int -> uchar -> ustringUstring.make n c returns a new fresh ustring of length n, filled
with character c. Raises Invalid_argument "Ustring.make" if n < 0 or
n > Sys.max_array_length.val copy : ustring -> ustringval sub : ustring -> int -> int -> ustringUstring.sub s start len returns a new ustring with length
len, consisting of a sub-string of s, which starts at index
position start and has length len. Raises exception
Invalid_argument "Ustring.sub" if start and len does not
give a valid sub-string.val concat : ustring -> ustring list -> ustringUstring.concat sep sl returns a ustring where the list of ustrings
sl are concatenated, were separator string sep is inserted between
each list element. The function always return a new fresh string.
If performance is critical, use function fast_concatval rindex : ustring -> uchar -> intUstring.rindex s c returns the index of the last occurrence of
character c in ustring s. Raises Not_found if c does not
occur in s.val rindex_from : ustring -> int -> uchar -> intUstring.rindex_from s i c returns the index of the last occurrence of
character c in ustring s before index position i+1. Note that
function calls Ustring.rindex s and
Ustring.rindex_from s (Ustring.length s - 1) c are equivalent.
Raises Not_found if c does not occur in s. Raises
Invalid_argument "Ustring.rindex_from" if i+1 is an illegal
index in ustring s.val append : ustring -> ustring -> ustring^... Please see module Ustring.Op for more details about the
operator. Note that this function is always creating a new fresh string
after append. For performance demanding application, function
fast_append or operator ^. are recommended instead.val fast_append : ustring -> ustring -> ustringappend and
standard string concatenation operator ^, function fast_append do not
allocate new memory or create a new fresh string. Instead,
the concatenation is internally stored as a tree, making
the operation constant time. When the string is later used by some function
in module Ustring, the internal tree representation will be automatically
collapsed to a plain string. Note that in contrary to append, this
function do not create a fresh new string directly. Hence, if the sub strings
are shared and modified in place, this string will also be updated. To
make sure that the result string is unique, call function copy.
This function is equivalent to infix operator ^. Please see module
Ustring.Op for more details about the operator.val fast_concat : ustring -> ustring list -> ustringconcat with the difference that it is not returning
a fresh string. Function fast_concat is instead using fast_append
for string concatenation.val count : ustring -> uchar -> intUstring.count s c returns the number of occurrences of character c
in ustring s.val trim_left : ustring -> ustringval trim_right : ustring -> ustringval trim : ustring -> ustringval empty : unit -> ustringval unix2dos : string -> stringUstring.unix2dos s returns a string where newline characters in
string s are converted to the DOS and Windows standard. The ustring
module handles all strings internally using line feed (LF) code 0x0A, which
is standard in Unix-like systems (e.g., GNU/Linux, Mac OS X, and FreeBSD).
All input functions (e.g., from_utf8() or from_latin1()) automatically
converts to this format. Hence, when ustrings are encoded using e.g.
LATIN-1 or UTF-8, they will only contain the LF charcter for new line.
However, for example Windows, DOS, OS/2 and Symbian OS are using
the sequence of Carriage return (CR) 0x0D and LF 0x0A. This function
converts from unix-style to this format.val string2hex : string -> ustringUstring.string2hex s returns a comma separated list of hex
values for the bytes in string s. For example, from input string "an_"
a ustring "61,6e,5f" is returned.val convert_escaped_chars : ustring -> ustringInvalid_argument "convert_escaped_chars" if the string contains
illegal escape sequences.val read_file : ?encode_type:encoding -> string -> ustringUstring.read_file fn returns a ustring of the whole contents
of a file with file name fn. By default, the encoding type is Auto
(see type encoding for details). The input can also be forced to be
assumed to be another encoding type. For example, expression
Ustring.read_file ~encode_type:Ustring.Utf8 fn creates
an ustring that will assume that the input file is encoded using UTF-8.
If there is a decoding error exception Decode_error enc pos is raised.
Argument enc is the encoding type
and pos is number of bytes read from the file when the decoding error
occurred. Raises Sys_error if there where problems opening or reading
from the file.val read_from_channel : ?encode_type:encoding ->
Pervasives.in_channel -> int -> ustringUstring.read_from_channel ic returns a function which
can read from the in_channel. The returned function has one argument stating
the number of Unicode characters that should be read from the stream.
It returns an ustring with the read characters. The length of the
returned ustring is approximately the requested number of characters,
i.e., the function can do a partial read. If the returned ustring has
length zero, the end of the character stream has been reached. If there is
a decoding error exception Decode_error enc pos is raised. Argument
enc is the encoding type and pos is number of bytes read from the
file when the decoding error occurred. Note that this function should
only be called once to get the read function int -> ustring and no
other function is allowed to read from this channel at the same time.exception Decode_error of (encoding * int)
val from_latin1 : string -> ustringval from_latin1_char : char -> ustringval from_utf8 : string -> ustringInvalid_argument "Ustring.from_utf8"
if the input string has not a valid UTF-8 encoding. The input string
must not have a byte order mark (BOM).val from_uchars : uchar array -> ustringInvalid_argument "Ustring.from_uchars" if uchar values are illegal
(must be in range 0x0-0x1FFFFF).val latin1_to_uchar : char -> ucharval to_latin1 : ustring -> stringInvalid_argument "Ustring.to_latin1" if the characters are not within
the ASCII and Latin-1 character set (values 0-255).val to_utf8 : ustring -> stringval to_uchars : ustring -> uchar arrayval validate_utf8_string : string -> int -> intUstring.validate_utf8_string s n checks if the first n characters
of string s have valid UTF-8 encoding. If the input is valid, but not all data
is available (e.g. at the end, only 2 bytes are available for a character
that needs 3 bytes), the number of characters that represent
whole characters are return. Raises exception
Decode_error enc pos if the string has not a valid
UTF-8 encoding. Argument enc is the encoding type and pos is the position
in string s of the decode error.Ustring module is especially designed for simple support of
Unicode lexing and parsing. The below functions are defined for this
purpose.val lexing_from_channel : ?encode_type:encoding -> Pervasives.in_channel -> Lexing.lexbufLexing.lexbuf on a given input channel. Expression
Ustring.lexing_from_channel inchan returns a lexer buffer that reads
from input channel inchan. By default, the encoding type is Auto
(see type encoding for details). The input can also be forced to be
assumed to be another encoding type. For example, expression
Ustring.lexing_from_channel ~encode_type:Ustring.Utf8 inchan creates
a lexbuf that will assume that the input data is encoded using UTF-8.
The stream of characters that are returned to the lexical analyzer is
always UTF-8, regardless of the input encoding. Hence, this function
is a simple and safe way to do lexical analysis of arbitrary encoded
text data. If there is an encoding error of the data read from
inchan, Raises exception Decode_error enc pos if there is a decoding
error of the data read from inchan. Argument enc is the encoding type
and pos is number of bytes read from inchan when the decoding error
occurred.val lexing_from_ustring : ustring -> Lexing.lexbufLexing.lexbuf that reads from a ustring. The stream of
characters that are returned to the lexical analyzer is
always UTF-8.val equal : t -> t -> bool=. which is defined in
module Ustring.Op.val not_equal : t -> t -> bool<>. which is defined in
module Ustring.Op.val compare : t -> t -> intPervasives.compare. Since both type t and function compare is
implemented, module UString can be passed directly to functors such
as Set.Make and Map.Make. For example, to create a map the uses
ustrings as the key, use the following source code line:
module USMap = Map.Make (Ustring)
val hash : t -> intt and
functions hash and equal are implemented, module UString can
be passed directly to functor Hashtbl.Make, making it simple to
use ustrings as keys in a hash table. For example, to create a
hash table that uses ustrings as keys, use the following source
code line:
module USHash = Hashtbl.Make(Ustring)