Module Ustring


module Ustring: sig .. end
Unicode string library - Ustring. Version 0.01 This module implements Unicode variants of functions using strings in the OCaml standard library, e.g., modules String and Pervasives. There are also a number of additional functions available. Several basic operators (e.g., equality and string concatenation) are defined in sub module Ustring.Op. It is recommended to open up this sub module but not Ustring directly.

This module currently supports ASCII, Latin-1, and UTF-8 encoding. If another encoding is used, an exception will be raised. In all functions below, a ustring s is indexed from 0 to (Ustring.length s)-1.

Copyright (C) 2010 David Broman. All rights reserved. This file is distributed under the "New BSD License".


type uchar = int 
Unicode char type. It is a basic integer.
module Op: sig .. end
Sub module Ustring.Op is containing Unicode versions of several functions available in the Pervasives module, as well as operators and functions for simple string creation and manipulation (for example string concatenation operator ^., ustring equality operator =. and string creation us"string").
type t = Op.ustring 
Alias for the type Op.ustring
type ustring = Op.ustring 
Alias for the type Op.ustring

type encoding =
| Ascii (*Standard ASCII (values 0-127)*)
| Latin1 (*Latin-1 encoding. Supports most European languages.*)
| Utf8 (*UTF-8 encoding. Full Unicode encoding, where each character is represented by 1-4 bytes.*)
| Utf16le (*Not yet supported*)
| Utf16be (*Not yet supported*)
| Utf32le (*Not yet supported*)
| Utf32be (*Not yet supported*)
| Auto (*Not a specific encoding, but a method for how encoded data is interpreted. If the input data have a byte order mark (BOM) in the beginning of the sequence, the encoding type stated in the BOM will be used. If no BOM is available, ASCII will initially be assumed to be the encoding type. Then, when a byte sequence appear that is not ASCII (value 0-127), a choice is made for the remaining encoding type: If the sequence represents a legal UTF-8 encoded character, the rest of the input will be treated as UTF-8 encoded. If it is not an UTF encoded character, Latin-1 will be assumed for the rest of the data sequence. Any later illegal decoding will then be reported as an error.*)

Module String's functions


The following section implements the Unicode version of the functions available in standard library module String.
val length : ustring -> int
Ustring.length s returns the number of characters of s.
val get : ustring -> int -> uchar
Ustring.get s n returns character number n in ustring s. Raises Invalid_argument "Ustring.get" if out of range.
val set : ustring -> int -> uchar -> unit
Ustring.set s n c modifies ustring s in place by replacing uchar at index n by uchar c. Raises Invalid_argument "Ustring.get" if out of range.
val create : int -> ustring
Function Ustring.create n returns a new fresh ustring with length n characters. Raises Invalid_argument "Ustring.create" if n < 0 or n > Sys.max_array_length.
val make : int -> uchar -> ustring
Function Ustring.make n c returns a new fresh ustring of length n, filled with character c. Raises Invalid_argument "Ustring.make" if n < 0 or n > Sys.max_array_length.
val copy : ustring -> ustring
Returns a fresh copy of the string, i.e., there will be no more sharing
val sub : ustring -> int -> int -> ustring
Function Ustring.sub s start len returns a new ustring with length len, consisting of a sub-string of s, which starts at index position start and has length len. Raises exception Invalid_argument "Ustring.sub" if start and len does not give a valid sub-string.
val concat : ustring -> ustring list -> ustring
Ustring.concat sep sl returns a ustring where the list of ustrings sl are concatenated, were separator string sep is inserted between each list element. The function always return a new fresh string. If performance is critical, use function fast_concat
val rindex : ustring -> uchar -> int
Ustring.rindex s c returns the index of the last occurrence of character c in ustring s. Raises Not_found if c does not occur in s.
val rindex_from : ustring -> int -> uchar -> int
Ustring.rindex_from s i c returns the index of the last occurrence of character c in ustring s before index position i+1. Note that function calls Ustring.rindex s and Ustring.rindex_from s (Ustring.length s - 1) c are equivalent. Raises Not_found if c does not occur in s. Raises Invalid_argument "Ustring.rindex_from" if i+1 is an illegal index in ustring s.

Additional string functions

val append : ustring -> ustring -> ustring
Append/concatenation of two strings. This function is equivalent to infix operator ^... Please see module Ustring.Op for more details about the operator. Note that this function is always creating a new fresh string after append. For performance demanding application, function fast_append or operator ^. are recommended instead.
val fast_append : ustring -> ustring -> ustring
Fast append/concatenation of two strings. Compared to function append and standard string concatenation operator ^, function fast_append do not allocate new memory or create a new fresh string. Instead, the concatenation is internally stored as a tree, making the operation constant time. When the string is later used by some function in module Ustring, the internal tree representation will be automatically collapsed to a plain string. Note that in contrary to append, this function do not create a fresh new string directly. Hence, if the sub strings are shared and modified in place, this string will also be updated. To make sure that the result string is unique, call function copy. This function is equivalent to infix operator ^. Please see module Ustring.Op for more details about the operator.
val fast_concat : ustring -> ustring list -> ustring
Same as function concat with the difference that it is not returning a fresh string. Function fast_concat is instead using fast_append for string concatenation.
val count : ustring -> uchar -> int
Ustring.count s c returns the number of occurrences of character c in ustring s.
val trim_left : ustring -> ustring
Returns a new ustring where white space (e.g. space, newline and tab) is removed from the beginning of the input ustring.
val trim_right : ustring -> ustring
Returns a new ustring where white space (e.g. space, newline and tab) is removed from the end of the input ustring.
val trim : ustring -> ustring
Returns a new ustring where white space (e.g. space, newline and tab) is removed from the beginning and end of the input ustring.
val empty : unit -> ustring
Returns an empty ustring
val unix2dos : string -> string
Function Ustring.unix2dos s returns a string where newline characters in string s are converted to the DOS and Windows standard. The ustring module handles all strings internally using line feed (LF) code 0x0A, which is standard in Unix-like systems (e.g., GNU/Linux, Mac OS X, and FreeBSD). All input functions (e.g., from_utf8() or from_latin1()) automatically converts to this format. Hence, when ustrings are encoded using e.g. LATIN-1 or UTF-8, they will only contain the LF charcter for new line. However, for example Windows, DOS, OS/2 and Symbian OS are using the sequence of Carriage return (CR) 0x0D and LF 0x0A. This function converts from unix-style to this format.
val string2hex : string -> ustring
Function Ustring.string2hex s returns a comma separated list of hex values for the bytes in string s. For example, from input string "an_" a ustring "61,6e,5f" is returned.
val convert_escaped_chars : ustring -> ustring
Converts escaped characters. Raises Invalid_argument "convert_escaped_chars" if the string contains illegal escape sequences.
val read_file : ?encode_type:encoding -> string -> ustring
Function Ustring.read_file fn returns a ustring of the whole contents of a file with file name fn. By default, the encoding type is Auto (see type encoding for details). The input can also be forced to be assumed to be another encoding type. For example, expression Ustring.read_file ~encode_type:Ustring.Utf8 fn creates an ustring that will assume that the input file is encoded using UTF-8. If there is a decoding error exception Decode_error enc pos is raised. Argument enc is the encoding type and pos is number of bytes read from the file when the decoding error occurred. Raises Sys_error if there where problems opening or reading from the file.
val read_from_channel : ?encode_type:encoding ->
Pervasives.in_channel -> int -> ustring
Function Ustring.read_from_channel ic returns a function which can read from the in_channel. The returned function has one argument stating the number of Unicode characters that should be read from the stream. It returns an ustring with the read characters. The length of the returned ustring is approximately the requested number of characters, i.e., the function can do a partial read. If the returned ustring has length zero, the end of the character stream has been reached. If there is a decoding error exception Decode_error enc pos is raised. Argument enc is the encoding type and pos is number of bytes read from the file when the decoding error occurred. Note that this function should only be called once to get the read function int -> ustring and no other function is allowed to read from this channel at the same time.

Encoding

exception Decode_error of (encoding * int)
Exception raised when a decode error occurrs. First parameter represents the encoding method used when the error occurred and the second parameter is the position in the stream/string/channel.
val from_latin1 : string -> ustring
Creates an ustring from a string that is assumed to be encoded with Latin-1.
val from_latin1_char : char -> ustring
Creates an ustring from a Latin-1 encoded char
val from_utf8 : string -> ustring
Creates an ustring from a string that is assumed to be encoded using UTF-8. Raises exception Invalid_argument "Ustring.from_utf8" if the input string has not a valid UTF-8 encoding. The input string must not have a byte order mark (BOM).
val from_uchars : uchar array -> ustring
Converts an array of uchars to an ustring. Raises exception Invalid_argument "Ustring.from_uchars" if uchar values are illegal (must be in range 0x0-0x1FFFFF).
val latin1_to_uchar : char -> uchar
Converts a Latin-1 encoded char to an uchar
val to_latin1 : ustring -> string
Creates a new string encoded using Latin-1. Raises Invalid_argument "Ustring.to_latin1" if the characters are not within the ASCII and Latin-1 character set (values 0-255).
val to_utf8 : ustring -> string
Returns an UTF-8 encoded string. An UTF-8 string consist of a sequence of bytes where each Unicode character is encoded into 1 to 4 bytes. ASCII characters are always encoded into 1 byte.
val to_uchars : ustring -> uchar array
Returns an array of uchars.
val validate_utf8_string : string -> int -> int
Expression Ustring.validate_utf8_string s n checks if the first n characters of string s have valid UTF-8 encoding. If the input is valid, but not all data is available (e.g. at the end, only 2 bytes are available for a character that needs 3 bytes), the number of characters that represent whole characters are return. Raises exception Decode_error enc pos if the string has not a valid UTF-8 encoding. Argument enc is the encoding type and pos is the position in string s of the decode error.

Lexing


The Ustring module is especially designed for simple support of Unicode lexing and parsing. The below functions are defined for this purpose.
val lexing_from_channel : ?encode_type:encoding -> Pervasives.in_channel -> Lexing.lexbuf
Creates a new Lexing.lexbuf on a given input channel. Expression Ustring.lexing_from_channel inchan returns a lexer buffer that reads from input channel inchan. By default, the encoding type is Auto (see type encoding for details). The input can also be forced to be assumed to be another encoding type. For example, expression Ustring.lexing_from_channel ~encode_type:Ustring.Utf8 inchan creates a lexbuf that will assume that the input data is encoded using UTF-8. The stream of characters that are returned to the lexical analyzer is always UTF-8, regardless of the input encoding. Hence, this function is a simple and safe way to do lexical analysis of arbitrary encoded text data. If there is an encoding error of the data read from inchan, Raises exception Decode_error enc pos if there is a decoding error of the data read from inchan. Argument enc is the encoding type and pos is number of bytes read from inchan when the decoding error occurred.
val lexing_from_ustring : ustring -> Lexing.lexbuf
Creates a new Lexing.lexbuf that reads from a ustring. The stream of characters that are returned to the lexical analyzer is always UTF-8.

Comparison and Standard Collections

val equal : t -> t -> bool
Safe structural equality comparison function for ustrings. For easy usage, use the equivalent operator =. which is defined in module Ustring.Op.
val not_equal : t -> t -> bool
Safe structural inequality comparison function for ustrings. For easy usage, use the equivalent operator <>. which is defined in module Ustring.Op.
val compare : t -> t -> int
Comparison function for ustrings. Uses the same specification as Pervasives.compare. Since both type t and function compare is implemented, module UString can be passed directly to functors such as Set.Make and Map.Make. For example, to create a map the uses ustrings as the key, use the following source code line:

module USMap = Map.Make (Ustring)

val hash : t -> int
Implements a safe hash function for ustrings. Since type t and functions hash and equal are implemented, module UString can be passed directly to functor Hashtbl.Make, making it simple to use ustrings as keys in a hash table. For example, to create a hash table that uses ustrings as keys, use the following source code line:

module USHash = Hashtbl.Make(Ustring)