libunidata

The Unicode Standard defines a large character set that, although not ideal, is rapidly becoming the preferred encoding of multilingual text, and is aligned with the ISO 10646 standard character encoding.

Besides the interpretation of its 65536 code positions, Unicode specifies a number of character properties and algorithms which are an important part of the standard, and essential for most serious applications. This library makes most of those character properties avaliable to the applications in a readily-usable, language-independent form, stored in simple tables indexed by the code position.

The library is a shared object built directly from the Unicode Character Database using a set of Perl scripts. The character properties are stored in a raw database, and are not intended to be accessed directly by the programmer. Currently, I include C programming interface to the library, and will probably support more languages in the future. In particular, I am in the process of including full Unicode support in GHC and Hugs.

Because the database maps a very large character set and a range of attributes, it has a huge memory footprint. For that reason, it is unsuitible for static linking. However, when dynamically-loaded, only a fraction of the database will ever find its way into the memory in a typical system.

Currently, the library provides all data from the following database files for all 65536 code positions in the basic multilingual plane (BMP):

ArabicShaping.txt
BidiMirroring.txt
Blocks.txt
CaseFolding.txt
CompositionExclusions.txt
EastAsianWidth.txt
Jamo.txt
LineBreak.txt
UnicodeData.txt

Unicode characters above U+FFFF are unsupported un this version of the library. Even though many code positions above U+FFFF have been allocated in the version 3.0.1 of the standard, these characters have very specialized uses and uniform properties that can be easily derived algorithmically. Besides, these characters already require special handling in the UTF-16 encoding. Therefore, rather than abandoning the ideal of the straight-forward table representation, I've decided to declare anything ouside of the BMP out-of-scope for this library.

Note that the two files “BidiMirroring.txt” and “CaseFolding.txt” have been added to the Unicode repertoire in version 3.0.1 of the standard. If you have the Unicode 3.0 book, you will not be able to use the CD-ROM to build the library, and will still have to download the database from the Unicode web site.

Also note that the following database files are currently ignored:

SpecialCasing.txt
Unihan.txt
NormalizationTest.txt
PropList.txt
NamesList.txt
Index.txt
diffXvY.txt

Of those, “SpecialCasing.txt” contains useful, but difficult to represent data, and I plan to include it in the future versions of the library. “Unihan.txt” contains data that I do not feel qualified to handle. If anyone with some knowledge of the relevant Asian languages wishes to contribute, I will be happy to include it, too. “NormalizationTest.txt” contains a large volume of specialized debugging data that has no place in a general-purpose library: those requiring it to debug their algorithms can do so at their own peril. The remaining four files reproduce data already include in other database files, and are completely redundant.

Note, however, that “PropList.txt” is required to build the library. The scripts use it to obtain the version of the standard to be stored in the three version constants.

Building and Installation

You can download the latest version of the library via HTTP from:

http://www.jantar.org/libunidata/libunidata.tar.gz

For the impatient, you can build the library with the usual:

zcat libunidata.tar.gz | tar -xf -
cd unidata
make

The makefile will download all required files from the Unicode web site (which can take a long time), generate the C files, compile them and link them into a single “libunidata.so” shared library. You will have to make sure that you have “wget” somewhere in your path for this to work. If you don't have a permanent connection to the Internet, or for any other reasons don't want to download the Unicode character database files during the build process, download and copy the following files into the “unidata” directory, after extracting “libunidata-1.0.tar.gz”:

ArabicShaping.txt
BidiMirroring.txt
Blocks.txt
CaseFolding.txt
CompositionExclusions.txt
EastAsianWidth.txt
Jamo.txt
LineBreak.txt
PropList.txt
UnicodeData.txt

The latest version of the character database is avaliable from:

http://www.unicode.org/Public/UNIDATA/

Once all of those files are in the “unidata” directory, you can run “make” as above.

Once compiled, you can install the library in “/usr/local” by running “make install”. If you'd like to install it anywhere else, simply copy “libunidata.so” to “$PREFIX/lib/libunidata.so.1.0” and “unidata.h” to “$PREFIX/include/unidata.h”, and you are ready to use the library.

The structure of the library is described in great detail in the “unidata.h” header in the distribution.