tlgu will convert an input_file from Thesaurus Linguae Graeca (TLG) and Packard Humanities Institute (PHI) representation to a Unicode (UTF-8) output_file which can then be read or searched using available pattern matching tools, like grep and awk. The TLG/PHI representation consists of "beta-code" text and citation information. The TLG / PHI and Epigraphical corpuses include the majority of classical Hellenic and Latin works and inscriptions. Several options are available, including splitting
icubaby is a C++ Library to Immediately Convert Unicode. It is a portable, header-only, dependency-free library for C++ 17 or later. Fast, minimal, and easy to use for converting sequences of text between any of the Unicode UTF encodings. Simple to use and exceptionally simple to integrate into a project, it does not allocate dynamic memory and neither throws or catches exceptions.
The ICU project is under the stewardship of The Unicode Consortium. International Components for Unicode (ICU) is an open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the Unicode Consortium and sponsored, sup
libgrapheme is an extremely simple freestanding C99 library providing utilities for properly handling strings according to the latest Unicode standard. It offers fully Unicode compliant grapheme cluster (i.e. user-perceived character) segmentation, word segmentation, sentence segmentation, detection of permissible line break opportunities, case detection (lower-, upper- and title-case), case conversion (to lower-, upper- and title-case) on UTF-8 strings and codepoint arrays, which both can also
ftfy transforms misencoded Unicode text back into its original form. It discovers — double encoding, misapplied html entities, and other escape sequences, understands UTF-8, Latin-1 and cp437 amongst other charsets. ftfy comes as Python library and as command line tool.
|