tesseract-ocr 4.0.0

tesseract-ocr is an OCR engine originally developed by Hewlett Packard and now sponsored by Google. It is highly accurate and will read a binary, gray, or color image and output text.

Tags c++ c ocr library cli
License Apache
State stable

Recent Releases

4.0.030 Oct 2018 21:45 minor feature: Add deconfiguration for LSTM . . . add lstmdeconfig to distribution and installation process. . 4.0.0 Release.
4.0.0-rc425 Oct 2018 07:05 minor feature: cppan build. . commontraining: two comments. . . . CID 1396172 (Uninitialized members). . Revert "CID 1396172 (Uninitialized members)". . TessPDFRenderer: Remove unused member variable jpg_quality_ (CID 1396 . . CycleTimer: Add missing initialization (CID 1396168). . lstmtraining: Handle failed remove syscall (CID 1396166). . Classify: new resource leak (CID 1396163). . uninitialized scalar variable (CID 1395880). . OpenclDevice: Catch negative index (CID 1395110). . SVNetwork: Handle failed socket call (CID 1164597). . classify/cluster: Replace Emalloc by std::vector. . sum computation in higher precision. . LLSQ: Replace sqrt by std::sqrt. . sum computation in higher precision. . . . . . Renamed GetGlyphConfidences() to GetChoices() and glyph_confidences t . . . . LineHypothesis: Add copy assignment operator. . ParamsTrainingHypothesis: Add copy assignment operator. . BLOB_CHOICE: Add copy assignment operator. . ROW: Add declaration for copy constructor. . C_OUTLINE_FRAG: Add declaration for copy constructor. . BlamerBundle: Add declaration for copy assignment operator. . unittest: Add more files from Google. . . . . . TessResultRenderer: Extend API to access status of renderer. . tesseractmain: Show error message when output file could not be created. . . . Update test submodule. . Add configuration for LGTM. . . . . . free PangoFontMap;. cluster: some potential overflows. . BLOBNBOX: Declare signed bit field. . . . configuration for LGTM. . Rename API function for getting LSTM choices. . Rename API function from GetBestLSTMChoices to GetBestLSTMSymbolChoices. . . . training: Don't hide global variables. . . . ; define NOUNDEFINED for cygwin. . Merge branch 'master' of https://github.com/tesseract-ocr/tesseract. . Revert "free PangoFontMap; ". . Revert "prefer to use FreeType for pango_cairo_font_map". . Remove type cast and compiler warning (-Wcast-qual). . ScrollView: Optimize local table_colors. . install tra
4.0.0-rc315 Oct 2018 21:05 minor feature: using c-api / compile with gcc . Merge branch 'master' of https://github.com/tesseract-ocr/tesseract. . Merge branch 'master' of https://github.com/tesseract-ocr/tesseract. . Plumbing: Remove comparison which is always false. . pgedit: remove unused declaration of display_bln_lines. . svpaint: Change a variable from global to local. . UNICHARMAP: Remove comparison which is always false. . Classify: Don't hide deparameter. . SVPaint: Remove empty block. . . Avoid crash with --psm 0 and LSTM traineddata. . Always use isascii() with isspace(). . . . . Update test submodule. . Update googletest submodule to release v1.8.1. . . . keep API compatibility with #1265. . Remove code for _MSC_VER 1900. . . . Remove virtual specifiers. . Merge branch 'master' of https://github.com/tesseract-ocr/tesseract. . . . uninitialized variable, remove unused variable. . hocr: add ocrp_wconf to unconditional ocr-capabilities;. integer overflow in overlap calculation. . . Use env variable in AppVeyor configuration. . . . remove insight.io badge. . remove not existing directory from autotools distribution. . Merge branch 'master' of https://github.com/tesseract-ocr/tesseract. . building of ScrollView.jar with modern java version;. Add Abseil as a submodule (needed for some of the new unit tests). . Update test submodule. . Add more hacks for use with Google unittests. . Enhance LOG emulation. . Add a basic implementation of class CycleTimer. . unittest: Add baseapi_test. . unittest: Add qrsequence_test. . unittest: Add fileio_test. . . . Remove gradechop.h. . Remove tab character in source files. . . . unittest: Add imagedata_test. . unittest: Add paragraphs_test. . unittest: Add lang_model_test (only works partially). . unittest: Add mastertrainer_test (only works partially). . . . unittest: Disable build rules for tests which still fail to build. . adapt info about ScrollView.jar build. . add cmake files to autotools distribution packa
4.0.0-rc208 Oct 2018 13:45 minor feature: remove duplicate help from combine_lang_model . Move content of ipoints.h to points.h and remove ipoints.h. . comments. . CID 1395882 (Uninitialized scalar variable). . print help for tesstrain.sh;. CID 1164579 (Explicit null dereferenced). . . . version info in VERSION. . Update tesseract man page about both OCR engines in tesseract 4. . Update README about both OCR engines in tesseract 4. . . . Don't set page segmentation mode for hocr, pdf and tsv configs. . . . Allow orientation detection with any traineddata. . . . Don't set page segmentation mode for unlv config. . . . Update tesseract man page. . Add Makefile rule to build HTML manpages. . . . Document some more config options for tesseract. . . . Merge and enhance documentation on language and script models. . . . implement parameter min_characters_to_try for minimum characters to t . . lstmtraining: Check write permission for output model. . combine_tessdata: Handle failures when extracting. . . . lstmtraining: Remove dead code for purified model name. . use of wrong UNICHARSET. . . . . constructor for class Dict (uninitialized member variables). . genericvector: Rewrite code to satisfy static code analyzer. . intproto: Use more efficient float calculations for floor. . . . . . rect: Use more efficient float calculations for ceil, floor. . chop: Use more efficient float calculations for sqrt. . genericvector: Pass parameters by reference. . GENERIC_2D_ARRAY: Pass parameters by reference. . WERD_RES: Remove comparisons which are constant. . improve description of min_characters_to_try variable. . pgedit: Change some variables from global to local ones. . . . "mktemp -d --tmpdir" on Mac OS; see #1453. . Merge branch 'master' of https://github.com/tesseract-ocr/tesseract. . Rework check for readable input file. . . . use pdf L_FLATE_ENCODE only for png input;. Release candidate 2.
4.0.0-rc101 Oct 2018 20:25 minor feature: Added JPEG quality option parameter (-c jpg_quality=n) . reported by Coverity Scan. . . reported by Coverity Scan. . . TessPDFRenderer: Improve robustness of API. . . detected by Coverity Scan. . . whitespace. . . detected by Coverity Scan. . . detected by Coverity Scan. . potential crash with --psm 0 and use osd.traineddata automatically. . . . . ImageThresholder::OtsuThresholdRectToPix for OpenCL. . . . ColPartition: Rename median_size_ - median_height_. . . . Initial COmmit to add Aksara Jawa - Javanese script. . typo re Javanese. . chamge validate javanese similar to indic. . . . Revert Makefile.am to beta.2. . . . remove duplicate include. scrollview: Clean include statements. . . . typo in function name. . typo in comments and variable name. . . . typo in function name. . typo in comments and variable name. . Javanese script training. . remove duplicate include. . . . add variable --save_box_tiff to Save box/tiff pairs along with lstmf . . Added the option for character accumulated glyph confidences. . . . . CID 1395116 ('Constant' variable guards dead code). . CID 1395114 ('Constant' variable guards dead code). . CID 1395113 ('Constant' variable guards dead code). . CID 1395109 (Logically dead code). . CID 1395108 (Dereference after null check). . CID 1164567 (Dereference after null check). . . . assertion caused by access to default TBOX. . . new whitespace. Convert CRLF line endings to LF. . . Move class tesseract::File from training to ccutil. . Add more portability hacks for Google test environment. . Add more unittests from Google. . unittest: and enable bitvector_test. . unittest: and enable cleanapi_test. . unittest: and enable colpartition_test. . unittest: and enable denorm_test. . Add ARRAYSIZE macro for Google test environment. . unittest: and enable heap_test. . unittest: and enable indexmapbidi_test. . unittest: and enable intfeaturemap_test. . . . unittest: and enable linlsq_tes
4.0.0-beta.431 Jul 2018 15:25 minor feature: CID 1393540 (Explicit null dereferenced) . CID 1393244 and CID 1393244 (Uninitialized scalar variable). . CID 1393243 (Uninitialized scalar field). . . . CID 1393239 (Dereference null return value). . CID 1393238 (Dereference null return value). . CID 1393241 (Dereference null return value). . . . Replace ASSERT_HOST in genericvector.h. . Remove errcode.h from public API. . . . Remove public API file ndminx.h. . . . Clean usage of assert.h. . . . Replace string.h by standard C++ cstring. . . . Remove LSTM header files from public API. . Remove arch header files from public API. . . . Remove unneeded include statements for scanutils.h. . . . Remove recursive header. . Clean some include statements. . Remove memry.h from public API. . . . Remove empty tessbox.h. . Clean more include files and include statements. . . . coutln: Replace alloc_mem, free_mem by standard functions. . adaptions: Remove unneeded include statement. . qspline: Remove unneeded include statement. . strngs: Replace alloc_mem, free_mem by standard functions. . gap_map: Replace alloc_mem, free_mem by C++ new, delete. . pitsync1: Remove unneeded include statement. . qspline: Replace alloc_mem, free_mem by C++ new, delete. . makerow: Replace alloc_mem, free_mem by C++ new, delete, std::vector. . oldbasel: Replace alloc_mem, free_mem by C++ new, delete, std::vector. . pithsync: Replace alloc_mem, free_mem by C++ std::vector. . tordmain: Replace alloc_mem, free_mem by C++ std::vector. . Remove memry.cpp, memry.h. . Remove stderr.h and its include statements. . . . dotproductsse: include statements. . . . Update VERSION. . . . CID 1386094 (Unread field). . CID 1386098 (Dubious method used). . CID 1386104 (Dereference null return value). . CID 1386083 (Dereference null return value). . . . CID 1164746 (Big parameter passed by value). . CID 1157757 (Logically dead code). . CID 1158180 (Argument cannot be negative) and clean code a bit. . CID 1242849 (U
4.0.0-beta.220 Jun 2018 03:17 minor feature: Download the leptonica source from github . . . Add new line to a few error messages. . . . filenames in comments. . . . from pull of cleanups: clang tidied, reviewed, new, . . Added script-specific validation and normalization for virama-using s . . build broken by previous commits that added use of string in lo . . Deleted some dead LSTM code, making everything use the recoder. . Removed changes from last commit that didn't belong. . Move LSTM unicharset and recoder to traineddata with version string p . . type of bit values. . wrong data type in argument for sscanf. . Remove extra semicolons. . windows build. . . . regression of. PangoFontInfo: Remove unused method is_fraktur. . PangoFontInfo: Remove unused method is_monospace. . PangoFontInfo: Remove unused method is_smallcaps. . PangoFontInfo: Remove unused method is_bold. . PangoFontInfo: Remove unused method is_italic. . Use lept_free to free memory allocated by Leptonica. . regression of again!. . . . . . BestPix to always return the highest resolution available, even . . Removed unnecessary using statements and cleaned up google/non-google . . Important to RTL languages saves last space on each line, which w . . clang tidy on previous pull. . Add googletest submodule. . cmake: Add googletest. . googletest: Add dummy test. . Changed the way unicharsets are handled to allow support for the ch . . Rewrote the recoder to use an encoding based on wubi instead of radic . . Define std::max under VS2017 x64. . . . . . Part 2 of separating out the unicharset from the LSTM model, ing c . . Added ADAM optimizer, unless git screwed it up, cos there is no diff. . Removed errors introduced by git merge. . Added AVX2 and AVX512 detector. . Added convert to int and directory listing to combine_tessdata. .
4.0.0-beta.111 Mar 2018 19:05 minor feature: Remove unused method TessdataManager::OverwriteEntry . Remove unused method TessdataManager::LoadFileLater. . crash if output file could not be opened. . : cleanup. . : inside main() use return rather than exit. . . . . . Improve robustness of TessdataManager. . . automake: Enable all warnings and a warning. . . . genericvector: Add overloaded LoadDataFromFile. . Remove unneeded null pointer check. . . . Replace Standard C library header files by C++ header files. . Remove obsolete comments and unused code from ccutil/host.h. . . . EquationDetect: Remove unneeded new / delete operations. . . . and improve Dockerfile. . . . opencl: Remove more unused code. . . . README: Add Coverity badge. . . . Update README.md. . Reduce number of new / delete operations for class KDTreeSearch. . Reduce number of new / delete operations for class LanguageModel. . . . UNICHARSET: Add missing initialization. . . Optimize LSTM code for builds without OpenMP. . . . use correct name for Mac OS X, correct link to training wiki;. Update documentation for installation. . . . Reorganize Readme.md. . Update Template. . Add link to ` the guidelines for this repository`. . Add link to guidelines for this repository. . Add badges for Doxygen and Wiki documentation. . typo. . Update readme for 3.05.01. . StringRenderer::pen_color_: int 3 - double 3 . . Change Mac OS X - macOS. . PangoFontInfo: Remove unused method is_fraktur. . Remove strcasestr which is no longer needed. . . . . . . . . . . . PangoFontInfo: Remove unused method is_monospace. . PangoFontInfo: Remove unused method is_smallcaps. . PangoFontInfo: Remove unused method is_bold. . PangoFontInfo: Remove unused method is_italic. . Make less verbose. . . . . . opencl: Remove unused code. . opencl: some compiler warnings. . . . LSTMTrainer: Catch empty vectors. . Update from Leptonica 1.74.1 to 1.74.2. . Travis CI for Leptonica 1.74.2. . . . Remove local implementation of
3.05.0102 Jun 2017 06:39 major bugfix: Bugfix release for stable tesseract version
3.05.0017 Feb 2017 11:05 minor feature: Made some fine tuning to the hOCR output. Added TSV as another optional output format. ABI break introduced in 3.04.00 with the AnalyseLayout() method. text2image tool - Enable all OpenType ligatures available in a font. This feature requires Pango 1.38 or newer. Training tools - Replaced asserts with tprintf() and exit(1). Cygwin compatibility. Improved multipage tiff processing. Improved the embedded pdf font (pdf.ttf). Enable selection of OCR engine mode from command line. Changed tesseract command line parameter '-psm' to '--psm'. Added new C API for orientation and script detection, removed the old one. Increased minimum autoconf version to 2.59. Removed dead code. many compiler warning. memory and resource leaks. some with the 'Cube' OCR engine. some openCL. Added option to build Tesseract with CMake build system. Implemented CPPAN support for easy Windows building. . Added TSV as another optional output format. ABI break introduced in 3.04.00 with the AnalyseLayout() method. text2image tool - Enable all OpenType ligatures available in a font. This feature requires Pango 1.38 or newer. Training tools - Replaced asserts with tprintf() and exit(1). Cygwin compatibility. Improved multipage tiff processing. Improved the embedded pdf font (pdf.ttf). Enable selection of OCR engine mode from command line. Changed tesseract command line parameter '-psm' to '--psm'. Added new C API for orientation and script detection, removed the old one. Increased minimum autoconf version to 2.59. Removed dead code. many compiler warning. memory and resource leaks. some with the 'Cube' OCR engine. some openCL. Added option to build Tesseract with CMake build system. Implemented CPPAN support for easy Windows building.
4.00.00alpha16 Dec 2016 09:05 minor feature: Remove unneeded definition for NULL. Use different font list and exposures for "lat" language training. Add info for progress monitor, make it visible in doxygen doc; remove?. Add Junicode to neo-Latin fonts. Update ci scripts. Test release build on windows. Update appveyor.yml. Update appveyor.yml. Update appveyor.yml. Training should work now. Update.travis.yml. Update appveyor.yml. Update CMakeLists.txt. Update.travis.yml. Merge branch 'master' of github.com:tesseract-ocr/tesseract. Update CMakeLists.txt. Update leptonica version. Update.travis.yml. Update appveyor.yml. Merge branch 'master' of github.com-egorpugin:egorpugin/tesseract. Update CMakeLists.txt. Improve leptonica search. Make box training work. Compatibility with Leptonica 1.73. Add more include directories. Merge branch 'master' of github.com:tesseract-ocr/tesseract. Update README.md. Update README.md. Update README.md. Replace pdf.ttf with sharp2.ttf, keep name the same. Document hocr_font_info in config. INCOMPATIBLE to hOCR line height information -. varsize array for Microsoft compiler. Only generate dir for HOCR when needed -. Emit fewer "lang" attributes. Add LTR mixed direction test files. Update README.md. compiler warning (signed / unsigned mismatch). Adds char GetHOCRTSVText(int) as placeholder. Copy of char GetHOCRT?. Adds TessHOcrTsvRenderer class for rendering HOCR info in tsv format. Calls TessHOcrTsvRenderer if tessedit_create_hocrtsv is true. Adds hocrtsv file to configs folder. Adds hocrtsv to tessdata/configs/Makefile.am. Adds BoolParam tessedit_create_hocrtsv in class Tesseract. Render output in TSV format. Avoids HTML escaping. Cleanup TSV renderer. hocrtsv references in Makefile. Add inactivity timeout for icu download on windows. move new delete histogramAllChannels inside the #ifdef USE_OPENCL; fi?. Update INSTALL.GIT.md. improve tesseract.pc.in -. solve segfault for box.train;. update Release Notes. Don't display tesseract's banner when quiet
3.04.0117 Feb 2016 10:45 minor feature: Add check for opencl requirements. Rework opencl requirements (configure: error: conditional "AMDEP"?. Typo. GRAPHICS_DISABLED build. Strcasestr needed on Cygwin too. Libicui18n is only called libicuin on mingw, not cygwin. Implement build without cube (-DNO_CUBE_BUILD). Tessedit_create_txt 0 blocks box training. Memmory leak based on (https://code.google.com/p/tesse?. Remove empty header file secname.h. Replace CubeUtils::UTF8ToUTF32 in pdfrenderer. Enable pdfrender with NO_CUBE_BUILD. NO_CUBE_BUILD with reverting to ANDROID_BUILD in baseapi. Improve NO_CUBE_BUILD. in UTF-16BE conversion. Remove extraneous line feed. VC14 compiler. Enable OpenMP support. Turn off optimisation in Microsoft Visual Studio for TextlineProjecti?. Rename README to README.md -. Remove info about VS 2008. to compile tesseract on mac with clang. For OpenCL reported on Apple Mac. Still get -54 on Apple?. VS2010 build. OpenCL build on Mac. Configure.ac for OS X and -framework. Missing "allheaders.h" when compiling with --enable-opencl on OS X. Various clang compilation errors. Get OpenCL to compile on OS X. Configure.ac unconditionally enabling OpenCL. Add ULL to constants which overflow 32 bits. Simplify build and run of ScrollView. Tesstrain.sh: Only fall back to default Latin fonts if none were prov?. Tesstrain.sh: Only set FONTS if they weren't set on the command line. Tesstrain.sh: Initialise fontconfig even if Arial isn't available. Remove --bin_dir option from tesstrain.sh (should use PATH instead). Add --exposures option to tesstrain.sh. Use mktemp to create workspace directory. COPYING: typo found by codespell. Api: typos in comments (all found by codespell). Ccmain: typos in comments and strings. Typo. Ccstruct: typos in comments and strings. Ccutil: typos in comments and strings. Classify: typos in comments and strings. Cube: typos in comments. Cutil: typos in comments. Dict: typos in comments and strings. Doxyfile: typo in comment. Java: typos in comments and strings. Wordrec: ty
3.04.0020 Aug 2015 08:26 minor feature: