OCRmyPDF 9.1.0

OCRmyPDF adds an invisible text layer to PDF documents after passing it through the Tesseract OCR engine. The output will be PDF/A with a selectable but invisible text layer above scanned image-documents. This allows later searching and archiving.

Tags pdf ocr scanning
License GNU GPLv3
State stable

Recent Releases

9.1.013 Nov 2019 03:15 minor feature: Tesseract: refactor logging . Docs: mention how to suppress progbar. . Docs: document optimization. . Docs: mention systemd for batches. . Report missing optional dependencies as possible cause of file size i . . Import and docstring cleanup. . Lint warning about missing cur_item. . Tesseract: exception when logger is RootLogger. . Travis: enable Py 3.8. . Docs: installation instructions for pikepdf manylinux2010 wheels. . Use pikepdf 1.7.0 to improve Python 3.8 support. . v9.1.0 release notes. . Test: test_report_file_size. . Test: further to test_report_file_size.
9.0.508 Nov 2019 13:45 minor feature: Remove Alpine Docker image . Dockerfile: remove venv from Ubuntu image; tweak reqs. . Dockerfile: errors are trying to build unneeded cached wheels. . Dockerfile: jbig2 not copied over. . "MANIFEST.in exists" by removing MANIFEST.in. . Travis: enable Python 3.8 testing. . Drop support for unpaper 6.1 on Ubuntu 14.04. . Docker: try adding automated test. . Docker autotest:, maybe?. . docs: remove comment about Ubuntu image. . Docker: relocate dockerfile. . docs: add remark about optimizing without OCR. . docker-compose.test does not seem to be ready for production use. . Update release notes; disable Py3.8 test again. . Support pdfminer.six 20191020.
9.0.404 Nov 2019 06:25 minor feature: pdfa: assume 3 RGB channels always . optimize: work around pikepdf 1.6.3 limitation with indexed ICCbased . . black settings in pyproject.toml. . py36 test including 37. . any False in the ocrmypdf.ocr() API being set to True. . Remove test_tesseract_config_invalid from suite. . Add contributing guide. . docs: intermediate file list for v9. . Use at most 3 Tesseract threads. . Delinting. . Use lstm_use_matrix for --user-words,patterns. . Python 3.8 updates. . travis: Python 3.8, osx_image. . Mention when we default to English and the system locale is not English. . Require Pillow 6.2.0 based on security vulnerability report in older . . Update release notes. . Disable Py3.8 for now. . test suite error. . Mention that v9.0.4 requires a source install for Py3.8 for now, due .
9.0.205 Sep 2019 03:15 minor feature: Docs: installation updates . Docs: leptonica.com - .org. . Alpine: use jbig2enc@community. . Travis: Make 3.7 the build leader/deployer. . Reactivate user-words test that was always skipped. . Running without eng.traineddata installed raises exception. . Install: affirm that we now require Tesseract beta. . Attempt to resolve black-inversion. Tests broken by --print-parameters change. . Use context managers to ensure Pillow images are. Allow test_german to xfail if deu language is not installed. . Optimize: Don't reinsert 1bpp images. . Optimize: exclude images with custom Decode tables. . Optimize: only re-insert pngs after pngquant. . Optimize: don't consider 1bpp images for PNG optimization. . Remove restriction on pytest 5. . Adjust test requirements. . Optimize: solve monochrome by converting to G4. . --print-parameters when chi_sim is not installed. . v9.0.2 release notes.
9.0.112 Aug 2019 06:45 minor feature: Add missing item from v9.0.0 release notes . Travis: remove vestiges of pdfminer being optional on osx. . Ensure --image-dpi on non-image produces a warning. . Alpine Docker: jbig2enc moved from testing to community. . Use pikepdf 1.6.1. . Tests: split out stdin/stdout tests. . Docs: update install on FreeBSD to point to ports. . Travis: Add a minimal Ubuntu config. . Tests: mark test as requiring pngquant. . Tests: interpretation of None as omitted argument. . Travis: make minimal config even more minimal. . v9.0.1 release notes.
9.0.028 Jul 2019 07:25 minor feature: refactor: split argparse and run_pipline . import in unpaper test. . refactor: move ruffus related code to one file. . feat: move to sync (none ETL) implementation (WIP). . feat: move to sync (none ETL) implementation. . feat: move to sync (none ETL) implementation - remove ruffus. . : most of the tests (37 failed, 133 passed, 28 skipped). . feat: add concurrent.futures pipeline. . feat: add tqdm progress bar. . feat: add triage step. . : remove ruffus. . : update pytest version. . : tests. . : typo. . --redo-ocr. . Merge master into api branch; all test pass. . warnings. . Remove custom logger. . More to logging and disabled tests. . Reinstate log level in messages to be r to old behavior. . Make logging format consistent with v8.3.0. . Move app specific settings a library may not want to __main__. . logging: don't pass log object to validation. . Additional logging ; silence extremely verbose pdfminer logging. . Refactor weave_layers, introduce progress bar. . Replace ProcessPoolExecutor with multiprocessing.Pool. . Cleanup ghostscript error output. . pylint removal. . Mark some slow tests. . validation: eliminate print(). . Update test cache. . Make re_symlink() not require a log object. . ing threading._RLock exception on Python 3.6. . Remove some now-unused code; etc. . extra blank lines in output messages in Python 3.6. . test invalidated by Python 3.6 logging. docs: Remove discussion of ruffus. . Explain picklable logger. . Refactor cli into basic high level api. . Refactor configure_logging. . Remove sys.exit() calls so we don't terminate caller application. . Refactor validation and exceptions. . api: progress_bar_friendly=False. . Add progress bar to optimize and add option to disable it. . release notes: clarify. . Improve argparse behavior for its role in making the API work. . api: short-circuit exception handler, as caller should provide their own. . Convert one test to use API. . distinction between clea
8.3.014 May 2019 19:25 minor feature: Add bash completion . Docs: mention completions. . Rename bash completions file. . Ghostscript: remove unnecessary post-render resizing step. . Weave: corruption of certain high page count files. . Weave: use emplacement method, scrap TOC repair. . Don't use MagicMock() as a dummy logger in pytest. . v8.3.0 release notes in progress. . Require pikepdf 1.3.0. . Ghostscript: rendering threads has no effect on pdfwrite, so remove it. . Weave: add new test for link consistency. . v8.3.0 notes: clarify. . Move completions to better location/Homebrew compat.
8.2.424 Apr 2019 22:45 minor feature: Remove safety traversal of PDF table of contents . Remove PyCharm deger hack. . Ignore pip-wheel-metadata folder. . Explicitly most pikepdf.Pdf when done with them. . Pdfinfo: be more specific about detecting XFA we can't render. . Weave: use explicit pdf.(), drastically reduce open file handles. . v8.2.4 notes. . Test.txt. . Main.txt.
8.2.304 Apr 2019 03:17 minor feature: Update batch.rst . Docs: explain Automator workflow. . Docs: use images folder. . Docs: broken sphinx ref. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Leptonica: junkpixt harder. . Readme: tweaks. . Better help text for --verbose. . LeptonicaErrorTrap when a sys.stderr.fileno() is not available. . v8.2.3 notes.
8.2.209 Mar 2019 03:15 minor feature: Exception while attempting to print error message for missing pro . Convert most uses of subprocess.Popen to subprocess.run in test suite. . Main: redundant argument test. . Main: version testing unnecessarily throwing exception to itself. . Test suite so --clean is not requested when unpaper is not installed. . Some test failures missed in prev commit. . v8.2.1 notes. . Further to external program version testing.
8.2.005 Mar 2019 03:15 minor feature: optimize: Modernize pikepdf usage . README: install other language packs on macOS. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Declare build system in pyproject.toml. . tests for move to Alpine dockerfile. . docs: avoid importing ocrmypdf. . Add version to build-system declaration. . Add Dockerfile based on alpine:3.9. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Docs: reorganize for new docker-alpine image. . Update requirements. . Update test cache for french- german change. . Move install-time external program checks out of setup.py. . optimize: recoding of PNGs. . optimize: on aggressive settings try JPG to PNG transcoding. . optimize: update comments. . optimize: all JBIG2 images binned on last page. . docs: minor. . Remove demessage. . v8.2.0 release notes. . Predictor name and photometric flip. . optimize: Disable jpg- png migration. . Merge branch 'feature/optimization-'. . v8.2.0 release notes: optimizer. . optimize: use Decode to invert 1bpp PNGs for now.
8.1.011 Feb 2019 06:05 minor feature: Docs: Clarify ArchLinux edition is in AUR . Add --unpaper-args. . Docs: remove reference to --skip-repair since the argument was removed. . Adjust the docker pull command for webservice. . Unpaper-args: add test case and harden feature. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Merge 'feature/unpaper-args'. . Docs: --unpaper-args. . --clean-final implies --clean. . Fuzz. . Webservice: add an optional config and larger upload limit. . Be os.nice()-r. . If --tesseract-timeout 0, say nothing when we time out. . Docs: more unpaper details. . Activate black precommit. . Exception on traversing corrupt ToC entries. . When weave handoff occurs with no OCR font present. . v8.1.0 release notes.
8.0.118 Jan 2019 03:17 minor feature: Ensure XObjects with no subtype don't cause an exception . Docs: Explain intermediate files. . Docs: Update some install procedures for v8 changes. . v8.0.1 notes.
8.0.007 Jan 2019 06:25 minor feature: docs: Ghostscript PDF/A XMP metadata loss; ocrmypdf-webservice . New template. . Readme: more media. . docs: try to readthedocs. . Travis: remove Brewfile. . v7.4.1 release notes. . leptonica.py: exception on certain types of barcode failures. . Detect when metadata is dropped during PDF/A conversion. . Drop support for Python 3.5. . Drop support for Tesseract 3. . Generate test cache. . Reformat with black. . Sort imports with isort. . Remove always-false Tess v3 tests. . pdfa: remove a pile of deprecated code. . Make pdfminer.six optional. . travis: Convert to f-strings where it makes sense. . pikepdf: version bump. . Delinting. . Add fish completions. . use pikepdf 0.10.2. . Prevent Ghostscript from generating invalid XMP metadata. . Bump pikepdf version, point to release notes. . v8.0.0 release notes.
7.4.016 Dec 2018 13:45 minor feature: leptonica: delete file junkpixt.png if created . Support using --force-ocr and --threshold or --mask-barcodes together. . comment in layout.py. . Add webapp stuff. . webapp docker: Build from polyglot. . Rename webapp to webservice. . pdfa: replace PDF/A checking with pikepdf implementation. . Deprecate encode/decode_pdf_date and remap to pikepdf version. . Remove more libxmp dependencies. . pdfinfo: FutureWarning. . setup: suppress XMLParser() warning - defusedxml related. . Replace Ghostscript DOCINFO and.25 metadata date regression. . regression on Ghostscript path. . Refactor pipeline to make PDF/A conversion a separate step. . Don't open encrypted files, even if password is empty. . Merge branches 'feature/newer-pike' and 'feature/webapp'. . Update webservice.py with separate license. . Rename to polyglot.dockerfile. . Require pikepdf 0.9.0. . pikepdf 0.9.0. . reqs/main.txt for pikepdf 0.9.0. . Require pikepdf 0.9.1. . pdfinfo: tolerate PDFs that overflow and underflow the graphics stack. . v7.4.0 release notes.
7.3.117 Nov 2018 15:45 minor feature: Docs build . Add ReadTheDocs yml so we can build with Py3.6. . Detailed page analysis enabled at wrong time. . Name2unicode ignoring certain markers. . 'del draw' exception. . Erasure of undetectable barcodes. . Leptonica: make threshold functions more flexible. . Pdfminer: detect TrueType fonts with no valid encoding information. . More argument checking. . Test case: true type font without Unicode mapping. . Add test case for Type3 fonts with no Unicode mapping. . Barcodes error handling. . Unsupported operand Decimal, float. . v7.3.1 release notes.
7.3.013 Nov 2018 22:05 minor feature: optimize: error in Py3.5 . Create deenvvar to override Creator or Producer. . Adjust for pikepdf API change. . Use Ghostscript for text region detection. . Remove other references to PyMuPDF. . Remove obsolete _naive_find_text. . Remove fitz from Travis. . Ghostscript, PDF/A: support pathlib. . PEP8 docstring convention misuse in a few places. . Rename _optimize to optimize.py. . Remove helpers.universal_open(). . Replace several uses of str(path) with fspath(path). . Remove special of TypeError from ruffus. . Remove qpdf.merge. . several pylint errors and warnings. . Cleanup unused imports. . tesseract.get_orientation: removed unused language parameter. . pipeline: search_window variable not actually used. . Cleanup some cases where log was lazy and should be. . Trailing whitespace. . leptonica: variables defined on class outside __init__. . pdfa: function using closure when it shouldn't. . Disable a pylint. . Reactivate two tests that weren't using their tures properly. . Regenerate test cache. . recent versions of tesseract not registering as textonly_pdf. . Ignore whether or not textonly_pdf was used in cache. . optimize: use new pikepdf api for objgen. . optimize: skip incremental images if any. . Use newer pikepdf API for objgen. . Merge branch 'test/ignore-masks'. . Add Python 3.7 support. . leptonica remove_colormap was replaced with a no-op at some point. . Replace all Pix.read with Pix.open. . Compress test images more heavily. . test resources naming inconsistency. . optimize: PNGs that were reduced to 1-bit being inverted. . Add test case to ensure mono is not inverted. . Optimize some of our bigger test files. . Update test cache with naming rule change. . Hopefully workaround Py3.5 marshal error. . installation for Python 3.7. . Improve release notes. . Make jpeg/png quality tunable args. . Update macOS Brewfile. . Upgrade to Py3.7 locally and resolve a few. Don't use --optimize in test since jbig2enc is no
6.2.530 Oct 2018 21:45 minor feature: Cherrypick Ghostscript 9.25 DOCINFO from 7.x . Ghostscript: in strict ASCII implementation. . Ghostscript: disable JPEG passthrough for ocrmypdf v6.x. . Backport blacklist of Ghostscript 9.24. . v6.2.4 release notes. . Disable failing test for tess 4.0rc1. . Remove macOS from testing entirely. . Drop support for PyMuPDF. . v6.2.5 notes. . travis.yml.
7.2.114 Oct 2018 01:05 minor feature: Cleanup MANIFEST.in, reorg requirements/*.txt, non-Unicode readme . Include Debian copyright file. . Remove cruft to support leptonica 1.72 in test suite. . compatibility with pikepdf 0.3.5 API change. . v7.2.1 release notes. . filename test.txt.
7.2.007 Oct 2018 07:25 minor feature: Remove some unhelpful lambdas . Weave: clarify comment about garbage data in ToC. . Optimize: Disable JBIG2 lossy mode, use lossless instead. . Optimize: Refactor convert_to_jbig2. . Optimize: only enable lossy JBIG2 for -O3. . Test: this error message changed case in newer Tesseract. . Test: send stderr to stderr, why don't we?. . Degrade more gracefully when --optimize is set but JBIG2 is not present. . Test: pytest warning about direct use of a ture. . Tesseract: account for behavior changes when params are missing. . Remove libtiff from Brewfile. . Suppression of tesseract config error messages. . Lossless JBIG2 when there are multiple JBIG2 images on a single page. . Refactor the detailed error messages. . Change JBIG2 lossy mode to require --jbig2-lossy. . v7.2.0 release notes. . Requirements: request pikepdf 0.3.4. . ...and document lossy JBIG2. . Travis: use newer macos image. . Optimize: Reorganize so JBIG2 can be performed on images reduced to 1bpp. . Optimize: refactor image extraction. . Optimize: more refactoring. . Optimize: Exclude soft masks (SMasks) from optimization. . v7.2.0 release notes update.
7.1.023 Sep 2018 15:45 minor feature: Remove the old tesseract pdf_renderer . a comment about Tesseract behavior in certain versions. . all with rotations. . ghostscript.py not saved in last commit. . Tests: confirm OCR layer copied. . Split out rotation related tests. . Silence demessages. . correction angle used from wrong page. . all but one rotation case. . rotation hard case. . Refactor, remove trigonometry. . Document aliasing of tesseract renderer. . Handle procset properly. . Upgrade PyMuPDF version. . Add unconditional (for now) whiteout of text areas. . Ignore masks when deciding what color to rasterize at. . When deciding if there is a text on a page, ignore the margins. . textareas: filter out images. . Remove hocr derenderer (-g). . Remove tesseract renderer entirely. . Revise parameter validation for output-type, pdf-renderer, lang. . Weave: periodically save to prevent indefinite growth of open file list. . Since PyMuPDF 1.13.3 corrupts text, pin 1.12.5 and work around it. . Refactor textareas to remove duplicate code. . Revert "Since PyMuPDF 1.13.3 corrupts text, pin 1.12.5 and work aroun . . Add metadata preservation test from stash. . DPI mismatch between OCR page and source page. . Return to PyMuPDF 1.12.5. . PyMuPDF tweaks: don't clean. . Weave: Unconditionally rotate and scale the text layerThis solves two . . test_main: uses leptonica. . Remove tests that exercise obsolete features (tesseract, -g). . Make XML metadata test actually work. . Merge branch 'master' into develop. . Merge optimize. . Remove jbig2enc.py. . merge error in Leptonica. . Add arguments to control optimization. . Check jbig2 when optimizing is requested. . Update our dependencies. . Warn about --user-words not having any effect. . Update tests. . Update test cache. . Don't try to run jbig2 when not available. . Travis: Use declarative APT for Tesseract too. . Temporarily unbreak without fitz mode. . jbig2enc name. . Ignore masks when deciding what color to rasteriz
6.2.419 Sep 2018 14:45 minor feature: Cherrypick Ghostscript 9.25 DOCINFO from 7.x . Ghostscript: in strict ASCII implementation. . Ghostscript: disable JPEG passthrough for ocrmypdf v6.x. . Backport blacklist of Ghostscript 9.24. . v6.2.4 release notes.
7.0.515 Sep 2018 03:16 minor feature: Docs: hyperlinking of jbig2 page (again) and cleanup release notes . Updating Arch Linux instalation. . Rst formatting in release notes. . Pdfinfo: remove some dead code. . Leptonica: update comments. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF: docs. . Tests: Migrate metadata tests to pikepdf. . Ghostscript: for 9.24 having jpeg passthrough available. . Ghostscript: no need to specify ProcessColorModel when ColorConversio . . Work around invalid TOC entries. . Work around loss of Unicode DOCINFO in Ghostscript 9.24+. . Check for and reject Adobe LiveCycle Designer PDFs. . Pikepdf version for Travis. . v7.0.5 release notes.
7.0.425 Aug 2018 07:05 minor feature: Error in optimize.py on PNGs at -O2 . Try setuptools_scm_git_archive again. . Docs: mention pikepdf install more clearly. . Require pikepdf 0.3.2. . v7.0.4 notes.
7.0.314 Aug 2018 03:16 minor feature: Optimize: Use new pikepdf Object.write API . Docs: links to JBIG2 encoder page. . Require pikepdf 0.3.1. . Remove pikepdf 0.3 compatibility shims since 0.3.1 is now required.
7.0.206 Aug 2018 05:25 minor feature: release notes typos . pipeline: remove unused function. . Add intensive (optional) rotation test. . ghostscript: never use autorotatepages. . pipeline: revise logic of rotations to pages with nonzero /Rotate. . Explain pytest --runslow. . Update pinned requirements. . Travis: use xenial for Python 3.7. . Regroup installation page content around platforms. . docs: Describe PDF optimization. . Draw preview image at full resolution. . Notes for v7.0.2. . travis.yml syntax.
6.2.302 Aug 2018 06:05 minor feature: Discard alpha channel when triaging images . Revert previous commit amd reject input images with alpha channel.
6.2.215 Jul 2018 03:19 minor feature: Ignore masks when deciding what color to rasterize at . Backport Python 3.7 for ruffus 2.7.0 from ocrmypdf v7.0.0. . Cherrypick Python 3.7 documentation updates from v7.0.0. . a comment about Tesseract behavior in certain versions. . Cherrypick warning about --user-words not having any effect. . main: do better parameter validation. . Tests: Add ability to disable use of cache. . Tests: Speed up a slow test (cherry-picked from v7). . Travis: modernize with v7.0.0 updates. . problem iterating ruffus exceptions and rotate-pages-threshold pa . . ocrmypdf.exec: trap FileNotFoundError too. . Skip locale check on Python 3.7. . Update release notes for v6.2.2. . Travis: v6 build failures. . Travis: nevermind xenial, then.
7.0.011 Jul 2018 07:25 minor feature: Ignore masks when deciding what color to rasterize at . Remove gpg. . Add wiki link to template. . recent versions of tesseract not registering as textonly_pdf. . v6.2.1 release notes. . Use qpdf 8.0.2 backport, force old pytest-timeout to build. . Merge branch 'test/ignore-masks'. . : doesn't work when installed in non-Unicode path. . path error on Py3.5. . Remove dependency on private fork of ruffus, change to official 2.7. . Remove ruffus 2.6.3 exception special casing. . Update release notes. . Update readme. . Declare certain APIs public. . typo introduced in. . Merge branch 'develop' (7.0.0) into master.
6.2.125 Jun 2018 05:25 minor feature: Remove gpg . Add wiki link to template. . recent versions of tesseract not registering as textonly_pdf. . v6.2.1 release notes. . Use qpdf 8.0.2 backport, force old pytest-timeout to build.
7.0.0rc107 Jun 2018 17:45 minor feature: Use python-xmp-toolkit for xmp check . Optimize: use tempdir for cmdline invocation. . Suppress some spurious tesseract errors. . Optimize: error in Py3.5.
6.2.007 May 2018 16:45 minor feature: Use more standard __version__ rather than PILLOW_VERSION . Add support for PDF/A-3. . helpers: missing call to complain(). . Don't suppress error message from config_notfound. . helpers.py again. . Add gpg key to template. . test_pageinfo: remove duplicate import. . --remove-background error on PDFs with colormapped images. . Expand size growth reasons to other arguments that trigger transcoding. . Update Dockerfile for Ubuntu 18.04. . Add 18.04 update procedure. . XMP validation with /CreationDate. . Merge branch 'feature/pdfa3'. . v6.2.0 Release notes. . v6.2.0. failure to prevent use of Ghostscript on /UserUnit files. . Trap PDF/A-3 errors on old Ghostscript.
6.1.503 May 2018 22:00 minor feature:
3.014 Sep 2015 17:45 minor feature: bump to v3.0 and move repos. Test case: No longer using JHOVE. Move to my repo: github.com/fritz-hh = jbarlow83.
3.0-rc931 Aug 2015 01:45 minor feature: Throw exception if iccprofiles not found instead of returning None. unpaper: support paletted files by conversion instead of bailing. Use png256 raster device when possible. Prevent running validation on missing file after an exception is thrown. Add test cases for additional image formats. ghostscript: quiet startup on rasterize. Bump version to -rc9.