OCRmyPDF 7.3.0

OCRmyPDF adds an invisible text layer to PDF documents after passing it through the Tesseract OCR engine. The output will be PDF/A with a selectable but invisible text layer above scanned image-documents. This allows later searching and archiving.

Tags pdf ocr scanning
License GNU GPLv3
State stable

Recent Releases

7.3.013 Nov 2018 22:05 minor feature: optimize: error in Py3.5 . Create deenvvar to override Creator or Producer. . Adjust for pikepdf API change. . Use Ghostscript for text region detection. . Remove other references to PyMuPDF. . Remove obsolete _naive_find_text. . Remove fitz from Travis. . Ghostscript, PDF/A: support pathlib. . PEP8 docstring convention misuse in a few places. . Rename _optimize to optimize.py. . Remove helpers.universal_open(). . Replace several uses of str(path) with fspath(path). . Remove special of TypeError from ruffus. . Remove qpdf.merge. . several pylint errors and warnings. . Cleanup unused imports. . tesseract.get_orientation: removed unused language parameter. . pipeline: search_window variable not actually used. . Cleanup some cases where log was lazy and should be. . Trailing whitespace. . leptonica: variables defined on class outside __init__. . pdfa: function using closure when it shouldn't. . Disable a pylint. . Reactivate two tests that weren't using their tures properly. . Regenerate test cache. . recent versions of tesseract not registering as textonly_pdf. . Ignore whether or not textonly_pdf was used in cache. . optimize: use new pikepdf api for objgen. . optimize: skip incremental images if any. . Use newer pikepdf API for objgen. . Merge branch 'test/ignore-masks'. . Add Python 3.7 support. . leptonica remove_colormap was replaced with a no-op at some point. . Replace all Pix.read with Pix.open. . Compress test images more heavily. . test resources naming inconsistency. . optimize: PNGs that were reduced to 1-bit being inverted. . Add test case to ensure mono is not inverted. . Optimize some of our bigger test files. . Update test cache with naming rule change. . Hopefully workaround Py3.5 marshal error. . installation for Python 3.7. . Improve release notes. . Make jpeg/png quality tunable args. . Update macOS Brewfile. . Upgrade to Py3.7 locally and resolve a few. Don't use --optimize in test since jbig2enc is no
6.2.530 Oct 2018 21:45 minor feature: Cherrypick Ghostscript 9.25 DOCINFO from 7.x . Ghostscript: in strict ASCII implementation. . Ghostscript: disable JPEG passthrough for ocrmypdf v6.x. . Backport blacklist of Ghostscript 9.24. . v6.2.4 release notes. . Disable failing test for tess 4.0rc1. . Remove macOS from testing entirely. . Drop support for PyMuPDF. . v6.2.5 notes. . travis.yml.
7.2.114 Oct 2018 01:05 minor feature: Cleanup MANIFEST.in, reorg requirements/*.txt, non-Unicode readme . Include Debian copyright file. . Remove cruft to support leptonica 1.72 in test suite. . compatibility with pikepdf 0.3.5 API change. . v7.2.1 release notes. . filename test.txt.
7.2.007 Oct 2018 07:25 minor feature: Remove some unhelpful lambdas . Weave: clarify comment about garbage data in ToC. . Optimize: Disable JBIG2 lossy mode, use lossless instead. . Optimize: Refactor convert_to_jbig2. . Optimize: only enable lossy JBIG2 for -O3. . Test: this error message changed case in newer Tesseract. . Test: send stderr to stderr, why don't we?. . Degrade more gracefully when --optimize is set but JBIG2 is not present. . Test: pytest warning about direct use of a ture. . Tesseract: account for behavior changes when params are missing. . Remove libtiff from Brewfile. . Suppression of tesseract config error messages. . Lossless JBIG2 when there are multiple JBIG2 images on a single page. . Refactor the detailed error messages. . Change JBIG2 lossy mode to require --jbig2-lossy. . v7.2.0 release notes. . Requirements: request pikepdf 0.3.4. . ...and document lossy JBIG2. . Travis: use newer macos image. . Optimize: Reorganize so JBIG2 can be performed on images reduced to 1bpp. . Optimize: refactor image extraction. . Optimize: more refactoring. . Optimize: Exclude soft masks (SMasks) from optimization. . v7.2.0 release notes update.
7.1.023 Sep 2018 15:45 minor feature: Remove the old tesseract pdf_renderer . a comment about Tesseract behavior in certain versions. . all with rotations. . ghostscript.py not saved in last commit. . Tests: confirm OCR layer copied. . Split out rotation related tests. . Silence demessages. . correction angle used from wrong page. . all but one rotation case. . rotation hard case. . Refactor, remove trigonometry. . Document aliasing of tesseract renderer. . Handle procset properly. . Upgrade PyMuPDF version. . Add unconditional (for now) whiteout of text areas. . Ignore masks when deciding what color to rasterize at. . When deciding if there is a text on a page, ignore the margins. . textareas: filter out images. . Remove hocr derenderer (-g). . Remove tesseract renderer entirely. . Revise parameter validation for output-type, pdf-renderer, lang. . Weave: periodically save to prevent indefinite growth of open file list. . Since PyMuPDF 1.13.3 corrupts text, pin 1.12.5 and work around it. . Refactor textareas to remove duplicate code. . Revert "Since PyMuPDF 1.13.3 corrupts text, pin 1.12.5 and work aroun . . Add metadata preservation test from stash. . DPI mismatch between OCR page and source page. . Return to PyMuPDF 1.12.5. . PyMuPDF tweaks: don't clean. . Weave: Unconditionally rotate and scale the text layerThis solves two . . test_main: uses leptonica. . Remove tests that exercise obsolete features (tesseract, -g). . Make XML metadata test actually work. . Merge branch 'master' into develop. . Merge optimize. . Remove jbig2enc.py. . merge error in Leptonica. . Add arguments to control optimization. . Check jbig2 when optimizing is requested. . Update our dependencies. . Warn about --user-words not having any effect. . Update tests. . Update test cache. . Don't try to run jbig2 when not available. . Travis: Use declarative APT for Tesseract too. . Temporarily unbreak without fitz mode. . jbig2enc name. . Ignore masks when deciding what color to rasteriz
6.2.419 Sep 2018 14:45 minor feature: Cherrypick Ghostscript 9.25 DOCINFO from 7.x . Ghostscript: in strict ASCII implementation. . Ghostscript: disable JPEG passthrough for ocrmypdf v6.x. . Backport blacklist of Ghostscript 9.24. . v6.2.4 release notes.
7.0.515 Sep 2018 03:16 minor feature: Docs: hyperlinking of jbig2 page (again) and cleanup release notes . Updating Arch Linux instalation. . Rst formatting in release notes. . Pdfinfo: remove some dead code. . Leptonica: update comments. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF: docs. . Tests: Migrate metadata tests to pikepdf. . Ghostscript: for 9.24 having jpeg passthrough available. . Ghostscript: no need to specify ProcessColorModel when ColorConversio . . Work around invalid TOC entries. . Work around loss of Unicode DOCINFO in Ghostscript 9.24+. . Check for and reject Adobe LiveCycle Designer PDFs. . Pikepdf version for Travis. . v7.0.5 release notes.
7.0.425 Aug 2018 07:05 minor feature: Error in optimize.py on PNGs at -O2 . Try setuptools_scm_git_archive again. . Docs: mention pikepdf install more clearly. . Require pikepdf 0.3.2. . v7.0.4 notes.
7.0.314 Aug 2018 03:16 minor feature: Optimize: Use new pikepdf Object.write API . Docs: links to JBIG2 encoder page. . Require pikepdf 0.3.1. . Remove pikepdf 0.3 compatibility shims since 0.3.1 is now required.
7.0.206 Aug 2018 05:25 minor feature: release notes typos . pipeline: remove unused function. . Add intensive (optional) rotation test. . ghostscript: never use autorotatepages. . pipeline: revise logic of rotations to pages with nonzero /Rotate. . Explain pytest --runslow. . Update pinned requirements. . Travis: use xenial for Python 3.7. . Regroup installation page content around platforms. . docs: Describe PDF optimization. . Draw preview image at full resolution. . Notes for v7.0.2. . travis.yml syntax.
6.2.302 Aug 2018 06:05 minor feature: Discard alpha channel when triaging images . Revert previous commit amd reject input images with alpha channel.
6.2.215 Jul 2018 03:19 minor feature: Ignore masks when deciding what color to rasterize at . Backport Python 3.7 for ruffus 2.7.0 from ocrmypdf v7.0.0. . Cherrypick Python 3.7 documentation updates from v7.0.0. . a comment about Tesseract behavior in certain versions. . Cherrypick warning about --user-words not having any effect. . main: do better parameter validation. . Tests: Add ability to disable use of cache. . Tests: Speed up a slow test (cherry-picked from v7). . Travis: modernize with v7.0.0 updates. . problem iterating ruffus exceptions and rotate-pages-threshold pa . . ocrmypdf.exec: trap FileNotFoundError too. . Skip locale check on Python 3.7. . Update release notes for v6.2.2. . Travis: v6 build failures. . Travis: nevermind xenial, then.
7.0.011 Jul 2018 07:25 minor feature: Ignore masks when deciding what color to rasterize at . Remove gpg. . Add wiki link to template. . recent versions of tesseract not registering as textonly_pdf. . v6.2.1 release notes. . Use qpdf 8.0.2 backport, force old pytest-timeout to build. . Merge branch 'test/ignore-masks'. . : doesn't work when installed in non-Unicode path. . path error on Py3.5. . Remove dependency on private fork of ruffus, change to official 2.7. . Remove ruffus 2.6.3 exception special casing. . Update release notes. . Update readme. . Declare certain APIs public. . typo introduced in. . Merge branch 'develop' (7.0.0) into master.
6.2.125 Jun 2018 05:25 minor feature: Remove gpg . Add wiki link to template. . recent versions of tesseract not registering as textonly_pdf. . v6.2.1 release notes. . Use qpdf 8.0.2 backport, force old pytest-timeout to build.
7.0.0rc107 Jun 2018 17:45 minor feature: Use python-xmp-toolkit for xmp check . Optimize: use tempdir for cmdline invocation. . Suppress some spurious tesseract errors. . Optimize: error in Py3.5.
6.2.007 May 2018 16:45 minor feature: Use more standard __version__ rather than PILLOW_VERSION . Add support for PDF/A-3. . helpers: missing call to complain(). . Don't suppress error message from config_notfound. . helpers.py again. . Add gpg key to template. . test_pageinfo: remove duplicate import. . --remove-background error on PDFs with colormapped images. . Expand size growth reasons to other arguments that trigger transcoding. . Update Dockerfile for Ubuntu 18.04. . Add 18.04 update procedure. . XMP validation with /CreationDate. . Merge branch 'feature/pdfa3'. . v6.2.0 Release notes. . v6.2.0. failure to prevent use of Ghostscript on /UserUnit files. . Trap PDF/A-3 errors on old Ghostscript.
6.1.503 May 2018 22:00 minor feature:
3.014 Sep 2015 17:45 minor feature: bump to v3.0 and move repos. Test case: No longer using JHOVE. Move to my repo: github.com/fritz-hh = jbarlow83.
3.0-rc931 Aug 2015 01:45 minor feature: Throw exception if iccprofiles not found instead of returning None. unpaper: support paletted files by conversion instead of bailing. Use png256 raster device when possible. Prevent running validation on missing file after an exception is thrown. Add test cases for additional image formats. ghostscript: quiet startup on rasterize. Bump version to -rc9.