OCRmyPDF 16.5.0

OCRmyPDF adds an invisible text layer to PDF documents after passing it through the Tesseract OCR engine. The output will be PDF/A with a selectable but invisible text layer above scanned image-documents. This allows later searching and archiving.

Tags pdf ocr scanning
License GNU GPLv3
State initial

Recent Releases

16.5.022 Nov 2024 19:05 major bugfix: . . with interpreting PDFs that have images with array masks. #1377. Enabled testing on Python 3.13.. a test that did not work correctly but still passed. #1382. Improved PDF/A conversion failed warning message to better describe implications.. Updated documentation to better explain OCR_JSON_SETTINGS in batch processing.. Build backend changed from setuptools to hatchling.. . . .
16.4.302 Sep 2024 03:45 minor bugfix: . . Work around pdfminer.six where a token on the buffer boundary is incorrectly. Parsed as two tokens. #1361 New rules are applied to stencil masks and explicit masks when calculating the. Optimal page DPI for rendering. #1362 attempts to use an incompatible jbig2.EXE provided by TeX Live. #1363. . . .
16.4.209 Aug 2024 14:45 minor bugfix: . . order of filenames passed to Ghostscript for PDF/A generation. #1359. Suppressed missing jbig2dec warning message. #1358. calculation of image size when soft mask dimensions don t match image. Dimension. #1351 Several to documentation. Thanks to users Iris and JoKalliauer. Who contributed these changes. error on processing PDFs that are missing certain image metadata. #1315. . . .
16.4.101 Jul 2024 20:45 minor bugfix: . . calculation of image printed area (used in finding weighted DPI for OCR). #1334. NotImplementedError: not sure how to get colorspace error. Messages in logs which simply records a failure to optimize images with Print production colorspaces. #1315 . . .
16.4.002 Jun 2024 22:25 major bugfix: . . Selecting the osd and equ pseudo-languages with -l/--language now. Exits with an error when using Tesseract OCR, because these are not Regular Tesseract languages but implementation details implemented. Using them can cause Tesseract to crash. The hOCR renderer is more tolerant of extra whitespace in input files.. watcher.py now changes the output file extension to.pdf when the input is not. .pdf.. Improved handling of PDFs that contain circularly referenced Form XObjects. #1321. Alpine Docker image for ARM64, which was not building correctly.. Docker images now use pikepdf 9.0.0.. . . .
16.3.124 May 2024 18:05 minor bugfix: . . a test suite failure with Ghostscript 10.03.0+. #1316. an with the presentation of the OCR progress bar. #1313. . . .
16.3.020 May 2024 11:05 major bugfix: . . progress bar not displaying for Ghostscript PDF/A conversion. #1313. Added progress bar for linearization. #1313. If rotate-pages-threshold d without rotate-pages we now exit with. an error since the user likely intended to use rotate-pages. #1309 If Tesseract hOCR gives an invalid line box, print an error message instead of. Exiting with an error. #1312 . . .
16.2.018 Apr 2024 16:25 major feature: . . NoneType object has no attribute get when optimizing certain PDFs. #1293, #1271. Switched formatting from black to ruff.. Added support for sending sidecar output to io.BytesIO.. Added support for converting HEIF/HEIC images (the native image of iPhones and. Some other devices) to PDFs, when the appropriate pi-hief library is installed. This library is marked as a dependency, but maintainers may opt out if needed. We now default to downsampling large images that would exceed Tesseract s internal. Limits, but only if it cause processing to fail. Previously, this behavior only Occurred if specifically requested on command line. It can still be configured And disabled. See the tesseract command line options. Added Macports install instructions. Thanks #64;akierig.. Improved logging output when an unexpected error occurs while trying to obtain. The version of a third party program. . . .
16.0.303 Jan 2024 17:45 minor security: . . Changed minimum required Ghostscript to 9.54, to support users of RHEL 9 and its. Derivatives, since that is the latest version available there. Removed warning message about CVE-2023-43115, on the assumption that most. Distributions have backported the patch by now. . . .
16.0.224 Dec 2023 07:25 minor feature: . . Temporarily changed PDF text renderer back to sandwich by default to address. Regressions in macOS Preview. . . .
16.0.020 Dec 2023 07:45 major feature: . . Added OCR text renderer, combined the best ideas of Tesseract s PDF. Generator and the older hOCR transformer renderer. The result is a hopefully Permanent for wordssmushedtogetherwithoutspaces in extracted text, Better registration/position of text on skewed baselines #1009, to character output when the German Fraktur script is used #1191, Proper rendering of right to left languages (Arabic, Hebrew, Persian) #1157. Asian languages may still have excessive word breaks compared to expectations. The new renderer is the default; the old sandwich renderer is still available Using --pdf-renderer sandwich; the old hOCR renderer is no more. The ocrmypdf.hocrtransform API has changed substantially.. Support for Python 3.9 has been dropped. Python 3.10+ is now required.. pikepdf gt;= 8.8.0 is now required.. . . .
15.4.317 Nov 2023 17:25 minor bugfix: . . deprecation warning in pikepdf older than 8.7.1; pikepdf gt;= 8.7.1 is. Now required. . . .
15.4.213 Nov 2023 10:25 minor bugfix: . . We now raise an exception on a certain class of PDFs that likely need an. Explicit color conversion strategy selected to display correctly For PDF/A conversion. an error that occurred while trying to write a log message after the. Delog handler was removed. . . .
15.4.109 Nov 2023 10:25 minor bugfix: . . misc/watcher.py regressions: accept --ocr-json-settings as either. Filename or JSON string, as previously; and argument count mismatch. #1183, #1185. We no longer attempt to set /ProcSet in the PDF output, since this is an. Obsolete PDF feature. Documentation improvements.. . . .
15.4.002 Nov 2023 17:45 major feature: . . Added new experimental APIs to support offline editing of the final text. Specifically, one can now generate hOCR files with OCRmyPDF, edit them with Some other tool, and then finalize the PDF. They are experimental and Subject to change, including details of how the working folder is used. There is no command line interface. Code reorganization: executors, progress bars, initialization and setup.. test coverage in cases where the coverage tool did not properly trace. Into threads or subprocesses. This code was still being tested but appeared as not covered. In the test suite, reduced use of subprocesses and other techniques that. Interfere with coverage measurement. Improved error check for when we appear to be running inside a snap container. And files are not available. Plugin specification now properly defines progress bars as a protocol rather. Than defining them as tqdm-like . We now default to using forkserver process creation on POSIX platforms. Rather than fork, since this is method is more robust and avoids some When threads are present. an instance where the user s request to --no-use-threads was ignored.. Replace some cryptic test error messages with more helpful ones.. Demessages for how OCRmyPDF picks the colorspace for a page are now. More descriptive. . . .
15.3.129 Oct 2023 07:25 minor bugfix: . . an with logging settings for misc/watcher.py introduced in the. Previous release. #1180 Updated documentation on Docker performance concerns.. . . .
15.3.025 Oct 2023 07:50 minor feature: Update misc/watcher.py to improve command line interface using Typer, and support .env specification of environment variables. Improved error messages. Thanks to @mflagg2814 for the PR that prompted this improvement. Improved error message when a file cannot be read because we are running in a snap container.
11.2.007 Oct 2020 08:45 minor feature: Document the example plugin . Better type checking on ocrmypdf.ocr(plugins=...). . Image optimization discarding image masks and soft masks associat . . V11.1.3 release notes. . V11.2.0 release notes.
11.1.201 Oct 2020 00:45 minor feature: Docs: Add 'unpaper' optional dependency for Ubuntu 18.04 . HOCR: write text in correct order. . V11.1.2 release notes. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF.
11.1.126 Sep 2020 06:45 minor feature: Release notes typo . Tidy a log message. . Tighten unpaper-args validation to exclude. and. . pngquant driver: refactor, use streams instead of temporary files. . v11.1.1 release notes.
11.1.018 Sep 2020 03:15 minor feature: Extend example plugin with example of mono conversion . Expand documentation of filter_page_image. . Remove Python 3.7 from build since homebrew removed it. . Disable pikepdf mmap. . Remove unused function log_page_orientations. . Display page numbers in log messages when grafting. . page rotation regression. . Use img2pdf to create optimized PNG images. . load zlib before liblept on windows. . v11.1.0 release notes. . Merge commit '9a6cd95e5fe2826d40861229aaa0431b76e302e7'. . Remove unpaper from macOS build.
11.0.209 Sep 2020 03:15 minor feature: Update templates . Reorganize templates. . Add "Postprocessing" message as a hint for long Ghostscript runs. . metadata up: don't try to update original PDF's metadata with docinfo. . v11.0.2 release notes.
11.0.119 Aug 2020 06:05 minor feature: Setup: blacklist pdfminer.six 20200720 . Approve img2pdf 0.4 as it passes tests. . Clarify that the GPL-3 portion of pdfa.py was removed. . V11.0.1 release notes.
11.0.014 Aug 2020 19:45 minor feature: Clarify license status of misc/completion/ files . Change license of misc/watcher.py to MIT. . Clarify copyright status of misc/batch.py, synology.py. . Copyright cleanup: relicense example_plugin.py. . Change license of all GPLv3 files to MPL-2.0. . Replace GPLv3-derived PDF/A template with PostScript generator. . Don't ask for sample files anymore. . docs: mention that Ghostscript PDF/A can swallow hyperlinks. . template:Give stronger hints about sample input files. . Merge branch 'de-gpl'.
10.3.206 Aug 2020 03:15 minor feature: Remove gs.py (spoofers entirely removed) and update copyright . Additional size increase reasons. . Document use of mmap. . Approve pdfminer.six 20200726. . test breakage in validation. . v10.3.2 release notes.
10.3.128 Jul 2020 08:25 minor feature: Support for older versions of pdfminer.six (boxes_flow error) . Approve pdfminer.six 20200720. . V10.3.1 release notes.
10.3.024 Jul 2020 03:15 minor feature: Docs: edit plugins . de.log missing pageno handler. . Docs: Note usage of OCR_JSON_SETTINGS for watcher. . Optimize: incorrect to prevent re-optimizing JBIG2s. . Optimize: add type hints. . Optimize: improve typing of xref_exts. . Update pre-commit settings. . Optimize: add typing for Xref, remove fspath()'s. . Docs: install notes for ARM64. . Docs: explain firstresult hook behavior. . Pipelines: Python 3.7/3.8 on macOS. . Add locking to Leptonica error trap. . Merge branch 'feature/optimize-cleanup'. . For Leptonica 1.79+ use leptSetStderrHandler. . Update debian/copyright from Debian, with. Pdfinfo: Replace list comp with gen expr'n. . Disable test_error_trap for Leptonica 1.79. . Merge branch 'feature/leptonica-179'. . Docs: plugins update. . Enable pikepdf mmap and set up signal handlers. . Enable pikepdf mmap in other contexts. . V10.3.0 release notes. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF.
10.2.109 Jul 2020 03:15 minor feature: Docs: Update Fedora versions . Python 3.9beta is now known to work (Fedora). . Docs: promote one liner installs, reorg Windows. . Docs: move Windows ahead of FreeBSD. . Install: drop Ubuntu 14.04 steps. . Install: add Mageia. . Pyproject.toml: weird line wrapping?. . Readme: markdown cleanup. . Quality: ing typing. More mypy errors. . Plugin manager: accept Path(plugin). . Improve API documentation. . TextPositionTracker: set boxes_flow=None. . More typing improvements. . V10.2.1 release notes.
10.2.024 Jun 2020 16:45 minor feature: OMP_THREAD_LIMIT rounded down to 0 in some cases . v10.1.1 release notes. . PDF/A acquires title "Untitled" after conversion. . deleted path in.coveragerc. . Update and sync.dockerignore.gitignore. . some tests that were failing in Docker. . Update Docker to Ubuntu 20.04 and jbig2-latest. . Bump requirements (mainly for Docker's benefit). . jobcontext.PdfContext: remove dead code, add annotations. . Decouple plugin manager forking from PdfContext/Pagecontext. . missing f-string in log message. . Support input/output streams at API level. . A few minor typing. Document that API accepts streams now. . hoctransform: remove deprecated element.getchildren(). . hocrtransform: refactor xpath manipulations. . hocrtransform: refactor colors. . hocrtransform: some text not included in output after Tesseract changes. . Add test to sanity check our pdf renderers. . test_hocrtransform: this test is worth not caching. . Update test cache. . v10.2.0 release notes. . New hocrtransform test isn't platform stable - mark runslow. . Spell runslow correctly.
10.1.020 Jun 2020 17:45 minor feature: Error message in logging from repeated filtering . Some corrections to release notes. . Docs: improve description of plugins. . For --clean-final, use same image as --clean if possible. . Sync: refactor intermediate image production. . Add a lot of type annotations. . Sync: refactor preprocess image filtering. . Coverage: ignore type checking. . Unpaper: use PNG input where possible. . V10.1.0 notes.
10.0.116 Jun 2020 18:25 minor feature: Error on -l lang1+lang2 . V10.0.1 release notes. . Tests that failed on other platforms from previous. Test_two_languages: use narrower test.
10.0.012 Jun 2020 03:45 minor feature: Refactor Windows executable shims . The Great Logging Refactor. . Remove safe_symlink log= warning. . Improve logging of subprocess output. . Reinstate logging of page numbers. . Add colored logs. . Suppress loglevel since we have color now. . Loosen test language requirements - eng/deu. . Improve help text about aborting due to text. . pytest picky about list vs tuple. . hocrtransform: cleanup/PEP8. . Drop support for pdfminer.six 20181108. . Refactor xy-pair for resolution to tuple. . Refactor 'xyres' into Resolution. . First cut at concurrent page scan. . Do pikepdf.open() once instead of per worker. . Refactor multiprocessing pool. . Further refactoring of concurrency concerns. . some broken tests. . Some wrong with forking worker_pdf, just open it once per page for now. . Replace task_initargs with use of partial(). . Use once-per-worker pikepdf init. . macOS - use spawn for multiprocessing. . Remove Ghostscript-based text extraction. . Adjust number of workers for concurrent page scanning. . setup: remove deprecated message about removeal of --force parameter. . ghostscript: remove deprecated argument from generate_pdfa. . Update release notes with v10 changes. . Remove last vestiges of command line usage of qpdf - change to check_pdf. . Add warning if problematic --tesseract-pagesegmode is selected. . Start pluggy-based plugin system. . Move samefile to helpers. . Refactor plugin setup to get_plugin_manager. . Get pluggy to work with forking workers. . Set up filter_ocr_image hook. . Allow plugins to add command line arguments. . pluginspec: avoid circular reference. . Support plugin invocation with API. . Rename PDFContext- PdfContext. . Delinting. . safe_symlink: remove deprecated params. . optimize: convert from executor to progress pool. . Change argument from --plugins to --plugin. . Support importing plugin by filename. . Rename install_cli to add_options. . New hook: filter_page_image. . Convert many uses o
9.8.204 Jun 2020 06:05 minor feature: Improve file size increase warning to account for changes to small files . Add installation instructions for Windows/Cygwin64. . Layout: look for text in XObjects too. . Test_report_file_size. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Docs: tidy Cygwin install. . v9.8.2 release notes.
9.8.129 May 2020 03:25 minor feature: Docs: remove reference to brewfile . Docs: update Arch Linux install instructions. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Missing jbig2enc reported as error with -O3 instead of warning. . Test files needed!. . Docs: Note about OCRmyPDF speed. . Update email. . Shim_paths to account for unexpected files in Program Files gs. . Mark pdfminer.six 20200517 as supported. . v9.8.1 notes. . Test that failed on Windows.
9.8.029 Apr 2020 06:25 minor feature: Install: clarify that old ocrmypdf should be removed from Ubuntu 18.04 . Watcher: add polling and log level adjustment. . Update requirements. . Where only first PNG-style image would be optimized. . Azure: add certifi, openssl for macOS. . Azure: use brew python instead. . Remove tesseract_badutf8.py. . Don't utf-8 decode tesseract --print-parameters. . v9.8.0 release notes.
9.7.216 Apr 2020 06:45 minor feature: Language argument not working as list . Docs: Set ownership when using docker image. . Isinstance(..,str). . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Support pdfminer.six 20200402. . v9.7.2 release notes. . Pytest picky about list vs tuple.
9.7.111 Apr 2020 03:15 minor feature: Update templates . Docs: warn that Windows users should use an ifmain guard. . Docs: warn that AWS Lambda doesn't work. . Add a few more type annotations to public APIs. . Version checker failing for qpdf 10.0.0. . v9.7.1 release notes. . Versions with leading v, e.g. v5.0.
9.7.031 Mar 2020 11:45 minor feature: Reqs: update pikepdf version . Expand documentation for subprocess.run() from test. . Install instructions for Ubunti 16.04. . Consult ICC profile when determining image colorspace. . Optimize: consider ICCBased 1 bit for optimization. . Watcher: allow all parameters to ocrmypdf.pdf to be passed by JSON. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Info.py: linearize O(n 2) search for use images on a page. . Add halftone mask to leptonica. . Wrong number of threads to use shown when OMP_THREAD_LIMIT is defined. . Validation: blacklist Ghostscript 9.51 too. . Docs: Add username to WSL instructions. . de.log not being deleted on Windows (probably). . Watcher: JSONDecodeError if OCR_JSON_SETTINGS not set. . Tests: add force OCR to a file with text that Ghostscript doesn't see. . Tests: workaround for Ghostscript 9.52 txtwrite problem. . v9.7.0 release notes.
9.6.104 Mar 2020 06:05 minor feature: Simplify metadata for invalid xml in output . docs: typo. . docs: archlinux install - yaourt is gone. . Docker image includes also French, Portuguese and Spanish. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Readme: Add another heise article. . docs: add Docker compose configuration for watchdog. . Update installation instructions for FreeBSD. . Demonstrate installing the AUR package without a helper. . Merge branch 'pigmonkey-aur-manual'. . Disable Travis. . docs: some mild improvements. . docker-compose.yaml file. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . docs: more clarifications. . watcher: add self to copyright. . docs: Docker syntax to use stdin/stdout properly. . docs: extract example files from batch.rst. . docs: document --pages. . Improve ocrmypdf.bash completions on macOS. . docs: docker prefers.yml not.yaml. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . docs: install cleanup. . Handle malformed DocumentInfo. . Remove potentially non-free file logo.afdesign. . v9.6.1 release notes.
9.6.011 Feb 2020 09:05 minor feature: Watched folder, new flags, and docs updates. . Update logging and env var extensibility. . grammar in output message. . watcher: some refactoring. . Order of events. . Wait for file based on pikepdf. . typos, add instructions for training data. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Refactor page rotation and re-enable message at info level. . docs: simplify/Ubuntu 18.04 install instructions. . setup: approve pdfminer.six 20200124. . v9.6.0 notes. . ifmain - main(). . Merge branch 'pr479'. . v9.6.0 notes updated. . Update reqs.
9.5.020 Jan 2020 03:15 minor feature: Refactor metadata_up . Add OCR quality measurement API. . v9.5.0 release notes. . regression: metadata updates not taking effect. . v9.5.0 release notes revised.
9.4.007 Jan 2020 03:15 minor feature: Tests: add coverage for helpers . Improve error message for unreadable input files. . Tests: AcroForm test case did not work correctly;. Tests: test TqdmConsole. . Tests: remove some obscure things from coverage. . Ghostscript: don't delete output_file that will never exist. . Test: environment warnings/cleanup. . Tests: problems with ghostscript spoofers. . Try to set up subprocess coverage better. . Rewrite main pool loop. . Remove session scope from tures. . Tests: skip tests not compatible with coverage. . Tests: use smaller files for ghostscript. . Logging: create a delog when -k parameter is d. . Logging: incorrect usage: logging.Logger(). . Logging: always log process arguments and stderr when at de. Tesseract: don't explicitly set lstm_use_matrix. . Tests: improve tesseract coverage. . Eliminate last use of PyPDF2 from test suite. . Docs: add note on limitations of sidecar file. . Lept: improve lib not found error message. . Docs: mention pdfgrep too. . Also generate log file in temp folder on verbose mode. . Allow pdfminer.six 20200104 and update recommended versions. . Don't use de.log in pytest. . Assert that depends on POSIX-y file handling. . Skip test that needs chmod when on Windows.
9.3.031 Dec 2019 03:15 minor feature: Generally update documentation about available platforms . Look in Program Files for executables and liblept5.dll. . Add isort to precommit. . Sort imports. . azure: only publish code coverage for macOS. . exception on parsing Ghostscript error messages. . Add improved example demonstrating watched folder functionality. . v9.3.0 release notes. . azure: homebrew broke something to do with python@2?. . exec/init: os.get_exec_path() returns list not str. . azure: tweak windows script. . Windows: Remove Program Files cache from ocrmypdf.exec. . docs: obsolete statement to "brew install tesseract-lang". . Update completions.
9.2.013 Dec 2019 06:45 minor feature: black: don't reformat _leptonica.py . Improve pre-commit checks. . Update version of pdfminer.six supported. . Make devnull check compatible with Windows. . black: don't reformat _leptonica.py. . compile_leptonica: move to correct location. . Refactor symlink usage to support Windows. . leptonica: Use Windows name for DLL. . leptonica: Handle API change for pixFindPageForeground. . leptonica: don't open files by name; use memory buffers. . leptonica: missing Leptonica error message for Windows. . difference in Windows error message breaking test_no_languages. . TypeError "environment can only contain strings". . tests: a few Windows. test_single_page_inline_image - remove temp file. . test_metadata: use mmap in a Windows and POSIX compatible way. . test: Replace many instances of run_ocrmypdf in subprocess with inline. . Enforce str-only environment for Windows since it's more strict. . Don't worry about streams on Windows. . Move gs tests to test_ghostscript. . Remove os_environ() context manager. . ghostscript: don't use NamedTemporaryFile. . ghostscript: use correct executable name on Windows. . ghostscript: use run(check=True) for more consistent error handling. . ghostscript: Refactor checking for executable name on Windows. . Use _OCRMYPDF_TEST_PATH for testing and.py stubs to simulate symlinks. . Don't expect filenames to be replicated on NT. . Make test_german more Windows-friendly. . ghosttext: mention page number differences. . Remove test_bad_utf8. . Add Windows install advice. . docs: sketch Windows install procedure. . DecompressionBomb related errors due to Windows process differences. . tests: error message from tesseract change. . Tesseract no longer posts an error message if config file not found. . Document function of symlink shim. . docs: cause about using Windows in production. . Possible to loss of log adapter state. . Remove Tesseract 4.0 specific check. . Merge branch 'windows'. . Add typing hints for
9.1.119 Nov 2019 03:15 minor feature: Docs: wsl - get-pip.py . Update version of pdfminer.six supported. . Docker: use get-pip to install pip. . Reference to Alpine apk add. . v9.1.1 release notes.
9.1.013 Nov 2019 03:15 minor feature: Tesseract: refactor logging . Docs: mention how to suppress progbar. . Docs: document optimization. . Docs: mention systemd for batches. . Report missing optional dependencies as possible cause of file size i . . Import and docstring cleanup. . Lint warning about missing cur_item. . Tesseract: exception when logger is RootLogger. . Travis: enable Py 3.8. . Docs: installation instructions for pikepdf manylinux2010 wheels. . Use pikepdf 1.7.0 to improve Python 3.8 support. . v9.1.0 release notes. . Test: test_report_file_size. . Test: further to test_report_file_size.
9.0.508 Nov 2019 13:45 minor feature: Remove Alpine Docker image . Dockerfile: remove venv from Ubuntu image; tweak reqs. . Dockerfile: errors are trying to build unneeded cached wheels. . Dockerfile: jbig2 not copied over. . "MANIFEST.in exists" by removing MANIFEST.in. . Travis: enable Python 3.8 testing. . Drop support for unpaper 6.1 on Ubuntu 14.04. . Docker: try adding automated test. . Docker autotest:, maybe?. . docs: remove comment about Ubuntu image. . Docker: relocate dockerfile. . docs: add remark about optimizing without OCR. . docker-compose.test does not seem to be ready for production use. . Update release notes; disable Py3.8 test again. . Support pdfminer.six 20191020.
9.0.404 Nov 2019 06:25 minor feature: pdfa: assume 3 RGB channels always . optimize: work around pikepdf 1.6.3 limitation with indexed ICCbased . . black settings in pyproject.toml. . py36 test including 37. . any False in the ocrmypdf.ocr() API being set to True. . Remove test_tesseract_config_invalid from suite. . Add contributing guide. . docs: intermediate file list for v9. . Use at most 3 Tesseract threads. . Delinting. . Use lstm_use_matrix for --user-words,patterns. . Python 3.8 updates. . travis: Python 3.8, osx_image. . Mention when we default to English and the system locale is not English. . Require Pillow 6.2.0 based on security vulnerability report in older . . Update release notes. . Disable Py3.8 for now. . test suite error. . Mention that v9.0.4 requires a source install for Py3.8 for now, due .
9.0.205 Sep 2019 03:15 minor feature: Docs: installation updates . Docs: leptonica.com - .org. . Alpine: use jbig2enc@community. . Travis: Make 3.7 the build leader/deployer. . Reactivate user-words test that was always skipped. . Running without eng.traineddata installed raises exception. . Install: affirm that we now require Tesseract beta. . Attempt to resolve black-inversion. Tests broken by --print-parameters change. . Use context managers to ensure Pillow images are. Allow test_german to xfail if deu language is not installed. . Optimize: Don't reinsert 1bpp images. . Optimize: exclude images with custom Decode tables. . Optimize: only re-insert pngs after pngquant. . Optimize: don't consider 1bpp images for PNG optimization. . Remove restriction on pytest 5. . Adjust test requirements. . Optimize: solve monochrome by converting to G4. . --print-parameters when chi_sim is not installed. . v9.0.2 release notes.
9.0.112 Aug 2019 06:45 minor feature: Add missing item from v9.0.0 release notes . Travis: remove vestiges of pdfminer being optional on osx. . Ensure --image-dpi on non-image produces a warning. . Alpine Docker: jbig2enc moved from testing to community. . Use pikepdf 1.6.1. . Tests: split out stdin/stdout tests. . Docs: update install on FreeBSD to point to ports. . Travis: Add a minimal Ubuntu config. . Tests: mark test as requiring pngquant. . Tests: interpretation of None as omitted argument. . Travis: make minimal config even more minimal. . v9.0.1 release notes.
9.0.028 Jul 2019 07:25 minor feature: refactor: split argparse and run_pipline . import in unpaper test. . refactor: move ruffus related code to one file. . feat: move to sync (none ETL) implementation (WIP). . feat: move to sync (none ETL) implementation. . feat: move to sync (none ETL) implementation - remove ruffus. . : most of the tests (37 failed, 133 passed, 28 skipped). . feat: add concurrent.futures pipeline. . feat: add tqdm progress bar. . feat: add triage step. . : remove ruffus. . : update pytest version. . : tests. . : typo. . --redo-ocr. . Merge master into api branch; all test pass. . warnings. . Remove custom logger. . More to logging and disabled tests. . Reinstate log level in messages to be r to old behavior. . Make logging format consistent with v8.3.0. . Move app specific settings a library may not want to __main__. . logging: don't pass log object to validation. . Additional logging ; silence extremely verbose pdfminer logging. . Refactor weave_layers, introduce progress bar. . Replace ProcessPoolExecutor with multiprocessing.Pool. . Cleanup ghostscript error output. . pylint removal. . Mark some slow tests. . validation: eliminate print(). . Update test cache. . Make re_symlink() not require a log object. . ing threading._RLock exception on Python 3.6. . Remove some now-unused code; etc. . extra blank lines in output messages in Python 3.6. . test invalidated by Python 3.6 logging. docs: Remove discussion of ruffus. . Explain picklable logger. . Refactor cli into basic high level api. . Refactor configure_logging. . Remove sys.exit() calls so we don't terminate caller application. . Refactor validation and exceptions. . api: progress_bar_friendly=False. . Add progress bar to optimize and add option to disable it. . release notes: clarify. . Improve argparse behavior for its role in making the API work. . api: short-circuit exception handler, as caller should provide their own. . Convert one test to use API. . distinction between clea
8.3.014 May 2019 19:25 minor feature: Add bash completion . Docs: mention completions. . Rename bash completions file. . Ghostscript: remove unnecessary post-render resizing step. . Weave: corruption of certain high page count files. . Weave: use emplacement method, scrap TOC repair. . Don't use MagicMock() as a dummy logger in pytest. . v8.3.0 release notes in progress. . Require pikepdf 1.3.0. . Ghostscript: rendering threads has no effect on pdfwrite, so remove it. . Weave: add new test for link consistency. . v8.3.0 notes: clarify. . Move completions to better location/Homebrew compat.
8.2.424 Apr 2019 22:45 minor feature: Remove safety traversal of PDF table of contents . Remove PyCharm deger hack. . Ignore pip-wheel-metadata folder. . Explicitly most pikepdf.Pdf when done with them. . Pdfinfo: be more specific about detecting XFA we can't render. . Weave: use explicit pdf.(), drastically reduce open file handles. . v8.2.4 notes. . Test.txt. . Main.txt.
8.2.304 Apr 2019 03:17 minor feature: Update batch.rst . Docs: explain Automator workflow. . Docs: use images folder. . Docs: broken sphinx ref. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Leptonica: junkpixt harder. . Readme: tweaks. . Better help text for --verbose. . LeptonicaErrorTrap when a sys.stderr.fileno() is not available. . v8.2.3 notes.
8.2.209 Mar 2019 03:15 minor feature: Exception while attempting to print error message for missing pro . Convert most uses of subprocess.Popen to subprocess.run in test suite. . Main: redundant argument test. . Main: version testing unnecessarily throwing exception to itself. . Test suite so --clean is not requested when unpaper is not installed. . Some test failures missed in prev commit. . v8.2.1 notes. . Further to external program version testing.
8.2.005 Mar 2019 03:15 minor feature: optimize: Modernize pikepdf usage . README: install other language packs on macOS. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Declare build system in pyproject.toml. . tests for move to Alpine dockerfile. . docs: avoid importing ocrmypdf. . Add version to build-system declaration. . Add Dockerfile based on alpine:3.9. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Docs: reorganize for new docker-alpine image. . Update requirements. . Update test cache for french- german change. . Move install-time external program checks out of setup.py. . optimize: recoding of PNGs. . optimize: on aggressive settings try JPG to PNG transcoding. . optimize: update comments. . optimize: all JBIG2 images binned on last page. . docs: minor. . Remove demessage. . v8.2.0 release notes. . Predictor name and photometric flip. . optimize: Disable jpg- png migration. . Merge branch 'feature/optimization-'. . v8.2.0 release notes: optimizer. . optimize: use Decode to invert 1bpp PNGs for now.
8.1.011 Feb 2019 06:05 minor feature: Docs: Clarify ArchLinux edition is in AUR . Add --unpaper-args. . Docs: remove reference to --skip-repair since the argument was removed. . Adjust the docker pull command for webservice. . Unpaper-args: add test case and harden feature. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF. . Merge 'feature/unpaper-args'. . Docs: --unpaper-args. . --clean-final implies --clean. . Fuzz. . Webservice: add an optional config and larger upload limit. . Be os.nice()-r. . If --tesseract-timeout 0, say nothing when we time out. . Docs: more unpaper details. . Activate black precommit. . Exception on traversing corrupt ToC entries. . When weave handoff occurs with no OCR font present. . v8.1.0 release notes.
8.0.118 Jan 2019 03:17 minor feature: Ensure XObjects with no subtype don't cause an exception . Docs: Explain intermediate files. . Docs: Update some install procedures for v8 changes. . v8.0.1 notes.
8.0.007 Jan 2019 06:25 minor feature: docs: Ghostscript PDF/A XMP metadata loss; ocrmypdf-webservice . New template. . Readme: more media. . docs: try to readthedocs. . Travis: remove Brewfile. . v7.4.1 release notes. . leptonica.py: exception on certain types of barcode failures. . Detect when metadata is dropped during PDF/A conversion. . Drop support for Python 3.5. . Drop support for Tesseract 3. . Generate test cache. . Reformat with black. . Sort imports with isort. . Remove always-false Tess v3 tests. . pdfa: remove a pile of deprecated code. . Make pdfminer.six optional. . travis: Convert to f-strings where it makes sense. . pikepdf: version bump. . Delinting. . Add fish completions. . use pikepdf 0.10.2. . Prevent Ghostscript from generating invalid XMP metadata. . Bump pikepdf version, point to release notes. . v8.0.0 release notes.
7.4.016 Dec 2018 13:45 minor feature: leptonica: delete file junkpixt.png if created . Support using --force-ocr and --threshold or --mask-barcodes together. . comment in layout.py. . Add webapp stuff. . webapp docker: Build from polyglot. . Rename webapp to webservice. . pdfa: replace PDF/A checking with pikepdf implementation. . Deprecate encode/decode_pdf_date and remap to pikepdf version. . Remove more libxmp dependencies. . pdfinfo: FutureWarning. . setup: suppress XMLParser() warning - defusedxml related. . Replace Ghostscript DOCINFO and.25 metadata date regression. . regression on Ghostscript path. . Refactor pipeline to make PDF/A conversion a separate step. . Don't open encrypted files, even if password is empty. . Merge branches 'feature/newer-pike' and 'feature/webapp'. . Update webservice.py with separate license. . Rename to polyglot.dockerfile. . Require pikepdf 0.9.0. . pikepdf 0.9.0. . reqs/main.txt for pikepdf 0.9.0. . Require pikepdf 0.9.1. . pdfinfo: tolerate PDFs that overflow and underflow the graphics stack. . v7.4.0 release notes.
7.3.117 Nov 2018 15:45 minor feature: Docs build . Add ReadTheDocs yml so we can build with Py3.6. . Detailed page analysis enabled at wrong time. . Name2unicode ignoring certain markers. . 'del draw' exception. . Erasure of undetectable barcodes. . Leptonica: make threshold functions more flexible. . Pdfminer: detect TrueType fonts with no valid encoding information. . More argument checking. . Test case: true type font without Unicode mapping. . Add test case for Type3 fonts with no Unicode mapping. . Barcodes error handling. . Unsupported operand Decimal, float. . v7.3.1 release notes.
7.3.013 Nov 2018 22:05 minor feature: optimize: error in Py3.5 . Create deenvvar to override Creator or Producer. . Adjust for pikepdf API change. . Use Ghostscript for text region detection. . Remove other references to PyMuPDF. . Remove obsolete _naive_find_text. . Remove fitz from Travis. . Ghostscript, PDF/A: support pathlib. . PEP8 docstring convention misuse in a few places. . Rename _optimize to optimize.py. . Remove helpers.universal_open(). . Replace several uses of str(path) with fspath(path). . Remove special of TypeError from ruffus. . Remove qpdf.merge. . several pylint errors and warnings. . Cleanup unused imports. . tesseract.get_orientation: removed unused language parameter. . pipeline: search_window variable not actually used. . Cleanup some cases where log was lazy and should be. . Trailing whitespace. . leptonica: variables defined on class outside __init__. . pdfa: function using closure when it shouldn't. . Disable a pylint. . Reactivate two tests that weren't using their tures properly. . Regenerate test cache. . recent versions of tesseract not registering as textonly_pdf. . Ignore whether or not textonly_pdf was used in cache. . optimize: use new pikepdf api for objgen. . optimize: skip incremental images if any. . Use newer pikepdf API for objgen. . Merge branch 'test/ignore-masks'. . Add Python 3.7 support. . leptonica remove_colormap was replaced with a no-op at some point. . Replace all Pix.read with Pix.open. . Compress test images more heavily. . test resources naming inconsistency. . optimize: PNGs that were reduced to 1-bit being inverted. . Add test case to ensure mono is not inverted. . Optimize some of our bigger test files. . Update test cache with naming rule change. . Hopefully workaround Py3.5 marshal error. . installation for Python 3.7. . Improve release notes. . Make jpeg/png quality tunable args. . Update macOS Brewfile. . Upgrade to Py3.7 locally and resolve a few. Don't use --optimize in test since jbig2enc is no
6.2.530 Oct 2018 21:45 minor feature: Cherrypick Ghostscript 9.25 DOCINFO from 7.x . Ghostscript: in strict ASCII implementation. . Ghostscript: disable JPEG passthrough for ocrmypdf v6.x. . Backport blacklist of Ghostscript 9.24. . v6.2.4 release notes. . Disable failing test for tess 4.0rc1. . Remove macOS from testing entirely. . Drop support for PyMuPDF. . v6.2.5 notes. . travis.yml.
7.2.114 Oct 2018 01:05 minor feature: Cleanup MANIFEST.in, reorg requirements/*.txt, non-Unicode readme . Include Debian copyright file. . Remove cruft to support leptonica 1.72 in test suite. . compatibility with pikepdf 0.3.5 API change. . v7.2.1 release notes. . filename test.txt.
7.2.007 Oct 2018 07:25 minor feature: Remove some unhelpful lambdas . Weave: clarify comment about garbage data in ToC. . Optimize: Disable JBIG2 lossy mode, use lossless instead. . Optimize: Refactor convert_to_jbig2. . Optimize: only enable lossy JBIG2 for -O3. . Test: this error message changed case in newer Tesseract. . Test: send stderr to stderr, why don't we?. . Degrade more gracefully when --optimize is set but JBIG2 is not present. . Test: pytest warning about direct use of a ture. . Tesseract: account for behavior changes when params are missing. . Remove libtiff from Brewfile. . Suppression of tesseract config error messages. . Lossless JBIG2 when there are multiple JBIG2 images on a single page. . Refactor the detailed error messages. . Change JBIG2 lossy mode to require --jbig2-lossy. . v7.2.0 release notes. . Requirements: request pikepdf 0.3.4. . ...and document lossy JBIG2. . Travis: use newer macos image. . Optimize: Reorganize so JBIG2 can be performed on images reduced to 1bpp. . Optimize: refactor image extraction. . Optimize: more refactoring. . Optimize: Exclude soft masks (SMasks) from optimization. . v7.2.0 release notes update.
7.1.023 Sep 2018 15:45 minor feature: Remove the old tesseract pdf_renderer . a comment about Tesseract behavior in certain versions. . all with rotations. . ghostscript.py not saved in last commit. . Tests: confirm OCR layer copied. . Split out rotation related tests. . Silence demessages. . correction angle used from wrong page. . all but one rotation case. . rotation hard case. . Refactor, remove trigonometry. . Document aliasing of tesseract renderer. . Handle procset properly. . Upgrade PyMuPDF version. . Add unconditional (for now) whiteout of text areas. . Ignore masks when deciding what color to rasterize at. . When deciding if there is a text on a page, ignore the margins. . textareas: filter out images. . Remove hocr derenderer (-g). . Remove tesseract renderer entirely. . Revise parameter validation for output-type, pdf-renderer, lang. . Weave: periodically save to prevent indefinite growth of open file list. . Since PyMuPDF 1.13.3 corrupts text, pin 1.12.5 and work around it. . Refactor textareas to remove duplicate code. . Revert "Since PyMuPDF 1.13.3 corrupts text, pin 1.12.5 and work aroun . . Add metadata preservation test from stash. . DPI mismatch between OCR page and source page. . Return to PyMuPDF 1.12.5. . PyMuPDF tweaks: don't clean. . Weave: Unconditionally rotate and scale the text layerThis solves two . . test_main: uses leptonica. . Remove tests that exercise obsolete features (tesseract, -g). . Make XML metadata test actually work. . Merge branch 'master' into develop. . Merge optimize. . Remove jbig2enc.py. . merge error in Leptonica. . Add arguments to control optimization. . Check jbig2 when optimizing is requested. . Update our dependencies. . Warn about --user-words not having any effect. . Update tests. . Update test cache. . Don't try to run jbig2 when not available. . Travis: Use declarative APT for Tesseract too. . Temporarily unbreak without fitz mode. . jbig2enc name. . Ignore masks when deciding what color to rasteriz
6.2.419 Sep 2018 14:45 minor feature: Cherrypick Ghostscript 9.25 DOCINFO from 7.x . Ghostscript: in strict ASCII implementation. . Ghostscript: disable JPEG passthrough for ocrmypdf v6.x. . Backport blacklist of Ghostscript 9.24. . v6.2.4 release notes.
7.0.515 Sep 2018 03:16 minor feature: Docs: hyperlinking of jbig2 page (again) and cleanup release notes . Updating Arch Linux instalation. . Rst formatting in release notes. . Pdfinfo: remove some dead code. . Leptonica: update comments. . Merge branch 'master' of github.com:jbarlow83/OCRmyPDF: docs. . Tests: Migrate metadata tests to pikepdf. . Ghostscript: for 9.24 having jpeg passthrough available. . Ghostscript: no need to specify ProcessColorModel when ColorConversio . . Work around invalid TOC entries. . Work around loss of Unicode DOCINFO in Ghostscript 9.24+. . Check for and reject Adobe LiveCycle Designer PDFs. . Pikepdf version for Travis. . v7.0.5 release notes.
7.0.425 Aug 2018 07:05 minor feature: Error in optimize.py on PNGs at -O2 . Try setuptools_scm_git_archive again. . Docs: mention pikepdf install more clearly. . Require pikepdf 0.3.2. . v7.0.4 notes.
7.0.314 Aug 2018 03:16 minor feature: Optimize: Use new pikepdf Object.write API . Docs: links to JBIG2 encoder page. . Require pikepdf 0.3.1. . Remove pikepdf 0.3 compatibility shims since 0.3.1 is now required.
7.0.206 Aug 2018 05:25 minor feature: release notes typos . pipeline: remove unused function. . Add intensive (optional) rotation test. . ghostscript: never use autorotatepages. . pipeline: revise logic of rotations to pages with nonzero /Rotate. . Explain pytest --runslow. . Update pinned requirements. . Travis: use xenial for Python 3.7. . Regroup installation page content around platforms. . docs: Describe PDF optimization. . Draw preview image at full resolution. . Notes for v7.0.2. . travis.yml syntax.
6.2.302 Aug 2018 06:05 minor feature: Discard alpha channel when triaging images . Revert previous commit amd reject input images with alpha channel.
6.2.215 Jul 2018 03:19 minor feature: Ignore masks when deciding what color to rasterize at . Backport Python 3.7 for ruffus 2.7.0 from ocrmypdf v7.0.0. . Cherrypick Python 3.7 documentation updates from v7.0.0. . a comment about Tesseract behavior in certain versions. . Cherrypick warning about --user-words not having any effect. . main: do better parameter validation. . Tests: Add ability to disable use of cache. . Tests: Speed up a slow test (cherry-picked from v7). . Travis: modernize with v7.0.0 updates. . problem iterating ruffus exceptions and rotate-pages-threshold pa . . ocrmypdf.exec: trap FileNotFoundError too. . Skip locale check on Python 3.7. . Update release notes for v6.2.2. . Travis: v6 build failures. . Travis: nevermind xenial, then.
7.0.011 Jul 2018 07:25 minor feature: Ignore masks when deciding what color to rasterize at . Remove gpg. . Add wiki link to template. . recent versions of tesseract not registering as textonly_pdf. . v6.2.1 release notes. . Use qpdf 8.0.2 backport, force old pytest-timeout to build. . Merge branch 'test/ignore-masks'. . : doesn't work when installed in non-Unicode path. . path error on Py3.5. . Remove dependency on private fork of ruffus, change to official 2.7. . Remove ruffus 2.6.3 exception special casing. . Update release notes. . Update readme. . Declare certain APIs public. . typo introduced in. . Merge branch 'develop' (7.0.0) into master.
6.2.125 Jun 2018 05:25 minor feature: Remove gpg . Add wiki link to template. . recent versions of tesseract not registering as textonly_pdf. . v6.2.1 release notes. . Use qpdf 8.0.2 backport, force old pytest-timeout to build.
7.0.0rc107 Jun 2018 17:45 minor feature: Use python-xmp-toolkit for xmp check . Optimize: use tempdir for cmdline invocation. . Suppress some spurious tesseract errors. . Optimize: error in Py3.5.
6.2.007 May 2018 16:45 minor feature: Use more standard __version__ rather than PILLOW_VERSION . Add support for PDF/A-3. . helpers: missing call to complain(). . Don't suppress error message from config_notfound. . helpers.py again. . Add gpg key to template. . test_pageinfo: remove duplicate import. . --remove-background error on PDFs with colormapped images. . Expand size growth reasons to other arguments that trigger transcoding. . Update Dockerfile for Ubuntu 18.04. . Add 18.04 update procedure. . XMP validation with /CreationDate. . Merge branch 'feature/pdfa3'. . v6.2.0 Release notes. . v6.2.0. failure to prevent use of Ghostscript on /UserUnit files. . Trap PDF/A-3 errors on old Ghostscript.
6.1.503 May 2018 22:00 minor feature:
3.014 Sep 2015 17:45 minor feature: bump to v3.0 and move repos. Test case: No longer using JHOVE. Move to my repo: github.com/fritz-hh = jbarlow83.
3.0-rc931 Aug 2015 01:45 minor feature: Throw exception if iccprofiles not found instead of returning None. unpaper: support paletted files by conversion instead of bailing. Use png256 raster device when possible. Prevent running validation on missing file after an exception is thrown. Add test cases for additional image formats. ghostscript: quiet startup on rasterize. Bump version to -rc9.