[GH-ISSUE #6550] tesseract: ocrmypdf fails due to private-tmp #3301

Closed
opened 2026-05-05 09:54:01 -06:00 by gitea-mirror · 3 comments
Owner

Originally created by @kmille on GitHub (Nov 23, 2024).
Original GitHub issue: https://github.com/netblue30/firejail/issues/6550

Hey,

I'm using latest git version. The current tesseract profile breaks ocrmypdf:

kmille@linbox:scans ocrmypdf C.pdf del.pdf
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
    1  Error, could not create hOCR output file: No such file or directory                             tesseract.py:253
    1  Error, could not create TXT output file: No such file or directory                              tesseract.py:253
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/1 -:--:--
An exception occurred while executing the pipeline                                                       _common.py:294
Traceback (most recent call last):                                                                                     
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 259, in                                
cli_exception_handler                                                                                                  
    return fn(options, plugin_manager)                                                                                 
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                 
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 181, in _run_pipeline                      
    optimize_messages = exec_concurrent(context, executor)                                                             
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                             
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 117, in exec_concurrent                    
    executor(                                                                                                          
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__                               
    self._execute(                                                                                                     
  File "/usr/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in                       
_execute                                                                                                               
    result = future.result()                                                                                           
             ^^^^^^^^^^^^^^^                                                                                           
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result                                          
    return self.__get_result()                                                                                         
           ^^^^^^^^^^^^^^^^^^^                                                                                         
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result                                    
    raise self._exception                                                                                              
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run                                             
    result = self.fn(*self.args, **self.kwargs)                                                                        
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                        
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 81, in _exec_page_sync                     
    ocr_out, text_out = _image_to_ocr_text(page_context, ocr_image_out)                                                
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 63, in _image_to_ocr_text                  
    ocr_out = render_hocr_page(hocr_out, page_context)                                                                 
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                 
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 766, in render_hocr_page                        
    if hocr.stat().st_size == 0:                                                                                       
       ^^^^^^^^^^^                                                                                                     
  File "/usr/lib/python3.12/pathlib.py", line 840, in stat                                                             
    return os.stat(self, follow_symlinks=follow_symlinks)                                                              
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                              
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr.hocr'             

These are the commands that run in background (this is incomplete!):

2024/11/23 22:13:51 PID=403793 UID=1000  CMD=zsh
2024/11/23 22:13:51 PID=403793 UID=1000  CMD=zsh
2024/11/23 22:13:51 PID=403794 UID=0     CMD=tesseract --version
2024/11/23 22:13:51 PID=403795 UID=1000  CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:51 PID=403799 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:51 PID=403803 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:51 PID=403804 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:52 PID=403806 UID=-1    CMD=??? (error: "read /proc/403806/comm: no such process")
2024/11/23 22:13:52 PID=403808 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:52 PID=403812 UID=0     CMD=3
2024/11/23 22:13:52 PID=403828 UID=0     CMD=tesseract --version
2024/11/23 22:13:52 PID=403829 UID=0     CMD=kworker/3:0
2024/11/23 22:13:52 PID=403830 UID=1000  CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:52 PID=403834 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:52 PID=403836 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:52 PID=403838 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:52 PID=403839 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:52 PID=403841 UID=0     CMD=/run/firejail/lib/fcopy /usr/bin/lstmtraining /run/firejail/mnt/bin
2024/11/23 22:13:52 PID=403843 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:52 PID=403847 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:52 PID=403863 UID=0     CMD=tesseract --version
2024/11/23 22:13:52 PID=403864 UID=0     CMD=kworker/2:0
2024/11/23 22:13:52 PID=403865 UID=1000  CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:52 PID=403869 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:52 PID=403873 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:52 PID=403874 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:52 PID=403875 UID=0     CMD=/run/firejail/lib/fcopy /usr/bin/lstmtraining /run/firejail/mnt/bin
2024/11/23 22:13:52 PID=403877 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --version
2024/11/23 22:13:52 PID=403878 UID=0     CMD=3
2024/11/23 22:13:52 PID=403881 UID=0     CMD=/run/firejail/lib/fcopy /usr/bin/text2image /run/firejail/mnt/bin
2024/11/23 22:13:52 PID=403882 UID=-1    CMD=??? (error: "open /proc/403882/comm: no such file or directory")
2024/11/23 22:13:52 PID=403897 UID=1000  CMD=gs --version
2024/11/23 22:13:52 PID=403898 UID=1000  CMD=gs --version
2024/11/23 22:13:53 PID=403900 UID=0     CMD=tesseract --list-langs
2024/11/23 22:13:53 PID=403901 UID=1000  CMD=/usr/bin/firejail /usr/bin/tesseract --list-langs
2024/11/23 22:13:53 PID=403905 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --list-langs
2024/11/23 22:13:53 PID=403910 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --list-langs
2024/11/23 22:13:53 PID=403911 UID=-1    CMD=??? (error: "open /proc/403911/comm: no such file or directory")
2024/11/23 22:13:53 PID=403913 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --list-langs
2024/11/23 22:13:53 PID=403915 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --list-langs
2024/11/23 22:13:53 PID=403917 UID=0     CMD=/run/firejail/lib/fcopy /usr/bin/text2image /run/firejail/mnt/bin
2024/11/23 22:13:53 PID=403939 UID=1000  CMD=gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -dInterpolateControl=-1 -sDEVICE=png16m -dFirstPage=1 -dLastPage=1 -r299.999998x299.999998 -dPDFSTOPONERROR -o - -sstdout=%stderr -dAutoRotatePages=/None -f /tmp/ocrmypdf.io.0od81kk5/origin.pdf

.... everything works so far - no exception so far

2024/11/23 22:14:03 PID=403953 UID=0     CMD=tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt
2024/11/23 22:14:03 PID=403954 UID=1000  CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt
2024/11/23 22:14:03 PID=403958 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt
2024/11/23 22:14:03 PID=403963 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt
2024/11/23 22:14:03 PID=403964 UID=0     CMD=/run/firejail/lib/fcopy /usr/bin/lstmtraining /run/firejail/mnt/bin
2024/11/23 22:14:03 PID=403966 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt
2024/11/23 22:14:03 PID=403970 UID=0     CMD=/run/firejail/lib/fcopy /usr/bin/text2image /run/firejail/mnt/bin


Fix for me is ignore private-tmp. What also worked (basically the same):

export TMPDIR=/home/kmille/tmp
noblacklist /home/kmille/tmp
whitelist /home/kmille/tmp

Unfortunately I can't tell you why this happens.

Originally created by @kmille on GitHub (Nov 23, 2024). Original GitHub issue: https://github.com/netblue30/firejail/issues/6550 Hey, I'm using latest git version. The current `tesseract` profile breaks `ocrmypdf`: ``` kmille@linbox:scans ocrmypdf C.pdf del.pdf Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00 1 Error, could not create hOCR output file: No such file or directory tesseract.py:253 1 Error, could not create TXT output file: No such file or directory tesseract.py:253 OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/1 -:--:-- An exception occurred while executing the pipeline _common.py:294 Traceback (most recent call last): File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 259, in cli_exception_handler return fn(options, plugin_manager) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 181, in _run_pipeline optimize_messages = exec_concurrent(context, executor) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 117, in exec_concurrent executor( File "/usr/lib/python3.12/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__ self._execute( File "/usr/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute result = future.result() ^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 81, in _exec_page_sync ocr_out, text_out = _image_to_ocr_text(page_context, ocr_image_out) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 63, in _image_to_ocr_text ocr_out = render_hocr_page(hocr_out, page_context) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 766, in render_hocr_page if hocr.stat().st_size == 0: ^^^^^^^^^^^ File "/usr/lib/python3.12/pathlib.py", line 840, in stat return os.stat(self, follow_symlinks=follow_symlinks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr.hocr' ``` These are the commands that run in background (this is incomplete!): ``` 2024/11/23 22:13:51 PID=403793 UID=1000 CMD=zsh 2024/11/23 22:13:51 PID=403793 UID=1000 CMD=zsh 2024/11/23 22:13:51 PID=403794 UID=0 CMD=tesseract --version 2024/11/23 22:13:51 PID=403795 UID=1000 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:51 PID=403799 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:51 PID=403803 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:51 PID=403804 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:52 PID=403806 UID=-1 CMD=??? (error: "read /proc/403806/comm: no such process") 2024/11/23 22:13:52 PID=403808 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:52 PID=403812 UID=0 CMD=3 2024/11/23 22:13:52 PID=403828 UID=0 CMD=tesseract --version 2024/11/23 22:13:52 PID=403829 UID=0 CMD=kworker/3:0 2024/11/23 22:13:52 PID=403830 UID=1000 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:52 PID=403834 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:52 PID=403836 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:52 PID=403838 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:52 PID=403839 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:52 PID=403841 UID=0 CMD=/run/firejail/lib/fcopy /usr/bin/lstmtraining /run/firejail/mnt/bin 2024/11/23 22:13:52 PID=403843 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:52 PID=403847 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:52 PID=403863 UID=0 CMD=tesseract --version 2024/11/23 22:13:52 PID=403864 UID=0 CMD=kworker/2:0 2024/11/23 22:13:52 PID=403865 UID=1000 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:52 PID=403869 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:52 PID=403873 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:52 PID=403874 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:52 PID=403875 UID=0 CMD=/run/firejail/lib/fcopy /usr/bin/lstmtraining /run/firejail/mnt/bin 2024/11/23 22:13:52 PID=403877 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --version 2024/11/23 22:13:52 PID=403878 UID=0 CMD=3 2024/11/23 22:13:52 PID=403881 UID=0 CMD=/run/firejail/lib/fcopy /usr/bin/text2image /run/firejail/mnt/bin 2024/11/23 22:13:52 PID=403882 UID=-1 CMD=??? (error: "open /proc/403882/comm: no such file or directory") 2024/11/23 22:13:52 PID=403897 UID=1000 CMD=gs --version 2024/11/23 22:13:52 PID=403898 UID=1000 CMD=gs --version 2024/11/23 22:13:53 PID=403900 UID=0 CMD=tesseract --list-langs 2024/11/23 22:13:53 PID=403901 UID=1000 CMD=/usr/bin/firejail /usr/bin/tesseract --list-langs 2024/11/23 22:13:53 PID=403905 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --list-langs 2024/11/23 22:13:53 PID=403910 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --list-langs 2024/11/23 22:13:53 PID=403911 UID=-1 CMD=??? (error: "open /proc/403911/comm: no such file or directory") 2024/11/23 22:13:53 PID=403913 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --list-langs 2024/11/23 22:13:53 PID=403915 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --list-langs 2024/11/23 22:13:53 PID=403917 UID=0 CMD=/run/firejail/lib/fcopy /usr/bin/text2image /run/firejail/mnt/bin 2024/11/23 22:13:53 PID=403939 UID=1000 CMD=gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -dInterpolateControl=-1 -sDEVICE=png16m -dFirstPage=1 -dLastPage=1 -r299.999998x299.999998 -dPDFSTOPONERROR -o - -sstdout=%stderr -dAutoRotatePages=/None -f /tmp/ocrmypdf.io.0od81kk5/origin.pdf .... everything works so far - no exception so far 2024/11/23 22:14:03 PID=403953 UID=0 CMD=tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt 2024/11/23 22:14:03 PID=403954 UID=1000 CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt 2024/11/23 22:14:03 PID=403958 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt 2024/11/23 22:14:03 PID=403963 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt 2024/11/23 22:14:03 PID=403964 UID=0 CMD=/run/firejail/lib/fcopy /usr/bin/lstmtraining /run/firejail/mnt/bin 2024/11/23 22:14:03 PID=403966 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt 2024/11/23 22:14:03 PID=403970 UID=0 CMD=/run/firejail/lib/fcopy /usr/bin/text2image /run/firejail/mnt/bin ``` Fix for me is `ignore private-tmp`. What also worked (basically the same): ``` export TMPDIR=/home/kmille/tmp noblacklist /home/kmille/tmp whitelist /home/kmille/tmp ``` Unfortunately I can't tell you why this happens.
gitea-mirror 2026-05-05 09:54:01 -06:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@kmk3 commented on GitHub (Nov 24, 2024):

Hey,

I'm using latest git version. The current tesseract profile breaks ocrmypdf:

kmille@linbox:scans ocrmypdf C.pdf del.pdf
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
    1  Error, could not create hOCR output file: No such file or directory                             tesseract.py:253
    1  Error, could not create TXT output file: No such file or directory                              tesseract.py:253
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/1 -:--:--
An exception occurred while executing the pipeline                                                       _common.py:294
Traceback (most recent call last):                                                                                     
  File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 259, in                                
cli_exception_handler                                                                                                  
    return fn(options, plugin_manager)                                                                                 
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                 
[...]
  File "/usr/lib/python3.12/pathlib.py", line 840, in stat                                                             
    return os.stat(self, follow_symlinks=follow_symlinks)                                                              
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                              
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr.hocr'             

These are the commands that run in background (this is incomplete!):

[...]
2024/11/23 22:13:53 PID=403915 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract --list-langs
2024/11/23 22:13:53 PID=403917 UID=0     CMD=/run/firejail/lib/fcopy /usr/bin/text2image /run/firejail/mnt/bin
2024/11/23 22:13:53 PID=403939 UID=1000  CMD=gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -dInterpolateControl=-1 -sDEVICE=png16m -dFirstPage=1 -dLastPage=1 -r299.999998x299.999998 -dPDFSTOPONERROR -o - -sstdout=%stderr -dAutoRotatePages=/None -f /tmp/ocrmypdf.io.0od81kk5/origin.pdf
[...]
2024/11/23 22:14:03 PID=403953 UID=0     CMD=tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt
2024/11/23 22:14:03 PID=403954 UID=1000  CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt
2024/11/23 22:14:03 PID=403958 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt
2024/11/23 22:14:03 PID=403963 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt
2024/11/23 22:14:03 PID=403964 UID=0     CMD=/run/firejail/lib/fcopy /usr/bin/lstmtraining /run/firejail/mnt/bin
2024/11/23 22:14:03 PID=403966 UID=0     CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt
2024/11/23 22:14:03 PID=403970 UID=0     CMD=/run/firejail/lib/fcopy /usr/bin/text2image /run/firejail/mnt/bin

Nice debugging.

How did you generate these CMD= lines?

Fix for me is ignore private-tmp. What also worked (basically the same):

export TMPDIR=/home/kmille/tmp
noblacklist /home/kmille/tmp
whitelist /home/kmille/tmp

Unfortunately I can't tell you why this happens.

Reason:

  1. ocrmypdf starts a firejailed tesseract (which has private-tmp) and tells
    it to create files in /tmp
  2. tesseract creates the files in its own private /tmp
  3. ocrmypdf cannot see the relevant files in the normal /tmp, as they only
    exist in the tesseract sandbox
<!-- gh-comment-id:2496178407 --> @kmk3 commented on GitHub (Nov 24, 2024): > Hey, > > I'm using latest git version. The current `tesseract` profile breaks `ocrmypdf`: > > ``` > kmille@linbox:scans ocrmypdf C.pdf del.pdf > Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00 > 1 Error, could not create hOCR output file: No such file or directory tesseract.py:253 > 1 Error, could not create TXT output file: No such file or directory tesseract.py:253 > OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/1 -:--:-- > An exception occurred while executing the pipeline _common.py:294 > Traceback (most recent call last): > File "/usr/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 259, in > cli_exception_handler > return fn(options, plugin_manager) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > [...] > File "/usr/lib/python3.12/pathlib.py", line 840, in stat > return os.stat(self, follow_symlinks=follow_symlinks) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr.hocr' > ``` > > These are the commands that run in background (this is incomplete!): > > ``` > [...] > 2024/11/23 22:13:53 PID=403915 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract --list-langs > 2024/11/23 22:13:53 PID=403917 UID=0 CMD=/run/firejail/lib/fcopy /usr/bin/text2image /run/firejail/mnt/bin > 2024/11/23 22:13:53 PID=403939 UID=1000 CMD=gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -dInterpolateControl=-1 -sDEVICE=png16m -dFirstPage=1 -dLastPage=1 -r299.999998x299.999998 -dPDFSTOPONERROR -o - -sstdout=%stderr -dAutoRotatePages=/None -f /tmp/ocrmypdf.io.0od81kk5/origin.pdf > [...] > 2024/11/23 22:14:03 PID=403953 UID=0 CMD=tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt > 2024/11/23 22:14:03 PID=403954 UID=1000 CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt > 2024/11/23 22:14:03 PID=403958 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt > 2024/11/23 22:14:03 PID=403963 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt > 2024/11/23 22:14:03 PID=403964 UID=0 CMD=/run/firejail/lib/fcopy /usr/bin/lstmtraining /run/firejail/mnt/bin > 2024/11/23 22:14:03 PID=403966 UID=0 CMD=/usr/bin/firejail /usr/bin/tesseract -l eng /tmp/ocrmypdf.io.0od81kk5/000001_ocr.png /tmp/ocrmypdf.io.0od81kk5/000001_ocr_hocr hocr txt > 2024/11/23 22:14:03 PID=403970 UID=0 CMD=/run/firejail/lib/fcopy /usr/bin/text2image /run/firejail/mnt/bin > ``` Nice debugging. How did you generate these `CMD=` lines? > Fix for me is `ignore private-tmp`. What also worked (basically the same): > > ``` > export TMPDIR=/home/kmille/tmp > noblacklist /home/kmille/tmp > whitelist /home/kmille/tmp > ``` > > Unfortunately I can't tell you why this happens. Reason: 1. ocrmypdf starts a firejailed tesseract (which has `private-tmp`) and tells it to create files in /tmp 2. tesseract creates the files in its own private /tmp 3. ocrmypdf cannot see the relevant files in the normal /tmp, as they only exist in the tesseract sandbox
Author
Owner

@kmille commented on GitHub (Nov 24, 2024):

You can use pspy or execsnoop (command output). Thanks for fixing! Feel free to close this.
Btw: When do you release 0.9.74? 0.9.72 was released in May 2024.

<!-- gh-comment-id:2496184395 --> @kmille commented on GitHub (Nov 24, 2024): You can use [pspy](https://github.com/DominicBreuker/pspy) or [execsnoop](https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py) ([command output](https://github.com/iovisor/bcc/blob/master/tools/execsnoop_example.txt)). Thanks for fixing! Feel free to close this. Btw: When do you release `0.9.74`? `0.9.72` was released in May 2024.
Author
Owner

@kmk3 commented on GitHub (Nov 25, 2024):

You can use pspy or
execsnoop
(command
output
).
Thanks for fixing!

Nice, thanks.

By the way, firejail has --trace=foo.txt, though the output is more verbose.

Feel free to close this.

If the PR is merged, the issue is automatically closed (due to the Fixes
line).

Btw: When do you release 0.9.74? 0.9.72 was released in May 2024.

Not sure.

<!-- gh-comment-id:2496557801 --> @kmk3 commented on GitHub (Nov 25, 2024): > You can use [pspy](https://github.com/DominicBreuker/pspy) or > [execsnoop](https://github.com/iovisor/bcc/blob/master/tools/execsnoop.py) > ([command > output](https://github.com/iovisor/bcc/blob/master/tools/execsnoop_example.txt)). > Thanks for fixing! Nice, thanks. By the way, firejail has `--trace=foo.txt`, though the output is more verbose. > Feel free to close this. If the PR is merged, the issue is automatically closed (due to the `Fixes` line). > Btw: When do you release `0.9.74`? `0.9.72` was released in May 2024. Not sure.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/firejail#3301
No description provided.