Ceci est une ancienne révision du document !
Table des matières
The paperScanner python command
Introduction
PaperScanner is a python command to extract body text of printing character page picture (bad or not).
You can use this command with option python paperScanner.py –help
to have more information.
Installation
Download the source code on github https://github.com/hiergaut/opencv/blob/master/paperScanner.py
Of course you need to have opencv library,
some additional lib : PIL and pytesseracct to recognize character.
pip install Pillow
pip install pytesseract
Usage
python paperScanner.py –read <FILENAME>
FILENAME is your picture file that you want to recover text character.
Explanation of source program
Firstly we have a picture of text page like this
and we have to retrieve all sentence of this text.
So before apply image treatment operations, I want to crop only the body text and align it.
I need to find the four corner of page before use warpPerspective function, to eliminate other color unlike the white page, I use histogram to exclude other colors
On histogram, there are two peak, on left this is the yellow color chair, and the other is the page color, seem as yellow more white that the precedent, is not a perfect white page.
I find the two boundary with this code