The paperScanner python command

Introduction

PaperScanner is a python command to extract body text of printing character page picture (bad or not).
You can use this command with option python paperScanner.py –help to have more information.

Installation

Download the source code on github https://github.com/hiergaut/opencv/blob/master/paperScanner.py

Of course you need to have opencv library,
some additional lib : PIL and pytesseracct to recognize character.
pip install Pillow
pip install pytesseract

Usage

python paperScanner.py –read <FILENAME>
FILENAME is your picture file that you want to recover text character.

Explanation of source program

Firstly we have a picture of text page like this

and we have to retrieve all sentence of this text.
So before apply image treatment operations, I want to crop only the body text and align it.
I need to find the four corner of page before use warpPerspective function, to eliminate other color unlike the white page, I use histogram to exclude other colors On histogram, there are two peak, on left this is the yellow color chair, and the other is the page color, seem as yellow more white that the precedent, is not a perfect white page. I find the two boundary with this code