Table des matières
The paperScanner python command
Introduction
PaperScanner is a python command to extract body text of printing character page picture (bad or not).
You can use this command with option python paperScanner.py –help
to have more information.
Installation
Download the source code on github https://github.com/hiergaut/opencv/blob/master/paperScanner.py
Of course you need to have opencv library,
some additional lib : PIL and pytesseracct to recognize character.
pip install Pillow
pip install pytesseract
Usage
python paperScanner.py –read <FILENAME>
FILENAME is your picture file that you want to recover text character.
Explanation of source program
Crop and rotate target paper
Firstly we have a picture of text page like this
and we have to retrieve all sentence of this text.
So before apply image treatment operations, I want to crop only the body text and align it.
I need to find the four corner of page before use warpPerspective function,
to eliminate other color unlike the white page, I use histogram to exclude other colors
On histogram, there are two peak, on left this is the yellow color chair,
and the other is the page color, seem as yellow more white that the precedent,
is not a perfect white page.
I find the two boundary with this code
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) hist = cv2.calcHist([img], [0], None, [256], [0, 256]) # search max value on histogram M = hist[0] i = 0 for j in range(1, 256): cur = hist[j] if cur > M: M = cur i = j if i > 253: prev = hist[i] else: prev = hist[i] + hist[i + 1] + hist[i + 2] # search first grow up on the right for j in range(i + 15, 254): cur = hist[j] + hist[j + 1] + hist[j + 2] if cur >= prev: break prev = cur right = j if i < 2: prev = hist[i] else: prev = hist[i - 2] + hist[i - 1] + hist[i] # search first grow up on the left for j in range(i - 15, 2, -1): cur = hist[j - 2] + hist[j - 1] + hist[j] if cur >= prev: break prev = cur left = j
after that, I make the contours, I see clearly the quadrilateral, and find the corners.
match = cv2.approxPolyDP(cnt, 0.02 * len(cnt), True) [[p], [p2], [p3], [p4]] = match zoom = 1 w = zoom * int(cv2.norm(p - p2)) h = zoom * int(cv2.norm(p - p4)) if w > h: t = w w = h h = t pts = np.float32([[p4], [p], [p2], [p3]]) else: pts = np.float32([[p], [p2], [p3], [p4]]) pts2 = np.float32([[w, 0], [0, 0], [0, h], [w, h]]) M = cv2.getPerspectiveTransform(pts, pts2) img2 = cv2.warpPerspective(img_src, M, (w, h))
Treatment (Thresholding, blurring, etc)
So now we must treat text character before launch tesseract recognition I remove the margin to remove folding
h, w = img.shape[:2] margin = 100 img = img[margin:h - margin, margin: w - margin]
Treatment to improve the quality and the sharpness of character
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
Character Recognition
finally I use tesseract and check if each word exist in a language text dictionary
img2 = Image.fromarray(img) txt = pytesseract.image_to_string(img2, lang='fra') file =open('frenchWord.txt', 'r') keyword_list = file.read().split() cpt =0 for word in txt.split(): if word in keyword_list: print(word) cpt +=1 nbWord =len(txt.split()) print("\naccuracy = ", cpt, '/', nbWord, ' ', "%.1f" % (cpt *100 /nbWord), "%")
after ten seconds, I find 34.4% correct French word in the text.