Différences

Ci-dessous, les différences entre deux révisions de la page.

--- diy:projets:paperscanner [2018/04/25 19:45] – gbouyjou
+++ diy:projets:paperscanner [2018/04/25 21:33] (Version actuelle) – [Crop and rotate target paper] gbouyjou
@@ Ligne 3: / Ligne 3: @@
 ----
 ===== Introduction =====
-PaperScanner is a python command to extract body text of printing character page picture (bad or not).
+PaperScanner is a python command to extract body text of printing character page picture (bad or not).\\
 You can use this command with option ''python paperScanner.py --help'' to have more information.
@@ Ligne 9: / Ligne 9: @@
 ----
 ===== Installation =====
-Of course you need to have opencv library,
+Download the source code on github [[https://github.com/hiergaut/opencv/blob/master/paperScanner.py]]\\
-some additional lib : PIL and pytesseracct to recognize character.
+\\
-''pip install Pillow''
+Of course you need to have opencv library,\\
-''pip install pytesseract''
+some additional lib : PIL and pytesseracct to recognize character.\\
+''pip install Pillow''\\
+''pip install pytesseract''\\
+----
+===== Usage =====
+''python paperScanner.py --read <FILENAME>''\\
+FILENAME is your picture file that you want to recover text character.\\
+----
+===== Explanation of source program =====
+----
+==== Crop and rotate target paper ====
+Firstly we have a picture of text page like this\\
+{{ :diy:projets:out.jpg?direct&400 |}}
+\\
+and we have to retrieve all sentence of this text.\\
+\\
+So before apply image treatment operations, I want to crop only the body text and align it.\\
+I need to find the four corner of page before use warpPerspective function,\\
+to eliminate other color unlike the white page, I use histogram to exclude other colors\\
+{{ :diy:projets:screen.png?direct&200 |}}
+On histogram, there are two peak, on left this is the yellow color chair,\\
+and the other is the page color, seem as yellow more white that the precedent,\\
+is not a perfect white page.\\
+\\
+I find the two boundary with this code
+<code python>
+    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
+    hist = cv2.calcHist([img], [0], None, [256], [0, 256])
+    # search max value on histogram
+    M = hist[0]
+    i = 0
+    for j in range(1, 256):
+        cur = hist[j]
+        if cur > M:
+            M = cur
+            i = j
+    if i > 253:
+        prev = hist[i]
+    else:
+        prev = hist[i] + hist[i + 1] + hist[i + 2]
+    # search first grow up on the right
+    for j in range(i + 15, 254):
+        cur = hist[j] + hist[j + 1] + hist[j + 2]
+        if cur >= prev:
+            break
+        prev = cur
+    right = j
+    if i < 2:
+        prev = hist[i]
+    else:
+        prev = hist[i - 2] + hist[i - 1] + hist[i]
+    # search first grow up on the left
+    for j in range(i - 15, 2, -1):
+        cur = hist[j - 2] + hist[j - 1] + hist[j]
+        if cur >= prev:
+            break
+        prev = cur
+    left = j
+</code>
+after that, I make the contours, I see clearly the quadrilateral, and find the corners.
+{{ :diy:projets:screen2.png?direct&400 |}}
+<code python>
+    match = cv2.approxPolyDP(cnt, 0.02 * len(cnt), True)
+    [[p], [p2], [p3], [p4]] = match
+    zoom = 1
+    w = zoom * int(cv2.norm(p - p2))
+    h = zoom * int(cv2.norm(p - p4))
+    if w > h:
+        t = w
+        w = h
+        h = t
+        pts = np.float32([[p4], [p], [p2], [p3]])
+    else:
+        pts = np.float32([[p], [p2], [p3], [p4]])
+    pts2 = np.float32([[w, 0], [0, 0], [0, h], [w, h]])
+    M = cv2.getPerspectiveTransform(pts, pts2)
+    img2 = cv2.warpPerspective(img_src, M, (w, h))
+</code>
+the result :\\
+{{ :diy:projets:screen3.png?direct&400 |}}
+----
+==== Treatment (Thresholding, blurring, etc) ====
+So now we must treat text character before launch tesseract recognition
+I remove the margin to remove folding
+<code python>
+    h, w = img.shape[:2]
+    margin = 100
+    img = img[margin:h - margin, margin: w - margin]
+</code>
+Treatment to improve the quality and the sharpness of character
+<code python>
+    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
+    img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
+</code>
+----
+==== Character Recognition ====
+finally I use tesseract and check if each word exist in a language text dictionary
+<code python>
+    img2 = Image.fromarray(img)
+    txt = pytesseract.image_to_string(img2, lang='fra')
+    file =open('frenchWord.txt', 'r')
+    keyword_list = file.read().split()
+    cpt =0
+    for word in txt.split():
+        if word in keyword_list:
+            print(word)
+            cpt +=1
+    nbWord =len(txt.split())
+    print("\naccuracy = ", cpt, '/', nbWord, ' ', "%.1f" % (cpt *100 /nbWord), "%")
+</code>
+after ten seconds, I find 34.4% correct French word in the text.
+{{:diy:projets:screen4.png?direct&400|}}