diy:projets:paperscanner
Différences
Ci-dessous, les différences entre deux révisions de la page.
Les deux révisions précédentesRévision précédenteProchaine révision | Révision précédente | ||
diy:projets:paperscanner [2018/04/25 19:45] – gbouyjou | diy:projets:paperscanner [2018/04/25 21:33] (Version actuelle) – [Crop and rotate target paper] gbouyjou | ||
---|---|---|---|
Ligne 3: | Ligne 3: | ||
---- | ---- | ||
===== Introduction ===== | ===== Introduction ===== | ||
- | PaperScanner is a python command to extract body text of printing character page picture (bad or not). | + | PaperScanner is a python command to extract body text of printing character page picture (bad or not).\\ |
You can use this command with option '' | You can use this command with option '' | ||
Ligne 9: | Ligne 9: | ||
---- | ---- | ||
===== Installation ===== | ===== Installation ===== | ||
- | Of course you need to have opencv library, | + | Download the source code on github [[https:// |
- | some additional lib : PIL and pytesseracct to recognize character. | + | \\ |
- | '' | + | Of course you need to have opencv library,\\ |
- | '' | + | some additional lib : PIL and pytesseracct to recognize character.\\ |
+ | '' | ||
+ | '' | ||
+ | ---- | ||
+ | ===== Usage ===== | ||
+ | '' | ||
+ | FILENAME is your picture file that you want to recover text character.\\ | ||
+ | ---- | ||
+ | ===== Explanation of source program ===== | ||
+ | |||
+ | ---- | ||
+ | ==== Crop and rotate target paper ==== | ||
+ | |||
+ | Firstly we have a picture of text page like this\\ | ||
+ | {{ : | ||
+ | \\ | ||
+ | and we have to retrieve all sentence of this text.\\ | ||
+ | \\ | ||
+ | So before apply image treatment operations, I want to crop only the body text and align it.\\ | ||
+ | I need to find the four corner of page before use warpPerspective function,\\ | ||
+ | to eliminate other color unlike the white page, I use histogram to exclude other colors\\ | ||
+ | {{ : | ||
+ | On histogram, there are two peak, on left this is the yellow color chair,\\ | ||
+ | and the other is the page color, seem as yellow more white that the precedent, | ||
+ | is not a perfect white page.\\ | ||
+ | \\ | ||
+ | I find the two boundary with this code | ||
+ | <code python> | ||
+ | img = cv2.cvtColor(img, | ||
+ | |||
+ | hist = cv2.calcHist([img], | ||
+ | |||
+ | # search max value on histogram | ||
+ | M = hist[0] | ||
+ | i = 0 | ||
+ | for j in range(1, 256): | ||
+ | cur = hist[j] | ||
+ | if cur > M: | ||
+ | M = cur | ||
+ | i = j | ||
+ | |||
+ | if i > 253: | ||
+ | prev = hist[i] | ||
+ | else: | ||
+ | prev = hist[i] + hist[i + 1] + hist[i + 2] | ||
+ | |||
+ | # search first grow up on the right | ||
+ | for j in range(i + 15, 254): | ||
+ | cur = hist[j] + hist[j + 1] + hist[j + 2] | ||
+ | if cur >= prev: | ||
+ | break | ||
+ | prev = cur | ||
+ | |||
+ | right = j | ||
+ | |||
+ | if i < 2: | ||
+ | prev = hist[i] | ||
+ | else: | ||
+ | prev = hist[i - 2] + hist[i - 1] + hist[i] | ||
+ | |||
+ | # search first grow up on the left | ||
+ | for j in range(i - 15, 2, -1): | ||
+ | cur = hist[j - 2] + hist[j - 1] + hist[j] | ||
+ | if cur >= prev: | ||
+ | break | ||
+ | prev = cur | ||
+ | |||
+ | left = j | ||
+ | </ | ||
+ | after that, I make the contours, I see clearly the quadrilateral, | ||
+ | {{ : | ||
+ | |||
+ | <code python> | ||
+ | match = cv2.approxPolyDP(cnt, | ||
+ | | ||
+ | [[p], [p2], [p3], [p4]] = match | ||
+ | |||
+ | zoom = 1 | ||
+ | w = zoom * int(cv2.norm(p - p2)) | ||
+ | h = zoom * int(cv2.norm(p - p4)) | ||
+ | if w > h: | ||
+ | t = w | ||
+ | w = h | ||
+ | h = t | ||
+ | |||
+ | pts = np.float32([[p4], | ||
+ | | ||
+ | else: | ||
+ | pts = np.float32([[p], | ||
+ | |||
+ | pts2 = np.float32([[w, | ||
+ | M = cv2.getPerspectiveTransform(pts, | ||
+ | img2 = cv2.warpPerspective(img_src, | ||
+ | </ | ||
+ | |||
+ | the result :\\ | ||
+ | {{ : | ||
+ | |||
+ | |||
+ | ---- | ||
+ | ==== Treatment (Thresholding, | ||
+ | |||
+ | So now we must treat text character before launch tesseract recognition | ||
+ | I remove the margin to remove folding | ||
+ | <code python> | ||
+ | h, w = img.shape[: | ||
+ | | ||
+ | margin = 100 | ||
+ | img = img[margin: | ||
+ | </ | ||
+ | |||
+ | Treatment to improve the quality and the sharpness of character | ||
+ | <code python> | ||
+ | img = cv2.cvtColor(img, | ||
+ | img = cv2.threshold(img, | ||
+ | </ | ||
+ | |||
+ | |||
+ | ---- | ||
+ | ==== Character Recognition ==== | ||
+ | finally I use tesseract and check if each word exist in a language text dictionary | ||
+ | <code python> | ||
+ | img2 = Image.fromarray(img) | ||
+ | txt = pytesseract.image_to_string(img2, | ||
+ | file =open(' | ||
+ | keyword_list = file.read().split() | ||
+ | cpt =0 | ||
+ | for word in txt.split(): | ||
+ | if word in keyword_list: | ||
+ | print(word) | ||
+ | cpt +=1 | ||
+ | | ||
+ | nbWord =len(txt.split()) | ||
+ | print(" | ||
+ | </ | ||
+ | after ten seconds, I find 34.4% correct French word in the text. | ||
+ | {{: | ||
diy/projets/paperscanner.1524685500.txt.gz · Dernière modification : 2018/04/25 19:45 de gbouyjou