diy:projets:paperscanner
Différences
Ci-dessous, les différences entre deux révisions de la page.
| Prochaine révision | Révision précédente | ||
| diy:projets:paperscanner [2018/04/25 16:52] – créée gbouyjou | diy:projets:paperscanner [2018/04/25 21:33] (Version actuelle) – [Crop and rotate target paper] gbouyjou | ||
|---|---|---|---|
| Ligne 1: | Ligne 1: | ||
| - | blabla | + | ====== The paperScanner python command ====== |
| + | |||
| + | ---- | ||
| + | ===== Introduction ===== | ||
| + | PaperScanner is a python command to extract body text of printing character page picture (bad or not).\\ | ||
| + | You can use this command with option '' | ||
| + | |||
| + | |||
| + | ---- | ||
| + | ===== Installation ===== | ||
| + | Download the source code on github [[https:// | ||
| + | \\ | ||
| + | Of course you need to have opencv library, | ||
| + | some additional lib : PIL and pytesseracct to recognize character.\\ | ||
| + | '' | ||
| + | '' | ||
| + | |||
| + | |||
| + | ---- | ||
| + | ===== Usage ===== | ||
| + | '' | ||
| + | FILENAME is your picture file that you want to recover text character.\\ | ||
| + | |||
| + | |||
| + | ---- | ||
| + | ===== Explanation of source program ===== | ||
| + | |||
| + | ---- | ||
| + | ==== Crop and rotate target paper ==== | ||
| + | |||
| + | Firstly we have a picture of text page like this\\ | ||
| + | {{ : | ||
| + | \\ | ||
| + | and we have to retrieve all sentence of this text.\\ | ||
| + | \\ | ||
| + | So before apply image treatment operations, I want to crop only the body text and align it.\\ | ||
| + | I need to find the four corner of page before use warpPerspective function, | ||
| + | to eliminate other color unlike the white page, I use histogram to exclude other colors\\ | ||
| + | {{ : | ||
| + | On histogram, there are two peak, on left this is the yellow color chair,\\ | ||
| + | and the other is the page color, seem as yellow more white that the precedent, | ||
| + | is not a perfect white page.\\ | ||
| + | \\ | ||
| + | I find the two boundary with this code | ||
| + | <code python> | ||
| + | img = cv2.cvtColor(img, | ||
| + | |||
| + | hist = cv2.calcHist([img], | ||
| + | |||
| + | # search max value on histogram | ||
| + | M = hist[0] | ||
| + | i = 0 | ||
| + | for j in range(1, 256): | ||
| + | cur = hist[j] | ||
| + | if cur > M: | ||
| + | M = cur | ||
| + | i = j | ||
| + | |||
| + | if i > 253: | ||
| + | prev = hist[i] | ||
| + | else: | ||
| + | prev = hist[i] + hist[i + 1] + hist[i + 2] | ||
| + | |||
| + | # search first grow up on the right | ||
| + | for j in range(i + 15, 254): | ||
| + | cur = hist[j] + hist[j + 1] + hist[j + 2] | ||
| + | if cur >= prev: | ||
| + | break | ||
| + | prev = cur | ||
| + | |||
| + | right = j | ||
| + | |||
| + | if i < 2: | ||
| + | prev = hist[i] | ||
| + | else: | ||
| + | prev = hist[i - 2] + hist[i - 1] + hist[i] | ||
| + | |||
| + | # search first grow up on the left | ||
| + | for j in range(i - 15, 2, -1): | ||
| + | cur = hist[j - 2] + hist[j - 1] + hist[j] | ||
| + | if cur >= prev: | ||
| + | break | ||
| + | prev = cur | ||
| + | |||
| + | left = j | ||
| + | </ | ||
| + | after that, I make the contours, I see clearly the quadrilateral, | ||
| + | {{ : | ||
| + | |||
| + | <code python> | ||
| + | match = cv2.approxPolyDP(cnt, | ||
| + | |||
| + | [[p], [p2], [p3], [p4]] = match | ||
| + | |||
| + | zoom = 1 | ||
| + | w = zoom * int(cv2.norm(p - p2)) | ||
| + | h = zoom * int(cv2.norm(p - p4)) | ||
| + | if w > h: | ||
| + | t = w | ||
| + | w = h | ||
| + | h = t | ||
| + | |||
| + | pts = np.float32([[p4], | ||
| + | |||
| + | else: | ||
| + | pts = np.float32([[p], | ||
| + | |||
| + | pts2 = np.float32([[w, | ||
| + | M = cv2.getPerspectiveTransform(pts, | ||
| + | img2 = cv2.warpPerspective(img_src, | ||
| + | </ | ||
| + | |||
| + | the result :\\ | ||
| + | {{ : | ||
| + | |||
| + | |||
| + | ---- | ||
| + | ==== Treatment (Thresholding, | ||
| + | |||
| + | So now we must treat text character before launch tesseract recognition | ||
| + | I remove the margin to remove folding | ||
| + | <code python> | ||
| + | h, w = img.shape[: | ||
| + | |||
| + | margin = 100 | ||
| + | img = img[margin: | ||
| + | </ | ||
| + | |||
| + | Treatment to improve the quality and the sharpness of character | ||
| + | <code python> | ||
| + | img = cv2.cvtColor(img, | ||
| + | img = cv2.threshold(img, | ||
| + | </ | ||
| + | |||
| + | |||
| + | ---- | ||
| + | ==== Character Recognition ==== | ||
| + | |||
| + | finally I use tesseract and check if each word exist in a language text dictionary | ||
| + | <code python> | ||
| + | img2 = Image.fromarray(img) | ||
| + | txt = pytesseract.image_to_string(img2, | ||
| + | |||
| + | file =open(' | ||
| + | keyword_list = file.read().split() | ||
| + | |||
| + | cpt =0 | ||
| + | for word in txt.split(): | ||
| + | if word in keyword_list: | ||
| + | print(word) | ||
| + | cpt +=1 | ||
| + | |||
| + | nbWord =len(txt.split()) | ||
| + | |||
| + | print(" | ||
| + | </ | ||
| + | after ten seconds, I find 34.4% correct French word in the text. | ||
| + | {{: | ||
diy/projets/paperscanner.1524675168.txt.gz · Dernière modification : de gbouyjou
