Saturday, 15 August 2015

Tesseract OCR text order for documents with tables or rows -


I am using scanned PDFs to convert to plain text, altogether it is highly effective, but I have an article There are problems in the scan order. The document with tabular data seems to be scanning the column by column when it appears that a more natural way to scan the line by line would be an example of a very small scale:

  This column A, line 1 is column B, line 1 is column C, line 1 is column A, line 2 is column B, row 2 is column C, line 2  

Is:

  this column A, line 1 Column A, line 2 is column B, row 1 is column B, row 2 this column is C, line 1 this column C, line 2  

I am starting to read the documentation I am guessing and examining, with the attitude of cruel force, but if someone has already solved a problem, then I appreciate the insights on fixes. It may be some training data but I do not know how it works.

You can play with its various


No comments:

Post a Comment