'StartDocumentAnalysis is not producing TABLE data
Am trying to detect and extract TABLE/FORMS data from a multi-page PDF (Async operation), which as per docs, "StartDocumentAnalysis" API is the right one. However,the outputs differ in each case when doing from console and from automation (using above API).
Case1: When analyzing same multi-page PDF from Textract console, I see 13 Tables are detected along with Forms/Lines/Words.
{
"BlockType": "CELL",
"Confidence": 67.00711822509766,
"RowIndex": 1,
"ColumnIndex": 2,
"RowSpan": 1,
"ColumnSpan": 1,
"Geometry": {....},
},
"Id": "fba85b26-3340-4f00-912c-5b86f253b7cd",
"Relationships": [
{
"Type": "CHILD",
"Ids": [
"684d019a-8c89-4cc1-81a0-702e581938f6",
"de8def9a-de59-4f70-ae2f-1624b1b607d7"
]
}
],
"Page": 3,
"childText": "Ficus Bank ",
"SearchKey": "Ficus Bank "
}
Case2: When using "StartDocumentAnalysis" API on same PDF from S3, the BlockType ="TABLE" are present in the output but without any data.
{
"BlockType": "CELL",
"ColumnIndex": 2,
"ColumnSpan": 1,
"Confidence": 67.00711822509766,
"EntityTypes": null,
"Geometry": {....},
},
,
"Hint": null,
"Id": "77c2a38a-c36c-4d8d-ba32-7e51a022ee23",
"Page": 3,
"Query": null,
"Relationships": [
{
"Ids": [
"e9b3b510-1041-4d6a-ab42-1fa42643038e",
"0a4d2092-c09e-4a56-aafe-700c006b85a3"
],
"Type": "CHILD"
}
],
"RowIndex": 1,
"RowSpan": 1,
"SelectionStatus": null,
"Text": null,
"TextType": null
},
Observation: The below two keys ("childText" & "SearchKey") are missing from the Async API's output from CELL type of TABLE blocktype.
"childText": "Ficus Bank ",
"SearchKey": "Ficus Bank "
Any info/direction would be appreciated.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
