Document_From_Table is an Aster function for automatic text/metadata extraction from 300+ document formats. It relies on the Apache Tika library to do the bulk of the work in extracting the contents. The function also includes the ability to OCR images using Google's Tesseract library. This library must be installed on all workers in the cluster for this functionality to be enabled.
and then 'zypper in tesseract' and it should install everything needed; otherwise, you will have to download the liblept5, leptonica, libtesseract, and tesseract-ocr rpms and install them manually on all the workers.
Document_From_Table takes a table of bytea files as input and extracts/ocrs the text from them. To load files into Aster in this format, please use the Aster Loader tool. After populating the table with files, you can then use the document_from_table function with the following sql-mr syntax: