Fieldpine Logo Documentation Home  

gds.documentgen.extract.file


This function takes an input file of various types and extracts textual information from it, which can be further processed. This API is the central heart that allows a wide variety of documents to be uploaded and decoded.

Use this API if you wish to decode and extract information from various document types, such as images, spreadsheets or PDF documents

Typically, this API is called by other functions to assist them. It can be called directly but the output it generates is designed to be consumed and further refined. It may also take a noticeable time to process. Depending on what type of file is loaded.

Related API: gds.documentgen.decode.file Analyses the file type (PDF, DOCX, PNG ...) but does not extract any content

Try it now

Example of Use
<form method="POST" action="/gnap/M/buck"
	target='xxx' enctype="multipart/form-data">

<input type='hidden' name='f3_s'
	value='gds.documentgen.extract.file'>

Input File <input type=file name='f110_s'>

<input type='submit' value='Execute'>
</form>

Select a file, such as Excel (xlsx), PDF, PNG, TIFF, JPG, BMP.... and the API will extract and decode it. Results will open in a new window

Input File

Input Specification

f110
Actual file contents. This argument is required.

This API can decode the following filetypes:
xlsx, csv, images (png, tiff, jpeg, jfif, bmp, tga, gif ....) eml, txt, pdf

f112
Request Processing Options
f113
Expected type of document. Setting this directs the system towards what type of document is expected.
f114
PhysKey refering to the document

Output Specification

The output of this API is a structure describing both the document and optionally its decoded meaning.

<DATS>
  <DecodeMethods>1</DecodeMethods>
  <DATS>...</DATS>		- Array of FIFB blocks
  <DOCU>...</DOCU>		- Document fields
  <DOCC>...</DOCC>		- Cleaned document fields
  <GRID>...</GRID>		- Layout information
</DATS>

Top Level Fields

Field#NameDescriptionExample(s)
f120DecodeMethodsA bitmask of which techniques were used to extract the information from the document.
  • 1 - OCR was used

This function allows you to upload an Excel spreadsheet and the system will decode the contents and return a data packet with the contents. Excel is an advanced product and the representation of the spreadsheet will not contain all attributes and abilities; this function is primarily about extracting tables of information.

gds.documentgen.extract.spreadsheet

gds.documentgen.extract.file

This function allows you to upload a complete file and have the system attempt to decode it and the contents but without updating any part of the system. You can upload XLSX, JPEG, PNG, TIFF, GIF files. If the document is an image, the system automatically calls OCR routines to read the document first.

OCR Invoice Fields

The invoice at the left shows an invoice that has been photographed and highlights the fields that the gds.documentgen.extract.file will attempt to decode. For clarity, only some fields are shown.

Call Arguments

f110Actual contents of file.
f112Request Processing Options
f113Expected type of document. Setting this directs the system towards what type of document is expected.
f114PhysKey refering to the document

Return Data

Several arrays of information are returned about the document, but these are limited to what can be gleaned without reference to your data. Essentially this function is reading the document but not applying any context awareness such as locating product names. If you wish to have context awareness call the function retailmax.elink.utility.document.decode which internally calls this function and then applies analysis to further decode the document

DOCU & DOCC Structure

This structure defines an document in terms of items found on the page, such as "date" or "invoce number". The DOCU structure is identical to the DOCC, the only difference is the DOCU contains raw data and the DOCC contains data that has been cleaned to what a reasonable person might do. For example, DOCU will report a number as 17.BB, while the DOCC might convert this to 17.88 DOCC may also use information from other scanned documents to complete its information. For example, a GST number that scans as "88-45Be~$y" may be presented as "88-458-126" in the DOCC

Field#NameDescriptionExample(s)
f100Holds Physkey when this record is stored in a database file
f101Datetime this document was created
f108DocTypeCodeNumber indicating type of document we best believe this is. 100=Invoice 101=Order Confirmation 102=OCR Test Page 103=ASN100
f109DocTypeNameText version of f108. Invoice
f110InvoiceNumberThe reference number of an invoice0041
f111DateThe date on this document24/05/14
f112TaxIdNumberGST registered number. This is supplied without formatting characters, but you should be prepared for this rule to not be honoured.27797318
f119OCR Quality score. A value from 0 to 100% indicating how well we think the OCR process worked.
f120GrandTotalThe grand total value of an invoice
f121Tax1GST total
f122SubTotalSub total
f123Freight charge amount
f130EmailThe primary email on this document. This is not designed to extract any random email but rather to identify what appears to be the authors email on this document.
f131Telephone number
f132Fax number
f133IssuerIssuing Party Name, eg the name of the supplier on an invoice
f134PurchaseOrderNumberPurchase order number.
f135Comments or remarks found on the document.
f136"Our Reference". Some documents have an our reference field in addition to invoice number etc, this field contains that additional "our reference" value.
f137Account Number
f142Website
f143Mobile
f144Contact Name
f145Address of document creator
f160DeliveryAddressAddress document sent too (street or postal, not email). Contains "ship-to" address if both ship to and bill to addresses are specified. If only one address is present, this field contains the addres
f161Bill to address, but only loaded if f160 also has a value. If only one address is specified it is stored in f160.
f300Gst Rate.15
f301Gst Number123-454-6789
f302Due date for invoice
f310Bank Account Number
f311Bank Name
f312Bank Branch
f313Payment instructions text if special instructions were present on documentPlease quote ABC123 on payment
f350MAF packhouse id (New Zealand)PH531

LINE Structure

This packet is a subtype of DOCU and DOCC. It holds the repeating lines on documents such as invoices and packing slips

Field#NameDescriptionExample
f200PidProduct Id in POS if already known.
f204ActualQtyUnitsActual Quantity, ideally in units if possible
f240EachPriceEach Price (price of single qty) excl tax
f241Discount amount in money
f242Discount percentage
f243RawNetPriceRaw Net price
f244raw Line total
f245Sale discount amount per each
f246Sale discount amount per line total
f247Final net line total
f260Each Price (price of single qty) including tax
f261Discount amount in money
f262Discount percentage
f263Raw Net price
f264raw Line total
f265Sale discount amount per each
f266Sale discount amount per line total
f267Final net line total
f300SupplierPartCodeSuppliers partcode
f303Item Barcode. This is a retail level barcode not a trade unit
f304OrderQtyUnitsOrder quantity in units, not outers
f306Outer packing type in words, such as "CTN6" or "Carton 6". This text is not standardised and can change from supplier to supplier.
f308Item Name in suppliers terms
f309Return date. Some invoices specify a return date per line item. This is the date at which unsold items should be returned12-feb-2015
f420RRPRRP
f1830SupPartPhyskeyPhyskey for suppliers partcode.

AGNT Structure

This packet is a subtype of DOCU and is added by agent programs that created or altered this DOCU structure. The AGNT block allows program to see which programs and versions supplied information.

Field#NameDescriptionExample
f110NameShort and friendly name of the Agent.Suppliers_NZ
f111BuildDateDate and Time the code was built. Ideally this should automatically inserted using preprocessor macros
__DATE__ " " __TIME__
if these exist in the language used by the agent.
23-mar-2015 11:09:34
f112VersionSingle increasing number containing the version of code. This must be a single number, not 10.3 style version numbering.1283