Skip to main content

docai

from langchain.document_loaders.blob_loaders import Blob
from langchain.document_loaders.parsers import DocAIParser

API Reference:

DocAI is a Google Cloud platform to transform unstructured data from documents into structured data, making it easier to understand, analyze, and consume. You can read more about it: https://cloud.google.com/document-ai/docs/overview

First, you need to set up a GCS bucket and create your own OCR processor as described here: https://cloud.google.com/document-ai/docs/create-processor The GCS_OUTPUT_PATH should be a path to a folder on GCS (starting with gs://) and a processor name should look like projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID. You can get it either programmatically or copy from the Prediction endpoint section of the Processor details tab in the Google Cloud Console.

PROJECT = "PUT_SOMETHING_HERE"
GCS_OUTPUT_PATH = "PUT_SOMETHING_HERE"
PROCESSOR_NAME = "PUT_SOMETHING_HERE"

Now, let's create a parser:

parser = DocAIParser(location="us", processor_name=PROCESSOR_NAME, gcs_output_path=GCS_OUTPUT_PATH)

Let's go and parse an Alphabet's take from here: https://abc.xyz/assets/a7/5b/9e5ae0364b12b4c883f3cf748226/goog-exhibit-99-1-q1-2023-19.pdf. Copy it to your GCS bucket first, and adjust the path below.

blob = Blob(path="gs://vertex-pgt/examples/goog-exhibit-99-1-q1-2023-19.pdf")
docs = list(parser.lazy_parse(blob))

We'll get one document per page, 11 in total:

print(len(docs))
    11

You can run end-to-end parsing of a blob one-by-one. If you have many documents, it might be a better approach to batch them together and maybe even detach parsing from handling the results of parsing.

operations = parser.docai_parse([blob])
print([op.operation.name for op in operations])
    ['projects/543079149601/locations/us/operations/16447136779727347991']

You can check whether operations are finished:

parser.is_running(operations)
    True

And when they're finished, you can parse the results:

parser.is_running(operations)
    False
results = parser.get_results(operations)
print(results[0])
    DocAIParsingResults(source_path='gs://vertex-pgt/examples/goog-exhibit-99-1-q1-2023-19.pdf', parsed_path='gs://vertex-pgt/test/run1/16447136779727347991/0')

And now we can finally generate Documents from parsed results:

docs = list(parser.parse_from_results(results))
print(len(docs))
    11