Document transformers | 🦜️🔗 Langchain

📄️ Beautiful Soup

Beautiful Soup offers fine-grained control over HTML content, enabling specific tag extraction, removal, and content cleaning.

DocAI is a Google Cloud platform to transform unstructured data from documents into structured data, making it easier to understand, analyze, and consume. You can read more about it//cloud.google.com/document-ai/docs/overview

📄️ Doctran Extract Properties

We can extract useful features of documents using the Doctran library, which uses OpenAI's function calling feature to extract specific metadata.

📄️ Doctran Interrogate Documents

Documents used in a vector store knowledge base are typically stored in narrative or conversational format. However, most user queries are in question format. If we convert documents into Q&A format before vectorizing them, we can increase the liklihood of retrieving relevant documents, and decrease the liklihood of retrieving irrelevant documents.

📄️ Doctran Translate Documents

Comparing documents through embeddings has the benefit of working across multiple languages. "Harrison says hello" and "Harrison dice hola" will occupy similar positions in the vector space because they have the same meaning semantically.

📄️ html2text

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text.

📄️ Nuclia Understanding API document transformer

Nuclia automatically indexes your unstructured data from any internal and external source, providing optimized search results and generative answers. It can handle video and audio transcription, image content extraction, and document parsing.

📄️ OpenAI Functions Metadata Tagger

It can often be useful to tag ingested documents with structured metadata, such as the title, tone, or length of a document, to allow for more targeted similarity search later. However, for large numbers of documents, performing this labelling process manually can be tedious.