๐๏ธ Etherscan Loader
Overview
๐๏ธ acreom
acreom is a dev-first knowledge base with tasks running on local markdown files.
๐๏ธ Airbyte CDK
Airbyte is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.
๐๏ธ Airbyte Gong
Airbyte is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.
๐๏ธ Airbyte Hubspot
Airbyte is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.
๐๏ธ Airbyte JSON
Airbyte is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.
๐๏ธ Airbyte Salesforce
Airbyte is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.
๐๏ธ Airbyte Shopify
Airbyte is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.
๐๏ธ Airbyte Stripe
Airbyte is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.
๐๏ธ Airbyte Typeform
Airbyte is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.
๐๏ธ Airbyte Zendesk Support
Airbyte is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.
๐๏ธ Airtable
* Get your API key here.
๐๏ธ Alibaba Cloud MaxCompute
Alibaba Cloud MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.
๐๏ธ Apify Dataset
Apify Dataset is a scaleable append-only storage with sequential access built for storing structured web scraping results, such as a list of products or Google SERPs, and then export them to various formats like JSON, CSV, or Excel. Datasets are mainly used to save results of Apify Actorsโserverless cloud programs for varius web scraping, crawling, and data extraction use cases.
๐๏ธ ArcGIS
This notebook demonstrates the use of the langchain.document_loaders.ArcGISLoader class.
๐๏ธ Arxiv
arXiv is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.
๐๏ธ AssemblyAI Audio Transcripts
The AssemblyAIAudioTranscriptLoader allows to transcribe audio files with the AssemblyAI API and loads the transcribed text into documents.
๐๏ธ Async Chromium
Chromium is one of the browsers supported by Playwright, a library used to control browser automation.
๐๏ธ AsyncHtmlLoader
AsyncHtmlLoader loads raw HTML from a list of urls concurrently.
๐๏ธ AWS S3 Directory
Amazon Simple Storage Service (Amazon S3) is an object storage service
๐๏ธ AWS S3 File
Amazon Simple Storage Service (Amazon S3) is an object storage service.
๐๏ธ AZLyrics
AZLyrics is a large, legal, every day growing collection of lyrics.
๐๏ธ Azure Blob Storage Container
Azure Blob Storage is Microsoft's object storage solution for the cloud. Blob Storage is optimized for storing massive amounts of unstructured data. Unstructured data is data that doesn't adhere to a particular data model or definition, such as text or binary data.
๐๏ธ Azure Blob Storage File
Azure Files offers fully managed file shares in the cloud that are accessible via the industry standard Server Message Block (SMB) protocol, Network File System (NFS) protocol, and Azure Files REST API.
๐๏ธ Azure Document Intelligence
Azure Document Intelligence (formerly known as Azure Forms Recognizer) is machine-learning
๐๏ธ BibTeX
BibTeX is a file format and reference management system commonly used in conjunction with LaTeX typesetting. It serves as a way to organize and store bibliographic information for academic and research documents.
๐๏ธ BiliBili
Bilibili is one of the most beloved long-form video sites in China.
๐๏ธ Blackboard
Blackboard Learn (previously the Blackboard Learning Management System) is a web-based virtual learning environment and learning management system developed by Blackboard Inc. The software features course management, customizable open architecture, and scalable design that allows integration with student information systems and authentication protocols. It may be installed on local servers, hosted by Blackboard ASP Solutions, or provided as Software as a Service hosted on Amazon Web Services. Its main purposes are stated to include the addition of online elements to courses traditionally delivered face-to-face and development of completely online courses with few or no face-to-face meetings
๐๏ธ Blockchain
Overview
๐๏ธ Brave Search
Brave Search is a search engine developed by Brave Software.
๐๏ธ Browserless
Browserless is a service that allows you to run headless Chrome instances in the cloud. It's a great way to run browser-based automation at scale without having to worry about managing your own infrastructure.
๐๏ธ ChatGPT Data
ChatGPT is an artificial intelligence (AI) chatbot developed by OpenAI.
๐๏ธ College Confidential
College Confidential gives information on 3,800+ colleges and universities.
๐๏ธ Concurrent Loader
Works just like the GenericLoader but concurrently for those who choose to optimize their workflow.
๐๏ธ Confluence
Confluence is a wiki collaboration platform that saves and organizes all of the project-related material. Confluence is a knowledge base that primarily handles content management activities.
๐๏ธ CoNLL-U
CoNLL-U is revised version of the CoNLL-X format. Annotations are encoded in plain text files (UTF-8, normalized to NFC, using only the LF character as line break, including an LF character at the end of file) with three types of lines:
๐๏ธ Copy Paste
This notebook covers how to load a document object from something you just want to copy and paste. In this case, you don't even need to use a DocumentLoader, but rather can just construct the Document directly.
๐๏ธ CSV
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.
๐๏ธ Cube Semantic Layer
This notebook demonstrates the process of retrieving Cube's data model metadata in a format suitable for passing to LLMs as embeddings, thereby enhancing contextual information.
๐๏ธ Datadog Logs
Datadog is a monitoring and analytics platform for cloud-scale applications.
๐๏ธ Diffbot
Unlike traditional web scraping tools, Diffbot doesn't require any rules to read the content on a page.
๐๏ธ Discord
Discord is a VoIP and instant messaging social platform. Users have the ability to communicate with voice calls, video calls, text messaging, media and files in private chats or as part of communities called "servers". A server is a collection of persistent chat rooms and voice channels which can be accessed via invite links.
๐๏ธ Docugami
This notebook covers how to load documents from Docugami. It provides the advantages of using this system over alternative data loaders.
๐๏ธ Dropbox
Drobpox is a file hosting service that brings everything-traditional files, cloud content, and web shortcuts together in one place.
๐๏ธ DuckDB
DuckDB is an in-process SQL OLAP database management system.
๐๏ธ Email
This notebook shows how to load email (.eml) or Microsoft Outlook (.msg) files.
๐๏ธ Embaas
embaas is a fully managed NLP API service that offers features like embedding generation, document text extraction, document to embeddings and more. You can choose a variety of pre-trained models.
๐๏ธ EPub
EPUB is an e-book file format that uses the ".epub" file extension. The term is short for electronic publication and is sometimes styled ePub. EPUB is supported by many e-readers, and compatible software is available for most smartphones, tablets, and computers.
๐๏ธ EverNote
EverNote is intended for archiving and creating notes in which photos, audio and saved web content can be embedded. Notes are stored in virtual "notebooks" and can be tagged, annotated, edited, searched, and exported.
๐๏ธ example_data
1 items
๐๏ธ Microsoft Excel
The UnstructuredExcelLoader is used to load Microsoft Excel files. The loader works with both .xlsx and .xls files. The page content will be the raw text of the Excel file. If you use the loader in "elements" mode, an HTML representation of the Excel file will be available in the document metadata under the textashtml key.
๐๏ธ Facebook Chat
Messenger) is an American proprietary instant messaging app and platform developed by Meta Platforms. Originally developed as Facebook Chat in 2008, the company revamped its messaging service in 2010.
๐๏ธ Fauna
Fauna is a Document Database.
๐๏ธ Figma
Figma is a collaborative web application for interface design.
๐๏ธ Geopandas
Geopandas is an open source project to make working with geospatial data in python easier.
๐๏ธ Git
Git is a distributed version control system that tracks changes in any set of computer files, usually used for coordinating work among programmers collaboratively developing source code during software development.
๐๏ธ GitBook
GitBook is a modern documentation platform where teams can document everything from products to internal knowledge bases and APIs.
๐๏ธ GitHub
This notebooks shows how you can load issues and pull requests (PRs) for a given repository on GitHub. We will use the LangChain Python repository as an example.
๐๏ธ Google BigQuery
Google BigQuery is a serverless and cost-effective enterprise data warehouse that works across clouds and scales with your data.
๐๏ธ Google Cloud Storage Directory
Google Cloud Storage is a managed service for storing unstructured data.
๐๏ธ Google Cloud Storage File
Google Cloud Storage is a managed service for storing unstructured data.
๐๏ธ Google Drive
Google Drive is a file storage and synchronization service developed by Google.
๐๏ธ Grobid
GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents.
๐๏ธ Gutenberg
Project Gutenberg is an online library of free eBooks.
๐๏ธ Hacker News
Hacker News (sometimes abbreviated as HN) is a social news website focusing on computer science and entrepreneurship. It is run by the investment fund and startup incubator Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity."
๐๏ธ Huawei OBS Directory
The following code demonstrates how to load objects from the Huawei OBS (Object Storage Service) as documents.
๐๏ธ Huawei OBS File
The following code demonstrates how to load an object from the Huawei OBS (Object Storage Service) as document.
๐๏ธ HuggingFace dataset
The Hugging Face Hub is home to over 5,000 datasets in more than 100 languages that can be used for a broad range of tasks across NLP, Computer Vision, and Audio. They used for a diverse range of tasks such as translation,
๐๏ธ iFixit
iFixit is the largest, open repair community on the web. The site contains nearly 100k repair manuals, 200k Questions & Answers on 42k devices, and all the data is licensed under CC-BY-NC-SA 3.0.
๐๏ธ Images
This covers how to load images such as JPG or PNG into a document format that we can use downstream.
๐๏ธ Image captions
By default, the loader utilizes the pre-trained Salesforce BLIP image captioning model.
๐๏ธ IMSDb
IMSDb is the Internet Movie Script Database.
๐๏ธ Iugu
Iugu is a Brazilian services and software as a service (SaaS) company. It offers payment-processing software and application programming interfaces for e-commerce websites and mobile applications.
๐๏ธ Joplin
Joplin is an open source note-taking app. Capture your thoughts and securely access them from any device.
๐๏ธ Jupyter Notebook
Jupyter Notebook (formerly IPython Notebook) is a web-based interactive computational environment for creating notebook documents.
๐๏ธ LarkSuite (FeiShu)
LarkSuite is an enterprise collaboration platform developed by ByteDance.
๐๏ธ Mastodon
Mastodon is a federated social media and social networking service.
๐๏ธ MediaWikiDump
MediaWiki XML Dumps contain the content of a wiki (wiki pages with all their revisions), without the site-related data. A XML dump does not create a full backup of the wiki database, the dump does not contain user accounts, images, edit logs, etc.
๐๏ธ MergeDocLoader
Merge the documents returned from a set of specified data loaders.
๐๏ธ mhtml
MHTML is a is used both for emails but also for archived webpages. MHTML, sometimes referred as MHT, stands for MIME HTML is a single file in which entire webpage is archived. When one saves a webpage as MHTML format, this file extension will contain HTML code, images, audio files, flash animation etc.
๐๏ธ Microsoft OneDrive
Microsoft OneDrive (formerly SkyDrive) is a file hosting service operated by Microsoft.
๐๏ธ Microsoft PowerPoint
Microsoft PowerPoint is a presentation program by Microsoft.
๐๏ธ Microsoft SharePoint
Microsoft SharePoint is a website-based collaboration system that uses workflow applications, โlistโ databases, and other web parts and security features to empower business teams to work together developed by Microsoft.
๐๏ธ Microsoft Word
Microsoft Word is a word processor developed by Microsoft.
๐๏ธ Modern Treasury
Modern Treasury simplifies complex payment operations. It is a unified platform to power products and processes that move money.
๐๏ธ News URL
This covers how to load HTML news articles from a list of URLs into a document format that we can use downstream.
๐๏ธ Notion DB 1/2
Notion is a collaboration platform with modified Markdown support that integrates kanban boards, tasks, wikis and databases. It is an all-in-one workspace for notetaking, knowledge and data management, and project and task management.
๐๏ธ Notion DB 2/2
Notion is a collaboration platform with modified Markdown support that integrates kanban boards, tasks, wikis and databases. It is an all-in-one workspace for notetaking, knowledge and data management, and project and task management.
๐๏ธ Nuclia Understanding API document loader
Nuclia automatically indexes your unstructured data from any internal and external source, providing optimized search results and generative answers. It can handle video and audio transcription, image content extraction, and document parsing.
๐๏ธ Obsidian
Obsidian is a powerful and extensible knowledge base
๐๏ธ Open Document Format (ODT)
The Open Document Format for Office Applications (ODF), also known as OpenDocument, is an open file format for word processing documents, spreadsheets, presentations and graphics and using ZIP-compressed XML files. It was developed with the aim of providing an open, XML-based file format specification for office applications.
๐๏ธ Open City Data
Socrata provides an API for city open data.
๐๏ธ Org-mode
A Org Mode document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs.
๐๏ธ Pandas DataFrame
This notebook goes over how to load data from a pandas DataFrame.
๐๏ธ Amazon Textract
Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Today, many companies manually extract data from scanned documents such as PDFs, images, tables, and forms, or through simple OCR software that requires manual configuration (which often must be updated when the form changes). To overcome these manual and expensive processes, Textract uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort. You can quickly automate document processing and act on the information extracted, whether youโre automating loans processing or extracting information from invoices and receipts. Textract can extract the data in minutes instead of hours or days.
๐๏ธ Polars DataFrame
This notebook goes over how to load data from a polars DataFrame.
๐๏ธ Psychic
This notebook covers how to load documents from Psychic. See here for more details.
๐๏ธ PubMed
PubMedยฎ by The National Center for Biotechnology Information, National Library of Medicine comprises more than 35 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full text content from PubMed Central and publisher web sites.
๐๏ธ PySpark DataFrame Loader
This notebook goes over how to load data from a PySpark DataFrame.
๐๏ธ ReadTheDocs Documentation
Read the Docs is an open-sourced free software documentation hosting platform. It generates documentation written with the Sphinx documentation generator.
๐๏ธ Recursive URL Loader
We may want to process load all URLs under a root directory.
๐๏ธ Reddit
Reddit is an American social news aggregation, content rating, and discussion website.
๐๏ธ Roam
ROAM is a note-taking tool for networked thought, designed to create a personal knowledge base.
๐๏ธ Rockset
Rockset is a real-time analytics database which enables queries on massive, semi-structured data without operational burden. With Rockset, ingested data is queryable within one second and analytical queries against that data typically execute in milliseconds. Rockset is compute optimized, making it suitable for serving high concurrency applications in the sub-100TB range (or larger than 100s of TBs with rollups).
๐๏ธ RSS Feeds
This covers how to load HTML news articles from a list of RSS feed URLs into a document format that we can use downstream.
๐๏ธ RST
A reStructured Text (RST) file is a file format for textual data used primarily in the Python programming language community for technical documentation.
๐๏ธ Sitemap
Extends from the WebBaseLoader, SitemapLoader loads a sitemap from a given URL, and then scrape and load all pages in the sitemap, returning each page as a Document.
๐๏ธ Slack
Slack is an instant messaging program.
๐๏ธ Snowflake
This notebooks goes over how to load documents from Snowflake
๐๏ธ Source Code
This notebook covers how to load source code files using a special approach with language parsing: each top-level function and class in the code is loaded into separate documents. Any remaining code top-level code outside the already loaded functions and classes will be loaded into a seperate document.
๐๏ธ Spreedly
Spreedly is a service that allows you to securely store credit cards and use them to transact against any number of payment gateways and third party APIs. It does this by simultaneously providing a card tokenization/vault service as well as a gateway and receiver integration service. Payment methods tokenized by Spreedly are stored at Spreedly, allowing you to independently store a card and then pass that card to different end points based on your business requirements.
๐๏ธ Stripe
Stripe is an Irish-American financial services and software as a service (SaaS) company. It offers payment-processing software and application programming interfaces for e-commerce websites and mobile applications.
๐๏ธ Subtitle
The SubRip file format is described on the Matroska multimedia container format website as "perhaps the most basic of all subtitle formats." SubRip (SubRip Text) files are named with the extension .srt, and contain formatted lines of plain text in groups separated by a blank line. Subtitles are numbered sequentially, starting at 1. The timecode format used is hoursseconds,milliseconds with time units fixed to two zero-padded digits and fractions fixed to three zero-padded digits (0000,000). The fractional separator used is the comma, since the program was written in France.
๐๏ธ Telegram
Telegram Messenger is a globally accessible freemium, cross-platform, encrypted, cloud-based and centralized instant messaging service. The application also provides optional end-to-end encrypted chats and video calling, VoIP, file sharing and several other features.
๐๏ธ Tencent COS Directory
This covers how to load document objects from a Tencent COS Directory.
๐๏ธ Tencent COS File
This covers how to load document object from a Tencent COS File.
๐๏ธ TensorFlow Datasets
TensorFlow Datasets is a collection of datasets ready to use, with TensorFlow or other Python ML frameworks, such as Jax. All datasets are exposed as tf.data.Datasets, enabling easy-to-use and high-performance input pipelines. To get started see the guide and the list of datasets.
๐๏ธ 2Markdown
2markdown service transforms website content into structured markdown files.
๐๏ธ TOML
TOML is a file format for configuration files. It is intended to be easy to read and write, and is designed to map unambiguously to a dictionary. Its specification is open-source. TOML is implemented in many programming languages. The name TOML is an acronym for "Tom's Obvious, Minimal Language" referring to its creator, Tom Preston-Werner.
๐๏ธ Trello
Trello is a web-based project management and collaboration tool that allows individuals and teams to organize and track their tasks and projects. It provides a visual interface known as a "board" where users can create lists and cards to represent their tasks and activities.
๐๏ธ TSV
A tab-separated values (TSV) file is a simple, text-based file format for storing tabular data.[3] Records are separated by newlines, and values within a record are separated by tab characters.
๐๏ธ Twitter
Twitter is an online social media and social networking service.
๐๏ธ Unstructured File
This notebook covers how to use Unstructured package to load files of many types. Unstructured currently supports loading of text files, powerpoints, html, pdfs, images, and more.
๐๏ธ URL
This covers how to load HTML documents from a list of URLs into a document format that we can use downstream.
๐๏ธ Weather
OpenWeatherMap is an open source weather service provider
๐๏ธ WebBaseLoader
This covers how to use WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as IMSDbLoader, AZLyricsLoader, and CollegeConfidentialLoader
๐๏ธ WhatsApp Chat
WhatsApp (also called WhatsApp Messenger) is a freeware, cross-platform, centralized instant messaging (IM) and voice-over-IP (VoIP) service. It allows users to send text and voice messages, make voice and video calls, and share images, documents, user locations, and other content.
๐๏ธ Wikipedia
Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system called MediaWiki. Wikipedia is the largest and most-read reference work in history.
๐๏ธ XML
The UnstructuredXMLLoader is used to load XML files. The loader works with .xml files. The page content will be the text extracted from the XML tags.
๐๏ธ Xorbits Pandas DataFrame
This notebook goes over how to load data from a xorbits.pandas DataFrame.
๐๏ธ Loading documents from a YouTube url
Building chat or QA applications on YouTube videos is a topic of high interest.
๐๏ธ YouTube transcripts
YouTube is an online video sharing and social media platform created by Google.