Google Drive
Google Drive is a file storage and synchronization service developed by Google.
This notebook covers how to load documents from Google Drive
. Currently, only Google Docs
are supported.
Prerequisitesβ
- Create a Google Cloud project or use an existing project
- Enable the Google Drive API
- Authorize credentials for desktop app
pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
π§ Instructions for ingesting your Google Docs dataβ
By default, the GoogleDriveLoader
expects the credentials.json
file to be ~/.credentials/credentials.json
, but this is configurable using the credentials_path
keyword argument. Same thing with token.json
- token_path
. Note that token.json
will be created automatically the first time you use the loader.
GoogleDriveLoader
can load from a list of Google Docs document ids or a folder id. You can obtain your folder and document id from the URL:
- Folder: https://drive.google.com/drive/u/0/folders/1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5 -> folder id is
"1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5"
- Document: https://docs.google.com/document/d/1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw/edit -> document id is
"1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw"
pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
from langchain.document_loaders import GoogleDriveLoader
API Reference:
loader = GoogleDriveLoader(
folder_id="1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5",
# Optional: configure whether to recursively fetch files from subfolders. Defaults to False.
recursive=False,
)
docs = loader.load()
When you pass a folder_id
by default all files of type document, sheet and pdf are loaded. You can modify this behaviour by passing a file_types
argument
loader = GoogleDriveLoader(
folder_id="1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5",
file_types=["document", "sheet"],
recursive=False
)
Passing in Optional File Loadersβ
When processing files other than Google Docs and Google Sheets, it can be helpful to pass an optional file loader to GoogleDriveLoader
. If you pass in a file loader, that file loader will be used on documents that do not have a Google Docs or Google Sheets MIME type. Here is an example of how to load an Excel document from Google Drive using a file loader.
from langchain.document_loaders import GoogleDriveLoader
from langchain.document_loaders import UnstructuredFileIOLoader
API Reference:
file_id = "1x9WBtFPWMEAdjcJzPScRsjpjQvpSo_kz"
loader = GoogleDriveLoader(
file_ids=[file_id],
file_loader_cls=UnstructuredFileIOLoader,
file_loader_kwargs={"mode": "elements"},
)
docs = loader.load()
docs[0]
You can also process a folder with a mix of files and Google Docs/Sheets using the following pattern:
folder_id = "1asMOHY1BqBS84JcRbOag5LOJac74gpmD"
loader = GoogleDriveLoader(
folder_id=folder_id,
file_loader_cls=UnstructuredFileIOLoader,
file_loader_kwargs={"mode": "elements"},
)
docs = loader.load()
docs[0]
Extended usageβ
An external component can manage the complexity of Google Drive : langchain-googledrive
It's compatible with the Μlangchain.document_loaders.GoogleDriveLoader
and can be used
in its place.
To be compatible with containers, the authentication uses an environment variable ΜGOOGLE_ACCOUNT_FILE` to credential file (for user or service).
pip install langchain-googledrive
folder_id='root'
#folder_id='1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5'
# Use the advanced version.
from langchain_googledrive.document_loaders import GoogleDriveLoader
loader = GoogleDriveLoader(
folder_id=folder_id,
recursive=False,
num_results=2, # Maximum number of file to load
)
By default, all files with these mime-type can be converted to Document
.
- text/text
- text/plain
- text/html
- text/csv
- text/markdown
- image/png
- image/jpeg
- application/epub+zip
- application/pdf
- application/rtf
- application/vnd.google-apps.document (GDoc)
- application/vnd.google-apps.presentation (GSlide)
- application/vnd.google-apps.spreadsheet (GSheet)
- application/vnd.google.colaboratory (Notebook colab)
- application/vnd.openxmlformats-officedocument.presentationml.presentation (PPTX)
- application/vnd.openxmlformats-officedocument.wordprocessingml.document (DOCX)
It's possible to update or customize this. See the documentation of GDriveLoader
.
But, the corresponding packages must be installed.
pip install unstructured
for doc in loader.load():
print("---")
print(doc.page_content.strip()[:60]+"...")
Customize the search patternβ
All parameter compatible with Google list()
API can be set.
To specify the new pattern of the Google request, you can use a PromptTemplate()
.
The variables for the prompt can be set with kwargs
in the constructor.
Some pre-formated request are proposed (use {query}
, {folder_id}
and/or {mime_type}
):
You can customize the criteria to select the files. A set of predefined filter are proposed:
| template | description |
| -------------------------------------- | --------------------------------------------------------------------- |
| gdrive-all-in-folder | Return all compatible files from a folder_id
|
| gdrive-query | Search query
in all drives |
| gdrive-by-name | Search file with name query
|
| gdrive-query-in-folder | Search query
in folder_id
(and sub-folders if recursive=true
) |
| gdrive-mime-type | Search a specific mime_type
|
| gdrive-mime-type-in-folder | Search a specific mime_type
in folder_id
|
| gdrive-query-with-mime-type | Search query
with a specific mime_type
|
| gdrive-query-with-mime-type-and-folder | Search query
with a specific mime_type
and in folder_id
|
loader = GoogleDriveLoader(
folder_id=folder_id,
recursive=False,
template="gdrive-query", # Default template to use
query="machine learning",
num_results=2, # Maximum number of file to load
supportsAllDrives=False, # GDrive `list()` parameter
)
for doc in loader.load():
print("---")
print(doc.page_content.strip()[:60]+"...")
You can customize your pattern.
from langchain.prompts.prompt import PromptTemplate
loader = GoogleDriveLoader(
folder_id=folder_id,
recursive=False,
template=PromptTemplate(
input_variables=["query", "query_name"],
template="fullText contains '{query}' and name contains '{query_name}' and trashed=false",
), # Default template to use
query="machine learning",
query_name="ML",
num_results=2, # Maximum number of file to load
)
for doc in loader.load():
print("---")
print(doc.page_content.strip()[:60]+"...")
API Reference:
Modes for GSlide and GSheetβ
The parameter mode accepts different values:
- "document": return the body of each document
- "snippets": return the description of each file (set in metadata of Google Drive files).
The conversion can manage in Markdown format:
- bullet
- link
- table
- titles
The parameter gslide_mode
accepts different values:
- "single" : one document with <PAGE BREAK>
- "slide" : one document by slide
- "elements" : one document for each elements.
loader = GoogleDriveLoader(
template="gdrive-mime-type",
mime_type="application/vnd.google-apps.presentation", # Only GSlide files
gslide_mode="slide",
num_results=2, # Maximum number of file to load
)
for doc in loader.load():
print("---")
print(doc.page_content.strip()[:60]+"...")
The parameter gsheet_mode
accepts different values:
"single"
: Generate one document by line"elements"
: one document with markdown array and <PAGE BREAK> tags.
loader = GoogleDriveLoader(
template="gdrive-mime-type",
mime_type="application/vnd.google-apps.spreadsheet", # Only GSheet files
gsheet_mode="elements",
num_results=2, # Maximum number of file to load
)
for doc in loader.load():
print("---")
print(doc.page_content.strip()[:60]+"...")
Advanced usageβ
All Google File have a 'description' in the metadata. This field can be used to memorize a summary of the document or others indexed tags (See method lazy_update_description_with_summary()
).
If you use the mode="snippet"
, only the description will be used for the body. Else, the metadata['summary']
has the field.
Sometime, a specific filter can be used to extract some information from the filename, to select some files with specific criteria. You can use a filter.
Sometimes, many documents are returned. It's not necessary to have all documents in memory at the same time. You can use the lazy versions of methods, to get one document at a time. It's better to use a complex query in place of a recursive search. For each folder, a query must be applied if you activate recursive=True
.
import os
loader = GoogleDriveLoader(
gdrive_api_file=os.environ["GOOGLE_ACCOUNT_FILE"],
num_results=2,
template="gdrive-query",
filter=lambda search, file: "#test" not in file.get('description',''),
query='machine learning',
supportsAllDrives=False,
)
for doc in loader.load():
print("---")
print(doc.page_content.strip()[:60]+"...")