Skip to main content

Google Drive

Google Drive is a file storage and synchronization service developed by Google.

This notebook covers how to load documents from Google Drive. Currently, only Google Docs are supported.

Prerequisites​

  1. Create a Google Cloud project or use an existing project
  2. Enable the Google Drive API
  3. Authorize credentials for desktop app
  4. pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib

πŸ§‘ Instructions for ingesting your Google Docs data​

By default, the GoogleDriveLoader expects the credentials.json file to be ~/.credentials/credentials.json, but this is configurable using the credentials_path keyword argument. Same thing with token.json - token_path. Note that token.json will be created automatically the first time you use the loader.

GoogleDriveLoader can load from a list of Google Docs document ids or a folder id. You can obtain your folder and document id from the URL:

pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
from langchain.document_loaders import GoogleDriveLoader

API Reference:

loader = GoogleDriveLoader(
folder_id="1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5",
# Optional: configure whether to recursively fetch files from subfolders. Defaults to False.
recursive=False,
)
docs = loader.load()

When you pass a folder_id by default all files of type document, sheet and pdf are loaded. You can modify this behaviour by passing a file_types argument

loader = GoogleDriveLoader(
folder_id="1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5",
file_types=["document", "sheet"],
recursive=False
)

Passing in Optional File Loaders​

When processing files other than Google Docs and Google Sheets, it can be helpful to pass an optional file loader to GoogleDriveLoader. If you pass in a file loader, that file loader will be used on documents that do not have a Google Docs or Google Sheets MIME type. Here is an example of how to load an Excel document from Google Drive using a file loader.

from langchain.document_loaders import GoogleDriveLoader
from langchain.document_loaders import UnstructuredFileIOLoader
file_id = "1x9WBtFPWMEAdjcJzPScRsjpjQvpSo_kz"
loader = GoogleDriveLoader(
file_ids=[file_id],
file_loader_cls=UnstructuredFileIOLoader,
file_loader_kwargs={"mode": "elements"},
)
docs = loader.load()
docs[0]

You can also process a folder with a mix of files and Google Docs/Sheets using the following pattern:

folder_id = "1asMOHY1BqBS84JcRbOag5LOJac74gpmD"
loader = GoogleDriveLoader(
folder_id=folder_id,
file_loader_cls=UnstructuredFileIOLoader,
file_loader_kwargs={"mode": "elements"},
)
docs = loader.load()
docs[0]

Extended usage​

An external component can manage the complexity of Google Drive : langchain-googledrive It's compatible with the Μ€langchain.document_loaders.GoogleDriveLoader and can be used in its place.

To be compatible with containers, the authentication uses an environment variable Μ€GOOGLE_ACCOUNT_FILE` to credential file (for user or service).

pip install langchain-googledrive
folder_id='root'
#folder_id='1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5'
# Use the advanced version.
from langchain_googledrive.document_loaders import GoogleDriveLoader
loader = GoogleDriveLoader(
folder_id=folder_id,
recursive=False,
num_results=2, # Maximum number of file to load
)

By default, all files with these mime-type can be converted to Document.

  • text/text
  • text/plain
  • text/html
  • text/csv
  • text/markdown
  • image/png
  • image/jpeg
  • application/epub+zip
  • application/pdf
  • application/rtf
  • application/vnd.google-apps.document (GDoc)
  • application/vnd.google-apps.presentation (GSlide)
  • application/vnd.google-apps.spreadsheet (GSheet)
  • application/vnd.google.colaboratory (Notebook colab)
  • application/vnd.openxmlformats-officedocument.presentationml.presentation (PPTX)
  • application/vnd.openxmlformats-officedocument.wordprocessingml.document (DOCX)

It's possible to update or customize this. See the documentation of GDriveLoader.

But, the corresponding packages must be installed.

pip install unstructured
for doc in loader.load():
print("---")
print(doc.page_content.strip()[:60]+"...")

Customize the search pattern​

All parameter compatible with Google list() API can be set.

To specify the new pattern of the Google request, you can use a PromptTemplate(). The variables for the prompt can be set with kwargs in the constructor. Some pre-formated request are proposed (use {query}, {folder_id} and/or {mime_type}):

You can customize the criteria to select the files. A set of predefined filter are proposed: | template | description | | -------------------------------------- | --------------------------------------------------------------------- | | gdrive-all-in-folder | Return all compatible files from a folder_id | | gdrive-query | Search query in all drives | | gdrive-by-name | Search file with name query | | gdrive-query-in-folder | Search query in folder_id (and sub-folders if recursive=true) | | gdrive-mime-type | Search a specific mime_type | | gdrive-mime-type-in-folder | Search a specific mime_type in folder_id | | gdrive-query-with-mime-type | Search query with a specific mime_type | | gdrive-query-with-mime-type-and-folder | Search query with a specific mime_type and in folder_id |

loader = GoogleDriveLoader(
folder_id=folder_id,
recursive=False,
template="gdrive-query", # Default template to use
query="machine learning",
num_results=2, # Maximum number of file to load
supportsAllDrives=False, # GDrive `list()` parameter
)
for doc in loader.load():
print("---")
print(doc.page_content.strip()[:60]+"...")

You can customize your pattern.

from langchain.prompts.prompt import PromptTemplate
loader = GoogleDriveLoader(
folder_id=folder_id,
recursive=False,
template=PromptTemplate(
input_variables=["query", "query_name"],
template="fullText contains '{query}' and name contains '{query_name}' and trashed=false",
), # Default template to use
query="machine learning",
query_name="ML",
num_results=2, # Maximum number of file to load
)
for doc in loader.load():
print("---")
print(doc.page_content.strip()[:60]+"...")

API Reference:

Modes for GSlide and GSheet​

The parameter mode accepts different values:

  • "document": return the body of each document
  • "snippets": return the description of each file (set in metadata of Google Drive files).

The conversion can manage in Markdown format:

  • bullet
  • link
  • table
  • titles

The parameter gslide_mode accepts different values:

  • "single" : one document with <PAGE BREAK>
  • "slide" : one document by slide
  • "elements" : one document for each elements.
loader = GoogleDriveLoader(
template="gdrive-mime-type",
mime_type="application/vnd.google-apps.presentation", # Only GSlide files
gslide_mode="slide",
num_results=2, # Maximum number of file to load
)
for doc in loader.load():
print("---")
print(doc.page_content.strip()[:60]+"...")

The parameter gsheet_mode accepts different values:

  • "single": Generate one document by line
  • "elements" : one document with markdown array and <PAGE BREAK> tags.
loader = GoogleDriveLoader(
template="gdrive-mime-type",
mime_type="application/vnd.google-apps.spreadsheet", # Only GSheet files
gsheet_mode="elements",
num_results=2, # Maximum number of file to load
)
for doc in loader.load():
print("---")
print(doc.page_content.strip()[:60]+"...")

Advanced usage​

All Google File have a 'description' in the metadata. This field can be used to memorize a summary of the document or others indexed tags (See method lazy_update_description_with_summary()).

If you use the mode="snippet", only the description will be used for the body. Else, the metadata['summary'] has the field.

Sometime, a specific filter can be used to extract some information from the filename, to select some files with specific criteria. You can use a filter.

Sometimes, many documents are returned. It's not necessary to have all documents in memory at the same time. You can use the lazy versions of methods, to get one document at a time. It's better to use a complex query in place of a recursive search. For each folder, a query must be applied if you activate recursive=True.

import os
loader = GoogleDriveLoader(
gdrive_api_file=os.environ["GOOGLE_ACCOUNT_FILE"],
num_results=2,
template="gdrive-query",
filter=lambda search, file: "#test" not in file.get('description',''),
query='machine learning',
supportsAllDrives=False,
)
for doc in loader.load():
print("---")
print(doc.page_content.strip()[:60]+"...")