Get a Google Docs Body Text with Apps Script

4 Approaches to extracting text from a Google Doc with Apps Script

Retrieving a Google Docs body text is quite easy with the help of Google Apps Script.

Well, until it isn’t. Let me explain.

The Starter Doc

You can grab a copy of the starter sheet here.

Starter Sheet

Your Google Doc should look a little like this:

Sample Google Docs containing all available elements including smart chips
Sample Google Docs containing all available elements including smart chips

(1) Get the Google Doc Body with Document App – Super Easy

My journey started off easy enough with a method I had run 100 times.

Lines 10-11: First, we call the Document App Service and retrieve the active document.

Lines 12-13: Next, all we need to do is call the getBody() method to retrieve the body data and from that, get the text of the body.

That’s it! Super simple.

You can give it a spin with the starter sheet.

To a YouTube Short on this code

A video short

Life was so much simpler when Google’s Document App team first created their DocumentApp Service. There were no ‘Smart’ Chips to deal with for them. Just some text to extract from paragraph or table elements.

If you have been playing along, you would have probably noticed that any data with a Smart Chip is not displayed with the simple approach above.

Why? Well for good or ill Google’s Smart Chips were designed with the permanently-online workforce in mind. They provide a layout that can be hovered over to provide more information and, sometimes, imagery about a topic.  They come in a number of flavours that users can generally access by using the ‘@’ symbol or going to Insert > Smart Chips in the menu.

I see the utility but be warned, if you find yourself moving documents between Google Docs, Libre Office Writer or Microsoft Office Document then you will be in for a headache. I digress.

So how do we get that chip data?

Hire me for our next Google Workspace project.

(2) Get Google Docs Body Data and Chip Data with Document App

As complete and enticing as this chapter title is, as of writing this the Google Docs development team has only created chip access support for a very, and I mean ‘very’, limited number of Smart Chips.

So what can we extract?

The Video

Get Google Doc Body Text Links and Smart Chips with Apps Script DocumentApp
Get Google Doc Body Text Links and Smart Chips with Apps Script DocumentApp

Chips that Document App Can extract

  • Date chips: These are dates that you can statically add (which is a weird use for a chip, just add the date in as normal text, ya sausage!) and dynamic chips, like today’s date.
  • Person Chips: This lists the selected person’s email and name. When you hover over this bad boy, it will come up with their racy little avatar and some more actions too.
    Google Doc Person Chip
    Google Doc Person Chip

     

  • Rich Link Chips: Just like regular links in a Google Doc, a rich link will link you to another Google Workspace Document. It won’t link you to external links like a website and it obviously can’t be read by the .getText() method. Further, it cannot even be read by calling the old trusty getLinkUrl() method (more on this later). Plus, it pretty much displays the same data when you hover over it as a normal link. So what’s the benefit? Well, you get a little document icon at the start and Google Docs will aggressively prompt you to convert any URL to a link. So I guess you are saving time writing a label for the link. Do yourself a favour and stick with the old-school links.

The rest of the chips you can’t extract. Some for good reason, like the very interactive chips, but there are also chips that should be retrievable but no methods as been implemented by Google yet.

What’s the message? Try to avoid Smart Chips in Google Workspace if you are scripting or want to download the file and use it with other software. Avoid the enticing walled garden where you can.

What about extracting URLs from Normal Links?

While we are extracting a limited subset of the smart chips (Yeah, still disgruntled), why don’t we also grab the links from the standard text and add them to our returned text?

Time to look at the code.

The Code

Note, that I have set the code up in a way that you should be able to easily remove any of the JavaScript switch cases that you don’t need.

🐐You can support me for free by using this Amazon affiliate link in your next tech purchase :Computers & Stuff! 🐐

I’ve also left a bunch of the console.logs() in the script to help you see what is going on. You can always comment them out or delete them as needed.

Finally, the function used here has a returning statement so you can implement this in your project, though you may wish to abstract the document id selection to a parameter if you are not working within a Google Doc-bound project.

If you have found the tutorial helpful, why not shout me a coffee ☕? I'd really appreciate it.

Code Breakdown

Function Setup

The function getBodyWithChips() returns a text string containing the retrieved data from the file.

We will need to iterate over a number of different elements to access their text data.

Line 12: Sets the text data. This will be appended to as we find the date in the Google Doc.

Next, I have added an internal callback method in the function to help iterate over the different element types in the Google Doc file. More on this later. For now, let’s jump to the bottom of the function.

Line 156-158 – Here we retrieve the DocumentApp body object in the same way we did in our simple example at the beginning of this tutorial.

Line 161 – Next we call the elementIterator() function. This function takes our Document App body class object as its initial parameter.

Line 164 – Finally, the text is returned to be used in your calling function.

The elementIterator() function

This function is our driving callback function that will scan each element type and try to retrieve the text from each element.

The function takes the DocumentApp.Body class object which is based on a standard element interface for Google Doc elements. There is no ‘interface’ type for text completion, so I have just added the ‘Body’ class object here, but in essence, the class object type will change depending on the element being called during the callback.

The function will return a text string back to the contain function once all callbacks are complete.

Line 22 – Upon each iteration, we will need to retrieve the number of child elements in the containing (parent) element using the getNumChildren function.

Line 25-48 – We then iterate over the number of children, first collecting the child element object with getChild() and then extracting the element type with getType().

line 30 – Using the type as our condition, we use a JavaScript switch statement to determine our case to run for the selected element.

Elements can be accurately retrieved using the ElementType enumerator for each case as you can see in the example.

Paragraph Element

If an element is a paragraph then it will not contain any text directly but its children will. This means that we will send the paragraph element to the elementIterator() parsing in the paragraph as its argument.

While we could call the getText() method at this point, we would not be able to extract any URLs from the text.

List Item Element

Similar to the paragraph element, a list item will contain child elements with text in them.  Here we parse in the list item element back into our callback function.

Table Element

The direct child of a table element is a table row. We will need to feed this table element back to our callback to iterate over the table’s rows.

Table Row Element

Each table row contains a list of table cell child elements. Again, we must feed the elementIterator() callback the table row to extract the table cell.

Table Cell Element

A table cell element can contain paragraphs, list elements, other table elements and any other number of element types. As such we need to send this back to our elementIterator() callback to dig deeper into the element tree to extract our text.

Text Element

Finally! This is what we are trying to extract.

If you have no intention of extracting any potential links from the document then you may replace the contents of this case with a much simpler script:

Extracting both the text and its associated URL while correctly maintaining its associated position is a little tricky.

Lines 60-63: Here we first retrieve the element data by defining the element type as text (asText()) and then get the text from the text element using the getText() method.

Text Element Attributes and Indices

A string of text in Google Docs is separated by its attributes. An attribute might be its font-weight, colour, type or whether it contains a link. These attributes are stored in an array of character indices in DocumentApp. Where each index has its own unique set of attributes.

This means that we need to iterate over each index position and check if there is a link attribute in it. If there is we will extract it and add it to our text.

For our example, we will store link data like markdown links:

[name]{url}

Line 66: First we need to set an offset to the text string we will build because the end result will be larger than the original text string because we are adding in the URL.

Lines 67-68 Next, we can iterate over the indices and store the start position value as a convenient variable.

Line 69: Then we will call the getLinkUrl() method. This method takes an index position as an argument. It will return either a URL, if one is found, or null.

Line 71: We can now check if the link is found and if there is store it with its associated text in our string.

Lines 73-74: Before we add the URL to our text we need to keep a record of the previous length of our string so we can later work out the offset. We will also need to check if the current index is the penultimate index with a boolean marker.

Lines 75-77: To get the link text we will need to determine where the text end is located (we already have its starting position). The text indices end range will be the next value in the indices array or, if we are at the penultimate index, the total length of the string.

Because we are adding to the text with our URL we will also need to add our offset values to it.

Lines 79-83: Next, we will add our markdown wrapping characters and the URL to the current string. We will do this with the help of the JavaScript substring() method. This will return a result that looks a little like this:

[my website]{https://yagisanatode.com}

Lines 85-89: Now that we have our text with our link we can add it to the remainder of the extracted text. If we are at the end of the text, we just need to add the new link and text, but if we are somewhere in the middle, we will also need to append the remaining text that we will have to check for links during the next iteration. However first, we need to update the offset with the the newly text string minus the original string length.

Line 93: With the text and any links added, we can now add the text to our main text string variable.

Lines 96-98: Finally, if we are at the end of a paragraph, we will also need to generate a new line.

Date Element
Google Doc Date Smart Chip
Google Doc Date Smart Chip

This is a fairly new method added by the Google Document App dev team and the first of our Smart Chip elements.

Line 105: As we discovered, we can’t extract Date chips using the asText() element interface method. However, when we do find a date we can call the asDate() class.

Lines 106-108: This class can retrieve a number of helpful date details:

  • getDisplayText(): The currently displayed or rendered date text.
  • getTimeStamp(): The actual date-time stamp containing the timezone.
  • getLocale(): The format type that is based on the selected locale.

For the purpose of this tutorial, I have added in all the available options here. Feel free to remove what you won’t need for your own project.

Line 109: Finally we append our text variable with each of these returned date results. Again, change this string to display what you need, but the tutorial version will return something like this:

“DATE[Display: 27 Dec 2024, Timestamp: Fri Dec 27 2024 19:00:00 GMT+0700 (Indochina Time), Locale: en-GB]”

Rich Link Element
Google Doc Rich Text Smart Chip
Google Doc Rich Text Smart Chip

The rich link element represents a link to a Google Drive file or a YouTube video, or some other non-specific Google resource that the developers were vague about (Bad! Bad documentation writers. Big smacks!).

Lines 116-119: Similar to the date smart chip method, the rich link method can retrieve a few useful bits of text, once we call the asRichLink() class, like:

  • getTitle(): The title of the file or video.
  • getUrl(): The URL link to the resource.
  • getMimeType(): The file type of the resource.

Line 120: I have added all three text options here for this tutorial. Feel free to remove what you don’t need and display them within the string how you please.

Here is the resulting text based on the example in the image at the start of this chapter:

“(https://docs.google.com/spreadsheets/u/0/d/1lfb1u–sheetlink–8iLg2TEQ/edit) type: application/vnd.google-apps.spreadsheet”

Person Element
Google Doc Person Chip
Google Doc Person Chip

Our final available chip is the Person chip. As you can see from the image above this chip displays information about the selected person including their account name email and avatar image along with some actions like sending emails, starting a Google Chat or Google Meet or scheduling something on their calendar.

Lines 123-132: The asPerson() class allows us to access the following text data:

  • getEmail(): The person’s email address.
  • getName(): The person’s account name.

You can see in our example, I’ve added the result as a markdown link format:

[${name}]{${email}}

Resulting in this for our current example:

[tester account]{tester@yagisanatode.com}

Table of Contents Element

We can also extract the table of contents data from the Google Doc with the .asTableOfContents() class.

Lines 134-141: Here, we are just going to extract the text with the getText() method. While we could retrieve the internal page links too this would not be useful for our body text display.

Unsupported Element

Lines 143-147: Finally, if we can’t access a chip like a drop-down result, variable, voting chip result, or some other chip that we should probably be able to access but there is currently no API class for. Then we can indicate this by identifying the element range as an unsupported element with the  DocumentApp.ElementType.UNSUPPORTED enumerator.

I had hoped that the unsupportedElement() class had a type ID or something at least to explain what the element was, but alas it didn’t so there was no point calling the class methods. Instead, we just add a warning text explaining that the current element range is not supported.

If you have found the tutorial helpful, why not shout me a coffee ☕? I'd really appreciate it.

Why no Equation Element?

I guess, similar to the unsupported element we could have added the equation warning text too or just leave that section out completely as I have done.

For equations with special characters, there is no way to extract the character data, unfortunately. As such, this was left out. I did, however, keep the formula in there to illustrate that it could not be read.

A little hope for the future (Code Snippets and Variables)

I noticed that the Google Docs developers have already included the enumerators for CODE_SNIPPET and VARIABLE in their getType() method. So perhaps, in the near future, I can update this page.

The Text Results

This time our text string results look like this:

Extracted text results from Google Docs using the DocumentApp element iterator approach to include some smart chips
Extracted text results from Google Docs using the DocumentApp element iterator approach to include some smart chips

As you can see we now have the smart chip data displayed for those chip classes available to us in Document App. Both regular paragraphs and table paragraphs are displayed. Further, the standard links are now showing the URLs in their correct locations. The table of contents is also displayed at the top of the page. We can even see emojis displaying correctly.

This is probably the most complete approach currently available, but what if we want to extract the text from other chips?

(3) The OCR approach to text extraction

Frankly, this approach just feels weird in the context of the Google Workspace ecosystem. In this approach we:

  1. Convert the document data to a blob
  2. This is then temporarily converted to a PDF using OCR (Optical Character Recognition)
  3. This then is fed to a fresh Google Doc file.
  4. The body text is then read again using getBody().getText().
  5. Once read the Google Doc file is deleted.

A very roundabout way of doing things.

The benefit of this approach is that we can now read ‘most’ of the text in the Chip that is currently on display.

The Code

As you can see below the code is much easier. And indeed, I borrowed some of this code from a previous tutorial of mind that you can explore here:

Google Apps Script: Extract Specific Data From a PDF and insert it into a Google Sheet (Updated Feb 2024)

Before running the code you will need to add Google Drive Advanced Service Version 3. Click Services > select Drive API > check if Version 3 is selected > Click Add.

Code Breakdown

Line 12: First, we extract the blob data from the active document. Blob data is a binary data type stored in memory chunks called ‘sbspaces’.

Lines 14-24: Next, we need to create a new Google Drive file using Drive API advanced service. We will do this with the file.create method that takes a resource object, which will be our MIME type of Google Doc, and an options object, which will be the OCR language set to English.

What’s occurring here is that by setting the type to a Google Doc and declaring the OCR language, Google Drive API will look at the blob data as an image (typically a PDF) and try and read the text visually before putting the read data into the file. This means that the code for the smart cards will not be read, rather the text of the cards will be read, well, in most cases.

Lines 27-28: Here we grab the newly created Google Doc and read the body text. You could add a return statement with the text variable if you need to draw this into another function.

Line 32: Once we have used and abused our new Google Doc, we trash it like yesterday’s newspaper using the file.remove method.

The text results

Extracted text results from Google Docs using OCR approach
Extracted text results from Google Docs using OCR approach

The OCR approach does a pretty solid job of extracting what is being displayed on the page. It will even add ordered and unordered list symbols. It won’t however, display emojis as indicated by this fun little symbol (�) and it won’t capture many of the formula symbols but will capture a few different ones to the basic approach.

So what is our next option? Yeap, time to drown in Google JSON files.

Shhh… just let go…

(4) Retrieve Google Doc Body Text With Doc API Advanced Service

As my last resort, I braced myself to iterate over the bottomless pit of JSON response data spewed forth by the Docs REST API 🤢🤮.

Historically, the advanced service APIs are updated with more regularity over the Apps Script-only services. I guess this is because they can be retrieved by other programming languages and according to the snobs, Apps Script isn’t cool, “it’s just a low-code scripting language”.

Here we are limited to accessing the following smart chips:

  • Person
  • Rich link

However, we can also retrieve code snippets here as they are considered paragraphs in the Google API. So that’s a win.

The bummer is that there is no property to indicate that another chip type is present like we could with the Document App Unsupported Element so our displayed retrieved text will exclude the clues that something might be missing.

Let’s take a look!

Note that in the video tutorial of this portion of our investigation (when it comes out), we examine the response object in more detail and go over how I found the properties I needed to extract the text. It’s worth a watch.

If you have found the tutorial helpful, why not shout me a coffee ☕? I'd really appreciate it.

The Code

The Docs API uses ‘tabs’ to refer to elements here which took a little getting used to.

First, ensure that you have added the Docs API advanced service to your Apps Script project:

  • Click Services from the sidebar.
  • Select Google Docs API.
  • Select Add.

Lines 1-8: We will use the test function to run our script in this example.

The getBodyTextAsString_chips() function

Line 20: The function takes a Google Doc ID as a function and returns the found body text as a string.

I’ve added internal private methods into this function for modularity, but you could also abstract them into their own functions and save them to their own page in your script.

Retrieve the JSON from the Google Docs REST API

Lines 28-48

To retrieve the body object data from the Docs API, we will use the Document.get method. Left unchecked, this will generate an unwieldy maze of JSON paths with a matching large file size. We don’t need that much bloat.

Fortunately, there are only a few properties that we need and we can use field masks, to request only what we need. This is done with the fields property. Each available property can be found in the document resource. From there you click on the object link to navigate to the nested properties of each item. We then put each of our required fields in a string.

For us, we need the Google Doc title, and then in the body of the Doc we want to get any text content or links found in paragraphs and tables along with the text and links from the person and rich link chip data.

So this object:

Can be represented like this:

As you can see curved braces represent arrays and dots represent child properties.

Tables can be nested within tables. So we will need to create an iterator callback method to add more nested fields or we will get more bloat in our response object. My guess is that people won’t nest more than 6 tables within one table so let’s generate our table/paragraph field 6 times using our matryoshkaDoll_tablesAndParas() method.

 


I think there are now two ways that we can look at the body data. One is the old way by iterating over the document elements and the other is via Tabs. It looks like Google wants to convert Google Docs into a kind of website-looking beast where you can navigate document tabs in the same way you have tabs in your Google Sheets. Kinda cool, I guess.

Anyway, we will need to take this approach to save ourselves a rewrite from any future deprecation. Here we will add the includeTabsContent property to our request payload.

Outside of the world of the tutorial, it is probably a good idea to wrap this in a try-catch statement to handle for any request failures.

Get the body property data only

Line 54: Our first means to simplify our navigation of the JSON response data is to extract the body tag.

First, traverse the property tree to the 'tabs' array. Then we can use the JavaScript find method to find the tab named "documentTab". From that, we will step to the body array property.

The Paragraph or body callback method

Line 58: First we set a text variable to collect our body text.

Lines 118-120: Jumping to the bottom of our function we now iterate over the body property. The body contains a tab property for each element type in the document like table and paragraphs.

We now call the paragraphOrTable()  callback method feeding it the current tab.

Is the tab a Paragraph or a Table?

On each iteration, we need to check if the current tab is a paragraph (Line 67) or table (Line 101).

Paragraphs

Each paragraph can contain one of three elements that have text we can retrieve:

  1. person (lines 74-80): The person property contains both the name of the person (props.name) and their email (props.email). We will add this to a markdown format to display in our body text.
  2. richLink (lines 81-88): Similarly, we can extract the title (props.title) and uri (props.uri) from the rich link properties along with the MIME file type (props.mimeTyle). Again, we will add this in a markdown format.
  3. textRun (lines 89-99): Text runs are defined as segments of text containing their own unique set of attributes like formatting or links. Here we will check if the current text has a link. If it does we will add it using the link markdown format. Otherwise, we will display the text.
Table

Lines 103-118

Tables can contain paragraphs or other tables, which may, indeed contain tables. Tables are the veritable world turtle of Google Docs.

This means that if we find one we need to:

  1. Iterate over each row array.
  2. Iterate over each cell array.
  3. Send the cell tab property back to paragraphOrTable() function to check if it is either a paragraph tab or another table, “What the hey?” (Terry Pratchett).

The Text Results

Extracted text results from Google Docs using Doc API advanced service approach
Extracted text results from Google Docs using Doc API advanced service approach

As you can see, we have managed to successfully extract our text along with any related links. We have also extracted any people chips and rich link chip data. We have no indication from the body object where the other chips are so we cannot even mark them as missing. On the bright side, the code block is printed out and is conveniently marked by these icons (❎).

Conclusion

As you can see, there is no real single perfect approach to extracting the text from a Google Doc, only a ‘least bad’ option that may better fit your needs.

4 Approaches to extracting text from a Google Doc with Apps Script
4 Approaches to extracting text from a Google Doc with Apps Script

I would really love to hear which approach you decided to use in your own project and how you are using it. I am sure other readers would gain some inspiration from this too. Go ahead and add a comment below.

Create and Publish Google Workspace Add-ons with Apps Script Course 300px

Oh, and let me know if there have been any updates to the APIs that I have missed. I keep these posts up to date.

Create and Publish a Google Workspace Add-on with Apps Script Course

Need help with Google Workspace development?

Go something to solve bigger than Chat GPT?

I can help you with all of your Google Workspace development needs, from custom app development to integrations and security. I have a proven track record of success in helping businesses of all sizes get the most out of Google Workspace.

Schedule a free consultation today to discuss your needs and get started or learn more about our services here.

~ Yagi 🐐

 

One thought on “Get a Google Docs Body Text with Apps Script”

Leave a Reply