Google Apps Script: Get Google Doc Text Based on Reference Characters

Get Google Doc Text Based on Reference Characters GAS Header1

Google Apps Script: Google Docs, Document App. Javascript: Spread syntax, Set, IndexOf, Substring

So here is the scenario, imagine you have a big Google Doc. You want to get a list of information from the document that you have noticed are between two sets of characters. Maybe something like this:

  1. You want to grab all the quoted text in a story and you know that the quoted test is between two sets of quotation marks: “ ”.
  2. You want to grab citations or asides inside different braces, for example, [],{} or ().
  3. You are making a mail merger and you want to grab a specific list of words that the user put in that is to be substituted based on special character identifiers, for example, {{name}}, {{phone}}.
  4. You want to grab all the websites in a Google doc and you know they start will start with https:// and with .gov .

This tutorial provides a simple how-to do this.  Perhaps the code is exactly what you need for your project. We’ve set it up in a way that is easy to implement in your own project.

The Code

Quick use guide

The getDocItems() function allows you to search a Google Doc based on the documents ID and find text between a set of identifying characters. It returns a list of strings of texts that are found between the identifying characters you chose.

You also have the option of keeping either the start search identifying characters, the end ones, both or neither.

The getDocItems() function takes two parameters:

  1. docID (String) : This is the Google Sheet ID for your project. You can find the ID in the URL:

Google Doc ID

  1. identifier (object): The identifier is an object container for you to put in your parameters and instructions for the code to run. It contains the following keys:
    • start:  A string of identifying characters you will use to find the start of the text. For example, {{. I use backticks () here because they are not commonly used in documents and allows us to easily add single or double quotation marks as our identifying characters without issues.
    • start_include: Set this to true if you want to include the identifying characters. Set to false if you don't.
    • end: This is the string of identifying characters you will use to find the end of the text you want to look for. For example  }}`.
    • end_include: Set this to true if you want to include the identifying characters. Set to false if you don’t.

To add this script to your project, simply copy and paste in the code above. In your own function add the getDocItems(docID, Identifer) function with the values required. Then run your function.

Examples

Example 1: Extracting all the quoted speech text from a story

In this example, we want to extract all the quoted speech text from the following story.

We might want to use it for analysis in, say, a Google Sheet later.  For now, we will just log the results. Here is how we would run the getDocItems() function from our project:

On line 3, we reference the document we want to search by providing the document ID.

Then on line 4, we create our object keys and values. I want to search by starting double quotation marks (we called them the 66ers when I was a kid). Now, if you simply type in a double quotation mark into the Google Apps Script editor, you may get a very plain non-directional version. This won’t help you search for anything. It needs to be exactly the same as the one in your Google Doc.

My recommendation is to then copy the starting quotation mark you want to search for in the Google Doc and paste it directly into the editor.

This will maintain the quotation direction keeping it as a 66er (Do I sound old saying that? I feel I do).

So that’s the start key done. We don’t want our saved data to include quotation marks. We’ll remove them by making the start_include value equal false.

Next, we want to indicate what characters we want to stop our search on. Again we copy and paste the 99ers quotation marks from the text between the backtick and set the end_include also to false.

In our results variable (Line 10), we’ll then add our docID and Identifier constant to our getDocItems() function. This will return a list of all the speech in the text as an array.

We then simply console.log this at the end. You do much much more with the data, I am sure.

Let’s quickly run through two more examples now that we have the basics.

Example 2: Get all the data inside square brackets []

In this example, we want to find all the quoted costs in the travel blog to Ryudikkelf Spaceport (What?!). Take a look at the Google Doc below:

In the example above, we can see that the text we want is inside square brackets “[]”. We kinda like the brackets so we want them to be kept when we save the data in our array. Here is how we would set up our script:

Running this code will result in this array:

Example 3: Get location references for a mail merge

Imagine you have a document that you want to create multiple copies for. However, each copy has unique data in it like names, billing information, etc. Perhaps you want to set up a mail merge-type system for this.

You decide that you will identify each location that needs to change with double braces as the key for that location “{{ }}”. You could use getDocItems() to get these unique character sets and then find and replace them each time you duplicate a new document with Google Apps Script.

You would grab these key character sets the same way as the previous two examples. Take a look at the example Google Do below:

When we run the code, you will have a list of unique character sets you might want to connect to headers in a Google Sheet to create your ‘Mail Merge’.

Here’s the script:

This is what the list would look like:

Looking to learn more about Google Apps Scripts in a more structured format? Udemy has some great courses that can get you from the basics to a real Google Apps Script pro!

Got a more specific problem you need help with, but don’t have the time to develop the skills? Fiverr’s your best bet to find a skilled Google Apps Script professional to solve your problem quickly and cheaply. *

*The above affiliate links have been carefully researched to get you to what you specifically need. If you decide to click on one of these links it will cost you just the same as going to the site. If you decide to sign up, I just get a little pocket money to help pay for the costs of running this website.

Code Breakdown

Want to know how getDocItems(docID, identifier) all works? Read on intrepid friend!

Parameters

getDocItems(docID, identifier) requires two parameters:

  • docID: The a string containing the documents ID.
  • identifier: an object of key-value pairs that contain start and end reference characters and the choice to keep those values in the saved array.

The function then returns an array of all the characters between the selected reference characters in the sheet.

Grabbing the text from the Google Doc

The first task is to grab all the text from the Google Doc. This is done by using Google Apps Script’s  DocumentApp class.

We then use the openById() method to grab the document we want to work in. You can see that we use the docID from our parameters here.

Next, we grab the body of the text. The body class allows us to access the documents whole text, or just things like lists, table of contents or tables. For us, we want to grab the body’s text. We do this on the next line of code with getText(). This simply grabs all the texts and sends it to Google Apps Script as a text string for us to work on.

Check if search characters are to be included

Here, we want to confirm if we need to include the search characters in the text we want to save in our array or not. Remember, back where we were creating the identifier object of key-value pairs we asked if you wanted to start to include the search characters. You chose either true – for sure, add em in! – or false.

We set two new variables to the start and end length. We use a ternary operator to do this. Each line basically says this: if the start_ or end_include was marked as true, then give this variable the value of zero, otherwise give this the values of the total length of the search character string.

Set up for the loop to find the reference characters

In a moment, we will search through the text for our start and end reference characters, but before we do, we need to do a little setup fo the loop.

First, we are going to create a textStart variable and set it to zero. This will update after we find each set of reference characters.

We will then let our docText  equal doc so that we can update it.

Finally, we will want to set up a variable to store our list of characters we find based on our search. This is done with docList.

Looping through the text to find the characters within or search set

while loop

The while loop is set to conclude when the value of the textStart is not greater -1.

Get the start text location

The textStart value changes in line 3 when we set its value to the index (location number) of the first set of characters we are searching for. This is done with the indexOf method which basically says that if there is an index of what we are searching for in our document, then give its location. So for example, if our start search is “{{” and our example text is this:

I am a {{hungry}} goat!

Then the index will be 7 characters in.

no more search items

If there are no characters in the doc with the identifier.start search characters then textStart will equal -1. If this is the case we don’t want to go any further, so we conclude the loop with a break ( line 6). We check if textStart is equal to -1 with an if statement in line 5.

However, if the number is greater than -1, we want to proceed. We do this with our somewhat unnecessary, but syntatically pretty, else statement.

Get the end text location

On line 9, we create the variable textEnd. This variable used the indexOf method again to find the end search text. This will give us the start location of our search text. Going back to our example, it will look like this (where the red bar is!):

I am a {{hungry|}} goat!

Which will be the 15th character.

We may want to include the search characters so we need to add them to our index. This is done by adding the length of the search characters to the indexOf location:

Which will set the index to 17 for our example, because there  are two characters in our search:

I am a {{hungry}}| goat!

Grabbing the text

On line 10, we grab the text or word within our search selection. We use subString to extract the range of characters in our doc string. Substring takes two arguments, the start index and end index. We add our textStart index and textEnd index here. In our little example above that would return:

{{hungry}}

let word = doc.substring(textStart,textEnd);

A brand new start

It seems terribly inefficient to iterate through the entire doc of text each time. Plus we don’t want to catch the same word set in an eternal loop. We need to change the start location of our doc. To do this, we simply update the doc using substring again commencing the start of doc now at the textEnd index.

doc = doc.substring(textEnd);

If you only put the start argument in a substring it will create a string from the start index all the way to the end of the string.

Update our list of found items

docList.push(word.substring(startLen,word.length - endLen));

Lastly, we want to push our new found text into our docList. Before we do that, we need to trim it if necessary with our startLen and endLen just in case we asked to remove the search parameters or keep them.

Return the array back

The final step of the function is to return our array of found text back to be used in our main function.

For my purposes, and I would imagine many of yours, you would only want a list of unique strings – no duplicates. Due to Google Apps Scripts recent upgrade to the V8 runtime we can now use a bunch of ES6 syntax. Sweet!

I wanted to try out the spread syntax and use the set object to remove the duplicate elements in an array. How elegant does this look?!

return [...new Set(docList)];

Alternatively, if you do want to return duplicates in your docList simply return the docList:

return docList;

and delete out the other return line.

Conclusion

This simple tutorial should get you started with grabbing text from a Google Doc for other uses. You could certainly make a lot of changes to the code to introduce more search rules. Likewise, you could use regular expressions to search for some really cool stuff, but for a simple overview this should get you started.

What do you think you would use this on. I love hearing how these snippets and tutorials are applied. It’s inspiring. Please let me know in the comments below.

Happy coding.

Leave a Reply