Google Apps Script: Google Docs, Document App. Javascript: Spread syntax, Set, IndexOf, Substring
So here is the scenario, imagine you have a big Google Doc. You want to get a list of information from the document that you have noticed are between two sets of characters. Maybe something like this:
- You want to grab all the quoted text in a story and you know that the quoted test is between two sets of quotation marks: “ ”.
- You want to grab citations or asides inside different braces, for example, [],{} or ().
- You are making a mail merger and you want to grab a specific list of words that the user put in that is to be substituted based on special character identifiers, for example, {{name}}, {{phone}}.
- You want to grab all the websites in a Google doc and you know they start will start with “https://“ and with “.gov“ .
This tutorial provides a simple how-to do this. Perhaps the code is exactly what you need for your project. We’ve set it up in a way that is easy to implement in your own project.
Table of Contents
The Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
/* ####### GET DOCUMENT TEXT BASED ON REFERNCE CHARACTERS ###### * * Creats an array of all the Google Doc documents text based on reference values. * This script requires a reference identifier start and end character set. The code will then * select the text inside the identifiers. I will either include the identifiers or not * depending on your selection. * * param {string} docID : The ID of the Google Doc, found in the URL. * param {object} identifier : An object containing the start and end identifiers to searh and if they should be included in the returned results. * * ## identifer object set up example ## * * { * start: `{{`, // << add your start identifying charaters here. * start_include: false, // << if you want the start identifier included change to true. * end: `}}`, // << add your end identifying characters here. * end_include: false // << if you want the end identifier included change to true. * }; * * returns {array} : Returns array of strings of characters found within identifiers. * */ function getDocItems(docID, identifier){ const body = DocumentApp.openById(docID).getBody(); const docText = body.getText(); //Check if search characters are to be included. let startLen = identifier.start_include ? 0 : identifier.start.length; let endLen = identifier.end_include ? 0 : identifier.end.length; //Set up the reference loop let textStart = 0; let doc = docText; let docList = []; //Loop through text grab the identifier items. Start loop from last set of end identfiers. while(textStart > -1){ let textStart = doc.indexOf(identifier.start); if(textStart === -1){ break; }else{ let textEnd = doc.indexOf(identifier.end) + identifier.end.length; let word = doc.substring(textStart,textEnd); doc = doc.substring(textEnd); docList.push(word.substring(startLen,word.length - endLen)); }; }; //return a unique set of identifiers. return [...new Set(docList)]; }; |
Quick use guide
The getDocItems()
function allows you to search a Google Doc based on the documents ID and find text between a set of identifying characters. It returns a list of strings of texts that are found between the identifying characters you chose.
You also have the option of keeping either the start search identifying characters, the end ones, both or neither.
The getDocItems()
function takes two parameters:
- docID (String) : This is the Google Sheet ID for your project. You can find the ID in the URL:
- identifier (object): The identifier is an object container for you to put in your parameters and instructions for the code to run. It contains the following keys:
- start: A string of identifying characters you will use to find the start of the text. For example,
{{
. I use backticks () here because they are not commonly used in documents and allows us to easily add single or double quotation marks as our identifying characters without issues.
- start_include: Set this to true if you want to include the identifying characters. Set to false if you don't.
- end: This is the string of identifying characters you will use to find the end of the text you want to look for. For example }}`.
- end_include: Set this to true if you want to include the identifying characters. Set to false if you don’t.
- start: A string of identifying characters you will use to find the start of the text. For example,
To add this script to your project, simply copy and paste in the code above. In your own function add the getDocItems(docID, Identifer)
function with the values required. Then run your function.
Examples
Example 1: Extracting all the quoted speech text from a story
In this example, we want to extract all the quoted speech text from the following story.
We might want to use it for analysis in, say, a Google Sheet later. For now, we will just log the results. Here is how we would run the getDocItems()
function from our project:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
//The chewy conversation function runsies(){ const docID = "1EToW3zy_gz4fe2TBI4if6CJqib9P9wy1zqaXYitA96c"; const identifier = { start: `“`, start_include: false, end: `”`, end_include: false }; let results = getDocItems(docID, identifier); console.log(results); }; |
On line 3, we reference the document we want to search by providing the document ID.
Then on line 4, we create our object keys and values. I want to search by starting double quotation marks (we called them the 66ers when I was a kid). Now, if you simply type in a double quotation mark into the Google Apps Script editor, you may get a very plain non-directional version. This won’t help you search for anything. It needs to be exactly the same as the one in your Google Doc.
My recommendation is to then copy the starting quotation mark you want to search for in the Google Doc and paste it directly into the editor.
This will maintain the quotation direction keeping it as a 66er (Do I sound old saying that? I feel I do).
So that’s the start
key done. We don’t want our saved data to include quotation marks. We’ll remove them by making the start_include
value equal false.
Next, we want to indicate what characters we want to stop our search on. Again we copy and paste the 99ers quotation marks from the text between the backtick and set the end_include
also to false.
In our results variable (Line 10), we’ll then add our docID
and Identifier
constant to our getDocItems()
function. This will return a list of all the speech in the text as an array.
We then simply console.log
this at the end. You do much much more with the data, I am sure.
1 2 3 4 5 6 7 |
Apr 17, 2020, 8:25:06 AM Debug [ 'So, kid. A pretty good rule of hoof is to don’t chew on things that the boss picks up. Unless it’s, you know food.', 'But how do I know if it’s not food if I don’t give it a try.', 'You, see,', 'Anything the boss puts in there is okay to chomp on. It’s pretty tasty too.', 'But don’t eat the trough,', 'You’re getting it, kid. You’re getting it!' ] |
Let’s quickly run through two more examples now that we have the basics.
Example 2: Get all the data inside square brackets []
In this example, we want to find all the quoted costs in the travel blog to Ryudikkelf Spaceport (What?!). Take a look at the Google Doc below:
In the example above, we can see that the text we want is inside square brackets “[]”. We kinda like the brackets so we want them to be kept when we save the data in our array. Here is how we would set up our script:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
//Ryudikkelf Spaceport function runsies1(){ const docID = "1WBohLj01HF2slGfha43aCp57vO_0nipsLbdYKZ6zktQ"; const identifier = { start: `[`, start_include: true, end: `]`, end_include: true }; let results = getDocItems(docID, identifier); console.log(results); }; |
Running this code will result in this array:
1 2 3 4 |
Apr 17, 2020, 8:34:20 AM Debug [ '[Salz food - $140IYNSK]', '[Prettytoria 30min VR session - $90IYNSK]', '[Gel Sandwich drinks - $60IYNSK]' ] |
Example 3: Get location references for a mail merge
Imagine you have a document that you want to create multiple copies for. However, each copy has unique data in it like names, billing information, etc. Perhaps you want to set up a mail merge-type system for this.
You decide that you will identify each location that needs to change with double braces as the key for that location “{{ }}”. You could use getDocItems()
to get these unique character sets and then find and replace them each time you duplicate a new document with Google Apps Script.
You would grab these key character sets the same way as the previous two examples. Take a look at the example Google Do below:
When we run the code, you will have a list of unique character sets you might want to connect to headers in a Google Sheet to create your ‘Mail Merge’.
Here’s the script:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
//Very Important Letter function runsies2(){ const docID = "1ReYneEMYpBlkgdjl8-KJWRemhQ7rQcbR_MFg6rpbDAQ"; const identifier = { start: `{{`, start_include: true, end: `}}`, end_include: true }; let results = getDocItems(docID, identifier); console.log(results); }; |
This is what the list would look like:
1 2 |
Apr 17, 2020, 8:44:36 AM Debug [ '{{Name}}', '{{Address}}', '{{Phone}}', '{{Date}}', '{{Day}}' ] |
Looking to learn more about Google Apps Scripts in a more structured format? Udemy has some great courses that can get you from the basics to a real Google Apps Script pro!
Got a more specific problem you need help with, but don’t have the time to develop the skills? Fiverr’s your best bet to find a skilled Google Apps Script professional to solve your problem quickly and cheaply. *
*The above affiliate links have been carefully researched to get you to what you specifically need. If you decide to click on one of these links it will cost you just the same as going to the site. If you decide to sign up, I just get a little pocket money to help pay for the costs of running this website.
Code Breakdown
Want to know how getDocItems(docID, identifier)
all works? Read on intrepid friend!
Parameters
getDocItems(docID, identifier)
requires two parameters:
- docID: The a string containing the documents ID.
- identifier: an object of key-value pairs that contain start and end reference characters and the choice to keep those values in the saved array.
The function then returns an array of all the characters between the selected reference characters in the sheet.
Grabbing the text from the Google Doc
1 2 |
const body = DocumentApp.openById(docID).getBody(); const docText = body.getText(); |
The first task is to grab all the text from the Google Doc. This is done by using Google Apps Script’s DocumentApp class.
We then use the openById()
method to grab the document we want to work in. You can see that we use the docID
from our parameters here.
Next, we grab the body of the text. The body class allows us to access the documents whole text, or just things like lists, table of contents or tables. For us, we want to grab the body’s text. We do this on the next line of code with getText(). This simply grabs all the texts and sends it to Google Apps Script as a text string for us to work on.
Check if search characters are to be included
1 2 3 |
//Check if search characters are to be included. let startLen = identifier.start_include ? 0 : identifier.start.length; let endLen = identifier.end_include ? 0 : identifier.end.length; |
Here, we want to confirm if we need to include the search characters in the text we want to save in our array or not. Remember, back where we were creating the identifier object of key-value pairs we asked if you wanted to start to include the search characters. You chose either true – for sure, add em in! – or false.
We set two new variables to the start and end length. We use a ternary operator to do this. Each line basically says this: if the start_ or end_include was marked as true, then give this variable the value of zero, otherwise give this the values of the total length of the search character string.
Set up for the loop to find the reference characters
1 2 3 4 |
//Set up the reference loop let textStart = 0; let doc = docText; let docList = []; |
In a moment, we will search through the text for our start and end reference characters, but before we do, we need to do a little setup fo the loop.
First, we are going to create a textStart
variable and set it to zero. This will update after we find each set of reference characters.
We will then let our docText
equal doc
so that we can update it.
Finally, we will want to set up a variable to store our list of characters we find based on our search. This is done with docList
.
Looping through the text to find the characters within or search set
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
//Loop through text grab the identifier items. Start loop from last set of end identfiers. while(textStart > -1){ let textStart = doc.indexOf(identifier.start); if(textStart === -1){ break; }else{ let textEnd = doc.indexOf(identifier.end) + identifier.end.length; let word = doc.substring(textStart,textEnd); doc = doc.substring(textEnd); docList.push(word.substring(startLen,word.length - endLen)); }; }; |
while loop
The while loop is set to conclude when the value of the textStart
is not greater -1.
Get the start text location
The textStart
value changes in line 3 when we set its value to the index (location number) of the first set of characters we are searching for. This is done with the indexOf method which basically says that if there is an index of what we are searching for in our document, then give its location. So for example, if our start search is “{{” and our example text is this:
I am a {{hungry}} goat!
Then the index will be 7 characters in.
no more search items
If there are no characters in the doc with the identifier.start
search characters then textStart
will equal -1. If this is the case we don’t want to go any further, so we conclude the loop with a break ( line 6). We check if textStart
is equal to -1 with an if statement in line 5.
However, if the number is greater than -1, we want to proceed. We do this with our somewhat unnecessary, but syntatically pretty, else
statement.
Get the end text location
On line 9, we create the variable textEnd
. This variable used the indexOf
method again to find the end
search text. This will give us the start location of our search text. Going back to our example, it will look like this (where the red bar is!):
I am a {{hungry|}} goat!
Which will be the 15th character.
We may want to include the search characters so we need to add them to our index. This is done by adding the length of the search characters to the indexOf
location:
1 |
doc.indexOf(identifier.end) + identifier.end.length; |
Which will set the index to 17 for our example, because there are two characters in our search:
I am a {{hungry}}| goat!
Grabbing the text
On line 10, we grab the text or word within our search selection. We use subString to extract the range of characters in our doc
string. Substring takes two arguments, the start index and end index. We add our textStart
index and textEnd
index here. In our little example above that would return:
{{hungry}}
let word = doc.substring(textStart,textEnd);
A brand new start
It seems terribly inefficient to iterate through the entire doc of text each time. Plus we don’t want to catch the same word set in an eternal loop. We need to change the start location of our doc
. To do this, we simply update the doc using substring again commencing the start of doc
now at the textEnd
index.
doc = doc.substring(textEnd);
If you only put the start argument in a substring it will create a string from the start index all the way to the end of the string.
Update our list of found items
docList.push(word.substring(startLen,word.length - endLen));
Lastly, we want to push our new found text into our docList
. Before we do that, we need to trim it if necessary with our startLen
and endLen
just in case we asked to remove the search parameters or keep them.
Return the array back
The final step of the function is to return our array of found text back to be used in our main function.
For my purposes, and I would imagine many of yours, you would only want a list of unique strings – no duplicates. Due to Google Apps Scripts recent upgrade to the V8 runtime we can now use a bunch of ES6 syntax. Sweet!
I wanted to try out the spread syntax and use the set object to remove the duplicate elements in an array. How elegant does this look?!
return [...new Set(docList)];
Alternatively, if you do want to return duplicates in your docList
simply return the docList:
return docList;
and delete out the other return line.
Conclusion
This simple tutorial should get you started with grabbing text from a Google Doc for other uses. You could certainly make a lot of changes to the code to introduce more search rules. Likewise, you could use regular expressions to search for some really cool stuff, but for a simple overview this should get you started.
What do you think you would use this on. I love hearing how these snippets and tutorials are applied. It’s inspiring. Please let me know in the comments below.
Happy coding.
Need help with Google Workspace development?
Go something to solve bigger than Chat GPT?
I can help you with all of your Google Workspace development needs, from custom app development to integrations and security. I have a proven track record of success in helping businesses of all sizes get the most out of Google Workspace.
Schedule a free consultation today to discuss your needs and get started or learn more about our services here.
Amazing tutorial. Thanks. I am using your code to extract info that is not following Reference Characters. I guess a regular expression would do the job. Lot of work ahead for an app script beginner.
Thanks for the kind words.
Yeah, regular expressions are tricky, but there are a lot of resources out there.
Best of luck on your adventure!
~Yagi
Hi Yagi. I am happy to notice that you answer the comments. Maybe you will find the time to give me a clue about how to incorporate regular expression in this code. I already have the RegEx. I have test them with other more simple code. My problem is that I can’t mange to change the identifier for the RegEx. Maybe it is not the only change to do?
Thanks for your time Yagi.
SOLVED! Using your other tutorial: Google Apps Script: Extract Specific Data From a PDF and insert it into a Google Sheet. This time I payed attention to details. At the end, you answered my question in an anachronistic way.
Thanks again Yagi.
Glad you managed to piece it all together.
~Yagi