Google Apps Script: Extract Specific Data From a PDF and insert it into a Google Sheet

Extract IDs from PDFs and instert into Google Apps Scirpt

Google Apps Script: DriveApp, Advanced Drive Service, SpreadsheetApp,  DocumentApp Javascript: spread operator, map, regex

Ah, bureaucracy at its finest.  Have you ever asked for a list from a client, another department or agency and instead of saving into a file that might even be considered marginally useful, they give it to you as a pdf. A… p…d…f.

Gee, thanks.

After some prostrations to the great Google Apps Script gods, I had a though.

“Hey, can’t we convert a PDF to a Google Doc with just a click of the button? Surely the great Google Apps Script devs have made it so we can do it programmatically too.”

And you know what? They bloody well did. The big legends.

The Scenario

I’ve just received a bunch of PDFs. The PFDs are all labelled by the class number. Take a look at the files in my Google Drive:

PDFs of sections in Google Drive

Each PDF file contains a list of student IDs that I need to extract and put into a Google Sheet.

PDF file with student IDs

The aim is to have a list of student IDs in column A and their corresponding sections in column B.

As you can see, we have some pretty standard text in the PDF that should be easy for Google to recognise so that we can extract the IDs.

The list of names in the demo sheets were randomly generated by AI!

NOTE! As always, I have tried to create this tutorial for varying levels. Feel free to follow along, or just grab what you need and get stuck into your own project.

If you are playing along, you can find a copy of the PDF files below. Simply add them to your own Drive before you get started:

PDF files by Section.

Spreadsheet Setup

Let’s keep this simple. Go in and create a new Google Sheet. Rename the sheet tab down the bottom to “Extracted”.

In the first cell, we are going to style up a button, that we will use to run the code later.

You can do this by going to Insert > Drawing in the menu. You can find out more about how to create buttons here:

Google Apps Script: How to Connect a Button to a Function in Google Sheets

Next, on line 2 we will add Student ID (A2) and Section in (B2).

Google Sheet Setup to extra usernames from PDF files

The Code

Globals

Let’s take a look at our main Global variables:

To get the FOLDER_ID constant, go to your folder in your Google Drive where you are keeping your PDF files you will be extracting your IDs from.

Google Drive file ID in URL 02

Change this out for your own folder ID.

The SS constant stands for Spreadsheet. Here again, we want to grab the ID of the spreadsheet we just created. To do this, you use the same trick again. Go up to the URL and select the ID.

You will need to change this one too.

Google Drive Spreadsheet ID in URL

Finally, the constant variable SHEET is the name of the sheet tab we created earlier. In our case, “Extracted”.

extractStudentIDsAndSectionToSheets()

The extractStudentIDsAndSectionToSheets() function is the main execution function that will extract all our IDs and sections and insert them into our Google Sheet.

Let’s take a look:

First up, on line 9, we call the SpreadsheetApp class and ask it to ready the active spreadsheet. This will be the one you are working in. We’ll store this in the ss variable and use it later to send toast popups to update the user on the progress of the code.

Getting a list of PDF FILES

Next, we grab the folder that our PDFs are in using the DriveApp class. The Drive App class allows you to add, remove, list and move files in your Google drive.

From the DriveApp we call the getFolderById(FOLDER_ID) method so that later we can look inside and find all the PDF files. Notice the FOLDER_ID global is being used here. (Line 11)

We only want to search for files that are PDFs. We can ask Google Apps Script to do this by using it to only get files that are of the type, PDF. (Line 12)

const files = folder.getFilesByType("application/pdf");

You can find more Mime types here.

The files constant variable prepares a generator containing all the related data for each of the PDFs in our assigned folder.

Storing our results

Remember, our goal is to grab a list of all of the IDs in each PDF file, assign them their section and put them all together in a list.

The list we will be updating is:

let allIDsAndCRNs = [] (Line 14)

This will ultimately contain a 2d array that might look a little like this:

We will populate this allIDsAndCRNs variable with data from our while loop.

Looping through each file and extracting data

To loop through all the PDF files of our files generator we use a while loop. We state that if files do have another PDF with all it’s associated data, then  we want to do something with it:

while(files.hasNext()){ (line 16)

The first thing we want to do during each iteration is to get the data that is yielded from each call to the files generator. We do this with the next() method. We’ll set this method to the variable file(line 17)

Next, we want to grab the file’s ID. This will be a big long scramble of letters and numbers similar to the folder ID. This is stored in the fileID variable. We will use this file ID reference to extract the IDs from the file. (line 18)

Extracting the text from the PDF

On line 20, our file ID is sent off to the getTextFromPDF() function. As the name suggests, this function will extract all the data from the current PDF file and save it as a string in the doc constant variable. The doc variable will result in an object. doc.name is the file name of the PDF and doc.text is the entire text inside the PDF.

Extracting the student IDs into a string

On the next line (line 21), we then use the doc.text data and send it to the extractStudentIDs to grab all the ids. This will return an array of all the student IDs it found. For example:

Adding the IDs to their assigned section.

Our next task is to assign the IDs to their section. We do this with the map method. This method creates a new array by iterating through the current array and making your desired changes.

To do this I have used a simple arrow function ( =>):

const studentIDsWithCRN = studentIDs.map( ID => [ID,doc.name]);

We call the map on our studentIDs array. For our iterator variable, we will set to ID. We then tell map that we want a new array that now contains a child array with the ID and the doc name as it iterates through each ID.

Keeping your users up to date

Extracting data from PDFs can take a bit of time. Once the code has been run, your users may be wondering what is taking so long. We should probably update them about where we are up to after we extract from each PDF. We might also want to warn the user if one of the PDFs doesn’t have any usernames in the file.

Fortunately, we can use Google Apps Scripts toast method inside SpreadsheetApp class. The toast method takes one argument and two optional arguments:

  1. Messagestring: the string message you want to display for the users.
  2. Title: string: You can fancy this up by including a title if you wish.
  3. Timeout duration: number: Time in seconds that you want the toast to appear for.

Take a look a the code:

First, we check if the zeroeth item in the studentIDs array contains the text “No items found”. This text is generated in the extractStudentIDs function if there’s are no items with the search ID condition met (more on this later).

So, if one of our PDF files does not contain any IDs we send a toast to the user with a message that says there are no items found in the file (doc.name). We add a nice header called “Warning” and keep it displayed for 2 seconds.

Alternatively, if there are items, we simply want to update the user that the current PDF file has been extracted.

Google Apps Script SpreadsheetApp Toast
An example of a “toast” out in the wild.
Finishing the iteration of the loop

The last part of each iteration of the while loop is to add all studentIDsWithCRN data to the end of the main allIDsAndCRNs 2d array.

We achieve this here with a contact method which concatenates or joins one array to the other.

allIDsAndCRNs = allIDsAndCRNs.concat(studentIDsWithCRN);

Inserting the IDs and Sections(CRNs) into the spreadsheet

The last task is to insert the IDs and section data into the spreadsheet. We do this by running the importToSpreadsheet function with our full set of accumulated IDs and sections as the argument.

importToSpreadsheet(allIDsAndCRNs);

getTextFromPDF(fileID)

The getTextFromPDF() function is called from the main extractStudentIDsAndSectionToSheets(). It takes a file ID as parameter and returns an array containing the file name of the PDF and a long string of all the PDF data contained within the file.

This process makes use of Google Docs ability to transform files like PDFs using Optical Character Recognition (OCR). What we will do is create a Google doc out of our PDF, then extract all the text from the doc and save it to a variable before deleting the Google Doc.

Creating the Google Doc from a PDF

To create a Google Doc from a PDF in Google Apps Script we need to make use of the Drive API in Advanced Google Services. However, before you can use this API, you will need to initialise it for your project.

To do this go to Resources > Advance Google services in your Google Apps Script editor.

Access Advance Google servies in Google Apps Script editor

A dialogue box will appear with a list of APIs. Scroll down until you find the Drive API. Select the box to “On” and hit OK.

Advanced Google Services Drive API select to on

To create a Google Doc version of our PDF we are going to use the insert method of the files resource in the Drive API. This will take 3 arguments:

  1. resource: object: This is an object containing the file’s metadata like its file name and mime-type.
  2. blob: data: This is all the actual file data that we will transform into a Google Doc.
  3. option: object: These are all the optional query parameters you wish to add. For us, we will use this to request the OCR and set the OCR language to English.

You can see all this put together on line 21:

var file = Drive.Files.insert(resource, blob, options);

Back up on line 11, we grab the blob. The blob stands for binary large object and it contains all the data inside the file.  We use our DriveApp again to get the current PDF file and then access its internal data.

var blob = DriveApp.getFileById(fileID).getBlob();

Back up on lines 12 to 15, you can see that we wanted to maintain the title of the new Google Doc so we added the title. We also added the mime type to help us to transform the PDF to our Google Doc.

Lines 16 to 19 set up our Optical Character Recognition (OCR). First is asks us if we want to convert the documents using OCR which we select as true. Then we determine that the language is in English.

Extract the text from the Google Doc

The next step is to extract all the text from our newly created Google Doc. To do this we need to open our newly created document with the DocumentApp class:

var doc = DocumentApp.openById(file.id);

Once we have the file we grab the body data and that data’s text.

var text = doc.getBody().getText();

We will also grab the name of the new doc.

const title = doc.getName();

Tidying Up and sending away

Once we have our text and title stored, we can go ahead and delete the newly created Google Doc. To delete the file completely I recommend using the remove method in the Drive API.  This will permanently delete the file instead of just sending it to the trash.

Drive.Files.remove(doc.getId());

To do this we simply provide the file ID of the newly created document.

The final step is to return an object containing the title (The section name) and the text of data back to our main function.

extractStudentIDs()

The extractStudentIDs(text) function takes a text file as a parameter and returns an array of all the ids within the text file.

To find all the IDs in our text we use a regular expression. Regular expressions will allow us to match all the student IDs based on a particular pattern.

The Regular Expression

For our example, we know that all our student IDs start with a 20 and have 7 more numerical characters proceeding it.

We express this in this manner:

/20\d{7}/g;

The two slashes indicate the regular expression area. The /g says that we are looking for all occurrences of the pattern, not just the first one. The 20 is simply our first two digits that are common for all ids. The \d indicates that we want to look for digits (0-9). Finally, the {7} tells us how many digits we want to look for.

If you are new to regular expressions, I have a tutorial on how I use them in Google Sheets that you might find useful. You can check it out here: 

Google Sheets: Counting and Filtering Rows where Cells Contain Particular Values.

You can modify the regular expression to suit your own needs here. Getting regular expression right can be a little tricky but there are a lot of resources out there to help you.

Checking and Storing the results

Whenever we find an ID we want to add it to an array. We can do this really elegantly by combining the spread operator with the match method. However, if there is no ID in our string then we will get an error. We want to handle this with our try error handling statement.

If there are no IDs in the text  then we will set our array to equal “No items found” in our catch statement and return it to our main function. (line 17 & 18)

If there are ids then we set our array to equal a list of all the IDs it finds. This will then be returned to the main function.

importToSpreadsheet(data)

This function takes the accumulated ids and sections from all the PDFs in a 2d array and inputs them in your Google Sheet.

Here we will use the SpreadsheetApp class again.

Our first task is to locate the correct Google Sheet and its sheet tab. (line 9)

Then we want to grab the range of cells we will insert our new dataset into. We do this with the getRange method.  This method can take either an argument as numbers of columns and rows or use A1Notation. For me, it is usually easier to use the number-based arguments.

We will start in cell A3. So we will add 3 for the number of rows down and 1 for the number of columns across as our first two arguments. The next arguments are the depth of the row. We calculate this by simply getting the length of our data. Lastly, we add 2 for the total width of columns. We know it will only be two across because we are only displaying a column of IDs and a column of sections.

Next, we set the values into the spreadsheet. This is the point where the data is pasted into the sheet for the user to see.

Finally, we give the data a sort by section and then id using the sort method.

Connecting the Button and Running the script

Remember all the way back at the top of this tutorial, you created a button to run your code from? To connect your button first copy the main function name extractStudentIDsAndSectionToSheets. Got to your Google Sheet and right-click on the button image and select Assign a script. Then paste in the function name.

Assign a script to a button to extract PDF data in Google Apps Script

Now click on the Extract button.  If it is your first time your script will ask you for permissions to run. Accept them and your script will work.

Conclusion

That wraps it up.

You can change the regular expression for your own project. You may even want to build from this script to search for multiple patterns in the script and store the results in a multi-column 2d array.

One limitation you might face is that the OCR may not be good enough for images that have been converted to PDFs rather than text to PDFs.

Anyway, give it a go, I would love to hear how you used this in your own projects and if you like this post, why not subscript (top right). I post 1-2 times a month and sometimes a little more over summer.

Looking to learn more about Google Apps Scripts in a more structured format? Udemy has some great courses that can get you from the basics to a real Google Apps Script pro!

Got a more specific problem you need help with, but don’t have the time to develop the skills? Fiverr’s your best bet to find a skilled Google Apps Script professional to solve your problem quickly and cheaply. *

*The above affiliate links have been carefully researched to get you to what you specifically need. If you decide to click on one of these links it will cost you just the same as going to the site. If you decide to sign up, I just get a little pocket money to help pay for the costs of running this website.

10 thoughts on “Google Apps Script: Extract Specific Data From a PDF and insert it into a Google Sheet”

  1. This is a great tutorial! Do you have any good references to enhance this script to search for multiple patterns in the script and store the results in a multi-column 2d+ array. I’ve been trying to tie this with other sources by have struggled to make it work.

  2. Hello,

    I hope your day is going well.

    I receive the error, “Exception: The number of rows in the range must be at least 1.”

    Additionally, on line 113 it says the following:
    const range = sheet.getRange(3,1,data.length,2);

    Where in the code is the variable data declared?

    All the best,
    Jake

    1. Hi Jacob,

      Data is a parameter of the function importToSpreadsheet(data).

      It is derived from the allIDsAndCRNs 2d array variable in the main extractStudentIDsAndSectionToSheets() function.

      My guess is that there were no items found that met your search parameters.

      Hope that helps.

      ~Yagi

  3. Thanks to the script, on my end I just needed to read PDF’s and rename the file using data from content, but I’m facing an issue, while folders contain small amount of PDF’s works with no issues, but a folder with over 50 files I get:

    “GoogleJsonResponseException: API call to drive.files.insert failed with error: Internal Error” Line 78
    on line 78: var file = Drive.Files.insert(resource, blob, options);

    1. Hi Hzo, I’m not 100% sure what is occurring based on that error. I do have a feeling that you might have reached a memory or processing limit using this approach. Roughly how long does it take for this error to occur?

      1. Hey Yagi, It takes about 6 seconds. I have made additional tests, some of these PDF have large images in it, I guess that might be the issue.

        File A = 50mb **crashes
        File B = 20mb **works

        Thanks again.

        1. Hi Hzo,

          I think you are onto something there with the images. The code converts the PDF into a Google Doc before extracting the text, but this process struggles with non-text based items.

          You might have to go back to the blob data and find a specific JS library to extract your data.

  4. This is a great tutorial! Do you have any good references to enhance this script to search for multiple patterns in the script and store the results in a multi-column 2d+ array. I’ve been trying to tie this with other sources but have struggled to make it work.

    1. Hi Michael,

      Thanks.

      For me, I would update the extractStudentIDs(text) function and create an array of search patterns. Then loop through each search pattern updating the const regexp with the new search pattern. I’d imagine you would want to put each column so you would first need to add the results of each search to an inner array (column array) then at the end of your loop add that array to the outer (Row) array.

      Hopefully, that makes a little sense.

      Let me know if you have any more questions.

      Cheers,

  5. Hi,
    Thank’s for your script Yagi !!!!

    Anyone know how read the text of interactive fields in PDF file ? When the script convert PDF to DOC the interactive fields disappear. It’s possible create a new method for extract interactive fields text ?

    Thank’s
    ProfeFP

Leave a Reply