Google Apps Script: DriveApp, Advanced Drive Service, SpreadsheetApp, DocumentApp Javascript: spread operator, map, regex
Ah, bureaucracy at its finest. Have you ever asked for a list from a client, another department or agency and instead of saving into a file that might even be considered marginally useful, they give it to you as a pdf. A… p…d…f.
Gee, thanks.
After some prostrations to the great Google Apps Script gods, I had a though.
“Hey, can’t we convert a PDF to a Google Doc with just a click of the button? Surely the great Google Apps Script devs have made it so we can do it programmatically too.”
And you know what? They bloody well did. The big legends.
Table of Contents
The Scenario
I’ve just received a bunch of PDFs. The PFDs are all labelled by the class number. Take a look at the files in my Google Drive:
Each PDF file contains a list of student IDs that I need to extract and put into a Google Sheet.
The aim is to have a list of student IDs in column A and their corresponding sections in column B.
As you can see, we have some pretty standard text in the PDF that should be easy for Google to recognise so that we can extract the IDs.
The list of names in the demo sheets were randomly generated by AI!
NOTE! As always, I have tried to create this tutorial for varying levels. Feel free to follow along, or just grab what you need and get stuck into your own project.
If you are playing along, you can find a copy of the PDF files below. Simply add them to your own Drive before you get started:
Spreadsheet Setup
Let’s keep this simple. Go in and create a new Google Sheet. Rename the sheet tab down the bottom to “Extracted”.
In the first cell, we are going to style up a button, that we will use to run the code later.
You can do this by going to Insert > Drawing in the menu. You can find out more about how to create buttons here:
Google Apps Script: How to Connect a Button to a Function in Google Sheets
Next, on line 2 we will add Student ID (A2) and Section in (B2).
The Code
This project requires Google Drive Advanced Service V2.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
//#####GLOBALS##### const FOLDER_ID = "1CnTA5AQo4xzo4TJAWpGtEPVRhh9jdCHl"; //Folder ID of all PDFs const SS = "1jJNhSkGq3SbqSnRXNRd4EhsYCjDwZ0LYZW6y3lkZVGw";//The spreadsheet ID const SHEET = "Extracted";//The sheet tab name /*######################################################## * Main run file: extracts student IDs from PDFs and their * section from the PDF name from multiple documents. * * Displays a list of students and sections in a Google Sheet. * */ function extractStudentIDsAndSectionToSheets(){ const ss = SpreadsheetApp.getActiveSpreadsheet() //Get all PDF files: const folder = DriveApp.getFolderById(FOLDER_ID); //const files = folder.getFiles(); const files = folder.getFilesByType("application/pdf"); let allIDsAndCRNs = [] //Iterate through each folderr while(files.hasNext()){ let file = files.next(); let fileID = file.getId(); const doc = getTextFromPDF(fileID); const studentIDs = extractStudentIDs(doc.text); //Add ID to Section name const studentIDsWithCRN = studentIDs.map( ID => [ID,doc.name]); //Optional: Notify user of process. You can delete lines 33 to 38 if(studentIDs[0] === "No items found") { ss.toast("No items found in " + doc.name, "Warning",2); }else{ ss.toast(doc.name + " extracted"); }; allIDsAndCRNs = allIDsAndCRNs.concat(studentIDsWithCRN); } importToSpreadsheet(allIDsAndCRNs); }; /*######################################################## * Extracts the text from a PDF and stores it in memory. * Also extracts the file name. * * param {string} : fileID : file ID of the PDF that the text will be extracted from. * * returns {array} : Contains the file name (section) and PDF text. * */ function getTextFromPDF(fileID) { var blob = DriveApp.getFileById(fileID).getBlob() var resource = { title: blob.getName(), mimeType: blob.getContentType() }; var options = { ocr: true, ocrLanguage: "en" }; // Convert the pdf to a Google Doc with ocr. var file = Drive.Files.insert(resource, blob, options); // Get the texts from the newly created text. var doc = DocumentApp.openById(file.id); var text = doc.getBody().getText(); var title = doc.getName(); // Deleted the document once the text has been stored. Drive.Files.remove(doc.getId()); return { name:title, text:text }; } /*######################################################## * Use the text extracted from PDF and extracts student id based on value parameters. * Also extracts the file name. * * param {string} : text : text of data from PDF. * * returns {array} : Of all student IDs found in text. * */ function extractStudentIDs(text){ const regexp = /20\d{7}/g; try{ let array = [...text.match(regexp)]; return array; }catch(e){ //Optional: If you want this info added to your Sheet data. Otherwise delete rows 98-99. let array = ["Not items found"] return array; } }; /*######################################################## * Takes the culminated list of IDs and sections and inserts them into * a Google Sheet. * * param {array} : data : 2d array containing a list of ids and their associated sections. * */ function importToSpreadsheet(data){ const sheet = SpreadsheetApp.openById(SS).getSheetByName(SHEET); const range = sheet.getRange(3,1,data.length,2); range.setValues(data); range.sort([2,1]); } |
Globals
Let’s take a look at our main Global variables:
1 2 3 4 |
//#####GLOBALS##### const FOLDER_ID = "1CnTA5AQo4xzo4TJAWpGtEPVRhh9jdCHl"; //Folder ID of all PDFs const SS = "1jJNhSkGq3SbqSnRXNRd4EhsYCjDwZ0LYZW6y3lkZVGw";//The spreadsheet ID const SHEET = "Extracted";//The sheet tab name |
To get the FOLDER_ID
constant, go to your folder in your Google Drive where you are keeping your PDF files you will be extracting your IDs from.
Change this out for your own folder ID.
The SS
constant stands for Spreadsheet. Here again, we want to grab the ID of the spreadsheet we just created. To do this, you use the same trick again. Go up to the URL and select the ID.
You will need to change this one too.
Finally, the constant variable SHEET
is the name of the sheet tab we created earlier. In our case, “Extracted”.
extractStudentIDsAndSectionToSheets()
The extractStudentIDsAndSectionToSheets()
function is the main execution function that will extract all our IDs and sections and insert them into our Google Sheet.
Let’s take a look:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
//#####GLOBALS##### const FOLDER_ID = "1CnTA5AQo4xzo4TJAWpGtEPVRhh9jdCHl"; //Folder ID of all PDFs const SS = "1jJNhSkGq3SbqSnRXNRd4EhsYCjDwZ0LYZW6y3lkZVGw";//The spreadsheet ID const SHEET = "Extracted";//The sheet tab name /*######################################################## * Main run file: extracts student IDs from PDFs and their * section from the PDF name from multiple documents. * * Displays a list of students and sections in a Google Sheet. * */ function extractStudentIDsAndSectionToSheets(){ const ss = SpreadsheetApp.getActiveSpreadsheet() //Get all PDF files: const folder = DriveApp.getFolderById(FOLDER_ID); const files = folder.getFilesByType("application/pdf"); let allIDsAndCRNs = [] //Iterate through each folderr while(files.hasNext()){ let file = files.next(); let fileID = file.getId(); const doc = getTextFromPDF(fileID); const studentIDs = extractStudentIDs(doc.text); //Add ID to Section name const studentIDsWithCRN = studentIDs.map( ID => [ID,doc.name]); //Optional: Notify user of process. You can delete lines 33 to 38 if(studentIDs[0] === "Not items found") { ss.toast("Not items found in " + doc.name, "Warning",2); }else{ ss.toast(doc.name + " extracted"); }; allIDsAndCRNs = allIDsAndCRNs.concat(studentIDsWithCRN); } importToSpreadsheet(allIDsAndCRNs); }; /*######################################################## * Extracts the text from a PDF and stores it in memory. * Also extracts the file name. * * param {string} : fileID : file ID of the PDF that the text will be extracted from. * * returns {array} : Contains the file name (section) and PDF text. * */ function getTextFromPDF(fileID) { const blob = DriveApp.getFileById(fileID).getBlob(); const resource = { title: blob.getName(), mimeType: blob.getContentType() }; const options = { ocr: true, ocrLanguage: "en" }; // Convert the pdf to a Google Doc with ocr. const file = Drive.Files.insert(resource, blob, options); // Get the texts from the newly created text. const doc = DocumentApp.openById(file.id); const text = doc.getBody().getText(); const title = doc.getName(); // Deleted the document once the text has been stored. Drive.Files.remove(doc.getId()); return { name:title, text:text }; }; /*######################################################## * Use the text extracted from PDF and extracts student id based on value parameters. * Also extracts the file name. * * param {string} : text : text of data from PDF. * * returns {array} : Of all student IDs found in text. * */ function extractStudentIDs(text){ const regexp = /20\d{7}/g; try{ let array = [...text.match(regexp)]; return array; }catch(e){ //Optional: If you want this info added to your Sheet data. Otherwise delete rows 98-99. let array = ["No items found"] return array; } }; /*######################################################## * Takes the culminated list of IDs and sections and inserts them into * a Google Sheet. * * param {array} : data : 2d array containing a list of ids and their associated sections. * */ function importToSpreadsheet(data){ const sheet = SpreadsheetApp.openById(SS).getSheetByName(SHEET); const range = sheet.getRange(3,1,data.length,2); range.setValues(data); range.sort([2,1]); } |
First up, on line 9, we call the SpreadsheetApp class and ask it to ready the active spreadsheet. This will be the one you are working in. We’ll store this in the ss
variable and use it later to send toast popups to update the user on the progress of the code.
Getting a list of PDF FILES
Next, we grab the folder that our PDFs are in using the DriveApp class. The Drive App class allows you to add, remove, list and move files in your Google drive.
From the DriveApp we call the getFolderById(FOLDER_ID)
method so that later we can look inside and find all the PDF files. Notice the FOLDER_ID
global is being used here. (Line 11)
We only want to search for files that are PDFs. We can ask Google Apps Script to do this by using it to only get files that are of the type, PDF. (Line 12)
const files = folder.getFilesByType("application/pdf");
You can find more Mime types here.
The files
constant variable prepares a generator containing all the related data for each of the PDFs in our assigned folder.
Storing our results
Remember, our goal is to grab a list of all of the IDs in each PDF file, assign them their section and put them all together in a list.
The list we will be updating is:
let allIDsAndCRNs = []
(Line 14)
This will ultimately contain a 2d array that might look a little like this:
1 2 3 4 5 6 7 8 9 |
let allIDsAndCRNs = [ ["202436717", "101-001"], ["202984726", "101-001"], ["201987793", "101-001"], ["209839311", "101-001"], ["201197374", "101-002"], ["203974666", "101-002"], ... ] |
We will populate this allIDsAndCRNs
variable with data from our while loop.
Looping through each file and extracting data
To loop through all the PDF files of our files
generator we use a while loop. We state that if files do have another PDF with all it’s associated data, then we want to do something with it:
while(files.hasNext()){
(line 16)
The first thing we want to do during each iteration is to get the data that is yielded from each call to the files
generator. We do this with the next()
method. We’ll set this method to the variable file
. (line 17)
Next, we want to grab the file’s ID. This will be a big long scramble of letters and numbers similar to the folder ID. This is stored in the fileID
variable. We will use this file ID reference to extract the IDs from the file. (line 18)
Extracting the text from the PDF
On line 20, our file ID is sent off to the getTextFromPDF()
function. As the name suggests, this function will extract all the data from the current PDF file and save it as a string in the doc
constant variable. The doc
variable will result in an object. doc.name
is the file name of the PDF and doc.text
is the entire text inside the PDF.
Extracting the student IDs into a string
On the next line (line 21), we then use the doc.text
data and send it to the extractStudentIDs
to grab all the ids. This will return an array of all the student IDs it found. For example:
1 2 3 4 5 6 7 8 9 |
studentIDs = [ "202436717", "202984726", "201987793", "209839311", "201197374", "203974666", ... ]; |
Adding the IDs to their assigned section.
Our next task is to assign the IDs to their section. We do this with the map
method. This method creates a new array by iterating through the current array and making your desired changes.
To do this I have used a simple arrow function ( =>
):
const studentIDsWithCRN = studentIDs.map( ID => [ID,doc.name]);
We call the map on our studentIDs
array. For our iterator variable, we will set to ID
. We then tell map that we want a new array that now contains a child array with the ID and the doc name as it iterates through each ID.
Keeping your users up to date
Extracting data from PDFs can take a bit of time. Once the code has been run, your users may be wondering what is taking so long. We should probably update them about where we are up to after we extract from each PDF. We might also want to warn the user if one of the PDFs doesn’t have any usernames in the file.
Fortunately, we can use Google Apps Scripts toast
method inside SpreadsheetApp class. The toast
method takes one argument and two optional arguments:
- Message: string: the string message you want to display for the users.
- Title: string: You can fancy this up by including a title if you wish.
- Timeout duration: number: Time in seconds that you want the toast to appear for.
Take a look a the code:
1 2 3 4 5 |
if(studentIDs[0] === "No items found") { ss.toast("No items found in " + doc.name, "Warning",2); }else{ ss.toast(doc.name + " extracted"); }; |
First, we check if the zeroeth item in the studentIDs
array contains the text “No items found”. This text is generated in the extractStudentIDs
function if there’s are no items with the search ID condition met (more on this later).
So, if one of our PDF files does not contain any IDs we send a toast to the user with a message that says there are no items found in the file (doc.name
). We add a nice header called “Warning” and keep it displayed for 2 seconds.
Alternatively, if there are items, we simply want to update the user that the current PDF file has been extracted.
Finishing the iteration of the loop
The last part of each iteration of the while loop is to add all studentIDsWithCRN
data to the end of the main allIDsAndCRNs
2d array.
We achieve this here with a contact method which concatenates or joins one array to the other.
allIDsAndCRNs = allIDsAndCRNs.concat(studentIDsWithCRN);
Inserting the IDs and Sections(CRNs) into the spreadsheet
The last task is to insert the IDs and section data into the spreadsheet. We do this by running the importToSpreadsheet
function with our full set of accumulated IDs and sections as the argument.
importToSpreadsheet(allIDsAndCRNs);
getTextFromPDF(fileID)
The getTextFromPDF()
function is called from the main extractStudentIDsAndSectionToSheets()
. It takes a file ID as parameter and returns an array containing the file name of the PDF and a long string of all the PDF data contained within the file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
/*######################################################## * Extracts the text from a PDF and stores it in memory. * Also extracts the file name. * * param {string} : fileID : file ID of the PDF that the text will be extracted from. * * returns {array} : Contains the file name (section) and PDF text. * */ function getTextFromPDF(fileID) { const blob = DriveApp.getFileById(fileID).getBlob(); const resource = { title: blob.getName(), mimeType: blob.getContentType() }; const options = { ocr: true, ocrLanguage: "en" }; // Convert the pdf to a Google Doc with ocr. const file = Drive.Files.insert(resource, blob, options); // Get the texts from the newly created text. const doc = DocumentApp.openById(file.id); const text = doc.getBody().getText(); const title = doc.getName(); // Deleted the document once the text has been stored. Drive.Files.remove(doc.getId()); return { name:title, text:text }; }; |
This process makes use of Google Docs ability to transform files like PDFs using Optical Character Recognition (OCR). What we will do is create a Google doc out of our PDF, then extract all the text from the doc and save it to a variable before deleting the Google Doc.
Creating the Google Doc from a PDF
To create a Google Doc from a PDF in Google Apps Script we need to make use of the Drive API in Advanced Google Services. However, before you can use this API, you will need to initialise it for your project.
To do this go to Services in your Google Apps Script editor.
A dialogue box will appear with a list of APIs. Scroll down until you find the Drive API and click on it. Select V2 as the version from the dropdown menu. Click the Add button.
To create a Google Doc version of our PDF we are going to use the insert method of the files resource in the Drive API. This will take 3 arguments:
- resource: object: This is an object containing the file’s metadata like its file name and mime-type.
- blob: data: This is all the actual file data that we will transform into a Google Doc.
- option: object: These are all the optional query parameters you wish to add. For us, we will use this to request the OCR and set the OCR language to English.
You can see all this put together on line 21:
var file = Drive.Files.insert(resource, blob, options);
Back up on line 11, we grab the blob. The blob stands for binary large object and it contains all the data inside the file. We use our DriveApp again to get the current PDF file and then access its internal data.
var blob = DriveApp.getFileById(fileID).getBlob();
Back up on lines 12 to 15, you can see that we wanted to maintain the title of the new Google Doc so we added the title. We also added the mime type to help us to transform the PDF to our Google Doc.
1 2 3 4 |
const resource = { title: blob.getName(), mimeType: blob.getContentType() }; |
Lines 16 to 19 set up our Optical Character Recognition (OCR). First is asks us if we want to convert the documents using OCR which we select as true. Then we determine that the language is in English.
1 2 3 4 |
var options = { ocr: true, ocrLanguage: "en" }; |
Extract the text from the Google Doc
The next step is to extract all the text from our newly created Google Doc. To do this we need to open our newly created document with the DocumentApp class:
var doc = DocumentApp.openById(file.id);
Once we have the file we grab the body data and that data’s text.
var text = doc.getBody().getText();
We will also grab the name of the new doc.
const title = doc.getName();
Tidying Up and sending away
Once we have our text and title stored, we can go ahead and delete the newly created Google Doc. To delete the file completely I recommend using the remove method in the Drive API. This will permanently delete the file instead of just sending it to the trash.
Drive.Files.remove(doc.getId());
To do this we simply provide the file ID of the newly created document.
The final step is to return an object containing the title (The section name) and the text of data back to our main function.
extractStudentIDs()
The extractStudentIDs(text)
function takes a text file as a parameter and returns an array of all the ids within the text file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
/*######################################################## * Use the text extracted from PDF and extracts student id based on value parameters. * Also extracts the file name. * * param {string} : text : text of data from PDF. * * returns {array} : Of all student IDs found in text. * */ function extractStudentIDs(text){ const regexp = /20\d{7}/g; try{ let array = [...text.match(regexp)]; return array; }catch(e){ //Optional: If you want this info added to your Sheet data. Otherwise delete rows 98-99. let array = ["No items found"] return array; } }; |
To find all the IDs in our text we use a regular expression. Regular expressions will allow us to match all the student IDs based on a particular pattern.
The Regular Expression
For our example, we know that all our student IDs start with a 20 and have 7 more numerical characters proceeding it.
We express this in this manner:
/20\d{7}/g;
The two slashes indicate the regular expression area. The /g
says that we are looking for all occurrences of the pattern, not just the first one. The 20
is simply our first two digits that are common for all ids. The \d
indicates that we want to look for digits (0-9). Finally, the {7}
tells us how many digits we want to look for.
If you are new to regular expressions, I have a tutorial on how I use them in Google Sheets that you might find useful. You can check it out here:
Google Sheets: Counting and Filtering Rows where Cells Contain Particular Values.
You can modify the regular expression to suit your own needs here. Getting regular expression right can be a little tricky but there are a lot of resources out there to help you.
Checking and Storing the results
Whenever we find an ID we want to add it to an array. We can do this really elegantly by combining the spread operator with the match
method. However, if there is no ID in our string then we will get an error. We want to handle this with our try
error handling statement.
If there are no IDs in the text then we will set our array to equal “No items found” in our catch
statement and return it to our main function. (line 17 & 18)
If there are ids then we set our array to equal a list of all the IDs it finds. This will then be returned to the main function.
importToSpreadsheet(data)
This function takes the accumulated ids and sections from all the PDFs in a 2d array and inputs them in your Google Sheet.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
/*######################################################## * Takes the culminated list of IDs and sections and inserts them into * a Google Sheet. * * param {array} : data : 2d array containing a list of ids and their associated sections. * */ function importToSpreadsheet(data){ const sheet = SpreadsheetApp.openById(SS).getSheetByName(SHEET); const range = sheet.getRange(3,1,data.length,2); range.setValues(data); range.sort([2,1]); }; |
Here we will use the SpreadsheetApp class again.
Our first task is to locate the correct Google Sheet and its sheet tab. (line 9)
Then we want to grab the range of cells we will insert our new dataset into. We do this with the getRange method. This method can take either an argument as numbers of columns and rows or use A1Notation. For me, it is usually easier to use the number-based arguments.
We will start in cell A3
. So we will add 3 for the number of rows down and 1 for the number of columns across as our first two arguments. The next arguments are the depth of the row. We calculate this by simply getting the length of our data
. Lastly, we add 2 for the total width of columns. We know it will only be two across because we are only displaying a column of IDs and a column of sections.
Next, we set the values into the spreadsheet. This is the point where the data is pasted into the sheet for the user to see.
Finally, we give the data a sort by section and then id using the sort method.
Connecting the Button and Running the script
Remember all the way back at the top of this tutorial, you created a button to run your code from? To connect your button first copy the main function name extractStudentIDsAndSectionToSheets
. Got to your Google Sheet and right-click on the button image and select Assign a script. Then paste in the function name.
Now click on the Extract button. If it is your first time your script will ask you for permissions to run. Accept them and your script will work.
Conclusion
That wraps it up.
You can change the regular expression for your own project. You may even want to build from this script to search for multiple patterns in the script and store the results in a multi-column 2d array.
One limitation you might face is that the OCR may not be good enough for images that have been converted to PDFs rather than text to PDFs.
Anyway, give it a go, I would love to hear how you used this in your own projects and if you like this post, why not subscript (top right). I post 1-2 times a month and sometimes a little more over summer.
Need help with Google Workspace development?
My team of experts can help you with all of your needs, from custom app development to integrations and security. We have a proven track record of success in helping businesses of all sizes get the most out of Google Workspace.
Schedule a free consultation today to discuss your needs and get started or learn more about our services here.
Changelog
2024-06-12
- Added extra information for the user to change the Drive API version to version 2.
This is a great tutorial! Do you have any good references to enhance this script to search for multiple patterns in the script and store the results in a multi-column 2d+ array. I’ve been trying to tie this with other sources by have struggled to make it work.
Obrigado aqui do Brasil (Thanks from Brazil).
You’re very welcome, rigos.
Hello,
I hope your day is going well.
I receive the error, “Exception: The number of rows in the range must be at least 1.”
Additionally, on line 113 it says the following:
const range = sheet.getRange(3,1,data.length,2);
Where in the code is the variable data declared?
All the best,
Jake
Hi Jacob,
Data is a parameter of the function importToSpreadsheet(data).
It is derived from the allIDsAndCRNs 2d array variable in the main extractStudentIDsAndSectionToSheets() function.
My guess is that there were no items found that met your search parameters.
Hope that helps.
~Yagi
Thanks to the script, on my end I just needed to read PDF’s and rename the file using data from content, but I’m facing an issue, while folders contain small amount of PDF’s works with no issues, but a folder with over 50 files I get:
“GoogleJsonResponseException: API call to drive.files.insert failed with error: Internal Error” Line 78
on line 78: var file = Drive.Files.insert(resource, blob, options);
Hi Hzo, I’m not 100% sure what is occurring based on that error. I do have a feeling that you might have reached a memory or processing limit using this approach. Roughly how long does it take for this error to occur?
Hey Yagi, It takes about 6 seconds. I have made additional tests, some of these PDF have large images in it, I guess that might be the issue.
File A = 50mb **crashes
File B = 20mb **works
Thanks again.
Hi Hzo,
I think you are onto something there with the images. The code converts the PDF into a Google Doc before extracting the text, but this process struggles with non-text based items.
You might have to go back to the blob data and find a specific JS library to extract your data.
This is a great tutorial! Do you have any good references to enhance this script to search for multiple patterns in the script and store the results in a multi-column 2d+ array. I’ve been trying to tie this with other sources but have struggled to make it work.
Hi Michael,
Thanks.
For me, I would update the
extractStudentIDs(text)
function and create an array of search patterns. Then loop through each search pattern updating theconst regexp
with the new search pattern. I’d imagine you would want to put each column so you would first need to add the results of each search to an inner array (column array) then at the end of your loop add that array to the outer (Row) array.Hopefully, that makes a little sense.
Let me know if you have any more questions.
Cheers,
Hi,
Thank’s for your script Yagi !!!!
Anyone know how read the text of interactive fields in PDF file ? When the script convert PDF to DOC the interactive fields disappear. It’s possible create a new method for extract interactive fields text ?
Thank’s
ProfeFP
How to extract text from image in google drive?
Great job, Yagi. I really liked it.
How do you delete unnecessary characters from the text.match(regexp)??
I want to replace the date from yyyy-mm-dd to yyyy/mm/dd.
So I want to replace “-” with “/”.
Is it hard to do?
Hi Toffee,
Great to hear you enjoyed the post.
If when you extract your dates you want to paste them with / instead of – , your best bet is to run a small function after you have run your equivalent of extractStudentIDs(text) for your project.
Something like this would do the trick:
dates = ["2020-10-01","2020-09-31","2020-02-08","2020-05-22"] // This would be your data extracted from your pdf
let changeDates = dates.map(element => element.replaceAll("-","/"))
console.log(changeDates)
Live Demo
If you are planning on updating the pdf, well.. that is a little bit more tricky and if your pdf is a complex document then I would not recommend using Google Apps Script just yet to update and change, though I am sure that they are on the case. If, however, your pdf is fairly basic, you might have luck extracting it as a blob, updating it with Google Apps Script with Google Sheets and then exporting it again as a PDF. Parts of the tutorial below may prove useful for this:
Create a Student Certificate of Attendance with Google Slides and Export it as a PDF or Print it
Cheers,
Yagi
Hi,
How to extract more than 5 data fields from pdf? Excellent content.
Thanks!
Hi Yagi, Great post! Very helpful and useful. However, I am having an issue. My pdf file is more than 200 pages but the .doc file created seems to have an 80-page limit.
Ooof! I think you may have hit a max boundary, Mohammad. I’d say the 80 page limit is the max memory cache value. You could possibly loop through the document and register a block at a time, but I just don’t think Apps Script was designed for that kind of work. Maybe try using PDF Parse or something like it for your solution.
~Yagi
Hello Sir,
I needs to add more columns in google sheet ID, Date, Designation, Name Like this, So how to do that
I am more of a user than scriptor now-a-days, but I could manage if needed.
I am sharing a lot of PDF files to different people, from my google drive lately. One, in particular, is my father.
In order to make good use of collaboration we discuss technical manuals often and I need to lead him to certain spots in a document. **First warning about size: I see a comment that maybe size plays a role in an error, so perhaps that is an issue for my use-case.
As far as Google is concerned, I should be able to open a PDF, add a comment at a page, and grab a link to that comment to send to him.
This often works, but also often it brings up the pdf doc to a page far away form the comment position.
**Multiple attempts to paste&go in the browser will bring up positions all over throughout the pdf document.
I went searching for a solution and found none, so I decided that to see a list of comments might be of benefit. *He could scroll down to the one we will talk about.
…This seems to NOT be an option in the Google viewer – Show List of comments to the PDF.
This has me thinking, based on your exercise, is there a reasonably simple way to use your approach to list out the comments in a pdf? And even point to the position of these comments?
I would PAY to be able to do this consistently.
Thank you for sharing all that you do! It has been a welcome experience in my digital – Google work flow.
great useful tool it is.
How can we extract the two patterns using the same function. my example is a. 64021920.1.1 and 64021920.12.1. i can able to extract from only pattern as of now.
Google removed V2, so this doesn’t work anymore.
Not true. V2 is still available from the dropdown menu in the Google Drive API under Version. You can see a dropdown selector to the right of the field.
I have updated the tutorial with some more clarity on this.