Create an index of my archive of PPT presentations with Python

Aaron Yu
The Startup
Published in
3 min readMay 10, 2020

--

I am a collector…So I have thousands of PowerPoint files on my PC. I tried to organize them into folders. However, when there are thousands of files, you just can’t do it properly. So I need a better way of indexing my PPT files.

What do I need

I need a script that can scan my PPT archive and extract the keywords from the PPT content so that I can tag each file for future reference.

To do this, I first need a library to read the text in a PPT file. Then I need a Named entity recognition (NER) library to extract the matched terms I am looking for. Lucky for me, I got what I need:

Get texts from PPT files

The idea is simple. I’d like to get all the texts from a PPT file and put them all together as a long string. Then I will put this string to spaCy to tell me what are the top matching keywords. python-pptx will handle the first part and it is simple to use. You just need to:

  • open a pptx file as a presentation
  • loop through slides
  • find text frame shapes on the slide
  • get all the paragraphs
  • then extract the text

Put it in code:

pres = Presentation(inputFilePath)
content = ""
# Collect all texts from the PPT
for slide in pres.slides:
slideCount += 1
for shape in slide.shapes:
if (shape.has_text_frame):
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
print(run.text)

Extract matched terms from the text

You may use spaCy phrase matcher or token matcher to extract matching keywords. You can refer to my the other article for details: A basic Named entity recognition (NER) with SpaCy in 10 lines of code in Python

However, I used spaCy EntityRuler to get the matching keywords I am looking for. I explained why in another article. I will not cover here again. Please check out here: A Closer Look at EntityRuler in SpaCy Rule-based Matching

Put it to test

I run it through 200+ PPT files, and it took 143 seconds. Quite efficient!

Finish it off by saving the keywords extracted to a file; now, I can search my archive more efficiently! Mission accomplished. Time to bed~

The code is here:

import os
from datetime import datetime
from pptx import Presentation
import spacyMatcherEntityRuler as matcher
import time
# This is the folder of your pptx files
filePath = 'C:/xxxx'
pptTags = []def processPPT(inputFilePath):
global pptTags
slideCount = 1
filename, file_extension = os.path.splitext(inputFilePath)
#check the file extension is pptx
if file_extension == '.pptx':
print("Processing : " + inputFilePath)
try:
pres = Presentation(inputFilePath)
content = ""
# Collect all texts from the PPT
for slide in pres.slides:
slideCount += 1
for shape in slide.shapes:
if (shape.has_text_frame):
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
# print(run.text)
content += run.text + '\n'
# Run spaCy to get the top 3 matched terms
tags = matcher.myEntityRulerMatcher(content)
oneItem = inputFilePath + '|' + '|'.join(map(str, tags))
pptTags.append(oneItem)
except:
print("Failed processing: " + inputFilePath)
else:
print("Not a pptx file: " + inputFilePath)
def processFolder(folderPath):
global totalFiles
for file in os.listdir(folderPath):
# read all files and folder
fileNameIn = os.path.abspath(os.path.join(folderPath, file))
# print(fileNameIn)
# if this is a folder, read all files inside
if os.path.isdir(fileNameIn):
processFolder(fileNameIn)
# if it's file, process it
else:
processPPT(fileNameIn)
def writeToFile(pptTags):
file = os.path.abspath(os.path.join(filePath, 'PPT Index_' + datetime.now().strftime("%Y%m%d") + '.txt'))
try:
fileObject = open(file, "w")
fileObject.write('\n'.join(pptTags))
fileObject.close()
except IOError as err:
print("Write to file failed")
print(str(err))
def main(argv=None):
tic = time.perf_counter()
processFolder(filePath)
writeToFile(pptTags)
toc = time.perf_counter()
print(f"Time used: {toc - tic:0.4f} seconds")
if __name__ == "__main__":
main()

--

--

Aaron Yu
The Startup

I am not a coder, but I like solving problems programmably