Create an index of my archive of PPT presentations with Python
I am a collector…So I have thousands of PowerPoint files on my PC. I tried to organize them into folders. However, when there are thousands of files, you just can’t do it properly. So I need a better way of indexing my PPT files.
What do I need
I need a script that can scan my PPT archive and extract the keywords from the PPT content so that I can tag each file for future reference.
To do this, I first need a library to read the text in a PPT file. Then I need a Named entity recognition (NER) library to extract the matched terms I am looking for. Lucky for me, I got what I need:
- python-pptx for pptx files processing (note: it works only on .pptx files)
- spaCy for Named entity recognition (NER). I have another article “A basic Named entity recognition (NER) with SpaCy in 10 lines of code in Python” talking about how to extract matched terms from a document here. I will use a script I wrote there in this project.
Get texts from PPT files
The idea is simple. I’d like to get all the texts from a PPT file and put them all together as a long string. Then I will put this string to spaCy to tell me what are the top matching keywords. python-pptx will handle the first part and it is simple to use. You just need to:
- open a pptx file as a presentation
- loop through slides
- find text frame shapes on the slide
- get all the paragraphs
- then extract the text
Put it in code:
pres = Presentation(inputFilePath)
content = ""
# Collect all texts from the PPT
for slide in pres.slides:
slideCount += 1
for shape in slide.shapes:
if (shape.has_text_frame):
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
print(run.text)
Extract matched terms from the text
You may use spaCy phrase matcher or token matcher to extract matching keywords. You can refer to my the other article for details: A basic Named entity recognition (NER) with SpaCy in 10 lines of code in Python
However, I used spaCy EntityRuler to get the matching keywords I am looking for. I explained why in another article. I will not cover here again. Please check out here: A Closer Look at EntityRuler in SpaCy Rule-based Matching
Put it to test
I run it through 200+ PPT files, and it took 143 seconds. Quite efficient!
Finish it off by saving the keywords extracted to a file; now, I can search my archive more efficiently! Mission accomplished. Time to bed~
The code is here:
import os
from datetime import datetime
from pptx import Presentation
import spacyMatcherEntityRuler as matcher
import time# This is the folder of your pptx files
filePath = 'C:/xxxx'pptTags = []def processPPT(inputFilePath):
global pptTags
slideCount = 1
filename, file_extension = os.path.splitext(inputFilePath)
#check the file extension is pptx
if file_extension == '.pptx':
print("Processing : " + inputFilePath)
try:
pres = Presentation(inputFilePath)
content = ""
# Collect all texts from the PPT
for slide in pres.slides:
slideCount += 1
for shape in slide.shapes:
if (shape.has_text_frame):
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
# print(run.text)
content += run.text + '\n'
# Run spaCy to get the top 3 matched terms
tags = matcher.myEntityRulerMatcher(content)
oneItem = inputFilePath + '|' + '|'.join(map(str, tags))
pptTags.append(oneItem)
except:
print("Failed processing: " + inputFilePath)
else:
print("Not a pptx file: " + inputFilePath)
def processFolder(folderPath):
global totalFiles
for file in os.listdir(folderPath):
# read all files and folder
fileNameIn = os.path.abspath(os.path.join(folderPath, file))
# print(fileNameIn)
# if this is a folder, read all files inside
if os.path.isdir(fileNameIn):
processFolder(fileNameIn)
# if it's file, process it
else:
processPPT(fileNameIn)
def writeToFile(pptTags):
file = os.path.abspath(os.path.join(filePath, 'PPT Index_' + datetime.now().strftime("%Y%m%d") + '.txt'))
try:
fileObject = open(file, "w")
fileObject.write('\n'.join(pptTags))
fileObject.close()
except IOError as err:
print("Write to file failed")
print(str(err))def main(argv=None):
tic = time.perf_counter()
processFolder(filePath)
writeToFile(pptTags)
toc = time.perf_counter()
print(f"Time used: {toc - tic:0.4f} seconds")
if __name__ == "__main__":
main()