Create an index of my archive of PPT presentations with Python

3 min readMay 10, 2020

I am a collector…So I have thousands of PowerPoint files on my PC. I tried to organize them into folders. However, when there are thousands of files, you just can’t do it properly. So I need a better way of indexing my PPT files.

What do I need

I need a script that can scan my PPT archive and extract the keywords from the PPT content so that I can tag each file for future reference.

To do this, I first need a library to read the text in a PPT file. Then I need a Named entity recognition (NER) library to extract the matched terms I am looking for. Lucky for me, I got what I need:

python-pptx for pptx files processing (note: it works only on .pptx files)
spaCy for Named entity recognition (NER). I have another article “A basic Named entity recognition (NER) with SpaCy in 10 lines of code in Python” talking about how to extract matched terms from a document here. I will use a script I wrote there in this project.

Get texts from PPT files

The idea is simple. I’d like to get all the texts from a PPT file and put them all together as a long string. Then I will put this string to spaCy to tell me what are the top matching keywords. python-pptx will handle the first part and it is simple to use. You just need to:

open a pptx file as a presentation
loop through slides
find text frame shapes on the slide
get all the paragraphs
then extract the text

Put it in code:

pres = Presentation(inputFilePath)
content = ""
# Collect all texts from the PPT
for slide in pres.slides:
    slideCount += 1
    for shape in slide.shapes:
        if (shape.has_text_frame):
            for paragraph in shape.text_frame.paragraphs:
                for run in paragraph.runs:
                    print(run.text)

Extract matched terms from the text

You may use spaCy phrase matcher or token matcher to extract matching keywords. You can refer to my the other article for details: A basic Named entity recognition (NER) with SpaCy in 10 lines of code in Python

However, I used spaCy EntityRuler to get the matching keywords I am looking for. I explained why in another article. I will not cover here again. Please check out here: A Closer Look at EntityRuler in SpaCy Rule-based Matching

Put it to test

I run it through 200+ PPT files, and it took 143 seconds. Quite efficient!

Finish it off by saving the keywords extracted to a file; now, I can search my archive more efficiently! Mission accomplished. Time to bed~

The code is here:

import os
from datetime import datetime
from pptx import Presentation
import spacyMatcherEntityRuler as matcher
import time# This is the folder of your pptx files
filePath = 'C:/xxxx'pptTags = []def processPPT(inputFilePath):
    global pptTags
    slideCount = 1
    filename, file_extension = os.path.splitext(inputFilePath)
    #check the file extension is pptx
    if file_extension == '.pptx':
        print("Processing : " + inputFilePath)
        try:
            pres = Presentation(inputFilePath)
            content = ""
            # Collect all texts from the PPT
            for slide in pres.slides:
                slideCount += 1
                for shape in slide.shapes:
                    if (shape.has_text_frame):
                        for paragraph in shape.text_frame.paragraphs:
                            for run in paragraph.runs:
                                # print(run.text)
                                content += run.text + '\n'
            # Run spaCy to get the top 3 matched terms
            tags = matcher.myEntityRulerMatcher(content)
            oneItem = inputFilePath + '|' + '|'.join(map(str, tags))
            pptTags.append(oneItem)
        except:
            print("Failed processing: " + inputFilePath)
    else:
        print("Not a pptx file: " + inputFilePath)
def processFolder(folderPath):
    global totalFiles
    for file in os.listdir(folderPath):
        # read all files and folder
        fileNameIn = os.path.abspath(os.path.join(folderPath, file))
        # print(fileNameIn)
        # if this is a folder, read all files inside
        if os.path.isdir(fileNameIn):
            processFolder(fileNameIn)
        # if it's file, process it
        else:
            processPPT(fileNameIn)
def writeToFile(pptTags):
    file = os.path.abspath(os.path.join(filePath, 'PPT Index_' + datetime.now().strftime("%Y%m%d") + '.txt'))
    try:
        fileObject = open(file, "w")
        fileObject.write('\n'.join(pptTags))
        fileObject.close()
    except IOError as err:
        print("Write to file failed")
        print(str(err))def main(argv=None):
    tic = time.perf_counter()
    processFolder(filePath)
    writeToFile(pptTags)
    toc = time.perf_counter()
    print(f"Time used: {toc - tic:0.4f} seconds")
if __name__ == "__main__":
    main()

Create an index of my archive of PPT presentations with Python

What do I need

Get texts from PPT files

Extract matched terms from the text

Put it to test

Written by Aaron Yu