PDFInfo: Quick Guide to Extracting Metadata from PDFs
PDF metadata—title, author, creation date, page count, and more—helps you organize, audit, and automate workflows that involve PDF files. pdfinfo, a lightweight command-line tool from the Poppler (or Xpdf) suite, quickly exposes this metadata so you can inspect PDFs without opening them in a GUI. This guide covers installation, common commands, useful flags, output parsing, and automation tips.
What pdfinfo shows
pdfinfo reports common metadata and file details such as:
- Title, Author, Subject, Keywords
- Creator (software that generated the PDF)
- Producer (PDF library that produced the file)
- CreationDate / ModDate
- Tagged (whether the PDF includes accessibility tagging)
- Encrypted (encryption status)
- Page count, Page size, and PDF version
- File size and linearization / fast web view indicators
Install pdfinfo
- macOS:
brew install poppler - Debian/Ubuntu:
sudo apt-get install poppler-utils - Fedora:
sudo dnf install poppler-utils - Windows: install Poppler binaries (add to PATH) or use WSL and follow Linux steps.
Basic usage
Run pdfinfo against a PDF file:
Code
pdfinfo file.pdf
Typical output is a line-by-line list of metadata fields and values.
Useful flags
-meta
Prints XML metadata block (XMP) if present:pdfinfo -meta file.pdf-box
Shows page box sizes (MediaBox, CropBox, BleedBox, TrimBox, ArtBox):pdfinfo -box file.pdf-f-l
Limit analysis to pages n–m (useful for very large files):pdfinfo -f 1 -l 5 file.pdf-rawdates
Show raw date strings from the PDF (no post-processing):pdfinfo -rawdates file.pdf-enc
Include encryption details (if any).
Check pdfinfo -help for the full list on your system.
Parsing pdfinfo output in scripts
pdfinfo output is plain text; use standard CLI tools to extract fields.
- Extract page count (bash):
Code
pages=\((pdfinfo file.pdf | awk '/^Pages:/ {print \)2}‘)
- Get title or fallback to filename:
Code
title=\((pdfinfo file.pdf | sed -n 's/^Title:[]*//p') </span>[ -z "\)title” ] && title=”\((basename file.pdf)" </code></div></div></pre> <ul> <li>Extract creation date and convert to ISO (example using GNU date):</li> </ul> <pre><div class="XG2rBS5V967VhGTCEN1k"><div class="nHykNMmtaaTJMjgzStID"><div class="HsT0RHFbNELC00WicOi8"><i><svg width="16" height="16" fill="none" xmlns="http://www.w3.org/2000/svg"><path fill="currentColor" fill-rule="evenodd" clip-rule="evenodd" d="M15.434 7.51c.137.137.212.311.212.49a.694.694 0 0 1-.212.5l-3.54 3.5a.893.893 0 0 1-.277.18 1.024 1.024 0 0 1-.684.038.945.945 0 0 1-.302-.148.787.787 0 0 1-.213-.234.652.652 0 0 1-.045-.58.74.74 0 0 1 .175-.256l3.045-3-3.045-3a.69.69 0 0 1-.22-.55.723.723 0 0 1 .303-.52 1 1 0 0 1 .648-.186.962.962 0 0 1 .614.256l3.541 3.51Zm-12.281 0A.695.695 0 0 0 2.94 8a.694.694 0 0 0 .213.5l3.54 3.5a.893.893 0 0 0 .277.18 1.024 1.024 0 0 0 .684.038.945.945 0 0 0 .302-.148.788.788 0 0 0 .213-.234.651.651 0 0 0 .045-.58.74.74 0 0 0-.175-.256L4.994 8l3.045-3a.69.69 0 0 0 .22-.55.723.723 0 0 0-.303-.52 1 1 0 0 0-.648-.186.962.962 0 0 0-.615.256l-3.54 3.51Z"></path></svg></i><p class="li3asHIMe05JPmtJCytG wZ4JdaHxSAhGy1HoNVja cPy9QU4brI7VQXFNPEvF">Code</p></div><div class="CF2lgtGWtYUYmTULoX44"><button type="button" class="st68fcLUUT0dNcuLLB2_ ffON2NH02oMAcqyoh2UU MQCbz04ET5EljRmK3YpQ CPXAhl7VTkj2dHDyAYAf" data-copycode="true" role="button" aria-label="Copy Code"><svg viewBox="0 0 16 16" fill="none" xmlns="http://www.w3.org/2000/svg"><path fill="currentColor" fill-rule="evenodd" clip-rule="evenodd" d="M9.975 1h.09a3.2 3.2 0 0 1 3.202 3.201v1.924a.754.754 0 0 1-.017.16l1.23 1.353A2 2 0 0 1 15 8.983V14a2 2 0 0 1-2 2H8a2 2 0 0 1-1.733-1H4.183a3.201 3.201 0 0 1-3.2-3.201V4.201a3.2 3.2 0 0 1 3.04-3.197A1.25 1.25 0 0 1 5.25 0h3.5c.604 0 1.109.43 1.225 1ZM4.249 2.5h-.066a1.7 1.7 0 0 0-1.7 1.701v7.598c0 .94.761 1.701 1.7 1.701H6V7a2 2 0 0 1 2-2h3.197c.195 0 .387.028.57.083v-.882A1.7 1.7 0 0 0 10.066 2.5H9.75c-.228.304-.591.5-1 .5h-3.5c-.41 0-.772-.196-1-.5ZM5 1.75v-.5A.25.25 0 0 1 5.25 1h3.5a.25.25 0 0 1 .25.25v.5a.25.25 0 0 1-.25.25h-3.5A.25.25 0 0 1 5 1.75ZM7.5 7a.5.5 0 0 1 .5-.5h3V9a1 1 0 0 0 1 1h1.5v4a.5.5 0 0 1-.5.5H8a.5.5 0 0 1-.5-.5V7Zm6 2v-.017a.5.5 0 0 0-.13-.336L12 7.14V9h1.5Z"></path></svg>Copy Code</button><button type="button" class="st68fcLUUT0dNcuLLB2_ WtfzoAXPoZC2mMqcexgL ffON2NH02oMAcqyoh2UU MQCbz04ET5EljRmK3YpQ GnLX_jUB3Jn3idluie7R"><svg fill="none" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path fill="currentColor" fill-rule="evenodd" d="M20.618 4.214a1 1 0 0 1 .168 1.404l-11 14a1 1 0 0 1-1.554.022l-5-6a1 1 0 0 1 1.536-1.28l4.21 5.05L19.213 4.382a1 1 0 0 1 1.404-.168Z" clip-rule="evenodd"></path></svg>Copied</button></div></div><div class="mtDfw7oSa1WexjXyzs9y" style="color: var(--sds-color-text-01); font-family: var(--sds-font-family-monospace); direction: ltr; text-align: left; white-space: pre; word-spacing: normal; word-break: normal; font-size: var(--sds-font-size-label); line-height: 1.2em; tab-size: 4; hyphens: none; padding: var(--sds-space-x02, 8px) var(--sds-space-x04, 16px) var(--sds-space-x04, 16px); margin: 0px; overflow: auto; border: none; background: transparent;"><code class="language-text" style="color: rgb(57, 58, 52); font-family: Consolas, "Bitstream Vera Sans Mono", "Courier New", Courier, monospace; direction: ltr; text-align: left; white-space: pre; word-spacing: normal; word-break: normal; font-size: 0.9em; line-height: 1.2em; tab-size: 4; hyphens: none;"><span>raw=\)(pdfinfo -rawdates file.pdf | sed -n ’s/^CreationDate:[ ]*//p’) # raw might look like D:20220303120000-05’00’convert with custom parsing or use a library in higher-level languages
For robust parsing, prefer using a scripting language (Python, Node.js) and a PDF library that reads XMP or Info dictionaries directly.
Examples in Python
Using PyPDF2 to read basic metadata:
python
from PyPDF2 import PdfReader reader = PdfReader(“file.pdf”) info = reader.metadata print(info.title, info.author, info.get(”/CreationDate”))
Note: PyPDF2 reads the document info dictionary; XMP metadata may require a different parser (e.g., pypdfium2 or direct XML parsing).
Automation tips
- Batch-check PDFs for missing metadata:
- Loop through files, call pdfinfo, and log missing Title/Author fields.
- Integrate into CI: fail builds if PDFs lack required metadata or are encrypted.
- Combine with exiftool or custom scripts to update metadata (some tools allow editing; pdfinfo is read-only).
- Normalize dates and author names using a mapping file in scripts.
Troubleshooting
- No metadata shown: PDF may lack an Info dictionary or XMP block; consider extracting XMP via
pdfinfo -metaor using a PDF library. - Dates look odd: PDF dates use the “D:YYYYMMDDHHmmSSOHH’mm’” format; use parsing utilities or libraries to normalize.
- Encrypted PDFs: pdfinfo will flag encryption; you may need to decrypt (if permitted) before extracting metadata.
Security and permissions
- pdfinfo reads files locally—ensure you have permission to access the files.
- Do not run pdfinfo on untrusted PDFs in an environment where opening the file (or parsing) could execute unvetted code; run in a sandbox if content is suspicious.
Quick checklist
- Install poppler/poppler-utils.
- Run
pdfinfo file.pdffor a quick view. - Use
-metafor XMP,-boxfor page boxes,-rawdatesfor raw timestamps. - Script parsing with awk/sed or use PyPDF2 for programmatic access.
- Automate checks and integrate into CI for consistency.
Use pdfinfo whenever you need a fast, scriptable way to inspect PDF metadata without opening a viewer.
Leave a Reply
You must be logged in to post a comment.