• Artificial Intelligence, Machine Learning, Python, Web Scraping: Feedbacks/Queries - e-contact

Operations on PDF Files


Table of Contents

GhostView to convert PDF into other formats such as PNG ---Compress, merge, scale, rotate and delete pages from PDF files using PyPDF2 [-] Convert PDF to PNG using PyMuPDF ]-[ Delete pages from a PDF file using PyPDF2

Python: PDF and File Utility Scripts

This Python code renames all the files of a specific type (say pdf or txt) in a folder by stripping specified number of characters at the start of the file name with new prefix. The code has to be placed in the folder containing the files. In case you incorrectly specified the new prefix, you can adjust number of characters to be stripped at the start of the file names.

Encrypt PDF Files: qpdf --encrypt user_pw owner_pw 256 -- input.pdf pdf_encrypted.pdf where user_pw and owner_pw are passwords for user and owner. 256 is for encryption algorithm: 256-bit encryption keys. To remove password from a PDF file: gs -dNOPAUSE -bBATCH -q -sDEVICE=pdfwrite -sPDFPassword=pdf_passwd -sOutputFile=out_file.pdf -f file_with_pw.pdf


Compress all PDF Files in folder: Open-source option using Python.

In case you want to compress just one PDF file, one can use this code.


GhostView

GhostView

Few pages of files can be extracted using Ghostview with command: gswin64 -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -dFirstPage=14 -dLastPage=17 -sOutputFile=OUT.pdf In.pdf -dNOPAUSE option disables the prompt and pause after each page is processed, -dBATCH option causes Ghostscript to exit when processing of the file specified in the command has finished.

-sPageList=pagenumber There are three possible values for this; even, odd or a list of pages to be processed. A list can include single pages or ranges of pages. Ranges of pages use the minus sign '-', individual pages and ranges of pages are separated by commas ','. A trailing minus '-' means process all remaining pages. For example:

-sPageList=1,3,5 indicates that pages 1, 3 and 5 should be processed. -sPageList=even to refer all even-numbered pages. -sPageList=odd refers to all odd-numbered pages.

-sPageList=5-10 indicates that pages 5, 6, 7, 8, 9 and 10 should be processed

-sPageList=1,5-10,12- indicates that pages 1, 5, 6, 7, 8, 9, 10 and 12 onwards should be processed

Be aware that using the '%d' syntax for OutputFile does not reflect the page number in the original document. If you chose (for example) to process even pages by using -sPageList=even, then the output of -sOutputFile=out%d.png would still be out0.png, out1.png, out2.png...

To rasterize [to convert an image described in a vector graphics format (shapes) into image with a series of pixels, dots or lines] all of the text in a PDF: Step-1) Convert the PDF into TIFF using ghostview -- gswin64 -sDEVICE=tiffg4 -o Out.tif Inp.pdf -> Step-2) Convert the TIFF to PDF using "tiff2pdf -z -f -F -pA4 -o New.pdf Out.tif". Alternatively: "gswin64c -dNOPAUSE -dBATCH -dTextAlphaBits=4 -sDEVICE=ps2write -sOutputFile=Out.ps Inp.pdf" then "gswin64c -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=New.pdf Out.ps". Options -dTextAlphaBits=4 is used for font antialiasing and works only if text in PDF file is in pixel format else error message "Can't set GraphicsAlphaBits or TextAlphaBits with a vector device." gets printed in console [anti-aliasing is used to describe the effect of making the edges of graphics objects or fonts smoother], The subsampling box size should be 4 for optimum output, but smaller values can be used for faster rendering. Antialiasing is enabled separately for text and graphics content. Allowed values are 1, 2 or 4. Reoslution is set by -r300 for 300 dpi

In imaging, alias refers to stair-stepping of lines. Anti-aliasing refers to the reduction in stair-stepping. Aliasing also refers to sampling of a single at low rate.

Switch from Windows OS to Linux


Following command can be used to create a PNG files for each page of a PDF file. The command line option '-sDEVICE=device' selects which output device Ghostscript should use. If device option isn't given the default device (usually a display device) is used. Ghostscript's built-in help message (gswin64 -h) lists the available output devices.

C:\Users\XYZ> gswin64 -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pnggray -r300 -dTextAlphaBits=4 -sOutputFile=PDF2PNG-%04d.png In.pdf there should be no space before or after '='. For Linux, gswin64 should be replaced to gs such as "gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -sOutputFile=PDF2PNG-%04d.jpeg In.pdf"

gswin64 -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pnggray -r300 -dTextAlphaBits=4 -dFirstPage=1 -dLastPage=10 -sOutputFile=PDF2PNG-%04d.png -dUseArtBox In.pdf --Set the page size using the pair of switches: -dDEVICEWIDTHPOINTS=w -dDEVICEHEIGHTPOINTS=h where 'w' = desired paper width and 'h' = desired paper height in points (1 point = 1/72 inch) and 1 pixel = 10 dots. Example - A4 size is height x width = 210 x 297 [cm] = 8.27 x 11.7 [inch] = 595 x 842 [points]. This will translate into -r300 -g2481x3510. Ghostscript may sometimes convert PDF to PNG with wrong output size. Use -dUseCropBox or -dUseTrimBox: note that these two options are mutually exclusive. Also -sPAPERSIZE=a4 cannot be used with -dUseCropBox or -dUseTrimBox. With -dPDFFitPage Ghostscript will render to the current page device size (usually the default page size). If the dimensions of PNG pages are different from those in PDF file, adjust -r300 to -r100 or -r160 till you get desired size. Note: Pixel is the smallest unit a screen can display, Dot is the smallest thing a printer can print.

Other options are: -sDEVICE=pngmono, -sDEVICE=jpeg / jpeggray, -sDEVICE=bmp16 / bmp256 / bmpgray / bmpmono ... are few out of almost 150 options available. pngmono - Monochrome Portable Network Graphics (PNG), pnggray is 8-bit gray PNG, png16 is 4-bit colour PNG, png256 is 8-bit colour PNG and png16m is 24-bit colour PNG.

-dUseBleedBox: Defines the region to which the contents of the page should be clipped when output in a production environment. Sets the page size to the BleedBox rather than the MediaBox. This may include any extra bleed area needed to accommodate the physical limitations of cutting, folding, and trimming equipment. The actual printed page may include printing marks that fall outside the bleed box.

-dUseTrimBox: The trim box defines the intended dimensions of the finished page after trimming. Sets the page size to the TrimBox rather than the MediaBox. Some files have a TrimBox that is smaller than the MediaBox and may include white space, registration or cutting marks outside the CropBox. Using this option simulates appearance of the finished printed page.

-dUseArtBox: The art box defines the extent of the page's meaningful content (including potential white space) as intended by the page's creator. Sets the page size to the ArtBox rather than the MediaBox. The art box is likely to be the smallest box. It can be useful when one wants to crop the page as much as possible without losing the content.

-dUseCropBox: Sets the page size to the CropBox rather than the MediaBox. Unlike the other "page boundary" boxes, CropBox does not have a defined meaning, it simply provides a rectangle to which the page contents will be clipped (cropped). By convention, it is often, but not exclusively, used to aid the positioning of content on the (usually larger, in these cases) media.

Convert all PDF in a folder into PNG using Bash. A folder is created based on PDF file name. User can specify start and final page numbers, default are 1 and 10.

#!/bin/bash
start_page=1
last_page=10
for pdf_file in *.pdf; do
  file_name="${pdf_file%.*}"
  mkdir -p $file_name #Create directory if does not exist

  gs -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dSAFER -dBATCH \
  -dNOPAUSE -sDEVICE=jpeg -r100 -dTextAlphaBits=4 -dFirstPage=$start_page \
  -dLastPage=$last_page -sOutputFile=$file_name/PDF2PNG-%04d.jpeg \
   $pdf_file
done

Operations on PDF Files

convert PDF files with text and coloured background into PDF with Black-and-White format: There are two approaches. In apporach-1, the Ghostview has been used to convert the PDF pages into PNG files and then Pillow/OpenCV has been used to convert the PNG into PDF with Black-and-White format. In second approach, PyMuPDF has been used to convert the PDF into PNG files. Other steps remain same as approach-1. The Python script can be downloaded from here. One of the important step is to find a right value of threshold which results in sharper text and whiter background. The program runs in serial mode and hence it is a bit slow: it may take up to 15 minutes for a PDF having 100 pages. There are other operations needed on PDF with scanned images such as:

  1. Deskew: straighten the text on a page
  2. Check resolution: the resolution of image is measured in PPI (pixels per inch) or DPI (dots per inch).
  3. Sharpen the texts
  4. Align texts on the centre of the pages
  5. Create equal margins on all pages
DPI: Dots per Inch is a sort of meta-data stored inside image file to tell a device how to display or print it. In other words - it is an indication of zoom level when an image is moved from one device to other. In display devices, DPI indicates sharpness of the illuminated points. In printing devices, it indicates the sharpness of printed characters or outlines. Resampling is the process of changing the pixel dimensions of an image maintaining the content of original image.

Convert PDF to PNG using Ghostview and Python

import os, sys, subprocess
resolution = 200  
i = 1  #Start page
j = 5  #Last page

pdf_name = str(sys.argv[1])
# Make directory named PDF2PNG
output_dir = "PDF2PNG"
os.makedirs(output_dir, exist_ok=True)

file_name = os.path.basename(pdf_name)
file_name = file_name.split(".")[0]
png_name = output_dir + "/" + file_name + "-%04d.png"

#Make sure that the Ghostview is defined in PATH
gs = 'gswin32c' if (sys.platform == 'win32') else 'gswin64'

# {f-strings}: syntax is similar to str.format() but less verbose
subprocess.run(["gswin64", "-dBATCH", "-dNOPAUSE", "-sDEVICE=png16m", \
f"-r{resolution}", f"-dFirstPage={i}", f"-dLastPage={j}", \
f"-sOutputFile={png_name}", f"{pdf_name}"], stdout=subprocess.PIPE)
pdfimages [options] PDF-file image-root: pdfimages reads the PDF file PDF-file, scans one or more pages, and writes one file for each image, 'image-root-nnn.zzz', where 'nnn' is the image number and 'zzz' is the image type (ppm, pbm, png, tif, jpg, jp2, jb2e, or jb2g).

Convert PDF to PNG using PyMuPDF. Using package img2pdf: img2pdf -o pdf_from_img.pdf *.jpg - run this from command line in the folder where images are stored. Sometimes, the files may not get sorted in correct order say when they are not padded with leading zeros. unix.stackexchange.com/.../img2pdf-get-pages-in-good-orderPortable provides multiple solutions. One handy method in Linux is brace expansion: img2pdf img_names_{1..250}.jpg -o pdf_from_jpg.pdf where img_names is the prefix of file names without numbers. The second solution is "find . -name 'img_names_*.jpg' -print0 | sort -z -V | xargs -0r img2pdf -o pdf_from_jpg.pdf" and this method does not require to know the number of images. Document Format (PDF) to Portable Pixmap (PPM) converter, pdftoppm is part of the poppler-utils package, which is available on most Linux distributions. "pdftoppm -png input.pdf pdf_png": save PNG images named pdf_png-000001.png, pdf_png-000002.png... More examples: pdftoppm -r 100 -gray -jpeg -jpegopt quality=90 input.pdf pdf_png.

To extract a specific page from a PDF using pdftoppm, the -f and -l options can be used to specify the first and last page numbers, respectively. For example, to extract only the third page, use: pdftoppm -f 3 -l 3 in.pdf png_prefix.

import fitz
inPDF = "00_Rigved.pdf"
prefix = "RV-Book-1"
doc = fitz.open(inPDF)
nPg = len(doc)

#iPg = doc.loadPage(0)   #Extract specific page
i = 1
for page in doc:
  pg = page.getPixmap()
  #outFile = prefix + str(n).zfill(i) + ".png"
  outFile = prefix + '{0:04}'.format(i) + ".png"
  pg.writePNG(outFile)
  i = i + 1

This Python code saves front (cover) page of all PDF files stored in a folder into PNG files. It has option to generate HTML tags to add the images as inline objects in a web page.


Delete pages from a PDF file using PyPDF2: it creates a new file by adding suffix _new. The pages to be deleted can also be specified as list or a range of numbers.

from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger
from pathlib import Path
import sys, os
#--Syntax: py delPages.pdf Input.pdf m n-----N < 0 implies single page deletion
file_name = str(sys.argv[1])

file_path = os.getcwd() + "\\" + file_name
in_pdf = PdfFileReader(str(file_path))
m = int(sys.argv[2])
n = int(sys.argv[3])

new_file = file_path.strip(".pdf") + "_new.pdf"
out_pdf = PdfFileWriter()
#
#Note that the counter i starts with ZERO
if (n >= 0):
  for i in range(in_pdf.getNumPages()):
    p = in_pdf.getPage(i)
    if (i >= m and i <= n):
      out_pdf.addPage(p)
else:
  for i in range(in_pdf.getNumPages()):
    p = in_pdf.getPage(i)
    if (i != m):
      out_pdf.addPage(p)
with open(new_file, 'wb') as f:
   out_pdf.write(f)
pdfseparate [options] PDF-file PDF-page-pattern: In Linux, pdfseparate extracts single pages from a Portable Document Format (PDF). The command reads the PDF file 'PDF-file', extracts one or more pages, and writes one PDF file for each page to 'PDF-page-pattern'.

Crop pages in a PDF file using PyPDF2:

In Ghostview, -dPDFFitPage can be used to select a PageSize given by the PDF MediaBox. The PDF file will be scaled to fit the current device page size (usually the default page size). Other options are -dUseBleedBox, -dUseTrimBox, -dUseArtBox or -dUseCropBox. This is useful for creating fixed size images of PDF files that may have a variety of page sizes, for example thumbnail images. This option is also set by the -dFitPage option.

from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger
from pathlib import Path
import sys
#Syntax: pdfcrop.py original.pdf 20 30 20 40

file_name = "Original.pdf"
pdf_path = (Path.home() / file_name)
input_pdf = PdfFileReader(str(pdf_path))
new_file = file_name.strip(".pdf") + "_new.pdf"

left     = int(sys.argv[2])
top      = int(sys.argv[3])
right    = int(sys.argv[4])
bottom   = int(sys.argv[5])

pdf = PdfFileReader(file_name, 'rb')
out = PdfFileWriter()
for page in pdf.pages:
    page.mediaBox.upperRight = (page.mediaBox.getUpperRight_x() - right, \
     page.mediaBox.getUpperRight_y() - top)
    page.mediaBox.lowerLeft  = (page.mediaBox.getLowerLeft_x()  + left,
     page.mediaBox.getLowerLeft_y()  + bottom)
    out.addPage(page)    

out_pdf = open(new_file, 'wb')
out.write(out_pdf)
out_pdf.close()

Scale Pages:

from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger
from pathlib import Path
import sys, os
#  Syntax: py scapePages.py Input.pdf 0.5
file_name = str(sys.argv[1])
file_path = os.getcwd() + "\\" + file_name
in_pdf = PdfFileReader(str(file_path))

#Enter scaling factors as fraction, all pages shall be scaled down/up
s = float(sys.argv[2])
new_file = str(file_path.strip(".pdf") + "_scaled.pdf")
out_pdf = PdfFileWriter()

for i in range(in_pdf.getNumPages()):
  p = in_pdf.getPage(i)
  p.scaleBy(s)
  out_pdf.addPage(p)

with open(new_file, 'wb') as f:
   out_pdf.write(f)
Merge PDF Files

Using Ghostview for Windows, files can be merged directly on command prompt: gswin64 -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -sOutputFile=x.pdf 1.pdf 2.pdf 3.pdf -Note that there should be no space in -sDEVICE=pdfwrite and -sOutputFile=combined.pdf such as -sDEVICE = pdfwrite and/or -sOutputFile = combined.pdf

Once GhostView is installed, you need to set the location in PATH using Control Panel - System - Advanced System Settings - Advance - Environment Variables. In Linux: pdfunite merges several PDF (Portable Document Format) files in order of their occurrence on command line to one PDF result file. qpdf input.pdf --pages . 1-5,20-z -- output.pdf: extract specific pages from a PDF using QPDF, here z refers to the last page.
from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger
from pathlib import Path
import sys, os
#Syntax: py pdfMerge.py F1.pdf F2.pdf F3.pdf F4.pdf
#Any number of files can be specified on command line. Input files must be in
#folder from which command is executed. e.g. py ../mergePdf.py F1.pdf F2.pdf
#
if (len(sys.argv) < 2):
  print("\nUsage: python {} input.pdf m n \n".format(sys.argv[0]))
  sys.exit(1)

fname = []
inpdf = []
j = len(sys.argv)
for i in range(1, j):
  fname.append(str(sys.argv[i]))
  fx = str(sys.argv[i])
  inpdf.append(PdfFileReader(fx))
new_file =  os.getcwd() + "\\" + "Merged_File.pdf"
out_pdf = PdfFileWriter()
#
for f in inpdf:
  for k in range(f.getNumPages()):
    p = f.getPage(k)
    out_pdf.addPage(p)
with open(new_file, 'wb') as f:
  out_pdf.write(f)

A more flexible option using argument parsing to merge PDF files either specified on command line or those stored in a folder can be found in this sample code.

Shuffle Pages: Click here to get the Python code.

Rotate Pages: Click here to get the Python code.

cpdf: The Coherent PDF Command Line Tools (written in OCaml) and C/C++/Python/.NET/Java/JavaScript API allows to manipulate PDF files such as merge or split, encrypt and decrypt, scale, crop, rotate pages, copy or add or remove bookmarks, read, write and modify annotations...

Download a Python script to extract text from a PDF file and summarize the words frequency.

Old documents have many noise or unwanted features such as Stains, noise from scanning, Ink Fading, Broken Character... OCR system, the process of recognition goes through five steps:
  1. Pre-processing where data is being prepared
  2. Segmentation - isolating the individual characters where text is segmented into words or characters
    • Line segmentation
    • Word segmentation
    • Character segmentation
      1. Projections
      2. Template matching
      3. Skeletonization
      4. Contour analysis
  3. Feature extraction
  4. Classification and
  5. Postprocessing
Tesseract and its Python wrapper ocrmypdf is a handy program to make a PDF of scanned documents into searchable by adding text parts and rendering them as 'invisible'. What one sees on screen or get printed is still the original image. But when a keyowrd is searched, one gets the hits highlighted that are on the invisible text layer. Following piece of code from "stackoverflow.com/ questions/ 55704218/ how-to-check-if-pdf-is-scanned-image-or-contains-text" can be used to check is a PDF contains any text or not.
import subprocess as sp
import re
output = sp.getoutput("ocrmypdf input.pdf output.pdf")
if not re.search("PriorOcrFoundError: page already has text!", output):
   print("Uploaded pdf already has text!")
else:
   print("Uploaded pdf file does not have text!")

This Python code check all the files of a folder and if they do not contain any searchable text, a Optical Character Recognition (OCR) operation is performed. The PDF file is copied to the folder where this Python code is stored and hence it must not be run from the folder where original files are stored.


Count Unique Words excluding those in a List
This Python code can be used to combine short lines of a text file into bigger lines. This is especially useful when texts are extracted from PDF files with OCR characters embedded in it.

List Files of a Folder

import os
import sys
#The code is run by defining the path at the command line argument
#e.g. py listDir.py . or py listDir.py ./abc/pqr

print("Synatax:: ", sys.argv[1])
file_list = []
for file in os.listdir(sys.argv[1]):
  if file.endswith(".py"):
    file_list.append(file)

PDF Reference, Third Edition, Adobe Portable Document Format Version 1.4: THE ORIGINS OF THE Portable Document Format and the Adobe Acrobat product family date to early 1990. At that time, the PostScript page description language was rapidly becoming the worldwide standard for the production of the printed page. PDF builds on the PostScript page description language by layering a document structure and interactive navigation features on PostScript's underlying imaging model, providing a convenient, efficient mechanism enabling documents to be reliably viewed and printed anywhere. At the heart of PDF is its ability to describe the appearance of sophisticated graphics and typography. This is achieved through the use of the Adobe imaging model, the same high-level, device-independent representation used in the Post-Script page description language.

The appearance of a page is described by a PDF content stream, which contains a sequence of graphics objects to be painted on the page. This appearance is fully specified where all layout and formatting decisions have already been made by the application generating the content stream.

List file names, size and number of pages
# References:
#------------------------------------------------------------------------------
# stackoverflow.com/questions/2104080/how-can-i-check-file-size-in-python
# www.geeksforgeeks.org/python-os-path-size-method
# www.geeksforgeeks.org/python-program-to-convert-a-list-to-string
# stackoverflow.com/questions/541390/extracting-extension-from-filename-in-python
# stackoverflow.com/questions/4226479/scan-for-secured-pdf-documents
# pythonexamples.org/python-if-not/

import sys,os
from PyPDF2 import PdfFileReader
root = "F:\World_Hist_Books"
path = os.path.join(root, "targetdirectory")
'''
#Get content of a directory: files, directories as LIST in the terminal
out_f = "List.txt"

#Write Only ('w') : Open the file for writing. If file already exits, data is 
#truncated and over-written. The handle is positioned at the beginning of the 
#file. Creates the file if it does not exist.
f = open(out_f, "w")     # f = open("List.txt", "w")

s = os.listdir()
for x in s:
  #print(x)
  f.write(x + '\n')
f.close
'''
out_f = "List.txt"
f = open(out_f, "w")
def convert_bytes(num):  #bytes to kB, MB, GB
  for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:
    if num < 1024.0:
      return "%3.1f %s" % (num, x)
    num /= 1024.0
  for path, subdirs, files in os.walk(root):
      for name in files:
        s = os.path.join(path, name)
        b = os.path.getsize(os.path.join(path, name))
        b = convert_bytes(b)
        #Get number of pages in the PDF file: f.split(".")[-1]
        ext = os.path.splitext(s)[1][1:].strip().lower()
        nPg = 0
        if (ext.upper() == "PDF"):
          with open(s, 'rb') as pdf_file:
            pdf_f = PdfFileReader(pdf_file)
            if not pdf_f.isEncrypted:
            nPg =  pdf_f.getNumPages()
          pdf_file.close()
        #Replace \ with whitespace
        A = s.split('\\')
        f.write(' '.join(map(str, A)))
        f.write(' ' + str(b) + '  ' + str(nPg) + '\n')
  
        #Write only the file names
        #f.write(s.split('\\')[-1] + '\n')
f.close
# L is the list
#listToStr = ' '.join([str(elem) for elem in L])
#listToStr = ' '.join(map(str, L))
#print(listToStr)
Contact us
Disclaimers and Policies

The content on CFDyna.com is being constantly refined and improvised with on-the-job experience, testing, and training. Examples might be simplified to improve insight into the physics and basic understanding. Linked pages, articles, references, and examples are constantly reviewed to reduce errors, but we cannot warrant full correctness of all content.