Imagine you have a large PDF document that you want to submit to some three-letter government agency. Besides having the document as one big PDF, the agency also asks you to submit the abstract and the biography as two separate PDFs.
Well, you are running out of time, and your coauthors are still working on their sections. You even wrote a make file to automate the document generation because you don’t want to run LaTeX multiple times manually to resolve all the references…
But how do you do make these two PDF files efficiently, preferably in an automatic way? Well, if you have Acrobat (the full version), then you can use Document -> Extract Pages to save the relevant pages into separate PDFs. But that’s not easy to automate. And what if you didn’t shell out the \$\$\$ to buy Acrobat?
Here is the good news. To extract pages from a PDF, we can use the free software pdftk. Suppose you know(*) the abstract happens to be on pages 2 and 3 and the biography spans from page 30 to the end. Here is an example usage for our imaginary situation (dont_ask is used to suppress the prompt to overwrite existing files):
pdftk foo.pdf cat 2-3 output abstract.pdf dont_ask
pdftk foo.pdf cat 30-end output biography.pdf dont_ask
Besides page extraction, pdftk can also catenate PDFs and perform several other PDF magic tricks. You can discover all these from reading this page. For example, you can discover that you can compute the number of pages of a PDF by
pdftk foo.pdf dump_data output - | grep NumberOfPages | cut -d' ' -f 2-
(*) How you can know these ranges automatically is another story to be written later. Hint: use pdftotext.