Category Archives: Electronic Documents

SumatraPDF 0.8 Released

SumatraPDF has just become a very strong contender as an everyday PDF viewer on Windows. The reason: it finally supports searching and bookmarks in version 0.8.

Its interface is still a bit over simplistic, but it’s already quite good for everyday uses. At the moment, the keyboard shortcuts are not very well-documented. Looking at the source code, I noticed some missing entries that are noteworthy:

  • The search interface: / starts a search, so as Ctrl-F. Ctrl-G and F3 is next, and Shift-F3 is previous. Note that the toolbar doesn’t have to be visible for the search keys to work.
  • Ctrl-L toggles full screen mode.
  • F12 toggles the bookmark panel.

(Update: Also note that Ctrl-Drag will select a region and copy its text to clipboard.)

I don’t have a list of the numerous PDF viewers on Windows here: a google search will get you many hits. Personally, I have “switched” to SumatraPDF on my browsers a few months ago, while spending most of my “PDF time” in PDF-XChange Viewer (see this post about its annotation capability).

I still keep Adobe Reader 8 as the default PDF handler: in terms of printing, anti-aliasing and javascript support, I have yet to see any competition to Adobe. However, now that PDF is in the process of becoming an ISO standard and FSF is declaring the GNU PDF project as one of the few high priority projects, hopefully we will see some change in a few years (see this linux.com article).

I should note that SumatraPDF 0.8 is still missing hyperlink support. Also, none of Windows PDF viewers seems to support pdfsync (round-tripping between PDF and TeX source). I know the former is being worked on. The latter, hmm…

P.S. Common PDF viewers on Windows also include Ghostview and Foxit Reader.

Protect Your Email Address in PS and PDF

The randtext package is a great random find in CTAN. Here is its description in full, but with an emphasis added by me:

The package provides a single macro \randomize{TEXT} that typesets the characters of TEXT in random order, such that the resulting output appears correct, but most automated attempts to read the file will misunderstand it.

This function allows one to include an email address in a TeX document and publish it online without fear of email address harvesters or spammers easily picking up the address.

The author is Charles Duan.

While you most certainly want to use this package for some protection, you should not rely solely on it. For a historical reason(*) , Adobe’s PDF library can easily extract an email address protected by this package. I don’t know about the top search engines, but I would expect their extractors to perform no worse.

(*) Basically, space characters need not exist after the TeX phase. They may instead exist as kerning and spacing information. Therefore, Adobe have an extraction routine that guesses the word breaks based on how the characters are spaced. In other words, the routine is literally “reading” the text from the rendering, just like any human!

Changing PDF Margins With The pdfpages Package

Sometimes I am given a PDF and I wish I could change its margins. For example, when I print out a conference version of a paper to study in detail, usually there isn’t much space on the sides to write my own ideas. I heard Fermat had a solution for this type of issue :P , but in the modern days, I use sticky notes. The same situation surprisingly arises even if you use PDF annotating softwares. Free or not free (no link for them), they simply do not support the “enlarge canvas” feature yet. All you can do is to insert electronic sticky notes. So much for the metaphor!

Now there is actually a technology called “reflowable PDFs”, which are PDFs that contain enough information to support reflowing its content to fit any width. You can see this page for a screenshot of a reflowed document. Reflowing works wonder for text-dominant documents like novels, but the PDF has to be specially prepared for reflowing to really work well. (Try View->Zoom->Reflow in Adobe Reader 8.)

But if the given PDF is not reflowable, we can still use some graphics editor and edit the PDF interactively. After all, a PDF is mostly a vector graphics file, modulo some potential embedded bitmaps, and so you can edit it like any vector graphics file. I’ve seen it done in Illustrator, and I guess some of the free software competitors like Inkscape or Scribus can do this kind of editing too. But using a command line tool would be far easier.

For a long while, I actually know how to change the margins of a PDF. (Turns out lawyers have essentially the same problem.) The trick is to use the pdfpages package with pdfLaTeX. You may recall that this package allows us to include specific pages of a PDF into our own document. Magically, it has a scale option (that is really inherited from graphicx)… This gives us the first batch file: pdf-rescale.bat. Execute

pdf-rescale.bat foo.pdf 0.8

and you will get foo-0.8.pdf, which is foo.pdf shrunk to 80%. I’ve found that 80% is a good default and so the third argument is actually optional. You can also specify a scale of larger than 1 for some journal articles formatted for smaller paper sizes. I can also imagine fancier applications in which you combine this idea with the geometry package (see this post) to control of the final outcome.

But since we are not changing the actual size of the paper, shrinking the content for a larger margin actually means, well, shrunk content. Conference proceedings are already typeset in a small enough font. Since the proceedings are in two columns, I had the idea to print each column on its own page. That will definitely give us ample space. After a lot of different attempts since last year, I finally managed to get the second batch file: pdf-1c.bat. Execute

pdf-1c.bat foo.pdf

and you will get foo-1c.pdf, which is foo.pdf but with one column per page. No kidding.

The following PNG illustrates these scripts using the title page of an old paper. The upper-left shows the original, and the upper-right is after shrinking to 80%. The lower-left and lower-right are the two pages from the one-column version. I have also attached the source PDF from which I generated the PNG. The key feature is that this PDF retains the text of the original—try search for the word “degree” and you will see. (At one point my solution was to assemble a bunch of PNGs representing each single column into a huge PDF. Although you can annotate on it electronically, you cannot search in it. It also prints slowly.)

Finally, I also note that it is in fact possible to prepare reflowable PDFs using pdfLaTeX or dvipdfm (there seems to be trouble if going through a PS), but I will save it for a later story. For now, you can experience reflowing a LaTeX PDF with this PracTeX article, which is actually on the use of the pdfpages package. Having played with reflowing for a while, I would say that the real deal-breaker for reflowable PDFs lies in the mathematical expressions. You can see this TUGboat article for some excellent macros on fine-tuning mathematical expressions. Now, try to reflow it… :P

P.S. If you write the corresponding bash scripts, please send them to me so that I can, with full acknowledgement, post them here. The only reason why I stick to batch files is to avoid extra dependencies.

Change PDF Margin Scripts: pdf-rescale.bat pdf-1c.bat

Update: Joshua Dunfield has ported the rescaling script to Linux. See the comments.

Free PDF Annotation Softwares

Given the amount of documents that I deal with, I’ve had many opportunities to save paper if only I could annotating on a PDF directly. Until recently, Jarnal has been the only free solution on Windows. However, I would say that Jarnal is not exactly the most polished piece of software I have used… (But I am not complaining against its price.)

Well, Jarnal has some serious competition now. PDF-XChange Viewer is a free (as in beer) PDF viewer that supports a full suite of native PDF annotation capabilities. By “native”, I mean if you insert a “sticky note” comment into the PDF, it shows up as a sticky note in Adobe Reader and so you can open, close and move the note, just like the stickies created by Acrobat Professional. In comparison, the interface is not very customizable and the keyboard shortcuts are not quite “there” yet, but it is certainly looking very good for a version-1 product. The font rendering is not as good as well, but it is comparable to other free solutions. Its performance on graphics intensive PDF is very good on too! There is even a button that allows you to quickly open the current PDF using Adobe Reader, just in case.

I wish it has better ink support, but again, I am not complaining against its price. :P

P.S. Not that I really want to destroy any sales, but I note that PDFCreator is a very good Distiller replacement. Together with PDF-XChange Viewer, they cover most of my PDF needs.

PrimoPDF

In the past, I used to manually set up the tool chain for PDF production on Windows.

Basically, this requires the installation of Ghostscript and a PS printer driver that prints to a RedMon port. The trick is to configure the PS printer driver so that it embeds Type 42 fonts. And I also need to have pdftk to set the document security and then stuff…

Well, that was indeed the past. :P These days I get lazy and just use the free (as in beer) product PrimoPDF. So far, I find it does everything right, but I still manually remove the Ghostscript 8.50 that is bundled in its installer (remove the gs subdirectory) and let it use the most current version on my computer.

Combined with the ability to split and join PDFs using pdftk, that’s a pretty good basic setup for everyday PDF production at a low low cost of zero dollars. I am all thankful!

Update: Many of you reminded me that I should really mention PDFCreator in this post as well. I agree. Both of them are great!

LaTeX Source Specials

As I have hinted in this post about pdfopen, it is possible to do some sort of round-trip LaTeX editing with the right tools. There are two directions to make the round-trip complete.

The forward jump is editor-to-previewer: given the current cursor location in the editor, jump to the corresponding paragraph in the previewer. The backward jump is symmetrical. The magic behind this is called “source specials”.

While it sounds somewhat like a Chinese dish with special sauces, it really isn’t. :P Instead, when you generate the DVI file, you ask latex to tag the DVI file with “source specials”. Imagine each word in the DVI file is tagged with the source filename and the line number of the word.

The forward jump is to ask the previewer to display the location that is right after the editor’s cursor location (filename and line number). The backward jump is to ask the editor to put the cursor closest to the current location in the previewer (usually the location of a double-click).

Let’s get to the example. I will show how to get round-tripping to work in Windows using MiKTeX distribution 2.4 and GNU Emacs. Other configurations are similar.

(Round-tripping between LaTeX and PDF is similar in principle, but the tools are not very mature yet. I will save this for a later post.)

  • Install the AUC TeX package into Emacs.
  • To enable tagging in latex, invoke it with the --src-specials option. Say the primary LaTeX source is main.tex, AUC TeX invokes it this way, :
    latex --src-specials \nonstopmode\input{main.tex}

  • To do the forward jump with YAP (the DVI previewer in MiKTeX), say when the cursor is on line 65 of hetero.tex, AUC TeX will execute:
    yap -1 -s65hetero.tex main.dvi

  • To enable the backward search, install the gnuserv package. Google can help you locate many links, like this and the rest.
  • Assuming you have gnuclientw installed on your path, in YAP, go to View->Options->Inverse Search. Inside Program combo, you should see “GNU Emacs (Single Instance)”. Example command line:
    gnuclientw.exe -F +%l "%f"

  • Note that installing gnuclientw addresses the problem in this comment. In general, you rarely want to invoke emacs directly. Use gnuclientw instead.

Have fun!

Mathematical Illustrations

As a manual of geometry and Postscript, the beginning chapters of this book reminds me a lot of my junior school Logo tutorials. But very soon Bill Casselman starts to draw 3D objects in 2D and projection is where the story gets very interesting. (Think how to draw a polyhedron on a piece of paper.)

The book has an online version and a paper edition.

P.S. One tidbit I learned from this book is that doughnut, oh I mean, torus means cushion in Latin. Hehe.

pdfopen and pdfclose

Update on 2008-01-24:
The “back” feature is no longer needed, at least on Windows Acrobat Reader 8. See: Edit->Preferences->Documents->Restore last view settings when reopening documents

Update on 2007-01-30:
I have patched the source and replaced the zip files so that pdfclose will be less likely to crash Adobe Reader 8. Thanks to this post. Also, apparently pdfopen cannot issue a “back” command to Adobe Reader and you need the full version for that feature.

If you are a Windows user and use Acrobat or Adobe Reader(*) to view PDF files, you may have experienced Acrobat locking your PDF file, making it impossible to overwrite. This is a serious problem when previewing your paper in the PDF format because every time before you generate a new PDF, you need to remember closing the old PDF in Acrobat.

Part of my solution to this problem is to open the PDF using pdfopen:

pdfopen --file foo.pdf

This will allow me to close foo.pdf by:

pdfclose --file foo.pdf

Integrating these two commands into your work-flow is left as an exercise to the reader. :)

But there is still an important usability problem that these two commands won’t solve. Every time you re-open a PDF, you will not be on the same page when you closed it. Instead, you will be on the first page. How would I to fix this? One solution would be to press Alt-Left after I re-opened the PDF file. This goes back in history and brings me to the last view I was at. But I’ve got something better.

I’ve modified the source code of pdfopen from TUG to include an extra option --back to do the obvious thing. So instead of the above, you should open a PDF file by:

pdfopen --file foo.pdf --back

Now the cycle is complete. Phew!

I have posted the exe files in a zip. The source files are available too.

For the record, you can obtain the original pdfopen and pdfclose at this URL:
http://www.tug.org/tex-archive/systems/win32/web2c/current/binary/bin-pdftools-win32.zip

BTW, I’ve read that recent versions of TeXnicCenter and WinEdt can both perform this Acrobat cycle too. But I am not sure if they can go back to the previous view. Heh, surely I know my Emacs can. :P

P.S. I am also aware that you can use gsview32 to preview PDFs and gsview32 does not lock PDF files. That’s one way to avoid this problem.

(*) Really, it’s called Adobe Reader. I don’t know since when they dropped the middle word.

Splitting and Joining PDFs by pdftk

Imagine you have a large PDF document that you want to submit to some three-letter government agency. Besides having the document as one big PDF, the agency also asks you to submit the abstract and the biography as two separate PDFs.

Well, you are running out of time, and your coauthors are still working on their sections. You even wrote a make file to automate the document generation because you don’t want to run LaTeX multiple times manually to resolve all the references…

But how do you do make these two PDF files efficiently, preferably in an automatic way? Well, if you have Acrobat (the full version), then you can use Document -> Extract Pages to save the relevant pages into separate PDFs. But that’s not easy to automate. And what if you didn’t shell out the \$\$\$ to buy Acrobat?

Here is the good news. To extract pages from a PDF, we can use the free software pdftk. Suppose you know(*) the abstract happens to be on pages 2 and 3 and the biography spans from page 30 to the end. Here is an example usage for our imaginary situation (dont_ask is used to suppress the prompt to overwrite existing files):

pdftk foo.pdf cat 2-3 output abstract.pdf dont_ask
pdftk foo.pdf cat 30-end output biography.pdf dont_ask

Besides page extraction, pdftk can also catenate PDFs and perform several other PDF magic tricks. You can discover all these from reading this page. For example, you can discover that you can compute the number of pages of a PDF by

pdftk foo.pdf dump_data output - | grep NumberOfPages | cut -d' ' -f 2-

(*) How you can know these ranges automatically is another story to be written later. Hint: use pdftotext.