Archive for the ‘Geek Commentary’ Category

Functional vs. Transparent API’s, Part I

Monday, July 25th, 2005

**I’ve been mulling over the relationship and differences between PDFTextStream’s API and other PDF-related API’s.**

I was originally going to write a pretty long tract on this topic, but relented mid-way because I realized that I likely don’t have the concepts straight in my own head, nevermind being able to put them down on screen.

PDFBox, JPedal, and other fine PDF libraries present very comprehensive API’s to a developer-user, ones which mirror the nature of PDF data structures to a hilt. That’s excellent, especially if you need to do some low-level mucking around.

To get to more sophisticated functionality (like the extraction of text, generation of PDF’s etc.), additional API’s need to be laid on top of the lower strata of data structures. It’s a very clean, formal computer science approach that fosters maintainability, reuse of code, and representational consistency.

PDFTextStream takes a somewhat different approach. It is primarily interested in fulfilling a very particular set of developer-user requirements: specifically, the extraction of text and other PDF content with maximal accuracy and throughput. To get there, we simply could not use the layered low-level API approach — while we might be able to make extraction functionality work, the overhead involved in that approach increases dramatically as the complexity of the functional requirement rises.

The result is PDFTextStream’s API, which if shown to an expert in the PDF document format, would look completely foreign. There are no references to PDF objects, dictionaries, names, XObjects, Postscript, or virtually any other PDF-specific data structures. This is because the PDFTextStream API is focussed on providing the shortest route from point A to point B for the developer-user looking to extract content from PDF documents, Period. Obviously, this has the drawback of making the PDFTextStream API singularly useless for anyone who wants to generate PDF reports (for example).

The best terms I can come up with for these types of API’s are ‘transparent’ (for tiered, low-level API’s), and ‘functional’ (for API’s dedicated to a specific functional domain to achieve side-benefits of specialization). Both have their place; transparent API’s are likely to always be more popular (since they have broad applicability), whereas functional API’s are likely to always maintain an edge within their particular domain.

So what’s the point? I find the distinction between ‘transparent’ and ‘functional’ API’s fascinating because the comparision is decidedly nontechnical — it’s about how people interact with the software, and how a software vendor wants to present itself and its product to its users. These might be the kinds of tensions that need to be exploited to make significant strides in software design, since software is still hard to build and hard to use even after the litany of technical ‘revolutions’ that have come and gone over the years.

Benchmarks and Honesty

Wednesday, July 13th, 2005

**…like oil and water, right? Not necessarily; we should hope not, for otherwise we’re all in trouble.**

Last week, someone anonymously posted a comment to a previous entry of mine. In a nutshell, he or she implied that the benchmarks we publish comparing PDFTextStream text extraction performance to that of other Java PDF libraries was rubbish. Here’s the comment in its entirety:

If the product is so good, why are your speed comparisons using your latest version against 2 year old products.

Wow, that hurt. I responded with a comment to the same entry, but the original implication was serious enough that I felt compelled to make a more visible statement about the benchmark that we publish.

The core complaint in the comment was that we’re tilting the playing field by comparing PDFTextStream to other years-old Java PDF libraries. That was and is fundamentally untrue, except in the case of Etymon’s PJ library. Here, I’ll quote my response on this issue from my comment in the original entry:

Etymon PJ was abandoned in favor of PJx years ago; PJx hasn’t been under active development since April of 2004 though (see http://sourceforge.net/projects/pjx/), and in its current state provides no API for text extraction that we can see. However, our original benchmarks nevertheless showed the older PJ library to be the fastest of the available libraries (second to PDFTextStream), so we included it even though Etymon doesn’t appear to support it anymore.

Our perspective on this is that we have been trying to be as transparent and honest as possible with these benchmarks from day one; therefore, when searching out Java PDF libraries to compare to PDFTextStream, we wanted to find the toughest competition possible. We found Etymon’s PJ library to be the fastest text extraction library (second to PDFTextStream), so we included it in the benchmark.

I think that’s very fair, and very honest. Frankly, given the sometimes rabid nature of skepticism in some developer circles, we would like likely have been suspected of hiding something if we had originally decided to exclude the PJ library because it’s no longer supported.

Benchmarks have long been viewed with suspicion by technologists of all stripes, but being a publisher of a benchmark has provided me with some perspective. Yes, benchmarks can be gamed; yes, internally-conducted benchmarks can be more vendor fantasy than reality. We knew this from the start, which is why we made extraordinary efforts to make the benchmark as transparent as possible (by publishing the benchmark code, test files, and methodology along with the bottom-line results). Any skeptics are free to run the benchmarks themselves, and report any observed discrepancies.

If that’s not the gold standard of honesty when it comes to benchmarking, and if a benchmark conducted and published in this manner cannot be trusted by the broader developer community, then we’re all in trouble. There are thousands of software products out there, all of which claim a particular advantage over their competition. Some advantages are qualitative, and cannot be measured — that’s fine. However, other advantages are quantifiable; for these claims, we should all welcome a transparent, published benchmark. Otherwise, the process of selecting software products descends into a matter of who has the better marketing and PR game (not that that hasn’t already happened to a very large extent already, but that’s a different post!).

Fundamentally, I hope the benchmark doesn’t matter. In the end, I would hope that every developer that is looking for a PDF extraction solution for Java would download all of the available libraries and do some real due diligence to determine which library delivers the best features and throughput in their environment. Voilà, everyone wins.

There’s little left to say, except that, if you find our benchmark to be unconvincing, we remain open and receptive to feedback. If there’s a way we can improve the benchmark, whether through changing methodology, test files, or tweaking timing code, we’ll do it.

Worldly Exposure

Thursday, June 16th, 2005

**Every person has a particular set of experiences they search for when choosing an occupation. For me, I’ve always be fascinated with the act and process of discovery. Thankfully, helping to build and maintain PDFTextStream satisfies that fascination in spades in ways that I never anticipated.**

One would assume that working on a piece of software that extracts text from PDF documents would be pretty dry work. And, to a certain extent, it is: supporting all of the intricacies and minutiae associated with a complex file format like PDF is not the most thrilling software development work.

However, what can be exciting about the experience is how it forces me to be exposed to things that I never would have seen otherwise. See, in order to ensure that PDFTextStream works well and continues to do so as it is improved and changed, we have developed a suite of test PDF documents. These documents must be examined one by one, fed into PDFTextStream, and records of the documents’ logical structure and text content saved off into what are called ‘ground truth’ files. Then, whenever a change is made to PDFTextStream, our automated tests compare all of the preexisting ground truth files with what PDFTextStream provides after it has been changed. This process of constantly tracking the impact of changes to PDFTextStream is critical in ensuring that it continues to be robust, providing high-quality output.

The point here though, is that the process of building up and maintaining our suite of PDF documents (which numbers in the thousands now) exposes us to documents from nearly every corner of human activity. That’s thrilling for me, as I get the option to read about things that I never would have come across had I not been involved in PDFTextStream. For example, our test suite includes PDF documents like:

  • An issue of the newsletter produced by the National Multiple Sclerosis Society
  • A research paper describing CFS, a Cryptographic File System for Unix that was developed at AT&T
  • Various PDF versions of U.S. patents
  • A maintenance worksheet that describes how to apply and care for a particular type of asphalt emulsion
  • A whitepaper discussing various systems that help in managing spectral data
  • An essay by Seth Godin called Do Less that discusses the need to be selective in one’s entrepreneurial venture
  • An English translation of an al Qaeda training manual siezed by the Manchester, UK police in a raid of an al Qaeda cell house
  • An article discussing options for 2D visualization of complex ontologies
  • The 2004 roster for the University of Pittsburgh softball team
  • A PDF version of a Powerpoint presentation about the excruciating financial minutiae of reinsurance
  • An article about how to safely set up and use tower scaffolding
  • A catalog of activities at the 2003 Melbourne Scarf Festival (who knew someone would ever host a lecture called “The Nature of Scarves”?)

As you can see, the list goes on and on and on. The world of human knowledge and experience is functionally infinite, but I love getting glimpses of obscure corners of it and making little personal discoveries. Pretty geeky, I know, but that’s not really surprising, is it?

Icelandic Character Encoding (and other Joys of PDF)

Thursday, September 16th, 2004

**The PDF file format is a wonderful thing — except when it isn’t. Today, I explain the discovery, origin, and resolution of a recent “bug” in PDFTextStream, and provide a gentle introduction of how text is encoded in PDF files along the way.**

One of the most difficult to understand computer science topics is the notions and implementation of character encoding schemes — those bits of code that connect an otherwise arbitrary number (say, 107) to a character (a lower-case ‘k’, in the case of the standard ASCII character encoding standard). It all seems so simple, right? Here’s a stream of integers that you might find in a binary file somewhere:

[104, 101, 108, 108, 111, 32, 116, 104, 101, 114, 101]

and, you just have a lookup table somewhere that connects each integer value to its corresponding character:

['h', 'e', 'l', 'l', 'o', ' ', 't', 'h', 'e', 'r', 'e']

Ah, if it were that simple. Well, sometimes it is, say 90% of the time, but it’s that last 5-10% that makes the world interesting, isn’t it?

P.S.: That’s all I’m going to say about character encoding in the general sense — the rest of this post is related to PDF-specific stuff. If you’d like to read a good introduction to character encoding, and Unicode in particular, check out this Joel on Software post that gives a 5,000-foot-level view of how Unicode works.

The Bug

We hate bugs. They really irritate us, and I’m no exception. So, when we stumbled across a particular PDF that PDFTextStream appeared to not handle properly, we were suitably…irritated.

On its face, the bug seems relatively simple. There’s a particular PDF file that is written in Icelandic. Here’s a screenshot of the PDF as shown by a PDF viewer:

20040916pdf_icelandic.gif

And here’s a screenshot of the text that PDFTextStream produced when it was provided with the PDF file:

20040916text_icelandic.gif

See the differences? In the PDF file, there are a number of eth characters that look like this: ð. (Icelandic has a couple of characters not found elsewhere; this is one of them.) Now look at the text that PDFTextStream extracted — all of those eth characters are gone, replaced by right-angle brackets (’>')!

You can see why we might be irritated.

We pride ourselves on PDFTextStream producing very accurate output of text extracted from PDF files, so a problem like this is taken very seriously. We looked at all of the output, checked out the PDF file, looked at its internals, and concluded that PDFTextStream had a bug that was affecting Icelandic characters specifically (which I wrote about here). After all, this is the first time we’ve run across this particular issue, and it didn’t seem to occur in connection with any of our other international/Unicode test PDFs (including some in Russian, French, Spanish, etc., etc.).

Little did we know that PDFTextStream’s behaviour in this case was not only correct, but that the “problem” was actually caused by the PDF file being malformed in a very particular way.

PDF Text Encoding Primer You’ll need a seat for this…

To understand why this is happening, you’ll need to know a little bit about how text encodings work in PDF files. By no means is this information complete; if you’re really interested, there’s a 1172-page bedtime story (the PDF v1.5 specification) that explains it all (or as much as the good folks at Adobe could remember).

To get us started, here’s a pictorial depiction of how text is represented in a PDF document:

20040916pdf_encoding.gif

(Those who know about this already, please bear with me, especially on the terminology — I’m trying to make this easy to understand, not technically perfect.)

Conceptually, it’s relatively simple. All text in a PDF file is stored as a sequence of character codes. In addition, every PDF file contains a character encoding for every font that it uses. The character encoding is essentially a dictionary: it links every character code used in the PDF file to a corresponding glyph code. Those glyph codes are then passed on to a font program, which is a set of specialized routines that know how to draw glyphs (glyphs are the particular manifestation of a character or symbol — how the letter ‘a’ is drawn on your screen is one glyph, for example). Once a PDF viewer has found the glyph-drawing instructions (provided by the font program) that correspond to the glyph codes (provided by the character encoding) that were associated with the character codes that are actually contained in a PDF file, it can draw the text on a computer screen or send it to a printer.

The Theory

Where things get complicated is in the translation between character codes and glyph codes. In almost every case, the glyph codes specified by character encodings correspond to Unicode character id’s — such id’s are very standard, and PDFTextStream (or any other library that might attempt to read text out of a PDF) can readily use the stream of those glyph codes as the effective text content of a PDF document. However, very rarely (this Icelandic PDF is the first PDF document we’ve come across that has this peculiarity), those glyph codes don’t correspond to Unicode character id’s, leading to improper characters being outputted as the text content of the PDF document.

Confused? Don’t worry, it’s not the simplest of things to grok. Simply put, in the case of the Icelandic PDF, the in-force font program associates the glyph code for a right-angle bracket to the eth glyph, and the character encoding in the PDF file reflects that.

If that doesn’t sound right, it isn’t — technically, a glyph code provided by a character encoding should correspond precisely with the glyph in the font program for the Unicode character that corresponds with the glyph code to begin with. So, PDFTextStream is outputting the wrong character because the PDF file is malformed (to correspond with the malformed font program).

So, after we figured this out, as a last check of our theory, we turned to Google. We originally stumbled upon the Icelandic PDF file on the Internet, so a few searches later, we managed to get it to show up in in the results of a Google search. Google has this nifty ‘View as HTML’ link next to most of its PDF search results; clicking on that link brought up Google’s extract of the text of the PDF. Here’s screenshot of that HTML view:

20040916google_icelandic.gif

Ah, and there’s those right-angle brackets again! So, even Google’s text extraction utility falls prey to the encoding problems in the semi-faulty PDF file, and so will every other PDF text extraction library. Once we saw this, we decided to put the issue to bed.

Conclusion

The bottom line is that, because there is no notion of a ‘valid’ PDF file, there will always be some vanishingly small percentage of PDF’s that don’t follow the PDF file specification, or even widely-held conventions. Unfortunately, that means that extracting text out of PDF documents will always be an imperfect art.

That irritates us.