Archive for July, 2005

Votes for Transparent API’s

Saturday, July 30th, 2005

**A brief word on the response to last week’s missive regarding ‘functional’ and ‘transparent’ API’s.**

After my last post, we received two serious inquiries from existing customers about whether (and when) PDFTextStream will provide a transparent PDF API. The general message of these inquiries is that, even though specific applications that are using PDFTextStream require and benefit from its decidedly functional API (focussed on content extraction), it’s occasionally useful or necessary to dig a little deeper into a PDF for other reasons.

Specifically, these customers have been using PDFBox or PJ in order to look at the guts of PDF documents in a way that PDFTextStream doesn’t currently provide for. These usages aren’t in production environments — in both cases, the transparent API’s are being used in a support or troubleshooting role, especially with poorly-formed PDF documents.

Nevertheless, it’s always better to provide a complete solution, as these two customers have pointed out. Ergo, it looks like we’ll be providing a transparent API ’some time soon’, and hopefully we’ll be able to navigate the technical difficulties likely to crop up when deploying transparent and functional API’s simultaneously. More as it happens, etc.

Functional vs. Transparent API’s, Part I

Monday, July 25th, 2005

**I’ve been mulling over the relationship and differences between PDFTextStream’s API and other PDF-related API’s.**

I was originally going to write a pretty long tract on this topic, but relented mid-way because I realized that I likely don’t have the concepts straight in my own head, nevermind being able to put them down on screen.

PDFBox, JPedal, and other fine PDF libraries present very comprehensive API’s to a developer-user, ones which mirror the nature of PDF data structures to a hilt. That’s excellent, especially if you need to do some low-level mucking around.

To get to more sophisticated functionality (like the extraction of text, generation of PDF’s etc.), additional API’s need to be laid on top of the lower strata of data structures. It’s a very clean, formal computer science approach that fosters maintainability, reuse of code, and representational consistency.

PDFTextStream takes a somewhat different approach. It is primarily interested in fulfilling a very particular set of developer-user requirements: specifically, the extraction of text and other PDF content with maximal accuracy and throughput. To get there, we simply could not use the layered low-level API approach — while we might be able to make extraction functionality work, the overhead involved in that approach increases dramatically as the complexity of the functional requirement rises.

The result is PDFTextStream’s API, which if shown to an expert in the PDF document format, would look completely foreign. There are no references to PDF objects, dictionaries, names, XObjects, Postscript, or virtually any other PDF-specific data structures. This is because the PDFTextStream API is focussed on providing the shortest route from point A to point B for the developer-user looking to extract content from PDF documents, Period. Obviously, this has the drawback of making the PDFTextStream API singularly useless for anyone who wants to generate PDF reports (for example).

The best terms I can come up with for these types of API’s are ‘transparent’ (for tiered, low-level API’s), and ‘functional’ (for API’s dedicated to a specific functional domain to achieve side-benefits of specialization). Both have their place; transparent API’s are likely to always be more popular (since they have broad applicability), whereas functional API’s are likely to always maintain an edge within their particular domain.

So what’s the point? I find the distinction between ‘transparent’ and ‘functional’ API’s fascinating because the comparision is decidedly nontechnical — it’s about how people interact with the software, and how a software vendor wants to present itself and its product to its users. These might be the kinds of tensions that need to be exploited to make significant strides in software design, since software is still hard to build and hard to use even after the litany of technical ‘revolutions’ that have come and gone over the years.

Blog Stuff

Friday, July 22nd, 2005

Notes on general housekeeping around here.

Just an FYI — we’ve begun to slowly improve the technical side of this blog. The feeds are now no longer running a day behind new posts, and there are actual ‘next’ and ‘previous’ links at the bottom of the main blog view. It’s the little things, right?

For those that are interested, this blog is run on Quills, a blog product for Plone (the Zope CMS that handles most of the dirty work of keeping our site humming nicely).

Quills is a decent blog platform, probably the best that is out there for Plone. I originally learned of it from Tom Lazar’s blog, which also uses Quills.

Some might say that we should have a ‘Quills Powered’ badge on here somewhere, but truth be told, we’ve rewritten about 1/2 of the product (most of the page templates). It provides a nice object model, but the presentation simply does not work for us out of the box at all, and much of it really doesn’t make much sense to me (the structure of the archive pages, in particular). Shipping our improvements over to the Quills folks is probably a good idea (best to not gripe without pitching in and all that), but a *ton* of cleanup would be required to make it usable outside of our site again. There’s just not enough time in the day.

High Noon

Monday, July 18th, 2005

**PDFlib released a PDF text extraction component, so let’s see how we stack up.**

A week or so ago, Dan Shea at PlanetPDF posted a news item about PDFlib releasing a PDF text extraction library. That’s obviously very interesting to us, simply because until now, PDFTextStream has been the only library out there concentrating on PDF text extraction.

My first reaction to reading this news was to shoot an email off to Dan, suggesting that a PDF text extraction library shootout of some kind might be in order. His reply was, “What do you have in mind?”

Well, jeez, I hadn’t gotten that far yet. I assume any comparison of text extraction libraries should focus on a few things immediately critical to the endeavor:

  • Text extract accuracy
  • Operational performance and throughput
  • PDF compatibility (PDF specification support, decryption services, etc.)
  • Auxilliary features (accessibility of other content)

And then there’s the extras that one looks for in any library:

  • Platform/Environment support
  • API clarity
  • Documentation and support
  • Vendor stability and longevity

Obviously, there’s a lot there, and since text extraction is a minute field compared to PDF generation, etc., Dan (or any other reviewer) would likely pick and choose what to focus on. May he (and others) always choose those aspects where we dominate… ;-)

In this particular situation, there’s also the complication of platform support: PDFlib’s component is available on a variety of platforms (through C bindings), whereas PDFTextStream is only available on the Java platform. That gives PDFlib an obvious advantage where Java isn’t in play, since we’re not showing up on .NET, python, etc., yet.

Anything missing here? Feel free to email me with any aspects that you think are important.