Archive for the ‘Geek Commentary’ Category

Cratered By Digg

Thursday, September 7th, 2006

Well, that was a surprise. PDFTextOnline was linked to on Digg, and made it to the front page (it made it to #2 when I saw it).

Of course, you know the drill from here. We built PDFTextOnline and put it out there as a nifty little tool, hoping that some people would find it useful, and maybe a couple curious software developers and managers might stumble upon PDFTextStream as a great way to bring PDF text extraction like they see in PDFTextOnline into their organization. We haven’t promoted it, or even linked to it heavily on snowtide.com.
Given all that, we didn’t put PDFTextOnline on a particularly large server — in fact, it was running on a mid-level VPS. Definitely nothing special.

Then we got hit with the digg-effect, and whammo, say goodbye. I haven’t poked at the server logs much yet, but the flood of traffic was heavy and unyielding.

So, I got the hint — PDFTextOnline is genuinely interesting to an audience larger than us. :-) Now I need to go server-shopping.

My hope is that PDFTextOnline will be back up later tonight, and then moved to a real server next week. Then maybe we can get slashdotted, and do it all over again!

Software Development and…Pregancy?

Wednesday, August 16th, 2006

For nearly a year, we have been working on a number of things in parallel:

All three of these things are absurdly complex, and large, and represent a huge amount of work. And, like the geniuses we are, we decided, “Hey, let’s release all of them at once!”

Well, what doesn’t kill you makes you stronger, right? It turns out that this was probably a very bad idea…not because we sacrificed quality or cut corners to make deadlines or anything strictly taboo like that. It was a bad idea because sleep is a precious thing. :-)

I don’t have children (and, technically speaking, I’ll never have children, seeing as I’m of the male persuasion), so pardon me while I draw a very tendenous analogy between software development and pregnancy. We just had the software equivalent of triplets — one major product release, one website, and one website/AJAX app, all at once.

This is a good reminder that, 99% of the time, software (and really, business, for that matter), should be incremental. We know this, and have practiced it for a long time — but I can guarantee you, we understand it a lot better now that we’ve broken that rule.

That said, why don’t you say `hi` to our newborns: PDFTextStream v2.0, the new snowtide.com, and PDFTextOnline.

“Sometimes required, otherwise optional”

Tuesday, September 6th, 2005

**This Yogi Berra-esque annotation next to a particular value in a PDF data structure included in the PDF document specification sent me for a little ride last night.**

I noticed this while reading a particular section in the PDF document specification, and ended up laughing myself to tears. I’m sure part of that was the tiredness (we’re grinding out some pretty long hours working on some nifty new features for PDFTextStream v1.4), but a good chunk of it was the absurdity of the language used and what it implies for the ’specification’.

“Sometimes required”? First, that just doesn’t make sense on the face of it, at least not without some definition of what ’sometimes’ entails; second, stuff like this makes for unnecessarily-complicated implementations of a specification. Of course, the annotation was explained in detail, and any serious developer would be able to handle its repercussions with aplomb. But not all software is written by serious people. For proof of that, just look at the hundreds of thousands (millions?) of PDF documents out there that contain off-by-one kinds of “errors” that fall into gray areas of the PDF spec, but are consumed without problems by a variety of applications (including Acrobat Reader and PDFTextStream).

(For PDF-heads out there, this annotation is associated with the /I value in a choice field dictionary, described in Table 8.71 of the PDF spec.)

Functional vs. Transparent API’s, Part I

Monday, July 25th, 2005

**I’ve been mulling over the relationship and differences between PDFTextStream’s API and other PDF-related API’s.**

I was originally going to write a pretty long tract on this topic, but relented mid-way because I realized that I likely don’t have the concepts straight in my own head, nevermind being able to put them down on screen.

PDFBox, JPedal, and other fine PDF libraries present very comprehensive API’s to a developer-user, ones which mirror the nature of PDF data structures to a hilt. That’s excellent, especially if you need to do some low-level mucking around.

To get to more sophisticated functionality (like the extraction of text, generation of PDF’s etc.), additional API’s need to be laid on top of the lower strata of data structures. It’s a very clean, formal computer science approach that fosters maintainability, reuse of code, and representational consistency.

PDFTextStream takes a somewhat different approach. It is primarily interested in fulfilling a very particular set of developer-user requirements: specifically, the extraction of text and other PDF content with maximal accuracy and throughput. To get there, we simply could not use the layered low-level API approach — while we might be able to make extraction functionality work, the overhead involved in that approach increases dramatically as the complexity of the functional requirement rises.

The result is PDFTextStream’s API, which if shown to an expert in the PDF document format, would look completely foreign. There are no references to PDF objects, dictionaries, names, XObjects, Postscript, or virtually any other PDF-specific data structures. This is because the PDFTextStream API is focussed on providing the shortest route from point A to point B for the developer-user looking to extract content from PDF documents, Period. Obviously, this has the drawback of making the PDFTextStream API singularly useless for anyone who wants to generate PDF reports (for example).

The best terms I can come up with for these types of API’s are ‘transparent’ (for tiered, low-level API’s), and ‘functional’ (for API’s dedicated to a specific functional domain to achieve side-benefits of specialization). Both have their place; transparent API’s are likely to always be more popular (since they have broad applicability), whereas functional API’s are likely to always maintain an edge within their particular domain.

So what’s the point? I find the distinction between ‘transparent’ and ‘functional’ API’s fascinating because the comparision is decidedly nontechnical — it’s about how people interact with the software, and how a software vendor wants to present itself and its product to its users. These might be the kinds of tensions that need to be exploited to make significant strides in software design, since software is still hard to build and hard to use even after the litany of technical ‘revolutions’ that have come and gone over the years.