Archive for the ‘PDFTextStream’ Category

Scala Makes Me Think

Wednesday, October 31st, 2007

(…or, “Oh, Dear, Wasn’t I Thinking Before?”)

As my friends will attest, I really enjoy programming languages. I’m one of those language fetishists that talk about “expressiveness” and “concision”, and yes, I’m one of those very strange fellows who blurt out bad Lisp jokes while getting odd looks from innocent bystanders. And while my bread and butter is built in Java, I often find myself yearning for a more expressive language while deploying, customizing, or integrating PDFTextStream (there I go again with the “expressiveness” bit). That yearning can reach almost pathological extremes at times, prompting me to go so far as to sponsor projects that make it possible to use Java libraries (including PDFTextStream) from within Python.

Fortunately, things don’t always have to be so hard. Case in point, I recently dove head-first into Scala, a language that combines object orientation and functional programming into one very tasty stew. Scala has a number of characteristics that make it interesting aside from its merging of OO and FP mechanisms:

  • it is statically-typed, and provides moderately good type inference that enables one to skip most type declarations and annotations
  • it is compiled, which provides a minimum level of performance (sure, it’s actually byte-compiled, but let’s not quibble right now)
  • and the real kicker: it compiles down to Java class files (or .NET IL), thereby enabling it to be hosted on a JVM (or .NET’s CLR), and call (and be called by) other Java (or .NET) libraries

There’s a lot to like here, for programmers from many walks of life, and I could go on and on about how Scala has single-handedly created and filled a great niche of delivering most of the raw power of purely functional languages like Haskell and ML within a JVM-hosted environment with respectable performance. But what has really impressed me has been the way that Scala has improved how I work. In short, it’s made really think about development again.

I generally have two working styles. In a classic statically-typed environment (say, Java or C#), I tend to generate pretty clean designs, but my level of productivity is very low. I attribute both of these characteristics to the copious amount of actual work (i.e. finger-typing) that has to go into writing Java or C# code, even with the best of tools. See, while I’m typing (and typing, and typing), I’m thinking two, three, four steps ahead, figuring out the design of the next chunk of code. The verbosity of the language gives me time to reason about the next step while my fingers are working off the previous results.

In a dynamically-typed environment (say, Python or Scheme), I tend to be extraordinarily productive, but unless I consciously step back and purposefully engage in design, the code I write is much more complex. In such environments, there’s less finger-typing going on, so I don’t have a natural backlog allowing me to think about the code before it’s already on the screen. Further, I know I can get from point A to point B relatively easily in many circumstances, so I end up skipping the design step, switching into Cowboy Coder mode, and hacking at things until everything works. Oddly enough, in certain circles, this isn’t so much frowned upon as it is recommended.

Scala is statically-typed, so the naive observer might speculate that my working style in Scala would be much the same as in Java. However, I’ve found that working with Scala has prompted (forced?) me to consciously step back and think about everything, at every step along the way: class hierarchies, type relationships in general, testing strategies, eliminating state where possible…the amount of actual thinking I’ve done while working with Scala has far outstripped the amount of reasoning that typically goes into any similar period of coding. Unsurprisingly, this has led to quite the spike in code quality, which translates into productivity through fewer bugs and less rework.

I attribute this to the strong, static typing that Scala enforces, combined with the type inference that Scala provides. The former forces me to reason about what I’m doing (as it does in Java, for instance), but because the latter eliminates so much of the finger-typing associated with static typing in other environments, I’m given the opportunity to realize that a concrete design phase would yield tremendous benefits, regardless of the scope of code in question. I suspect I would find working in Haskell or ML to be a similar experience, but because those languages don’t easily interoperate with the libraries I need to do my work, I’ve never really given them a chance.

Thankfully, I don’t think I’ll have to. Scala is a great environment, and even more important than its technical merits, its design has led me to engage in a more thoughtful, more conscious development process.

New Year’s PDFTextStream Sale!

Friday, December 22nd, 2006

This morning, we put some limited-time-only discounts into place for PDFTextStream to celebrate the new year. You can now purchase PDFTextStream server deployment licenses for as little as $999 USD (optionally with Premium Support). These licenses carry no CPU restriction, so you can use them on your 1CPU development box or your 64-CPU Superdome. And, as always, you can use the same license under Java, Python, or .NET. This sale starts today, and ends on January 31, 2007. You can place your order here (with payments handled by Google Checkout).

This is quite a deal — these unlimited-CPU server licenses usually cost $13,750 USD. That’s quite an insane discount, but I thought it was worth the chance. Theoretically, this will create a little buzz, increase our customer list by quite a bit, and maybe expose a different class of users to PDFTextStream that might have previously written it off because of its admittedly high (normal) price tag.

This is also a decent pricing experiment. We’ve never done much experimentation in the area of pricing, so we’ll now have one more data point on our demand curve (as described brilliantly by Joel). I don’t think that this particular experiment will have any lasting effect on our pricing for PDFTextStream, but it will be an interesting exercise nonetheless.

Free PDFTextStream for Academic Use

Thursday, October 12th, 2006

The title says it all.  Today we’re announcing that PDFTextStream is free for academic use: read the press release, and if you are a qualifying academic developer, go ahead and apply for a free PDFTextStream license file.

Don’t worry, the application “process” will take you 2 minutes, and assuming you are eligible (i.e. a student, faculty, academic researcher, or university IT staff), you’ll get your free PDFTextStream license file within a week.  Why a week?  Well, we want to set expectations properly, as we assume we’ll get a pretty solid barrage of applications — after all, everyone likes free stuff.

I’m hoping that this will make life easier for many, especially those who are building truly cool new search, content management, and other webby and/or document-oriented processing systems.  Too often, we’ve run across university-funded researchers who have bare-metal budgets, and are forced to use substandard tools and libraries (but who still manage to build amazing technologies).  PDF is obviously important (and will only become more prevalent), so making sure those folks can get the best PDF content extraction library available at no cost to them will hopefully enable even greater, faster progress.

It’s the very least we can do to “give back”.

Memory-mapping Files in Java Causes Problems

Wednesday, August 30th, 2006

Today, we released PDFTextStream v2.0.1 — a minor patch release that contains a workaround for an interesting and unfortunate bug: on Windows, if one accesses a PDF file on disk using PDFTextStream, then closes the PDFTextStream instance (using PDFTextStream.close()), the PDF file will still be locked. It can’t be moved or deleted.

This is actually not a bug in PDFTextStream, but in Java, documented as Sun bug #4724038. In short, any file that is memory-mapped cannot reliably be “closed” (i.e. the `DirectByteBuffer` (or some native proxy, perhaps) that holds the OS-level file handle does not release those resources, even when the `FileChannel` is closed that was used to create the `DirectByteBuffer`). Reading the comments on that bug report show a great deal of frustration, and rightly so: regardless of the technical reasons for the behavior, memory-mapping files isn’t rocket science (or, hasn’t been for 20 years or somesuch), and this kind of thing shouldn’t happen.

Since we can’t fix the bug, we devised a workaround: if you set the `pdfts.mmap.disable` system property to `Y`, then PDFTextStream won’t memory-map PDF files. Simple enough fix. FYI, there appears to be no performance degredation associated with using PDFTextStream in this mode.

Of course, this is only a problem on Windows, which does not allow files to be moved or deleted while a process has an open file handle. We have a number of customers that deploy on Windows Server (although that number is much smaller than those that deploy on a variety of *nix), but until last week, they hadn’t reported any problems. Our best guess is that, given the systems we know those customers are running, they are probably using PDFTextStream’s in-memory mode (where PDF data is in memory, and provided to PDFTextStream as a `ByteBuffer`). Of course, in that case, no file handles are ever opened, so all is well.

This problem is the topic of a new FAQ entry as well.