Archive for the ‘The Business’ Category

Working Together: Python/Java & Open Source/Commercial

Monday, August 21st, 2006

PDFTextStream started out as a Java library, but is now available and supported for Python. How that leap was made exemplifies how commercial and open source software efforts complement each other in the best of circumstances, and is also a fantastic case study in Java + Python integration.

In general, Java and Python don’t really mix. Their architectures, best-practices, object models, and philsophies are pretty divergent in a lot of ways. Because of this, you don’t often find them cohabiting peacefully.

However, there are significant advantages to be had by bringing these two environments together. Python is a really elegant language, and is very well-suited to whole classes of software development that are much more painful to tackle in Java. Java has its advantages as well: a very mature standard library, a huge array of third-party library support, fantastic development environments, and the backing of big players in IT. As always, there’s a right tool for each job, and sometimes Java works best, and sometimes Python works best, but a combination would truly be more than the sum of its parts.
As PDFTextStream got its legs in the market about 18 months ago, our consulting business picked up, and I began to look for a way to use Python for prototyping and custom development in conjunction with PDFTextStream. Of course, back then, PDFTextStream was only for Java, so some bridge-building was in order.

I came across JPype (http://jpype.sourceforge.net), and found it to be a promising solution. JPype is an open-source Python library that gives “python programs full access to java class libraries”. Sounds good, and it was.

Eventually, however, we ran into some problems. Specifically, one of our clients wanted to have PDFTextStream extract text from PDF documents in-memory (i.e. without having the PDF file(s) on disk). That’s not problem with PDFTextStream — we added that feature in short order.

However, this client was also adamant in their desire for a Python-based solution. The rest of their application (with which our piece integrated) is 100% Python, and their performance requirements (think millions of PDF documents processed per month) made running PDFTextStream as some kind of service component unthinkable.

What’s the problem? JPype, circa summer of 2005, copied data between Python and Java. That means that, if you have a PDF file in memory in Python, and want to use PDFTextStream’s in-memory extraction capability, JPype made a copy of that PDF file data before passing it off into the target Java function or constructor.

Bad, bad, bad. That was a huge performance hit to the application, and simply unacceptable from the client’s (and users’) point of view.

The obvious course of action was to make JPype, in effect, “pass by reference” when working with significant chunks of data (byte arrays, Strings, etc). This was no simple task, but we soon contacted the maintainer of JPype, a friendly fellow named Steve Ménard, and explained our predicament.

Within a few days, he had hammered out the idea to expose Python strings (the byte array of the Python world in most environments) as DirectByteBuffer objects in Java. This was a great idea, and meshed nicely with PDFTextStream’s in-memory processing API. Steve and I hammered out a relatively informal work agreement and hourly rate, and it was assumed by both of us that his enhancements to JPype for our purposes would stay licensed under the Apache v2.0 license to be enjoyed by the rest of the JPype community.

Nailing down all the technical details took a few weeks, but in the end, Steve was successful. We were able to put PDFTextStream’s entire API to use from within Python in a way that sacrificed not one ounce of performance or functionality.

So what’s the upshot of all of this?

  • Our consulting job completed with high praise from our customer, and our component of their application continues to hum away, extracting text from millions of PDF documents per month using PDFTextStream from Python
  • We’ve since worked with Steve here and there as necessary in order to make additional tweaks to JPype. Because of his help, we now distribute a supported version of PDFTextStream for Python (click that for more technical details about the Python/Java integration made possible by JPype).
  • The JPype project retains the new/improved functionality that we paid for, and the broader community continues to benefit from that.
  • Steve got to pick up a new mac mini, plus whatever else he felt like buying with his hard-earned cash

That’s what I call a win-win situation, for us, for our customers, for Steve, and for the JPype project and its other users. In an ideal world, this is how open source and commercial software efforts should collaborate and cross-pollinate.

Open Source, Positioning, and Execution

Tuesday, August 16th, 2005

**In the past month, I’ve read no fewer than 8 articles and blog posts trying to thread a story around what is apparently the “big” question these days: how can software companies make money in an open source world? Well, we are, quite well thank-you-very-much. Here’s how and why.**

Our primary product is PDFTextStream. It came on to the market a year ago, entering a market (Java libraries that can extract content from PDF documents) that was dominated by open source (or dual-licensed) offerings that are generally well-liked by the broader community.

OK, so why are we still here, thriving and growing?

  • Positioning When I decided to enter this market three years ago, I knew we would have a good chance simply because it has characteristics that are uniquely suited to a strong, specialized commercial vendor. While generating PDF documents is generally quite easy (thereby leading to a glut of report-generating libraries), extracting content from PDF documents is not. There are numerous file-format ambiguities to address, as well as the details related to achieving document understanding accuracy that is demanded by corporate and government customers. Anyone not dedicated to serving this market with 100% of their effort will not meet the market’s true demands.
  • Execution Anyone who strives to innovate eventually experiences some anxiety about sharing ideas with colleagues, with the irrational fear that those ideas might be misappropriated, leading to unnecessary competition. The thing is, dozens or hundreds of other people in the same field are likely having the same ideas simultaneously, so the only thing that will ever ensure business success is superior execution.


    Likewise, there are at least four open source Java libraries that extract content out of PDF documents. It’s not arrogant or smug to say that we’ll out-execute the teams or individuals that work on those libraries. We’re in this for the long haul and this is all we do 14 hours a day.
  • Serving a Niche Very closely related to product positioning was the decision to enter a very demanding niche. We’re not trying to build yet another HTTP server, EJB container, etc. We’re not working on a commodity, and therefore we are much less likely to see competition from an open source library staffed by developers from IBM (for example). Beyond this market-centric reality is the fact that PDF content extraction is a much more difficult game than writing an HTTP server (again, for example) — there are no standards, there are no RFC’s, there’s no easy way to tell if you’re doing things the right way. So, if someone wants to go head to head with PDFTextStream, they’ll have to grab their machete and start slicing through the same jungle of PDF specs, mangled documents (which nevertheless open in Acrobat without a hitch), and all of the other fun that goes into building a PDF extraction library.

I’m not saying that this formula we’ve worked out is simple, or that it can be easily replicated with a different product in a different market. However, at least from where I’m sitting, “living in an open source world” is pretty pleasant.

High Noon

Monday, July 18th, 2005

**PDFlib released a PDF text extraction component, so let’s see how we stack up.**

A week or so ago, Dan Shea at PlanetPDF posted a news item about PDFlib releasing a PDF text extraction library. That’s obviously very interesting to us, simply because until now, PDFTextStream has been the only library out there concentrating on PDF text extraction.

My first reaction to reading this news was to shoot an email off to Dan, suggesting that a PDF text extraction library shootout of some kind might be in order. His reply was, “What do you have in mind?”

Well, jeez, I hadn’t gotten that far yet. I assume any comparison of text extraction libraries should focus on a few things immediately critical to the endeavor:

  • Text extract accuracy
  • Operational performance and throughput
  • PDF compatibility (PDF specification support, decryption services, etc.)
  • Auxilliary features (accessibility of other content)

And then there’s the extras that one looks for in any library:

  • Platform/Environment support
  • API clarity
  • Documentation and support
  • Vendor stability and longevity

Obviously, there’s a lot there, and since text extraction is a minute field compared to PDF generation, etc., Dan (or any other reviewer) would likely pick and choose what to focus on. May he (and others) always choose those aspects where we dominate… ;-)

In this particular situation, there’s also the complication of platform support: PDFlib’s component is available on a variety of platforms (through C bindings), whereas PDFTextStream is only available on the Java platform. That gives PDFlib an obvious advantage where Java isn’t in play, since we’re not showing up on .NET, python, etc., yet.

Anything missing here? Feel free to email me with any aspects that you think are important.

Totally Flattened

Tuesday, June 28th, 2005

**The past 10 days have been just nuts.**

When it rains, it’s buckets.

We got hit last week with serious inquiries from a half a dozen very large organizations — a good mix of governmental, corporate, and nonprofit/research. Each of them already had a grasp of what PDFTextStream could mean to them and their projects, especially on the performance and text extraction quality fronts. However, each of them also were looking for some broader extraction functionality: bookmarks, annotations, tagged PDF structures, etc.

This is stuff we were already working on and planning to add into the mix, but these new requests certainly kicked the pace up quite a bit. Some of it was pretty quick and easy to finish up and move into beta phase — that will find its way into released versions very soon.

Other stuff is a little harder though, to put it mildly: OCR of text in images in PDF’s, decryption of digitally-signed documents, and other higher-order functionality. Again, all stuff we’ve been positioning ourselves to jump on, but when there’s fish to fry, we all start cooking a little faster. (Now’s when you’re supposed to groan at the horrible pun….)

So, we definitely busy. Now, who said software slowed down in the summer?