Archive for the ‘PDFTextStream’ Category

Working Together: Python/Java & Open Source/Commercial

Monday, August 21st, 2006

PDFTextStream started out as a Java library, but is now available and supported for Python. How that leap was made exemplifies how commercial and open source software efforts complement each other in the best of circumstances, and is also a fantastic case study in Java + Python integration.

In general, Java and Python don’t really mix. Their architectures, best-practices, object models, and philsophies are pretty divergent in a lot of ways. Because of this, you don’t often find them cohabiting peacefully.

However, there are significant advantages to be had by bringing these two environments together. Python is a really elegant language, and is very well-suited to whole classes of software development that are much more painful to tackle in Java. Java has its advantages as well: a very mature standard library, a huge array of third-party library support, fantastic development environments, and the backing of big players in IT. As always, there’s a right tool for each job, and sometimes Java works best, and sometimes Python works best, but a combination would truly be more than the sum of its parts.
As PDFTextStream got its legs in the market about 18 months ago, our consulting business picked up, and I began to look for a way to use Python for prototyping and custom development in conjunction with PDFTextStream. Of course, back then, PDFTextStream was only for Java, so some bridge-building was in order.

I came across JPype (http://jpype.sourceforge.net), and found it to be a promising solution. JPype is an open-source Python library that gives “python programs full access to java class libraries”. Sounds good, and it was.

Eventually, however, we ran into some problems. Specifically, one of our clients wanted to have PDFTextStream extract text from PDF documents in-memory (i.e. without having the PDF file(s) on disk). That’s not problem with PDFTextStream — we added that feature in short order.

However, this client was also adamant in their desire for a Python-based solution. The rest of their application (with which our piece integrated) is 100% Python, and their performance requirements (think millions of PDF documents processed per month) made running PDFTextStream as some kind of service component unthinkable.

What’s the problem? JPype, circa summer of 2005, copied data between Python and Java. That means that, if you have a PDF file in memory in Python, and want to use PDFTextStream’s in-memory extraction capability, JPype made a copy of that PDF file data before passing it off into the target Java function or constructor.

Bad, bad, bad. That was a huge performance hit to the application, and simply unacceptable from the client’s (and users’) point of view.

The obvious course of action was to make JPype, in effect, “pass by reference” when working with significant chunks of data (byte arrays, Strings, etc). This was no simple task, but we soon contacted the maintainer of JPype, a friendly fellow named Steve MĂ©nard, and explained our predicament.

Within a few days, he had hammered out the idea to expose Python strings (the byte array of the Python world in most environments) as DirectByteBuffer objects in Java. This was a great idea, and meshed nicely with PDFTextStream’s in-memory processing API. Steve and I hammered out a relatively informal work agreement and hourly rate, and it was assumed by both of us that his enhancements to JPype for our purposes would stay licensed under the Apache v2.0 license to be enjoyed by the rest of the JPype community.

Nailing down all the technical details took a few weeks, but in the end, Steve was successful. We were able to put PDFTextStream’s entire API to use from within Python in a way that sacrificed not one ounce of performance or functionality.

So what’s the upshot of all of this?

  • Our consulting job completed with high praise from our customer, and our component of their application continues to hum away, extracting text from millions of PDF documents per month using PDFTextStream from Python
  • We’ve since worked with Steve here and there as necessary in order to make additional tweaks to JPype. Because of his help, we now distribute a supported version of PDFTextStream for Python (click that for more technical details about the Python/Java integration made possible by JPype).
  • The JPype project retains the new/improved functionality that we paid for, and the broader community continues to benefit from that.
  • Steve got to pick up a new mac mini, plus whatever else he felt like buying with his hard-earned cash

That’s what I call a win-win situation, for us, for our customers, for Steve, and for the JPype project and its other users. In an ideal world, this is how open source and commercial software efforts should collaborate and cross-pollinate.

Software Development and…Pregancy?

Wednesday, August 16th, 2006

For nearly a year, we have been working on a number of things in parallel:

All three of these things are absurdly complex, and large, and represent a huge amount of work. And, like the geniuses we are, we decided, “Hey, let’s release all of them at once!”

Well, what doesn’t kill you makes you stronger, right? It turns out that this was probably a very bad idea…not because we sacrificed quality or cut corners to make deadlines or anything strictly taboo like that. It was a bad idea because sleep is a precious thing. :-)

I don’t have children (and, technically speaking, I’ll never have children, seeing as I’m of the male persuasion), so pardon me while I draw a very tendenous analogy between software development and pregnancy. We just had the software equivalent of triplets — one major product release, one website, and one website/AJAX app, all at once.

This is a good reminder that, 99% of the time, software (and really, business, for that matter), should be incremental. We know this, and have practiced it for a long time — but I can guarantee you, we understand it a lot better now that we’ve broken that rule.

That said, why don’t you say `hi` to our newborns: PDFTextStream v2.0, the new snowtide.com, and PDFTextOnline.

Automated Quality Control, Part II

Thursday, January 26th, 2006

In my last post about quality control, I detailed the challenges we face in testing PDFTextStream in order to minimize hard faults, and some of the patchwork testing ’strategy’ that we employed in the early days. Now, I’d like to walk you though our specific design goals and technical solutions that went into building our current automated quality control environment.

For some months, most of the PDF documents we tested PDFTextStream against were retrieved from the Internet using a collection of scripts. These scripts controlled a very simple workflow:

workflow.gif

  1. Query for PDF URLs In this step, a search engine (usually Yahoo, simply because I like it’s web services API) is queried for URLs that reference PDF documents that contain a search term or phrase.
  2. Download PDFs All of the URLs retrieved from the search engine are downloaded.
  3. Test PDFs with PDFTextStream PDFTextStream is then tested against each of the PDF documents that were successfully downloaded.
  4. Report failures, suspicious results Any errors thrown by PDFTextStream are reported, along with any spurious log messages that might indicate a ’soft failure’.

This approach is solid. It makes it possible to test PDFTextStream against random collections of PDF documents, thanks to the nature of search engine results. However, while the general approach is effective in principle, our implementation of it was unenviable for some time:

  • Being a collection of scripts, the process was manual, so testing runs happened only when someone was ‘at the helm’. This involved providing query strings for the search engine access phase, nursing the downloads in various ways, and then picking through the test results (failures weren’t ‘reported’ so much as they were spit out to a log file, which then had to be grepped through in order to find interesting nuggets).
  • Since the process was manual, it couldn’t scale. That’s obviously bad, and led to significant restrictions on the number of PDF documents that could be reasonably tested in a given period. Beyond that, it led to our test box(es) sitting idle much of the time.
  • Since failures (and ’soft failures’) weren’t actually being reported or even recorded anywhere in any useful way, it was impossible (or really, really hard) to know what failures to concentrate on after the testing was finished. One always wants to focus on the bugs that are causing the most trouble, but we couldn’t readily tell which failures were most common, or even which of two different kinds of failures were more common than the other. This makes prioritizing work very difficult, and much like throwing darts blindly.

So, drawing from these lessons, we set out to design and build a quality control environment. To me, the emphasis on ‘environment’ here is shorthand for a number of qualities that the system resulting from this effort should exhibit:

  • Autonomy Each component of the environment (usually called a node) should operate asynchronously, moving through the workflow presented earlier without any intervention, assistance, or monitoring, either from other systems or components or from people.
  • Scalability Each node (and each group of nodes) should be able to saturate all available resources available to it — CPU capacity, bandwidth, disk, etc. Our aim here is to maximize the number of PDF documents PDFTextStream can be tested against in a given period, so having any resources of any kind sitting idle is simply wasteful.
  • Auditability Any any moment, we should be able to know what every node in the environment is doing, what it’s going to do next, and what it’s done since its inception. Further, we should be able to generate reports on what kinds of faults PDFTextStream has thrown, on which PDF documents, which build of PDFTextStream was used in each test, etc. This makes it very simple to determine which errors should be focussed on, and which can be put on the back burner.

Those that know such fields would recognize these design principles as being very similar to those that are relied upon in multi-agent systems or distributed computation systems programming. That is not accidental: from the start, we recognized that in order to test PDFTextStream to the degree that we thought necessary, we would need to test it against millions of PDF documents. That simply was not going to happen with any kind of manual, or even scheduled system (such as simply running those old scripts from cron). Between that requirement and the notion that we need to have multiple ‘nodes’ running simultaneously in order to utilize all of the resources we have available, it was a no-brainer to use some of the concepts that are taken for granted by those that are steeped in the multi-agent systems field, for example.

So, there’s the design goals of our automated quality control environment, in broad strokes. It retains the fundamental workflow that was implemented long ago in that patchwork of scripts, but includes design principles that make the environment efficient, manageable, and effective in terms of pushing PDFTextStream to its limits.

Next time, I’ll discuss how we rolled out our quality control environment, and give some statistics on how it’s performed since we brought it online.

Automated Quality Control, Part I

Wednesday, January 18th, 2006

**Quality control is critical to the success of a business, and in turn, to the success of its customers as well. This is doubly true in the case of software businesses and products, where problems and defects are rarely obvious. In this first post in a series about Snowtide’s approach to quality control, I touch on some of the specific challenges we face in ensuring product quality.**

In PDFTextStream’s early days, before we were promoting it widely as a reliable, high-performance library suitable for enterprise applications, our quality control was weak. We had a few beta users, who stumbled across errors and PDF’s that caused PDFTextStream to fail. We had a few scripts that harvested PDF’s from various places (all of the PDF’s linked from a particular webpage, for example), which we could then test against PDFTextStream. For some time in the early days, our testing and quality control ‘procedures’ were . . . weak.

Most software, especially that which is built for some constrained, well-defined purpose, can be tested very readily. Start it up, click a few buttons, browse through a few pages, run some test data, maybe do some stress testing if you have time. Does it work? Yes? Great, deploy it. Otherwise, fix the problem, and repeat.

However, because of the nature of PDF documents, the fact that PDFTextStream didn’t fail on PDF #1 didn’t mean that it wouldn’t fail on PDF #2. Further, just because PDFTextStream didn’t fail on PDF #10,000 didn’t mean it wouldn’t fail on PDF #10,001. Since there is no notion of validity when it comes to PDF documents, there is no way to say that PDFTextStream is Fully Tested™. This is completely different than the circumstances one might find when testing an XML parser, or a web server, or a web application, or an accounting utility, where the inputs and outputs are well-defined and specified.

Our circumstances are, however, very similar to those I would imagine exist when testing a new car, or a light bulb, or an iPod. There, you can only hope to minimize the likelihood of failure, since the nature of the product is such that it will fail eventually. Similarly, PDFTextStream (or any other PDF library or viewer) will never be able to claim that it will never emit an error, because the PDF document format inherently allows for a wide degree of variability.

Once you move past the notion of proving the lack of defects in a piece of software to accepting that the attainable goal is to minimize the likelihood of defects as much as possible, there is a very straightforward approach to quality control: test, test, test. Not to eliminate the possibility of faults, but to minimize their occurrence to a reliable, quantifiable percentage of the total tests performed.

So, it is with this focus that we embarked on building an automated quality control environment. Our plan was simple: build a system that will test PDFTextStream against astronomical numbers of PDF documents — orders of magnitude more than any customer would ever dream of processing — then see where PDFTextStream fails, and fix the problems.

Over the next few weeks, I’ll be posting the highlights from our experiences rolling out this automated QC environment, as well as some notable instances where failures discovered by this process have prompted significant improvements in PDFTextStream.

**Update: Read the second part of this series, [Automated Quality Control, Part II](/2006/01/26/automated-quality-control-part-ii)**