Whoa, Peter Norvig used some of my code!

February 26th, 2009

I’m generally not one to be impressed by celebrity — you won’t catch me reading People or US Weekly, example.  However, this morning I noticed with a shimmer of glee that Peter Norvig used some code that I wrote years ago in one of his recent projects.  So, just for the record, if Dr. Norvig ever shows up in US Weekly, I’ll pick one up!

In case you don’t know, Peter Norvig is the Director of Research at Google.  That’s interesting, but the real reason Dr. Norvig holds sway with me is his classic book, Paradigms of Artificial Intelligence Programming.  If it weren’t for that book, I almost certainly would not be doing what I’m doing today.  Its pages are where I came to understand lisps, and began to imagine what was possible and what I might be able to accomplish in computer science (final results yet to be determined, of course).  For that, I am extraordinarily grateful to him (and others, of course, but I’ll wait to talk about them when they get around to using some of my code! ;-) ).

Back to the story.  This morning, I decided to hop onto Google Analytics for a bit to check up on the traffic stats for our various websites.  Lo and behold, in the “top referrals” listing, I saw ‘norvig.com’; “Well,” I thought to myself, “that’s interesting!”   A quick grep of the server logs (is there a screen in Google Analytics that actually provides you with the full referral URLs?) showed the referral URL to be Dr. Norvig’s “post” from last week, An Exercise in Species Barcoding.

A search of my name on that page shows that he needed a way to calculate the Levenshtein distance (also known as the edit distance) between two large strings — his quick implementation (like most) operated in O(n^2) space, which would have required weeks of processing time in his particular case.  So, he looked around for a more efficient implementation, and found one that I wrote in October of 2003 that operated in linear space bounds (and was, ironically enough, my first-ever contribution to an open source project).  With a couple of tweaks to suit his specific needs, the code I wrote worked out nicely for him.

This story is satisfying and funny (for me, anyway) in a couple of different ways:

First, there’s the fact that (what I would now consider) throwaway work of mine floating around the nets six years later.  Remember kids, the Internet never forgets!

Second, it reminded me of what I was doing when I wrote that particular code.  I was building what would later become PDFTextStream’s first ground-truthing system1 (although I don’t think I knew of that term at the time). It’s a lot more sophisticated now, but back in 2003, I was simply trying to set up a “ground truthing” system where the full (vetted and known-good) extracted text from each PDF document in our nascent test repository would be saved off somewhere, and later builds of PDFTextStream would compare its extracted PDF text to those saved files.

Of course, it wouldn’t be practical to require that PDFTextStream produce identical output forever — some amount of slop had to be allowable, because (for example) if an extracted word was outputted with four spaces before it instead of two, that would generally be sufficient.  For that and other reasons, I wanted to test that current PDF text extracts were the same as the known-good extracts within a defined margin of error.  Unfortunately, I was ground-truthing full document extracts at that time, and most Levenstein functions with their quadratic performance characteristics would take a lot of memory to diff the multi-megabyte strings that were involved.

Solution: write my own Levenshtein function (loosely based off of a pedagogical implementation by Mike Gilleland that had been incorporated into the Apache commons-lang project) that operated in linear space bounds.  Thankfully, I opted to offer the improvement back to the Apache commons-lang project and to Dr. Gilleland — had I not, Dr. Norvig would never had found that code, and I wouldn’t be writing this right now.

Third and finally, this story is satisfying because, hell, Peter Norvig used some of my code.  A person I respect and admire has found it convenient to use some minor thing I created years ago, and was thoughtful enough to say so.  I hope I can follow that example as I go along in my travels.

See, Dr. Norvig, I’m still learning from you.

Footnotes:

1 Ground truthing is a testing methodology often used in document processing systems where ideal or otherwise known-good output is cataloged, and then actual or current output is compared to it to determine relative accuracy.  PDFTextStream’s current ground-truthing system serves as a semi-rigorous smoke test of its aggregate text extraction accuracy while we’re doing active development, as well as an ironclad regression test for when we’re looking to cut a release.  Thankfully, it’s come a long, long way from the very naive approach I was pursuing in 2003.

RIA Platform Judo: Install Java/JavaFX using Flash?

February 3rd, 2009

I hadn’t tried TweetDeck yet, and thought I’d give it a run.  It requires Adobe AIR, and I thought I’d end up having to do the download/install dance.  But lo-and-behold, the TweetDeck “Install Now” Flash button bootstraps a local install of AIR for me!  No mess, no fuss, and the whole thing took less than two minutes and three clicks.

That’s cool and all, but we’re (mostly) JVM partisans here, so the inevitable question is: why doesn’t Sun use the same mechanism to drive deployment of the latest and greatest JRE/JavaFX?  Flash functionally has 100% penetration, so it makes tons of sense to use it to make it trivial to get the JRE out there, at which point Java’s “native” update functionality can take over.

I’m not a RIA guru by any stretch, so maybe there’s a good reason why this isn’t done?

Paul Graham’s Y Combinator leaves Boston, entrepreneurs dive under the bed

January 26th, 2009

Last Friday, Paul Graham announced that his Y Combinator incubator was leaving Boston for Silicon Valley, prompted by the impending birth of his first child.  He didn’t lose any sleep over it though, and made his thoughts on Boston vs. Silicon Valley clear (yet again) in that announcement:

Boston just doesn’t have the startup culture that the Valley does. It has more startup culture than anywhere else, but the gap between number 1 and number 2 is huge; nothing makes that clearer than alternating between them.

This has been picked up in a variety of areas by Boston-local entrepreneurs and those that watch that space.  The reaction has been predictable, if you know just how much of an inferiority complex people have vis á vis Silicon Valley (emphasis below mine):

Scott Kirsner:

That’s not just his opinion… it’s reality… and we ought to be addressing it head-on.

Buzz in the HUB:

Paul Graham has long been a critic of the Boston Venture community and their reluctance to invest in his crops of nascent startups. He has now given in to the fact that the Valley is a better place for (web) startups.

Robert Buderi:

…the full-time departure of Graham and Y Combinator is a real loss for the New England innovation community

The most grating reaction I saw was this anonymous comment left on Scott Kirsner’s post:

For people like myself who are on the cusp of creating something new, Boston’s conservative culture translates directly to greater risk and scarcer opportunity. The question always boiles down to: if you have a family, how do you take the plunge, even if your business intention is indeed groundbreaking? Is the risk too great in this town?? And if it is, doesn’t that mean a move to the Valey must be considered.

At this point, I should remind my two readers that I chose to start a software company in Northampton, a town in Western Massachusetts, a solid 80 miles from Boston, thankyouverymuch.  This fact is always disorients my Bostonian acquaintences, who generally have one of three reactions:

  • “Heh, so you’re out past the tumbleweeds?”
  • “What’s it like, living in the country?”
  • “Wait, you mean Northampton Street in Cambridge, right?”

I fully realize that that’s just one way in which I’m outside of the mainstream of the “Boston startup scene” (the other ways include the fact that Snowtide is 100% bootstrapped, and that we’ve been profitable for years — also shocking, I know).  That said, all of the bellyaching about Paul Graham choosing where to spend his summers seems vaguely ridiculous.  It feels very similar in tone to the discussion I witnessed at the end of the 2008 MassTLC Unconference last October, where dozens of people in a room of perhaps 300 of the brightest entrepreneurial minds in the Boston area expressed varying degrees of concern, despair, and panic about how the “Boston scene” has fallen so far behind Silicon Valley.

Now, maybe there are fewer VCs in the Boston area; maybe the “startup culture” is more vibrant (for some?1) in Silicon Valley; maybe Bostonians are technically and financially more conservative than their west-coast cousins.  I have no way of evaluating the truthiness of those assertions, having never spent any time in Silicon Valley.  However, it’s entirely reasonable to say that all of these assertions are wholly irrelevant to an entrepreneur’s objective: building a sustainable business2 (or organization, if your aim is to innovate in the non-profit space).

A business is built out of innovation in a particular market, combining expertise and insight and networking and execution.  Nowhere in that formula is a requirement that some guys from Sand Hill Road need to come down and drop $20M in your lap so you can staff up.  Nowhere in that formula is a requirement that you need to have your offices down the street from 20 other entrepreneurs3.  And in today’s (or yesterday’s!) global economy, nearly all of your customers will be hundreds or thousands of miles away.  Innovative technology startups thrive in places as diverse as Chicago, New York, Austin, Seattle, Liverpool, and yes, even in smaller towns like Champaign, IL and Northampton, MA.

If I were to make a geographical recommendation, I’d say two things:

  1. If there is some compelling reason why a particular city or locale would provide a significant advantage to your company, go there.  (e.g. you’re working on tide power, so it makes sense to be near the ocean, or your largest potential customer and first beta site is in Bismark, so you should move in down the street)
  2. Otherwise, start your business where you’ll be happy and where you’ll find like-minded people.

If Silicon Valley is where you really want to be, godspeed; however, don’t move anywhere (or really, do anything) just because that’s where the crowd is or because some ostensibly-important person passes gas.  Innovation means thinking for yourself, doing what others think is strange, or foolish, or wrong, and being right in the long term.  Make sure you apply the same kind of independent thinking that will hopefully build your business over the coming years to where you choose to call “home” for that time.

Footnotes:

1 My impression of Silicon Valley “startup culture” isn’t particularly positive.  Seeing trendy ad-supported websites hit $15 billion paper valuations when they might not be trendy with next year’s incoming class doesn’t make me respect the culture that produces such organizations.  If that’s the sort of culture you’re interested in, fine, but please call it the lottery that it is.

2 If you’re an “entrepreneur” in the sense that your business is raising some VC, building buzz about a website, and selling out to Yahoo/Google/Microsoft/etc before you’ve banked a single cent of profit, I’ve nothing to say to you.

3 Although I will grant that such an environment does make for a more enjoyable after-work bar scene.  But, surely cocktail parties aren’t key ingredients in a technology startup?  If so, I’m going to become a carpenter.

Surprising Praise

November 19th, 2008

I happen to work in a particular corner of the software industry that isn’t exactly the most happenin’ party zone.  Compared to whatever is “hot” at any point in time, extracting data from documents seems dull to most.  I’m not deterred though – quite the contrary, being able to deliver products into contexts where we have a big positive impact on the well-being of our customers’ organizations makes for a level of satisfaction that (I suspect) outpaces the quick, fleeting high of popularity.

That said, one downside of this is that many of our customers realize so much benefit from working with us is that they often are very reluctant to allow us to discuss our relationships with them in public.  So, far more often than not, we can have the satisfaction of realizing that big impact on a customer’s organization, but are barred from talking about it.  Given that, we are doubly appreciative of those that have worked with us to develop case studies on how they use PDFTextStream.

But last week, I discovered that someone we’ve worked with in the past – Neil Gandhi, who was heavily involved with Zinio’s deployment of PDFTextStream – had added a recommendation to my LinkedIn profile.  Now, most recommendations are very pleasant to begin with, but Neil’s was so effusive and unsolicited that I thought I’d share it here (oddly, LinkedIn doesn’t show recommendations on the “public” profile pages, but if you log in and view my profile, you’ll see Neil’s comments):

I worked with Chas during my time at Zinio LLC, a digital publication company that specializes in online and offline delivery of digital magazines. At the time, we were implementing global search functionality but our PDF text extraction solution was really sub-par. We found Chas at Snowtide and worked with him and his team in implementing PDFTextStream; their PDF text extraction solution. We were also testing against a slew of other vendors and open-source solutions to find the best product based on accuracy and service. You can find the case-study here: http://snowtide.com/cs-zinio

Needless to say, PDFTextStream was by far the most accurate solution, but to my surprise, Chas and his team provided the best service a small company like Zinio could ask for. I never had to wait more than half a day for a response and the questions and requests were always answered with a can-do attitude. If something couldn’t be done, Chas always had the time to explain why and also suggested (many times, better) solutions. He could talk the talk as both a CEO and as a developer and could switch back and forth when talking to my Director of Engineering, VP of Technology, and me and was more than competent at all levels. In the end we ended up purchasing the solution for all of our extraction servers, and I made a connection to someone I can always turn to when I need anything PDF related.

Pretty cool.  We’ll obviously be publishing more case studies as the opportunities arise, but it’s hard to beat comments as personal and unsolicited as those.  Thanks, Neil.