Hi There!

I'm Tomer Filiba, the author of several open source Python projects (most notably RPyC, Construct, and Plumbum). After years of scattering my stuff around the web (the Python Cookbook, sourceforge, google-code, etc.), I decided to "consolidate my portfolio" into a single website. This place is home to all of my projects and my personal Python-oriented blog... I hope you'll find it interesting.

Cartesian Tree-Product

May 02, 2013

I have to admit that my day-to-day life involves very little algorithmic problems, but here and there I get a chance to think. In this post, I'd like to discuss an interesting problem that I've met several times already in my programming career, each time in different settings. When I met it again last week, I decided it's time to formalize it a little. In lack of a better name, I call it "Cartesian Tree-Product" (not to be confused with Cartesian product of graphs), and here's how it goes:

Say you're given an expression tree. To limit the scope of the problem, we'll assume the tree is binary, its internal nodes can either be
AND or OR, and its leaves hold "atomic comparators" (which are of no interest to us). Here's an example expression:

(x=5 OR y=6) AND z=7

Which is represented by the following expression tree:

    AND
    /  \
   OR   \ 
  /  \   \
x=5  y=6  z=7

Now the problem is to produce all different sub-trees such that:

In other words, we want to produce all partial expressions of the original expression, which will satisfy it and which can together reconstruct it. Big words aside, here's what we want for the expression above:

x=5 AND z=7
y=6 AND z=7

Each expression here satisfies the original one (try it), and if we OR the two, we get a mathematically-equivalent tree:

(x=5 AND z=7) OR (y=6 AND z=7)


       OR
      /  \
     /    \
  AND      AND
 /   \    /   \
x=5  z=7 y=6  z=7

(To clarify: By sub-tree or partial expression, I mean it is constructed only from the leaves of the original tree/expression)

If you take a second look, it resembles Cartesian products where some "joints" (nodes) in the tree duplicate the resulting tree. Intuitively, we "split" the tree for every internal OR and continue with both copies. For instance, if we take (x=5 OR y=6) AND (z=7 OR w=8), we'll get 4 sub-trees

x=5 AND z=7
x=5 AND w=8
y=6 AND z=7
y=6 AND w=8

However, it depends on the structure of the tree; (x=5 OR y=6 OR z=7) AND w=8 produces 3 sub-trees. This closely relates to the rules of distributivity in propositional logic, but it seems to me that the "Cartesian product" notion is a generalization of the concept.

How is it Useful?

Since we're dealing with expression trees, it's hardly surprising that the two times I had to use this algorithm related to syntax. In the first case, I wrote a fuzz-testing tool for an interactive program, like the MySQL shell. The program accepted commands, conforming to a well-defined syntax, and I wanted to generate commands at random and see that it didn't crash.

For instance, a command might look like map-lun <vol-name|vol-id> lun-id and we'll want to try both variants, i.e., map-lun vol-name lun-id and map-lun vol-id lun-id. Of course the syntax is generally much more complex, with nested brackets, optional arguments, etc. It gets interesting, but we can still map it to the problem outlined above.

Another real-life use case is running queries against a huge dataset. In order not to complicate our query engine (written in C for performance), it can perform only intersections (ANDs) of filters. If you want to query for more complex conditions, you have to run it multiple times with the partial queries and "sum up" the results. But we don't want the end user "doing the math" for us, and waiting for one query to finish before we start the next means we have to load data from the store multiple times. If we could process it in chunks, we'd benefit from cache locality and shorten query times.

The Algorithm

The code is strikingly short, but that's not to mean it's easy to follow. The heart of it is two, recursively-nested for-loops:

def cartesian_tree_product(node):
    if not isinstance(node, tuple):
        yield node
        return
    
    lhs, op, rhs = node
    for l in set(cartesian_tree_product(lhs)):
        for r in set(cartesian_tree_product(rhs)):
            if op == "|":
                yield l
                yield r
            else:
                yield (l, op, r)

Let's try it out on the expression (x=5 OR y=6) AND (z=7 OR w=8 OR q=9) AND r=10:

>>> exp = ((("x=5", "|", "y=6"), "&", (("z=7", "|", "w=8"), "|", "q=9")),
...         "&", "r=10")
>>>
>>> for v in set(cartesian_tree_product(exp)):
...     print v
...
(('x=5', '&', 'z=7'), '&', 'r=10')
(('y=6', '&', 'w=8'), '&', 'r=10')
(('y=6', '&', 'z=7'), '&', 'r=10')
(('x=5', '&', 'w=8'), '&', 'r=10')
(('y=6', '&', 'q=9'), '&', 'r=10')
(('x=5', '&', 'q=9'), '&', 'r=10')

Does the trick.

Trying to estimate the complexity of this beast may be futile, but it clearly seems to be doing "more work" than a mere SAT problem: it doesn't just find one satisfying assignment, it looks for all satisfying assignments! This ought to put it in the NP-hard class. In fact, if we generate a binary expression of alternating ANDs and ORs, it can get much worse than exponential complexity!

Here's a little function that generates an interleaved binary expression:

counter = itertools.count()
def mkexp(n):
    if n == 0:
        return "x%d" % (counter.next(),)
    return (mkexp(n-1), "&" if n % 2 == 0 else "|", mkexp(n-1))

E.g.,

>>> mkexp(3)
((('x0', '|', 'x1'), '&', ('x2', '|', 'x3')), '|', (('x4', '|', 'x5'), 
   '&', ('x6', '|', 'x7')))

Now

>>> for i in range(1, 10):
...     variants = set(cartesian_tree_product(mkexp(i)))
...     print i, len(variants)
...
1 2
2 4
3 8
4 64
5 128
6 16384
^C

Notice how each increment either doubles or squares the number of results... that's because ANDs and ORs are interleaved (ANDs double, ORs square). Seems more like a power tower to me.

Extension: the Inverse Problem

The inverse problem is also useful. In the inverse problem we're given a set of expressions and we're trying to generate the most "compact form", i.e., "undo the effects" of the distributivity law.

I once wrote a test harness where each test specified its prerequisites declaratively. For instance, a test might need to run after the system had come up from an emergency shutdown, so the framework would bring the system to the required state and then then run the test. Obviously, it may take a while to bring the system to a certain state. It could range from minutes to days. And we have hundreds of tests!

In this case, we are given a list of tests and we want to find the most efficient order to run them, meaning, we want to minimize the setup and teardown times when moving between different system states.

class FooTest:
    REQUIRES = [A, B, C]
    
    def test(self):
        # ...

class BarTest:
    REQUIRES = [A, B, D]

    def test(self):
        # ...

class SpamTest:
    REQUIRES = [A, E]

    def test(self):
        # ...

In the example above, we can first bring the system to states A and B (e.g., "running over 10 hours" and "having less than 1 TB of free space"), then setup C, run FooTest, teardown C and setup state D, run BarTest, teardown B, setup E, and run SpamTest.

In essence, we want to take the requirement lists from each test and reconstruct the most compact tree that represents them, then we follow that tree in BFS order and reduce the overall time.

    A
   / \
  B   \
 / \   \
C   D   E

It took me a while to realize it's basically the inverse of the original problem. It is much simpler, in terms of complexity, so perhaps it's not strictly the inverse, but the two clearly work in "opposite directions".

Anyway, it's funny how I met this problem from different angles three times already. Just thought I'd share.

View Comments


A Survey of Construct 3

January 07, 2013

I'm working on Construct 3 again and I'm exploring lots of new ideas. I wanted to share these ideas at this early stage to get feedback on them from users, to keep the project on track. This survey starts a bit slow (as I'm not counting on users being familiar with Construct) but it dives into code right away.

You can leave your feedback in the Disqus comments below, or join the new discussion group dedicated to Construct (both 2 and 3). See Also: Construct 2, Pickler Combinators

Introduction

Construct is a binary packing combinators library for Python in which you can define rich data structures. Unlike most alternatives, these data structures can be used for both packing and unpacking of binary data; for instance, once you define what a TCP packet is, you can analyze packets or construct ones on your own, with no additional code.

TL;DR box

We begin the discussion with the atomic constructs: bytes, integers, floats, etc. With these, we build composite packers (Sequence, Array, and Struct), which can also be created using some syntactic sugars, and discuss the changes from Construct 2.
Next, we cover how Construct handles data (the stream of units approach) and how this helps us when working with multiple levels of data granularity (bits and bytes). We also introduce adapters, which transform object representations for packing and unpacking, and macros, which enable us to easily reuse existing constructs.
Then we cover the context and the this object, which allow us to express dependencies within data structures. From there we move to code generation, a key feature of Construct 3: to improve performance, constructs would be compiled to imperative Python code (even Cython, one day). We finish with a semi-formal proof that Construct is more powerful than context-free languages, making it probably the most powerful parser in existence!

Basics

Packers are objects that expose the two methods pack(obj) and unpack(data). Intuitively, pack takes an object suitable with that packer and returns a binary representation of it; unpack is the inverse operation, taking a binary representation and returning a Python object. Here's the most fundamental example:

>>> from construct3 import byte
>>> byte.pack(127)
'\x7f'
>>> byte.unpack('\x7f')
127

There's more than mere byte, of course: the numeric family consists of int(8|16|24|32|64)(s|u)(b|l) (e.g., int32ul) and float(32|64)(b|l), where s = signed, u = unsigned, b = big endian and l = little endian, but we will overlook those for now. By the way, byte is an alias for int8u.

These can be seen as atoms and Construct is a library of combinators: it gains it's power from combining simpler elements into more complex structures. The simplest combinator is Sequence, which we'll explore by defining an IPv4 address as a sequence of 4 bytes:

>>> from construct3 import Sequence
>>> ipaddr = Sequence(byte, byte, byte, byte)
>>> ipaddr
Sequence(int8u, int8u, int8u, int8u)
>>> ipaddr.unpack('\x7f\x00\x00\x01')
[127, 0, 0, 1]
>>> ipaddr.pack([192, 168, 2, 1])
'\xc0\xa8\x02\x01'

Naturally, we can created nested sequences (not that it makes sense right now, but it's important to note):

>>> Sequence(Sequence(byte, byte), byte, byte).unpack("ABCD")
[[65, 66], 67, 68]

Since combining packers is our bread-and-butter, let's introduce a syntactic sugar -- the bind operator (>>). This operator takes two packers and returns a sequence thereof. Here's how it looks:

>>> ipaddr = byte >> byte >> byte >> byte
>>> ipaddr
Sequence(int8u, int8u, int8u, int8u)

Sequences can be heterogeneous (consisting of several kinds of packers, e.g., Sequence(float64b, int16ul)); however, when the data we're dealing with is homogeneous, we can use Arrays instead. Following along the lines of the previous example, we can define ipaddr as an array of 4 bytes:

>>> from construct3 import Array
>>> ipaddr = Array(4, byte)
>>> ipaddr
Range(4, 4, int8u)                     # Range? We'll get to that later
>>> ipaddr.unpack("\x7f\x00\x00\x01")
[127, 0, 0, 1]

But as arrays themselves are pretty common, you can create arrays using the subscript notation ([]), as follows:

>>> ipaddr = byte[4]
>>> ipaddr
Range(4, 4, int8u)

Isn't it cool?

More Elaborate Structures

So far we've only worked with data in the form of lists. However, many times (and especially when nesting is involved), we would like to give names to the subcomponents that make up our data structure. Enter Struct. Named after the C struct statement, Struct takes pairs of (name, packer) and returns a composite packer.

>>> from construct3 import Struct
>>> ipaddr = Struct(('a', byte), ('b', byte), ('c', byte), ('d', byte))
>>> x = ipaddr.unpack('\xc0\xa8\x02\x01')
>>> x
Container:
  a = 192
  b = 168
  c = 2
  d = 1
>>> x.a
192
>>> x["a"]
192

Note

In Construct 2, all constructs took a name parameter. While this approach works great for Structs, it doesn't make much sense for Sequences, Arrays, and virtually all other constructs. Moreover, it was the cause for the the notorious Rename construct.

One of the most important cleanups of Construct 3 is dropping the name from packers and moving it to where it belongs - Struct.

Notice that unpacking a Struct breaks down the data into a Container object, which is simply a convenience-wrapper around good-old dict. Likewise, given a dict-like object, you can pack it back into bytes:

>>> ipaddr.pack({"a" : 192, "b" : 168, "c" : 2, "d" : 1})
'\xc0\xa8\x02\x01'

Structures can soon grow large and have many nested structures within them. Using pairs of ("name", packer) quickly becomes a burden and your code starts looking like LISP:

Struct(
    ("foo", byte),
    ("bar", Struct(
        ("spam", int16ul),
        ("bacon", int64sb),
    )),
    ("viking", int32sl),
)

In order to avoid carrying your father's parentheses with you, Construct provides yet another syntactic sugar: the slash (/) operator. This operator is used as "name" / packer and simply returns ("name", packer). Let's revise the previous code snippet:

Struct(
    "foo" / byte,
    "bar" / Struct(
        "spam" / int16ul,
        "bacon" / int64sb,
    ),
    "viking" / int32sl,
)

I wish it were possible to override = or :, but sadly this isn't the case. However, I feel / is "good enough" a choice, plus it binds more tightly than most operators. Remember the bind operator (>>)? It can be used just the same here, making one-line Structs quick and easy:

>>> ipaddr = 'a' / byte >> 'b' / byte >> 'c' / byte >> 'd' / byte
>>> ipaddr
Struct(('a', int8u), ('b', int8u), ('c', int8u), ('d', int8u))

Note

The "inline style" is appropriate for small Structs and Sequences (2-4 members). When dealing with larger structures, use the "multiline" version instead. It's more readable and scalable.

Also note that these are all but syntactic sugars: If you don't like their looks, you can always use the expanded form.

Bytes and Bits and Units

Bytes are easy to work with, but protocols and file formats often talk at different levels of granularity, switching between bits and bytes. For instance, here's the SCSI CDB of READ6:

The LUN component is 3 bits long and the LBA component is 21 bits long... how can we handle this? Before we get to that, it's important to understand a little of how things work under the hood - specifically, the stream of units approach. Internally, Construct operates on a stream of arbitrary units, which normally happen to be bytes. When needed, this stream can be replaced (or wrapped) to provide different units, e.g., bits. Here's an example:

read6 = Struct(
    "opcode" / byte,
    "address" / Bitwise(Struct(
        "lun" / Bits(3),
        "lba" / Bits(21),
    )),
    "transfer_length" / byte,
    "control" / byte,
)

Note how we switch between bytes and bits: opcode is a byte, followed by address which operates on bits. The Bitwise packer replaces the underlying byte-stream with a bit-stream, so the contained Struct now operates on bits. The Struct itself nows nothing of it, and it's only required that the underlying packers would be able to make sense of it. For instance, you can place an int32ul inside a Bitwise packer, but the result would meaningless: it will read four bits and treat them as bytes, interpreting 0b1001 as 0x01000001.

For this reason we have the Bits packer, which reads that many bits and converts them to an integer (base 2); for convenience, Construct provides bit (a single bit), nibble (four bits) and octet (eight bits) as well.

In theory, Construct could be operate on various other units (e.g., Unicode characters), but practice shows byte- and bit-streams are the most useful ones. Some exceptions are the processing of compressed or encoded data, but these are beyond the scope of this survey.

Powering Up: Adapters

So far we've only looked at data in its raw form (e.g., [127, 0, 0, 1]), but it is normally desirable to transform its representation into one that is easier to work with. For instance, we may prefer 127.0.0.1 to a list of numbers. Enter adapters. While at first it may seem confusing, the distinction between adapters and packers is quite clear: packers work at the stream level while adapters work at the object level; this lets you add power without interfering with the low-level machinery.

Here's an example:

class IpAdapter(Adapter):
    def decode(self, arr, ctx):              # called by unpack()
        return  ".".join(map(str, arr))      # converts [x, y, z, w] to 'x.y.z.w'
    
    def encode(self, ipstr, ctx):            # called by pack()
        return map(int, ipstr.split("."))    # converts 'x.y.z.w' to [x, y, z, w]
>>> ipaddr = IpAdapter(byte[4])
>>> ipaddr.unpack('\xc0\xa8\x02\x01')
'192.168.2.1'
>>> ipaddr.pack('127.0.0.1')
'\x7f\x00\x00\x01'

When we only have a single use for an adapter (and it's simple enough), we can even go one-liner here:

ipaddr = IpAdapter(byte[4], 
    decode = lambda arr, ctx: ".".join(map(str, arr)),
    encode = lambda ipstr, ctx: map(int, ipstr.split("."))
)

At this point adapters might seem quite trivial, but they can do much more than this. For instance, the integer packers we've used so far are actually adapters that transform bytes into numbers. Other examples include inserting computed values into the result, encoding/decoding strings, validating input, etc. Essentially, adapters can transform objects in any way you wish prior to packing/unpacking.

Don't Repeat Yourself: Macros

Many times you find yourself in need of a recurring pattern. You could write a full-blown packer/adapter from scratch, but your best option is to reuse existing building blocks. Construct attempts to define the most general packers and special-case for them common cases. One such example is Array, which we've met before: Construct actually defines the more general Range packer (which accepts minimum and maximum counts). On top of this, Array is a simple "macro" that expands to a Range with the same minimum and maximum.

def Array(count, itempkr):
    return Range(count, count, itempkr)

Macros can be more complex, of course. For instance, a recurring pattern is to use a Struct inside a Bitwise packer; let's fuse the two into BitStruct:

def BitStruct(*members):
    return Bitwise(Struct(*members))

Macros have many more uses; you can explore the implementation of Construct to see some examples, and you're encouraged to write ones on your own. Remember: less code is great success. As we'll see later on, using macros (rather than writing your own packers) can even lead to better performance.

Putting things in Context

Up until now, we've only seen simple (static) data structures, ones that could just as well be expressed using the built-in struct module. The key-feature of Construct is its ability to express dependencies within data structures. One common dependency is that of length-value relations, where a number specifies how many elements follow it. For instance, strings in Pascal were prefixed by a length byte, e.g., "\x05hello"... how do we express that relation?

Generally speaking, we may require access to things we've previously encountered (e.g., the history); for this reason, both packing and unpacking carry a context dictionary with them. This dictionary is mostly maintained by composite packers such as Struct and Sequence, but any packer along the way can both modify and access it, making decisions based on the history. Here's an example:

>>> pstring = Struct(
...     "length" / byte,
...     "value" / Raw(lambda ctx: ctx["length"])
... )
>>> pstring.unpack("\x05hello")
Container:
  length = 5
  value = 'hello'
>>> pstring.pack({"length" : 9, "value" : "wikipedia"})
'\x09wikipedia'

Virtually any length/count parameter that is passed to one of the built-in construct can be either a number or a function that takes a context dict and computes a value. In this case, Raw(x) (which reads x units from the stream) will read length units; the value of length, of course, is determined by the previously seen element, whose name was "length". It is also important to note that this dependency is preserved in packing, e.g.

>>> pstring.pack({"length" : 9, "value" : "hello"})
Traceback (most recent call last):
  ...
construct3.packers.RawError: Expected buffer of length 9, got 5

Instead of writing lambda functions everywhere, Construct 3 introduces the this object. It's a special object that builds a contextual expression (e.g., a function taking ctx) in a straight-forward and readable manner. For instance,

>>> from construct3 import this
>>> this.x
this.x
>>> this.x * 2
(this.x * 2)
>>> (this.x * 2)({"x":5})
10
>>> (lambda ctx: ctx["x"] * 2)({"x":5})      # equivalent lambda function
10

Let's revise pstring:

>>> pstring = Struct(
...     "length" / byte,
...     "value" / Raw(this.length)
... )

We can also use a Sequence instead of a Struct here, yielding this poetically-beautiful piece of code:

>>> pstring = byte >> Raw(this[0])
>>> pstring.unpack("\x05helloXXX")
[5, 'hello']

Using a simple adapter, we can make our lives easier when working with length-value encoded data:

class LV(Adapter):
    def encode(self, obj, ctx):
        return (len(obj), obj)     # compute len(obj) for us
    def decode(self, obj, ctx):
        return obj[1]              # discard the length field
>>> pstring = LV(byte >> Raw(this[0]))
>>> pstring.unpack("\x05hello")
'hello'
>>> pstring.pack("hello")
'\x05hello'

If you need access to elements outside of the current scope (Struct or Sequence), you can use the parent context, named underscore (_). For instance, use this._.x to go up one level, or this._._._.y to go up three. Consider the following example, in which the length of the data is given in two parts (in different scopes), and we wish to read len + gth bytes:

>>> nested = Struct(
...     "len" / byte,
...     "foo" / Struct(
...         "gth" / byte,
...         "data" / Raw(this._.len + this.gth)
...     )
... )
>>> nested.unpack("\x03\x02hello")
Container:
  len = 3
  foo = Container:
    gth = 2
    data = 'hello'

Compilation

Note

This is a work in progess; Construct 3.0 would probably come out with a very basic compiler that will be improved over time.

One of the highlights about Construct is defining your data structures directly in Python. In fact, Construct is an in-langaguge DSL in the form of packing combinators: instead of expressing your data structures in XML or some proprietary language, you just write them (and run them) as any other piece of code.

We used to have psyco, which was capable of speeding up Construct 2 by a tenfold, but it's been dead for the past four years. I first had plans to compile data structures to C/C++ (which would have made Construct NASA-grade material :)), but I soon realized that its quite an impossible feat (due to the fact Adapters are Turing-complete).

On the other hand, I now realized I can compile Constructs to Python! The compiler could inspect the whole data structure in advance and generate optimized code, eliminating the context, etc. Whenever a convertion is not possible, it would just fall back to the current, interpretted scheme. The compile is already capable of compiling this Struct("len" / byte, "gth" / byte, "data" / Raw(this.len + this.gth)) into this:

def test_unpack(stream):
    var0 = {}
    var1, = _struct.unpack('B', stream.read(1))
    var0['len'] = var1
    var2, = _struct.unpack('B', stream.read(1))
    var0['gth'] = var2
    var3 = stream.read((var1 + var2))
    var0['data'] = var3
    return var0

Notice the stack depth remains relatively constant, unlike the way nested packers work today. As the compiler improves, it could translate byte[4] to Raw(4), to speed up things. Another option is to generate Cython code with type annotations, but that would take some time. This is yet another reason to favor the use of "macros" over implemeting constructs from stratch: if you rely on the built-in ones, it's more likely that the compiler would generate optimized code.

Computational Power

Here's a semi-formal proof that Construct is stronger than Context Free languages (as well as mildly context-sensitive ones), which probably makes it the most powerful (although not the most efficient) parser:

>>> def Match(symbol):
...     return OneOf(Raw(len(symbol)), [symbol])
...
>>> anbncn = byte >> Match("a")[this[0]] >> Match("b")[this[0]] >> Match("c")[this[0]]
>>> anbncn.unpack("\x04aaaabbbbcccc")
[4, ['a', 'a', 'a', 'a'], ['b', 'b', 'b', 'b'], ['c', 'c', 'c', 'c']]
>>> anbncn.unpack("\x04aaaabbbbbcccc")
Traceback (most recent call last):
  ...
construct3.packers.RangeError: Expected 4 items, found 0
Underlying exception: ValidationError("'b' must be in ['c']",)

Here's a recognizer for the language , which is not context free (assuming n is given in unary representation, it requires the recognition of ). We can easily extend this to to break out of mildly context-sensitive languages, and use While(this[-1] == '1', Raw(1)) instead of the first byte, so n won't be bounded from above.

Feedback

You can leave comments right here on this page, or join Construct's new discussion group. Thanks!

View Comments


New Experiences

December 15, 2012

Okay, I've been slacking off and I feel I've got some explaining to do... Allow me to start by admitting that I lied. If you remember, I said I wanted an easy life, but then the opportunity came and I knew I had to take it: I've joined two friends to co-found Touchbase. Yes, I said it won't happen to me, but heck, it did.

At Touchbase, we plan to revolutionize the calendar - this outdated table that you use every day to manage your time. You see, it turns out that even though hundreds of millions of people world-wide run their lives according to this naive table, it hasn't really changed much in the last couple of centuries: it's just a passive, linear representation of time, into which you insert events.

We believe people spend way too much time managing their time. In other words, you work for your calendar instead of it working for you. Think of how much time people spend coordinating meetings... If both you and the person you wish to meet with work at the same place (or share calendars), you can normally see each other's free/busy times. This makes finding a suitable time for you two a bit easier, but as you don't see event details, you can't make informed decisions.

For instance, your friend might have an out-of-town meeting from 9am to 11am, so picking a time slot right at 11am isn't a good idea. Or, she might be on a business trip; she won't block her whole day as she still wants people to book with her at her destination, but how would you know that? And the other way around - suppose you and this Tech guru will both be in New York next week, but neither one of you knows about the other being there. So instead of grabbing a coffee at a Starbucks next week, you'll have to take a flight to California three weeks from now.

And we've only talked about one-on-one meetings. I suppose you know how frustrating is coordinating a meeting of five people (even in the same office), or handling the reschedules/ counters that follow it. We aim high, both technologically (and algorithmically) and product-wise, but we're starting out with more modest go-to-market strategies.

Lessons on Web Programming

In case you've been reading my blog, you probably know by now that I'm no fan of web programming. I always feel it's a conglomerate of unrelated or inferior technologies, hastefully stacked one on the other. That's not to say that people don't do amazing stuff on the web, but the foundations of it all are shaky.

Part of our job at Touchbase it to handle large amounts of user data, which we obtain from third- party providers. It works 98% percent of the time, but every once in a while we get timeouts or malformed data, which aborts the user's request. In case the user's data is somehow malformed (e.g., an expected field is missing in one record), subsequent retries would fail just the same, leading to user frustration. At some point it came to me that web programming is actually a stochastic process, not a deterministic one like most software development we're used to. We work with big numbers here, where the occasional anomaly should just be ignored. Many "best practices" simply don't apply here and one has to resort to wishful thinking. In other words, do whatever you can, ignore errors and learn to live with partial data.

I brought this realization to the mighty @FAKEGRIMLOCK, and he explained:

I'm an enlightened person now.

In Other News

I just released Plumbum v1.1 today, adding Paramiko integration and Subcommand support. As usual, the changelog holds the full details. I plan to write a short tutorial on subcommands soon, so stay tuned.

View Comments




Older Posts »»»