Moved

Moved. See https://slott56.github.io. All new content goes to the new site. This is a legacy, and will likely be dropped five years after the last post in Jan 2023.

Thursday, September 29, 2011

The Politics of Estimating

Computerworld, September 12, page 10.

MicroburstIT Disasters
According to a study of 1,471 big IT projects, 15% turn out to be money pits, with cost overruns averaging 200%.

How is this a politically-charged statement?  We hear this kind of thing all the time.

As developers (or project leaders) we're failing to execute.

Right?

Hogwash.

An "overrun" is isomorphic to "badly justified" or "badly budgeted" or "oversold to executive sponsors".

An "overrun" can be a failure to use (or even permit) realistic estimates.  It may reflect an executive sponsor restating objectives to make the project large enough to justify it.  An overrun can mean anything.

Calling it an overrun is a way to label it as "failure to execute".

I prefer to call it a failure of vision (or whatever it is executive sponsors do).  It's more likely to be an under-estimate than it is to be an over-run.

After all, how many times have we been told to reduce an estimate?  How many times have folks gotten their "attaboys" and "attagirls" for "sharpening their pencils" and reducing the proposal to the smallest amount that the customer would approve?

Tuesday, September 27, 2011

Threads and I/O

Threads don't promote concurrent I/O.

Kernel threads may.  Most of us write user threads.  Here's a great summary under Thread (Computer Science).
However, the use of blocking system calls in user threads (as opposed to kernel threads) or fibers can be problematic. If a user thread or a fiber performs a system call that blocks, the other user threads and fibers in the process are unable to run until the system call returns. A typical example of this problem is when performing I/O: most programs are written to perform I/O synchronously. When an I/O operation is initiated, a system call is made, and does not return until the I/O operation has been completed. In the intervening period, the entire process is "blocked" by the kernel and cannot run, which starves other user threads and fibers in the same process from executing.

The point is this.

If it involves I/O, multi-threading doesn't help.  Processes do.

If it involves computation, multi-threading may help.

Thursday, September 22, 2011

"Strict" Unit Testing -- Everything In Isolation Is Too Much Work

Folks like to claim that unit testing absolutely requires each class be tested in isolation using mocks for all dependencies.  This is a noble aspiration, but doesn't work out perfectly well in Python.

First, "unit" is intentionally vague.  It could be a class, a function, a module or a package.  It's "unit" of code.  Anything could be considered a "unit".

Second--and more important--the extensive mocking isn't fully appropriate for Python programming.  Mocks are very helpful in statically-typed languages where you must be very fussy about assuring that all of the interface definitions are carefully matched up properly.  

In Python, duck typing allows a mock to be defined quite trivially.  A mock library isn't terribly helpful, since it doesn't reduce the code volume or complexity in any meaningful way.

Dependencies without Injection

The larger issue with trying to unit test in Python with mock objects is the impact of change.

We have some class with an interface.

class AppFeature( object ):
    def app_method( self, anotherObject ):
        etc.

class AnotherClass( object ):
    def another_method( self ):
        etc.

We've properly used dependency injection to make AppFeature depend on an instance of AnotherClass.  This means that we're supposed to create a mock of AnotherClass to test the AppFeature

class MockAnotherClass( object ):
    def another_method( self ):
        etc.

In Python, this mock isn't a best practice.  It can be helpful.  But adding a mock can also be confusing and misleading.

Refactoring Scenario

Consider the situation where we're refactoring and change the interface to AnotherClass.  We modify another_method to take an additional argument, for example.

How many mocks do we have?  How many need to be changed?  What happens when we miss one of the mocks and have the mysterious Isolated Test Failure?  

While we can use a naming convention and grep to locate the mocks, this can (and does) get murky when we've got a mock that replaces a complex cluster of objects with a simple Facade for testing purposes.  Now, we've got a mock that doesn't trivially replace the mocked class.

Alternative: Less Strict Mocking

In Python--and other duck typing languages--a less mock-heavy approach seems more productive.  The goal of testing every class in isolation surrounded by mocks needs to be relaxed.  A more helpful approach is to work up through the layers.
  1. Test the "low-level" classes--those with few or no dependencies--in isolation.  This is easy because they're already isolated by design.
  2. The classes which depend on these low-level classes can simply use the low-level classes without shame or embarrassment.  The low-level classes work.  Higher-level classes can depend on them.  It's okay.
  3. In some cases, mocks are required for particularly complex or difficult classes.  Nothing is wrong with mocks.  But fussy overuse of mocks does create additional work.
The benefit of this is 
  • The layered architecture is tested the way it's actually used.  The low-level classes are tested in isolation as well as being tested in conjunction with the classes that depend on them.
  • It's easier to refactor.  The design changes aren't propagated into mocks.
  • Layer boundaries can be more strictly enforced.  Circularities are exposed in a more useful way through the dependencies and layered testing.
We need to still work out proper dependency injection.  If we try to mock every dependency, we are forced to confront every dependency in glorious detail.  If we don't mock every single dependency, we can slide by without properly isolating our design.

Thursday, September 1, 2011

Data Warehousing and SQL -- Tread Carefully


"Are you implying that a scalable Data Warehouse solution could be implemented using Python and serialised files?"

Not "implying".  I'm trying to state it as clearly as I can.

A scalable data warehouse solution involves a lot of flat file processing.

ETL, for example, is mostly a flat-file pipeline.  It starts with source application extract (to create a flat file) and proceeds through a number of transformation steps to filter, cleanse, recode, conform dimensions, and eventually relate facts to dimensions.  This is generally very, very fast when done with simple flat files and considerably slower when done with a database.

This is the "Data Warehouse Bus" that Kimball describes in chapter 9 of The Data Warehouse Lifecycle Toolkit.

Ultimately, the cleansed, conformed files will lay around in a "staging area" forever.  When a datamart is built, then a subset of these files can be (rapidly) loaded into an RDBMS for query processing.

Doing this in Python is no different from doing it in Java, C++ or (for that matter) Syncsort.  Yes.  You can build a data warehouse using processing steps written around Syncsort and be quite successful.

The important part of this is to recognize the following.

When trying to do data warehouse flat-file processing in C++ (or Java) you have the ongoing schema maintenance issue.  The source data changes.  You must tweak the schema mapping from source to warehouse.  You can encode this schema mapping as property files or some such, or you can simply use an interpreted language like Python and encode the mappings as Python code.

The "Data Warehouse Bus" is a lot of applications that are trivially written as simple, parallel, multi-processing, small, read-match-write programs.  Forget threads.  Simply use heavy-weight, OS-level processes so that you can maximize the I/O bandwidth.  (Remember: when one thread makes an I/O request, the entire process waits; an I/O-bound application isn't helped by multi-threading.)

    with open('some_data','rb') as source:
        rdr= csv.DictReader( source )
        wtr= csv.DictWriter( sys.stdout, some_schema )
        for row in rdr:
            if exclude( row ): continue
            clean = cleanse( row )
            wtr.writerow( clean )

This example writes to stdout so that it can be connected in a pipeline with other steps in the processing.  Programs running in an OS pipeline run concurrently.  They tie up all the cores available without any real programming effort other than decomposing the problem into discrete parallel steps that apply to each row being touched.

Simple file processing is much, much faster than SQL processing.  Why?  No overheads for locking or buffer pooling or rollback segments, or logging, or after-image journaling or deadlock detection, etc.

Note that a data warehouse database has no need for sophisticated locking.  All of the "updates" are bulk loads.  80% of the activity is "insert".  With some Slowly Changing Dimension (SCD) operations there is a trivial status-change update, but this can be handled with a single database-wide lock during insert.

The primary reason for using SQL is to handle "SELECT something ... GROUP BY" queries.  SQL does this reasonably well most of the time.  Python does it pretty well, also.

    sum_col1 = defaultdict( float )
    count_group = defaultdict( int )
    with connection.cursor() as c:
        c.execute( "SELECT COL1, GROUP FROM..." )
        for row in c.fetchall():
            sum_col1[row.group] += col1
            count_group[row.group] += 1
    print( sum_col1, count_group )

That's clearly wordier than SQL.  But not much wordier.  The SELECT statement embedded in the Python is simpler because it omits the GROUP BY clause.  Since it's simpler, it's more likely to benefit from being reused in the RDBMS.

The Python may actually run faster than a pure SQL query because it avoids the (potentially expensive) RDBMS sort step.  The Python defaultdict (or Java HashMap) is how we avoid sorting.  If we need to present the keys in some kind of user-friendly order, we have limited the sort to just the distinct key values, not the entire join result.

Because of the huge cost of group by, there are two hack-arounds.  One is "materialized views".  The idea is that a group-by view is updated when the base tables are updated to avoid the painful cost of sorting at query time.  In addition to this, there are reporting tools which are "aggregate aware".  They can leverage the materialized view to avoid the sort.

How about we avoid all the conceptual overhead of materialized views and aggregate aware reporting. Instead we can write simple Python procedures that do the processing we want.

Bottom Line

Data Warehouse does not imply SQL.  Indeed, it doesn't even suggest SQL except for datamart processing of flexible ad-hoc queries where there's enough horsepower to endure all the sorting.