Howardism Musings from my Awakening Dementia
My collected thoughts flamed by hubris
Home PageSend Comment

Why Not Use the Database?

The last couple of years of server-side architecture has left me with a pretty bad taste in my mouth, and I've just got to spit… However, I have a suspicion that some of my ideas may be seen as heretical, so allow me to protect myself from the pyre by stating that the following idea is not a global truth, but a wakening shake for all of us to realize that not everything is a pattern from our last project.

Allow me to get concrete by talking a bit about my last project… which was as much as an experiment as it was a deliverable to the customer…

Most server-side or web-based applications are written using object-oriented technology (OOP), and it seems that we have a fetish for writing useless objects. Yes, sometimes you have to build some helper objects, but it is better to write those with a clear purpose in mind and to realize that you might not need to…

I'm specifically referring to fact that we have this itch to create an object that mirrors our database tables.

For example, in a CRM system, we would be tempted to create a "customer" and a "task" and a "campaign" object… granted, I did create a database table for each of those, but why are we so tempted to also create an object.

I know, our rationale goes something like this: I need to retrieve the data from the database, and I don't want to duplicate that code every time I need a customer, so I will instantiate an object that will do that business.

Now let me bore you with the slippery slope that happens afterwards… you first create the object, and it is good. After a while, you need to retrieve more and more complex data associated with customer, and suddenly the performance goes to hell because to list all of the customer names takes an hour as it is gleefully calculating each customer's mortgage payments as well. Let me give you an example from my last project.

In my case, each customer could have zero or more "tasks", be associated with zero or more "campaigns" and had one registration event. Each of these, while stored in a different table, had a date associated with it, and in one particular "screen" (read: servlet), we wanted to display the names of the customers with the latest date associated with each customer from any of these tables.

Since I am grabbing a value that is related from three different tables (but doesn't have the same name), we need to perform an SQL union with the max date. In my case, I threw this into a temporary table and then joined it with the customer list. My point is not to get bogged down into the details, but to show to give you and idea of the amount of work required to get this "business data".

Let's suppose I had taken the typical approach and had created a "Customer" class, with a number of getXXX() methods that returned the various properties associated with a customer. This particular getMaxContactDate() method would be very costly to lookup. I would have two choices: I could retrieve the customer data during construction of the object, or retrieve each property when it is called.

Retrieving each property when it is called seems terribly inefficient if I am trying to display all of the data associated with a customer, as this could result in a hundred database calls…

But performing this SQL union business would be silly during the object's construction because I might just be needing to list all of the customer's names in a list, as we don't want to perform costly database operation without needing to…

So since I'm such a smart guy, I'll have the costly properties called when we need them, and the rest performed at construction… and also since I'm so smart, I will realize that to retrieve all the customer's data will not be abysmal, it also won't be optimal-- as each costly method will have to perform its database operations by itself, and can't leverage any pre-calculations done by any other costly method. Unless, I cache some of these pre-calculations…

Yes, this idea is an exciting white rabbit hole… But before you start coding this intensive customer class, stop and think if it is worth it. Will you ever reuse this code? No way, this is a customer class that will be slightly different from the next customer class you'll have to write. Also keep in mind that this is a mindless database application… it just isn't worth your time to build such complexity. Especially when complexity == risk for bugs.

Of course there's another way… don't make the object in the first place.

What? Yeah, I know this might be a controversial idea, but think about it… we've been doing database applications for a long time-- longer than we've been doing OOP. So why not use the database for what it is good for-- get data. How did we do it before? If a "screen" needed a bunch of data, we performed the SQL operation, got that data, and stuck it up on the screen.

The advantage of this approach is that each screen can perform optimal database operations to get just the data it requires… no more, and no less.

The disadvantage? Obviously there may be a few "screens" that need to perform the same database operations, and this approach may result in duplicate code. It has been my perspective that this doesn't happen very often, as usually each display "screen" required different SQL code, so it really results in similar (not duplicate) code, but one in which it can adapt to the needs of any individual "screen."

However, just like you abstract all of your output messages into a separate text file (resource bundle), we could do the same for all of these SQL statements. Now, instead of hard-coding the SQL string, you retrieve it from a resource bundle based on some identification string. If a "screen" requires a slightly different SQL statement, we just create it and add it to the bundle.

Put down the pitchforks-- I'm not saying don't make any more objects… if an object did some non-database calculation, then clearly make that object. But I'd bet that the bulk of these database-oriented-objects probably don't need to be objects at all.

If this idea gives you hives, then at least don't perfect your re-invented wheel, but use some tool like Hibernate or JDO… just don't get me started on EJBs.

Greg Jorgensen wrote the following comment:

I just read your article. You hit on something I've wondered and bitched about for a while.

As a long-time database/SQL programmer I have often wondered why programmers use relational databases as persistent object storage and ignore the powerful and well-tested data manipulation and programming features RDBMSs offer. I've seen hundreds of the "wrapper" classes you mention. The idea seems to be to map classes to database tables (or vice-versa), then moving the business/rules and logic into higher-level classes. This leads to clunky and inefficent classes for customer, employee, etc. and code somewhere else that manipulates the objects. Reporting and almost any aggregate functionality gets tedious to program AND slow, since it either generates a lot of discrete database queries, or it uses cursors on a result set.

I guess the idea of separating the database structure and contents from the business logic is a holdover from the three-tier architecture craze. Or maybe it's caused by an overwhelming desire to do as much as possible in the "cool" OOP language and only do simple SELECT/INSERT/UPDATE transactions on the database. If you think about it for a minute, you realize that placing the data structure and contents in a back-end relational database, and the logic for manipulating/validating/displaying that data somewhere else (in middle-tier classes) is really the antithesis of OOP: data and the operations on it are now separated rather than encapsulated.

Rather than write a bunch of wrapper classes, I let the database engine do what it's good at: managing data. So my Customer class has methods for OrderVolume(periodstart, periodend) or ChangeAddress(…). The wrapper class does nothing more than get a database connection (either opening a new one or using a pooled open connection) and passing NAMED parameters to a stored procedure. The SQL code in the database knows how to physically manipulate the data.

I recently rewrote some code that had wrappers for every table in the database: you could SELECT, INSERT, UPDATE, DELETE by passing a zillion parameters in a dictionary. The wrappers were really just training wheels to keep the programmer away from SQL. The wrappers constructed the SQL, but had no way to validate it for correct field names or types, so the SQL was tossed at the RDBMS with fingers crossed. Tracking through the code to find references to tables or columns was tricky because of the layers of indirection. My fix was to determine the operations the code actually performed against the database (and there were not very many operations). It never did something like change the city but not the zip code, or display just one detail line from an order. So there was no reason to have wrappers that could be made to do such things. Instead I wrote functions that know how to change a customer address, or retrieve all order detail lines at once. Those functions were mostly a few lines of code that called a stored procedure. Stored procedures are syntax-checked, type-validated, and compiled (I'm using SQL Server), so they are known to work before they are executed, and they are much faster than sending random SQL statements. In the process I was able to move code that does secondary lookups into JOINs in the stored procedure, do some formatting of results, calculate discounts and defaults, etc. The combined code is much smaller, simpler, and easier to follow than the original. Now if I we need to change a discount percentage that happens in one place, in the database code, instead of in hundreds of places that were fetching order detail lines one at a time, looking up the discount factor in the inventory table, doing the calculation, and keeping track of all that.

In short, try using different abstractions for the database. Treat it as a black box that can do specific operations on data or return formatted results, instead of treating it like a high-capacity dBase file.

The arguments I've heard against this kind of slicing are mostly spurious:

  • It Howard's Note: What about instead of having an object being little more than a struct, you create "model-side code" that contains the operations. breaks the three-tier or MVC model. I don't think so, but so what?

  • What if we change back-end databases? That will be costly and require extensive code review in any case, and it's rarely done in real life.

  • SQL is old/lame/hard to read/etc. No it's not, it just takes practice and some study, like anything else. SQL isn't cool but it will land a job a lot faster than arguing about continuations in Haskell. Young programmers don't like the old-fashioned syntax of SQL, or the paucity of functionality compared to the Perl, Python, Java, C++ they cut their teeth on. Not that I'm an old-time Fortran and COBOL programmer, but I just want to hit hot-shots who complain about those languages even though they never used them. SQL has, I think, been lumped together with COBOL in the geriatric backwater of programming tools. Too bad, they don't know what they're missing and giving up -- a single line of SQL is frequently more expressive and powerful than hundreds of lines of procedural code.

  • My database engine doesn't support stored procedures (from programmers who use mySQL or Access but have no exposure to Oracle, SQL Server, Postgres, Sybase, DB2, etc.). Use a real database engine, or hide your "stored procedure" in the wrapper class until you can migrate them to the database engine.

  • I don't trust JOINs or subqueries… how can I tell if I'm getting the right results? LEFT JOIN confuses me. From people who believe they have mastered C++ templates or Java checked exceptions this is just so much whining. Relational database operations have consistent, predictable, and testable behavior -- you just have to read the C. J. Date book. I think database work separates the wheat from the chaff, because eventually you have to actually know what you're doing with SQL, whereas you can get a long way by faking it and copy/pasting code snippets with C++, Java, Visual Basic, etc. The necessity of real understanding throws a lot of programmers, just like debugging in assembly language or debugging old C code does.

My anecdotal experience is that too many programmers don't get relational database theory because it's one of the (very) few things in the programming world that is based on real math and has provably right and wrong ways to do things. That's not to say database designers don't argue about normalization, just that the arguments are usually based more on definable and testable assumptions than arguments comparing, say, Java to C#. I think most programmers don't get the set orientation of RDBMSs and SQL, and tend to think in terms of records and fields instead of relations; their thinking is one-dimensional, which creates the loads of slow and buggy cursor-oriented code I see all the time. The record-oriented approach introduces locking and consistency problems that the database engine can handle if you let it. It also hobbles the database engine and slows things down to the point that caching results (which any decent database engine does already) starts to seem like a good idea.

Tell others about this article:
Click here to submit this page to Stumble It