Why Not Use the Database?
The last couple of years of server-side architecture has left me with a
pretty bad taste in my mouth, and I've just got to spit… However, I have
a suspicion that some of my ideas may be seen as heretical, so allow me to
protect myself from the pyre by stating that the following idea is not a
global truth, but a wakening shake for all of us to realize that not
everything is a pattern from our last project.
Allow me to get concrete by talking a bit about my last project… which
was as much as an experiment as it was a deliverable to the customer…
Most server-side or web-based applications are written using
object-oriented technology (OOP), and it seems that we have a fetish for
writing useless objects. Yes, sometimes you have to build some helper
objects, but it is better to write those with a clear purpose in mind and
to realize that you might not need to…
I'm specifically referring to fact that we have this itch to create an
object that mirrors our database tables.
For example, in a CRM system, we would be tempted to create a "customer"
and a "task" and a "campaign" object… granted, I did create a database
table for each of those, but why are we so tempted to also create an
object.
I know, our rationale goes something like this: I need to retrieve the data
from the database, and I don't want to duplicate that code every time I
need a customer, so I will instantiate an object that will do that
business.
Now let me bore you with the slippery slope that happens afterwards… you
first create the object, and it is good. After a while, you need to
retrieve more and more complex data associated with customer, and suddenly
the performance goes to hell because to list all of the customer names
takes an hour as it is gleefully calculating each customer's mortgage
payments as well. Let me give you an example from my last project.
In my case, each customer could have zero or more "tasks", be associated
with zero or more "campaigns" and had one registration event. Each of
these, while stored in a different table, had a date associated with it,
and in one particular "screen" (read: servlet), we wanted to display the
names of the customers with the latest date associated with each customer
from any of these tables.
Since I am grabbing a value that is related from three different tables
(but doesn't have the same name), we need to perform an SQL union with the
max date. In my case, I threw this into a temporary table and then joined
it with the customer list. My point is not to get bogged down into the
details, but to show to give you and idea of the amount of work required to
get this "business data".
Let's suppose I had taken the typical approach and had created a "Customer"
class, with a number of getXXX()
methods that returned the
various properties associated with a customer. This particular
getMaxContactDate()
method would be very costly to lookup. I
would have two choices: I could retrieve the customer data during
construction of the object, or retrieve each property when it is called.
Retrieving each property when it is called seems terribly inefficient if I
am trying to display all of the data associated with a customer, as this
could result in a hundred database calls…
But performing this SQL union business would be silly during the object's
construction because I might just be needing to list all of the customer's
names in a list, as we don't want to perform costly database operation
without needing to…
So since I'm such a smart guy, I'll have the costly properties called when
we need them, and the rest performed at construction… and also since I'm
so smart, I will realize that to retrieve all the customer's data will not
be abysmal, it also won't be optimal-- as each costly method will have to
perform its database operations by itself, and can't leverage any
pre-calculations done by any other costly method.
Unless, I cache some of these pre-calculations…
Yes, this idea is an exciting white rabbit hole… But before you start
coding this intensive customer class, stop and think if it is worth it.
Will you ever reuse this code? No way, this is a customer class that will
be slightly different from the next customer class you'll have to write.
Also keep in mind that this is a mindless database application… it just
isn't worth your time to build such complexity.
Especially when complexity == risk for bugs.
Of course there's another way… don't make the object in the first place.
What? Yeah, I know this might be a controversial idea, but think about
it… we've been doing database applications for a long time-- longer than
we've been doing OOP. So why not use the database for what it is good for--
get data. How did we do it before? If a "screen" needed a bunch of data, we
performed the SQL operation, got that data, and stuck it up on the screen.
The advantage of this approach is that each screen can perform optimal
database operations to get just the data it requires… no more, and no
less.
The disadvantage? Obviously there may be a few "screens" that need to
perform the same database operations, and this approach may result in
duplicate code. It has been my perspective that this doesn't happen very
often, as usually each display "screen" required different SQL code, so it
really results in similar (not duplicate) code, but one in which it
can adapt to the needs of any individual "screen."
However, just like you abstract all of your output messages into a separate
text file (resource bundle), we could do the same for all of these SQL
statements. Now, instead of hard-coding the SQL string, you retrieve it
from a resource bundle based on some identification string. If a "screen"
requires a slightly different SQL statement, we just create it and add it
to the bundle.
Put down the pitchforks-- I'm not saying don't make any more objects… if
an object did some non-database calculation, then clearly make that object.
But I'd bet that the bulk of these database-oriented-objects
probably don't need to be objects at all.
If this idea gives you hives, then at least don't perfect your re-invented
wheel, but use some tool like Hibernate or JDO… just don't get me started
on EJBs.
Greg Jorgensen wrote the following comment:
I just read your article. You hit on something I've wondered and bitched
about for a while.
As a long-time database/SQL programmer I have often wondered why
programmers use relational databases as persistent object storage and
ignore the powerful and well-tested data manipulation and programming
features RDBMSs offer. I've seen hundreds of the "wrapper" classes you
mention. The idea seems to be to map classes to database tables (or
vice-versa), then moving the business/rules and logic into higher-level
classes. This leads to clunky and inefficent classes for customer,
employee, etc. and code somewhere else that manipulates the objects.
Reporting and almost any aggregate functionality gets tedious to program
AND slow, since it either generates a lot of discrete database queries, or
it uses cursors on a result set.
I guess the idea of separating the database structure and contents from the
business logic is a holdover from the three-tier architecture craze. Or
maybe it's caused by an overwhelming desire to do as much as possible in
the "cool" OOP language and only do simple SELECT/INSERT/UPDATE
transactions on the database. If you think about it for a minute, you
realize that placing the data structure and contents in a back-end
relational database, and the logic for manipulating/validating/displaying
that data somewhere else (in middle-tier classes) is really the antithesis
of OOP: data and the operations on it are now separated rather than
encapsulated.
Rather than write a bunch of wrapper classes, I let the database engine do
what it's good at: managing data. So my Customer class has methods for
OrderVolume(periodstart, periodend)
or ChangeAddress(…)
. The wrapper
class does nothing more than get a database connection (either opening a
new one or using a pooled open connection) and passing NAMED parameters to
a stored procedure. The SQL code in the database knows how to physically
manipulate the data.
I recently rewrote some code that had wrappers for every table in the
database: you could SELECT, INSERT, UPDATE, DELETE by passing a zillion
parameters in a dictionary. The wrappers were really just training wheels
to keep the programmer away from SQL. The wrappers constructed the SQL, but
had no way to validate it for correct field names or types, so the SQL was
tossed at the RDBMS with fingers crossed. Tracking through the code to find
references to tables or columns was tricky because of the layers of
indirection. My fix was to determine the operations the code actually
performed against the database (and there were not very many operations).
It never did something like change the city but not the zip code, or
display just one detail line from an order. So there was no reason to have
wrappers that could be made to do such things. Instead I wrote functions
that know how to change a customer address, or retrieve all order detail
lines at once. Those functions were mostly a few lines of code that called
a stored procedure. Stored procedures are syntax-checked, type-validated,
and compiled (I'm using SQL Server), so they are known to work before they
are executed, and they are much faster than sending random SQL statements.
In the process I was able to move code that does secondary lookups into
JOINs in the stored procedure, do some formatting of results, calculate
discounts and defaults, etc. The combined code is much smaller, simpler,
and easier to follow than the original. Now if I we need to change a
discount percentage that happens in one place, in the database code,
instead of in hundreds of places that were fetching order detail lines one
at a time, looking up the discount factor in the inventory table, doing the
calculation, and keeping track of all that.
In short, try using different abstractions for the database. Treat it as a
black box that can do specific operations on data or return formatted
results, instead of treating it like a high-capacity dBase file.
The arguments I've heard against this kind of slicing are mostly spurious:
It Howard's Note: What about instead of having an object being little more than a struct
, you create "model-side code" that contains the operations.
breaks the three-tier or MVC model.
I don't think so, but so what?
What if we change back-end databases?
That will be costly and require extensive code review in any case, and it's
rarely done in real life.
SQL is old/lame/hard to read/etc.
No it's not, it just takes practice and some study, like anything else. SQL
isn't cool but it will land a job a lot faster than arguing about
continuations in Haskell. Young programmers don't like the old-fashioned
syntax of SQL, or the paucity of functionality compared to the Perl,
Python, Java, C++ they cut their teeth on. Not that I'm an old-time Fortran
and COBOL programmer, but I just want to hit hot-shots who complain about
those languages even though they never used them. SQL has, I think, been lumped
together with COBOL in the geriatric backwater of programming tools. Too bad,
they don't know what they're missing and giving up -- a single line of SQL is
frequently more expressive and powerful than hundreds of lines of
procedural code.
My database engine doesn't support stored procedures (from programmers who use mySQL or Access but have no exposure to Oracle, SQL Server, Postgres, Sybase, DB2, etc.).
Use a real database engine, or hide your "stored procedure" in the wrapper
class until you can migrate them to the database engine.
I don't trust JOINs or subqueries… how can I tell if I'm getting the right results? LEFT JOIN confuses me.
From people who believe they have mastered C++ templates or Java checked
exceptions this is just so much whining. Relational database operations
have consistent, predictable, and testable behavior -- you just have to
read the C. J. Date book. I think database work separates the wheat from
the chaff, because eventually you have to actually know what you're doing
with SQL, whereas you can get a long way by faking it and copy/pasting code
snippets with C++, Java, Visual Basic, etc. The necessity of real
understanding throws a lot of programmers, just like debugging in assembly
language or debugging old C code does.
My anecdotal experience is that too many programmers don't get relational
database theory because it's one of the (very) few things in the
programming world that is based on real math and has provably right and
wrong ways to do things. That's not to say database designers don't argue
about normalization, just that the arguments are usually based more on
definable and testable assumptions than arguments comparing, say, Java to
C#. I think most programmers don't get the set orientation of RDBMSs and
SQL, and tend to think in terms of records and fields instead of relations;
their thinking is one-dimensional, which creates the loads of slow and
buggy cursor-oriented code I see all the time. The record-oriented approach
introduces locking and consistency problems that the database engine can
handle if you let it. It also hobbles the database engine and slows things
down to the point that caching results (which any decent database engine
does already) starts to seem like a good idea.
Tell others about this article: