What To Test in a Relational Database
Last month I explored the assumptions made by the traditional data management community and argued that these assumptions had been shown to be false over the years. These assumptions included the belief that you can't easily evolve a database schema, that you need to do detailed up front modeling, and that reviews and inspections are an effective way to ensure data quality. Respectively, I argued that database refactoring enables you to easily evolve database schemas, that an agile approach to data modeling is significantly more effective, and that database regression testing is the best way to ensure data quality. A debate on the Agile Databases mailing list ensued and it quickly became apparent that the traditionalists could understand, although often not accept, the first two agile database development techniques but clearly struggled with the concept of database testing. I believe that the virtual absence of discussion about testing within the data management community is the primary cause of the $611 billion annual loss, as reported by The Data Warehouse Institute, experienced by North American organizations resulting from poor data quality. So this month I've decided to describe what you should consider testing in a database.
Let's start with some terminology. Database testing is the act of validating the contents, schema, and functionality within a database. From the point of view of a relational database this includes the actual data itself, the table structures and relationships between tables, and the stored procedures/functions or database classes respectively. Database interface testing validates the database at the black-box level whereas internal database testing validates it at the clear-box level -- if any database testing occurs at all it is typically at the interface level only because of the lack of tool support for internal testing. Database regression testing is the act of running the database test suite on a regular basis, ideally whenever someone does something which could potentially inject a defect into the database such as change how they write data into a database or change some of the code within the database itself. Test Driven Database Development (TDDD), also known as "Behavior Driven Database Development" (BDDD), is the act of specifying the design of a database by writing a single test then just enough database code/schema to fulfill that test.
I think that one of the reasons why data professionals are confused about the concept of database regression testing is because it is a relatively new idea within the data community. One of the assumptions that I didn't cover last month is the idea within the traditional data community that testing is something that other people do (i.e., test or quality assurance professionals). This reflects a penchant for over-specialization and a serial approach towards development by traditionalists, two ideas which have also been shown to be questionable organizational approaches at best.
The easiest thing to get your head around is the need to validate the logic implemented within a database. Relational databases contain code, in the form of stored procedures, triggers, and even object-oriented classes. There is nothing special about this code. Just like you test application code, shouldn't you also test database code? Of course you should. You'll apply the exact same types of tests to database code as you would to application code.
What isn't as obvious, at least from the questions I was getting from traditional data professionals, was the need to validate data quality via testing. As DDJ's data quality survey showed last year, data is considered a corporate asset by 96 percent of organizations yet less than half have any sort of testing strategy in place to actually ensure data quality. In short, people like to talk about data quality but not act on it. When it comes to data you could validate the following via tests: Column domain value rules. For example, the Flavor column has allowable values of Chocolate, Vanilla, and Strawberry.
Column default value rules. For example, the default value is Strawberry.
Value existence rules. For example, there should always be a value of Flavor indicated (it can never be null).
Row value rules. For example, the value of StartDate must be less than EndDate when EndDate is provided.
Size rules. For example, a code in a column must always be two characters in length or a value in a VARCHAR column must be at least five characters in length Although these data rules can be implemented via constraints, or via other means, you still need to test to ensure that the rules are being implemented properly. Constraints can easily be dropped or reworked, therefore you should have regression tests I place to validate them. Nullability is critical to test for because a NOT NULL constraint can also easily be dropped. Furthermore, "quasi-nulls" such as empty strings are often not allowed so supporting tests should be in place to ensure this.
Table structure can also be easily validated, something that is typically done as a side effect of the tests to validate the Create Read Update and Delete (CRUD) logic of an application. These tests will break whenever you change the database schema without also changing the access code. From a database design point of view, as you write CRUD tests you are effectively designing the table structure which supports those tests.
You can also write tests which validate relationships between the rows in different tables. These tests validate referential integrity (RI) rules, for example if a row in the Employee table references a row within the Position table then that row should actually exist. RI rules such as this are typically implemented as triggers, but what happens if someone drops or modifies a trigger without understanding the implications of doing so?
You may also choose to write database performance tests to both specify performance requirements and to ensure that those requirements are met. From a black-box point of view you might write tests which validate the performance characteristics surrounding database access, including object/relational (O/R) mapping logic. From a clear-box point of view you might have tests which motivate you to maintain secondary indices to support common database access paths or to refactor your database tables into structures which are more performant.
I believe that as an industry we have a lot of work ahead of us with respect to data quality. We have known for decades that testing, particularly regression testing, leads to greater quality yet for some reason we haven't applied this knowledge to relational databases. There are many things which can and should be tested within a relational database, this column touches on just a few, and it is about time that we step up and develop new tools and techniques to do exactly that. The Agile community has lead the way to bringing regression testing, and more importantly test-first development, to application programming. This has lead to noticeable increases in quality, time to market, and return on investment (ROI). We can achieve the same successes when it comes to database design and quality.
Subscribe to:
Post Comments (Atom)
Data Governance Learning Plan
Seven courses to build the knowledge of Data Governance Program ( planning & establishment ) by Kelle O'Neal with 100 MCQs Exam 1....
-
One of the common problems I has facing is Configure IBM Cognos 8.3 with IIS 7 and Windows Vista for that I summarize the configuration in t...
-
Software developers have had limited access to telecom networks and capabilities. Specialized computer-telephony integration protocols are t...
-
Planning Support for Windows Server 2008 Hyper-V™ You can now use PerformancePoint Server 2007 SP2 with Windows Server 2008 Hyper-V. Hyper-V...
No comments:
Post a Comment