Pages

Monday, July 25, 2011

20 Database Design Best Practices





  1. Use well defined and consistent names for tables and columns (e.g. School, StudentCourse, CourseID ...).
  2. Use singular for table names (i.e. use StudentCourse instead of StudentCourses). Table represents a collection of entities, there is no need for plural names.
  3. Don’t use spaces for table names. Otherwise you will have to use ‘{‘, ‘[‘, ‘“’ etc. characters to define tables (i.e. for accesing table Student Course you'll write “Student Course”. StudentCourse is much better).
  4. Don’t use unnecessary prefixes or suffixes for table names (i.e. use School instead of TblSchool, SchoolTable etc.).
  5. Keep passwords as encrypted for security. Decrypt them in application when required.
  6. Use integer id fields for all tables. If id is not required for the time being, it may be required in the future (for association tables, indexing ...).
  7. Choose columns with the integer data type (or its variants) for indexing. varchar column indexing will cause performance problems.
  8. Use bit fields for boolean values. Using integer or varchar is unnecessarily storage consuming. Also start those column names with “Is”.
  9. Provide authentication for database access. Don’t give admin role to each user.
  10. Avoid “select *” queries until it is really needed. Use "select [required_columns_list]" for better performance.
  11. Use an ORM (object relational mapping) framework (i.e. hibernate, iBatis ...) if application code is big enough. Performance issues of ORM frameworks can be handled by detailed configuration parameters.
  12. Partition big and unused/rarely used tables/table parts to different physical storages for better query performance.
  13. For big, sensitive and mission critic database systems, use disaster recovery and security services like failover clustering, auto backups, replication etc.
  14. Use constraints (foreign key, check, not null ...) for data integrity. Don’t give whole control to application code.
  15. Lack of database documentation is evil. Document your database design with ER schemas and instructions. Also write comment lines for your triggers, stored procedures and other scripts.
  16. Use indexes for frequently used queries on big tables. Analyser tools can be used to determine where indexes will be defined. For queries retrieving a range of rows, clustered indexes are usually better. For point queries, non-clustered indexes are usually better.
  17. Database server and the web server must be placed in different machines. This will provide more security (attackers can’t access data directly) and server CPU and memory performance will be better because of reduced request number and process usage.
  18. Image and blob data columns must not be defined in frequently queried tables because of performance issues. These data must be placed in separate tables and their pointer can be used in queried tables.
  19. Normalization must be used as required, to optimize the performance. Under-normalization will cause excessive repetition of data, over-normalization will cause excessive joins across too many tables. Both of them will get worse performance.
  20. Spend time for database modeling and design as much as required. Otherwise saved(!) design time will cause (saved(!) design time) * 10/100/1000 maintenance and re-design time.

29 comments:

  1. Hash. Passwords. With a salt. With a a hashing method that's slow (in computer terms).

    ReplyDelete
  2. For 18, if this is SQL Server, there are options for storing blob and large text data out of row. http://msdn.microsoft.com/en-us/library/ms189087.aspx.

    sp_tableoption N'MyTable', 'large value types out of row', 'ON'

    ReplyDelete
  3. #7 I wouldn't say so. What is the difference between 100 and 'BAA' in terms of indexing?

    ReplyDelete
  4. #7 Integers and there variants all require byte alignment prior to the comparison function, character types do not. The avoidance of the byte alignment step is why VARCHAR indexes are used in monster databases.

    Mike

    ReplyDelete
  5. I like prefixing columns with either Is, Has, Can or Must. This allows you to produce a more powerfull name for your column.

    ReplyDelete
  6. Prefixing objects may not be a bad practice. tblSchoolCourse, uspSchoolCourse, and vwSchoolCourse make it easy to see when reading code that a table, user stored procedure, or view is being accessed.

    Dave62

    ReplyDelete
  7. @Anonymous (Mike): Could you please provide some references concerning your statement that "varchar indexes are used in moster databases"? (Presumably, by "used" you meant "favored".)

    It seems to me, the system is more likely to have a memory bottleneck than a processor bottleneck. YMMV; when in doubt, profile, then optimize if needed.

    ReplyDelete
  8. It's my understanding that a CHAR(3) occupies less storage, hence requires less I/O than an INT (4-bytes) - note I'm not talking about CHAR, not VARCHAR or NCHAR. So for example if you have a table of Airports or Airlines it would be better to use the 2 or 3-letter airport/airline code as a primary key - this field is unique and would probably be part of the table anyway so you save storage having to add an unnecessary INT primary key column. Also it is very convenient as a foreign key when browsing other tables as it is often immediately obvious which airport/airline is being referred to just from its 2/3 letter code. With a bit of work you could create your own 2/3 letter unique keys for other suitable tables instead of integers.

    Rob

    ReplyDelete
  9. On #4, you don't always have the context.

    For example, if I'm using intellisense in my editor, it's not always going to know that I want to enter a table name, so when I start typing, I get columns, functions, etc showing up. When you don't exactly remember the full name of the table, it just slows you down.

    Likewise, if you are saving your DB objects as scripts for source control, you don't have the context. Marking all your tables as tblWhatever.sql makes it that much faster to find what you're looking for.

    You need to be consistent, obviously, but I've found that the benefit of the extra three characters more than outweighs potential issues with prettily-formatted code.

    ReplyDelete
  10. @Jim: For that kind of intellisense usage, it can help. Anyway I don't want to see 100 tables starting with Tbl on the DBMS user interface and ER diagrams. And I think writing 3 more characters on each query for each table is a bit time consuming.

    But these are just different viewpoints. Thanks for the comment.

    ReplyDelete
  11. Nice list and very thoughtful.

    ReplyDelete
  12. In regard to namingconventions wouldn't tables usually be named with a noun (e.g. SchoolCourse) and stored procs including a verb (e.g. SearchSchoolCourses). As for views I see a possible conflict but since I haven't been using views much (I am not focused on designing databases in my daily work) I haven't really got much experience with naming those - I imagine some combination af nouns though (e.g. SchoolCoursesAndStudents)... :-)

    Jan

    ReplyDelete
  13. Useful suggestions

    ReplyDelete
  14. Intersting range of comments, but almost all of them are negated by the fact that these "best practices" are specific to a specific database system (ms sql). For example, practice 8 is not practical since a bit field differs in each dbms and from a coding perspective they should all map to a single type. In C# you will need to query the type and cast the variable appropriately. Use a small (16 bit) integer type instead and this problem goes away - it works across all dbms types.

    ReplyDelete
  15. @dpleo
    An object name is an alias for the data the object represents, not the object itself. When you think in the problem domain, it doesn't matter if the data is coming from a view, stored proc or table, it just matters that the data is there. On top of that, it can become a maintenance pain. What happens if you have to change tblSchoolCourse to a view? You then have to go and rename it everywhere it's referenced, and if you're not the only one who works on that database, you're going to have an interesting time tracking them all down.

    Prefixing and postfixing is an old habit that really needs to die. With modern OO languages, for example, everything's an object, yet people still feel compelled to strFirstName or nAge. The intention of prefixing things was to describe the *purpose* of the thing, not the *type* of the thing, but that message got lost along the way. If they were true to their interpretation, then it would be objFirstName, objAge etc. And I actually have seen code like that...

    ReplyDelete
  16. CB, I think your answer to James on point 19 is a little off the mark. No matter what the table size, lets go big and assume it is 100GB in size, non-partitioned, unless the application is poorly designed or not designed for online use, the whole table will not be in memory. If a query causes a clustered index scan, then indexing needs to be looked at or the application request patterns. Even the smallest table work this way. Using MS SQL server as an example, the smallest amount of data that can be brought into memory from the disk is 8KB, one page. An efficient query engine only brings in data that is needed to service the query at hand. You can see this in action by creating a query joining say 20 tables, but only reference 1 in the select clause and the where clause. The query plan will not include the other 19 tables. In almost every case of slow database performance I have had to deal with, it comes down to indexing, not the number of joins. My rule of thumb is normalise to the 3rd normal form and put in production. It would be nice to be able to thoroughly stress test the database before going into production, but very few dev shops have this capacity. Most stress testing I have done is in a production environment. :)

    Dave

    ReplyDelete
  17. Very nice post. I have learn all the points in my 2 years experience. Why not you made this post 2 years before?

    ReplyDelete
  18. Good pointers ..

    Need to add few points.

    1) Do Understand the strengths of underline architecture of Database your product is being built upon .. and use generic as well as specific (to DB) optimization techniques as much as possible.

    2) Do not utilize the single db for all use cases .. like the techniques defined above for partitioning, you may come across that the database itself can be divided into two or more MAJOR modules/usecases to distribute the load of types of operations

    3) Do not over-use the OLTP database .. and do understand and able to bifurcate the difference b/w OLAP and OLTP databases and for which use case which one is more suitable ..

    4) Real-time (or Interval) Replication should be used for load balancing (or for different types of usage) wherever applicable

    5) DB configuration should be properly tuned with the hardware .. and hardware procurement should be derived from the requirements of business.

    ReplyDelete
  19. #7 & #18 can be combined. That is, make your records as narrow as possible. Use integers, enums, and bits over chars, varchars, etc. The narrower the record the more of them will fit in the caches. It is the caches that make your database fast. Generally, put meta-data in one table and content in other joined tables. And, as a previous commentator said, it is really important to know your database's performance characteristics regards to data volume and data contention.

    ReplyDelete
  20. every data base designer must consider this rules. and another thing is this article has been written in very simple manner

    ReplyDelete
  21. I'd like to add one about security.

    #0 - DO NOT IMPLEMENT YOUR OWN CUSTOM SECURITY MECHANISM simply because you don't know how to use the security built into the database engine and do not want to learn how to.

    You'll waste more time (yours and your users) and hinder your database design thru attempts to work around limitations that you've unintentionally added to your DB design by using your own security mechanism. It takes lees effort to learn how to use the systems built in security then implemt you r own.

    At our work we deal with 3 different applications that use SQKL Server and each uses its own cluster F of a mess security process because the developers obviously had no clue on How security in SQL Server worked. Very frustrating.

    ReplyDelete
  22. Over all good post but not agreed by point no 4, some time prefixes or suffixes are helpful in querying Database or make queries more understandable at least some sort of pattern must be used to get the idea is it View, SP, Function, OR Table

    ReplyDelete
  23. #19 on normalization/denormalization focuses too narrowly on performance. Your logical data model should be highly normalized, because the purpose is to document and reflect the business understanding of the world. If denormalization is being considered, the critical question to be asked is "am I starting to lie about the world". Fast but wrong is not usually an acceptable solution.

    That said, in many circumstances enterprise data integration may allow substantial denormalization, for example in a data warehouse, because the system of record can be relied on to manage data integrity. Most normalization steps are intended to prevent anomalies due to inserts or updates, so if those never happen (e.g. a warehouse) then denormalization may be safe.

    ReplyDelete
  24. Warehouse is different things in database design, They are totally differ. De-normalization is focus only logical data model.

    ReplyDelete
  25. These are simply fantastic tip for database application. you may also like my post how to manage transaction in database

    ReplyDelete
  26. Definitely a good "intro to data design" and sensible guidelines. Some of the points are debatable in terms of their "best practice" status though.

    6 - Synthetic vs Natural keys is an issue that is still debated with strong arguments on both sides.

    11 - Can hardly be called "best practice" when you include a non-quantifiable statement like "when the code is big enough". Well? When IS the code "big enough"? Also "Performance issues of ORM frameworks can be handled by detailed configuration parameters" is just not true. Optimizing an ORM-based solution goes much deeper than configuration. An efficient ORM implementation requires both care in code, software design AND data design.

    13 - Depending on vendor, clustering/replication can often come with a hefty licensing cost. Worth mentioning...

    18 - depends on database. Oracle (and AfAIK MySql) will store BLOB:s out of line (in separate blocks) if they are large. As long as you don't include the blob column in your query, there should be no performance hit having it defined in the same table.

    ReplyDelete
  27. At all useful, but:
    2. May conflict with ORM conventions like in CakePHP.
    4. It may be needed to use prefix if you need to separate a few instances of the same application located in the same database like in WordPress.
    7. Some exceptions are possible like when index on char/varchar(1) 1 byte will works faster than int (4) bytes.
    8. True for SQL Server, but may vary in another databases.
    11. In some cases it is better to use stored procedures instead of tuning ORM framework. Did you saw what is rendering by Hibernate? :)
    14. Too much constraints in database can slow down it considerably.
    16. Incorrectly defined indexes may bring you more problems than benefits.
    17. As a result you will also receive twice larger area for attacker and need of enabling external connections to database.

    ReplyDelete
  28. This comment has been removed by a blog administrator.

    ReplyDelete