Purpose
|
Define the detailed physical design of the database.
|
The physical database design includes model elements (such as tables, views, and stored procedures) that represent the
detailed physical structure of the database and model elements (such as schemas and tablespaces) that represent the
underlying data storage design of the database. Collectively, these model elements comprise the Physical Data
Model of the database. This Physical Data Model is contained in the Artifact: Data Model and is not a separate model work product.
The detailed steps for developing the physical database design are as follows:
Purpose
|
To define reusable user-defined types.
|
Domains might be used by the database designer to enforce type standards throughout the database design. Domains are
user-defined data types that can be applied to a column in a table. Domains have the properties of a column
without the name.
Purpose
|
Create the initial database tables and relationships.
|
The database designer models the Physical Data Model elements using tables and columns in tables, as described in Guideline: Data Model.
If a Logical Data Model has been created, then its logical entities can be used as the basis for an initial set of
tables.
Alternatively, the database designer might jump-start the Physical Data Model by using the persistent classes in the
Design Model as a starting point for tables in the Physical Data Model. The database designer models the
persistent classes and their attributes as tables and columns respectively. The database designer also needs
to define the relationships between the tables based on the associations between the persistent classes in the Design
Model. A description of how the Design Model elements and relationships map to Data Model elements and
relationships is provided in Work Product Guideline: Forward Engineering Relational Databases.
If you are starting the model from persistent classes rather than from a normalized Logical Data Model, then you will
generally need to apply some normalization in order to eliminate data redundancies and non-key field dependencies. See
Concept: Normalization for more information on database normalization.
Purpose
|
To define standard reference tables used across the project.
|
Often there are standard look-up tables, validation tables, or reference tables used throughout the project. Since the
data in these tables tends be frequently accessed but seldom-changing, that data is worth special consideration. In the
Design Model, these tables might contain standard product codes, state or province codes, postal or zip codes, tax
tables, area code validation tables, or other frequently accessed information. In financial systems, these tables might
contain lists of policy codes, insurance policy rating categories, or conversion rates. Look in the Design Model for
classes that are primarily read-only, providing validation information for a large number of clients.
If the reference table is small, do not bother to index it, since indexing might actually add additional overhead for
small tables. A small, frequently accessed table also tends to remain in memory, because caching algorithms often keep
frequently accessed tables in the data cache.
If possible, make sure that the database cache is large enough to keep all reference tables in memory, along with
normal "working set space" for queries and transactions. Often the secret to increasing database performance is
reducing disk I/O.
Once the reference table structures are defined, determine a strategy for populating the reference tables. Since these
tables are accessed near the beginning of the project, determining the reference values and loading the tables often
need to occur relatively early during application runtime. While the database designer is not responsible for obtaining
the data, he or she is responsible for determining how and when the reference tables will be refreshed.
Purpose
|
To define the one or more columns that uniquely identify a row in the table.
To define constraints on columns that guarantee the uniqueness of the data or collection of data.
|
A primary key is one or more columns that uniquely identify rows in a table. A table has a single primary key. There is
often a "natural" key that can be used to uniquely identify a row of data (for example, the postal code in a reference
table). The primary key should not contain data that might change with the business environment. If the "natural" key
is a value that can change (for example a person's name), then it is recommended that the database designer create a
single non-meaningful, non-user-entered column when creating a primary key. This creates a data structure that has
greater adaptability to changes in the business structure, rules, or environment.
The use of a non-meaningful, non-user-entered column as the primary key is an essential concept in designing a data
warehouse. Transactional systems often choose a "natural" primary key that might be subject to minimal change over a
non-meaningful, non-user-entered column.
A unique constraint designates that the data in the column or collection of columns is unique per row. If the unique
constraint is on a column, the data in a specific row in the specified column must be unique from the data in a
different row in the same column.
When a unique constraint is defined for a group of columns, the uniqueness is based on the collective whole of the data
in the columns that make up that unique constraint. The data in a specific row in a specific column does not have to be
unique from the data in a different row in the same column. The database designer uses the unique constraint to ensure
uniqueness of business data.
Purpose
|
To ensure the integrity of the database.
|
Data integrity rules, also known as constraints, ensure that data values lie within defined ranges. Where these ranges
can be identified, the database can enforce them. (This is not to say that data validation should not be done in the
application, but only that the database can serve as a "validator of last resort" in the event that the application
does not work correctly.) Where data validation rules exist, the database constraints must be designed to enforce them.
A foreign key is one or more columns in a table that map to the primary key in another table. One table might have many
foreign keys, and each foreign key is a map to a different table. This mapping, or relationship, between the tables is
often referred to as a parent-child relationship. The child table contains the foreign key, which maps to the
primary key in the parent table.
The definition of foreign key constraints is also often used by the query optimizer to accelerate query
performance. In many cases, the foreign key enforcement rules use reference tables.
Purpose
|
To optimize the database data structures for performance.
|
In the case of a relational Data Model, the initial mapping generally yields a simple class-to-table mapping. If
objects from different classes need to be retrieved at the same time, the RDBMS uses an operation called a "table join"
to retrieve the rows related to the objects of interest. For frequently accessed data, join operations can be
computationally expensive. To eliminate the cost of the join, a standard relational technique called "de-normalization"
is often employed.
De-normalization combines columns from two or more different tables into the same table, effectively pre-joining the
information. De-normalization reflects a tradeoff between more-expensive update operations in favor of less-expensive
retrieval operations. This technique also reduces the performance of the system in queries that are interested only in
the attributes of one of the objects that are effectively joined in the de-normalized table, since all attributes are
normally retrieved on every query. For cases in which the application normally wants all attributes, there can be a
significant performance improvement.
De-normalizing more than two tables is rare and increases the cost of inserts and updates as well as the cost of
non-join queries. Limiting de-normalization to two tables is a good policy unless strong and convincing evidence can be
produced regarding the benefits.
De-normalization can be inferred from the design
classes in cases in which classes are nested. Nested classes can be mapped to a de-normalized table.
Some object databases allow a concept similar to de-normalization, in which related objects are clustered together on
disk and retrieved in single operations. The concept in use is similar: Reduce object retrieval time by reducing the
work the system must do in order to retrieve related objects from the database.
In some cases, optimizing the Data Model can unmask problems in the Design Model, including performance bottlenecks, poor modeling, or incomplete designs. In this event, discuss the problems
with the Designer of the class, triggering change requests where appropriate.
Purpose
|
To provide for efficient data access using indexing.
To provide for efficient data access using database views.
|
Once the table structure has been designed, you must determine the types of queries that will be performed against the
data. Indexing is used by the database to speed access. Indexing is most effective when the data values in the column
being indexed are relatively distinct.
Consider the following indexing principles:
-
The primary key column of the table must always be indexed. Primary key columns are used frequently as search keys
and for join operations.
-
Tables smaller than 100 rows in size with only a few columns benefit little from indexing. Small tables generally
fit easily in the database cache.
-
Indexes should also be defined for frequently executed queries or for queries that must retrieve data quickly
(generally, any searches done while a person might be waiting). An index should be defined for each set of
attributes that are used together as search criteria. For example, if the system needs the ability to find all
Orders on which a particular product is ordered, an index on the Line Item table on the product number column would
be necessary.
-
Indexes should generally be defined only on columns used as identifiers, not on numeric values, such as account
balances or textual information such as order comments. Identifier column values tend to be assigned when the
object is created and then remain unchanged for the life of the object.
-
Indexes on simple numbers (integer and number data types) are much simpler and faster than indexes on strings.
Given the large data volumes processed on a query or a large join, small savings add up quickly. Indexes on
numeric columns tend to take significantly less space than indexes on characters.
On the down side, the use of indexes is not free; the more indexes on a table, the longer inserts and updates take to
process. When contemplating the use of indexes, bear in mind the following precautions:
-
Do not index just to speed up an infrequently executed query, unless that query occurs at a critical point, making
maximum speed essential.
-
In some systems, update and insertion performance is more important than query performance. A common example is in
factory data acquisition applications in which quality data is captured in real time. In these systems, only
occasional online queries are executed, and most of the data is analyzed periodically by batch reporting
applications that perform statistical analysis on it. For data-acquisition systems, remove all indexes to achieve
maximum throughput. If indexes are needed, they can be rebuilt just before the batch reporting and analysis
applications run, then dropped when the reporting and analysis is complete.
-
Always remember that indexes have hidden costs. For example, they take time to update (a tax paid on every insert,
update, or delete) and occupy disk space. Be sure you get value from using them.
Many databases offer a choice of index types. The most common include:
-
B-tree indexes-The most frequently used kind are based on balanced b-tree index data structures. They are
useful when the index key values are randomly distributed and tend to have wide variability. They tend to perform
poorly, however, when data being indexed is already in sequential order.
-
Hashed indexes-Less frequently, index key values are hashed. Hashing offers better performance when the
range of index key values is known, relatively unchanging, and unique. This technique relies upon the use of the
key value to calculate the address of the data of interest. Because of the need for predictability, hash indexes
tend to be useful only for medium-sized lookup tables that change very infrequently.
Your choice of indexing strategy and timing of index creation can have a large impact on performance. Bulk data loads
should be performed without indexes (this can be achieved by dropping the index, loading the data, and then re-creating
the index). The reason for this is that the index structure is re-balanced as each row is added. Since subsequent rows
will change the optimal index structure, the work done re-balancing the index as each row is inserted is largely
wasted. It is faster and more efficient to load data without indexes, then re-create the index when the data load is
done. Some databases provide bulk data-loaders to do this automatically.
Another strategy for optimizing database access performance is the use of views. Database views are virtual tables that
have no independent storage of their own. To the calling program (or user), however, a view behaves like a table. A
view supports retrieval of data, and it can be used to update data as well-depending on the database structure and
database vendor. The view contains data from one or more tables that can be accessed through a single select statement.
The performance gain occurs during the selection of data, especially in frequently queried tables. The data is
retrieved from a single location-the view-instead of by searching the multiple or large tables that exist in the
database.
Views also play a significant role in database security. A view containing parts of a table can restrict access to
sensitive data contained in the base table.
Purpose
|
To design the space allocation and disk page organization of the database.
|
A database designer uses tablespaces to represent the amount of storage space that is allocated to tables, indexes,
stored procedures, and so forth. One or more tablespaces are mapped to a database. The database designer must analyze
the tables in the Data Model to determine how to distribute them, along with other support database elements, across
the storage space in the database.
In determining the tablespace structures for the database, bear in mind that databases do not perform I/O on rows,
records, or even whole tables. Instead they perform I/O on disk blocks. The reason for this is simple: Block I/O
operations are usually optimized in the software and hardware on the system. As a result, the physical organization of
the tables and indexes in the database can have a dramatic impact on the performance of the system.
When planning the space allocation and disk page organization of the database, consider the following factors:
-
the density of information in the disk pages
-
the location of disk pages on disk or across multiple disks
-
the amount of disk space to allocate to the table
These factors are discussed in the sections that follow.
Disk Page Density
The density of disk pages depends on the extent to which data is expected to change over time. Basically, a
less-dense page is more capable of accepting changes in values or the addition of data over time, while a fuller
data page provides better read performance, since more data is retrieved per block read.
To simplify disk management, the database designer can group tables by the extent to which they tend to change. The
following three groups constitute a good beginning for this type of organization:
-
highly dynamic tables
-
somewhat dynamic tables
-
mostly static tables
The highly dynamic tables should be mapped onto disk pages that have a great deal of empty space in them (perhaps
30%); the somewhat dynamic tables should be mapped onto disk pages that have less empty space (perhaps 15%); and
the mostly static should be mapped onto disk pages that have very little empty space (perhaps 5%). The indexes for
the tables must be similarly mapped.
Disk Page Location
After the groups of tables are mapped, the database designer must determine where to put the disk pages. The goal
here is to try to balance the workload across a number of different drives and heads to reduce or eliminate
bottlenecks. Consider the following guidelines:
-
Never put data on the same disk as the operating system, its temporary files, or the swap devices. These drives
are busy enough without the addition of further workload to them.
-
Put data that is accessed simultaneously on different drives in order to balance the workload. Some systems
support parallel I/O channels. If this is the case, put the data on different channels.
-
Put the indexes on a different drive from the data that it indexes in order to spread out the workload.
-
Refer to the database vendor's documentation for guidelines.
-
The type of storage used (for example, RAID-5, RAID-10, SAN, NAS, and channel attached) affects database
performance. Make use of the performance guidelines provided by the storage provider.
Database I/O is generally the limiting factor in database performance. I/O balancing is an iterative, experimental
process. By prototyping database access performance during the elaboration phase, coupled with appropriate
instrumentation to monitor physical and logical I/O, you can uncover performance problems early while there is
still time to adjust the database design.
Disk Space Allocation
Using the characteristics of the persistence design mechanism, estimate the number of objects that must be stored.
The amount of disk space required to store the objects varies from RDBMS to RDBMS. When calculating disk
space, make sure to account for growth due to additions of data. To estimate the disk space for a database,
first estimate the disk space required for each table, and then calculate the space requirements for all
tables. Consult the database administrator manual for the specific RDBMS product to determine the precise
size estimation formula. Here are some general steps for estimating the space requirements for a table:
-
Calculate average row size. This calculation should include any control information at the record level,
as well as any control information required for variable-length columns.
-
Calculate the number of rows that will fit into a page or block of I/O. Because most databases store only
complete records on a page or I/O block, this should be the integer number of rows that will fit into a page or
block of I/O.
-
Calculate the number of pages or I/O blocks required to store the estimated number of records in the
database. The estimated number of records must include any load factors.
-
Multiply the number of pages or I/O blocks required by the size of the page or I/O block.
-
Add any overhead for additional indexes.
-
Add any fixed overhead for the table.
Once the table space requirements have been defined:
-
Compute the sum of the space required by the tables.
-
Add in any required fixed amount of space for database management.
-
Add in disk space required for the transaction log and audit trail.
In a frequently updated environment, the retention requirements for the audit trail require significant amounts of
storage. The documentation for major commercial database management systems usually provides detailed sizing
instructions. Be sure to refer to these instructions when calculating your estimates of the database disk space
requirements.
Purpose
|
To determine if the stored procedures or triggers should be used to implement data access class operations.
|
Most databases support a stored procedure capability. A stored procedure is executable code that runs within the
process space of the database management system. It provides the ability to perform database-related actions on the
server without having to transfer data across a network. The judicious use of stored procedures can the improve
performance of the system.
Stored procedures are usually one of these two types: actual procedures or triggers. Procedures are executed explicitly
by an application, generally have parameters, and provide an explicit return value. Triggers, on the other hand, are
invoked implicitly when some database event occurs (for example, insert a row, update a row, or delete a row), have no
parameters other than the row being modified (since they are invoked implicitly), and do not provide an explicit return
value.
In database systems that lack constraints, triggers are often used to enforce referential and data integrity.
Otherwise, they tend to be used when an event needs to trigger (or cause) another event. Triggers are also frequently
used for security purposes by auditing the trigger event.
The design classes in the Design Model must be examined to see if they have operations that should be implemented using
the stored procedure or trigger facility. Candidates include:
-
any operations that primarily deal with persistent data (creating, updating, retrieving, or deleting it).
-
any operations in which a query is involved in a computation (such as calculating the average quantity and value of
a product in inventory).
-
operations that must access the database in order to validate data.
Remember that improving database performance usually means reducing I/O. Therefore, if performing a computation on the
DBMS server will reduce the amount of data passed over the network, the computation should probably be performed on the
server.
Work with the designer of the design class to discuss how the database can be used to improve performance. The designer
will update the operation method to indicate whether one or more stored procedures can be used to implement the
operation.
|