Version 1.1
By Craig Utley
Introduction
Creating
a Star Schema Database is one of the most important, and sometimes the
final, step in creating a data warehouse. Given how important this
process is to building a data warehouse, it is important to understand
how to move from a standard, on-line transaction processing (OLTP)
system to a final star schema. Please note that a general term is relational data warehouse and may cover both star and snowflake schemas.
This paper attempts to address some of the issues that many who are new to the data warehousing arena find confusing, such as:
- What is a Data Warehouse? What is a Data Mart?
- What is a Star Schema Database?
- Why do I want/need a Star Schema Database?
- The Star Schema looks very denormalized. Won’t I get in trouble for that?
- What do all these terms mean?
This
paper will attempt to answer these questions, and show developers how
to build a star schema database to support decision support within their
organizations.
Terminology
Usually,
readers of technical articles are bored with terminology that comes
either at the end of a chapter or is buried in an appendix at the back
of a book. Here, however, I have the thrill of presenting some terms up
front. The intent is not to bore readers earlier than usual, but to
present a baseline off of which to operate. The problem in data
warehousing is that the terms are often used loosely by different
parties. The definitions presented here represent how the terms will be
used throughout this paper.
OLTP
OLTP
stands for Online Transaction Processing. This is a standard,
normalized database structure. OLTP is designed for transactions, which
means that inserts, updates, and deletes must be fast. Imagine a call
center that takes orders. Call takers are continually taking calls and
entering orders that may contain numerous items. Each order and each
item must be inserted into a database. Since the performance of the
database is critical, database designers want to maximize the speed of
inserts (and updates and deletes). To maximize performance, some
businesses even limit the number of records in the database by
frequently archiving data.
OLAP and Star Schema
OLAP
stands for Online Analytical Processing. OLAP is a term that means many
things to many people. Here, the term OLAP and Star Schema are
basically interchangeable. The assumption is that a star schema database
is an OLAP system. An OLAP system consists of a relational database
designed for the speed of retrieval, not transactions, and holds
read-only, historical, and possibly aggregated data.
While
an OLAP/Star Schema may be the actual data warehouse, most companies
build cube structures from the relational data warehouse in order to
provide faster, more powerful analysis on the data.
Data Warehouse and Data Mart
Data
Warehouses and Data Marts differ in scope only. This means that they
are built using the exact same methods and procedures, so the process is
the same while only their intended scope varies.
A
data warehouse (or mart) is way of storing data for later retrieval.
This retrieval is almost always used to support decision-making in the
organization. That is why many data warehouses are considered to be DSS
(Decision-Support Systems). While some data warehouses are merely
archival copies of data, most are used to support some type of
decision-making process. The primary benefit of taking the time to
create a star schema, and then possibly cube structures, is to speed the
retrieval of data and format that data in a way that it is easy to
understand. This means that a star schema is built not for transactions
but for queries.
Both a data warehouse and a data mart are storage mechanisms for read-only, consolidated, historical data.
Read-only means that the person looking at the data won't be changing
it. If a user wants to look at the sales yesterday for a certain
product, they should not have the ability to change that number. Of
course if the number is wrong, it should be corrected, but more on that
later.
"Consolidated" means
that the data may have come from various sources. Many companies have
purchased different vertical applications from various vendors to handle
such tasks as human resources (HR), accounting/finance, inventory, and
so forth. These systems may run on multiple operating systems and use
different database engines. Each of these applications may store their
own copy of an employee table, product table, and so on. A relational
data warehouse must take data from all these systems and consolidate it
so it is consistent, which means it is in a single format.
The
"historical" part means the data may be only a few minutes old, but
often it is at least a day old. A data warehouse usually holds data that
goes back a certain period in time, such as five years. In contrast,
standard OLTP systems usually only hold data as long as it is "current"
or active. An order table, for example, may move order data to an
archive table once the order has been completed, shipped, and received
by the customer.
The
data in data warehouses and data marts may also be aggregated. While
there are many different levels of aggregation possible in a typical
data warehouse, a star schema may have a "base' level of aggregation,
which is one in which all the data is aggregated to a certain point in
time.
For example:
assume a company sells only two products: dog food and cat food. Each
day, the company records the sales of each product. At the end of a
couple of days, the data looks like this:
|
|
Quantity Sold
|
Date
|
Order Number
|
Dog Food
|
Cat Food
|
4/24/99
|
1
|
5
|
2
|
|
2
|
3
|
0
|
|
3
|
2
|
6
|
|
4
|
2
|
2
|
|
5
|
3
|
3
|
|
|
|
|
4/25/99
|
1
|
3
|
7
|
|
2
|
2
|
1
|
|
3
|
4
|
0
|
Table 1
Clearly,
each day contains several transactions. This is the data as stored in a
standard OLTP system. However, the data warehouse might not record this
level of detail. Instead, it could summarize, or aggregate, the data to
daily totals. The records in the data warehouse might look something
like this:
|
Quantity Sold
|
Date
|
Dog Food
|
Cat Food
|
4/24/99
|
15
|
13
|
4/25/99
|
9
|
8
|
Table 2
This
summarization of data reduces the number of records by aggregating the
individual transaction records into daily records that show the number
of each product purchased each day.
In
this simple example, it is easy to derive Table 2 simply by running a
query against Table 1. However, many complexities enter the picture that
will be discussed later.
Aggregations
There
is no magic to the term "aggregations." It simply means a summarized,
typically additive value. The level of aggregation in a star schema
depends on the scenario. Many star schemas are aggregated to some base
level, called the grain, although this is becoming somewhat less common
as developers rely on cube building engines to summarize to a base level
of granularity.
OLTP Systems
OLTP,
or Online Transaction Processing, systems are standard, normalized
databases. OLTP systems are optimized for inserts, updates, and deletes;
in other words, transactions. Transactions in this context can be thought of as the entry, update, or deletion of a record or set of records.
OLTP
systems achieve greater speed of transactions through a couple of
means: they minimize repeated data, and they limit the number of
indexes. The minimization of repeated data is one of the primary drivers
behind normalization.
When
examining an order, systems typically break orders down into an order
header and then a series of detail records. The header contains
information such as an order number, a bill-to address, a ship-to
address, a PO number, and other fields.
An order detail record is usually a product number, a product
description, the quantity ordered, the unit price, the total price, and
other fields. Here is what an order might look like:
Figure 1
The data stored for this order looks very different. If stored in a flat structure, the detail records look something like this:
Order Number
|
Order Date
|
Customer ID
|
Customer Name
|
Customer Address
|
Customer City
|
12345
|
4/24/99
|
451
|
ACME Products
|
123 Main Street
|
Louisville
|
Customer State
|
Customer Zip
|
Contact Name
|
Contact Number
|
Product ID
|
Product Name
|
KY
|
40202
|
Jane Doe
|
502-555-1212
|
A13J2
|
Widget
|
Product Description
|
Category
|
SubCategory
|
Product Price
|
Quantity Ordered
|
Etc…
|
¼” Brass Widget
|
Brass Goods
|
Widgets
|
$1.00
|
200
|
Etc…
|
Table 3
Notice,
however, that for each detail, much of the data is being repeated: the
entire customer address, the contact information, the product
information, and so forth. All of this information is needed for each
detail record, but the system should have to store all the customer and
product information for each record. Relational technology allows each
detail record to tie to a header record, without having to repeat the
header information in each detail record. The new detail records might
look like this:
Order Number
|
Product Number
|
Quantity Ordered
|
12473
|
A4R12J
|
200
|
Table 4
A simplified logical view of the tables might look something like this:
Figure 2
Notice
that the extended cost is not stored in the OrderDetail table. OLTP
schemas store as little data as possible to speed inserts, updates, and
deletes. Therefore, any number that can be calculated at query time is
calculated and not stored.
Developers
also minimize the number of indexes in an OLTP system. Indexes are
important but they slow down inserts, updates, and deletes. Therefore,
most schemas have just enough indexes to support lookups and other
necessary queries. Over-indexing can significantly decrease performance.
Normalization
Database
normalization is the process of removing repeated information. As shown
above, normalization reduces repeated information of the order header
record in each order detail record. Normalization is a process unto
itself, and what follows is merely a brief overview.
Normailzation
first removes repeated records in a table. For example, the following
order table contains much repeated information and is not recommended:
Figure 3
In
this example, there will be some limit on the number of order detail
records in the Order table. If there were twenty repeated sets of fields
for detail records, the table would be unable to handle an order for
twenty one or more products. In addition, if an order has just has one
product ordered, all the other fields are useless.
So, the first step in the normalization process is to break the repeated fields into a separate table, and end up with this:
Figure 4
Now, an order can have any number of detail records.
OLTP Advantages
As
stated before, OLTP allows for the minimization of data entry. For each
detail record, only the primary key value from the OrderHeader table is
stored, along with the primary key of the Product table, and then the
order quantity is added. This greatly reduces the amount of data entry
necessary to add a product to an order.
Not
only does this approach reduce the data entry required, it greatly
reduces the size of an OrderDetail record. Compare the size of the
record in Table 3 to that in Table 4. The OrderDetail records take up
much less space with a normalized table structure. This means that the
table is smaller, which helps speed inserts, updates, and deletes.
In
addition to keeping the table smaller, most of the fields that link to
other tables are numeric. Queries generally perform much better against
numeric fields than they do against text fields. Therefore, replacing a
series of text fields with a numeric field can help speed queries.
Numeric fields also index faster and more efficiently.
With
normalization, there are frequently fewer indexes per table. Each
transaction requires the maintenance of affected indexes. With fewer
indexes to maintain, inserts, updates, and deletes run faster.
OLTP Disadvantages
There
are some disadvantages to an OLTP structure, especially when retrieving
the data for analysis. First, queries must utilize joins across
multiple tables to get all the data. Joins tend to be slower than
reading from a single table, so minimizing the number of tables in a
query will boost performance. With a normalized structure, developers
have no choice but to query from multiple tables to get the detail
necessary for a report.
One
of the advantages of OLTP is also a disadvantage: fewer indexes per
table. Fewer indexes per table are great for speeding up inserts,
updates, and deletes. In general terms, the fewer indexes in the
database, the faster inserts, updates, and deletes will be. However,
again in general terms, the fewer indexes in the database, the slower
select queries will run. For the purposes of data
retrieval, a higher number of correct indexes helps speed retrieval.
Since one of the design goals to speed transactions is to minimize the
number of indexes, OLTP databases trade faster transactions at the cost
of slowing data retrieval. This is one reason for creating two separate
database structures: an OLTP system for transactions, and an OLAP system
for data retrieval.
Last
but not least, the data in an OLTP system is not user friendly. Most IT
professionals would rather not have to create custom reports all day
long. Instead, they would prefer to give their customers some query
tools so customers could create reports without involving the IT
organization. Most customers, however, don't know how to make sense of
the normalized structure of the database. Joins are somewhat mysterious,
and complex table structures (such as associative tables on a
bill-of-material system) are difficult for the average customer to use.
The structures seem obvious to IT professionals, who sometimes wonder
why customers can’t get the hang of it. Remember, however, that
customers know how to do a FIFO-to-LIFO revaluation and other such tasks
that IT people may not know how to do; therefore, understanding
relational concepts just isn't something customers should have to worry
about.
If customers
want to spend the majority of their time performing analysis by looking
at the data, the IT group should support their desire for fast, easy
queries. On the other hand, maintaining the speed requirements of the
transaction-processing activities is critical. If these two requirements
seem to be in conflict, they are, at least partially. Many companies
have solved this by having a second copy of the data in a structure
reserved for analysis. This copy is more heavily indexed, and it allows
customers to perform large queries against the data without impacting
the inserts, updates, and deletes on the main data. This copy of the
data is often not just more heavily indexed, but also denormalized to
make it easier for customers to understand.
Reasons to Denormalize
When
database administrators are asked why they would ever denormalize, the
first (and often only) answer is: speed. Recall one of the key
disadvantages to the OLTP structure: It is built for data inserts,
updates, and deletes, but not data retrieval. Therefore, one method of
squeezing some speed out of it is by denormalizing some of the tables
and having queries pull data from fewer tables. These queries are faster
because they perform fewer joins to retrieve the same recordset.
Joins
are relatively slow, as has already been mentioned. Joins are also
confusing to many end users. By denormalizing, users are presented with a
view of the data that is far easier for them to understand. Which view
of the data is easier for a typical end-user to understand:
Figure 5
Figure 6
The
second view is much easier for the end user to understand. While a
normalized schema requires joins to create this view, putting all the
data in a single table allows the user to perform this query without using
joins. Creating a view that looks like this, however, still uses joins
in the background and therefore does not achieve the best performance on
the query. Fortunately, there is a better way.
How Humans View Information
All
of this leads to the real question: how do humans view the data stored
in the database? This is not the question of how humans view it with
queries, but how do they logically view it? For example, are these intelligent questions to ask:
- How many bottles of Aniseed Syrup were sold last week?
- Are overall sales of Condiments up or down this year compared to previous years?
- On a quarterly and then monthly basis, are Dairy Product sales cyclical?
- In
what regions are sales down this year compared to the same period last
year? What products in those regions account for the greatest percentage
of the decrease?
All
of these questions would be considered reasonable, perhaps even common.
They all have a few things in common. First, there is a time element to
each one. Second, they all are looking for aggregated data; they are
asking for sums or counts, not individual transactions. Finally, they
are looking at data in terms of "by" conditions.
"By"
conditions refer to looking at data by certain conditions. For example,
take the question: "On a quarterly and then monthly basis, are Dairy
Product sales cyclical?" This can be rephrased with the following
statement: "We want to see total sales by category (just Dairy Products in this case), by quarter or by month."
Here
the customer is looking at an aggregated value, the sum of sales, by
specific criteria. Customers can add further "by" conditions by saying
they wanted to see those sales by brand and then the individual
products.
Figuring
out the aggregated values to be shown, such as the sum of sales dollars
or the count of users buying a product, and then figuring out the "by"
conditions is what drives the design of the star schema.
Making the Database Match Expectations
If
the goal is to view the data as aggregated numbers broken down along a
series of "by" criteria, why isn't data simply stored in this format?
That's
exactly what is done with the star schema. It is important to realize
that OLTP is not meant to be the basis of a decision support system. The
"T" in OLTP stands for transactions, and a transaction is all about
taking orders and depleting inventory, and not about performing complex
analysis to spot trends. Therefore, rather than tie up an OLTP system by
performing huge, expensive queries, the answer is to build a database
structure that maps to the way humans see the world.
Humans
see the world in a multidimensional way. Most people think of cube
structures when speaking of multiple dimensions, but cubes are typically
built from relational data that has already been put into a dimensional
model. The dimensional model is a database structure to support
queries, and cubes can then be built on it later.
Facts and Dimensions
When examining how people look at data, they usually want to see some sort of aggregated data. These data are called measures.
These measures are numeric values that are measurable and usually
additive. For example, sales dollars are a perfect measure. Every order
that comes in generates a certain sales volume measured in some
currency. If a company sells twenty products in one day, each for five
dollars, they generate 100 dollars in total sales. Therefore, sales
dollars is one measure most companies track. Companies may also want to
know how many customers they had that day. Did five customers buy an
average of four products each, or did just one customer buy twenty
products? Sales dollars and customer counts are two measures businesses
may want to track.
Just
tracking measures isn't enough, however. People need to look at
measures using those "by" conditions. The "by" conditions are called
dimensions. In order to examine sales dollars, people almost always wan
to see them by day, or by quarter, or by year. There is almost always a
time dimension on anything people ask for. They may also want to know
sales by category or by product. These "by" conditions will map into
dimensions: there is almost always a time dimension, and product and
geography dimensions are very common as well.
Therefore,
in designing a star schema, the first order of business is usually to
determine what people want to see (the measures) and how they want to
see it (the dimensions).
Mapping Dimensions into Tables
Dimension
tables answer the "why" portion of a question: how do people want to
slice the data? For example, people almost always want to view data by
time. Users often don't care what the grand total for all data happens
to be. If the data happens to start on June 14, 1989, do users really
care how much total sales have been since that date, or do they really
care how one year compares to other years? Comparing one year to a
previous year is a form of trend analysis and one of the most common
things done with data in a star schema.
Relational
data warehouses may also have a location or geography dimension. This
allows users to compare the sales in one region to those in another.
They may see that sales are weaker in one region than any other region.
This may indicate the presence of a new competitor in that area, or a
lack of advertising, or some other factor that bears investigation.
When
designing dimension tables, there are a few rules to keep in mind.
First, all dimension tables should have a single-field primary key. This
key is typically a surrogate key and is often just an identity column,
consisting of an automatically incrementing number. The value of the
primary key is meaningless, hence the surrogate key; the real
information is stored in the other fields. These other fields, called
attributes, contain the full descriptions of the dimension record. For
example, if there is a Product dimension (which is common) there are
fields in it that contain the description, the category name, the
sub-category name, the weight, and so forth. These fields do not
contain codes that link to other tables. Because the fields contain
full descriptions, the dimension tables are often fat; they contain many
large fields.
Dimension
tables are often short, however. A company may have many products, but
even so, the dimension table cannot compare in size to a normal fact
table. For example, even if a company has 30,000 products in the product
table, the company may track sales for these products each day for
several years. Assuming the company actually only sells 3,000 products
in any given day, if they track these sales each day for ten years, they
end up with this equation: 3,000 products sold X 365 day/year * 10
years equals almost 11,000,000 records! Therefore, in relative terms, a
dimension table with 30,000 records will be short compared to the fact
table.
Given that a
dimension table is fat, it may be tempting to normalize the dimension
table. Normalizing the dimension tables is called a snowflake schema and
will be discussed later in this paper.
Dimensional Hierarchies
Developers
have been building hierarchical structures in OLTP systems for years.
However, hierarchical structures in an OLAP system are different because
the hierarchy for the dimension is actually stored in a single
dimension table (unless snowflaked as discussed later.)
The
product dimension, for example, contains individual products. Products
are normally grouped into categories, and these categories may well
contain sub-categories. For instance, a product with a product number of
X12JC may actually be a refrigerator. Therefore, it falls into the
category of major appliance, and the sub-category of refrigerator. There
may have more levels of sub-categories, which would further classify
this product. The key here is that all of this information is stored in
fields in the dimension table.
The product dimension table might look something like this:
Figure 7
Notice
that both Category and Subcategory are stored in the table and not
linked in through joined tables that store the hierarchy information.
This hierarchy allows users to perform "drill-down" functions on the
data. They can execute a query that performs sums by category and then
drill-down into that category by calculating sums for the subcategories
within that category. Users can then calculate the sums for the
individual products in a particular subcategory.
The
actual sums being calculated are based on numbers stored in the fact
table. These will be examined when discussing the fact table later.
Consolidated Dimensional Hierarchies (Star Schemas)
The
above example (Figure 7) shows a hierarchy in a dimension table. This
is how the dimension tables are built in a star schema; the hierarchies
are contained in the individual dimension tables. No additional tables
are needed to hold hierarchical information.
Storing
the hierarchy in a dimension table allows for the easiest browsing of
the dimensional data. In the above example, users could easily choose a
category and then list all of that category’s subcategories. They would
drill-down into the data by choosing an individual subcategory from
within the same table. There is no need to join to an external table for
any of the hierarchical information.
In
this overly-simplified example, there are two dimension tables joined
to the fact table. The fact table will examined later. For now, examples
will use only one measure: SalesDollars.
Figure 8
In order to see the total sales for a particular month for a particular category, a SQL query would look something like this:
SELECT Sum(SalesFact.SalesDollars) AS SumOfSalesDollars
FROM TimeDimension INNER JOIN (ProductDimension INNER JOIN
SalesFact ON ProductDimension.ProductID = SalesFact.ProductID)
ON TimeDimension.TimeID = SalesFact.TimeID
WHERE ProductDimension.Category='Brass Goods' AND TimeDimension.Month=3
AND TimeDimension.Year=1999
To drill down to a subcategory, the SQL would change to look like this:
SELECT Sum(SalesFact.SalesDollars) AS SumOfSalesDollars
FROM TimeDimension INNER JOIN (ProductDimension INNER JOIN
SalesFact ON ProductDimension.ProductID = SalesFact.ProductID)
ON TimeDimension.TimeID = SalesFact.TimeID
WHERE ProductDimension.SubCategory='Widgets' AND TimeDimension.Month=3
AND TimeDimension.Year=1999
Snowflake Schemas
Sometimes,
the dimension tables have the hierarchies broken out into separate
tables. This is a more normalized structure, but leads to more difficult
queries and slower response times.
Figure
9 represents the beginning of the snowflake process. The category
hierarchy is being broken out of the ProductDimension table. This
structure increases the number of joins and can slow queries. Since the
purpose of an OLAP system is to speed queries, snowflaking is usually
not productive. Some people try to normalize the dimension tables to
save space. However, in the overall scheme of the data warehouse, the
dimension tables usually only account for about 1% of the total storage.
Therefore, any space savings from normalizing, or snowflaking, are
negligible.
Figure 9
Building the Fact Table
The
Fact Table holds the measures, or facts. The measures are numeric and
additive across some or all of the dimensions. For example, sales are
numeric and users can look at total sales for a product, or category, or
subcategory, and by any time period. The sales figures are valid no
matter how the data is sliced.
While
the dimension tables are short and fat, the fact tables are generally
long and skinny. They are long because they can hold the number of
records represented by the product of the counts in all the dimension
tables.
For example, take the following simplified star schema:
Figure 10
In
this schema, there are product, time and store dimensions. With ten
years of daily data, 200 stores, and 500 products, there is a potential
of 365,000,000 records (3650 days * 200 stores * 500 products). This
large number of records makes the fact table long. Adding another
dimension, such as a dimension of 10,000 customers, can increase the
number of records by up to 10,000 times.
The
fact table is skinny because of the fields it holds. The primary key is
made up of foreign keys that have migrated from the dimension tables.
These fields are typically integer values. In addition, the measures are
also numeric. Therefore, the size of each record is generally much
narrower than those in the dimension tables. However, there are many,
many more records in the fact table.
Fact Granularity
One
of the most important decisions in building a star schema is the
granularity of the fact table. The granularity, or frequency, of the
data is determined by the lowest level of granularity of each dimension
table, although developers often discuss just the time dimension and say
a table has a daily or monthly grain. For example, a fact table may
store weekly or monthly totals for individual products. The lower the
granularity, the more records that will exist in the fact table. The
granularity also determines how far users can drill down without
returning to the base, transaction-level data.
One
of the major benefits of the star schema is that the low-level
transactions may be summarized to the fact table grain. This greatly
speeds the queries performed as part of the decision support process.
The aggregation or summarization of the fact table is not always done if
cubes are being built, however.
Fact Table Size
The
previous section discussed how 500 products sold in 200 stores and
tracked for 10 years could produce 365,000,000 records in a fact table
with a daily grain. This, however, is the maximum size for the table.
Most of the time, there are far fewer records in the fact table. Star
schemas do not store zero values unless zero has some signifigance. So,
if a product did not sell at a particular store for a particular day,
the system would not store a zero value. The fact table
contains only the records that have a value. Therefore, the fact table
is often sparsely populated.
Even
though the fact table is sparsely populated, it still holds the vast
majority of the records in the database and is responsible for almost
all of the disk space used. The lower the granularity, the larger the
fact table. In the previous example, moving from a daily to weekly grain
would reduce the potential number of records to only slightly more than
52,000,000 records.
The
data types for the fields in the fact table do help keep it as small as
possible. In most fact tables, all of the fields are numeric, which can
require less storage space than the long descriptions we find in the
dimension tables.
Finally,
be aware that each added dimension can greatly increase the size of the
fact table. If just one dimension was added to the previous example
that included 20 possible values, the potential number of records would
reach 7.3 billion.
Changing Attributes
One
of the greatest challenges in a star schema is the problem of changing
attributes. As an example, examine the simplified star schema in Figure
10. In the StoreDimension table, each store is located in a particular
region, territory, and zone. Some companies realign their sales regions,
territories, and zones occasionally to reflect changing business
conditions. However, if the company simply updates the table to reflect
the changes, and users then try to look at historical sales for a
region, the numbers will not be accurate. By simply updating the region
for a store, the total sales for that region will appear as if the
current structure has always been true. The business has "lost" history.
In
some cases, the loss of history is fine. In fact, the company might
want to see what the sales would have been had this store been in that
other region in prior years. More often, however, businesses do not want
to change the historical data. In this case, the typical approach is to
create a new record for the store. This new record contains the new
region, but leaves the old store record, and therefore the old regional
sales data, intact. This approach, however, prevents companies from
comparing this stores current sales to its historical sales unless the
previous StoreID is preserved. In most cases the answer it to keep the
existing StoreName (the primary key from the source system) on both
records but add BeginDate and EndDate fields to indicate when each
record is active. The StoreID is a surrogate key so each record has a
different StoreID but the same StoreName, allowing data to be examined
for the store across time regardless of its reporting structure.
This
particular problem is usually called a "slowly-changing dimension" and
there are various methods for handling it. The actual implementation is
beyond the scope of this paper.
There are no right and wrong answers. Each case will require a different solution to handle changing attributes.
Aggregations
The
data in the fact table is already aggregated to the fact table's grain.
However, users often ask for aggregated values at higher levels. For
example, they may want to sum sales to a monthly or quarterly number. In
addition, users may be looking for a total at a product or category
levels.
These
numbers can be calculated on the fly using a standard SQL statement.
This calculation takes time, and therefore some people will want to
decrease the time required to retrieve higher-level aggregations.
Some
people store higher-level aggregations in the database by
pre-calculating them and storing them in the the fact table. This
requires that the lowest-level records have special values put in them.
For example, a TimeDimension record that actually holds weekly totals
might have a 9 in the DayOfWeek field to indicate that this particular
record holds the total for the week.
A
second approach is to build another fact table but at the weekly grain.
All data is summarized to the weekly level and stored there. This works
well for storing data summarized at various levels, but the problem
comes into play when examining the number of possible tables needed. To
summarize at the weekly, monthly, quarterly, and yearly levels by
product, four tables are needed in addition to the "real", or daily,
fact table. However, what about weekly totals by product subcategory?
And montly totals by store? And quarterly totals by product category and
territory? Each combination would require its own table.
This
approach has been used in the past, but better alternatives exist.
These alternatives usually consist of building a cube structure to hold
pre-calculated values. Cubes were designed to address the issues of
calculating aggregations at a variety of levels and respond to queries
quickly.
Summary
The
star schema, also called a relational data warehouse or dimensional
model, is a consolidated, consistent, historical, read-only database
storing data from one or more systems. The data often comes from OLTP
systems but may also come from spreadsheets, flat files, and other
sources. The data is formatted in a way to provide fast reponse to
queries. Star schemas provide fast response by denormalizing dimension
tables and potentially through providing many indexes.
Star
schemas may be the end of the data warehousing process, but often they
are the source for a cube-building product. Different engines work in
different ways, but most store the data in new binary formats for even
quicker retrieval, and calculate aggreations at various levels of
granularity. While most modern cube-building engines do not requires a
star schema as their source, a star schema is still the best source as
the data has already been consolidated and made consistent before the
cube is built.
Note on version 1.1
I've
been surprised by the popularity of this paper, which I wrote the night
before a talk at my first SQL Server Connections conference back in
1999. I've finally gotten around to making some minor updates, which
include a summary, fixing a few typos, and changing from first person to
third person. I wouldn't mind completely rewriting the paper so perhaps
a version 2.0 will appear at some point. - Craig Utley, 17 July 2008
reference: http://www.ciobriefings.com/Publications/WhitePapers/DesigningtheStarSchemaDatabase/tabid/101/Default.aspx
No comments:
Post a Comment