Developing
& Implementing Data Warehouses with SQL Server 7.0
Overview of Datawarehousing
-
a data warehouse is a database containing data
that usually represents the business history of an
organization
-
must store many years of information and be able to
query quickly
-
data is organized to support analysis rather than to
process real-time transactions
-
OLAP
- provides a multidimensional presentation
of data warehouse data
-
data mining
- finding patterns among the
information
-
data mart
- smaller version of a data warehouse
-
when designing a data warehouse, use either:
-
top down approach - design the big warehouse
first, then divide it into smaller marts
-
bottom up approach - design the smaller units
first, then integrate them into one big
warehouse
Characteristics of data
-
data is stored centrally
-
converts to the same format in order to be
consistent
-
only includes subject oriented key information
-
historical
-
read only, fast query and retrieval is the key - need
appropriate level of summarization
-
fact table contains only number and key to reference
textual data
-
dimension table contains descriptions of data in facts
table to provide a hierarchy structure
Warehousing Processes
-
retrieve data from operational systems such as RDBMS,
tape or external data
-
transform data to appropriate format
-
populate the warehouse
-
user can then query data with tools such as
EIS
Design Considerations
-
human factors: use default character plus unicode data
types for all languages, or use same code page for client
and server unless they use only the first 128 char
-
file group: use file groups to create database across
drives to improve performance
-
fault tolerant: disk striping with parity, disk
mirror
Business Analysis Process
-
identify objectives (eg. the need to know which factor
led to a drop in sales)
-
gather and analyze information; find out from workers
what sort of information is needed for the data
warehouse
-
identify only one segment at first to build a data
mart
-
create conceptual data model by pencil
-
identify datasources that have integrity
-
determine the duration of data kept in warehouse –
recommended: 2 years (multi-duration for different data
mart)
Schemas
Star schema
-
most popular
-
has one central fact table
-
dimension table has a 1-to-many relationship-to-fact
table
-
dimension table can be updated, while fact table is more
long term
-
fact table uses multipart key, can contain
pre-calculated summarized value
-
larger fact table can be fast as it contains more
pre-calculated value, but at the expense of higher storage
and maintenance cost
Snow flake schema
-
variation of star, more levels of dimensional tables,
more normalized, lower row count, more complex, more
joins
-
once in the lowest level of detail, cannot go down
further
System Requirements
Hardware/Software
|
Requirements
|
Computer
|
DEC
Alpha AXP or X86
|
Memory
|
32
megabytes (MB) minimum (64 MB
recommended)
|
Hard
disk space
|
35 - 50
MB (85 MB including common files and
samples)
|
Server
Operating system
|
Microsoft Windows NT® Server 4.0 with Service
Pack 4 or later (2) Windows NT
Workstation 4.0 with Service Pack 4
(2)
|
Client
Operating system
|
Microsoft Windows® 95 + DCOM95 Windows 95 OSR2
+ DCOM95 Windows 98
Internet
Explorer is required for Microsoft Management Console (MMC) and HTML
Help
|
Installation Options
Installation Parameters
-
-r causes Setup.exe to automatically generate a silent
response file (.iss)
-
-s performs a silent unattended installation using the
response file created with -r
-
-f1<path\ResponseFile> defines alternate location
and name of the response file (.iss file)
-
-f2<path\LogFile> defines alternate location and
name of the log file
OLAP Manager
Key functions of OLAP Manager
-
Define databases and data sources
-
Build and process cubes
-
Specify storage options and optimize query performance
-
Manage server security
-
Browse data sources, shared dimensions, and security
roles
OLAP Cube Architecture
-
multidimensional representation of detail and summary
data
-
consists of a data source, dimensions, measures, and
partitions
-
data source identifies and connects to the database
containing the data warehouse data that is the source of
data for the cube
-
dimensions map data warehouse dimension table
information into a hierarchy of levels, and can be created
for use in an individual cube or multiple cubes
-
virtual dimension maps the properties of members of
another dimension into a dimension that can then be used in
cubes
-
measures which are displayed in rows and columns are to
be used to identify the numerical values from the fact table
that are summarized for analysis, and form the core of cube
information presented to users
-
fact table columns can be additive or nonadditive, and
OLAP Services can use both types as measures
-
partitions are the multidimensional storage containers
that hold cube data
-
user-defined partitions are available only in SQL
Server™ OLAP Services Enterprise Edition
-
design is based on the analytical requirements of
users
-
a virtual cube is a logical view of portions of one or
more cubes that can be used to join relatively unlike cubes
that share a common dimension
-
cubes require substantial storage to contain the data
and precalculated summary information in multidimensional
structures - aggregations are precalculated summaries of
data that provide the mechanism rapid query
-
precalculation of all possible aggregations in a cube
results in the fastest possible response time for all
queries at the expense of storage and processing time. If no
aggregations are precalculated (0%), little storage space is
required beyond that necessary to store the base data
-
another factor that affects storage requirements is
sparsity, which is the amount of empty cells in a cube. SQL
Server uses the following techniques to minimize the storage
requirement: Storage is not allocated for empty cells, data
compression is employed, a sophisticated algorithm designs
efficient summary aggregations
Cube Storage Modes
MOLAP storage
-
multidimensional structure to contain aggregations and a
copy of the base data
-
provides the most rapid query response times
-
more appropriate for cubes frequently used, and for
rapid query response
ROLAP storage
-
uses tables in the data warehouse relational database to
store a cube’s aggregations
-
does not store a copy of the base data
-
query response is generally slower
-
typically used for large data sets that are infrequently
queried, such as historical data from less recent previous
years
-
aggregations cannot be created for a cube with ROLAP
storage if the data source is OLAP Services
HOLAP storage
-
combines attributes of both MOLAP and ROLAP in that
aggregation data is stored in MOLAP structures, while base
data is left in the data warehouse’s relational
database
-
generally suitable for cubes that require rapid query
response for summaries based on a large amount of base
data
Partitions
-
cubes can be divided into partitions, each of which can
be stored using a different mode
-
partitions are invisible to user, and can be stored on
different servers, providing a clustered approach to cube
storage
-
OLAP Services provides a Partition wizard to assist in
creating partitions
-
cube may return incorrect results for some queries if a
portion of the cube's data is included in more than one of
its partitions
-
partitions of a cube can be merged
Processing Cube
-
process - a complete load of the cube, i.e. all
dimension and fact table data is read and all specified
aggregations are calculated
-
changes in the data warehouse schema that affect the
structure of cubes require those cubes to have their
structure changed and then be processed, while changes in
(or additions to) data in the data warehouse do not require
cubes to be completely processed
Incrementally updating a cube
-
appropriate when new data is to be added to a cube
-
does not require that the cube be processed.
-
does not affect the existing data that has already been
processed
-
can be performed while users continue to query the
cube
-
after update, users have access to the additional data
without having to disconnect and reconnect
Refreshing a cube’s Data
-
causes a cube’s data to be cleared and reloaded and its
aggregations recalculated
-
appropriate when the underlying data in the data
warehouse has changed but the cube’s structure remains the
same
-
faster, as aggregation tables do not have to be
redesigned
-
can be performed while users continue to query the cube
and users will have access to the updated data without
having to disconnect and reconnect
Fact table and partition arrangement
-
with different fact tables for a partition, all fact
tables and dimensions for a cube’s partitions must have the
same structure as the cube’s fact tables and dimensions
-
you should ensure that no data is duplicated among the
fact tables by using filter, although duplicated data is
technically allowed
-
with same fact table for multiple partitions, data items
should not be used in more than one partition – use filters
for all partitions in a cube to extract mutually exclusive
data sets from the fact table
Merging partitions
-
partitions can be merged only if they have the same
structure, they are stored in the same mode (MOLAP, HOLAP,
or ROLAP), and contain identical aggregation designs
-
you can copy the aggregation design from another of the
cube’s partitions when creating the partition in the
Partition wizard to ensure that these partitions have the
same aggregation design
-
fact tables are not merged automatically when you merge
partitions. You must do it manually
-
filters of both partitions are ORed together to create a
filter for the resulting partition which specifies the set
of facts used in the resulting partition
-
when merging partitions that use different fact tables,
the resulting partition will refer only to the target
partition’s fact table. Facts from the source partition’s
fact table must be merged manually
-
a merged MOLAP (multidimensional OLAP) partition with an
incomplete fact table contains an internally merged copy of
fact table data and will operate correctly until it is
processed merged; a HOLAP (hybrid OLAP) or ROLAP (relational
OLAP) partition with an incomplete fact table contains
accurate aggregations, but incomplete facts, which leads to
incorrect returned data
-
absence of unavailable facts might not be noticed unless
a user attempts to drill down to a fact in the unavailable
table, or executes a query that requires a fact from the
unavailable table
-
when merging partitions that were created by specifying
data slices in the Partition wizard, the merged partition
can contain unexpected incorrect data unless you create a
filter that specifies the data in the resultant
partition.
-
data Slice is specified when you create a partition
using the Partition wizard. The wizard attempts to create a
filter on the fact table to specify the data to be included
in the partition if the level’s MemberKeyColumn and
MemberNameColumn properties point to the same column. This
is the default if the dimension or cube has not been
edited
Dimension
Dimension tables
-
contain the detail and describe the fact table
-
has fewer rows than fact table
-
columns are mainly character type
-
conventional dimensions: eg, time dimension such as
month, date.... location dimension base don country,
city...etc
-
shared dimensions - data mart sharing each other's
dimension table
-
degenerate dimensions: represent a business event but
not a numeric fact, has no association to any dimension
table
-
junk dimensions: attribute not related to business
objective, but is still important
-
fact table is much larger then dimension table, so judge
the required size based on fact table
Defining Dimensions
-
each column in a dimension contributes a level to the
dimension, which are ordered by specificity and organized in
a hierarchy that allows logical avenues for drill down
-
each level contains members that are the values within
the column that defines the level
-
in tabular browsers, members provide the column
headings, row headings, and subheadings by which measures
are separated and displayed to cube users
-
in graphical browsers, members provide other types of
descriptive labels that serve the same function as in
tabular browsers
-
each dimension table’s primary key must join to a
foreign key in a cube’s fact table or another dimension
table, although key columns are not required in the
dimension definition
-
multiple-hierarchy dimensions provide similar yet
alternate views of cube data, as two or more dimensions with
names that share the same prefix, followed by a period, with
different suffixes
Rebuilding Structure
-
re-creates and loads the dimension:
-
required after the structure of the dimension is
changed, or that relationships between members in the
dimension hierarchy are changed
-
when a shared dimension is processed with the rebuild
the dimension structure option, all cubes that incorporate
the shared dimension immediately become unavailable to users
and must be processed before they can be used again
-
when a shared dimension’s structure is edited and saved
but not processed, it will automatically be processed when
any cube incorporating the dimension is processed. Any other
cubes that incorporate the dimension immediately become
unavailable to users and must be processed before they can
be used again
-
you can incrementally update a dimension, which updates
a dimension when new members have been added but no
structural changes have been made
-
cube that incorporates a shared dimension remains
available to users while the dimension is incrementally
updated
-
added dimension members are available in the cube after
the update is complete
Calculated Members
-
dimension member whose value is calculated at run time
using an expression that you specify when you define the
calculated member
-
enables you to add members and measures to a cube
without increasing its size
-
only the definitions for calculated members are stored,
and values are calculated in memory
Virtual Cubes
-
a combination of multiple cubes in one logical cube
-
can also be based on a single cube to expose only
selected subsets of its measures and dimensions
-
stores only their definitions, and not the data of their
component cubes
-
requires virtually no physical storage space
-
provides security function by limiting some users’ view
of the underlying cubes
-
after creation, you must process it in order to
establish the internal links to the specified dimensions and
measures in its underlying cube or cubes
Member Properties
-
attribute of a dimension member
-
can be used in queries and thus provide end users with
more options when analyzing cube data
-
acts as the basis of virtual dimensions.
-
values for member properties must be read from a column
in the same dimension table as the associated members
-
created in the Dimension editor by associating the
column that contains values for the member property with the
level that contains the members
Virtual Dimension
-
logical dimension based on a property of a level in a
physical dimension
-
enables end users to analyze cube data based on the
member properties of dimension level members in a cube
-
can be added to a cube only if the dimension that
supplies its member property is also included in the
cube
-
does not increase the cube’s size because, unlike an
ordinary dimension, aggregation data is not stored in the
cube
-
does not affect cube processing time because it is
calculated in memory when needed
Write-Enabling
-
client applications can record changes to the cube’s
data
-
allows end users to explore scenarios by changing cell
values and analyzing the effects of the changes on cube
data
-
end user’s change is stored in the write-back table as a
difference from the currently displayed value, although
original value in the cube is preserved and an audit trail
of changes is recorded in the write-back table
-
changes can only be made to cells at the lowest level -
cells that do not contain aggregated information
-
separate write back table facilitates conversion to a
partition to permanently incorporate changes into the cube
or to discard, which returns the cube to its original state
-
end user is permitted to record changes in a cube’s
write-back table only if the user belongs to a role with
read/write permissions assigned to the cube
Data and Structural Changes
-
data additions can be managed by carefully defining
partition filters, and by designing a strategy to
synchronize OLAP and data warehouse data
-
changes to correct errors in a data warehouse can be
minimized by applying care during the data transformation,
validation, and scrubbing operations
-
changes to correct errors in basic data should be
incorporated in the source database and then migrated to the
data warehouse in a controlled manner
-
changes to data in the fact table can affect the
accuracy of queries to a cube until the cube is processed.
The refresh data processing method can be used to reload the
cube’s data and recalculate the aggregations
Visibility to clients during
changes
-
when a cube that is currently online is processed, it
remains online until the processing has been completed, at
which time the online cube is replaced by the new cube
version
-
when a cube is processed using the full process method,
online clients will be disconnected from the cube when the
switch is made to the new version of the cube, and the
clients must individually reconnect to access the new
version
-
when a cube is processed using either the incremental
update or the refresh data method, online clients will not
be disconnected from the cube when the processing completes.
The new version of the cube will be immediately visible with
no break in service
-
if a shared dimension is processed using the rebuild the
dimension structure method, all cubes that use the dimension
will immediately become unavailable to clients and must be
processed before they can be used again
-
if a shared dimension is processed using the incremental
update method, cubes that use the dimension remain available
to clients and any new members added to the dimension
automatically become available to clients when the dimension
processing is complete
-
dimension hierarchies can be affected by changes to data
in the data warehouse dimension tables as it is based on
relationships between members in a dimension table. In this
case, it must be rebuilt using the rebuild the dimension
structure dimension processing method
-
structure of OLAP cubes and dimensions can be affected
by changes to the design of the data warehouse such as the
addition, deletion, or alteration of tables, or
relationships between tables. In such cases, you must modify
the design of affected cubes and dimensions, redefine
partitions and aggregations, and completely process the
modified cubes and dimensions
-
for each OLAP server, a repository is created to store
metadata for multidimensional objects such as cubes,
dimensions, and so on. This is a Microsoft Access (.mdb)
database at \Program Files\OLAP Services\Bin\msmdrep.mdb.
You can use the Migrate Repository wizard to migrate this
repository to a SQL Server (.mdf) database, but the process
cannot be undone
Security
-
security is controlled via access control rights that
are managed using roles
-
enabled only if the server is installed on the Windows
NT NTFS file system, which enforces access control lists
-
three levels of access control
-
Read - supported functionality includes
browsing of data and data structures (metadata), does not
allow modifying data and processing of data
-
Read/Write - all read access functionality,
capable of modifying data in cubes designated and enabled
for write-back
-
Admin - by default, the user account used to
install the server on a particular computer has Admin
privileges on that computer. Only members of this group
can start the OLAP Manager user interface and use
administrative functions, including security
management
Changing a User’s Access Rights
-
time that elapses between a change to a user’s access
rights in OLAP Services and the actual effect of the change
depends on the value of the Auto Synch Period initialization
property
-
if the Auto Synch Period property is set to null or 0
(zero), synchronization does not occur at a constant
interval but occurs due to users’ actions
-
if the Auto Synch Period property is set to a non-null,
non-zero value, at the specified interval, users’ logon user
names and authorizations are compared to their access rights
defined in OLAP Services, so that changes to a user’s access
rights that occurred since the last synchronization take
effect immediately
Roles
-
map Microsoft® Windows NT® user accounts and user groups
to security category assignments for cubes
-
created and managed at the database level via the Manage
Roles dialog box
-
service name for OLAP Services is
MSSQLServerOLAPService. Logon account associated with this
service must have permissions to access data sources that
OLAP Services administrators can access through the OLAP
Manager
Building Custom Applications
-
custom applications can be created to manage OLAP
server, create and maintain OLAP objects such as cubes,
dimensions, and security roles, extend user interface and
connect to the OLAP server, query data in cubes, and create
local cubes
-
OLAP Services recognize both the SQL dialect and the
multidimensional expressions (MDX) dialect
-
MDX - multidimensional expressions (MDX) for
manipulating multidimensional data, supporting MDX functions
in the definitions of calculated members, and in a full
language implementation for building local cubes and
querying cube data using PivotTableR Service with OLE DB and
Microsoft ActiveXR Data Objects (ADO)
-
Cube editor's Calculated Member Builder can help
developing MDX
-
Decision Support Objects (DSO) enable you to create
applications that enhance, augment, and automate your OLAP
installation
-
Add-in Programs - you can create and register add-in
programs that will be called by the OLAP Add-In Manager in
response to user activity in the OLAP Manager user
interface. They can optionally use Decision Support Objects
(DSO) to manage server objects, and multiple add-in programs
can be registered
Nature of PivotTable Service
-
OLE DB provider that supports the optional OLE DB for
OLAP extensions introduced in OLE DB 2.0 and ADO 2.0
-
in-process
-
designed to provide online and offline data analysis and
online access to OLAP data
-
functions as a client of OLAP Services
-
can work with only a single local cube partition
Functions of PivotTable
Service
-
connects to OLAP Services as a client component
-
functions as a service provider for relational data
sources to implement multidimensional functionality
-
expose the OLE DB interfaces with OLAP extensions
-
functions as a tabular data provider by supporting a
subset of SQL
-
functions as a multidimensional data provider by
supporting MDX
-
enables client applications to create local cube
directly from a relational data source.
-
functions as a mobile desktop OLAP client that enables
users to download from data sources and store the data in a
multidimensional structure on a local computer for offline
analysis
Performance Tuning
-
database engine is largely self-tuning – you should
let SQL Server take care of most of the tuning work
-
increase the amount of RAM - access to data in RAM
cache is much faster than access from disk
-
create and maintain good indexes - ensure good
indexes are created and maintained. Please refer to the SQL
Server 7 Database Design Cramsession regarding Clustered and
Non-clustered indexes, as well as the purpose of Fill
Factor. Keep in mind that, because a clustered index
physically orders the table data, clustered indexes are much
better than non-clustered indexes for queries that match
columns or search for ranges of columns that are mostly
non-unique.
-
Monitor disk I/O subsystem performance – ensures that
database server is running without disk queuing
-
Application and Query Tuning – use SQL Server
Profiler to monitor and log a SQL Server's workload, then
submit the logged workload to SQL Server Index Tuning Wizard
to make appropriate index changes
-
SQL Server Performance Monitor - detects bottlenecks
with the revised set of Performance Monitor objects and
counters
-
SQL Server Query Analyzer and Graphical ShowPlan -
visually analyze problematic SQL queries and statistics I/O
-
Max Async I/O - with RAID (Redundant Array of
Inexpensive Disks) attached to a database server that is
capable of very high disk I/O transfer rates, max async I/O
should be set higher. The goal is to make Checkpoint fast
enough to finish before another checkpoint is needed.
Command to use is in SQL Server Query Analyzer:
"sp_configure 'max async io', <value>"
-
Worker Threads - total number of these threads
available to service all incoming command batches is
dictated by the setting for the sp_configure option max
worker threads. If not enough thread is available, thread
sharing will occur, which will negatively affect
performance
-
LazyWriter - aims to produce free buffers during
periods of low disk I/O so that disk I/O resources are
readily available for use. If it is having problems keeping
the free buffer steady, or at least above zero, it is likely
that the disk subsystem is not able to provide LazyWriter
with the disk I/O performance that LazyWriter needs. You
should then add more physical disk drives. Also, you can
adjust LazyWriter disk I/O request behavior with the use of
max async I/O
-
Checkpoint – writes dirty pages out to the SQL Server
data files in order to even out SQL Server disk I/O activity
over a longer time period. You may adjust Checkpoint's dirty
page flushing behavior with max async I/O. For example, if
disk queuing occurs at unacceptable levels, decrease max
async I/O and add more disks to the disk subsystem
-
DBCC SHOWCONTIG can be used to reveal whether
excessive page splitting has occurred on a table - Scan
Density should be as close to 100 percent as possible, or
rebuild the clustered index on that table using the
DROP_EXISTING option to defragment the table
-
FILLFACTOR on the CREATE INDEX and DBCC REINDEX
commands allow you to specify what percentage of open space
to leave on index and data pages. Please note that PAD_INDEX
option is for FILLFACTOR on the non-leaf-level index pages.
Optimal value for FILLFACTOR depends upon how much new data
will be inserted within a given time frame into an 8-KB
index and data page
-
reads tend to outnumber writes for OLAP system -
FILLFACTOR should be set at 100 percent so that all index
and data pages are filled completely for maximum I/O
performance
Other Issues
-
Linked server allows SQL Server to execute commands
against OLAP Services using the OLE DB provider for
OLAP
-
When setting up linked services to OLAP Server, keep
in mind that
-
If login is made to SQL Server using SQL Server
security login name and password, setup SQL Server
services to run under either a local or domain user
account rather than using a SYSTEM account. Otherwise, if
login is made to SQL Server using Microsoft Windows NT
authentication, then SQL Server passes the credentials of
this Microsoft Windows NT account to OLAP services
-
Either the SQL Server services startup account or
the Microsoft Windows NT account should have access to
OLAP services. Make this account part of the 'OLAP
Administrator' local group on OLAP server computer, or
create a database role within OLAP Manager and provide
this role with read/write access to the cubes.
-
OLAP Services internally uses the NT computer name
for licensing information. Services may fail to start after
you rename the Microsoft Windows NT computer name
-
Re-run the SQL Server OLAP Services setup so that
it uses the latest Windows NT computer name. Remember to
reinstall any service pack that was in use
-
Change the Windows NT computer name back to the
original computer name
-
When using OLAP Services to pull data from SQL Server
using trusted or Windows NT authentication security, the
MSSQLServerOLAPService service must be configured to run
under a domain or local user account; otherwise, OLAP may
fail to process dimensions or cubes. By default
MSSQLServerOLAPService service runs under a local system
account
-
Apart from being a member of the OLAP Administrators
group, the MSSQLServerOLAPService account must have the
appropriate permissions on SQL Server: permissions required
will vary depending on the type of storage structure
selected
-
When using MOLAP storage, the
MSSQLServerOLAPService account must have at least SELECT
permissions on source database.
-
If you use ROLAP or HOLAP storage, the
MSSQLServerOLAPService account must have at least SELECT
and CREATE TABLE permissions on source database
-
SQL Server 7.0 can perform queries against OLE DB
providers by using the OPENQUERY or OPENROWSET Transact-SQL
functions or by using a query with four-part names including
a linked-server name. Query is limited to the abbreviated
SELECT syntax supported by OLAP Services, but can include
MDX syntax to return "flattened rowsets"
-
When a client application connects to SQL Server and
performance is not good, look at SQL Server and check how to
optimize the query/schema for better performance -
especially the star/snowflake schema, including schema
design, indexing, key relationships, referential integrity,
and insufficient data scrubbing
-
One possible solution to optimize schema: before
processing an OLAP Services cube, declare a primary key in
each dimension table. Then
-
For star dimensions, declare foreign key (FK)
relationships between each dimension table and the
corresponding fact table.
-
For snowflaked dimensions, you also declare FK
relationships between each secondary dimension table and
the primary dimension table that it augments.
-
Define indexes on each of the primary keys in the
dimension tables and in the fact tables, and also on each
of the foreign keys in the fact table(s)
-
Perform clean up: remove all NULLs for data items
that are being moved into OLAP Services
|