esiksha.com - Computers - M.C.S.E. - Non-Retiring Nt4 Electives - Developing and Implementing Data Warehouses with SQL Server 7.0

Developing & Implementing Data Warehouses with SQL Server 7.0

Overview of Datawarehousing

a data warehouse is a database containing data that usually represents the business history of an organization
must store many years of information and be able to query quickly
data is organized to support analysis rather than to process real-time transactions
OLAP - provides a multidimensional presentation of data warehouse data
data mining - finding patterns among the information
data mart - smaller version of a data warehouse
when designing a data warehouse, use either:

top down approach - design the big warehouse first, then divide it into smaller marts
bottom up approach - design the smaller units first, then integrate them into one big warehouse

Characteristics of data

data is stored centrally
converts to the same format in order to be consistent
only includes subject oriented key information
historical
read only, fast query and retrieval is the key - need appropriate level of summarization
fact table contains only number and key to reference textual data
dimension table contains descriptions of data in facts table to provide a hierarchy structure

Warehousing Processes

retrieve data from operational systems such as RDBMS, tape or external data
transform data to appropriate format
populate the warehouse
user can then query data with tools such as EIS

Design Considerations

human factors: use default character plus unicode data types for all languages, or use same code page for client and server unless they use only the first 128 char
file group: use file groups to create database across drives to improve performance
fault tolerant: disk striping with parity, disk mirror

Business Analysis Process

identify objectives (eg. the need to know which factor led to a drop in sales)
gather and analyze information; find out from workers what sort of information is needed for the data warehouse
identify only one segment at first to build a data mart
create conceptual data model by pencil
identify datasources that have integrity
determine the duration of data kept in warehouse – recommended: 2 years (multi-duration for different data mart)

Schemas

Star schema

most popular
has one central fact table
dimension table has a 1-to-many relationship-to-fact table
dimension table can be updated, while fact table is more long term
fact table uses multipart key, can contain pre-calculated summarized value
larger fact table can be fast as it contains more pre-calculated value, but at the expense of higher storage and maintenance cost

Snow flake schema

variation of star, more levels of dimensional tables, more normalized, lower row count, more complex, more joins
once in the lowest level of detail, cannot go down further

System Requirements

Hardware/Software	Requirements
Computer	DEC Alpha AXP or X86
Memory	32 megabytes (MB) minimum (64 MB recommended)
Hard disk space	35 - 50 MB (85 MB including common files and samples)
Server Operating system	Microsoft Windows NT® Server 4.0 with Service Pack 4 or later⁽²⁾ Windows NT Workstation 4.0 with Service Pack 4⁽²⁾
Client Operating system	Microsoft Windows® 95 + DCOM95 Windows 95 OSR2 + DCOM95 Windows 98 Internet Explorer is required for Microsoft Management Console (MMC) and HTML Help

Installation Options

OLAP server
OLAP Manager
Client Components (client components and other related files). Required to run the OLAP Manager and the OLAP server
Sample applications

Installation Parameters

-r causes Setup.exe to automatically generate a silent response file (.iss)
-s performs a silent unattended installation using the response file created with -r
-f1<path\ResponseFile> defines alternate location and name of the response file (.iss file)
-f2<path\LogFile> defines alternate location and name of the log file

OLAP Manager

console application that provides a robust user interface for accessing OLAP servers and the metadata repositories that define multidimensional database structures

Key functions of OLAP Manager

Define databases and data sources
Build and process cubes
Specify storage options and optimize query performance
Manage server security
Browse data sources, shared dimensions, and security roles

OLAP Cube Architecture

multidimensional representation of detail and summary data
consists of a data source, dimensions, measures, and partitions
data source identifies and connects to the database containing the data warehouse data that is the source of data for the cube
dimensions map data warehouse dimension table information into a hierarchy of levels, and can be created for use in an individual cube or multiple cubes
virtual dimension maps the properties of members of another dimension into a dimension that can then be used in cubes
measures which are displayed in rows and columns are to be used to identify the numerical values from the fact table that are summarized for analysis, and form the core of cube information presented to users
fact table columns can be additive or nonadditive, and OLAP Services can use both types as measures
partitions are the multidimensional storage containers that hold cube data
user-defined partitions are available only in SQL Server™ OLAP Services Enterprise Edition
design is based on the analytical requirements of users
a virtual cube is a logical view of portions of one or more cubes that can be used to join relatively unlike cubes that share a common dimension
cubes require substantial storage to contain the data and precalculated summary information in multidimensional structures - aggregations are precalculated summaries of data that provide the mechanism rapid query
precalculation of all possible aggregations in a cube results in the fastest possible response time for all queries at the expense of storage and processing time. If no aggregations are precalculated (0%), little storage space is required beyond that necessary to store the base data
another factor that affects storage requirements is sparsity, which is the amount of empty cells in a cube. SQL Server uses the following techniques to minimize the storage requirement: Storage is not allocated for empty cells, data compression is employed, a sophisticated algorithm designs efficient summary aggregations

Cube Storage Modes

MOLAP storage

multidimensional structure to contain aggregations and a copy of the base data
provides the most rapid query response times
more appropriate for cubes frequently used, and for rapid query response

ROLAP storage

uses tables in the data warehouse relational database to store a cube’s aggregations
does not store a copy of the base data
query response is generally slower
typically used for large data sets that are infrequently queried, such as historical data from less recent previous years
aggregations cannot be created for a cube with ROLAP storage if the data source is OLAP Services

HOLAP storage

combines attributes of both MOLAP and ROLAP in that aggregation data is stored in MOLAP structures, while base data is left in the data warehouse’s relational database
generally suitable for cubes that require rapid query response for summaries based on a large amount of base data

Partitions

cubes can be divided into partitions, each of which can be stored using a different mode
partitions are invisible to user, and can be stored on different servers, providing a clustered approach to cube storage
OLAP Services provides a Partition wizard to assist in creating partitions
cube may return incorrect results for some queries if a portion of the cube's data is included in more than one of its partitions
partitions of a cube can be merged

Processing Cube

process - a complete load of the cube, i.e. all dimension and fact table data is read and all specified aggregations are calculated
changes in the data warehouse schema that affect the structure of cubes require those cubes to have their structure changed and then be processed, while changes in (or additions to) data in the data warehouse do not require cubes to be completely processed

Incrementally updating a cube

appropriate when new data is to be added to a cube
does not require that the cube be processed.
does not affect the existing data that has already been processed
can be performed while users continue to query the cube
after update, users have access to the additional data without having to disconnect and reconnect

Refreshing a cube’s Data

causes a cube’s data to be cleared and reloaded and its aggregations recalculated
appropriate when the underlying data in the data warehouse has changed but the cube’s structure remains the same
faster, as aggregation tables do not have to be redesigned
can be performed while users continue to query the cube and users will have access to the updated data without having to disconnect and reconnect

Fact table and partition arrangement

with different fact tables for a partition, all fact tables and dimensions for a cube’s partitions must have the same structure as the cube’s fact tables and dimensions
you should ensure that no data is duplicated among the fact tables by using filter, although duplicated data is technically allowed
with same fact table for multiple partitions, data items should not be used in more than one partition – use filters for all partitions in a cube to extract mutually exclusive data sets from the fact table

Merging partitions

partitions can be merged only if they have the same structure, they are stored in the same mode (MOLAP, HOLAP, or ROLAP), and contain identical aggregation designs
you can copy the aggregation design from another of the cube’s partitions when creating the partition in the Partition wizard to ensure that these partitions have the same aggregation design
fact tables are not merged automatically when you merge partitions. You must do it manually
filters of both partitions are ORed together to create a filter for the resulting partition which specifies the set of facts used in the resulting partition
when merging partitions that use different fact tables, the resulting partition will refer only to the target partition’s fact table. Facts from the source partition’s fact table must be merged manually
a merged MOLAP (multidimensional OLAP) partition with an incomplete fact table contains an internally merged copy of fact table data and will operate correctly until it is processed merged; a HOLAP (hybrid OLAP) or ROLAP (relational OLAP) partition with an incomplete fact table contains accurate aggregations, but incomplete facts, which leads to incorrect returned data
absence of unavailable facts might not be noticed unless a user attempts to drill down to a fact in the unavailable table, or executes a query that requires a fact from the unavailable table
when merging partitions that were created by specifying data slices in the Partition wizard, the merged partition can contain unexpected incorrect data unless you create a filter that specifies the data in the resultant partition.
data Slice is specified when you create a partition using the Partition wizard. The wizard attempts to create a filter on the fact table to specify the data to be included in the partition if the level’s MemberKeyColumn and MemberNameColumn properties point to the same column. This is the default if the dimension or cube has not been edited

Dimension

Dimension tables

contain the detail and describe the fact table
has fewer rows than fact table
columns are mainly character type
conventional dimensions: eg, time dimension such as month, date.... location dimension base don country, city...etc
shared dimensions - data mart sharing each other's dimension table
degenerate dimensions: represent a business event but not a numeric fact, has no association to any dimension table
junk dimensions: attribute not related to business objective, but is still important
fact table is much larger then dimension table, so judge the required size based on fact table

Defining Dimensions

each column in a dimension contributes a level to the dimension, which are ordered by specificity and organized in a hierarchy that allows logical avenues for drill down
each level contains members that are the values within the column that defines the level
in tabular browsers, members provide the column headings, row headings, and subheadings by which measures are separated and displayed to cube users
in graphical browsers, members provide other types of descriptive labels that serve the same function as in tabular browsers
each dimension table’s primary key must join to a foreign key in a cube’s fact table or another dimension table, although key columns are not required in the dimension definition
multiple-hierarchy dimensions provide similar yet alternate views of cube data, as two or more dimensions with names that share the same prefix, followed by a period, with different suffixes

Rebuilding Structure

re-creates and loads the dimension:
required after the structure of the dimension is changed, or that relationships between members in the dimension hierarchy are changed
when a shared dimension is processed with the rebuild the dimension structure option, all cubes that incorporate the shared dimension immediately become unavailable to users and must be processed before they can be used again
when a shared dimension’s structure is edited and saved but not processed, it will automatically be processed when any cube incorporating the dimension is processed. Any other cubes that incorporate the dimension immediately become unavailable to users and must be processed before they can be used again
you can incrementally update a dimension, which updates a dimension when new members have been added but no structural changes have been made
cube that incorporates a shared dimension remains available to users while the dimension is incrementally updated
added dimension members are available in the cube after the update is complete

Calculated Members

dimension member whose value is calculated at run time using an expression that you specify when you define the calculated member
enables you to add members and measures to a cube without increasing its size
only the definitions for calculated members are stored, and values are calculated in memory

Virtual Cubes

a combination of multiple cubes in one logical cube
can also be based on a single cube to expose only selected subsets of its measures and dimensions
stores only their definitions, and not the data of their component cubes
requires virtually no physical storage space
provides security function by limiting some users’ view of the underlying cubes
after creation, you must process it in order to establish the internal links to the specified dimensions and measures in its underlying cube or cubes

Member Properties

attribute of a dimension member
can be used in queries and thus provide end users with more options when analyzing cube data
acts as the basis of virtual dimensions.
values for member properties must be read from a column in the same dimension table as the associated members
created in the Dimension editor by associating the column that contains values for the member property with the level that contains the members

Virtual Dimension

logical dimension based on a property of a level in a physical dimension
enables end users to analyze cube data based on the member properties of dimension level members in a cube
can be added to a cube only if the dimension that supplies its member property is also included in the cube
does not increase the cube’s size because, unlike an ordinary dimension, aggregation data is not stored in the cube
does not affect cube processing time because it is calculated in memory when needed

Write-Enabling

client applications can record changes to the cube’s data
allows end users to explore scenarios by changing cell values and analyzing the effects of the changes on cube data
end user’s change is stored in the write-back table as a difference from the currently displayed value, although original value in the cube is preserved and an audit trail of changes is recorded in the write-back table
changes can only be made to cells at the lowest level - cells that do not contain aggregated information
separate write back table facilitates conversion to a partition to permanently incorporate changes into the cube or to discard, which returns the cube to its original state
end user is permitted to record changes in a cube’s write-back table only if the user belongs to a role with read/write permissions assigned to the cube

Data and Structural Changes

data additions can be managed by carefully defining partition filters, and by designing a strategy to synchronize OLAP and data warehouse data
changes to correct errors in a data warehouse can be minimized by applying care during the data transformation, validation, and scrubbing operations
changes to correct errors in basic data should be incorporated in the source database and then migrated to the data warehouse in a controlled manner
changes to data in the fact table can affect the accuracy of queries to a cube until the cube is processed. The refresh data processing method can be used to reload the cube’s data and recalculate the aggregations

Visibility to clients during changes

when a cube that is currently online is processed, it remains online until the processing has been completed, at which time the online cube is replaced by the new cube version
when a cube is processed using the full process method, online clients will be disconnected from the cube when the switch is made to the new version of the cube, and the clients must individually reconnect to access the new version
when a cube is processed using either the incremental update or the refresh data method, online clients will not be disconnected from the cube when the processing completes. The new version of the cube will be immediately visible with no break in service
if a shared dimension is processed using the rebuild the dimension structure method, all cubes that use the dimension will immediately become unavailable to clients and must be processed before they can be used again
if a shared dimension is processed using the incremental update method, cubes that use the dimension remain available to clients and any new members added to the dimension automatically become available to clients when the dimension processing is complete
dimension hierarchies can be affected by changes to data in the data warehouse dimension tables as it is based on relationships between members in a dimension table. In this case, it must be rebuilt using the rebuild the dimension structure dimension processing method
structure of OLAP cubes and dimensions can be affected by changes to the design of the data warehouse such as the addition, deletion, or alteration of tables, or relationships between tables. In such cases, you must modify the design of affected cubes and dimensions, redefine partitions and aggregations, and completely process the modified cubes and dimensions
for each OLAP server, a repository is created to store metadata for multidimensional objects such as cubes, dimensions, and so on. This is a Microsoft Access (.mdb) database at \Program Files\OLAP Services\Bin\msmdrep.mdb. You can use the Migrate Repository wizard to migrate this repository to a SQL Server (.mdf) database, but the process cannot be undone

Security

security is controlled via access control rights that are managed using roles
enabled only if the server is installed on the Windows NT NTFS file system, which enforces access control lists
three levels of access control

Read - supported functionality includes browsing of data and data structures (metadata), does not allow modifying data and processing of data
Read/Write - all read access functionality, capable of modifying data in cubes designated and enabled for write-back
Admin - by default, the user account used to install the server on a particular computer has Admin privileges on that computer. Only members of this group can start the OLAP Manager user interface and use administrative functions, including security management

Changing a User’s Access Rights

time that elapses between a change to a user’s access rights in OLAP Services and the actual effect of the change depends on the value of the Auto Synch Period initialization property
if the Auto Synch Period property is set to null or 0 (zero), synchronization does not occur at a constant interval but occurs due to users’ actions
if the Auto Synch Period property is set to a non-null, non-zero value, at the specified interval, users’ logon user names and authorizations are compared to their access rights defined in OLAP Services, so that changes to a user’s access rights that occurred since the last synchronization take effect immediately

Roles

map Microsoft® Windows NT® user accounts and user groups to security category assignments for cubes
created and managed at the database level via the Manage Roles dialog box
service name for OLAP Services is MSSQLServerOLAPService. Logon account associated with this service must have permissions to access data sources that OLAP Services administrators can access through the OLAP Manager

Building Custom Applications

custom applications can be created to manage OLAP server, create and maintain OLAP objects such as cubes, dimensions, and security roles, extend user interface and connect to the OLAP server, query data in cubes, and create local cubes
OLAP Services recognize both the SQL dialect and the multidimensional expressions (MDX) dialect
MDX - multidimensional expressions (MDX) for manipulating multidimensional data, supporting MDX functions in the definitions of calculated members, and in a full language implementation for building local cubes and querying cube data using PivotTableR Service with OLE DB and Microsoft ActiveXR Data Objects (ADO)
Cube editor's Calculated Member Builder can help developing MDX
Decision Support Objects (DSO) enable you to create applications that enhance, augment, and automate your OLAP installation
Add-in Programs - you can create and register add-in programs that will be called by the OLAP Add-In Manager in response to user activity in the OLAP Manager user interface. They can optionally use Decision Support Objects (DSO) to manage server objects, and multiple add-in programs can be registered

Nature of PivotTable Service

OLE DB provider that supports the optional OLE DB for OLAP extensions introduced in OLE DB 2.0 and ADO 2.0
in-process
designed to provide online and offline data analysis and online access to OLAP data
functions as a client of OLAP Services
can work with only a single local cube partition

Functions of PivotTable Service

connects to OLAP Services as a client component
functions as a service provider for relational data sources to implement multidimensional functionality
expose the OLE DB interfaces with OLAP extensions
functions as a tabular data provider by supporting a subset of SQL
functions as a multidimensional data provider by supporting MDX
enables client applications to create local cube directly from a relational data source.
functions as a mobile desktop OLAP client that enables users to download from data sources and store the data in a multidimensional structure on a local computer for offline analysis

Performance Tuning

database engine is largely self-tuning – you should let SQL Server take care of most of the tuning work
increase the amount of RAM - access to data in RAM cache is much faster than access from disk
create and maintain good indexes - ensure good indexes are created and maintained. Please refer to the SQL Server 7 Database Design Cramsession regarding Clustered and Non-clustered indexes, as well as the purpose of Fill Factor. Keep in mind that, because a clustered index physically orders the table data, clustered indexes are much better than non-clustered indexes for queries that match columns or search for ranges of columns that are mostly non-unique.
Monitor disk I/O subsystem performance – ensures that database server is running without disk queuing
Application and Query Tuning – use SQL Server Profiler to monitor and log a SQL Server's workload, then submit the logged workload to SQL Server Index Tuning Wizard to make appropriate index changes
SQL Server Performance Monitor - detects bottlenecks with the revised set of Performance Monitor objects and counters
SQL Server Query Analyzer and Graphical ShowPlan - visually analyze problematic SQL queries and statistics I/O
Max Async I/O - with RAID (Redundant Array of Inexpensive Disks) attached to a database server that is capable of very high disk I/O transfer rates, max async I/O should be set higher. The goal is to make Checkpoint fast enough to finish before another checkpoint is needed. Command to use is in SQL Server Query Analyzer: "sp_configure 'max async io', <value>"
Worker Threads - total number of these threads available to service all incoming command batches is dictated by the setting for the sp_configure option max worker threads. If not enough thread is available, thread sharing will occur, which will negatively affect performance
LazyWriter - aims to produce free buffers during periods of low disk I/O so that disk I/O resources are readily available for use. If it is having problems keeping the free buffer steady, or at least above zero, it is likely that the disk subsystem is not able to provide LazyWriter with the disk I/O performance that LazyWriter needs. You should then add more physical disk drives. Also, you can adjust LazyWriter disk I/O request behavior with the use of max async I/O
Checkpoint – writes dirty pages out to the SQL Server data files in order to even out SQL Server disk I/O activity over a longer time period. You may adjust Checkpoint's dirty page flushing behavior with max async I/O. For example, if disk queuing occurs at unacceptable levels, decrease max async I/O and add more disks to the disk subsystem
DBCC SHOWCONTIG can be used to reveal whether excessive page splitting has occurred on a table - Scan Density should be as close to 100 percent as possible, or rebuild the clustered index on that table using the DROP_EXISTING option to defragment the table
FILLFACTOR on the CREATE INDEX and DBCC REINDEX commands allow you to specify what percentage of open space to leave on index and data pages. Please note that PAD_INDEX option is for FILLFACTOR on the non-leaf-level index pages. Optimal value for FILLFACTOR depends upon how much new data will be inserted within a given time frame into an 8-KB index and data page
reads tend to outnumber writes for OLAP system - FILLFACTOR should be set at 100 percent so that all index and data pages are filled completely for maximum I/O performance

Other Issues

Linked server allows SQL Server to execute commands against OLAP Services using the OLE DB provider for OLAP
When setting up linked services to OLAP Server, keep in mind that
1. If login is made to SQL Server using SQL Server security login name and password, setup SQL Server services to run under either a local or domain user account rather than using a SYSTEM account. Otherwise, if login is made to SQL Server using Microsoft Windows NT authentication, then SQL Server passes the credentials of this Microsoft Windows NT account to OLAP services
2. Either the SQL Server services startup account or the Microsoft Windows NT account should have access to OLAP services. Make this account part of the 'OLAP Administrator' local group on OLAP server computer, or create a database role within OLAP Manager and provide this role with read/write access to the cubes.
OLAP Services internally uses the NT computer name for licensing information. Services may fail to start after you rename the Microsoft Windows NT computer name
1. Re-run the SQL Server OLAP Services setup so that it uses the latest Windows NT computer name. Remember to reinstall any service pack that was in use
2. Change the Windows NT computer name back to the original computer name
When using OLAP Services to pull data from SQL Server using trusted or Windows NT authentication security, the MSSQLServerOLAPService service must be configured to run under a domain or local user account; otherwise, OLAP may fail to process dimensions or cubes. By default MSSQLServerOLAPService service runs under a local system account
Apart from being a member of the OLAP Administrators group, the MSSQLServerOLAPService account must have the appropriate permissions on SQL Server: permissions required will vary depending on the type of storage structure selected
1. When using MOLAP storage, the MSSQLServerOLAPService account must have at least SELECT permissions on source database.
2. If you use ROLAP or HOLAP storage, the MSSQLServerOLAPService account must have at least SELECT and CREATE TABLE permissions on source database
SQL Server 7.0 can perform queries against OLE DB providers by using the OPENQUERY or OPENROWSET Transact-SQL functions or by using a query with four-part names including a linked-server name. Query is limited to the abbreviated SELECT syntax supported by OLAP Services, but can include MDX syntax to return "flattened rowsets"
When a client application connects to SQL Server and performance is not good, look at SQL Server and check how to optimize the query/schema for better performance - especially the star/snowflake schema, including schema design, indexing, key relationships, referential integrity, and insufficient data scrubbing
One possible solution to optimize schema: before processing an OLAP Services cube, declare a primary key in each dimension table. Then
1. For star dimensions, declare foreign key (FK) relationships between each dimension table and the corresponding fact table.
2. For snowflaked dimensions, you also declare FK relationships between each secondary dimension table and the primary dimension table that it augments.
3. Define indexes on each of the primary keys in the dimension tables and in the fact tables, and also on each of the foreign keys in the fact table(s)
4. Perform clean up: remove all NULLs for data items that are being moved into OLAP Services