Friday, 1 June 2007

Transforming Data into Information

Data in SOA, Part I: Transforming Data into Information

Data and data management are key aspects of nearly every enterprise software solution. SOA is no exception. Effective data modeling and management are an essential part of successful SOA realization. To take your data to the next level you need to transform it into information; to take your information to the next level you need to transform it into knowledge.

This article is the first in a series of two articles on “Data in SOA: Transforming Data into Knowledge.” In this article I describe an approach to transforming data into information in SOA as part of an overall SOA transformation plan, with a definition of a SOA Reference Architecture (SOA RA), and the realization of an enterprise SOA. In Part II of this series I describe an approach to transforming information into knowledge for SOA as an extension to an overall SOA transformation plan and a high-value expansion of an enterprise SOA RA.
Why Data?

Data are ubiquitous (data is plural; datum is singular though both plural and singular verbs can be used with "data"). At their core, most IT efforts are focused on collecting, distributing, and managing data, providing data when it's needed, where it's needed, how it's needed, and for whomever (with proper authorization) needs it. Some may recall that long before the term IT ("information technology") was coined, most enterprises called their "computer departments" and activities DP, or “Data Processing.”

With all the technology waves past, present, and into the foreseeable future, one constant has remained: data. The same data that were (and still likely are) processed by mainframes have also likely been processed by one or more of client-server, CORBA/DCOM, Java EE, .NET, Web services, SOA, and Web 2.0. Over time, the storage, formats, and transports may have changed, and how the data is processed has changed, but the "data" remain (and are growing). In essence, all the industry technology waves have one thing in common: they are new or improved ways to process data. Data are fundamental. If you agree with my premise that data are fundamental to enterprise solutions, it follows that data (and data modeling/management) are also a priority consideration for enterprise architects in SOA (and Web 2.0).
What are Data?

Let's start by selecting your favorite dictionary definition for "data," and then augment it. For the purpose of this article, data are the elemental, atomic, or low-level aggregation of pieces of "information" with some structure (form), relations, and state, but no behavior. For example, an Address table with columns for Street Address, City, and so on, is an example of data, as is the definition of an Address in a Customer Table. Data are structure and state without behavior. Data are the raw building blocks from which we may construct information. Data are the prerequisite for Information.
What is Information?

Again, choose your favorite definition, and then augment it. For the purpose of this article, information is the aggregation of data and the fundamental logic that provides additional form, the basic relations, and syntactic and semantic contexts—that is, it is state and core model behavior. For example, correctness in ensuring a ZIP code is valid and consistent with the City. Information extends data by providing the ability to map, or relate, data, and define logic for the behavioral models consistent with the domain (syntax and semantics) context. Information is based on and requires data. In other words, information represents entities (subjects, objects) that encapsulate both state (data) and behavior (logic). You may consider information as being analogous to an instance of a model class in object-oriented programming which contains both data members (instance variables) that hold state and methods that provide (model) behavior.
The Value of Data in SOA

Organizations have different drivers, starting points, and priorities for defining and refining their SOA Reference Architecture (SOA RA), which may shift during their transformation to SOA. A holistic approach to the planning and design of a SOA RA should include the data services layer. This article uses the term data services layer to include both data and information access services.

Without an enterprise data services layer in your SOA RA, subsequent line-of-business (LoB) projects will be forced to develop individual "point," or one-off solutions, that are specific to each application. Few commonalities will be discovered, few opportunities for shared service definition, reuse, and consistency will be discovered, and the definition of a canonical data model will be elusive. There is a good chance that many of the benefits of SOA (and ROI) will take longer to realize, if they are realized at all. We’ve probably all read statistics that place project resource consumption on data integration tasks at anywhere from 50 to 85 percent of enterprise application software development! This anecdotal "fact" alone should be enough to ensure a data services layer is an integral part of any SOA realization. Combined with the obvious notion that our enterprise software solutions are primarily designed to process data, the value of data in SOA should also be apparent.

Figure 1 is a high-level conceptual view of BEA's SOA Reference Architecture, which illustrates high-level layers. Note the presence of the data services layer as first-class area, indicating the importance of the data services layer in a SOA RA.

BEA SOA Reference Architecture
Figure 1: SOA Reference Architecture layers

Data, data models, and data management are fundamental to SOA success. In fact, BEA values data services so highly that not only do we offer the AquaLogic Data Services Platform product, but data services are a fundamental part of many BEA Consulting service offerings, which include a Data Services Consulting Service where the focus is on SOA data and information layer planning, design, and development.
A Note About Data Access and Connectivity Services

Data access services refer to information sources often collectively known as Enterprise Information Systems (EIS) as well as databases and file systems. These can be legacy systems, systems of record, packaged commercial applications, customer, partner, and third-party applications and services, and Web services. What they have in common is that they provide data and/or information (which implies behavior in the context of this article) for consumption by other applications. In this sense, these applications when accessed through the data services layer are just another form or source of data. At a higher level of abstraction, Data services would look the same to consuming applications, which is one of the primary goals (normalization/consistency) of the data services layer in SOA RA. The fact that the interface exposed for consumption interacts with one or more databases, tables, back-end, legacy, shrink-rapped, and/or external systems is an implementation detail encapsulated by the data services layer.

Connectivity services are about exposing applications and databases as application services in a standards-based manner.
Transforming Data into Information

So, your organization is planning a transformation to SOA. Investigation and planning on all layers and aspects of the SOA RA (see Figure 1) has started, and you have been tasked with the realization of the data services layer. Now what? Consider the following transformation steps:

1. Inventory existing data and system access assets
2. Determine dependency matrix
3. Establish baselines metrics/SLAs
4. Set asset priorities
5. Carry out data modeling
6. Create logical modeling
7. Set information rules
8. Establish application specializations

Figure 2 provides an example of a possible set of internal abstraction layers for an SOA RA data services layer where we will map the requirements and capabilities from our 9 steps:

Data Services Layer – Internal Layer Abstraction

Figure 2: Data services layer –internal layer abstraction

Based on your requirements and perspective, you may determine the need for a different set of abstraction layers. At the very least, you should separate the physical and logical layers and distribute your rule types accordingly.

Let's now look at each of these steps in more detail.
1) Inventory Existing Data and System Access Assets

The first step is finding out what is out there, that is, what are your current data and information system access assets. What data and information assets (referred to as simply "assets" for the remainder of the article), for example databases, information sources, and applications (meaning legacy, system of record) does your organization have? For each asset you will want to know the supporting metadata such as documentation, history, technology/tools/products/platforms, versions, ownership/management, location, security, and access mechanisms. Depending on the number of assets and their metadata, you may want to consider some sort of metadata catalogue or repository as well as a standard template or set of templates that captures the meta-information in a consistent manner and allows for search.
2) Determine Dependency Matrix

Once you have started or created the asset catalogue, the second step is to determine the dependency matrix. The dependency matrix, also part of the asset meta-information, captures information on who uses the asset, when they use it, frequency/how often, what they do with, or to, the asset (for example, CRUD), where they use it (that is, what type of access—batch, online, real time, reporting). It is also important to understand why a consumer uses a particular asset as that will help with task prioritization as well as provide requirements for your emerging data models.

Once you have captured the "who, what, where, when, how, and why" for each known consumer of an asset, you can start to analyze and form generalizations across all asset consumers. The goal is to find opportunities for simplification and reuse by transforming existing assets into SOA Building Blocks. These include, but are not limited to, assets in a service-oriented, self-describing, discoverable form that can be readily utilized in an SOA ecosystem using open, common, industry, and/or organization standards.

One definition contained within the set of SOA Building Blocks is your definition of a service. What standards and specifications, and their versions, will be used? For example, specific versions of WSDL, SOAP, UDDI, WS-Security, WS-I Basic Profile, WS-Addressing, XML, and XSD may be required, while others may be optional/recommended. Your data and information access assets will likely take a form consistent with your basic SOA Building Block definition of a "service." (Using your favorite search engine, search on the topics of “Service Identification” and “Service Definition,” which cover this area.)
3) Establish Baselines Metrics/SLAs

Each catalogued asset, since it already exists in some form, should have estimated or actual production usage statistics, including transaction volume, patterns, concurrent users, reliability, availability, scalability, and performance (RASP) information.

Usage information is also a great indicator of business and IT value and priority. This baseline information is used to define a set of metrics that will form the basis of Service Level Agreements (SLAs) and allow for goal definition and tracking over time. Metrics, as well as current production information, are invaluable in sizing and capacity planning of both hardware and software to support the data services layer in SOA. Be sure your SLAs are bidirectional, that is the service provider defines its SLA terms, conditions, and penalties for each consumer; consumers are expected to abide by the agreement.

For example, an agreement states that Consumer A may perform a maximum of 100 get() requests on DataServiceXYZ (the asset/service provider) per day (where a day is defined as a 24-hour period starting at 12:00 midnight GMT) and the response time per request is to be <= 2 seconds. If Consumer A sends more than the agreed maximum get() requests, then the service provider is able to apply the penalties as defined in the agreement. There are corresponding expectations on the service provider. Should Consumer A stay at or beneath their request maximum, the service provider must provide a response time <= 2 seconds or face commensurate penalties defined in the agreement.

Metrics and SLAs define the expectations and rules of engagement that affect the basis of the value, goal, and sizing of each asset. Track your baseline metrics, SLAs, and reuse to establish a cost and benefits model.

With the preceding set of information captured to some degree, it should be possible to start evaluating each asset in the context of all the other cataloged assets—that is, assign each asset a priority. A good heuristic is to have at least three and no more than ten (which is excessive) priority levels; any more or less will be inadequate or unmanageable.

Priority assignments are designed to assist in the identification of the most important assets based on utilization and the value of the business functions supported. You should design a set of metrics (including those in Step 3) and definitions that provide for empirical comparison and evaluation of each asset to determine its priority assignment. Assigning asset priorities will help determine possible project starting points, potential business/IT sponsors, and relative business value.

Using all of the preceding information, a "current reality" snapshot for each asset can be established, documented, and tracked as these assets are transformed into SOA building blocks. Across all catalogued assets, the top-rated highest priority assets should be selected for the remaining set of steps. The actual number selected depends on your risk assessments, priority valuation, business/IT goals, resources, and similar factors.

5) Data Modeling

Starting with the first selected asset (I recommend doing one asset end-to-end first, perhaps not the highest priority either, as this allows you to exercise the governance and data services layer’s SDLC process in a more controlled and manageable manner), review the existing physical aspects. For a database or set of tables, consider the various queries that are used by consumers, any logic procedures stored in the database, and their triggers, as well as any side-effect actions. This forms the physical data asset definition and description. For information access, what is used: MOM, third-party adapters, or proprietary integrations, point-to-point custom integration? This forms the physical information asset definition and description.

As the data services layer forms an integral part of an overall SOA Reference Architecture, the definitions and requirements for an SOA building block should be defined. There is likely a gap between your asset's current state and the SOA RA building block goal state. The first order of business is to bring the current physical asset as close to your SOA Building Block goal state standard as possible. You may recall the previous discussion regarding the definition and description of a "service" for your SOA Reference Architecture. For simplicity, let's say your definition of a service requires WSDL, SOAP, document-style with documents defined using XSD. Other recommended specifications include WS-Addressing, and XQuery/XPath. With this definition, we need to consider how to transform or map tables in a relational database, XML data, and/or information access systems into a set of services that meet our building block service definition criteria.

There are various tools and technologies to map existing data and information access assets into a physical data layer in Figure 2 to define logical service models consistent with your specific requirements and definition of a service. BEA's AquaLogic Data Services Platform (ALDSP) is our realization technology for transformation of data/information access assets into SOA building blocks (data services), which provides a standards-based, service-oriented data services layer for your SOA Reference Architecture.

Once you import your physical assets (regardless of their interface and implementation), you have what is known as the physical data services layer (refer to Figure 2). Services in the physical data services layer have a consistent look, feel, and representation—that is, the underlying implementation details and communication protocols are abstracted, encapsulated, and removed from view (and you may still go "under the covers" when required), providing only the asset definition (service definition) and operational information. Now that you have your "data," it is time to define your logical model.
6) Logical Modeling

The goal of the logical model is to abstract, integrate, normalize, and manage the aggregation of one or more physical data services. These actions may be abstracted into two logical layers: the logical data normalization layer and the logical data integration layer, as shown in Figure 2, which also have a set of applicable rules: management rules, data rules, integration rules, and business rules.

Before we go further, it is worth noting that ALDSP allows for any number of logical layers that are required to support your logical abstraction design requirements. The logical layers are design-time-oriented only; their purpose is to allow designers and developers to separate and layer logical models and concerns effectively. These logical layers are not part of the runtime deployment—that is, even though there may be several logical layers in design, they do not correspond to a set of indirection layers at runtime. They are flattened and optimized into a single runtime layer. Development and operational staff can view the runtime artifacts and the optimizations and make modifications as they deem necessary.

You may define a different set of criteria and factors as the basis of your logical model layers than the ones I use here. For example, there may be a single layer that contains all of your logical abstractions, or you may have several logical layers. Too few logical layers may prove to be limiting and potentially lead to an increase in complexity over time. At a minimum, you should define a set of criteria that determines your logical abstraction layers and what they contain.

For example, you may have a logical abstraction that performs the normalizations as I show in Figure 2. The logical data normalization layer allows you to "clean up" and simplify any complex or confusing information. It is often difficult if not impossible to change the physical structure of existing databases or other systems over which you do not have direct ownership or responsibility, or changes at that level are simply not practical. The logical data normalization layer provides this opportunity to reengineer without forcing changes in the physical data layer. (If you need more information on "data normalization," I recommend performing a Web search on "data normalization" to learn more about what that is, and what it entails.) The logical layer provides a model design that may be used as a future physical data and information model as the systems that use the data sources directly are updated or retired. The goal of logical data services is provide a service model that is much easier to use, more understandable, and potentially more reusable by higher-level shared services and consuming applications.

Steps 5 and 6 may be reversed. The key is to ensure your logical models are not overly constrained by the current physical assets. In other words, while your logical models will utilize physical data services, do not let the limitations of those current physical assets limit your logical models or exert undue influence on your overall data services layer design. The physical assets are a starting point upon which to build richer, more expressive models.
7) Information Rules

Rules and rule processing are how data become information. Rules and rule processing provide relations, semantics, and behavior in the data services layer. As shown in Figure 2, there are several categories of rules:

*

Management rules provide any requirements and/or restrictions on using the system and data assets that form the physical data layer. This can include security, access windows (dates/times), caching, metadata, transactions, and any side effects or ancillary actions (for example, logging and auditing) that need to be performed.
*

Data rules provide validation, consistency, cross-checking, and any other rules associated with data accuracy and consistency. They may also provide cache management and other side effects in the physical or logical models. Data rules are at the table, row, column, and field level.
*

Integration rules provide mappings and consistency across logical and physical data layers. Integration maps higher-level abstractions to their corresponding logical or physical layers. For example, a Customer ID at a higher-level abstraction as part of a new canonical data model that is converted from/to several underlying native forms from several customer databases and/or backend systems. Integration rules are at the system and/or database layer.
*

Business rules provide meaningful business relations and some business logic, that is, behavior. In object-oriented programming, consider the state and behavior encapsulated in your model objects. Business rules perform a similar behavioral role in data services. Business Rules capture business processing logic at the data model layer. This logic is fundamental to the business entity’s very definition and its relations with other business entities that are intrinsic to the business entity across all utilizations, for example, in an enterprise-wide, or at least a division-wide, scope. Some of these rules are defined in the canonical model, while others are defined in the application specialization models.

8) Application Specializations

Once you have completed your logical model, you have effectively defined a canonical information model. The definition of this model completes the initial design of your information model, meaning you have effectively started to transform your data into your information. There is one final step that further refines your information model: application specializations.

Though many may, not all consuming applications will be able to use the canonical information model directly. Application specialization provides an abstraction layer for consuming applications to define their own logical model specific to their requirements.

Application specializations encapsulate the additional information model state and behavior required by consuming applications, which simplifies the consuming applications' utilization of the canonical information model assets. Since application specializations are unique to each consuming application, or a set of related business applications, there is no need to include them in the canonical information model. If application specializations have a larger scope (for example, across divisions or the enterprise), then they should be part of the canonical information model.
Conclusion

Creating the data services layer for your SOA Reference Architecture and defining the canonical information model for your organization is a difficult, challenging task often with little glory: it is difficult work, and challenging to do well. Following the approach described in this article should provide enough information for you to plan, assess, and begin designing your SOA transformation in the data layer, and transforming your organization's data into information. The actual planning, design, and development of your SOA Reference Architecture's data services layer depend on a number of unique factors that are specific to your organization or situation and well beyond the scope of this architecture article.

Now that we have started transforming our data into information in our SOA ecosystem, we can think about transforming our information into knowledge. The second and final article in this series, "Data in SOA, Part II: Transformation of Information into Knowledge," will describe the steps for this transformation.
References

No comments: