Home  |   Blog Home  |   Blogroll  |   Authors  
expressor semantic data integration blog

As with any enterprise software, there are a number of criteria that determine if it is perceived to be easy to use relative to competitive offerings.

In this blog entry, I’d like to examine some of the ease of use criteria that any ETL/data integration software should adhere to. Later on, I’d also like to highlight some important ease-of-use issues that have not been addressed by mainstream ETL software vendors — mainly because of the limitations of their underlying software architectures.

So here’s my list of expected ease-of-use features:

1. The data integration software shall be downloadable and installable within minutes. This is a very important ease of use aspect for any nimble project team that doesn’t have dedicated resources for various project roles. The idea that it could take hours to install products like IBM DataStage indicates that these types of systems have become so bloated and complex that they can’t be easily adopted any longer by smaller teams in mid-size organizations or even in smaller projects in G2000 enterprises. And more moving pieces imply greater infrastructure complexity – more things that can go wrong with the installation and less visibility into exactly what might have gone wrong.

2. The data integration software shall provide a modern, industry-standard look and feel. The software should emulate how best of breed applications on Windows and non-Windows environments utilize and support the various platform features of the operating system and associated GUI frameworks. It isn’t about whether the ETL GUI tools are implemented in .Net or Eclipse/Java, it is more about how each vendor utilizes the UI features available on that platform. Unfortunately, some of existing mainstream ETL tools have not kept up with this criteria and one could look at their GUI tools and perceive them to be somewhat outdated. Having an industry-standard look and feel also makes the product less intimidating to new and/or casual users.

3. The data integration software shall provide graphical support for most of the functions of the software, and should only revert to scripting when necessary. I think that most of the commercial tools on the market do a fairly decent job on this front. But don’t assume that these tools provide first-class scripting support, which is paramount for anyone having to encode complex transformations that simply can’t be graphically expressed. Remember, once you leave the tool, compliancy and reporting metadata can quickly become compromised.

4. The data integration software shall support a well-defined workflow, meaning that it guides users through the various stages of the design process. Again, don’t assume that today’s tools on the market do a good job on this front — you may be surprised how non-intuitive and non-workflow oriented some of the products actually are. The other side of this coin is that the workflow must be adjustable so that it can be conformed to the business, not the other way around.

5. The data integration software shall offer operators for all important types of data transformations. This should be an easy one to cover given that many DI vendors have been around this problem for a long time. One would expect by now that each vendor has a complete set of transformation operators, but as it turns out, things like sophisticated grouping logic is still difficult to implement in many systems. The ability to rapidly create and deconstruct array structures is also import with the emergence of XML and its derivatives. A complete set of robust and easy to configure operators is paramount and not a nice to have.

6. The data integration software shall make it easy to connect to a wide variety of data sources. Connecting to data sources should be as easy as point and click operations and should not require anything more and most vendors make it easy to do so.

7. The data integration software shall provide targeted interfaces for all important roles on an ETL project including ETL developers, analysts and data stewards. With a few exceptions, business users have been ignored for a long time in the data integration lifecycle. This has a lot to do with the fact that many of the ETL tools were designed as developer tools, and support for business users — if supported — has been an afterthought. It also speaks to the complexity that most tools are encumbered with.

8. The data integration software GUI(s) shall be the same for batch and low-latency operations. There is no technical reason to make the UIs or engines different. This has been and is only an issue for vendors who don’t use the same underlying processing engine for batch and low-latency data processing. Users clearly shouldn’t have to become familiar with different UIs or engine characteristics for different DI use cases.

9. The data integration software shall provide good error reporting capabilities. As simple as this requirement can be stated, it is unfortunately true that various ETL systems haven’t fully coped with this problem and provide you with annoying error messages that are pretty useless in debugging your application.

10. The data integration software should allow you to work in an offline mode. It often happens that you can’t get access to your corporate systems and data when you’re at a remote site. What you really want is to be able to keep working on your data integration application while away from the office and synchronize your work later on when you are back online.

These are some of the basic requirements one would expect from any traditional, open-source or next-generation ETL product like expressor. But as I had stated earlier, the industry as a whole has been very reluctant to embrace new innovations that could further increase the ease of use of the ETL software.

Why is that you ask? The short answer is that the traditional vendors would have to overhaul their systems to add this ease-of-use functionality and quite naturally have been reluctant to do so due to lack of competitive pressures! But this is changing rapidly as expressor rolls out its expressor 3.0 product line later this year.

Here are additional requirements an easy-to-use, state-of-the-art ETL tool should adhere to in order to further remove the intrinsic complexities of typical data integration projects you are more than familiar with:

State-of-the-art data integration software shall shield users from the complexities of the underlying physical data structures. As a data steward or ETL developer you shouldn’t have to deal with the physical table column names or names in an XML file but rather be able to build your application with logical, abstracted business names that are much more meaningful to you and allow you to more easily communicate with your team members. Wouldn’t it be nice if you could tell your ETL software that “acc_no” in table A and “account_num” in table B mean the same and that you can link both these names to a common business term called “account_number”.

State-of-the-art data integration software shall automatically do data type conversions for you. Isn’t it about time that after close to twenty years of data integration software innovation one wouldn’t need to guide the software to perform basic data type conversions? What you really want is that “acc_no” (represented as a string in table A) and “account_num” (represented as an Integer in table B) are automatically mapped to an internal data type so that you as a user don’t have to worry about type conversions any longer. Have you ever had to worry about date time formats between databases? Removing these conversions from the data flow makes it simpler to build, maintain, and comprehend – allowing developers and analysts to focus on the business requirements and not the technical side effects of external storage systems.

State-of-the-art data integration software shall allow you to assign constraints to your common business names and its associated data types. The software shall enable you to validate if the actual data that flows through your application meets these constraints or not. If not, the software should allow you to take action on those data items by either rejecting them, alerting you, or by taking some other user-defined action.

State-of-the-art data integration software shall allow you to define reusable business rules. Imagine defining a reusable rule that does Currency conversions. The fact that business rules in traditional ETL tools are hardcoded into the transformation logic is one of the fundamental flaws of today’s ETL software.

State-of-the-art data integration software shall allow you to organize reusable business rules and other project artifacts in shared projects and libraries — so they can be easily accessed by other users on the same and different projects. Wouldn’t you expect that the data integration software offers you capabilities similar to Microsoft Excel, which organizes all reusable functions in libraries and makes them easily accessible to you?

State-of-the-art data integration software shall support the concept of “synthetic debugging”. Rather than having to connect your application to the “real data feeds” for testing a particular business rule, why shouldn’t you be able to test your rule(s) based on a set of custom values you define to see if the rule(s) does what you intend it to do. This can be a great time saver and is another just one of these innovative ease of use features that can come very handy.

As you can see, ease of use for data integration software is far more complex than looking at one specific aspect of the software only. We at expressor have learned a lot about this topic over the past three years and our intent with our upcoming expressor 3.0 release later this year is to deliver on this promise as best as we can.

Michael Waclawiczek, VP of Marketing, expressor

Post to Twitter Tweet This Post

  • Share/Bookmark

End user companies are consistently stressing the importance of metadata as shown in various research studies conducted by Gartner, TDWI, and other analyst firms. So you ask, what has gone wrong; why are so many ETL and data integration vendors still not fully addressing this important aspect in their products?

Let’s first look at some of the high-level metadata management requirements and see if you agree with these:

  1. An ETL tool should be flexible enough to handle and record changing application requirements
  2. An ETL tool should facilitate data governance and data quality initiatives
  3. An ETL tool should allow you to easily connect to and handle changing data source and target requirements
  4. An ETL tool should require minimum maintenance
  5. An ETL tool should allow business users to be included in the ETL/DI process
  6. An ETL tool should promote reuse of business rules and other ETL project artifacts without requiring the user to identify reuse opportunities
  7. An ETL tool should allow you to do design time and run time reporting and auditing
  8. An ETL tool should support a collaborative development
  9. An ETL tool should facilitate role-based development
  10. An ETL tool should include a metadata repository that is fully integrated into the product

I’d say that most vendors in our space will claim that they handle most of these high-level requirements, but in practice they only show varying support for 1), 2) and 3) above, and most of them fall short on many of the other requirements.

Why is that, you ask? Because many ETL tools were created in the nineties when metadata management was still in its infancy, and even some of the newer low-end open source tools developed in the last ten years weren’t built with a central notion of metadata management. This is the main reason why it’s still very hard for these vendors to deliver on the more complex requirements asked for by businesses today.

So let’s dive a little bit deeper to see what’s really going on. First, most ETL vendors, who offer data quality solutions, offer them as separate products and provide very little integrated data quality support within their core ETL engines. This is due to the fact that the ETL processes in traditional, point-to-point data mapping tools make it difficult to build generic data quality rules right into their products that could handle a large number of typical data quality rules. So, while data quality support is available by a number of tools out there, it’s an add-on capability and not part of the core product.

I guess there is very little doubt in anyone’s mind that good built-in metadata management is a must if you would like to reduce the maintenance burden of your applications. SSIS users for example have complained about the lack of metadata management support for years and know what I am talking about when I point out this issue.

As highlighted in point 4) above, who wouldn’t want their business users be actively included in the DI lifecycle? Why wouldn’t you want business users to actually define their rules directly in the product rather than in a separate Excel spreadsheet, which then gets thrown over the wall to the ETL developer, who has to implement these business rules? How do you know if what the ETL developer has implemented is right? Most ETL tools have no built in feedback mechanism which permits the business user to double check, much less test, what the developer actually coded.

Without doubt, one would rather have the domain expert be able to implement their own rules then making ETL developers responsible for this task. And again, the main reason why these types of requirements haven’t been satisfied by traditional ETL tools is because it is very difficult to offer business users the right tools when the foundation for defining these rules is tied to the complexities of the physical metadata.

This is the fundamental flaw of 99% of all ETL and data integration tools, and with the exception of expressor no other vendor has addressed this issue like we do. What’s really needed is a metadata abstraction layer ala what expressor offers to be able to turn this important requirement into practical reality.

Not only would you like to have reuse in terms of business rules, but reuse built into the right metadata foundation can go much further. It can give you reuse of dataflow components, data quality checks, domain conversion rules (is this field in Dollars or Euros?), etc. With most traditional tools you are forced to start all over again every time you build your next application. They don’t even let you reuse things you’ve already created as part of the same project. In the cases that reuse is available, it is your responsibility – not the tools – to find the reuse opportunities.

Reporting on your metadata has tremendous benefits for data lineage and analysis. Wouldn’t you want to go back in time and see what happened to any piece of data or metadata and how it has changed over time. Or to know, why a specific business rule was changed or even if it wasn’t, was it was ever used?

Everyone who has to deal with regulations and compliance should demand this kind of information, which requires a rich metadata model to start with, and your vendor’s metadata repository better be fully integrated with all the tools that you use throughout the data integration lifecycle.

More vendors these days do provide some level of collaborative team development, but many tools fall short on delivering on this promise. What’s often compromised is the level of collaboration support you would expect. Without a comprehensive and convincing approach to metadata management, you’ll likely be disappointed in what your vendor has to offer you around collaborative development.

Now the next requirement — role-based development — is another can of worms. So why would you even want such a thing? Well, if you thought that having business users and data stewards to be more active in the development of your DI application, then asking your vendor for how they support different roles in a DI project is the right thing to do. Some vendors state that they support role based development, but the roles which they support are limited to the development staff and if you are lucky, extended to the computer operations group.

Role-based development implies that you provide dedicated interfaces for specific user roles performing specific activities within a project. Don’t assume that your business users would ever be happy to work with an ETL developer-centric product! Different facilities are necessary for different tasks – although you can cut a tree down with a steak knife or slice a turkey with a chain saw the results are typically not very optimal (or pretty).

I believe that by now you get the idea that good metadata management is paramount to a good ETL development environment. So I encourage you to find out if your vendor of choice understands the importance of metadata and has a deep understanding of the requirements associated with it. Metadata has been a stepchild in the ETL world for long enough – expressor has and is promoting it to center stage.

Michael Waclawiczek, VP Marketing, expressor

Register for our free expressor Studio 3.0 beta program!

Post to Twitter Tweet This Post

  • Share/Bookmark

Our next monthly webinar on September 9 will feature Dr. David Fenstermacher, the Chair and Executive Director of the Department of Biomedical Informatics at the H. Lee Moffitt Cancer Center & Research Institute.  The Moffitt Research Institute uses the expressor semantic data integration system to build its newComparative Effectiveness Research (CER) data warehouse.

Dr. Fenstermacher is a researcher and distinguished keynote speaker.  He will be presenting on the topic of “Metadata:  the cornerstone for tomorrow’s healthcare information management systems” on September 9, 12pm – 1pm EDT.  Please register now!

Presentation abstract: Healthcare reform will bring sweeping changes to the current information infrastructure to support new paradigms such as personalized medicine and comparative effectiveness research that will require observational data captured at the point of care be reusable for patients, researchers, clinicians and administrators.  The design of new healthcare information management systems must overcome many current challenges including the information gap, the lack of data standards, inadequate information on data quality, and the technical limitations of current data architectures.

As more healthcare providers modernize their electronic data collection capabilities a large focus of the efforts to implement these systems will be the data and not the traditional IT infrastructure.  Data interoperability requires the creation of extensive data dictionaries that provide both physical and contextual metadata that allow IT professionals and end-users to effectively manage and use the data.  These concepts need to be linked to standards (SNOMED CT, ICD-9-CM, MedDRA, LOINC, and GO) to support the exchange of data beyond an individual healthcare provider.  As meaningful use is further refined the standards required might be more clearly defined, but the need for data interoperability will be a key to improving the quality of patient care while maintaining or reducing costs.

In addition, new metadata layers will be required to provide richer knowledge about the nature of the data available in an institutions healthcare information infrastructure that will allow new ways to access and use data.  For an organization to create and manage an electronic healthcare data infrastructure that supports interoperability it will be imperative to create comprehensive data governance strategies to adopt data definitions, metadata standards and decided on which national and international data standards will be incorporated into the infrastructure.  These efforts will lay the foundation to reveal evidence-based guidelines based on clinical and molecular data to create well-designed clinical trials that will define new treatment interventions towards personalized medicine.

Post to Twitter Tweet This Post

  • Share/Bookmark

expressor Studio 3.0 is a free, easy to use, powerful (ETL and) data integration application that is based on breakthrough technology innovations to enable true semantic integration.

expressor Studio lets you easily connect to a wide range of data sources, map your data fields to common business names and types, and graphically design complex data transformation flows in minutes.  With expressor Studio you create reusable business rules and design, test, and run your integration applications right from within your Windows desktop.

Data integration is about to change forever; to preregister for a test drive of our beta version of expressor Studio 3.0 visit

http://www.expressor-software.com/expressorStudio

Post to Twitter Tweet This Post

  • Share/Bookmark

The ETL / data integration market is constantly evolving and open source vendors have been leading the charge in providing some and in some instances all of their software for free. Those who chose to charge for just maintenance and services only have a hard time making their business models work. Those like Talend who actually are in the business of selling commercial products provide their Studio products for free.

Now, we at expressor didn’t offered a “free” version of our product, as we decided that we needed some time and flexibility in how we evolve our product before we make it available to the masses. However the time has come and expressor will launch its expressor Studio 3.0 beta product in a few weeks. And it’s free!

So how will we be different from open source solutions including Talend?

First off, our expressor Studio 3.0, albeit free, is not an open source product. It is a fully quality tested product that is by far more advanced and sophisticated than the me-to products like Talend. expressor Studio 3.0 is a brand new, innovative data integration tool that allows you to pretty much develop, test, and deploy any batch data integration application you could think off. And we want to keep it that way.

Jack Freeman

So how will we be making money?

We will be selling annual support contracts for Studio users that need support, and by upgrading some of you to our premium expressor Integration Suite, which supports true, scalable parallel data processing, remote development and deployment, low latency data delivery and enables you to do team development, metadata reporting, etc.

Open source and commercial vendors offering free products will be similar in this regard. What will be different is the functionality the expressor Studio product provides compared to our competition.

Let me say it again, Talend and other open source products are me-to products with little innovation. Talend Open Studio was never designed to be innovative; it is the Chinese manufactured version of a traditional ETL tool like Informatica and nothing else.

Talend, with some success, has been hiding under the open source umbrella to make the point that their core ETL stuff is free and different and Informatica isn’t. And they have succeeded in that until now. But as I had stated in previous blogs, we think that Informatica is the better product than Talend and we also think that we beat both of these vendors on innovation around our metadata approach, on performance, and neither of them will beat us on price. In fact, I think that Informatica will soon become your grandfather’s Oldsmobile.

We are very serious about rolling out a free version of our flagship semantic data integration product and taking data integration to the next level.

So again, why is expressor not open source?

Because it doesn’t matter. What matters is if a vendor like us can give you better value than the traditional vendors and Talend do.

Here’s why expressor is on a roll of truly changing our industry that is as broken as GM was years ago and won’t be fixed by just producing cheaper products.  It is about consumers wanting fundamentally better products. And in our space, there is only one vendor who is working hard to make that happen. And its expressor software.

We have the smartest and brightest data integration people in our company. They have built the traditional data integration products and know what is wrong with them and are excited about expressor because they can innovate and build a truly next generation product set. They believe that Talend is a me-to, knock-off type of product. That’s all it is.

Talend is undermining the traditional players on one axis only – and that’s price. But don’t get me wrong. That’s not easy to do and I grant them best wishes in getting the market to pay attention to much more innovative and even more affordable offerings, like ours.

Watch our space!

Michael Waclawiczek, VP of Marketing

Pre-register for expressor Studio 3.0 beta!

Post to Twitter Tweet This Post

  • Share/Bookmark

expressor 2.3 includes a large collection of connectors (motors) for relational, flat-file, xml, SAS files, Cobol-copybook, and other source and target data formats.  This blog entry is about explaining to you what our in-sql motor in expressor 2.3 is all about.

Our in-table motor executes the equivalent of the statement SELECT * FROM table_name, where * is replaced with the table column names included in the image file.  If desired, you can specify a WHERE clause as part of a manually entered database URI, but in this case the same WHERE clause would be applied to all channels in a parallel network.  This is most likely not what you really want.

If you need to execute a more involved SELECT statement or a stored procedure, perhaps with a table join, nested statement, or complex partitioning directive across multiple channels, you would use the in-sql motor.  With this motor, the statement you want to execute is included in the image file, although if used with a parallel network the part of the WHERE clause that specifies the partitioning directive will be contained within the channel specification.  Let’s see how this all comes together.

Suppose you want to read the names of a subset of individuals who were President of the United States in the nineteenth, twentieth, and twenty-first centuries.  The SELECT statement (the following examples use syntax specific to Microsoft SQL Server) might look something like the following.

SELECT last_name, first_name FROM presidents WHERE political_party IN

(SELECT party FROM party_list)

AND date_of_inauguration > CAST( ‘1800-01-01′ AS datetime )

This type of compound statement could not be executed by the in-table motor but is easily implemented by the expressor in-sql motor.

Now suppose you want to return members of the Democratic and Republican parties on separate channels.  You would add a partitioning directive to the nested SELECT statement.  The following fragment shows the statement that would be included in the image file.

SELECT last_name, first_name FROM presidents WHERE political_party IN

(SELECT party FROM party_list <partitioning_directive>)

AND date_of_inauguration > CAST( ‘1800-01-01′ AS datetime )

The entry <partitioning_directive> is a placeholder for a WHERE clause that will be used to partition the result set across the channels.  In this case, the partitioning directive would be specified through the value assigned to the channel query attribute.  One channel in the associated network file would include

query=”WHERE political_party = ‘Democratic’ “

and the second channel would include

query=”WHERE political_party = ‘Republican’ “

In this example, the nested SELECT statement did not include the WHERE keyword, so this keyword must be included in the value assigned to the channel’s query attribute.  But that may not always be the case.  The value assigned to a channel’s query attribute might be an extension to a WHERE clause contained in the nested SELECT statement.

For example, suppose you want to retrieve only Democratic presidents partitioned on date of inauguration.  The SELECT statement embedded in the image file might look something like the following.

SELECT last_name, first_name FROM presidents

WHERE political_party=’Democratic’ <partitioning_directive>

And the partitioning directives in the two channels could have the following content.

AND (date_of_inauguration > CAST( ‘1950-01-01′ AS datetime )

AND (date_of_inauguration < CAST( ‘1950-01-01′ AS datetime )

Since the nested SELECT statement includes the WHERE keyword, the query attribute’s content simply extends the WHERE clause with the AND keyword.

As you can see from these examples, the in-sql motor gives you full control not only over the complexity of the SELECT statement but what portion of the SELECT statement contains the partitioning directive that distributes records across the channels in a parallel network.

John Lifter, expressor

Post to Twitter Tweet This Post

  • Share/Bookmark

Next week, on Aug 18, Andy Leonard will be starting the first part of his webinar series, which focuses on a discussion on data warehouse basics.  Andy will discuss what a data warehouse is and why you’ll get somewhat different answers depending on who you ask.  He will be looking at the beginning of this, and talk about the two schools of thoughts.

As you might know Ralph Kimball championed one school which says that all data marts in an enterprise feed into the data warehouse.  Bill Immon on the other hand believes that there is the notion of a central data warehouse that generates these data marts.  As a result, there are differences in the database schemas as well.  Kimball champions star schemas, and Immon likes the 3rd normal form.  More on this in Andy’s short video below.

Register for the entire webinar on Aug 18 at http://www.tinyURL.com/expressorScalableDW

Post to Twitter Tweet This Post

  • Share/Bookmark

Andy Leonard

Join Microsoft SQL Server MVP, Andy Leonard, and expressor Senior Director of Product Management, Michael Ruland, for this 4-part webinar series. The first session, Part 1, will take place on August 18, 2010 from 12pm EDT to 1pm EDT. Parts 2-4 will follow.

• Part 1 – data warehousing basics

• Part 2 – importance of good metadata management

• Part 3 – why consider semantic integration?

• Part 4 – the expressor solution (with expressor 3.0 studio beta live demo)

Register now!

Post to Twitter Tweet This Post

  • Share/Bookmark

Our monthly newsletter – August 2010 edition – features an interview with CEO Bob Potter and summarizes a variety of exciting new company and product news, including:

  • interview with CEO Bob Potter
  • Skechers USA selects expressor for new data warehousing initiative
  • expressor issues Q2 2010 momentum release
  • expressor upgrades eval edition
  • expressor/PPC webinar recording on “getting data governance right”
  • expressor launches four-part webinar series with Andy Leonard
  • expressor software has been named by Information Management as one of the 40 vendors they’ll be watching in 2010.
  • CEO Bob Potter stirs controversy with Talend on “open source licensing
  • expressor Studio Pre-release 2 testing underway
  • upcoming live events

Post to Twitter Tweet This Post

  • Share/Bookmark

This is an important question which I’d like to address in this blog.  As some of you may already know expressor is primarily targeting medium size SMBs and departmental projects in Global 2000 enterprises.

Important sectors in this market are Microsoft-centric SQL Server accounts that are using Microsoft SSIS for standard ETL tasks.  expressor commonly complements their SSIS usage for high-performance BI / data warehousing and other complex data integration initiatives.   In a majority of these accounts, traditional data integration vendors like Informatica and open source ETL vendors aren’t even considered by these companies for two primary reasons:  cost in case of Informatica and the Java-centric nature of products like Talend.  expressor however fits extremely well into these environments, as our highly affordable data integration product is built around Microsoft Office tools including Visio and Excel and optimized for SQL Server and other Microsoft applications.

In more heterogeneous and complex data integration environments, we are more likely to compete with Informatica and Talend — and with Ab Initio and IBM DataStage on occasion.  Talend’s strategy is to chase every Informatica opportunity and try to beat them on price, because that’s the only thing they can really compete on.  We believe that Informatica is and remains to be the better data integration tool than Talend both in terms of features and usability.

So here comes expressor, who is increasingly spoiling the party for these two vendors.   Our expressor 2.3 product beats Informatica and Talend hands down on performance and processing of complex data today — we have not lost a single POC against these guys in these kinds of situations.  As it regards software license and maintenance costs, I have blogged several times in the past that expressor is between 60 – 80% less expensive than Informatica and that we even beat Talend on price in typical project scenarios.

We will be heating up the competitive battlefield even further with our upcoming expressor 3.0 release later this year.   With 3.0 we will be rolling out a brand new Studio version of our product that will allow you to perform a multitude of standard data integration tasks at no charge.  Our premium expressor 3.0 offering will include a host of new innovative capabilities and will feature significant UI and usability improvements to make it the easiest enterprise-class data integration product to download, use, and maintain on the market.

These are exciting times for expressor and the entire data integration industry.

Michael Waclawiczek, VP of Marketing, expressor software

Post to Twitter Tweet This Post

  • Share/Bookmark
Back To Top