|
As with any enterprise software, there are a number of criteria that determine if it is perceived to be easy to use relative to competitive offerings. In this blog entry, I’d like to examine some of the ease of use criteria that any ETL/data integration software should adhere to. Later on, I’d also like to highlight some important ease-of-use issues that have not been addressed by mainstream ETL software vendors — mainly because of the limitations of their underlying software architectures. So here’s my list of expected ease-of-use features: 1. The data integration software shall be downloadable and installable within minutes. This is a very important ease of use aspect for any nimble project team that doesn’t have dedicated resources for various project roles. The idea that it could take hours to install products like IBM DataStage indicates that these types of systems have become so bloated and complex that they can’t be easily adopted any longer by smaller teams in mid-size organizations or even in smaller projects in G2000 enterprises. And more moving pieces imply greater infrastructure complexity – more things that can go wrong with the installation and less visibility into exactly what might have gone wrong. 2. The data integration software shall provide a modern, industry-standard look and feel. The software should emulate how best of breed applications on Windows and non-Windows environments utilize and support the various platform features of the operating system and associated GUI frameworks. It isn’t about whether the ETL GUI tools are implemented in .Net or Eclipse/Java, it is more about how each vendor utilizes the UI features available on that platform. Unfortunately, some of existing mainstream ETL tools have not kept up with this criteria and one could look at their GUI tools and perceive them to be somewhat outdated. Having an industry-standard look and feel also makes the product less intimidating to new and/or casual users. 3. The data integration software shall provide graphical support for most of the functions of the software, and should only revert to scripting when necessary. I think that most of the commercial tools on the market do a fairly decent job on this front. But don’t assume that these tools provide first-class scripting support, which is paramount for anyone having to encode complex transformations that simply can’t be graphically expressed. Remember, once you leave the tool, compliancy and reporting metadata can quickly become compromised. 4. The data integration software shall support a well-defined workflow, meaning that it guides users through the various stages of the design process. Again, don’t assume that today’s tools on the market do a good job on this front — you may be surprised how non-intuitive and non-workflow oriented some of the products actually are. The other side of this coin is that the workflow must be adjustable so that it can be conformed to the business, not the other way around. 5. The data integration software shall offer operators for all important types of data transformations. This should be an easy one to cover given that many DI vendors have been around this problem for a long time. One would expect by now that each vendor has a complete set of transformation operators, but as it turns out, things like sophisticated grouping logic is still difficult to implement in many systems. The ability to rapidly create and deconstruct array structures is also import with the emergence of XML and its derivatives. A complete set of robust and easy to configure operators is paramount and not a nice to have. 6. The data integration software shall make it easy to connect to a wide variety of data sources. Connecting to data sources should be as easy as point and click operations and should not require anything more and most vendors make it easy to do so. 7. The data integration software shall provide targeted interfaces for all important roles on an ETL project including ETL developers, analysts and data stewards. With a few exceptions, business users have been ignored for a long time in the data integration lifecycle. This has a lot to do with the fact that many of the ETL tools were designed as developer tools, and support for business users — if supported — has been an afterthought. It also speaks to the complexity that most tools are encumbered with. 8. The data integration software GUI(s) shall be the same for batch and low-latency operations. There is no technical reason to make the UIs or engines different. This has been and is only an issue for vendors who don’t use the same underlying processing engine for batch and low-latency data processing. Users clearly shouldn’t have to become familiar with different UIs or engine characteristics for different DI use cases. 9. The data integration software shall provide good error reporting capabilities. As simple as this requirement can be stated, it is unfortunately true that various ETL systems haven’t fully coped with this problem and provide you with annoying error messages that are pretty useless in debugging your application. 10. The data integration software should allow you to work in an offline mode. It often happens that you can’t get access to your corporate systems and data when you’re at a remote site. What you really want is to be able to keep working on your data integration application while away from the office and synchronize your work later on when you are back online. These are some of the basic requirements one would expect from any traditional, open-source or next-generation ETL product like expressor. But as I had stated earlier, the industry as a whole has been very reluctant to embrace new innovations that could further increase the ease of use of the ETL software. Why is that you ask? The short answer is that the traditional vendors would have to overhaul their systems to add this ease-of-use functionality and quite naturally have been reluctant to do so due to lack of competitive pressures! But this is changing rapidly as expressor rolls out its expressor 3.0 product line later this year. Here are additional requirements an easy-to-use, state-of-the-art ETL tool should adhere to in order to further remove the intrinsic complexities of typical data integration projects you are more than familiar with: • State-of-the-art data integration software shall shield users from the complexities of the underlying physical data structures. As a data steward or ETL developer you shouldn’t have to deal with the physical table column names or names in an XML file but rather be able to build your application with logical, abstracted business names that are much more meaningful to you and allow you to more easily communicate with your team members. Wouldn’t it be nice if you could tell your ETL software that “acc_no” in table A and “account_num” in table B mean the same and that you can link both these names to a common business term called “account_number”. • State-of-the-art data integration software shall automatically do data type conversions for you. Isn’t it about time that after close to twenty years of data integration software innovation one wouldn’t need to guide the software to perform basic data type conversions? What you really want is that “acc_no” (represented as a string in table A) and “account_num” (represented as an Integer in table B) are automatically mapped to an internal data type so that you as a user don’t have to worry about type conversions any longer. Have you ever had to worry about date time formats between databases? Removing these conversions from the data flow makes it simpler to build, maintain, and comprehend – allowing developers and analysts to focus on the business requirements and not the technical side effects of external storage systems. • State-of-the-art data integration software shall allow you to assign constraints to your common business names and its associated data types. The software shall enable you to validate if the actual data that flows through your application meets these constraints or not. If not, the software should allow you to take action on those data items by either rejecting them, alerting you, or by taking some other user-defined action. • State-of-the-art data integration software shall allow you to define reusable business rules. Imagine defining a reusable rule that does Currency conversions. The fact that business rules in traditional ETL tools are hardcoded into the transformation logic is one of the fundamental flaws of today’s ETL software. • State-of-the-art data integration software shall allow you to organize reusable business rules and other project artifacts in shared projects and libraries — so they can be easily accessed by other users on the same and different projects. Wouldn’t you expect that the data integration software offers you capabilities similar to Microsoft Excel, which organizes all reusable functions in libraries and makes them easily accessible to you? • State-of-the-art data integration software shall support the concept of “synthetic debugging”. Rather than having to connect your application to the “real data feeds” for testing a particular business rule, why shouldn’t you be able to test your rule(s) based on a set of custom values you define to see if the rule(s) does what you intend it to do. This can be a great time saver and is another just one of these innovative ease of use features that can come very handy. As you can see, ease of use for data integration software is far more complex than looking at one specific aspect of the software only. We at expressor have learned a lot about this topic over the past three years and our intent with our upcoming expressor 3.0 release later this year is to deliver on this promise as best as we can. Michael Waclawiczek, VP of Marketing, expressor |
-
Blogs
expressor corporate blog
expressor technical blog
Categories
- Application
- Business Rules
- Collaborative / Role-based
- Competition
- Connectivity
- csv file
- Customers
- Customers
- Data Integration
- Data Integration
- Data Lifecycle Management
- Data Migration
- Data Migration
- Data Processing
- Data Transformation
- Data Warehousing
- Data Warehousing
- DSS
- Ease-of-use
- expressor
- expressor Studio
- Featured
- Healthcare Informatics
- Low Latency
- Metadata
- Netezza
- News / Events
- Online / Multi-media
- Open Source ETL
- Partners
- Scripting
- Semantic Integration
- Semantic Types
- SQL Server
- SSIS
- Standard Edition
- Telecommunications
- Tools
- Traditional ETL
- type conversion
- Uncategorized
Tags
ab initio affordable etl American Tower Andy Leonard Bob Potter cloud computing Darwin's ETL data integration software data integration tools data quality data stage datastage data warehouse EDI etl etl software etl tools expressor expressor 3.0 expressor etl expressor software expressor Studio free ETL Healthcare Informatics ibm informatica insight series lifecycle management metadata microsoft netezza open source pricing semantic data integration Semantic Integration semantic rationalization semantics semantic types SQLSaturday sql server ssis Talend TCO Teradata webinar







2 Comments
you described very clearly the most important flaws of current tools on the market. i totally agree with all your findings and ideas! on my list i have 3 more points:
1. an etl tool should support iterative development. “iterative” naturally means a lot of changes on existing code. some changes are small and localized to a specific job or to a specific column in a transformation, but there can also be bigger changes affecting many jobs or a whole set of columns. Asimple example would be to rename a bunch of jobs or columns (using regular expressions). Current tools like datastage may support finding all instances or jobs in that case, but thats it. you are not able to automatically change a set of objects. another example: change all join stage in a job from case
insensitive to case-sensitive. in the datastage gui you would have to click through all join stage to change. most of the time you then resort to job exports and perform the changes on the horrible export file format (using xslt, sed or awk), which is still much faster than in the gui! so tool-supported refactoring and ease of change would change the actual project work dramatically, since quite a significant part of the work usually goes into refactoring (even though this is never exposed in the project plans). to support frequent changes also means fast design, compile and execute cycles – in current tools a single cycle easily can take up 10 minutes for a single job! the result: less cycles, less quality.
2. the etl tool should give the developer the chance to use the power of meta data to build better reusable and more generic jobs. to be able to build truely reusable and generic jobs or sequences, meta data should be treated like data at the same level and interchangeable. and if you know how powerful languages like schemasql, where you can bind not only data but also meta data to tuple variables, then you will miss this kind of functionality a lot in the current etl tools. another use-case for meta data is property values: e.g. how would you name a file that is written by a job? in most cases you will find yourself copying parts of the job or table definition name into the file name property. why not having the possibilty to refer to existing meta data elements to name this file, e.g. naming it $jobname.$tablename.txt. and the next time you use this file stage in another job you do not have to change the property, since you can reuse the property value in a different context!
3. visual data flows are good, text is sometimes better, visuals data flows with text is the best! the most common problem of graphical tools is that they fail to expose critical information to understand a job flow – e.g. is this join an inner join or a left outer join, which link is left which is right, what is the join key. to some extend you can solve this with renaming the graphical objects, but this is most of the time a manual job – i have seen developers spending hours if not days just on this kind of renaming! in a lot of cases renaming does not even work since the name rules of the tool limit what you can express: consider a filter or switch stage/operator – to the first output all records a > 100 flow to the second all records a <= 100. the best would be to see this filter expression directly on the output links. to avoid information clutter, the graphical/textual representation layers could be a solution. or to have two perspectives for the textual and the graphical representations which you can easily switch or you can have
side by side.
Many thanks for your thoughtful response.
Regards,
Michael