- Bad ETL (extract, transform, load) projects are ones that don’t have a strategy for different types of information or lack of knowledge management on how to add/remove different data sources, add/remove processors & translators, and add/remove different sinks of information.
- It doesn’t necessarily have to be on any particular platform, just that it has structure.. just as any software should have.. an architecture.
- Simple systems that separate E / T / L into composable blocks that are scriptable or configurable.
- Compiled systems are good too if the volume is extreme.
- A good bash pipeline is just as good as any as long as its well documented.
- Using ESB (Enterprise Service Bus) for ETL.
- Using Spark for ETL.
- Basically using things that have advanced features for business logic for doing simple transformations that really don’t need to belong in these computing environments. Conjoining simple message delivery (ETL) to an advanced message delivery (ESB) or advanced computing (Spark).
Why should an organization undertake such a project?
Comes down to how does it affect the organizations’ perpetuity. Some of the questions a business should be able to answer are:
What other solutions provide the same end user results?
Tools like Domo, or Tableau, or recently something like Periscope (in the SaaS world) can be useful to gain basic insights without having to do ETL if the data is ready. Other open source tools can be used as well such as Kibana, Metabase, and Redash as long as the data is available.
What are the trade-offs between the various solutions?
Ultimately if the data isn’t ready, ETL may be required to get it clean enough for those tools to allow users to visualize/explore it properly.