CIO Insider

CIOInsider India Magazine


The Data Warehouse of the Future

Anup Purohit, CIO, Yes Bank

In today's digital world, data is the single most valuable asset for organizations. The large number of ecosystems assimilating all this data and analyzing it, to determine the hidden nuggets are a testament to its overarching importance. Any medium to large organization has a Data Warehouse implementation as one of its most strategic initiatives. The insights from a data warehouse are used for a myriad of purposes including but not limited to analytical campaigns, business and portfolio monitoring, performance analysis and risk management. Having said that, most of the data warehouses across organizations fail or do not bring the value in line with the incurred costs.

Let us first understand what a data warehouse is and what a data warehouse is not. A DATA WAREHOUSE IS A CONCEPT. It is NOT A TECHNOLOGY. Essentially, a data warehouse is a representation of all the business processes that constitute an organization. They are repositories of data extracted, cleansed, transformed and integrated from multiple disparate systems to provide an easy and comprehensible data reporting and analysis bed. Depending upon the business function, data warehouses are required to store both current and historical data.

The confusion around data warehouse being mistaken as a piece of technology stems from the fact that traditionally all data warehouses have been implemented on Relational databases (RDBMS). So it is not the concept which is a failure, but it is technological ecosystem hosting the concept which is posing challenges to organizations. Here are the top limitations of traditional data warehouse technologies:

1. Traditional RDMBS’ are well suited for lower volumes of data. For larger volumes, we have engineered systems which are extremely complex to administer

2. These become cost prohibitive when the data sets start increasing

3. A lot of data warehouses are very rigid and are not flexible to adopt inline with dynamic changes in organization and requirements

As we live in a 'DATA Age', I believe the 'Data Warehouse of Future' needs to provide organizations with following capabilities:

1. It needs to handle large data volumes. We are digitizing every facet of organizations hence generating data like never before. The Data Warehouse of Future should be able to ingest and work upon large data sets

2. New and new forms of data are merging. Semi structured and unstructured data sets are providing insights which traditional structure have not provided earlier. The Data Warehouse of Future should be able to handle these varied data sets and merge them seamlessly with structured data

3. The Data Warehouse of Future has to be analytical. No longer is it sufficient to integrate as is/cleansed data from organizational systems. It is of utmost importance to elicit hidden information using analytical models and integrate into mainstream of data warehouse for consumption by business users. So Data Warehouse of Future should be able to manage analytical capabilities as part of routine workloads

4. The next generation data warehouses need to be as near real time as possible. The Data Ware house of Future should be able to ingest large amounts of streaming data in a scalable manner

5. The Data Warehouse of Future need to be flexible and agile enough to ingest new data very quickly and easily. The tradition traditional data warehouse have been extremely rigid and incorporation of new data elements becomes a project in itself

The traditional technologies have been able to address some but not all of the above stated limitations. So in order to make data warehouse concept a successful proposition in an organization, it is imperative to scout for newer technologies and look for alternatives to traditional technologies. One such technology which stands out is Hadoop.

Hadoop is essentially an ecosystem for data analysis with many technologies under its umbrella. While you have Sqoop and Flume to handle structured and unstructured data, you have Pig and Spark to handle jobs. You also have Kafka and Spark Streaming to handle real time data and then there is a foray into providing interactive query capabilities like Impala to query on large data sets. Powering all of that is a distributed and persistent file system HDFS which has resulted into formulation of term called Data Lake where organizations can store data in a very cost effective manner and use the aforementioned technologies for various tasks and purposes.

Having stated the above, is Hadoop an answer for every requirement? My observation to that is a clear No. Traditional database technologies have been in existence and have matured in a span of 30+ years and hence provide a very mature stable and functionally rich SQL and programming interfaces to work with the data. Further the data governance and security capabilities are highly advanced as compared to Hadoop. Additionally, the pool of skilled resources around this technology is immensely large. Hadoop as technology is still maturing in terms of its SQL prowess and the resource skill sets required is yet to become main stream.

So how does an organization address the data warehouse problem in the most optimum manner? I believe the answer to this lies in following 'Horses for Courses' approach. An organization needs to maximize the pros of each technology to the best possible extent. Use the traditional RDBMS technologies for analysis on recent/current data and leverage the power of Big Data platform to run large jobs on historical data and also provide a low cost alternative to your large storage. Therefore I see a hybrid structure as the most optimized solution to this problem.

This is indeed a space which has the eye of most technological pundits across the globe. While the active Hadoop community focuses its research and development on quickly providing and maturing their SQL capabilities thereby providing RDBMS kind of flavour on the Big Data platform, the traditional databases technologies is leveraging Big Data concepts to provide parallel and distributed processing. Will there be a clear winner above or will we see an alliance between these, only time will tell.

Current Issue
Cloud And Startups On The Balance