Transition from Legacy Databases to a Modern Data Architecture

The proliferation of purpose-built data solutions means that choosing your data transition strategy requires you to do your homework.


The good news: there is a combination of database technologies out there suited to your organization’s exact data requirements. But the proliferation of purpose-built data solutions means that choosing your data transition strategy means doing your homework. While this article won’t make that decision for you, it will – hopefully – give you a quick understanding of what you need to think about as you plan out your data modernization initiatives.

A quick primer on databases and data architecture

Just as soon as the advent of computers in the 1950s, there was a need to manipulate data. Initially, data was stored in the computer programs or algorithms themselves. But as the complexity of analysis and quantity of data increased, there was a need to externally store, manage, and access data.

Charles W. Bachman is credited with designing the first database management system (DBMS) in the 1960s. The approach proved popular, and many commercial companies, like IBM, created their own version like (like the IBM Information Management). As computers became increasingly popular with businesses, new ways to standardize the storage and access of databases were developed.

A true revolution in database management came from E.F. Codd in 1970. Codd came up with a relational model of data (RDBMS). This model organizes data in rows and columns – called tables. Each row in the table has many columns and a unique key to access it. The rows in one table could be linked to rows in other tables. In an instant, the complexities of datasets were reduced to simple relations between tables. IBM also introduced a standard way to access the data in the tables, called the Structured Query Language (SQL). Many popular and commercial relational databases came into the market, including DB2 and Oracle.

Why databases modernized and why data decisions are more critical than ever

The ease of storage, ability to connect datasets, and standard way to access data made relational databases a huge success. The relational databases (RDBMS) dominated the data management for almost 30 years – from 1970 to 2000. The millennium, though, brought a huge shift in volume, variety and, velocity of data. The evolution challenged the foundations of relational databases by introducing:

      1. Structured vs. unstructured data – Structured data was mostly comprised of rows and columns, but its domination was challenged with the generation of new types of data, like Twitter feeds, mp3 files, video, time-series data, graphical data, and more. The traditional RDBMS were not suitable for storing these types of data types.
      2. Limited volume vs unlimited volume of data – The RDBMS wave did face a large amount of structured data from some industries, like finance and retail, but it was still limited to terabytes and petabytes over several years. The new millennium saw a daily or weekly growth of data in terabytes for use cases like Facebook logs, Twitter feeds, or self-driving cars.
      3. Batch vs. streaming data – There were applications that needed data to be processed as real-time streams; for example, Twitter feeds or web logs for a recommendation engine. This was different than the traditional batch processing in the relational databases.
      4. On-premises vs. cloud data – The millennium also introduced cloud technology that solved the availability, scalability, and reliability of existing database solutions. It also introduced hybrid data management considerations, as data was scattered in multiple on-premises and cloud environments.

The above characteristics of data rendered the RDBMS model obsolete – or, at least, very much limited it in scope. Of course, structured data is still critical for many operational functions like finance and order management, but the new types of databases are posing a significant challenge to RDBMS. Different data types and their applications require different database technology, like having the right tool for the right use case. When you are building a house, you use all sorts of tools – a hammer, drill, screwdriver, spanner, wrenches, etc. Similarly, when you are building a database management system you need to use the right databases for the right use case – this is also called as the polyglot architecture. Several database technologies, collectively referred to as “NoSQL” databases, came into existence after the millennium: key-value stores, graph databases, columnar databases, document-type databases, time-series databases, etc. This table, shown below, helps show how database types fit into desired functionality.


The rise of open source

You might have noticed an important trend in the above databases solutions – while the RDBMS era was dominated by a few proprietary databases like DB2, Oracle, Informix, the NoSQL databases are primarily open-source databases. Many of the most popular NoSQL databases and related technologies are open source, like Hadoop, Cassandra, Kafka, OpenSearch, Redis, Spark, and many more. These open-source technologies are completely free to use by any individual or business. There is none of the licensing cost or vendor lock-in that was prevalent in RDBMS technologies like Oracle.

However, the popularity of open-source technologies has motivated several new vendors to make money off the open-source technologies. Some of these vendors have added new features to the core software and started charging licensing fees for it, while others have introduced new technologies that are open source in name only. For any transition or migration undertaking to the data layer, it is critical to understand the new open-source ecosystem to get the most out of these technologies.

The open-source ecosystem can be divided into three types of vendors:

      1. Open-Core – The core database technology used is open-source, but these vendors add proprietary features on top of the open-core and start charging licensing fees. It can quickly become unclear (sometimes intentionally so) as to what is open and what is proprietary, and the vendor is only focused on moving you from true open-source to its licensed version.
      2. Open-Code – Some data solutions introduce a new technology that they call open-source, but it uses more restrictive software licensing, such as SSPL. These licenses prohibit commercial alternatives. Technically anyone can download and look at the code (i.e. open code), but any changes or enhancements are still owned by the commercial license holder.
      3. Open-Source – There are few vendors that support true open-source technologies like Cassandra, Redis, and others. They offer services like support and managed platforms, but differ by not charging licensing fees for any of the technologies (hence the code used remains fully open source and portable).


Anil is the VP & Head of Data Solutions at Instaclustr, which provides a managed platform for open source data-layer and developer workflow technologies. Anil has 20+ years of experience in data and analytics roles. Joining Instaclustr in 2019, he works with organizations to drive successful data-centric digital transformations via the right cultural, operational, architectural, and technological roadmaps. Prior to Instaclustr, he held data & analytics leadership roles at Dell EMC, Accenture, and Visa, among others. Anil lives and works in the Bay Area.