Moving parts of the data architecture

February 19, 2020

Traditionally, data warehouses have been very resistant to change and adoption of new technology. However, mid of last decade, this side of the world saw vast changes. The whole ecosystem rushed towards change fuelled by large scale Hadoop and Bigdata implementations. In the past few years, slowly but steadily, yet another data revolution has been happening – Cloud based and Serverless data architectures.

With all 3 major public cloud service providers innovating and competing at break-neck speed, the options available for enterprises have multiplied. While cost (especially the Capex) was cited as the initial driver for moving to cloud, today the cited benefits are multi-pronged, ranging from ease of maintenance to high availability.

Courtesy the cloud migration at enterprise level, the data architectures, in the context of relational databases, have gotten to a fairly mature stage in the cloud world too. Beyond OLTP, the pattern for standard warehouse workloads have also become consistent. The big consideration and advantage that cloud brings to the fore is the choice to do ELT (something that was relegated only to the MPP world on-premise) in place of ETL.

As an enterprise, if you are on this journey, watch out for pitfalls like availability (within the geo), security implementations (are you looking for row level security?), scalability (each of them is set up differently) and, importantly, overfitting the latest tech for the usecase you have. For example, Azure offers MPP through Azure Synapse. However, using it for smaller datasets might result in performance issues, purely as a result of how the MPP databases are architected.

Coming to a “still woozy” scenario of Bigdata workloads, there are quite a few moving parts

While the ease of use has increased and time to explore has decreased, there are quite a few options to pick from. Landing on the preferred architecture for a data platform is more difficult today than it was a few years ago!

And this is mainly due to multiple options available across different cloud service providers (Bigquery vs Redshift vs Synapse vs Hbase), different options within the same cloud service provider (Azure SQL vs Azure HDInsight vs Azure Synapse) and so on.

So, how do we proceed from here?

Start with collating all the questions:

  • Are the number of data sources, that you need to access, limited or numerous?
  • What are the data sources we need to ingest the data from, and what are their formats?
  • Is most of the data structured / relational data?
  • Do you need a real-time filtering and aggregation of data?
  • What do you want to do with the data flowing in – Complex Event Processing or Data Analysis?
  • Do you foresee a need for separate scaling of compute and storage?
  • Are there queries on the data that need to be distributed and warrant MPP processing style?
  • Is your ultimate goal, for the data platform,the provisioning of a robust analytics workbench, for training large volumes of data sets for complex AI models?
  • Will you be needing your batch processes to run continuously or is it in spurts?
  • For your real-time data feeds, do you need built-in support for windowing?
  • Are you looking at embedding your visualizations (reports) in the existing website?

Answers to all the above questions, along with considerations like availability of data centers within your geography, connectivity to on-premise data sources, auto scale capabilities, and security will play a role in the final decision.

The below diagram is just a small subset (illustrative) of the options that exist across platforms, for different facets of a data architecture.

What is also equally important to know is, in most cases, it is going to be a hybrid architecture – hybrid by leveraging multiple options within one cloud service provider (using S3 to stage data, followed by a combination of Athena for directly querying data in S3 and loading transformed data into Redshift leveraging Glue) as well as multi-cloud (with IoT data sitting in AWS and relational data sitting in Azure)

These are interesting times in data world and things haven’t been this exciting ever! When in doubt, reach out to experts that we have in-house at RoundSqr 🙂