Big Data Architecture for Machine Learning

Sai Geetha M N
Jul 1, 2021
7 min read

Machine Learning by itself is a branch of Artificial Intelligence that has a large variety of algorithms and applications. One of my earlier articles on 'The Machine Learning Landscape" provides a basic mind map of the various algorithms.

Big data architectures provide the logical and physical capability to enable high volumes, large variety and high-velocity data to be ingested, processed, stored, managed and accessed.

The marriage of these two opens up immense possibilities and large enterprises are already leveraging the benefits. To understand how to bring the two together, we would first need to understand them individually.

Machine Learning Architecture

The Machine learning architecture is closely tied to the process of ML as described in my earlier article: "Machine Learning Process - A Success Recipe"

As a quick recap, a typical ML process would involve the steps depicted here:

It has a two-phased process of learning and predicting, the former feeding into the latter.

However, when the Machine Learning Model has to be put into production, a few more aspects have to be taken care of, as shown here;

The aspects added are cross-cutting concerns and are shown by the four layers at the bottom of the diagram.

Task Orchestration: the ability to orchestrate tasks like feature engineering, model training and evaluation on computing infrastructures like AWS or Azure. Dependency management would be an important aspect here and most often it is non-trivial.

Infrastructure: Provisioning of infrastructure, providing elasticity of the same through options like containerization is essential.

Security: With that, the additional layer of security through authentication and authorization needs to be added.

Monitoring: Continuous monitoring of the infrastructure, the jobs and the performance are all non-trivial aspects to be taken care of in production.

The final aspect of providing feedback about the statistical performance of the model itself giving opportunities to auto-tune the model would be a great value add. (shown by the dotted line from adaptation to Data collection).

It goes without saying that the code written, should follow best practices of modularity and SOLID principles leading to maintainability and extensibility.

All of this is good as long as the scale of data does not cross what can be handled by single large machines. In the realm of single large machines, all of this would ideally be deployed as containerized applications or traditional n-tier architectures with their own data stores, processing capabilities and would expose the models through APIs,

But the moment the scale of data crosses such a boundary, the only way to handle is to use distributed architectures. The Big data stack provides one such ecosystem whose primary functioning is based on distributed computing and storage principles. Let us understand Big data architecture and its capabilities.

Big Data Architecture

Let us now understand a typical application architecture on a big data platform. This would include building data lakes and the ability to serve its various customers, typically consisting of data analysts, business analysts, data engineers.

And when ML and Big data come together, the customers include data scientists and ML engineers too. (which I will address in the next section)

This is a generic architecture that should serve most enterprise data lakes - both from the perspective of building the lakes as well as serving data from the lakes for various use cases and stakeholders. No technology stack other than Hadoop is mentioned here, as each of the components have multiple options and should be evaluated based on the use cases of the organization.

This architecture has multiple elements in it - the Ingress pipeline, the various data zones, the data processing pipelines, the streaming layer, the egress and the serving layers. Each of these components have to be well thought through to ensure they serve almost all the use cases of an organization.

The Ingress pipeline. All data coming into your data lake should come through a common mechanism, so that data governance, data lineage management and aspects of data security can all be centralised and governed well. This part can grow into an unmanageable nightmare if you allow multiple ETL (Extract, Transform, Load) or ELT (Extract Load Transform) pipelines.

The Landing Zone: All data that comes in and does not need near-real-time processing lands here and is maintained for a pre-defined duration in the original raw format, for audit and traceability purposes. Practices of regular clean up have to be put into place.

Data Validation: Here is where all the types of data validation are done. Where possible, you can validate the data by comparing with the source and where not possible, validate the data for its own semantics as described in detail in my article on "Data Validation - During ingestion into data lake". As there are no ready-made tools for this, building a framework will take you a long way.

Data Lake: Data lake is where you have data that is trustworthy, to be served to all of the consumers. However, this data is still in its original form, albeit clean. This data as it is, is very useful for deriving insights. Considering that this is a big data platform, you can allow years of historical data to grow. This is immensely valuable for an organization that believes "Data is the new gold". Data can be read from here directly but most often requires further transformations.

De-normalized layer and Data Cubes: As the data is huge, joining data and deriving insights becomes a highly expensive process. Hence, one of the best practices is to be able to create a de-normalized layer of data for each domain in the organization. Then, all the users of that domain can get what they are looking for without expensive processing over and over again. The denormalized layer is almost equivalent to the facade design pattern. While the sources of data may change, as long as the domain picks up from the denormalized layer, it is protected from the change in sources.

Also, if very similar aggregations are required repeatedly, building cubes of data with pre-aggregation could be a good idea. You could even introduce big data OLAP capabilities here so that it can be served to reporting tools more natively. Some of the big data OLAP tools have been discussed in my article "Hadoop for Analysts"

Egression or Serving Layer: Once the data is processed, transformed and available, you have to be able to serve this data through a serving layer. This could be providing APIs through various technologies. An API can serve data right out of the Hadoop platform or you could publish this data out of Hadoop. An egress framework here would ensure that data produced within the data lake can be made available for all types of consumers to consume in batches or even near real-time.

If all the above aspects are taken care of, you have a working architecture for building data lakes and using them on a big data platform.

Machine Learning with Big Data

Having understood both the architectures independently, we need to see how they can work together and allow for new possibilities.

Since Machine Learning is all about "Learning from Data" and since Big data platforms have data lakes consisting of all the data one can have, it is but logical that they come together and provide even more insights and even better predictions opening up opportunities to businesses as never seen before.

Have a look at the amalgamated architecture. All you have to do is extend your data pipelines to now support machine learning too.

Most of the architecture looks very similar to the big data architecture, right? and yes, that's the point. Just extending it a little, as shown in the red dotted lines, gives your machine learning models, the power of a big data platform.

Let us focus on the pipeline starting with feature engineering up to predictions. Now you can use the data from the data lake and transform it into required features using the power of a distributed platform. The features can be stored in a feature repository that feeds into the models that are being trained.

The output of parametric models (like logistic regression) can be stored in a models repository and either egressed out or served through APIs. In the non-parametric models where the whole data is required (as in K-Nearest Neighbours kind of algorithms), you can deploy the algorithm code as part of the pipeline itself.

This shows that just continuing to extend the data pipelines that existed so far into algorithms and models, is the only extra part to be done!!

The rest of the aspects of production-ready machine learning algorithms, consisting of authentication, monitoring, task orchestration, and infrastructure provisioning are all available out of the box from the stack here. None of these is explicitly depicted in the diagram because it is taken for granted on this stack.

You do not have to work anymore, with small data sets, only to find that when you deploy with larger data, the statistical performance has degraded!! Power unto you, power unto the data scientists and ML engineers - with all of the data, the processing power and the large memory.

Doesn't this sound liberating? It is indeed, though there are a few challenges and nuances one has to understand to make this work for your organization.

Final Words

Machine Learning in a containerised world itself is a very empowering paradigm. Culling unforeseen insights and predictions have become a reality with the ushering in of ML.

Big data platforms like Hadoop have got the parallel processing capability of a distributed architecture to every enterprise - big or small, with the help of affordable commodity hardware. Combining the two opens up new vistas for any organization.

However, tread carefully on how you set up the two aspects of ML and big data together. Skillsets needed for the same cannot be underestimated. Upfront architectural thinking is a must. Understanding your company's use cases and the risk appetite, you would have to do a series of POCs for finalizing your custom set-up. However, the above article should give you a jump start on that thinking.

Decision Trees through an Example

Decision Trees - Feature Selection for a Split

Decision Trees - Homogeneity Measures

Big Data Architecture for Machine Learning

Machine Learning Architecture

Big Data Architecture

Machine Learning with Big Data

Final Words

Recent Posts

Comments