Machine Learning Process - A Success Recipe
It is said that "The world's most valuable resource is no longer oil, but data". This is largely known and hence all organizations big or small are gathering data of every type hoping to find the gold mine in them.
However, not many ML projects that use the data are fully successful. Most often, a lot of effort is put and yet the benefits are not commensurate.
Following a process would go a long way in overcoming many issues and even correcting the course of action when you don't see the benefits.
Since it is only gaining huge traction in the last decade, there is a misnomer that Machine learning is all about just building models. True, if it is only for learning the art. But in real business situations, it has many more things that need to be looked into and the process here tries to address that.
Machine Learning Process Overview
As shown here, 3 broad aspects need to be taken care of, before you go into an iterative process of model development and testing - that forms the core of the process.
The trio of questions, data and algorithms need to be addressed before going further.
Questions - Asking the right questions is a very important first step. If you ask frivolous questions, you get frivolous answers.
Data: Collecting the right data that can be useful in answering the questions on hand. This is another important step. It goes without saying that this needs a good amount of domain knowledge as well.
Algorithms: A fair idea about what are the types of algorithms you are targeting. Would it be supervised, unsupervised, a combination or reinforcement? may be semi-supervised? based on whether you have labelled data or you are trying to discover new insights etc. You can read more about algorithm categories here
Once these 3 are taken care of, you get into the actual ML Modeling process - which is iterative in nature and is detailed in the next section.
Machine Learning Modelling Process
The ML modelling process itself has many steps and is pretty iterative. It consists broadly of 2 phases of learning and prediction as summarised in the diagram here:
Phase 1 (Learning) is about creating the models that learn the best from the data available. This consists of many steps including the very laborious steps of data collection, data cleansing, after deciding on the category of algorithms you are planning to use. Once the model is built and evaluated to do well with test data, you need to deploy it into production
Typically, data collection and cleansing take 70 to 80% of your time and only about 20% is spent on model development. In model development, the concept of hyper-parameter tuning is the most time-consuming part.
Once the model is ready and deployed, Phase 2 (Predicting) of the process begins. All new data that comes in now will use the deployed model to predict. These predictions serve the business purpose they were meant for.
However, as an ML developer, you continuously monitor the predictions to see any drift and re-adapt the models through a repeated Phase 1 with more fresh data and sometimes with newer algorithms.
Note that initially if you find a small drift in the prediction, you can retrain the same models with more fresh data. As the drift gets larger, you may have to re-think your algorithm strategy itself.
Machine Learning's Product Lifecycle
Now, how does the above ML model development cycle map to an ML Product lifecycle? A typical product life cycle consists of
Research and development
Introduction of the product
Growth of the product
The decline of the product
Each of these phases could be mapped to the typical ML life cycle as shown in the figure here. The key takeaway here is that 'Learning' and 'Predicting' is constant in ML projects. They happen iteratively until you plan to phase out that ML product altogether.
For example, if you are thinking of a product that aims at targeted promotions based on customer behaviour on the home page of your e-commerce website, you will go through the predictions of promotion success and fine-tuning through continuous learning from fresh data. Only when you plan to retire this proposition is when the cycle of learning and prediction stops.
Machine Learning Questions
ML Questions have their own value chain.
As you can see here, if the questions answer "What", like what went wrong, they just help in monitoring and give you the information in hindsight.
The next level in the questions value chain answers "Why?". These are diagnostic in nature and go one step further on why it happened and not just what happened. This is still reactive.
To take an example, if you are trying to ask questions about a 'Search' system. Monitoring kind of questions ask, 'What' went wrong when people were trying to search for some products. This is necessary to know as a first step but is reactive and of little value.
The next would be to ask 'Why' are these products failing to show up in the search? This helps you take corrective action. But after the failure has happened. The business lost is lost and we can only take corrective action for the future visitors of our website. Till this level of monitoring and diagnostics, traditional data analytics play an important role. They help in deriving information and insights from past data.
ML can up this game to a different level. It can help in the prediction and even optimization of systems and processes.
It seeks to answer questions that provide predictive 'insights' into incidents that are yet to happen. 'When' do you think searches are going to fail? and what are the searches that are going to fail? Once you can predict this, you could take proactive steps to prevent it.
The highest level of questions would answer, 'What if' it fails, 'how to' rectify it? if it is possible to auto-rectify it. These are of the highest value that lead to 'Optimized' processes or systems.
Therefore, ML product managers could use this yardstick to see if they are answering valuable questions and getting the maximum benefit from their ML investments.
Machine Learning Data
This is a vast subject. I have tried to highlight largely five aspects of data that needs to be understood and dealt with, at a minimum:
Data Structure - data could be structured, semi-structured or unstructured. Each has nuances that need to be understood.
Scale of Data - data could be big data with high volumes, high velocity of creation, a great variety of data. Need to probe the data to understand its veracity and value, before using it
Whether the data contains Labels - does the data have answers within it, called labelled data or is it unlabelled data?
Data Quality - By checking the quality, noise has to be removed, missing values have to be imputed, the transformation of constants to categorical data has to be done and most often data has to be normalized
Feature Extraction - Given the data we have, we need to filter out those that are relevant to answering the questions on hand and eliminate those that do not matter. You may have to derive new features too if that helps.
Machine Learning Algorithms
Finally, we have to discuss the choice of algorithms to be made. At a very broad level, the types of algorithms are listed here.
Supervised algorithms would require labelled data. i.e. the answer to the question exists in historical data itself. For example, based on a set of blood test results, the answer to whether that patient is diabetic or not is also there. This helps in predicting the future patients' diabetes status based on test results.
Unsupervised algorithms do not have labelled data. They just derive insights from the data with a completely fresh perspective. Like in clustering algorithms or even in neural nets where the output is not known, Customer segmentation is a typical example where you do not have any label about the customer but based on similar behaviours, you group them.
Reinforcement algorithms are those that process incoming data and try to maximize a particular reward within the constraints of the environment given. The training is based upon the input, The model will return a state and the user will decide to reward or punish the model based on its output. These are the kind of algorithms used when a machine plays a game and is trained to win it.
If you follow this process of the trio (Questions, Data, Algorithms) being addressed before you plunge into modelling and you keep the feedback loop in the modelling phase, you also align your ML process with the product life cycle, you have clearly higher chances of seeing your ML projects succeed.