# Empirical Models for Analyzing “BIG” Data – What’s the Difference?

“BIG Data” is a current buzzword across almost all businesses and plays a key part in Industry 4.0. The term “BIG Data” is most appropriately used to characterize data that is not just large but complex in nature. Furthermore, there is not just one issue with BIG data – there are different purposes or objectives depending upon whether one is in sales, marketing, finance, manufacturing, etc. and there are many different issues to be solved (data collection, warehousing, integration, and analytics). But ultimately most of the issues (e.g. collection, warehousing, integration, cloud services) are just infrastructure issues that need to be improved in order to ultimately be able to use the data to extract actionable information. The focus in this blog is therefore on the data analysis tools and issues that one must consider to effectively extract such information.

As Chairman of ProSensus and as a Professor Emeritus in Chemical Engineering at McMaster University, I have been actively involved in the analysis of “BIG Data” in the manufacturing/process industries for more than 45 years with close to 100 companies. It is my purpose in this blog to share some of the key issues we have learned about effectively analyzing historical data from the process/manufacturing industries.

To analyze historical data, one needs to make use of models, usually empirical – such as regression, data mining (deep learning neural networks, decision trees, etc.) or latent variable models. My PhD supervisor in Statistics (G.E.P. Box) used to often say “All models are wrong but some are useful”. The problem is that most people lump all empirical models into one category – “empirical models” - as though any of these models are interchangeable irrespective of the nature of the data or the objectives of the problem.

## Active Vs. Passive Models

Whether a model is useful depends upon three factors:

- The objectives of the model
- The nature of the data used for the modeling
- The regression method used to build the model

From an objective point of view, there are basically two major classes of models – those to be used for passive use and those to be used for active use.

Models for passive use are intended to be used just to passively observe the process in the future. Such passive applications include classification, inferentials or soft sensors, and process monitoring (MSPC). For such passive uses one does not need or even want causal models, rather one wants to just model the normal variations common to the operating process. Historical data is ideal for building such models. Models for active use are intended to be used to actively alter the process. Such active applications include using the models to optimize or control the process or to trouble-shoot process problems or gain causal information from the data. For active use one needs causal models. Causality implies that for any active changes in the adjustable or manipulatable variables in the process, the model will reliably predict the changes in the output of interest.

The problem is that to guarantee causality in any set of adjustable process variables one needs to have independent variation in those variables, such as would result from a designed experiment performed on the plant. But historical plant operating data almost never contains such information, rather most variables vary in a highly correlated manner and the number of independent variations in the process is usually much smaller than the number of process variables. This poses a major problem if we want causal models since most of the data available in industry is historical operating data. The question then becomes what analysis methods are useful for obtaining active models from historical operating data?

## A Deeper Look into Machine Learning

Machine Learning (ML) methods are currently the rage in “BIG Data” communities. These include deep learning neural networks, and massive decision trees. These new ML approaches are improvements on the older “shallow” neural networks (a few layers connecting all variables to all nodes) and single large decision trees, both of which led to overfitting of the data and to large variances in the predictions. The newer ML approaches are aimed at overcoming some of these deficiencies. Deep learning NN’s use many simplified layers, regularization and averaging to reduce the effective number of parameters and the overfitting. New decision trees involve building many decision trees based on fewer randomly selected variables and then averaging or voting on the results to effectively reduce the variance and bias of the results.

These newer ML methods can be very good for passive uses. But they cannot be used for extracting interpretable or causal models from historical data for active use. With historical data, there are an infinite number of models that can arise from any of these machine learning methods, all of which might provide good predictions of the outputs, but none of which is unique or causal. Because the process variables are all highly correlated and the number of independent variations in the process is much smaller than the number of measured variables, one can get many ML models all using different variables and having different weights or coefficients on the variables that give nearly identical predictions. This does not allow for meaningful interpretations, even more so if the results come from averaging or voting on many (e.g. often 1000 or more) models.

Nevertheless, these ML models have proven to be very powerful in passive applications – eg. deep learning NN’s for image analysis, and Random Forests for medical diagnosis, both of which are passive applications (no interest in altering the image or the patient).

## Benefits of Using MVA

Latent Variable models such as PLS (Partial Least Squares or Projection to Latent Variables) were developed specifically to handle “BIG Data” where the real number of things affecting the process is much smaller than the number of measured variables. Typically, the number of latent variables needed to extract useable information from hundreds of process variables is more in the order of 3 to 10, implying that the true number of degrees of freedom that affect the process is often quite small.

These latent variable models such as PLS or PCR (Principal Component Regression) are a total break from the classic statistical regression or machine learning models in that they assume that the input or regressor (X) space and the output (Y) space are not of full statistical rank and so they provide models for both the X and Y spaces rather than just the Y space. Without simultaneous models for both the X and Y spaces, there is no model uniqueness nor any model interpretabilty or causality. It is for these reasons that ProSensus has used Latent variable models for nearly all its empirical modeling when active models are needed.

Latent variable models provide unique and causal models because they simultaneously model the X and Y spaces. However, they provide causality only in the reduced dimension space of the latent variables as this is the only space within which the process has varied. By moving the latent variables one can reliably predict the outputs (Y) – ie. The definition of causality in the LV’s. But to move the LV’s one cannot just adjust individual x variables, but rather combinations of the X variables that define the LV’s as defined by the X space model.

ProSensus has used this uniqueness and causality provided by LV models for many years now to trouble-shoot and understand processes and to optimize both processes and products based on actionable information extracted from historical data.