Data-Driven Modelling, Data Reconciliation and Fault Detection using Principal Component Analysis and its New Variants


  • Arun K. Tangirala
  • Shankar Narasimhan


  • Arun K. Tangirala, Dept. of Chemical Engineering, IIT Madras, India
  • Shankar Narasimhan, Dept. of Chemical Engineering, IIT Madras, India


Models are central to all applications of process automation including design, data reconciliation, control, optimization and process monitoring. Developing models from data, formally known as system identification, has been a powerful alternative to first-principles approaches, and in many situations the de facto choice for complex processes. Data-driven models are also advantageous in capturing effects of uncertainties, modelling random signals and estimating noise levels in process and measurements. The nearly seven decades of literature presents a rich repertoire of techniques with excellent, practically useful, software tools for identification. A majority of these techniques, however, cater to the case of error-free inputs. There exist, however, a large number of applications where the inputs are also known with errors (in addition to outputs) - identification problems in these cases are known as the errors-in-variables (EIV) identification. Techniques devised for classical identification, when applied to solve the EIV problems, are known to result in biased estimates. Furthermore, the statistical properties of input-errors need to be estimated. In this respect, EIV identification has emerged as a separate significant branch of identification in its own right (Soderstrom, 2018). EIV identification finds applications in a variety of engineering and modern applications including process industry, manufacturing, biological processes and econometrics. Techniques for EIV identification include (total) least-squares, instrumental variable, maximum likelihood and frequency-domain algorithms. In the recent times, principal component analysis (PCA)-based techniques for EIV identification, specifically, iterative and dynamic iterative PCA methods (IPCA and DIPCA), have been developed for building steady-state and dynamical models, respectively (Narasimhan and Shah, 2008; Maurya et al, 2018).
A key advantage of PCA-based formulation over existing methods is that, with a careful handling of measurement noise, the order of the dynamical system, model coefficients and noise variances can be estimated consistently using a layered approach. Consistency of order determination is applicable to both input-output and state-space models. Moreover, PCA-based approaches are symmetric, in the sense that models are first identified as constraints where no prior distinction of variables as input and output is required. Subsequently, the model is obtained by partitioning the variables into input and outputs and re-writing the constraints accordingly. This approach is also useful in other applications such as soft sensing, imputation of missing observations, etc. Models built using dynamic PCA carry biased estimates in general, whereas those that are developed using dynamic iterative PCA are consistent (unbiased). Furthermore, DIPCA facilitates a methodology for accurate model order determination and provides estimates of the noise covariance matrix.
Two of the key applications of models are in data reconciliation and in statistical process monitoring (a.k.a. fault detection). Models for data reconciliation are typically built from first-principles. Based on a published work of Narasimhan and Bhatt (2015), we demonstrate that PCA (and its iterative version) can be, however, used for both model development and data reconciliation. This is a key result since PCA serves as a single standalone tool for both model development and data reconciliation. Model-based fault detection methods rely on the key step of residual generation. DIPCA models combined with EIV-Kalman filters provide optimal residuals and result in improved fault detection as compared to standard dynamic PCA-based approaches (Mann et al 2019). This is because the standard approaches result in biased estimates  and non-unique residuals, whereas DIPCA estimates are not only accurate but also result in unique and optimal residuals when combined with the Kalman filter.  Furthermore, these methods are invariant to non-singular transformations of the data and provide consistent results, as compared to contribution charts which are currently used for diagnosis with PCA based approaches.
The focus of this workshop is on presenting the theory and tools for PCA-based approaches to EIV identification, while a brief overview of other methods will also be provided. The objective is to deliver a simple standalone technique for an automated complete identification of a process that includes order estimation, model coefficients and error variances from input-output data. Subsequently, we focus on statistical fault diagnosis using these models and residual-based approaches using EIV-Kalman filters. Participants will be trained in developing (i) regression models for steady-state processes (ii) dynamical models  including both input-output and state-space classes for dynamic processes and (iii) fault detection using DIPCA-EIV-Kalman filter. Case studies using an in-house developed MATLAB-based GUI toolbox for model identification and fault detection using the aforementioned methods will be presented to illustrate the methods on applications relating to regression, data reconciliation and fault detection.


  • EIV Modelling - Overview, Shankar Narasimhan
  • Preliminaries, Arun Tangirala
  • Steady-state EIV Regression using IPCA, Shankar Narasimhan
  • Data Reconciliation using PCA / IPCA, Shankar Narasimhan
  • Case Studies using EIV-DIPCA Toolbox, Shankar Narasimhan
  • Dynamical Models: Review, Arun Tangirala
  • Dynamic Iterative PCA for EIV Input-Output Identification, Arun Tangirala
  • Identification of State-Space Models for EIV Processes, Arun Tangirala
  • Fault Detection using PCA-based approaches: Review, Shankar Narasimhan
  • EIV-Kalman Filter and DIPCA-based fault detection, Shankar Narasimhan and Arun Tangirala
  • Case Studies using EIV-DIPCA Toolbox, Arun Tangirala