Probabilistic Models For Data Classification


In the era of digital transformation, organizations amass an unprecedented volume of data, which often includes both regulated data (PII, SOX, HIPAA, CCPA, UCPA, etc) and valuable intellectual property (IP). Ensuring the visibility and proper classification of this data is crucial for compliance, risk management, and safeguarding corporate assets. Several probabilistic classification models that Inspect-Data is using and can aid in these tasks, including the Naive Bayes Classifier, Logistic Regression, Hidden Markov Models (HMMs), and Conditional Random Fields (CRFs).

In machine learning, classification is considered an instance of the supervised learning methods i.e., inferring a function from labeled training data. The training data consists of a set of training example where each example is pair consisting of input object and desired output value. Given such a set of training data the task of a classification algorithm is to analyze the training data and produce an inferred function which can be used to classify new examples by assigning a correct label to each of them. An example would be assigning a given information into sensitive or non-sensitive classes.

A common subclass of classification is probabilistic classification and below are examples of probabilistic classification methods. Probabilistic classification algorithm use statistical inference to find the best class for a given example. In addition to simply assigning the best class like other classification algorithms, probabilistic classification algorithm will output a corresponding probability of the example being a member of each of the possible classes. The class with the highest probability is normally then selected as the best class. In general, probabilistic classification algorithms has a few advantages over non probabilistic classifiers: First it can output a confidence value associated with its selected class label and therefore it can be abstained if it's confidence of choosing any particular output is too low. Second probabilistic classifiers can be more effectively incorporated into larger machine learning tasks in a way that partially or completely avoids the problem error propagation. Error propagation, sometimes referred to as propagation of uncertainty, is the effect that the uncertainties of individual measurements have on the uncertainty of a calculated value that is based on those measurements. Understanding how to correctly propagate errors can be critical for determining the accuracy and reliability of a calculated value.

Fundamental Models and Algorithms

Naive Bayes Classifier. A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. This model assumes that each feature contributes independently to the outcome. Its strength lies in text classification, making it valuable in identifying documents containing regulated data or intellectual property. However, its 'naive' assumption of feature independence can lead to oversimplification, potentially missing intricate relationships between data elements.

Logistic Regression. Logistic regression is an approach for predicting the outcome of a categorial dependent variable based on one or more observed variables. By predicting the probability of data belonging to a specific class (e.g., 'regulated data' or 'not regulated data'), it helps Inspect-Data to determine which data requires stringent protection. However, its effectiveness depends on the appropriateness of its logistic function to model the observed variables, and it may not effectively handle complex or non-linear relationships.

Hidden Markov Model. A Hidden Markov model (HMM) is a simple case of dynamic Bayesian network, where the hidden states are forming a chain and only some possible value for each state can be observed. One goal of HMM is to infer the hidden states according to the observed values and their dependency relationships. A very important application of HMM is part-of-speech tagging in NLP. This can help detect patterns or behaviors related to the misuse of regulated data or IP. However, HMMs are computationally intensive and assume that the underlying process is Markovian (i.e., future states depend only on the present state and not on the sequence of events that preceded it), which might not always hold.

Conditional Random Fields. A Conditional Random Field (CRF) is a special case of Markov random field, but each state of node is conditional on some observed values. CRFS can be considered as a type of discriminative classifiers, as they do not model the distribution over observations. Name entity recognition in information extraction is one of CRF's applications. This makes CRFs valuable in tasks like identifying segments of regulated data within larger documents or discerning patterns in network traffic to protect IP. However, the complexity of CRFs can make them harder to implement and more computationally demanding.

Inspect-Data may use combination of these models into providing comprehensive data visibility. For example, a Naive Bayes Classifier or Logistic Regression could be used for initial broad-brush data classification, followed by HMMs or CRFs for in-depth analysis of identified sensitive data. By leveraging these probabilistic classification models, Inspect-Data can protect regulated data and intellectual property in an increasingly data-driven world.

5 min read
Share this post: