After we obtain the feature vectors using the signal processing methods described in the previous chapter, they will be put into a classification algorithm to train a model. The model will be able to predict whether a signal belongs to stable or unstable. Since our dataset is labeled, we mainly focused on supervised learning. In this chapter, several methods and classification algorithms are used, and their effectiveness are compared.
The simplest method is to set a fixed threshold value to distinguish stable data from the unstable ones. A statistical method – Naïve Bayes was tested in 3.2. Some commonly used classification algorithms, such as local outlier factor (LOF), support vector machine (SVM), and k-nearest neighbor (k-NN), were applied to our dataset. Finally, we utilized artificial neural network (ANN) to obtain models for prediction.
3.1 Numerical threshold
Using a fixed threshold as the boundary of stable and unstable region is one of the simplest and most straightforward method. This method relies on the distribution of a feature to be separable by a single threshold. The calculation steps are as follows:
1. Choose a feature that is represented by a single numerical value, e.g. the relative energy after FFT of the x-axis signal.
2. Compute the values of this feature for the entire dataset to obtain the distribution of both stable and unstable data.
3. Let there be m stable data points and n unstable data points. Since relative energy is larger for unstable data points, choose the m-th smallest value in these 𝑚 + 𝑛 data points. This is the selected threshold. A data point is considered
3.2 Naïve Bayes
Naïve Bayes is a probabilistic classifier based on Bayes’ theorem, widely used in machine learning [68] [69] [70]. The general idea is as follows. Given a set of features, such as the energy ratios of x-, y-, and z-axis, we know the distribution of each feature for both unstable and stable data. Then, we fit the distribution curve with, e.g. Gaussian distribution. Now, given a new data point, we can estimate its probability of being unstable based on the fitted distribution and its energy ratio from x-axis. Same can be done with y- and z-axis.
Given a feature vector 𝐱 = (𝑥1,… ,𝑥𝑛) , and classes {𝐶𝑘} , we want to calculate 𝑝(𝐶𝑘|𝑥1,… ,𝑥𝑛), the probability that x belongs to Ck , for each k. To determine which class 𝐱 belongs to, we want to find k that maximizes 𝑝(𝐶𝑘|𝑥1,… ,𝑥𝑛). Assume all features x1, … , xn are independent. By Bayes' theorem,
𝑝(𝐶𝑘|𝐱) =𝑝(𝐶𝑘)𝑝(𝐱|𝐶𝑘)
𝑝(𝐱) .
By chain rule,
𝑝(𝐶𝑘|𝐱) ∝ 𝑝(𝐶𝑘, 𝑥1, … , 𝑥𝑛) = 𝑝(𝐱|𝐶𝑘) = 𝑝(𝐶𝑘)𝑝(𝑥1|𝐶𝑘) … 𝑝(𝑥𝑛|𝐶𝑘),
where 𝑝(𝐶𝑘) is called prior probability, or simply prior. For our application, we set 𝑝(𝐶𝑘) = 1/𝑘 for all k. 𝑝(𝑥1|𝐶𝑘) depends on the model fitting methods, such as Gaussian or Bernoulli.
3.3 Local outlier factor (LOF)
LOF is an algorithm to distinguish outliers from clusters of data points, and has been used in audio and image recognition [71]. Previous work used LOF to classify relative wavelet packet entropy [72]. To use this method in this research, it is required to limit the ratio of unstable data points so that they are significantly less than the stable once. This is because LOF finds the outliers, i.e. the points far away from other points.
Define k-distance 𝑘(𝐴) be the k-th nearest neighbor of point A. Denote the distance between two points A and B as 𝑑(𝐴, 𝐵). Define reachability distance as
𝑟𝑘(𝐴, 𝐵) = 𝑚𝑎𝑥{𝑘(𝐵), 𝑑(𝐴, 𝐵)}.
The reason to use reachability distance instead of the distance between A and B is to get more stable results in computations. Furthermore, define local reachability density as
𝑙𝑟𝑑𝑘(𝐴) = |𝑁𝑘(𝐴)|
∑𝐵∈𝑁 𝑟𝑘(𝐴, 𝐵)
𝑘(𝐴)
,
where 𝑁𝑘(𝐴) is the k nearest neighbors of A. Roughly speaking, local reachability density is the reciprocal of the average distance between A and its neighbors. The local outlier factor is
𝐿𝑂𝐹𝑘(𝐴) =
∑ 𝑙𝑟𝑑(𝐵)
𝑙𝑟𝑑(𝐴)
𝐵∈𝑁𝑘(𝐴)
|𝑁𝑘(𝐴)| = ∑𝐵∈𝑁𝑘(𝐴)𝑙𝑟𝑑(𝐵)
|𝑁𝑘(𝐴)| ∙ 𝑙𝑟𝑑(𝐴).
Fig. 15. An example of visualized local outlier factors
3.4 Support vector machine (SVM)
SVM is a well-known method to separate two classes of data points [73] [74], and has the advantage of good performance because the equations can be written in linear form. SVM has been used for chatter recognition when combined with information entropy [75], WPT [76], and Q-factor [77].
There are two types of SVMs, linear and non-linear. For linear SVM, suppose there are n points (𝐱𝟏, 𝑦1), … , (𝐱𝐧, 𝑦𝑛), where 𝑦𝑖 = ±1, SVM attempts to separate the points with 𝑦𝑖 = 1 from the ones with 𝑦𝑖 = −1 using a hyperplane 𝐰 ∙ 𝐱 − 𝑏 = 0 with some vector 𝐰, called support vector. This is illustrated in Fig. 16 (a). In practice, there are cases when the points cannot be separated by a hyperplane as in Fig. 16 (b). In such cases, the points are mapped to higher dimensional space using a kernel function such as
polynomial, RBF, and sigmoid. The mapped points may be easily separable with a hyperplane.
(a)
(b)
Fig. 16. (a) Illustration of linear SVM for a linearly-separable dataset [78] (b) Non-linear SVM with RBF kernel [79]
3.5 K-nearest neighbor (k-NN)
K-NN is a simple classification algorithm based on votes from the neighbors [80].
Suppose we have a data point A without knowing whether it is stable or unstable. We find k-nearest neighbors of A for some k. Let’s say 𝑘 = 5, and there are 3 of them are stable, and 2 are unstable. Since 3 2, k-NN classify A as stable.
It is easy to add a weight function 𝑤(𝑟) for each neighbor. For example, let the k neighbors of A be 𝑝1, … , 𝑝𝑘, and the first m points are stable while the rest are unstable.
Let the distance between A and 𝑝𝑖 be 𝑟𝑖 for all i. Then, define the scores
𝑠𝑐𝑜𝑟𝑒(𝑠𝑡𝑎𝑏𝑙𝑒) = ∑ 𝑤(𝑟𝑖)
𝑚
𝑖=1
𝑠𝑐𝑜𝑟𝑒(𝑢𝑛𝑠𝑡𝑎𝑏𝑙𝑒) = ∑ 𝑤(𝑟𝑖)
𝑘
𝑖=𝑚+1
If 𝑠𝑐𝑜𝑟𝑒(𝑠𝑡𝑎𝑏𝑙𝑒) 𝑠𝑐𝑜𝑟𝑒(𝑢𝑛𝑠𝑡𝑎𝑏𝑙𝑒), then A should be classified as stable, and vice versa. Some common weight functions 𝑤(𝑟) are 1 (uniform), 𝑟−𝑎, and 𝑒−𝑎, for some 𝑎 0.
3.6 Artificial neural network (ANN)
Using ANN in chatter recognition has the advantage that a detailed model is not required and reasonable accuracy may still be achieved without manually select an optimal feature as input. For this reason, we will use the entire frequency spectrum as the input of the ANN. The vibration signal will be converted into frequency domain using FFT, and ANN should be able to distinguish the stable data from unstable data since chatter is more easily recognizable in frequency domain. The architecture and training parameters will be manually tuned to obtain a good accuracy.