TMsDP: two-stage density peak clustering based on multi-strategy optimization

Jie Ma (School of Business and Management, Jilin University, Changchun, China) (Information Resource Research Center, Jilin University, Changchun, China)
Zhiyuan Hao (School of Business and Management, Jilin University, Changchun, China)
Mo Hu (Department of Network and New Media, School of Journalism and Communication, Nanjing Normal University, Nanjing, China)

Data Technologies and Applications

ISSN: 2514-9288

Article publication date: 10 August 2022

548

Abstract

Purpose

The density peak clustering algorithm (DP) is proposed to identify cluster centers by two parameters, i.e. ρ value (local density) and δ value (the distance between a point and another point with a higher ρ value). According to the center-identifying principle of the DP, the potential cluster centers should have a higher ρ value and a higher δ value than other points. However, this principle may limit the DP from identifying some categories with multi-centers or the centers in lower-density regions. In addition, the improper assignment strategy of the DP could cause a wrong assignment result for the non-center points. This paper aims to address the aforementioned issues and improve the clustering performance of the DP.

Design/methodology/approach

First, to identify as many potential cluster centers as possible, the authors construct a point-domain by introducing the pinhole imaging strategy to extend the searching range of the potential cluster centers. Second, they design different novel calculation methods for calculating the domain distance, point-domain density and domain similarity. Third, they adopt domain similarity to achieve the domain merging process and optimize the final clustering results.

Findings

The experimental results on analyzing 12 synthetic data sets and 12 real-world data sets show that two-stage density peak clustering based on multi-strategy optimization (TMsDP) outperforms the DP and other state-of-the-art algorithms.

Originality/value

The authors propose a novel DP-based clustering method, i.e. TMsDP, and transform the relationship between points into that between domains to ultimately further optimize the clustering performance of the DP.

Keywords

Citation

Ma, J., Hao, Z. and Hu, M. (2022), "TMsDP: two-stage density peak clustering based on multi-strategy optimization", Data Technologies and Applications, Vol. ahead-of-print No. ahead-of-print, pp. 1-27. https://doi.org/10.1108/DTA-08-2021-0222

Publisher

:

Emerald Publishing Limited

Copyright © 2022, Ma Jie, Hao Zhiyuan and Hu Mo

License

Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial & non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode


1. Introduction

As a powerful machine learning method in the data mining field, the clustering strategy has a broad research prospect in effectively identifying the internal structure of data samples, such as mining spatiotemporal co-location events in trajectory data sets (Ansari et al., 2021), conducting customer segmentation (Li et al., 2021) and detecting CT scan images (Singh and Bose, 2021). In addition, as an important branch in clustering algorithms, density-based clustering has been concerned and studied by a large number of researchers. Various density-based clustering methods have been proposed and widely utilized in different fields to date, such as fault recognition in wind turbines with a density-based clustering algorithm (Luo et al., 2021) and risk assessment on railway investment with an improved density-based approach (Guo et al., 2021). In 2014, the density peak clustering algorithm (DP) was proposed by American scholars in Science (Rodriguez and Laio, 2014). Since its establishment, the DP has been studied and applied by a large number of investigators in various fields, such as text clustering (Jo, 2020), medical analysis (Medeghri and Sabeur, 2021) and image recognition (Wang et al., 2019, 2020; He et al., 2021). Specifically, there are three significant parameters, i.e. the dc value (cutoff distance), the ρ value (local density) and the δ value (the distance between a point and another point with a high ρ value), and an important principle in the original DP, i.e. the cluster centers should have a higher ρ value and a higher δ value than other points (Abbas et al., 2021; Wang et al., 2021). Although the DP has better clustering performance than other traditional density-based clustering algorithms, it still contains a critical limitation, i.e. the higher ρ value and the higher δ value could not accurately reflect whether a point is a cluster center.

To give a concrete example, two different situations are discussed in this paper; Figures 1 and 2 show situation 1 and situation 2, respectively. For situation 1, it is clearly shown in Figure 1(a) that the data set flame should have two different categories, and the two potential cluster centers both have a higher ρ value and a higher δ value than other points. Actually, Figure 1(b) shows that the DP could indeed obtain a clustering result which is close to the natural category. The combination results of Figure 1 seem to demonstrate that the aforementioned principle about the ρ value, the δ value and the cluster centers is reasonable. However, situation 2 illustrates that the principle is unreasonable yet. As shown in Figure 2, the DP could just obtain the inferior clustering results when analyzing the data sets D1 and compound, which are not consistent with the principle mentioned above.

Obviously, the DP could identify only two potential cluster centers for data set D1 (it has three different natural categories), while it could just identify six wrong clusters for the data set compound (it has six different natural categories). The difference between situation 1 and situation 2 reflects the following deficiencies of the DP: (1) the DP could not detect the accurate density peak points when analyzing some data sample with multi-density or variable density; (2) the DP is challenging to identify some data samples with non-single cluster center accurately and (3) the drawback of the original density calculation method and the improper assignment of the non-central points ultimately affect the overall clustering performance.

To address the aforementioned issues, the authors develop an enhanced DP-based clustering method, i.e. two-stage density peak clustering based on multi-strategy optimization (TMsDP), to further optimize the clustering performance of the DP. The main contributions and innovations of the TMsDP are as follows:

  1. Point-domain is constructed by introducing the pinhole imaging strategy to confirm the search scope of potential centers. The point-domain improves the clustering efficiency by transforming the relationship between points into that between domains.

  2. Point-domain density is determined to measure the distribution of points in a point-domain, while the domain distance is calculated by introducing the Hausdorff distance to improve the clustering accuracy.

  3. Domain similarity is proposed to achieve the domain merging process. In a data space, the domain similarity between point-domains is higher, and it is more likely to merge with each other.

The details of TMsDP are discussed in this study. Specifically, Section 2 presents a brief introduction of the DP, Section 3 describes the specific technical details of the proposed TMsDP, Section 4 analyzes the experimental results with different data sets to verify the clustering performance of the TMsDP and Section 5 summarizes this study by discussing the results and future areas for potential investigations.

2. Density peak clustering

2.1 Preparation

In the original DP (Rodriguez and Laio, 2014), dc is set as the manual parameter, which denotes the appropriate position in an ascending distance sequence, and the definition processes are shown as follows (assuming Sample = {s1, s2, s3, …, sn}):

(1)position=round(N×per cent/100),
(2)disorder=sort(dis(si,sj)),
(3)dc=disorder(position),

where N indicates the manual inputting value and dis (si, sj) represents the distance between the point si and the point sj. Rodriguez and Laio (2014) define that ρi denotes the number of points in a circle with the point si as the center and the dc value as the radius, and the process is shown as follows:

(4)ρi=si,sjSamplenχ(dis(si,sj)dc),

where the function χ(o) is equal to 1 or 0. If the variable o is greater than 0, χ(o) is equal to 0. Otherwise, χ(o) is equal to 1. In addition, the calculation process of the δ value is shown as follows:

(5)δsi=minsi,sjSampleρsi<ρsjdis(si,sj).

2.2 Related work

Based on the aforementioned contents, it is clear that the δ value and the ρ value are limited by the threshold parameter, i.e. dc value, and utilizing different dc values could even provide completely different clustering results when analyzing the same data set (Hou et al., 2020; Lu et al., 2020; Jangra and Toshniwal, 2020; Flores and Garza, 2020; Zhu et al., 2020). For addressing the threshold parameter selection issue, Xu et al. (2020) proposed a robust DP with density-sensitive similarity to find accurate cluster centers automatically and reduce the effect of the dc value selection on clustering results. D'Errico et al. (2021) provided a feasible approach for solving the classification problem of data with different shapes and distributions in order to avoid the drawback of the dc value. Ding et al. (2018) developed an automatic DP based on a generalized extreme value distribution. At the same time, the assignment strategy of non-cluster center points often affects the final clustering results. To address the assignment issues, Jiang et al. (2019) introduced logistic distribution theory and K-nearest neighbor (kNN) theory into DP. Xu et al. (2021) designed a novel sparse search strategy to measure the similarity between the nearest neighbors of each point. Yu et al. (2021) proposed a three-way density peak clustering method based on evidence theory. Seyedi et al. (2019) utilized a graph-based label propagation to assign labels to remaining points and proposed the dynamic graph-based label propagation for density peak clustering. Apart from the dc value selection issue and the non-center point assignment issue, it is challenging to identify the potential centers in low-density regions and to analyze data with varying density distributions using the DP. For solving these issues, Yan et al. (2021) proposed a rotation-DPeak algorithm to solve the imbalanced data and data with sparse regions. Liu et al. (2018) presented three novel definitions, i.e. shared nearest neighbor (SNN) similarity, local density ρ and the distance from the nearest larger density point δ, and proposed an SNN-based clustering by fast search and find of density peaks algorithm. Du et al. (2019) provided a new option based on the sensitivity of the local density, redefined the δ value and redesigned the assignation strategy based on a new density-adaptive metric, while Chen and Yu (2021) proposed a domain-adaptive density clustering algorithm, which consisted of three steps: domain-adaptive density measurement, cluster center self-identification and cluster self-ensemble. In addition, the DP could not effectively identify the noise data and outliers and it has high computational complexity when solving large-scale data. For avoiding the drawbacks and accelerating the DP, Parmar et al. (2019) proposed a residual error-based DP to better identify overlapping clusters. Wang et al. (2020) combined the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm and proposed a systematic density-based clustering method using anchor points. Chen et al. (2020) replaced density with kNN density and proposed a fast DP, i.e. FastDPeak. Fan et al. (2019) proposed a fast algorithm that accelerates the density computation about 50 times over the original one.

To optimize the performance of original DP, the authors delineate a novel DP-based clustering method in this paper. In the novel method, they propose four main significant strategies, i.e. point-domain, point-domain density, domain distance and domain similarity. The framework of TMsDP is shown in Figure 3.

3. The proposed clustering method

3.1 Point-domain strategy based on pinhole imaging theory

In order to explore the potential cluster centers in low-density regions, the proposed TMsDP constructs the point-domain by introducing the pinhole imaging theory. Pinhole imaging is a physics phenomenon where a light source passes through a pinhole and its inverted image will be formed on a screen (Long et al., 2021). Inspired by the related literature (Long et al., 2021; Lu et al., 2018), this paper introduces the pinhole imaging theory into the search strategy of potential cluster centers, which can help the TMsDP to expand the range of center exploration. Assume that the point Si(xsi,ysi) is a potential cluster center in Sample and other potential cluster centers, like the point Sk(xsk,ysk) and the point Sj(xsj,ysj), may also exist in the same point-domain. If we want to construct a point-domain for data Si, we should comprehend the preliminary definitions which are shown in Figure 4 (this paper mainly utilizes two-dimensional (2D) data as the examples to explain the following preliminary definitions).

Definition 1.

(upper bound for searching of the first dimension data). As a rule of thumb, if the point Si(xsi,ysi) is a cluster center, the ρ value of other potential cluster centers should be close to ρsi. Therefore, this study should determine a searching range to explore these potential cluster centers. The first exploration concept is the search upper set (SUS); the SUS is a point-set where the points have a higher first dimension data value and a higher ρ value than the point Si and they are nearly closest to the point Si. Based on the SUS, the calculation processes of upper bound for searching of the first dimension data are shown as follows:

(6)SUSx={SnSample|ρSn>ρSi,xSn>xSi,nearestneighbor(Sn,Si)},
(7)upperboundforsearching={SnSUSx|xSnxSi=τdc}.

Definition 2.

(lower bound for searching of the first dimension data). To maximize the odds of finding more potential cluster centers, this study should consider a situation where some potential centers may exist in a region with a slightly lower ρ value than the point Si. Therefore, the second exploration concept is the search lower set (SLS); the SLS is also a point-set where the points have a lower first dimension data value than the point Si, and the ρ values of these points are much closer to ρsi. Based on the abovementioned contents, the calculation processes of lower bound for searching of the first dimension data are shown as follows:

(8)SLSx={SmSample|ρSm<ρSi,xSm<xSi,nearestneighbor(Sm,Si)},
(9)lowerboundforsearching={SmSLSx|maxSmSi(xSm)}.

Definition 3.

(basis point in the first dimension). In this paper, the basis point in the first dimension denotes a middle value between upper bound for searching of the first dimension data and lower bound for searching of the first dimension data. The definition is shown as follows:

(10)basispointx=xSm+xSn2.

Definition 4.

(upper bound for searching of the second dimension data). The process of upper bound for searching of the second dimension data is similar to Definition 1, and the definitions are shown as follows:

(11)SUSy={SaSample|ρSa>ρSi,ySa>ySi,nearestneighbor(Sa,Si)},
(12)upperboundforsearching={SaSUSy|ySaySi=ωdc}.

Definition 5.

(lower bound for searching of the second dimension data). The process of lower bound for searching of the second dimension data is similar to Definition 2, and the definitions are shown as follows:

(13)SLSy={SbSample|ρSb<ρSi,ySb<ySi,nearestneighbor(Sb,Si)},
(14)lowerboundforsearching={SbSLSy|maxSbSi(ySb)}.

Definition 6.

(basis point in the second dimension). The definition of the basis point in the second dimension is similar to Definition 3, and it is shown as follows:

(15)basispointy=ySa+ySb2.

For the example shown in Figure 4, it is a point-domain of the point Si. In the point-domain Si, the authors set the x-axis value of receiving screen (first dimension) to xr, the y-axis value of receiving screen (first dimension) to yr, the x-axis value of receiving screen (second dimension) to xr and the y-axis value of receiving screen (second dimension) to yr. Based on the triangular similarity theory, the relationships between four searching bounds and two basis points are shown as follows:
(16)xSnxSm2+xSmxSixr(xSnxSm2+xSm)=ySiyr=ψ,
(17)ySaySb2+ySbySiyr(ySaySb2+ySb)=xSixr=ξ,
where the control thresholds ψ and ξ could be set manually for different clustering demands. According to formula (16) and formula (17), the side values of the point-domain can be obtained as follows:
(18)side={side(sidexsidey)|sidex=2+2ψψ(xSm+xSn2xSi)sidey=2+2ξξ(ySa+ySb2ySi)}.

3.2 Domain merging strategy based on point-domain similarity

Although the TMsDP transforms the relationships between points into that between point-domains, it is still a density-based clustering method. Therefore, how to perform the density analysis on point-domains is a highlight in this section. This paper defines the point-domain density as follows:

Definition 7.

(point-domain density). In this paper, the point-domain density denotes the amount of points per unit area of a point-domain (the definition emphasizes the distribution of data points, which has statistical significance). According to the aforementioned contents, the authors could assume a set D = {D1, D2, D3, …, Dn}, where n indicates the amount of point-domains and D indicates a domain-set which includes all point-domains, and the calculation process of point-domain density is shown as follows (applying the function amount(θ) to calculate the amount of data points in a point-domain):

(19)PDρi=amount(Di)(2+2ψψ(xSm+xSn2xSi))(2+2ξξ(ySa+ySb2ySi)).

The point-domain density could show the inner characteristic of a point-domain; moreover, the authors consider the outer characteristics between point-domains. Therefore, this paper constructs a novel distance definition, i.e. domain distance.
Definition 8.

(domain distance). Inspired by the literature (Vavpetic and Zagar, 2021; Ryu and Kamata, 2021; Nie et al., 2021), the authors adopt the Hausdorff distance to calculate the domain distance between point-domains. Assume that a point-domain D1 = {d11, d12, d13, …, d1i} and the other point-domain D2 = {d21, d22, d23, …, d2j}, where d1i and d2j denote the two different points and i and j denote the serial number of data points in D1 and D2, respectively. The calculation process of the domain distance is shown as follows:

(20)domaindis(D1,D2)=maxd1iD1,d2jD2min(dis(d1i,d2j)).

For calculating the domain distance, the authors still need to consider two additional situations: (1) is there an intersection part between the two point-domains? (2) whether the points in these two point-domains are uniformly distributed? The authors take the data set spiral as an example to describe these two situations, and the results are shown in Figures 5 and 6.

As shown in Figure 5(a), point-domain 1 and point-domain 2 have no intersection part, which means that the point-domain similarity could take the domain distance as the only calculation criterion. But in Figure 5(b), point-domain 1 and point-domain 3 have an intersection part and there is also an intersection part between point-domain 2 and point-domain 3. When there exist intersection parts between point-domains, the calculation of point-domain similarity needs to take into account the intersection part, and it is shown as follows:

(21)Intersection={amount(DiDj)|(Di,DjD)}.

As shown in Figure 6(a), point-domain 1 has some independent sparse points in the red circle region, and it could be clearly seen that these sparse points deviate from the overall distribution trend of the points in the point-domain. Therefore, the calculation process of point-domain similarity should be performed on the points in the overall distribution trend other than the sparse points. In Figure 6(b), point-domain 2 does not have sparse points and the overall distribution trend of points is relatively stable. Inspired by the literature (Yarinezhad and Hashemi, 2019), the authors propose a strategy to identify the sparse points in this paper. Obviously, the points in the manifold data sets could identify easily whether they are the sparse points. However, for other data sets with different types, the sparse points could not be judged visibly. Assume that a point-domain could be divided into two regions with equal areas and the density values of these two regions are set to ρ1 and ρ2, respectively. In addition, the authors assume that ρ1 is greater than ρ2, the density value of the whole point-domain is ρ and the difference between ρ1 and ρ2 will be compared with the value of 0.8ρ. If the difference between ρ1 and ρ2 is greater than 0.8ρ, the points in the region with small density could be identified as the sparse points.

According to the rule of thumb, if there are more intersection parts between two subjects and these two subjects are much nearer, the two subjects are more likely to merge into one. Therefore, the TMsDP adopts the domain distance and the intersection part between two point-domains to calculate the domain similarity. The calculation formula is shown as follows:

(22)simDi,DjSample={amount(DiDj)amount(Di)×amount(Dj)×exp(maxdimDi,djnDjmin(dis(dim,djn))maxdimDi,djnDjmin(dis(dim,djn))+o(θ))DiDjintersection simiDi,DjSample=amount(DiDj)amount(Di)×amount(Dj)DiDjmin(γ×intersection simi)×exp(maxdimDi,djnDjmin(dis(dim,djn))maxdimDi,djnDjmin(dis(dim,djn))+o(θ))DiDj
where sim denotes the domain similarity, γ denotes a random parameter with a range of values in (0, 1) and s denotes the adjustment operator which aims to make the value of domain similarity in (0, 1). Considering that the distance value between point-domains with intersection must be smaller than that between point-domains without intersection, the larger the distance value between point-domains is, the smaller the similarity is. Therefore, this paper adds the adjustment operator o(θ) and the adjustment parameter γ to ensure that the domain similarity between point-domains without intersection is less than that between point-domains with intersection. Figure 7 shows the merging situation of two point-domains, which takes the data set 2circles as an example.

In fact, these strategies and methods proposed in this paper increase the impact of the parameters on the clustering result. Apart from the original parameters dc, the TMsDP adds the parameters τ and ω to determine the exploration range of the potential cluster centers, adds the parameters ψ and ξ to determine the size of the point-domain and the value of the domain density and adds the parameters o(θ) and γ to determine the domain similarity. Actually, the most significant parameter in the TMsDP is the side value of different point-domains, and the parameters mentioned above are finally utilized to calculate the side value. The side value of point-domains will be shown in the following specific experimental results (in the following experiments, the authors set the side length and side width to equal values in a point-domain). The overall procedures of the TMsDP are shown in Algorithm 1 (Table I).

3.3 Time complexity analysis

For the TMsDP, the time complexity analysis is considered from the following aspects: (1) the time complexity of the point-domain is close to O(n); (2) the time complexity of the calculation about the domain distance is close to O(n2) and (3) the time complexity of the domain similarity is close to O(n2). Thus, the time complexity of the TMsDP is close to O(n2 + n2 + n), which is close to the original DP (the time complexity of the DP clustering is O(n2)).

4. Experimental results and analysis

To illustrate the performance of the proposed method, this section selects 12 synthetic data sets and 12 real-world data sets as the experiment samples [1]. The 12 synthetic data sets include 2circles, compound, twocirclesnoise2, spiral, pathbase, jain, flame, D1, D2, DS5, skewed and unbalance. The 12 real-world data sets include thyroid, breast, glass, liver, heart, seeds, zoo, wine, vote, iris, dna and msplice. The specific characteristics of these experiment data sets are shown in Table II. In addition, to further demonstrate the clustering performance of the proposed method, the TMsDP is compared with DP (Rodriguez and Laio, 2014), density peaks clustering based on logistic distribution and gravitation (DPC-LG) (Jiang et al., 2019), DBSCAN (Ester et al., 1996), Affinity Propagation Algorithm (AP) (Frey and Dueck, 2007) and K-means (Jain, 2010). This paper takes the Rand index (RI, the range of values is from −1.0 to 1.0), F-measure (FM, the range of values is from −1.0 to 1.0), Jaccard index (JI, the range of values is from 0 to 1.0) and normalized mutual information (NMI, the range of values is from −1.0 to 1.0) as the evaluation criteria to measure the clustering performance.

4.1 Experimental results of synthetic data sets

These 12 synthetic data sets can actually be divided into several different types, including manifold data sets, multiple center data sets, data sets with unbalanced and skewed size and data sets with varying sizes. The authors present the experimental results of the 12 synthetic data sets in Tables III and IV and the clustering results of these data sets in Figures 8–19. In the clustering result figures, the original distribution denotes the real distribution of a data set, panel (a) shows the clustering result of the AP algorithm, panel (b) shows the clustering result of the K-means algorithm, panel (c) shows the clustering result of the DBSCAN algorithm, panel (d) shows the clustering result of the DP algorithm, panel (e) shows the clustering result of the DPC-LG algorithm and panel (f) shows the clustering result of the TMsDP algorithm.

According to the visualization of the clustering results, this study could find that the proposed method, DBSCAN, DP and DPC-LG can obtain more accurate clustering results when analyzing some manifold data sets (such as jain and spiral). However, when analyzing some data sets with multiple centers (such as 2circles, compound and twocirclesnoise2) and the data sets with unbalanced and skewed size (such as unbalance and skewed), only the proposed TMsDP can obtain more accurate clustering results among the six algorithms in the comparison experiments. Meanwhile, when analyzing some data sets with varying sizes (such as D1 and DS5) and some data sets with irregular shapes (such as flame and DS5), the TMsDP still obtains more accurate clustering results than the other five comparison algorithms. In order to compare the clustering performance of these six methods more sharply, Tables III and IV present the evaluation index values of different algorithms with different parameter value settings, which demonstrate that the TMsDP outperforms other compared algorithms.

4.2 Experimental results of real-world data sets

As shown in Tables V and VI, the TMsDP could obtain larger values in almost all the four evaluation metrics than the other five comparison algorithms when analyzing 12 real-world data sets. Of course, considering the diversity of data structural characteristics, the TMsDP could not obtain the best values in all evaluation metrics when analyzing all test data sets. Nevertheless, according to the available comparison results, the better clustering performance of TMsDP could still be shown.

4.3 Robustness analysis

In this experiment, the authors select the seeds and liver with different degrees of noise to evaluate the robustness of the compared algorithms. The authors generate different amounts of random data points as noise in the value space of the original data set. The noise level of each data set gradually increases from 1.0 per cent to 10.0 per cent. The experimental results are presented in Figure 20.

As shown in Figure 20, with the increasing proportion of noise, the average FM value of each algorithm decreases. However, the average FM value of the TMsDP drops at a minimum rate, while that of AP drops at a maximum rate. Due to the small sample size of the data sets, the average FM values of TMsDP, DP, DPC-LG, DBSCAN and K-means are almost identical when the noise level rises from 1.0 per cent to 10.0 per cent. Therefore, the TMsDP retains higher accuracy in each case and illustrates higher robustness than the compared algorithms.

4.4 Running time analysis

In this section, the authors compare the running time of TMsDP with DPC-LG and DP on the 24 data sets, which include five different categories, i.e. (1) the synthetic 2D data sets with the data volume being less than 1,000, (2) the synthetic 2D data sets with the data volume being greater than or equal to 1,000, (3) the real-world data sets with the range of dimensions being from 2 to 10 and the data volume being less than 1,000, (4) the real-world data sets with the dimensions being greater than 10 and the data volume being less than 1,000 and (5) the real-world data sets with the dimensions being greater than 150 and the data volume being greater than or equal to 2,000 (selecting the average running time in 30 times of these three algorithms). The overall running speed is slow when dealing with higher dimensional data sets due to the limited running environment (Intel Core i5, 2.40 GHz, 8 GB RAM and MATLAB 2014a); therefore, when running some data sets with a large sample size and high dimensions, the overall running time of these three comparison algorithms is relatively long. In addition, because the TMsDP is an improved algorithm based on the DP, three DP-based algorithms (TMsDP, DPC-LG and DP) are selected for comparison. The running time result is shown in Table VII.

As shown in Table VII, the running time of the TMsDP is about twice as long as that of the DP. According to Section 3.3, the time complexity of the TMsDP is close to O(n2 + n2 + n), which is close to DP (the time complexity of the DP is O(n2)). Therefore, the actual running time of the TMsDP is not more than twice as long as that of the traditional DP.

4.5 Overall performance review

In this paper, 12 synthetic data sets and 12 real-world data sets are utilized as experimental samples to demonstrate the clustering performance of the proposed method.

According to the clustering results of the 12 synthetic data sets, it could be seen that the proposed TMsDP shows better clustering performance than others when facing the manifold data sets, such as the spiral and jain. Moreover, when facing the multiple center data sets, such as the 2circles, compound and twocirclesnoise2, and the data sets with an unbalanced and skewed size, such as the unbalance and skewed, the original DP is challenging to find potential centers in low-density regions, while the TMsDP method could adopt point-domains to explore more potential cluster centers for achieving better clustering results. In addition, when facing the irregularly shaped data sets, such as DS5, flame and spiral, and the data sets with varying sizes, such as D1 and DS5, the TMsDP could still obtain more accurate clustering results than other compared algorithms.

According to the clustering results of 12 real-world data sets, it is clearly shown that the TMsDP method could obtain better values in almost all evaluation metrics than other mentioned algorithms. In summary, the TMsDP improves the clustering performance compared with the original DP and expands the theoretical prospects of the density-based algorithms.

5. Conclusion

To address the deficiencies of DP (i.e. failing to identify the cluster centers in low-density regions and being challenging to analyze a category with multi-centers), this paper proposes the TMsDP. The TMsDP shows three significant contributions: (1) constructing point-domain by introducing the pinhole imaging strategy to expand the search range for finding potential cluster centers; (2) proposing the novel methods to calculate point-domain density, domain distance and domain similarity and (3) finishing the clustering process based on domain similarity. The experimental results on 12 synthetic data sets and 12 real-world data sets illustrate that the TMsDP shows significantly improved clustering performance compared with original DP and the other algorithms experimentally compared in the paper.

Although the proposed method shows better clustering performance, it adds six additional parameters, which could have more impacts on the clustering results. Therefore, the authors divide the future research plane into two aspects. In the theoretical aspect, the first part is to explore an improved calculation method of side values for reducing the number of parameters while preserving the clustering performance of the TMsDP; the second part is to update the calculation strategies of point-domain similarity and domain distance to accelerate the algorithm and the third part is to redesign a novel search mechanism and structure to automatically explore the potential cluster centers. In the application aspect, the authors extend the application fields of the TMsDP. When facing some data from the real-world problems, it could be found that the structures of these data are different from those of the experimental data sets mentioned above. Most of these data have diverse characteristics, including having multiple clustering centers, the clustering centers in the low-density region, unbalanced density distribution and unbalanced sample size distribution. For example, the text data, the consumption data of consumers, the stock data, the financial data and the image data all have complex data features. Therefore, this study could apply the TMsDP to solve some related real-world problems, such as the topic identification of the online public opinion (mainly performing the text clustering), the customer segmentation for some enterprises (mainly performing the clustering analysis on the consumption data of consumers) and the problems of the facial image segmentation and detecting the CT scan images (mainly performing the image recognition). In addition, this study could also combine the TMsDP with some swarm intelligence optimization algorithms to solve the optimization problems in the real world.

Figures

The selection of cluster center points (they are the data points in the oblong) and the clustering result of flame

Figure 1.

The selection of cluster center points (they are the data points in the oblong) and the clustering result of flame

The selection of cluster center points (they are the data points in the oblong) and the clustering result of D1 and compound

Figure 2.

The selection of cluster center points (they are the data points in the oblong) and the clustering result of D1 and compound

The framework of the TMsDP algorithm

Figure 3.

The framework of the TMsDP algorithm

The whole process of constructing point-domains by utilizing the pinhole imaging strategy

Figure 4.

The whole process of constructing point-domains by utilizing the pinhole imaging strategy

The case of intersection and non-intersection between point-domains

Figure 5.

The case of intersection and non-intersection between point-domains

The case of a point-domain with sparse point distribution and without sparse points

Figure 6.

The case of a point-domain with sparse point distribution and without sparse points

The case of utilizing domain similarity to merge any two different point-domains

Figure 7.

The case of utilizing domain similarity to merge any two different point-domains

The clustering result for data set 2circles

Figure 8.

The clustering result for data set 2circles

The clustering result for data set spiral

Figure 9.

The clustering result for data set spiral

The clustering result for data set twocirclesnoise2

Figure 10.

The clustering result for data set twocirclesnoise2

The clustering result for data set jain

Figure 11.

The clustering result for data set jain

The clustering result for data set compound

Figure 12.

The clustering result for data set compound

The clustering result for data set pathbase

Figure 13.

The clustering result for data set pathbase

The clustering result for data set D1

Figure 14.

The clustering result for data set D1

The clustering result for data set D2

Figure 15.

The clustering result for data set D2

The clustering result for data set DS5

Figure 16.

The clustering result for data set DS5

The clustering result for data set flame

Figure 17.

The clustering result for data set flame

The clustering result for data set skewed

Figure 18.

The clustering result for data set skewed

The clustering result for data set unbalance

Figure 19.

The clustering result for data set unbalance

Comparison of algorithm robustness

Figure 20.

Comparison of algorithm robustness

The part core process of the proposed TMsDP

The basic attributions of experiment data sets

Data set typeOrderData set nameDimensionData volumeReal cluster number
Synthetic12circles26002
Synthetic2compound23996
Synthetic3twocirclesnoise226103
Synthetic4spiral23123
Synthetic5pathbase23003
Synthetic6jain23732
Synthetic7flame22402
Synthetic8D12873
Synthetic9D22854
Synthetic10DS525005
Synthetic11skewed21,0006
Synthetic12unbalance26,5008
Real world13thyroid62153
Real world14breast92772
Real world15glass92146
Real world16liver63452
Real world17heart133032
Real world18seeds72103
Real world19zoo161017
Real world20wine131783
Real world21vote164352
Real world22iris41503
Real world23dna1802,0003
Real world24msplice2403,1753

The performance benchmark of synthetic data sets

Data setMethodParameter valueFMJIRINMI
2circlesAP31/0.90.35590.20270.4992
K-means20.49830.33180.49920
DBSCAN3/31111
DP3.17020.50280.33580.50260.0050
DPC-LG5.61330.49910.33250.49980.0010
TMsDP0.0859/side= 0.00171111
spiralAP30/0.90.32790.19610.55380.000369
K-means30.32740.19570.55410.000351
DBSCAN2.5/21111
DP2.58121111
DPC-LG2.58121111
TMsDP1.7443/side= 0.03491111
twocirclesnoise2AP30/0.90.36490.20290.5033
K-means30.40260.24610.50190.0004
DBSCAN1.8/50.99670.99340.99670.9850
DP1.36460.48330.31850.50410.0211
DPC-LG0.53020.49470.32860.50660.0254
TMsDP0.0872/side= 0.00170.99180.98370.99180.9901
jainAP43/0.70.58470.39000.5793
K-means20.69770.53150.65910.3672
DBSCAN2.9/201111
DP11.8121111
DPC-LG6.0271111
TMsDP1.3537/side= 0.02711111
pathbaseAP71/0.70.63210.44830.68220.2804
K-means30.66170.49080.74760.5470
DBSCAN2/50.75180.57270.75940.6965
DP1.10110.66540.49500.75090.5530
DPC-LG1.89740.64730.46930.71340.5039
TMsDP2.5500/side= 0.48780.97390.94910.98260.9363
compoundAP16/0.40.68380.51930.84710.7469
K-means60.64220.46500.84320.7202
DBSCAN1/50.91030.83350.95280.8708
DP0.87320.72230.56480.86700.8363
DPC-LG0.87320.64370.47310.83400.7665
TMsDP0.4472/side= 0.04470.87030.75840.92160.8653

The performance benchmark of synthetic data sets

Data setMethodParameter valueFMJIRINMI
D1AP2.3/0.60.87660.76840.9201
K-means30.97450.95030.98240.9515
DBSCAN0.65/2.30.91930.84510.9465
DP0.63741111
DPC-LG1.32311111
TMsDP0.1934/side= 0.38681111
D2AP2.3/0.60.97560.95240.98820.9655
K-means40.97560.95240.98820.9655
DBSCAN2.4/180.95240.90920.97700.9427
DP0.2630.97560.95240.98820.9655
DPC-LG0.40940.97560.95240.98820.9655
TMsDP0.3542/side = 0.70830.97560.95240.98820.9655
DS5AP52/0.80.78340.62750.88960.7913
K-means50.80930.67920.92180.8206
DBSCAN0.03/60.84080.70860.91900.9018
DP0.0410.78840.64130.90010.8663
DPC-LG0.05950.82050.68180.91140.8857
TMsDP0.0595/side= 0.15470.90990.83460.96370.9242
flameAP42/0.90.74730.59590.73810.4345
K-means20.73640.58220.72670.3989
DBSCAN1/60.96590.93360.96410.5312
DP1.13361111
DPC-LG0.93011111
TMsDP0.9301/side= 3.72020.99220.98450.99170.9635
skewedAP26.3/0.60.70820.54820.90240.7245
K-means60.72030.56290.90650.7422
DBSCAN49/60.97720.95500.99250.8755
DP71.55420.99010.98030.99670.9845
DPC-LG35.51060.99420.98840.99810.9906
TMsDP71.5542/side= 143.10840.99420.98840.99810.9906
unbalanceAP33/0.80.99890.99780.99940.9943
K-means80.99890.99780.99940.9943
DBSCAN6000/60.99910.99830.99950.9603
DP1.1808e+30.99940.99880.99970.9956
DPC-LG3.8471e+30.99580.99160.99760.9844
TMsDP1.6846e+3/side= 3.3692e+30.99960.99920.99980.9964

The performance benchmark of real-world data sets

Data setMethodParameter valueFMJIRINMI
thyroidAP42/0.90.53360.36360.51990.0567
K-means30.82110.69600.80410.1497
DBSCAN4/60.79890.66510.78370.4446
DP9.26610.73800.54710.56530.0824
DPC-LG7.06610.75360.57210.60990.1360
TMsDP2.1307/side= 4.26150.79720.64280.71510.4673
breastAP14.3/0.90.54490.37320.50160.0007
K-means20.65100.48260.59390.0829
DBSCAN2.3/7.50.61410.44160.57590.0657
DP2.80020.76470.58860.58770.0371
DPC-LG1.86220.76470.58560.58770.0371
TMsDP0.4639/side= 0.18560.76510.58870.59710.0843
glassAP6.1/0.70.42000.26470.72110.3593
K-means60.50520.32980.6764
DBSCAN1.4/20.56380.35120.59270.2905
DP0.31320.55420.33330.54320.3922
DPC-LG0.33500.54280.30910.45910.3589
TMsDP0.3053/side= 0.61060.55200.33630.56190.4123
liverAP31/0.90.61020.42850.49980.0070
K-means20.64070.45380.50430.0009
DBSCAN10/3.60.50160.33470.49810.0037
DP28.16030.71240.50920.51040.0136
DPC-LG9.48680.71240.50920.51040.0136
TMsDP9.4868/side= 7.44030.70730.50640.51420.0196
heartAP31/0.90.59950.42800.59500.1611
K-means20.61620.44440.59210.1461
DBSCAN0.8/26.90.55140.37950.51870.0385
DP0.48340.62510.45100.57570.1375
DPC-LG0.41580.58970.41810.58930.1396
TMsDP0.3936/side= 0.36960.59640.42470.58370.1251
dnaAP1.5/0.60.20770.07040.61950.0002
K-means30.61190.43840.71490.3646
DBSCAN4.2/5.10.62210.39030.40150.0478
DP0.39780.49740.31850.47310.0371
DPC-LG0.39790.49740.32740.54900.0632
TMsDP7.1414/side= 8.56970.49740.31850.47310.0371

The performance benchmark of real-world data sets

Data setMethodParameter valueFMJIRINMI
seedsAP21/0.90.80680.67610.87140.7101
K-means30.81060.68150.87440.7061
DBSCAN1.1/120.57010.39500.67660.2685
DP0.66740.80260.67020.86730.6983
DPC-LG0.66740.78030.63960.85300.6833
TMsDP8.1971/side= 2.19640.84580.73280.89770.7436
zooAP2.1/0.80.58930.40640.83540.6914
K-means70.65880.48260.8590
DBSCAN1/3.50.77160.61920.90320.8149
DP3.31660.58160.40530.77360.5841
DPC-LG2.82840.61210.43680.79350.6415
TMsDP3.3166/side= 33.16620.88130.78770.94530.8421
wineAP25/0.60.58280.41130.71610.4376
K-means30.58350.41200.71870.1505
DBSCAN100/0.80.57830.33680.34180.0296
DP101.34620.58920.39850.61020.3982
DPC-LG367.02190.64610.44960.64350.4624
TMsDP4.7849/side= 9.56990.61920.42470.62620.4158
voteAP22.9/0.70.76810.62320.76160.4380
K-means20.77420.63120.76840.4694
DBSCAN1/300.70860.53230.60110.1813
DP2.64580.78070.64000.77520.4900
DPC-LG2.44950.72370.52450.52590.0210
TMsDP2.6458/side= 0.66670.74830.59760.74200.4180
irisAP11.9/0.70.82080.69590.87970.7582
K-means30.82080.69590.87970.8688
DBSCAN1/300.74900.57790.77770.6952
DP0.88320.76350.58910.77660.7355
DPC-LG0.88320.76730.59200.77640.7452
TMsDP0.1732/side= 0.37500.86680.76490.91240.7900
mspliceAP0.8/0.60.02740.00080.6154
K-means30.54700.37530.67150.3045
DBSCAN2.9/7.90.61920.38580.39310.0381
DP8.36660.42060.26490.50720.0068
DPC-LG8.36660.43550.27590.50500.0054
TMsDP8.3666/side= 10.03920.47290.30440.51070.0176

The running time result (unit: second)

Data sets characteristicData setsTMsDPDPC-LGDP
Synthetic 2D data sets with data volume < 1,0002circles1.055140.723480.69241
compound0.742110.645610.63548
twocirclesnoise20.944960.768640.67681
spiral0.924890.615910.59479
pathbase0.828980.581070.55721
jain1.165620.698940.65485
flame0.652910.554390.54409
D10.588410.490880.45262
D20.640210.484740.47937
DS50.995630.708870.68828
Synthetic 2D data sets with data volume ≥ 1,000skewed1.803771.217971.01533
unbalance11.701889.753829.57304
Real-world data sets with 10 > dimension > 2 and data volume < 1,000iris0.734480.659670.60074
thyroid5.178843.827713.38868
liver31.4666926.7756825.87881
seeds6.546615.364815.01853
breast8.258475.214675.14373
glass2.011511.707341.31521
Real-world data sets with dimension >10 and data volume < 1,000heart7.078244.448364.25349
wine1.067051.009450.71224
zoo2.071211.425121.37737
vote47.0379238.6882738.03321
Real-world data sets with dimension >150 and data volume ≥ 2,000dna428.44746336.21582334.39406
msplice1205.55255850.19014837.18927

Note

References

Abbas, M., El-Zoghabi, A. and Shoukry, A. (2021), “DenMune: density peak based clustering using mutual nearest neighbors”, Pattern Recognition, Vol. 109, p. 107589.

Ansari, M.Y., Mainuddin, A.A. and Bhushan, G. (2021), “Spatiotemporal trajectory clustering: a clustering algorithm for spatiotemporal data”, Expert Systems and Applications, Vol. 178, p. 115048.

Chen, J.G. and Yu, P.S. (2021), “A domain adaptive density clustering algorithm for data with varying density distribution”, IEEE Transactions on Knowledge and Data Engineering, Vol. 33, pp. 2310-2321.

Chen, Y.W., Hu, X.L., Fan, W.T., Shen, L.L., Zhang, Z., Liu, X., Du, J.X., Li, H.B., Chen, Y. and Li, H.L. (2020), “Fast density peak clustering for large scale data based on kNN”, Knowledge-Based Systems, Vol. 187, p. 104824.

D'Errico, M., Facco, E., Laio, A. and Rodriguez, A. (2021), “Automatic topography of high-dimensional data sets by non-parametric density peak clustering”, Information Sciences, Vol. 560, pp. 476-492.

Ding, J.J., He, X.X., Yuan, J.Q. and Jiang, B. (2018), “Automatic clustering based on density peak detection using generalized extreme value distribution”, Soft Computing, Vol. 22 No. 9, pp. 2777-2796.

Du, M.J., Ding, S.F., Xue, Y. and Shi, Z.Z. (2019), “A novel density peaks clustering with sensitivity of local density and density-adaptive metric”, Knowledge and Information Systems, Vol. 59, pp. 285-309.

Ester, M., Kriegel, H.P., Sander, J. and Xu, X. (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, In KDD, Vol. 96 No. 34, pp. 226-231.

Fan, X., Duan, Y.Z., Cheng, S.C., Zhang, Y.X. and Cheng, H. (2019), “Fast density-peaks clustering for registration-free pediatric white matter tract analysis”, Artificial Intelligence in Medicine, Vol. 96, pp. 1-11.

Flores, K.G. and Garza, S.E. (2020), “Density peaks clustering with gap-based automatic center detection”, Knowledge-Based Systems, Vol. 206, p. 106350.

Frey, B.J. and Dueck, D. (2007), “Clustering by passing messages between data points”, Science, Vol. 315 No. 5814, pp. 972-976.

Guo, J.W., Zhang, J., Di, S., Zhang, Y.X., Xu, P.J., Li, L.T., Xie, Z.Q. and Li, Q.L. (2021), “An improved density-based approach to risk assessment on railway investment”, Data Technologies and Applications, Vol. 56 No. 3, pp. 382-408. doi: 10.1108/DTA-11-2020-0291.

He, Y.L., Wu, Y.Y., Qin, H.L., Huang, J.Z.X. and Jin, Y. (2021), “Improved I-nice clustering algorithm based on density peaks mechanism”, Information Sciences, Vol. 548, pp. 177-190.

Hou, J., Zhang, A.H. and Qi, N.M. (2020), “Density peak clustering based on relative density relationship”, Pattern Recognition, Vol. 108, p. 107554.

Jain, A.K. (2010), “Data clustering: 50 years beyond K-means”, Pattern Recognition Letters, Vol. 31 No. 8, pp. 651-666.

Jangra, S. and Toshniwal, D. (2020), “VIDPSO: victim item deletion based PSO inspired sensitive pattern hiding algorithm for dense datasets”, Information Processing and Management, Vol. 57 No. 5, p. 102255.

Jiang, J.H., Chen, Y.J., Hao, D.H. and Li, K.Q. (2019), “DPC-LG: density peaks clustering based on logistic distribution and gravitation”, Physica A, Vol. 514, pp. 25-35.

Jo, T. (2020), “Semantic string operation for specializing AHC algorithm for text clustering”, Annals of Mathematics and Artificial Intelligence, Vol. 88 No. 10, pp. 1083-1100.

Li, Y., Chu, X.Q., Tian, D., Feng, J.Y. and Mu, W.S. (2021), “Customer segmentation using K-means clustering and the adaptive particle swarm optimization algorithm”, Applied Soft Computing, Vol. 113, p. 107924.

Liu, R., Wang, H. and Yu, X.M. (2018), “Shared-nearest-neighbor-based clustering by fast search and find of density peaks”, Information Sciences, Vol. 450, pp. 200-226.

Long, W., Jiao, J.J., Liang, X.M., Wu, T.B., Xu, M. and Cai, S.H. (2021), “Pinhole-imaging-based learning butterfly optimization algorithm for global optimization and feature selection”, Applied Soft Computing, Vol. 103, p. 107146.

Lu, G.L., Zhu, Y.B., Su, G.Z., Zhang, Z.M. and Yan, P. (2018), “Efficient block matching using improved particle swarm optimization with application to displacement measurement for nano motion systems”, Optics and Lasers in Engineering, Vol. 111, pp. 246-254.

Lu, H., Shen, Z., Sang, X.S., Zhao, Q.H. and Lu, J.F. (2020), “Community detection method using improved density peak clustering and nonnegative matrix factorization”, Neurocomputing, Vol. 415, pp. 247-257.

Luo, S., Liu, H. and Qi, E. (2021), “Recognition and labeling of faults in wind turbines with a density-based clustering algorithm”, Data Technologies and Applications, Vol. 55 No. 5, pp. 841-868.

Medeghri, H. and Sabeur, S.A. (2021), “Anatomic compartments extraction from diffusion medical images using factorial analysis and K-means clustering methods: a combined analysis tool”, Multimedia Tools and Applications, Vol. 80 No. 16, pp. 23949-23962.

Nie, B., Liu, D.Q., Liu, X.H. and Ye, W.J. (2021), “Phase I non-linear profiles monitoring using a modified Hausdorff distance algorithm and clustering analysis”, International Journal of Quality & Reliability Management, Vol. 38 No. 2, pp. 536-550.

Parmar, M., Wang, D., Zhang, X.F., Tan, A.H., Miao, C.Y., Jiang, J.H. and Zhou, Y. (2019), “REDPC: a residual error-based density peak clustering algorithm”, Neurocomputing, Vol. 348, pp. 82-96.

Rodriguez, A. and Laio, A. (2014), “Clustering by fast search and find of density peaks”, Science, Vol. 344 No. 6191, pp. 1492-1496.

Ryu, J. and Kamata, S. (2021), “An efficient computational algorithm for Hausdorff distance based on points-ruling-out and systematic random sampling”, Pattern Recognition, Vol. 114, p. 107857.

Seyedi, S.A., Lotfi, A., Moradi, P. and Qader, N.N. (2019), “Dynamic graph-based label propagation for density peaks clustering”, Expert Systems with Applications, Vol. 115, pp. 314-328.

Singh, P. and Bose, S.S. (2021), “Ambiguous D-means fusion clustering algorithm based on ambiguous set theory: special application in clustering of CT scan images of COVID-19”, Knowledge-Based Systems, Vol. 231, p. 107432.

Vavpetic, A. and Zagar, E. (2021), “On optimal polynomial geometric interpolation of circular arcs according to the Hausdorff distance”, Journal of Computational and Applied Mathematics, Vol. 392, p. 113491.

Wang, S., Hua, W.Q., Liu, H.Y. and Jiao, L.C. (2019), “Unsupervised classification for polarimetric SAR images based on the improved CFSFDP algorithm”, International Journal of Remote Sensing, Vol. 40 No. 8, pp. 3154-3178.

Wang, S.L., Li, Q., Zhao, C.F., Zhu, X.Q., Yuan, H.N. and Dai, T.R. (2021), “Extreme clustering-A clustering method via density extreme points”, Information Sciences, Vol. 542, pp. 24-39.

Wang, Y.Z., Wang, D., Zhang, X.F., Peng, W., Miao, C.Y., Tan, A.E. and Zhou, Y. (2020), “McDPC: multi-center density peak clustering”, Neural Computing and Applications, Vol. 32 No. 17, pp. 13465-13478.

Xu, X., Ding, S.F., Wang, L.J. and Wang, Y.R. (2020), “A robust density peaks clustering algorithm with density-sensitive similarity”, Knowledge-Based Systems, Vol. 200, p. 106028.

Xu, X., Ding, S.F., Wang, Y.R., Wang, L.J. and Jia, W.K. (2021), “A fast density peaks clustering algorithm with sparse search”, Information Sciences, Vol. 554, pp. 61-83.

Yan, M., Chen, Y.W., Hu, X.L., Cheng, D.D., Chen, Y. and Du, J.X. (2021), “Intrusion detection based on improved density peak clustering for imbalanced data on sensor-cloud systems”, Journal of Systems Architecture, Vol. 118, p. 102212.

Yarinezhad, R. and Hashemi, S.N. (2019), “Solving the load balanced clustering and routing problems in WSNs with an fpt-approximation algorithm and a grid structure”, Pervasive and Mobile Computing, Vol. 58, p. 101033.

Yu, H., Chen, L.Y. and Yao, J.T. (2021), “A three-way density peak clustering method based on evidence theory”, Knowledge-Based Systems, Vol. 211, p. 106532.

Zhu, Y.L., Zhang, B., Dou, Z.H., Zou, H., Li, S.T., Sun, K. and Liao, Q.L. (2020), “Short-Term Load forecasting based on Gaussian process regression with density peak clustering and information sharing antlion optimizer”, IEEE Transactions on Electrical and Electronic Engineering, Vol. 15 No. 9, pp. 1312-1320.

Further Reading

Wang, Y.Z., Wang, D., Peng, W., Miao, C.Y., Tan, A.E. and Zhou, Y. (2020), “A systematic density-based clustering method using anchor points”, Neurocomputing, Vol. 400, pp. 352-370.

Acknowledgements

Funding: This research was funded by the Major Project of the National Social Science Foundation of China, grant number 20&ZD125, and the National Natural Science Foundation of Jilin Province, grant number 20210101480JC.

Corresponding author

Zhiyuan Hao is the first corresponding author and can be contacted at: 15391910163@163.comMo Hu is the second corresponding author and can be contacted at: 959539150@qq.com

About the authors

Jie Ma, PhD, is Professor at the School of Business and Management, Jilin University in the People's Republic of China. She has published more than 100 academic articles in Chinese Social Sciences Citation Index, and her research interests include information resources management, information behavior, machine learning, deep learning and text analysis.

Zhiyuan Hao is a PhD candidate at the School of Business and Management, Jilin University in the People's Republic of China. He has published a number of papers in Chinese Social Sciences Citation Index and international SSCI/SCI journals, such as Information Processing and Management, IEEE Access and Tehnicki Vjesnik-Technical Gazette. His research interests include information behavior, machine learning, deep learning and text analysis. Jie Ma and Zhiyuan Hao contributed equally to this work.

Mo Hu is working in the Department of Network and New Media, School of Journalism and Communication, Nanjing Normal University in the People's Republic of China. She received her PhD from the School of Business and Management, Jilin University. She majored in information resources management, machine learning and text analysis.

Related articles