The datasets we used include Iris, Breast Cancer, Dry Bean and HTRU2 (All datasets are from The University of California, Irvine). In this study, we compared the effects of every step, and tested datasets with different data types and sizes, in order to test the performance of the algorithm under different scenarios. After the K value is fixed, we added Gaussian Function to carry out the selection of classification. From the perspective of distance measurement, we use Pearson correlation coefficient to replace the traditional Euclidean Metric, then we further refine the study for attribute values of datasets and introduce the entropy weight method, combined with Pearson’s measure to optimize the distance calculation equation. This paper optimizes the classical classification algorithm-KNN, and designs a new normalized algorithm called PEWM_G KNN. So this kind of algorithm helps to reduce the difficulty of work operation and improve people’s work efficiency. In addition, the given data can be mapped to the specified category area, and the development trend of future data can be predicted through classification models. It aims to help us carry out a classification model or operation analysis of classification function by screening and classifying the current data in data mining. Therefore, in order to be fully familiar with the related work of driving big data processing, it is necessary to master the classification algorithm of data. It is very important to choose the right big data processing platform and algorithm to deal with different kinds of datasets. Behind the scene, the computer server clusters need to process hundres of millions pieces of data every day. The data about people’s life are being continously collected, analysized and applied as our society progresses into the big data era. Big data has become part of the life for many people.
0 Comments
Leave a Reply. |