2019 4th International Conference on Computer Science and Engineering (UBMK), Samsun, Turkey, 11 - 15 Eylül 2019
Today, a lot of institutions collect and store big data belongs to their respondents (client, patients, users, firms etc.). The main purposes of these actions can be such as doing their missions and providing better services (modeling, extracting behavior patterns, disease detection, making future plans, creating policies, developing decision-making mechanisms). To benefit from the collected big data at a higher level, it is inevitable to publish the data. However, if the big data includes sensitive information about responders, a direct release of these data may cause disclosure of identities of respondents. Hence new solutions to protect the privacy of respondents are always required. Anonymization is a utility-based privacy preserving approach that is frequently used in privacy-preserving big data publishing (PPBDP). In this paper, a clustering-based anonymization model on Spark is proposed and applied for the first time. The main purpose of the proposed approach is evaluating anonymization problem as a clustering problem. Distributed k-Means algorithm is used for anonymization in the proposed model. In order to adopt a clustering-based approach to k-anonymity, some assumptions were made. As a result, the proposed model provides a plausible solution to PPBDP.