The research team of the Department of Electronics has made progress in the privacy and security key technology field of distributed machine learning
Tsinghua News Network, August 23rd. Distributed machine learning can coordinate data and resources distributed in different nodes in the actual system, and perform model training by sharing and learning intermediate variables (such as model parameters) between nodes. This technology has the characteristics of decentralization, which avoids the privacy risks brought about by centralized data storage to a certain extent, and is currently the mainstream machine learning method for privacy protection. However, with the deepening of research, distributed machine learning has also encountered many challenges. The current distributed machine learning framework utilizes the data dispersion of each node to achieve data privacy protection. The privacy of node raw data is highly correlated with shared variables in learning, and existing research has proved that private data can be successfully decoded from shared variables. Therefore, how to build a distributed machine learning framework that protects the privacy of the whole process and each link is a basic frontier topic in the current data security field.
However, the contradiction between data security and processing efficiency is an eternal topic. With the enhancement of privacy protection in distributed machine learning, it will inevitably affect the efficiency and effect of machine learning, especially in the training of large-scale parameter models. This contradiction is particularly prominent. On the one hand, as the scale of machine learning models increases and the privacy protection of each link is enhanced, the communication resource and computing resource overhead of shared variables between nodes will increase exponentially, becoming a major bottleneck in large-scale model learning. On the other hand, for some complex raw data, such as strongly correlated graph data, etc., these highly correlated raw data are scattered in different nodes in the distributed learning framework, and privacy protection can be achieved through the "de-correlation" of decentralized data, but It also loses a large amount of correlation information between these data, which greatly reduces the efficiency of machine learning. Existing methods assume that nodes have independent and complete data and learn based on their internal features, and it is difficult to effectively model cross-node strongly correlated data. How to solve the contradiction between the "endogenous strong correlation" of graph data and the "de-correlation" of distributed learning for privacy protection, and improve the learning effect of strongly correlated data is a highly challenging topic.
Aiming at the cutting-edge topics in distributed machine learning for privacy protection, the research team of the Open Source Data Cognitive Innovation Center of the Department of Electronics, Tsinghua University has carried out systematic research work (the systematic architecture of the research is shown in Figure 1), and has achieved phased progress . The research team created a set of privacy-enhancing distributed machine learning models (the method is shown in Figure 2). The model adopts the collaborative learning framework of differential privacy knowledge transfer to realize the "whole process" privacy protection in the distributed learning process. When the machine learning model directly applies differential privacy, the learning effect falls off a cliff. While providing effective and provable privacy security protection for the distributed learning process, it improves the performance of existing privacy-preserving machine learning methods by up to 84.2%. Aiming at the model scale bottleneck problem caused by the contradiction between "privacy enhancement" and "model learning efficiency" in distributed machine learning, the research team created a set of efficient model training methods for privacy-enhanced distributed architecture (the method is shown in 3). On the basis of the privacy-enhanced distributed learning model, a set of two-way knowledge distillation technology based on the "disciple effect" is developed, and a set of model knowledge adaptive compression method based on mutual learning constraints is proposed, which breaks through the enhanced privacy protection. Efficiency bottleneck of knowledge sharing in machine learning process. Experimental results prove that in a large-scale privacy-enhanced distributed learning model, this method can increase the training efficiency of complex models by 20 times. Aiming at the contradiction between "strong association" and "de-association" in the distributed learning of graph data, the research team proposed a complex data learning method for privacy-enhanced distributed architecture (the method is shown in Figure 4). By establishing an association model learning method for enhanced privacy protection, the "strong association" graph data is distributed in each node "de-association", and at the same time, the data expansion mechanism is used to model the high-order association information of cross-node data. Experiments on actual scene data prove that the framework can effectively mine the association between distributed graph data, reaching 98.2% of the optimal association modeling effect without privacy protection constraints.