[Talk Summary] Machine Learning and Privacy: Friends or Foes?

Dr. Vitaly Shmatikov from Cornell Tech gave a talk about "Machine Learning and Privacy: Friends or Foes?" on March, 17, 2017 at School of Information Sciences, University of Pittsburgh. With recent advances in machine learning, there are new powerful tools built on ML models that help to protect data privacy. However, may trained models result in leaking sensitive data? In this talk, Dr Shmatikov presented two part: (1) How to use machine learning against systems by partially encrypt user data in storage (e.g., images); (2) How to turn machine learning against itself, to extract sensitive training data from machine learning models,  including black-box models constructed using Google's and Amazon's "learning-as-a-service" platforms.

At the beginning of the talk, Dr. Shmatikov introduced a quick overview about machine learning, how it works and outperforms humans. Advanced machine learning is able to deal with typical tasks such as image classification based on neural network, regression models, etc. With an explosion of social information and big data, standard models can access a huge amount of data for training, resulting in accurate predictions. He also gave a demonstration of using deep neural networks, which starts with low level features and gradually aggregate to higher level features, to do image recognitions.

"What does the deep learning revolution mean for a privacy researcher?" Dr. Shmatikov presented the first part of his talk about using machine learning as a tool for protecting data privacy. 


He started by showing how machine learning outperforms humans in recognizing images. Humans are able to recognize original, clear pictures. However, with low dimensional or low level feature pictures, humans fail to recognize them. On the other hand, using neural networks to train original pictures, machines are able to recognize the low level feature pictures successfully. He showed a case study about P3, prototype privacy-preserving photo sharing system, that helps to protect users' personal photo. The 3P version of images can be protected from human recognition.

For the second part of the talk, Dr. Shmatikov focused on the question that given a machine learning model, can we extract information used to train the model? Or do trained models leak sensitive data?

Currently, some companies such as Google, Microsoft or IBM have been approaching machine learning as a service. The service includes a training API to train a model and a prediction API based on the trained model. Customers can upload their data, the system will train a model using the data to help the customers without machine learning knowledge predict the output from an input. But, this kind of service may leak sensitive data. Dr. Shmatikov gave an idea that when using Prediction API, the input for the API from the training set and not from the training set has different results. Hence, if we can recognize the difference we can predict if an input was a member of the training dataset or non-member.

The distinctive idea he and his colleague proposed here is that they use machine learning against machine learning. They train an attack model using shadow models which behave similarly to the target model of the service. 
  • Attack model: used to predict if an input was a member of the training set or a non-member. The training data for attack model is obtained from the outputs of shadow models.
  • Shadow models: used to generate the training data for the attack model. The training data for the shadow models is synthesized by sampling from all possible inputs classified by the target model with high confidence.
He claimed that the more overfitted the target model is, the more accurately the attack model predicts. In terms of privacy, does the model leak information about data in the training set? In terms of learning, does the model generalize t0 data outside the training set? These two questions are in the same nature; machine learning and privacy have the same enemy, overfitting.
  • Overfitted models leak training data
  • Overfitted models lack predictive power
3 floor, School of Information Sciences
University of Pittsburgh


Popular posts from this blog


FolkTrails: Interpreting Navigation Behavior in a Social Tagging System

[Talk Summary 1] Web as a textbook: Curating Targeted Learning Paths through the Heterogeneous Learning Resources on the Web