The right balance between privacy and data insight
I agree that the privacy dream is "That we have a database with useful but private information and we have a curator that wants to take this data and sanitize it, so data analysts can only interact with the sanitized data set", but many mistakes can be made with data sets to be used for general purposes.
I read about a few examples in the paper "Why Pseudonyms Don't Anonymize: A Computational Re-identification Analysis of Genomic Data Privacy Protection Systems" written by the Data Privacy Laboratory, School of Computer Science at Carnegie Mellon University. The conclusion was that "this work illustrates the danger of blindly adopting identity protection methods for genomic data. Future methods must account for inferences that can be leaked from the data itself and the environment into which the data is being released in order to provide guarantees of privacy. While the protection methods reviewed in this paper provide a base for future protection strategies, our analyses provide guideposts for the development of provable privacy protecting methods."
I think that separate data sets should be produced for specific purposes and that sensitive data fields should be secured to minimize the risk of data inferences. We never know where this data will eventually end up in a data breach.
Data tokenization can provide the right balance between privacy and data insight in many situations.
In some situations I've seen suggestions that services that deliver result sets to specific queries instead of exposing the raw data (from different sources) could be attractive for some use cases. This can protect the privacy of the individuals and also the privacy of the different sources of data. A great source for this approach can be found in the paper "Distributed Anonymization: Achieving Privacy for Both Data Subjects and Data Providers" written by Pawel Jurczyk and Li Xiong at Emory University, Atlanta.
Ulf Mattsson, CTO Protegrity