Given the increasing cyber threats and implementation of data privacy legislation like the GDPR in the EU or CCPA in the US, businesses need to ensure that private data is used as little as possible. Data masking provides a way to limit the use of private data while enabling business to test their systems with data that is as close to real data as possible.
The average cost of a data breach was $4 million in 2019. This creates a strong incentive for businesses to invest in information security solutions including data masking to protect sensitive data. Data masking is a must-have solution for organizations that wish to comply with GDPR or to use realistic data in a testing environment.
What is data masking?
Data masking is also referred to as data obfuscation, data anonymization, or pseudonymization. It is the process of replacing confidential data by using functional fictitious data such as characters or other data. Main purpose of data masking is to protect sensitive, private information in situations where the enterprise shares data with third parties.
Why is data masking important now?
The number of data breaches is increasing each year (Compared to midyear of 2018, the number of recorded breaches was up 54% in 2019) Therefore, organizations need to improve their data security systems. The need for data masking is increasing due to the following reasons:
- Organizations need a copy of production data when they decide to use it for non-production reasons such as application testing or business analytics modeling.
- 79% of CIOs believe employees have put company data at risk accidentally in the last 12 months, while 61% think employees have put company data at risk maliciously.
- 95% acknowledge that insider security threats are a danger to their organization
- GDPR and CCPA force businesses to strengthen their data protection systems otherwise organizations have to pay hefty fines.
How does data masking work?
Data masking process is simple, yet, it has different techniques and types. In general, organizations start with identifying all sensitive data your enterprise holds. Then, they use algorithms to mask sensitive data and replace it with structurally identical but numerically different data. What do we mean by structurally identical? For instance, passport numbers are 9 digits in the US and individuals usually have to share their passport information with airline companies. When an airline company builds a model to analyze and test the business environment, they create a different 9 digit long passport ID or replace some digits with characters.
Here is an example of how data masking works:
What are the types of data masking?
- Static data masking (SDM): In SDM, data is masked in the original database then duplicated into a test environment so that businesses can share the test data environment with third-party vendors.
- Dynamic data masking (DDM): In DDM, there is no need for a second data source to store the masked data dynamically. The original sensitive data remains in the repository and is accessible to an
application when authorized by the system. Data is never exposed to unauthorized users, contents are shuffled in real-time on-demand to make the contents masked. Only authorized users are able to see authentic data. A reverse proxy is generally used to achieve DDM. Other dynamic methods to achieve DDM are generally called on-the-fly data masking.
What are the techniques of data masking?
There are numerous data masking techniques and we classified them according to their use case.
Suitable for testing
In the substitution approach, as its name refers, businesses substitute the original data with random data from supplied or customized lookup file. This is an effective way to disguise data since businesses preserve the authentic look of data.
Shuffling is another common data masking method. In the shuffling method, just like substitution, businesses substitute original data with another authentic-looking data but they shuffle the entities in the same column randomly.
Number and Date Variance
For financial and date-driven data sets, applying the same variance to create a new dataset doesn’t change the accuracy of the dataset while masking data. Using variance to create a new dataset is also commonly used in synthetic data generation. If you plan to protect data privacy with this technique, we recommend you to read our comprehensive guide to synthetic data generation.
Encryption is the most complex data masking algorithm. Users can access data only if they have the decryption key.
This method involves randomly rearranging the order of characters. This process is irreversible so that the original data cannot be obtained from the scrambled data.
Suitable for sharing data with unauthorized users
Nulling out or Deletion
Replacing sensitive data with null value is also an approach businesses can prefer in their data masking efforts. Though it reduces the accuracy of testing results which are mostly maintained in other approaches, it is a simpler approach when business are not masking due to model validation purposes.
In masking out method, only some part of the original data is masked. It is similar to nulling out since it is not effective in the test environment. For example, in online shopping, only last 4 digits of the credit card number are shown to customers to prevent fraud.
Source: Solix Technologies
How is data masking different than synthetic data?
For creating test data compliant with GDPR regulations, organizations have two options: generating synthetic data or masking data with different algorithms. Though these two testing techniques serve to the same purpose, each method has different benefits and risks.
Data masking is the process of creating a copy of real-world data that is obscured in specific fields within a data set. However, even if the organization applies most complex and comprehensive data masking techniques, there is a slight chance that somebody can identify individual people based on trends in the masked data. Therefore, there is the risk of releasing information to third parties.
On the other side, synthetic data is data that is artificially created rather than being generated by actual events. It does not contain real information about individuals, it is created based on the data model or message models that a business uses for its production systems. For cases where a business is testing a whole new application or the business believes their data masking is not sufficient, using synthetic data is the answer.
Which types of data require data masking?
- Personally identifiable information (PII): Any data that could potentially be used to identify a particular person. For example, full name, social security number, driver’s license number, and passport number.
- Protected health information (PHI): PHI includes demographic information, medical histories, test and laboratory results, mental health conditions, insurance information, and other data that a healthcare professional collects to identify appropriate care.
- Payment card information (PCI-DSS): There is an information security standard for organizations to follow while handling branded credit cards from the major card schemes.
- Intellectual property (IP): IP refers to creations of the mind, such as inventions; literary and artistic works; designs; and symbols, names and images used in commerce.
How does GDPR promote data masking?
Data masking is accepted as a technique to protect individuals’ data by GDPR. Here are the related articles where GDPR encourages businesses to use pseudonymization:
Article 6 (4-e): ” the existence of appropriate safeguards, which may include encryption or pseudonymization.”
Article 25 (1): “Taking into account the state of the art, the cost of implementation and the nature, scope, context and purposes of processing as well as the risks of varying likelihood and severity for rights and freedoms of natural persons posed by the processing, the controller shall, both at the time of the determination of the means for processing and at the time of the processing itself, implement appropriate technical and organizational measures, such as pseudonymization, which are designed to implement data-protection principles, such as data minimisation, in an effective manner and to integrate the necessary safeguards into the processing in order to meet the requirements of this Regulation and protect the rights of data subjects”
Article 32 (a): “The controller and the processor shall implement appropriate technical and organizational measures to ensure a level of security appropriate to the risk, including inter alia as appropriate: the pseudonymization and encryption of personal data.”
Article 40 (2): “Associations and other bodies representing categories of controllers or processors may prepare codes of conduct, or amend or extend such codes, for the purpose of specifying the application of this Regulation, such as with regard to:
- (d) the pseudonymization of personal data
Article 89 (1): “Processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes, shall be subject to appropriate safeguards including data minimization and pseudonymization”
What are some example data masking case studies?
Independence Health Group
Independence Health Group is the leading health insurance company that is offering a wide range of services including commercial, Medicare and Medicaid medical coverage, third-party benefits administration, pharmacy benefits management, and workers’ compensation. Independence Health wanted to allow on- and off-shore developers to test applications using real data, however, they needed to mask PHI and other personally identifiable information. They decided to use Informatica Dynamic Data Masking to disguise member names, birthdates, social security numbers (SSNs), and other sensitive data in real-time as developers pull down data sets.
With a data masking solution, Independence Health is able to better protect sensitive data of customers which reduces the potential cost of a data breach.
Samsung is working on analyzing and producing mobile and smart TV products all over the world. While performing product analysis on millions of Samsung Galaxy Smartphone devices, the company has to protect personal private information in accordance with the rules and procedures of the local regulation.
To ensure legal compliance to personal privacy, Samsung has partnered with Dataguise. Dataguise’s tool for Hadoop automatically discovers consumer privacy data and encrypts it before migrating data to AWS analytics tools so that only authorized users can access and perform analytics on real-data.
What are the best practices of data masking?
- Make sure you discovered all sensitive data in enterprise’s database before transferring it to the testing environment.
- Understand your sensitive data and identify the most suitable data masking technique accordingly.
- Use irreversible methods so that your data cannot be transformed back to the original version.
What are the leading data masking tools?
- CA Test Data Manager
- Dataguise Privacy on Demand Platform
- Delphix Dynamic Data Platform
- HPE SecureData Enterprise
- IBM Infosphere Optim
- Imperva Camouflage Data Masking
- Informatica Dynamic Data Masking (for DDM)
- Informatica Persistent Data Masking (for SDM)
- Oracle Advanced Security (for DDM)
- Oracle’s Data Masking and Subsetting Pack (for SDM)
- Privacy Analytics
- Solix Data Masking
If you are interested in other security solutions to protect your enterprise data from cyber threats, below is a recommended reading list for you:
- Endpoint Security: in-Depth Guide
- The Ultimate Guide to Cyber Threat Intelligence (CTI)
- AI Security: Defend against AI-powered cyberattacks
- Managed Security Services (MSS): Comprehensive Guide
- Security Analytics: The Ultimate Guide
- Deception Technology: in-Depth Guide