Big data analytics are proving a boon to business and government, enabling organizations to analyze whole sets of related data or data from disparate sets, for patterns and trends in ways never before possible. Use cases are growing quickly, and we are only at the beginning of understanding how best to make use of this technology.
- Retail – Create a full profile of not only what people are buying, but the process they go through to make their selections. Data can include detailed customer views of online shopping behavior, social media, and even interactions in physical stores. An extension of this – creating case sensitive pricing and promotion based on customer profiles.
- Household profiling for business and government - Similarly, whole household profiling brings together diverse publicly available information to create sophisticated views of households that result in opportunities for business, and characterizations for government entities.
- Internet of Things (IOT) applications – Collecting and analyzing a wealth of data from diverse physical devices to build and recognize patterns that may indicate threats, opportunities or failures in process(predictive analysis)
- Extended business planning – Taking in data from previously un-linked business unit silos to look for increased efficiencies and better customer services
- IT Information Security – Combining data from IT monitoring, storage devices, file and data access logsand analyzing for patterns, vulnerabilities, inappropriate access
- Hospitality industry – Profile customerpreferences based on their previous purchase patterns to provide a personalized experience tailored to individual travelers
If you look at the details of these existing use cases, there are many concerns with compliance due to the protected nature of some of the information collected and correlated. Sensitive and protected data is almost always part of the core data set. This data can include personally identifiable information (PII) protected by law (including Data Breach Protection laws, Data Residency regulations, US HIPAA/HITECH requirements) as well as financial data and intellectual property that are required to be protected by compliance regulations and industry bodies (PCI DSS, SOX, GLB, fiduciary requirements, national secrecy laws).
Due to the nature of Big Data implementations, if even a small percentage of the total information is protected data, the entire data store must be protected, as protected data can reside anywhere within the environment.
What’s needed to meet these compliance needs? A set of protections that meets the following requirements:
- Least privilege – Protected and sensitive data must not be accessible to applications and users except for those who specifically require it
- Country of origin – Protected must only be accessible to users within the country of origin
This seems simple in principle, but can be complex in implementation, and there are choices to be made based upon the data types and analysis planned.
Outside the Big Data environment– OS and File system data stores need to be protected from access by system level accounts (Privileged Users, as these are sometimes called), but these accounts still need to be enabled to continue performing system level tasks such as updates, backups, software installations, and so on. The best solution to this is encrypting the Big Data data stores and then only allowing decryption for the Big Data processes and user accounts. This enables the system instance underlying the Big Data store to be managed as needed (even a European data store by an outsourcing company in India), without exposure of information at the system level – meeting both “least privilege” and “country of origin” requirements.
Inside the Big Data environment … Data ingestion and analysis reports– As data is added to a Big Data environment protected information should be anonymized. There are many techniques that can be used – encryption, tokenization/substitution, nulling out data – and the selection of the right technique depends upon the expected final use, and the analytics to be performed. The final reports can then have this data anonymized, or returned to original state based on the privilege and jurisdiction of the user or process accessing the information.
- Encryption– Encryption of protected data is best suited when analysis using the protected portion of the data set is not needed. Once encrypted, isn’t useful for analysis – but the rest of the data set is available, and the protected data accompanies un-protected records (like an encrypted Social Security number accompanying a name). Although there has been much talk of homomorphic encryption recently (an encryption technique that is said to be able to retain the capacity to analyze data, even though encrypted), it has yet to reach commercial usefulness. Once reports of analysis are completed, the encryption key used on ingestion can be used to recreate the protected data. If this key is only accessible to processes and people who meet “least privilege”, and who are located in the correct jurisdiction, both requirements are preserved.
- Tokenization– With Tokenization, protected data is replaced with a “token” representing the actual information. The information itself need never leave a secure environment to enter the Big Data implementation. Tokenization can be “tuned” to meet the level of need/security of the information. Lower level of security tokenization can replace (for instance) a social security number with one using the same number of digits, and if a consistently used algorithm generates the numbers can also preserve some analysis capabilities. More highly secure tokenization schemes can use random number generation to create a substitute for the protected data, but will not allow analysis of the protected information (much like encryption). In either case, when the analysis run is complete and reports are generated, only those with the correct privilege level and/or within the right jurisdiction will have the capability to access or generate the protected information.
- Nulling data– When it’s desired to preserve a space in the data set for protected information, and the data is need not be available for analysis or reporting, it can be nulled on ingestion into the environment.
Use of these techniques requires some work – protected data must be identified prior to ingestion, and appropriate processing performed before it leaves the “least privilege” environment and/or its home jurisdiction. Similarly, work is required on the reporting side to ensure that the data is accessible after it leaves the big data environment. Some of this work can be left to an internal gateway - If all data leaving a given subnet, for instance, always passes through a gateway that applies an anonymization technique to social security numbers, home addresses, etc. this can minimize the exposure both to big data environments and to cloud or other applications.
The final result of this work is to minimize the risks to organizations of failing to meet compliance and regulatory requirements, as well as reduction of the exposure to data breaches, when working with Big Data implementations.