Keeping Big Data Safe: Common Hadoop Security Issues and Best Practices

Start Free Trial
December 10, 2015 by Updated April 30th, 2024

hadoop security

The big data explosion has given rise to a host of Information technology tools and capabilities that enable organizations to capture, manage and analyze large sets of structured and unstructured data for actionable insights and competitive advantage. But with this new technology comes the challenge of keeping sensitive information private and secure.

Big data that resides within a Hadoop environment can contain sensitive financial data in the form of credit card and bank account numbers. It may also contain proprietary corporate information and Personally Identifiable Information (PII) such as the names, addresses, and social security numbers of clients, customers, and employees.

Due to the sensitive nature of all of this data and the damage that can be done should it fall into the wrong hands, it is imperative that it be protected from unauthorized access. To that end, here is a look at some common Hadoop security issues along with best practices to keep sensitive data safe and secure.

Security concerns with Hadoop

It wasn’t all that long ago that Hadoop in the enterprise was primarily deployed on-premise. As such, sensitive data was safely confined in isolated clusters or data silos where security wasn’t an issue. But that quickly changed as Hadoop evolved into Big Data-as-a-Service (BDaaS), took to the cloud, and became surrounded by an ever-growing ecosystem of tools and applications. And while these innovations have served to democratize data and bring Hadoop into the mainstream, they have also created new security concerns for organizations that now struggle to scale security in step with Hadoop’s rapid technological advances.

For many organizations, Hadoop has evolved into an enterprise data platform. That poses new security challenges as data that was once siloed is brought together in a vast data lake and made accessible to a variety of users across the organization. Among these challenges are:

  • Ensuring the proper authentication of users who access Hadoop.
  • Ensuring that authorized Hadoop users can only access the data that they are entitled to access.
  • Ensuring that data access histories for all users are recorded in accordance with compliance regulations and for other important purposes.
  • Ensuring the protection of data—both at rest and in transit—through enterprise-grade encryption.

Hadoop security best practices

Clearly, today’s organizations face formidable security challenges. And the stakes regarding data security are being raised ever higher as sensitive healthcare data, personal retail customer data, smartphone data, and social media and sentiment data become more and more a part of the big data mix. It’s time for organizations to reevaluate the safety of their data in Hadoop and to reacquaint themselves with the following Hadoop security best practices.

  1. Plan before you deploy – Big data protection strategies must be determined during the planning phase of the Hadoop deployment. Before moving any data into Hadoop it’s critical to identify any sensitive data elements, along with where those elements will reside in the system. In addition, all company privacy policies and pertinent industry and governmental regulations must be taken into consideration during the planning phase in order to better identify and mitigate compliance exposure risk.
  2. Don’t overlook basic security measures – Basic security measures can go a long way in meeting Hadoop security challenges. To ensure user identification and control user access to sensitive data it’s important to create users and groups and then map users to groups. Permissions should be assigned and locked down by groups, and the use of strong passwords should be strictly enforced. Fine-grained permissions should be assigned on a need-to-know basis only and broad stroke permissions should be avoided as much as possible.
  3. Choose the right remediation technique – When business analytic needs require access to real data, as opposed to data that has been desensitized, there are two remediation techniques to choose from—encryption or masking. While masking offers the most secure remediation, encryption might be a better choice as it offers greater flexibility to meet evolving needs. Either way, it’s important to ensure that the data protection solutions being considered are capable of supporting both remediation techniques. That way, both masked and unmasked versions of sensitive data can be kept in separate Hadoop directories if desired.
  4. Ensure that encryption integrates with access control – Once an encryption solution is chosen it must be made compatible with the organization’s access control technology. Otherwise, users with different credentials won’t have the appropriate, selective access to sensitive data in the Hadoop environment that they require.
  5. Monitor, detect, and resolve issues – Even the best security models will be found wanting without the capability to detect non-compliance issues and suspected or actual security breaches and quickly resolve them. Organizations need to make sure that best-practice monitoring and detection processes are in place.
  6. Ensure proper training and enforcement – To be fully effective, best practice policies and procedures with respect to

Hadoop Security Issues

must be frequently revisited in employee training and constantly supervised and enforced.

Hadoop is enabling organizations to analyze vast and rich data stores and derives actionable insights that inform new and better products and services and help to create a competitive advantage. But the benefits of Hadoop come with risks. Hopefully, the above information will help organizations to gain a better understanding of the security and compliance issues associated with Hadoop and to implement best practices to keep sensitive data safe and secure going forward.

To learn more about how Qubole handles data security, download our security brief.

Start Free Trial
Read Getting started with Spark on QDS for Google Cloud Platform