Securing Your Hadoop DataNode: Kerberos, Wire Encryption, and Best Practices
An unsecured Hadoop cluster is a ticking time bomb. Without authentication, any user on the network can read, write, or delete HDFS data. This guide covers the essential security layers for HDFS DataNodes: Kerberos authentication, data transfer encryption, block access tokens, and OS-level hardening.
Why DataNodes Are a Security Target
DataNodes are the workhorses of HDFS — they store actual data blocks and serve reads/writes to clients. In an unsecured cluster:
- Any process that can reach port 9866 (DataNode transfer port) can read or write blocks directly
- There's no per-user access control on who reads which data
- A rogue client can inject corrupt or malicious blocks
Hadoop's security model addresses all of this through Kerberos-based mutual authentication, block access tokens, and optional wire encryption.
Layer 1: Kerberos Authentication
Kerberos is the foundation of Hadoop security. Every Hadoop service (NameNode, DataNode, ResourceManager, NodeManager) authenticates with a Kerberos principal before communicating.
Prerequisites
- A running Kerberos KDC (MIT Kerberos or Active Directory)
- DNS properly configured (Kerberos is very sensitive to hostname resolution)
- Synchronized clocks across all nodes (within 5 minutes; use NTP)
Create Service Principals
For each DataNode host, create a principal:
# On the KDC
kadmin.local -q "addprinc -randkey hdfs/datanode1.example.com@EXAMPLE.COM"
kadmin.local -q "addprinc -randkey host/datanode1.example.com@EXAMPLE.COM"
# Export keytabs
kadmin.local -q "ktadd -k /etc/security/keytabs/hdfs.keytab hdfs/datanode1.example.com@EXAMPLE.COM"
Copy keytabs to each DataNode at /etc/security/keytabs/hdfs.keytab with ownership hdfs:hdfs and mode 400.
Enable Security in hdfs-site.xml
<!-- hdfs-site.xml -->
<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>
<!-- DataNode SASL RPC authentication -->
<property>
<name>dfs.datanode.kerberos.principal</name>
<value>hdfs/_HOST@EXAMPLE.COM</value>
</property>
<property>
<name>dfs.datanode.keytab.file</name>
<value>/etc/security/keytabs/hdfs.keytab</value>
</property>
Enable Security in core-site.xml
<!-- core-site.xml -->
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
<property>
<name>hadoop.rpc.protection</name>
<value>authentication</value> <!-- or privacy for encryption -->
</property>
Layer 2: Block Access Tokens
Block access tokens prevent unauthorized direct block reads/writes even from nodes that have network access to a DataNode. The NameNode issues a short-lived token when a client requests a block location; the DataNode validates the token before serving data.
Enable with:
<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>
<property>
<name>dfs.block.access.token.lifetime</name>
<value>600</value> <!-- seconds; default 600 -->
</property>
Without block tokens, a client who obtains a block location (host:port + block ID) can read that block without further auth. With tokens, the NameNode effectively gatekeeps all data transfers.
Layer 3: Wire Encryption
Even with Kerberos, data transferred between DataNodes and clients is in plaintext by default. Enable encryption for data in transit:
RPC Encryption (control plane)
<!-- core-site.xml -->
<property>
<name>hadoop.rpc.protection</name>
<value>privacy</value> <!-- authentication = auth only; integrity = + checksums; privacy = + encryption -->
</property>
Data Transfer Encryption (data plane)
<!-- hdfs-site.xml -->
<property>
<name>dfs.encrypt.data.transfer</name>
<value>true</value>
</property>
<property>
<name>dfs.encrypt.data.transfer.algorithm</name>
<value>rc4</value> <!-- or 3des for FIPS compliance -->
</property>
<property>
<name>dfs.encrypt.data.transfer.cipher.suites</name>
<value>AES/CTR/NoPadding</value> <!-- hardware-accelerated AES on modern CPUs -->
</property>
AES/CTR with hardware acceleration (AES-NI, available on most modern Intel/AMD CPUs) adds only 5–10% overhead compared to unencrypted transfer.
Layer 4: DataNode SASL on Privileged Ports
Running the DataNode data transfer on a privileged port (below 1024) proves that the process was started as root and later dropped privileges — adding OS-level verification. This is optional but adds defense in depth.
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:1004</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:1006</value>
</property>
When using SASL (Hadoop 2.6+) instead of privileged ports, the DataNode proves its identity through Kerberos without needing root-owned ports:
<property>
<name>dfs.datanode.require.secure.ports</name>
<value>false</value>
</property>
<property>
<name>dfs.http.policy</name>
<value>HTTPS_ONLY</value>
</property>
Layer 5: OS Hardening
Kerberos secures the Hadoop layer, but the underlying OS must also be locked down:
File Permissions
# DataNode data directories should be owned by the hdfs user
chown -R hdfs:hadoop /data/hdfs/dn
chmod 700 /data/hdfs/dn
# Keytab files must not be world-readable
chmod 400 /etc/security/keytabs/hdfs.keytab
chown hdfs:hdfs /etc/security/keytabs/hdfs.keytab
Network Restrictions
Restrict DataNode ports to cluster-internal network ranges using iptables or firewalld:
# Only allow DataNode transfer port from within the cluster subnet
iptables -A INPUT -p tcp --dport 9866 -s 10.0.0.0/8 -j ACCEPT
iptables -A INPUT -p tcp --dport 9866 -j DROP
Run DataNode as Non-Root
The DataNode process should run as the hdfs system user, not root:
# /etc/hadoop/hadoop-env.sh
export HDFS_DATANODE_USER=hdfs
export HDFS_DATANODE_SECURE_USER=hdfs
Verifying Security Configuration
After enabling security, verify that everything works:
# Obtain a Kerberos ticket for the hdfs service user
kinit -kt /etc/security/keytabs/hdfs.keytab hdfs/namenode.example.com@EXAMPLE.COM
# List HDFS root (should succeed)
hdfs dfs -ls /
# Check that unauthenticated access is denied
kdestroy
hdfs dfs -ls / # Should fail with "No valid credentials"
Run an HDFS health check with auth:
kinit -kt /etc/security/keytabs/hdfs.keytab hdfs/namenode.example.com@EXAMPLE.COM
hdfs dfsadmin -report
hdfs fsck / -summary
Security Audit Checklist
| Item | Secured |
|---|---|
| Kerberos principals created for all service hosts | [ ] |
| Keytab files owned by service user, mode 400 | [ ] |
hadoop.security.authentication = kerberos | [ ] |
dfs.block.access.token.enable = true | [ ] |
| Data transfer encryption enabled | [ ] |
| DataNode data dirs owned by hdfs user, mode 700 | [ ] |
| Firewall restricts DataNode ports to cluster subnet | [ ] |
| HDFS audit logging enabled | [ ] |
| NTP synchronized (< 5 min skew) | [ ] |
| Ranger or Sentry for fine-grained authorization | [ ] |
Summary
Securing a Hadoop DataNode involves multiple complementary layers:
- Kerberos — mutual authentication between services and clients
- Block access tokens — prevent unauthorized direct block access
- Wire encryption — protect data in transit (RPC + data transfer)
- Privileged ports or SASL — OS-level service identity verification
- OS hardening — file permissions, firewall, non-root user
No single layer is sufficient on its own. A properly secured DataNode requires all these working together. For fine-grained row-level and column-level access control beyond what HDFS ACLs provide, look at Apache Ranger as the next step.
