Provisioning Hortonworks Data Platform onto OpenStack with Terraform

Dmytro Bischak
Sep 21, 2018
8 min read

Updated: Jun 29, 2021

Enhance Hortonworks Data Platform v3.x.x deploy scripts using Terraform implementation for https://github.com/hortonworks/ansible-hortonworks This article deals with provisioning Cloud environment for Hortonworks Data Platform (HDP) and deployment in Openstack private cloud by Terraform tool.

What is the main idea of infrastructure as code ?

Provisioning of the Hadoop cluster should provide not only initial configuration of cluster nodes but also further support, transparent horizontal and vertical scaling. Cluster resources can be located in different physical datacenters and different cloud service providers. To simplify provisioning and maintaining of Hadoop cluster we have to use proper tools and practices. The most suitable approach is infrastructure as code (IAC), which is described by Martin Fowler in his article.

The main idea of IAC is that we write and execute code to define, deploy, and update our infrastructure. The IAC advantages are as follows:

it is self-documented;
version control like any programming language code that gives us understanding of changes and history;
validation by code review, automation testing and static analysis tools;
deployment process automation which increases speed and safety;
systems can be easily reproduced in different clouds;
systems are in consistent state;
processes are repeatable;
infrastructure is immutable.

For provisioning Hadoop cluster Hortonworks HDP distribution we can apply one of the next approaches given below which more or less implements IAC paradigm:

ad hoc scripts (BASH scripts)
configuration management tools (ansible)
server provisioning tools (Terraform, OpenStack Heat)

For example we can provision our cloud with shell-scripts, but this kind of IAC implementation will be hard to maintain. We won't treat these tools as competitors but as a useful supplement to each other.

In our article we will describe implementation based on:

How do we implement IAC with Terraform for OpenStack cluster provisioning

Officially for provisioning HDP Hortonworks suggests a collection of Ansible scripts. These scripts are divided into two categories according to provisioning stages:

For building VMs in cloud (create instances with OS, mounting drives, initial access and networking configuration, registering DNS-records of hosts etc.)
For computational nodes installation (install Ambari and Hadoop Technology stack components with initial configuration).

First category of ansible scripts suitable for initial provisioning but not for further maintaining, so we change these collection of ansible scripts onto Terraform implementation. Terraform IAC implementation has multiple advantages over its competitors, because by means of using Terraform as IAC implementation we move:

from Configuration Management to Orchestration;
from Mutable Infrastructure to Immutable Infrastructure;
from Procedural coding style to Declarative;
from Client/Server Architecture to Client-Only Architecture.

Also we chose Hashicorp Terraform because of mentioned above IAC advantages:

This is multicloud solution. It supports AWS, GoogleCloud, OpenStack, MS Azure and other resources providers by mixing them together. We can build a cluster in a multicloud environment: AWS and GoogleCloud and OpenStack.
It is free and open source solution (Mozilla Public License 2.0)
It contains different programming primitives (local variables, loops, modules etc).

We will combine the use of provisioning and configuration management tool together: our Terraform code and Hortonworks Ansible scripts. Building HDP cluster comprises two main steps (as it is shown in the picture below):

Provisioning (building) HDP cluster with Terraform. This step includes editing terraform configuration files and run scripts.
HDP cluster software configuration and installation.

1. Provisioning (building) HDP cluster with Terraform

Install Terraform

Download version of the terraform (the latest or the most preferable one):

wget https://releases.hashicorp.com/terraform/0.11.8/terraform_0.11.8_linux_amd64.zip

Suggested to use the same version of terraform or at least to test the newest terraform version for infrastructure's code compatibility. Terraform is in development so take it into account. Better to use the same version for the same code of infrastructure.

Extract the downloaded file archive:

wget https://releases.hashicorp.com/terraform/0.11.8/terraform_0.11.8_linux_amd64.zip 
unzip terraform_0.11.8_linux_amd64.zip

Move the executable into a directory searched for executables:

sudo mv terraform /usr/local/bin/

[Optional] Enable autocompletion:

terraform -install-autocomplete 
source ~/.profile

It is also considered good practice to run Terraform inside docker container or separate build node. Upload the ansible-hortonworks repository to the build node / workstation, preferably under the home folder. If the build node / workstation can directly download the repository, we will run the following:

Clone the repository

Clone from git-repository branch terraform-default:

cd && git clone -b terraform-default https://github.com/dataengi/ansible-hortonworks.git

If our GitHub SSH key is installed, we can use the SSH link:

cd && git clone -b terraform-default git@github.com:dataengi/ansible-hortonworks.git

Our code is based on existed Hortonwork's HDP Ansible scripts for building cloud on OpenStack. We added appropriate terraform implementation for provisioning to OpenStack (directory terraform). In directory terraform there are two subdirectories terraform/live and terraform/modules. First one contains subdirectories with real configured environments. In our case this is stage. Second one contains modules in Terraform sense "functions" which we "call" to build our nodes. With modules we achieve code reuse. We will work with live subdirectory by configuring appropriate files: main.tf and terraform.tfvars.

Cloud configuration

We should set environment variables (in file terraform.tfvars, with -var flag, from environment variables) E.g. contents of terraform.tfvars:

username            = "openstsack_username"
tenantname          = "tenantname"
password            = "passs"
openstack_auth_url  = "http://openstack_auth_url:5000/v3/"
openstack_keypair   = "key-pair"
network_name        = "network_name"
#AWS
aws_region          = "eu-west-2"
access_key          = "123131321"
secret_key          = "******"
aws_zone_id         = "xxxxxxxx"

Set the OpenStack variables

We should set OpenStack credentials (username, password) and parameters (tenantname, openstack_auth_url, openstack_keypair, network_name).

Set the Amazon Route 53 variables

In suggested code to provide appropriate DNS-records of our hosts we used AWS Route53 service. But if you don't need it you can skip this configuration. In terraform.tfvars we should set aws_region, access_key, secret_key and aws_zone_id in section #AWS. After successufully creating infrastructure at Terraform all hosts will have appropriate DNS names bindings.

Nodes configuration

We should modify the file at terraform/live/stage/main.tf to set the OpenStack configuration according to blueprint cluster configuration. This section contains variables that are node specific. Nodes are separated by host_groups, which is an Ambari Blueprint concept. Each group defines a specific cluster role, for example master, slave, edge. There can be any number of host_groups (as long as they correspond to the Blueprint), therefore other host_groups can be added to correspond with the required architecture / blueprint. host_group names can be taken from blueprint (e.g. default blueprint is in file ansible-hortonworks/playbooks/group_vars/all in section blueprint configuration) And host_groups can have any names and any number of nodes but they should correspond with the host_groups in the Ambari Blueprint and respect the Blueprint spec (for example, there shouldn't be more than 1 node in the host_group which contains the AMBARI_SERVER component, but there can be 100+ nodes in the slave / worker host_group).

So, if we open file ansible-hortonworks/playbooks/group_vars/all section blueprint configuration we shall see two host groups with installed HDP-components (services and clients):

#############################
## blueprint configuration ##
#############################

blueprint_name: '{{ cluster_name }}_blueprint'            # the name of the blueprint as it will be stored in Ambari
blueprint_file: 'blueprint_dynamic.j2'                    # the blueprint JSON file - 'blueprint_dynamic.j2' is a Jinja2 template that generates the required JSON
blueprint_dynamic:                                        # properties for the dynamic blueprint - these are only used by the 'blueprint_dynamic.j2' template to generate the JSON
  - host_group: "hdp-master"
    clients: ['ZOOKEEPER_CLIENT', 'HDFS_CLIENT', 'YARN_CLIENT', 'MAPREDUCE2_CLIENT', 'TEZ_CLIENT', 'PIG', 'SQOOP', 'HIVE_CLIENT', 'OOZIE_CLIENT', 'INFRA_SOLR_CLIENT', 'SPARK2_CLIENT']
    services:
      - ZOOKEEPER_SERVER
      - NAMENODE
      - SECONDARY_NAMENODE
      - RESOURCEMANAGER
      
      ...
      
      - HST_AGENT
  - host_group: "hdp-slave"
    clients: ['ZOOKEEPER_CLIENT', 'HDFS_CLIENT', 'YARN_CLIENT', 'MAPREDUCE2_CLIENT', 'TEZ_CLIENT', 'PIG', 'SQOOP', 'HIVE_CLIENT', 'OOZIE_CLIENT', 'INFRA_SOLR_CLIENT', 'SPARK2_CLIENT']
    services:
      - DATANODE
      - NODEMANAGER
      - METRICS_MONITOR
      - HST_AGENT

In file ansible-hortonworks/terraform/live/stage/main.tf we have the appropriate groups of nodes:

module "hdp-master" {
  source                = "../../modules/nodegroups"
  host_group            = "hdp-master"
  hostname              = "cluster-os-m"
  domainsuffix          = "scalhive.com"
  nodescount            = 1

    ...

  enable_floating_ip    = false
}

module "hdp-slave" {
  source                = "../../modules/nodegroups"
  host_group            = "hdp-slave"
  hostname              = "cluster-os-s"
  domainsuffix          = "scalhive.com"
  nodescount            = 4

    ...

  sec_groups            = ["default","local-network"]
}

Also there is a sample of High Available infrastructure in Terraform ` ansible-hortonworks/terraform/live/stage/main.tf.hdp3-ha-3-masters-with-druid-atlas-knox-log.sample which conforms to ansible-hortonworks/playbooks/group_vars/example-hdp3-ha-3-masters-with-druid-atlas-knox-log`.

Build the Cloud environment

We have to setup the OpenStack credentials according to HDP manual:

Download the OpenStack RC file from OpenStack dashboard, and download your user specific OpenStack RC file. This is usually found on Compute -> Access and Security under the API Access tab. Download the v3 if available.
Apply the OpenStack credentials by copying the file to the build node / workstation in a private location (for example the user's home folder) and source the file so that it populates the existing session with the OpenStack environment variables. Type OpenStack account password when prompted.

source ~/ansible/bin/activate 
source ~/*-openrc.sh

We can verify if it worked by trying to list the existing OpenStack instances:

 nova --insecure list

Enter terraform build directory terraform/live/stage. Initialize a working directory containing Terraform configuration files.

cd ~/ansible-hortonworks*/terraform/live/stage
terraform init

Analyze plan of cloud infrastructure:

cd ~/ansible-hortonworks*/terraform/live/stage
terraform plan

Apply infrastructure:

cd ~/ansible-hortonworks*/terraform/live/stage
terraform apply

Optional: analyze graph of used resources:

terraform graph | dot -Tpng > graph.png

After successful infrastructure deployment we will have HDP cluster nodes. The result of terraform apply will be static inventory file ansible-hortonworks/inventory/static with cluster nodes divided into groups, e.g.:

[hdp-master]
cluster-os-m-01.scalhive.com ansible_host=10.224.0.17 ansible_user=centos ansible_ssh_private_key_file=~/.ssh/big-data-sandbox.pem
[hdp-slave]
cluster-os-s-01.scalhive.com ansible_host=10.224.0.4 ansible_user=centos ansible_ssh_private_key_file=~/.ssh/big-data-sandbox.pem
cluster-os-s-02.scalhive.com ansible_host=10.224.0.29 ansible_user=centos ansible_ssh_private_key_file=~/.ssh/big-data-sandbox.pem

We provisioned our cluster and generate this file with Terraform. This file will be used at the initial state of software installation.

2. Cluster software configuration and installation

HDP software components configuration

Cluster configuration is described in the appropriate READMEs. We have to configure file ` ansible-hortonworks/playbooks/group_vars/all ` and change values to appropriate for our installation(e.g. set versions, change components etc.). At this stage we use ansible scripts of Hortonworks.

Important notes:

If you don't have dedicated DNS server with the ability to manage reverse requests you should set this parameter to no:

external_dns: no

This option instructs installation scripts to clone /etc/hosts file to nodes during installation process.

If you decided to use High Available configuration you should change database option to one of RDBMS: postgres, mysql or mariadb

database: 'postgres'

You should be careful with software configuration of nodes. If you set incorrect values - you will receive error message (blueprint configuration will be checked) or your configuration won't work (e.g. there are some services which must be used in configuration, some services can be used only on some nodetypes etc.). Remember that software components can be added later.

HDP software components installation

Further step of installation - run script to install the cluster using Blueprints. We should check whether the CLOUD_TO_USE environment is variable to static.

export CLOUD_TO_USE=static
cd ~/ansible-hortonworks*/ && bash install_cluster.sh

On this step we should check output of ansible script. Some steps may fail and we should restart script. For example on a task of installation required packages may fail with connection or getting data from yum repository. By default Ambari and cluster configuration tools on master node will be installed. We can check our installation process by visiting http://[your-ambari-node]:8080 with credentials admin/admin.

##Add new nodes to cluster

After installation was successful we can horizontaly scale our cluster:

in directory ansible-hortonworks/terraform/live/stage
in host-group (e.g "hdp-slave") increase nodescount number (e.g. change value from 2 to 4);
analyze infrastructure and apply

terraform plan  
terraform apply

In Ambari UI add new hosts to HOSTS -> Actions -> Add new hosts: enter target hostnames, private SSH key, root user for SSH access. Choose nodes roles and services.

Remember that actual state of your infrastructure will be stored in a file ansible-hortonworks/terraform/live/stage/terraform.tfstate. Don't change this file manually.

Further steps

Cluster is installed with default configuration, so you have to configure it according to your requirements:

tune cluster by changing sizes of heap, containers etc. First step in cluster tuning is benchmarking
configure security access

Important notes about using Terraform

There are different good sharing of experiences using Terraform (e.g. this one or another one). We will also share our drawbacks and pitfalls from our experience:

it doesn't support rollbacks (if some stage of infrastructure provisioning fails we won't have rollback of apply operation);
Sometimes result of plan can differ from the result of apply command. Use command apply with output (-out) of a plan command. Carefully read output of apply and destroy before typing 'yes'.
Take into account the versions of providers and Terraform distribution.
We have to test destroy method as an apply method
Code refactoring may cause destroy of our infrastructure, so name variables carefully.

Rules

Overthink it at the planning infrasturcture phase.
Test your infrastructure on a small sandbox cluster.
Keep clear name conventions because our code - is our infrastructure documentation.
Plan before every apply.
Take into account resource providers versions and set it directly in code.
Take into account Terraform version and use same version for the existing code (use installation environment on a separate node of docker container)
The master branch of the live repository should be identical to deployed production.
If you decided to manage your infrastructure with Terraform don't use another types of configurations (e.g. Web UI, API, etc)

Conclusion

This article suggests a simple approach for HDP cluster provisioning and analyzes pitfalls of IAC implementation with Terraform. This approach allows to build scalable and maintainable solutions according to infrastructure of as code paradigm. Suggested approach has many advantages:

documented infrastructure
simplified provisioning and maintaining of infrastructure
used official Hortonworks scripts
another cloud provider can be added or mixed with existed.

In the next article we will talk about building multicloud HDP cluster.

BLOG