Skip to main content

Spark Cluster Setup

Now that the Ngrok services are up and running, the master node is exposed to the public internet. This setup allows Spark worker nodes to join the cluster from any network, whether private or public.

This guide provides step-by-step instructions for setting up the Spark master and worker nodes using Docker or Docker Compose.

Prerequisites

Before proceeding, ensure that your system meets the following requirements:

  • Docker: Install Docker
  • Docker Compose: Install Docker Compose
  • Ngrok: Ensure Ngrok is installed and configured with an authentication token. Refer to the Ngrok Setup Guide if needed.
  • Network Access: Ensure that your system has internet access to communicate with the Ngrok tunnel.

Master Node Setup

You can set up the Spark master node using either Docker or Docker Compose.

Using Docker

First, create a Docker network for Spark:

docker network create spark-network

Then, run the Spark master node:

docker run -d --name spark-master \
--network spark-network \
-p 8080:8080 -p 7077:7077 \
-e SPARK_MODE=master \
-e SPARK_MASTER_PORT=7077 \
-e SPARK_MASTER_WEBUI_PORT=8080 \
bitnami/spark:latest
  • Ports:
    • 8080: Spark master web UI port.
    • 7077: Spark master communication port.

Using Docker Compose

Below is a docker-compose.yaml file to set up the Spark master node:

version: "3.8"

services:
spark-master:
image: bitnami/spark:latest
environment:
- SPARK_MODE=master
- SPARK_MASTER_PORT=7077
- SPARK_MASTER_WEBUI_PORT=8080
ports:
- "8080:8080"
- "7077:7077"
networks:
- spark-network

networks:
spark-network:
driver: bridge

To start the Spark master node using Docker Compose, run:

docker-compose up -d

Worker Node Setup

The worker node connects to the master node via the Ngrok tunnel URL. You can set up the worker node using Docker or Docker Compose.

Using Docker

Run the Spark worker node using the following command. Replace 2.tcp.ngrok.io:14327 with your actual Ngrok TCP tunnel URL for the master node.

docker run -d --network host \
-e SPARK_WORKER_CORES=1 \
-e SPARK_WORKER_MEMORY=1G \
--name spark-worker \
bitnami/spark:latest /opt/bitnami/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://2.tcp.ngrok.io:14327
  • Environment Variables:
    • SPARK_WORKER_CORES: Number of CPU cores allocated to the worker.
    • SPARK_WORKER_MEMORY: Amount of memory allocated to the worker.

Using Docker Compose

Below is a docker-compose.yaml configuration to set up the Spark worker node. Replace ${SPARK_MASTER_URL} with your actual Ngrok TCP tunnel URL for the master node.

version: "3.8"

services:
spark-worker:
image: bitnami/spark:latest
environment:
- SPARK_WORKER_CORES=1
- SPARK_WORKER_MEMORY=1G
- SPARK_MASTER_URL=${SPARK_MASTER_URL}
networks:
- spark-network
command: >
/opt/bitnami/spark/bin/spark-class org.apache.spark.deploy.worker.Worker ${SPARK_MASTER_URL}

networks:
spark-network:
driver: bridge

Create a .env file to define the SPARK_MASTER_URL environment variable:

SPARK_MASTER_URL=spark://2.tcp.ngrok.io:14327

To start the Spark worker node using Docker Compose, run:

docker-compose up -d

Monitoring the Spark Cluster

Once both the master and worker nodes are running, you can monitor the cluster using the Spark master web UI:

The Spark master web UI provides an overview of active worker nodes, running jobs, and resource usage.


Best Practices

  • Ngrok Tunnel: Ensure that the Ngrok tunnel remains active for the worker nodes to maintain their connection with the master node.
  • Environment Configuration: Use Docker Compose with an .env file to simplify the management of environment-specific variables.
  • Monitoring: Regularly check the Spark master web UI to monitor cluster health and ensure worker nodes are properly connected.

For additional support or documentation: