Scaling down Google Kubernetes cluster to zero during off-peak hours

Ajinkya Bhabal
5 min readApr 3, 2023

Introduction

This step-by-step guide helps you to scale down GKE clusters in an off-hour to zero for cost optimization. It will be helpful in enterprises for developers and the Ops team who want to scale up GKE Clusters before the spike arrive and scale it down again at night and on weekends to save money.

It is recommended to use in development or testing environments only.

We will be using the following google cloud resources to achieve this

  1. Cloud Scheduler
  2. Cloud Run
  3. Artifact Registry

In this architecture, We have implemented Cloud run jobs that only run their tasks and exit when finished. Using Cloud scheduler to schedule the cloud run jobs at a certain interval in the time of day.

There will be two jobs running, First in the morning at 9 AM to scale up the GKE cluster and at the end of the day at 8 PM to scale down the GKE cluster to zero. Also, it will be running from Monday to Friday only accordingly cron job can be written.

Let’s start with the implementation in Google Cloud.

First, Enable services (Cloud Run, Cloud Scheduler) and Create a Service account.

Setup permissions on SA for cloud run to invoke the jobs and pull the latest image from the Artifact registry and resize the Kubernetes cluster.

Create Artifact Repository where we will store scale down and up docker images.

  • Repository for GKE Scale down docker Image

Use following commands if you want to push docker images to Google Artifact registry.

Here’s the bash script used for scaling down the GKE cluster.

The following bash script will check all GKE clusters with the prefix ‘test’ in a name and will scale down all node pools to zero within those clusters.

Note: When autoscaling is turned on, Script will not scale the cluster to zero. After validating your scaling work, make sure to turn off the autoscaling.

#!/bin/bash

CLOUD_RUN_TASK_INDEX=${CLOUD_RUN_TASK_INDEX:=0}
CLOUD_RUN_TASK_ATTEMPT=${CLOUD_RUN_TASK_ATTEMPT:=0}

echo "Starting Task #${CLOUD_RUN_TASK_INDEX}, Attempt #${CLOUD_RUN_TASK_ATTEMPT}..."

echo "Scaling down test gke cluster instances"

for CLUSTER_NAME in $(gcloud container clusters list --format="value(name)" --filter="name~test*")
do
for NP_NAME in $(gcloud container node-pools list --cluster=$CLUSTER_NAME --format="value(name)" --zone=asia-south1-a)
do
gcloud container clusters resize $CLUSTER_NAME --node-pool $NP_NAME --num-nodes 0 --zone=asia-south1-a --quiet
done
done

retVal=$?
if [[ $retVal -eq 0 ]]
then
echo "Completed Task # ${CLOUD_RUN_TASK_INDEX}."
else
echo "Task #${CLOUD_RUN_TASK_INDEX}, Attempt #${CLOUD_RUN_TASK_ATTEMPT} failed."
fi

Next will create Dockerfile to run above bash script.

FROM google/cloud-sdk:latest

# Execute following commands in the folder /gke_scripts
WORKDIR /gke_scripts

# Copy over the script to the /gke_scripts folder
COPY scale_down.sh .

# Just in case the script doesn't have the executable permissions set
RUN chmod +x ./scale_down.sh

# Run the script when starting the container
CMD [ "./scale_down.sh" ]

We will run following commands to push docker image to Artifact repository.

gcloud auth configure-docker asia-south1-docker.pkg.dev

docker build -t scdown .

docker tag scdown asia-south1-docker.pkg.dev/flawless-helper-376817/gke-sc-d/v0.0.1

docker push asia-south1-docker.pkg.dev/flawless-helper-376817/gke-sc-d/v0.0.1

Now we will create Cloud run job with above container image.

We will be creating job in asia-south1(Mumbai) region.

In the Security section, We have to provide previously created Service account name.

Finally, We are Setting up scheduler to execute the job on cloud run.

We have set the following trigger URL in the cloud scheduler to invoke the cloud run.

https://asia-south1-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/flawless-helper-376817/jobs/gke-scale-down-v1:run

For authentication OAuth configured with previously created Service account.

Now after the scheduler triggers the cloud run will get the following result.

In the logs of the Cloud run job, we can see that cluster has been resized successfully.

GKE cluster has been scaled to zero nodes.

Conclusion:

Here we have learned, that using this simple implementation we can reduce the cost of a Google Kubernetes Engine cluster in dev/test environment.

If you’d like to read more about IaaC tools, Kubernetes topics, Just let me know in the comments. Thanks for reading!

Ref link -

--

--