Your First Job In The Cloud

Chapter 1. The First Job In The Cloud

Intro

On a cloud I saw a child,
And he laughing said to me: “...
— William Blake

Nowadays one should live in a cave on an uninhabited island lost in Arctic ocean to have never heard of “Artificial Intelligence,” “Machine Learning,” “NLP” and the family of buzzwords. Having a masters in Data Science, I feel a bit less excited about tomorrow’s AI revolution. That does not mean DS is boring or undue—rather it requires a lot of effort to be put into and I really like that feeling of being doing stuff on the bleeding edge.

As a relatively new industry, ML has not set up the processes yet. I have heard something opposite about Google and Facebook, but we are still considered nerds in small businesses. The role developers used to play twenty years ago. That’s great to see that more and more people are getting into ML, either being excited by Google slides on a last conference, or just being curious whether the neural nets can indeed distinguish between cats and dogs seen on a photo.

Big corps prepare and share (thanks God it’s XXI century) huge datasets, trained models and everything the junior data scientist might use to play in the sandbox. After we made sure that models trained on Google or Facebook data somehow work and even might predict things (in some cases under some very eccentric circumstances, but it’s still so thrilling,) we usually want to try to train our own model ourselves. It takes hours on our own laptop, even despite the dataset is limited to tweets from our forty two friends for the last single year. Results usually look promising, but unsatisfactory. There is no way the laptop could proceed with the whole tweet feed for the last decade without exploding the SDD and blowing up.

That is the time we get to the magical words: cloud calculus. Or how do you name it. Let’s Google servers do explode instead of our lovely laptops, right? Right. Our next job will be in the cloud. Pun intended.

There are not that many resources, explaining how one might stop procrastinating starring at the laptop monitor for when the model is built and start getting benefits of living in the 2018 AD. There are Google ML, Amazon SageMaker, Azure Machine Learning Studio, but the documentation everywhere was written by developers for gray-bearded geeks. There is an enormous threshold to execute the very first job in the cloud. And this writing is supposed to bridge that gap.

That is not a rocket science and there is nothing really complex. Just few steps to make and several things to take into consideration. That’s the breathtaking journey and once done, the subsequent trips will seem a cakewalk. Let’s go.

All the below is written for Google ML Engine, but it might be applied to any cloud computing system almost as is. I will try not to go deeply into details, concentrating more on whats rather than on hows.

Before We Start

First of all, I want to reference the paper that helped me a lot to move my job into the cloud. Tensorflow beginner guide by Fuyang Liu's Blog is almost perfect, save for it does not cover pitfalls and does not suggest shortpaths where it could have made sense.

Google also has a documentation on ML Engine, I wish I were as smart as to use it as a guide. We still need it though to quickly look up this and that.

First we need to set up our cloud environment. I refer to Google guide here because things tend to change within time and I hope they will keep this info up-to-date.

After we have the account enabled for ML, we should set up our local environment. I strongly advise using Linux, MacOS is more or less robust, Windows will make you cry. Once we are to run jobs in the cloud, I believe you have python installed and configured. What we need to install is Google SDK. It’s pretty straightforward though, download it from the page linked and install.

Now we need to setup our credentials. gcloud init should do.

Let’s check it works as expected:

$ gcloud ml-engine models list
Listed 0 items.

Wow. We are all set.

Our First Job

That is the important part. Don’t try to upload and run your fancy last project. It’ll fail and you’ll get frustrated. Let’s enter cold water slowly. Let’s make your first job completed successfully, showing a fascinating green light icon when you’ll check your jobs status.

The cloud expects the python package to be uploaded and the main module to execute it specified. So, let’s go with a pretty simple python package. Let’s assume it’s named test1.py and resides in the directory named test1.

# coding: utf-8
import logging
import argparse

if __name__ == "__main__":
  parser = argparse.ArgumentParser()

  parser.add_argument(
    '--job-dir',
    help='GCS job directory (required by GoogleML)',
    required=True
  )
  parser.add_argument(
    '--arg',
    help='Test argument',
    required=True
  )
  arguments = parser.parse_args().__dict__
  job_dir = arguments.pop('job_dir')
  arg = arguments.pop('arg')

  logging.info("Hey, ML Engine, you are not scary!")  
  logging.warn("Argument received: {}.".format(arg))

We use logging because unlike simple stdout logs are available through the web interface.

Also you’ll need a cloud configuration file on your local. It might be placed everywhere, I prefer to have a config file per project. Put test1.yml in the same directory:

trainingInput:
  scaleTier: CUSTOM
  # 1 GPU
  masterType: standard_gpu
  # 4 GPUs
  # complex_model_m_gpu
  runtimeVersion: "1.9"
  pythonVersion: "3.5"

I am not sure who took that decision, but the default python version for ML Engine is 2.7, that’s why two last lines are mandatory.

Also you would need to create a file setup.py, containing the description of our project. It will be processed by Google SDK.

from setuptools import find_packages
from setuptools import setup

setup(
    name='test1',
    version='0.1',
    description='My First Job'
)

Well, that is it. Let’s try (this file test1.sh should be on the same level as the package folder.)

#!/bin/bash

export BUCKET_NAME=foo-bar-baz-your-bucket-name
export REGION=us-east1
export JOB_NAME="test1_$(date +%Y%m%d_%H%M%S)"
export JOB_DIR=gs://$BUCKET_NAME/$JOB_NAME

gcloud ml-engine jobs submit training $JOB_NAME \
    --staging-bucket gs://$BUCKET_NAME \
    --job-dir gs://$BUCKET_NAME/$JOB_NAME \
    --region $REGION \
    --runtime-version 1.9 \
    \
    --module-name test1.test1 \
    --package-path ./test1 \
    --config=test1/test1.yaml \
    -- \
    --arg=42

NB! you have to specify your bucket name and you might need to change the region as well.

I strongly advise to create a shell script to run (schedule/queue) a job from the very beginning. It’s much easier to tackle with when it comes to modifications.

There are three ‘subsections’ of arguments there: first four are job-specific and remain unchanged from job to job. The second is job-specific settings. The third one (after --) contains argument that will be passed to the __main__ function of your package.

Go try it:

./test1.sh

Job [test1_20180818_085812] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ml-engine jobs describe test1_20180818_085812

or continue streaming the logs with the command

  $ gcloud ml-engine jobs stream-logs test1_20180818_085812
jobId: test1_20180818_085812
state: QUEUED

Now you might execute gcloud ml-engine jobs describe ... as suggested. It’ll spit out another portion of text. Copy the last link and paste in into your browser address line. You should see...

What should you see there I will describe in the next chapter. Happy clouding!