Your First Job In The Cloud /2
Yulia Zvyagelskaya
Posted on August 19, 2018
Chapter 1. The First Job In The Cloud
Your First Job In The Cloud
Yulia Zvyagelskaya ・ Aug 18 '18
Chapter 2. Getting The Job Done
Hear the voice of the Bard!
— William Blake
Setting Up Packages
In the end of the Chapter 1 we were able to run a job in the cloud. It was completed successfully (if not, please blame Google, not me.) We’ve seen this fascinating green light icon next to it. Let’s now try the real job‽
Not yet. We are not mature enough to enter a cage with lions. Let’s do it step by step.
I assume you have the code—that apparently trains a model—on hand. Unlike other tutorials, this one won’t provide a code training model for you. We are talking about deployment to ML Engine. Sorry for that.
This code probably has a bundle of import
s. It probably requires Tensorflow
, Keras
and some other packages that are not included into python standard library. If all you imported was Tensorflow
and the whole rest was written by you, you barely need this tutorial. Such a brave person should wade through everything on their own.
So, yeah. Packages. Copy all the import
lines from all your files to the top of the job scaffold we have created in the Chapter 1 (“Our First Job” section, I struggled to find how do I do an anchor to the subtitle inside the post here.)
Submit a job to the cloud. Go check logs in approximately 8 minutes to see that the task has failed. If it has not, I envy you, you seem to use only standard packages included into ML Engine setup by default. I was not as lucky. Google includes the very limited set of packages (probably to reduce the docker launch time.)
So, we need to demand the additional packages we need explicitly. Unfortunately, AFAICT, there is no way to retrieve a diff between what ML Engine provides out of the box and what we need, So we are obliged to add them one by one, submit job, check logs, cry, yell at the sky, repeat. To add packages, open the file setup.py
we have created in the Chapter 1 and add the following lines.
from setuptools import find_packages
from setuptools import setup
setup(
name='test1',
version='0.1',
+ install_requires=['scikit-learn>=0.18','annoy>=1.12','nltk>=3.2'],
+ packages=find_packages(),
+ include_package_data=True,
description='My First Job'
)
Keep adding entries in install_requires
until ML Engine is satisfied and the job turns back to successful processing. It might happen, that some packages you need were designed for python younger than 3.5
(in my case it was the package that used fancy new string formatting f'Hi, {name}'
, introduced in 3.6
. I did not find better solution, rather than download this package locally, backport it to 3.5
and re-package it myself. Build the package and put it both into your bucket (I have created a subfolder packages
there for that purpose) and the same folder with setup.py
script. Now we have to tell ML Engine to use our version of it. Update your shell script with:
--module-name test1.test1 \
--package-path ./test1 \
+ --packages my_package-0.1.2.tar.gz \
--config=test1/test1.yaml \
The same should be done with all your own packages you need to include into the distribution to run a job. The install_packages
parameter in call to setup
in setup.py
has to be updated accordingly. Also, update your cloud config with packageUris
parameter:
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m_gpu
pythonVersion: "3.5"
runtimeVersion: "1.9"
packageUris:
- 'gs://foo-bar-baz-your-bucket-name/packages/my_package-0.1.2.tar.gz'
Submit a job and check that the job is now green.
Doing Their Job
Not all 3rd party packages are ready to be used in cloud. It’s the very same machine as our own laptop, but virtual, so everything should be the same, right?—Well, yes and no. Everything is the same. But this VM is shut down as the job finishes. Meaning if we need some results besides logs (like a trained model file, you know,) we have to store it somewhere outside of the container where this job was run, otherwise it’ll die together with the container.
Python standard open(file_name, mode)
does not work with buckets (gs://...../file_name
). One needs to from tensorflow.python.lib.io import file_io
and change all calls to open(file_name, mode)
to file_io.FileIO(file_name, mode=mode)
(note the named mode parameter.) The interface of the opened handle is the same.
Some 3rd party packages have an explicit save
method, accepting the file name instead of FileIO
and the only possibility would be to copy it to the bucket afterwards. I did not manage to use Google SDK for that, since it is absent in containers, so here is the tiny snippet copying the file from the container to the bucket.
The code below is quite inefficient, because it loads the whole model at once and then dumps it to the bucket, but it worked for me for relatively small models:
model.save(file_path)
with file_io.FileIO(file_path, mode='rb') as i_f:
with file_io.FileIO(os.path.join(model_dir, file_path), mode='wb+') as o_f:
o_f.write(i_f.read())
The mode must be set to binary for both reading and writing. When the file is relatively big, it makes sense to read and write it in chunks to decrease memory consumption, but chunking files on binary IO read operations seems to be a bit out of scope of this tutorial. I recommend to wrap this into the function and call this function as needed.
Now, when we have all the needed packages installed, and the function to store files in the bucket on hand, there is nothing preventing us from trying our model in the cloud with a whole load of the data in the cloud. Right?
Nope. I strongly suggest to split the job into smaller steps, like “cleaning up,” “splitting into training and testing sets,” “preprocessing,” “training,” etc and dump all the intermediate results to the bucket. Since you needed a training process to be run in the cloud, it should be very time consuming. Having the intermediate results on hand might save you a lot of time in the future. For instance, adjusting the parameters given to the train process itself does not require all the preparation steps to be re-run. In such a case we might just start with Step N, using the output from the previous step that was dumped before as the input.
How to do it, I will show in the next chapter. Happy clouding!
Posted on August 19, 2018
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.