We all have been there. I have an interesting dataset that we want to train our shiny new model on. Unfortunately, I don't have a dedicated GPU on my Macbook 2015 model. Unless, you are somebody who uses graphical intensive application such as games, numerical processing software regularly, It will not make sense for you to buy a dedicated GPU.
Luckily there are a lot of remote GPU options available to you. Depending upon your use-case, you can choose the one that fits your need.
Option |
Pros |
Cons |
Cloud Provider(Gcloud, AWS, Azure) |
flexibility, save the data |
Higher ramp-up time |
Colaboratory notebook |
Good documentation |
Short runtimes, slow GPU, not good for long training jobs |
Jupyter hub |
Open-source, multiple language support |
no free GPU support |
Kaggle Notebooks |
free 43 hours of GPU computing |
data IO to machine is little inconvenient |
So, today we will talk about how we use GPU on kaggle to train a spaCy model for Hindi Language. Biggest challenge of training a model is to get the clean data that accurately represent your Machine learning problem. Let's do a quick search to get a list of the datasets available
A quick search on Github with "Hindi tagger" yields these results
After browsing through these datasets, you will notice that most of these datasets are relatively small and follow incoherent tagging scheme incompatible with how spaCy's input data format. Luckily, we have other dataset that we can use here from CONLL competition.
Summary
The Hindi UD treebank is based on the Hindi Dependency Treebank (HDTB)
created at IIIT Hyderabad, India.
Introduction
The Hindi Universal Dependency Treebank was automatically converted from Hindi Dependency Treebank (HDTB) which is part of an ongoing effort of creating multi-layered treebanks for Hindi and Urdu. HDTB is developed at IIIT-H India.
Acknowledgments
The project is supported by NSF Grant (Award Number: CNS 0751202; CFDA Number: 47.070).
Any publication reporting the work done using this data should cite the following references:
Riyaz Ahmad Bhat, Rajesh Bhatt, Annahita Farudi, Prescott Klassen, Bhuvana Narasimhan, Martha Palmer, Owen Rambow, Dipti Misra Sharma, Ashwini Vaidya, Sri Ramagurumurthy Vishnu, and Fei Xia. The Hindi/Urdu Treebank Project. In the Handbook of Linguistic Annotation (edited by Nancy Ide and James Pustejovsky), Springer Press
@InCollection{bhathindi
Title = {The Hindi/Urdu Treebank Project}
Author = {Bhat, Riyaz Ahmad and Bhatt, Rajesh and Farudi, Annahita and Klassen, Prescott and Narasimhan,
…
Browsing the stats.xml file gives us an overview of different pos tags available in the dataset.
Let's open the notebook and enable GPU for the session from three dots > Accelerator > GPU. Note that there is tpu option as well, but TPU can only be used for Keras and Tensorflow models. Spacy uses none of those, it uses its own custom neural network library, thinc.
Let's clone this repository using the command below in Kaggle notebook. This will download the data from repo in the working directory.
! git clone https://github.com/UniversalDependencies/UD_Hindi-HDTB
Let's quickly check if we have access to gpu
import tensorflow as tf
tf.test.gpu_device_name()
Spacy expects training input data to be in the form of JSON documents, but our downloaded data is in .connlu format. So, we will use spacy convert
for conversion to JSON.
! mkdir data
! spacy convert UD_Hindi-HDTB/hi_hdtb-ud-dev.conllu data
! spacy convert UD_Hindi-HDTB/hi_hdtb-ud-train.conllu data
! spacy convert UD_Hindi-HDTB/hi_hdtb-ud-test.conllu data
Now, we are all setup to start training the model
! spacy train hi model_dir data/hi_hdtb-ud-train.json data/hi_hdtb-ud-dev.json -g 0
Don't forget to pass the argument -g 0
to enable the gpu usage for training. It will save the trained model in the model_dir
directory. It runs about 6X faster on gpu than on my local machine. There are probably ways to make it run faster, as the job on kaggle notebook was CPU constrained. Anyway, the whole job finished in half an hour on kaggle notebook.
Let's load the model and run some inferences
from spacy.lang.hi import Hindi
from spacy.gold import docs_to_json
nlp_hi = Hindi()
nlp_hi.add_pipe(nlp_hi.create_pipe('tagger'))
nlp_hi.add_pipe(nlp_hi.create_pipe('parser'))
nlp_hi.add_pipe(nlp_hi.create_pipe('ner'))
nlp_hi = nlp_hi.from_disk("model_dir/model-best/")
sentence = "मैं खाना खा रहा हूँ।"
doc = nlp_hi(sentence)
print(docs_to_json([doc]))
# ...
# {'id': 0, 'orth': 'मैं', 'tag': 'PRP', 'head': 2, 'dep': 'nsubj', 'ner': 'O'}
# ...
After the finishes, let's gzip the model and download it locally from the file-viewer pan on the right in the kaggle notebook.
! tar -cvzf model.tgz model_dir/model-best
Hurray !
Here is the Kaggle notebook link, if you want to play around.
https://www.kaggle.com/rahul1990gupta/training-a-spacy-hindi-model?scriptVersionId=41283884