Skip to content
Snippets Groups Projects
Forked from Hellequin Remi / tp_isia
2 commits ahead of the upstream repository.

README.md

TL ; DR

./configure # environment configuration, run this only once
qsub single_gpu_training.pbs # train the network on one GPU, return the job id
qstat # watch the job status (Q for queue, R for running)
fusion-output -f <jobid> # watch the job logs during
# once the job is finished, output will be written in <jobname>.o<job_id>

Documentation

Fusion supercomputer documentation : https://mesocentre.pages.centralesupelec.fr/user_doc/

Transformers on github : https://github.com/huggingface/transformers The documentation for run_squad.py can be found here : https://huggingface.co/transformers/examples.html#squad

Configure environment

The configure script setup the environment :

./configure

Note : the configure script have to be executed only once. The script can take a few minutes to execute.

Train the network

IMPORTANT : the login node has no GPU, running your script without qsub will not be effective

Run the network training

qsub <pbs_script>.pbs

Two training examples are provided :

  • single_gpu_training.pbs : train the network on a single GPUs
  • dual_gpu_training.pbs : train the network on a two GPUs

Notes :

  • Some temporary data is written in directory --output_dir (./debug_squad/). You may have to clean the directory manually before relaunching the training rm -r ./debug_squad/
  • During the TP sessions, you can use the reservation isiaq instead of the gpuq by commenting/decommenting lines beginning with #PBS -q)

Misc notes

Squad dataset location

On fusion, the squad datasets are located :

v1.1 : /gpfs/opt/data/squad/1.1 v2.0 : /gpfs/opt/data/squad/2.0

Squad dataset download

Dataset are already dowloaded and store in shared location /gpfs/opt/data/squad/.

If you want to download the squad dataset again, here are the commands.

Squad 1.1

wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
wget https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py

Squad 2.0

wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
wget https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/
mv index.html evaluate-v2.0.py

Basic and useful PBS commands to manage jobs

qsub <script>.pbs # submit a job, return the job id
qdel <job_id> # stop the job
qstat # get information on your ongoing jobs
qstat -f <job_id> # get full information on a job
qstat -fx <job_id> # get full information on a finished job
qstat -wn1u <username> # get more information on running jobs including node name
fusion-output -f <jobid> # watch the job logs during execution

See full documentation on the job management : https://mesocentre.pages.centralesupelec.fr/user_doc/06_jobs_management/