README.md
TL ; DR
./configure # environment configuration, run this only once
qsub single_gpu_training.pbs # train the network on one GPU, return the job id
qstat # watch the job status (Q for queue, R for running)
fusion-output -f <jobid> # watch the job logs during
# once the job is finished, output will be written in <jobname>.o<job_id>
Documentation
Fusion supercomputer documentation : https://mesocentre.pages.centralesupelec.fr/user_doc/
Transformers on github : https://github.com/huggingface/transformers
The documentation for run_squad.py
can be found here : https://huggingface.co/transformers/examples.html#squad
Configure environment
The configure
script setup the environment :
- move
.conda
default directory to$WORKDIR
to avoid space problem on$HOME
mount point - load the anaconda executables (version anaconda3/5.3.1, see https://mesocentre.pages.centralesupelec.fr/user_doc/04_module_command/)
- create an anaconda environment with required dependencies for
transformers
- download and install
transformers
./configure
Note : the
configure
script have to be executed only once. The script can take a few minutes to execute.
Train the network
IMPORTANT : the login node has no GPU, running your script without
qsub
will not be effective
Run the network training
qsub <pbs_script>.pbs
Two training examples are provided :
-
single_gpu_training.pbs
: train the network on a single GPUs -
dual_gpu_training.pbs
: train the network on a two GPUs
Notes :
- Some temporary data is written in directory
--output_dir
(./debug_squad/
). You may have to clean the directory manually before relaunching the trainingrm -r ./debug_squad/
- During the TP sessions, you can use the reservation
isiaq
instead of thegpuq
by commenting/decommenting lines beginning with#PBS -q
)
Misc notes
Squad dataset location
On fusion, the squad datasets are located :
v1.1 : /gpfs/opt/data/squad/1.1
v2.0 : /gpfs/opt/data/squad/2.0
Squad dataset download
Dataset are already dowloaded and store in shared location /gpfs/opt/data/squad/
.
If you want to download the squad dataset again, here are the commands.
Squad 1.1
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
wget https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py
Squad 2.0
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
wget https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/
mv index.html evaluate-v2.0.py
Basic and useful PBS commands to manage jobs
qsub <script>.pbs # submit a job, return the job id
qdel <job_id> # stop the job
qstat # get information on your ongoing jobs
qstat -f <job_id> # get full information on a job
qstat -fx <job_id> # get full information on a finished job
qstat -wn1u <username> # get more information on running jobs including node name
fusion-output -f <jobid> # watch the job logs during execution
See full documentation on the job management : https://mesocentre.pages.centralesupelec.fr/user_doc/06_jobs_management/