# README.md ## TL ; DR ```bash ./configure # environment configuration, run this only once qsub single_gpu_training.pbs # train the network on one GPU, return the job id qstat # watch the job status (Q for queue, R for running) fusion-output -f <jobid> # watch the job logs during # once the job is finished, output will be written in <jobname>.o<job_id> ``` ## Documentation Fusion supercomputer documentation : https://mesocentre.pages.centralesupelec.fr/user_doc/ Transformers on github : https://github.com/huggingface/transformers The documentation for `run_squad.py` can be found here : https://huggingface.co/transformers/examples.html#squad ## Configure environment The `configure` script setup the environment : - move `.conda` default directory to `$WORKDIR` to avoid space problem on `$HOME` mount point - load the anaconda executables (version anaconda3/5.3.1, see https://mesocentre.pages.centralesupelec.fr/user_doc/04_module_command/) - create an anaconda environment with required dependencies for `transformers` - download and install `transformers` ```bash ./configure ``` > Note : the `configure` script have to be executed only once. The script can take a few minutes to execute. ## Train the network > IMPORTANT : the login node has no GPU, running your script without `qsub` will not be effective Run the network training ```bash qsub <pbs_script>.pbs ``` Two training examples are provided : - `single_gpu_training.pbs` : train the network on a single GPUs - `dual_gpu_training.pbs` : train the network on a two GPUs Notes : - Some temporary data is written in directory `--output_dir` (`./debug_squad/`). You may have to clean the directory manually before relaunching the training `rm -r ./debug_squad/` - During the TP sessions, you can use the reservation `isiaq` instead of the `gpuq` by commenting/decommenting lines beginning with `#PBS -q`) ## Misc notes ### Squad dataset location On fusion, the squad datasets are located : v1.1 : `/gpfs/opt/data/squad/1.1` v2.0 : `/gpfs/opt/data/squad/2.0` ### Squad dataset download Dataset are already dowloaded and store in shared location `/gpfs/opt/data/squad/`. If you want to download the squad dataset again, here are the commands. Squad 1.1 ``` wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json wget https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py ``` Squad 2.0 ``` wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json wget https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ mv index.html evaluate-v2.0.py ``` ### Basic and useful PBS commands to manage jobs ```bash qsub <script>.pbs # submit a job, return the job id qdel <job_id> # stop the job qstat # get information on your ongoing jobs qstat -f <job_id> # get full information on a job qstat -fx <job_id> # get full information on a finished job qstat -wn1u <username> # get more information on running jobs including node name fusion-output -f <jobid> # watch the job logs during execution ``` > See full documentation on the job management : https://mesocentre.pages.centralesupelec.fr/user_doc/06_jobs_management/