Newer
Older
# README.md
## TL ; DR
```bash
./configure # environment configuration, run this only once
qsub single_gpu_training.pbs # train the network on one GPU, return the job id
qstat # watch the job status (Q for queue, R for running)
fusion-output -f <jobid> # watch the job logs during
# once the job is finished, output will be written in <jobname>.o<job_id>
```
## Documentation
Fusion supercomputer documentation : https://mesocentre.pages.centralesupelec.fr/user_doc/
Remi Hellequin
committed
Transformers on github : https://github.com/huggingface/transformers
The documentation for `run_squad.py` can be found here : https://huggingface.co/transformers/examples.html#squad
## Configure environment
The `configure` script setup the environment :
- move `.conda` default directory to `$WORKDIR` to avoid space problem on `$HOME` mount point
- load the anaconda executables (version anaconda3/5.3.1, see https://mesocentre.pages.centralesupelec.fr/user_doc/04_module_command/)
- create an anaconda environment with required dependencies for `transformers`
- download and install `transformers`
```bash
./configure
```
> Note : the `configure` script have to be executed only once. The script can take a few minutes to execute.
## Train the network
> IMPORTANT : the login node has no GPU, running your script without `qsub` will not be effective
Run the network training
```bash
qsub <pbs_script>.pbs
```
Remi Hellequin
committed
Two training examples are provided :
- `single_gpu_training.pbs` : train the network on a single GPUs
- `dual_gpu_training.pbs` : train the network on a two GPUs
Remi Hellequin
committed
Notes :
- Some temporary data is written in directory `--output_dir` (`./debug_squad/`). You may have to clean the directory manually before relaunching the training `rm -r ./debug_squad/`
- During the TP sessions, you can use the reservation `isiaq` instead of the `gpuq` by commenting/decommenting lines beginning with `#PBS -q`)
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
## Misc notes
### Squad dataset location
On fusion, the squad datasets are located :
v1.1 : `/gpfs/opt/data/squad/1.1`
v2.0 : `/gpfs/opt/data/squad/2.0`
### Squad dataset download
Dataset are already dowloaded and store in shared location `/gpfs/opt/data/squad/`.
If you want to download the squad dataset again, here are the commands.
Squad 1.1
```
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
wget https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py
```
Squad 2.0
```
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
wget https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/
mv index.html evaluate-v2.0.py
```
### Basic and useful PBS commands to manage jobs
```bash
qsub <script>.pbs # submit a job, return the job id
qdel <job_id> # stop the job
qstat # get information on your ongoing jobs
qstat -f <job_id> # get full information on a job
qstat -fx <job_id> # get full information on a finished job
qstat -wn1u <username> # get more information on running jobs including node name
fusion-output -f <jobid> # watch the job logs during execution
```
> See full documentation on the job management : https://mesocentre.pages.centralesupelec.fr/user_doc/06_jobs_management/