Skip to content
Snippets Groups Projects
README.md 3.25 KiB
Newer Older
Remi Hellequin's avatar
Remi Hellequin committed
# README.md

## TL ; DR

```bash
./configure # environment configuration, run this only once
qsub single_gpu_training.pbs # train the network on one GPU, return the job id
qstat # watch the job status (Q for queue, R for running)
fusion-output -f <jobid> # watch the job logs during
# once the job is finished, output will be written in <jobname>.o<job_id>
```

## Documentation

Fusion supercomputer documentation : https://mesocentre.pages.centralesupelec.fr/user_doc/

Transformers on github : https://github.com/huggingface/transformers
Remi Hellequin's avatar
Remi Hellequin committed
The documentation for `run_squad.py` can be found here : https://huggingface.co/transformers/examples.html#squad

## Configure environment

The `configure` script setup the environment :
- move `.conda` default directory to `$WORKDIR` to avoid space problem on `$HOME` mount point
- load the anaconda executables (version anaconda3/5.3.1, see https://mesocentre.pages.centralesupelec.fr/user_doc/04_module_command/)
- create an anaconda environment with required dependencies for `transformers`
- download and install `transformers`

```bash
./configure
```

Remi Hellequin's avatar
Remi Hellequin committed
> Note : the `configure` script have to be executed only once. The script can take a few minutes to execute.
Remi Hellequin's avatar
Remi Hellequin committed

## Train the network

> IMPORTANT : the login node has no GPU, running your script without `qsub` will not be effective

Run the network training

```bash
qsub <pbs_script>.pbs
```

Remi Hellequin's avatar
Remi Hellequin committed

- `single_gpu_training.pbs` : train the network on a single GPUs
- `dual_gpu_training.pbs` : train the network on a two GPUs

Notes : 

- Some temporary data is written in directory `--output_dir` (`./debug_squad/`). You may have to clean the directory manually before relaunching the training `rm -r ./debug_squad/`
- During the TP sessions, you can use the reservation `isiaq` instead of the `gpuq` by commenting/decommenting lines beginning with `#PBS -q`)

Remi Hellequin's avatar
Remi Hellequin committed
## Misc notes

### Squad dataset location

On fusion, the squad datasets are located :

v1.1 : `/gpfs/opt/data/squad/1.1`
v2.0 : `/gpfs/opt/data/squad/2.0`

### Squad dataset download

Dataset are already dowloaded and store in shared location `/gpfs/opt/data/squad/`.

If you want to download the squad dataset again, here are the commands.

Squad 1.1

```
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
wget https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py
```

Squad 2.0

```
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
wget https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/
mv index.html evaluate-v2.0.py
```

### Basic and useful PBS commands to manage jobs

```bash
qsub <script>.pbs # submit a job, return the job id
qdel <job_id> # stop the job
qstat # get information on your ongoing jobs
qstat -f <job_id> # get full information on a job
qstat -fx <job_id> # get full information on a finished job
qstat -wn1u <username> # get more information on running jobs including node name
fusion-output -f <jobid> # watch the job logs during execution
```

> See full documentation on the job management : https://mesocentre.pages.centralesupelec.fr/user_doc/06_jobs_management/