Models and training

Provided deep learning models

biRNN
bert
bert_plus

We used two types of BERT model structures : the standard one [Devlin et al., 2018] and the refined one. Compared to the standard model structures, the refined BERT considers task-specific features described as follows.

learnable postional embedding

self-attetion with realtive postion representation [Shaw et al.]

center postitions concatenation for the output layer.

Training models

Basic settings:

N_EPOCH=50
W_LEN=21
LR=1e-4
MOTIF="CG"
NUCLEOTIDE_LOC_IN_MOTIF=0
POSITIVE_SAMPLE_PATH=<methylated fast5 path>
NEGATIVE_SAMPLE_PATH=<unmethylated fast5 path>
MODEL_SAVE_PATH=<model saved path>

BiRNN

# training biRNN model
MODEL="biRNN_basic"
python3 train_biRNN.py --model ${MODEL}  --model_dir ${MODEL_SAVE_PATH} --gpu cuda:0 --epoch ${N_EPOCH} \
--positive_control_dataPath ${POSITIVE_SAMPLE_PATH}   --negative_control_dataPath ${NEGATIVE_SAMPLE_PATH} \
--motif ${MOTIF} --m_shift ${NUCLEOTIDE_LOC_IN_MOTIF} --w_len ${W_LEN} --lr $LR \
--batch_size ${BATCH_SIZE}  --num_worker ${N_WORKER} --data_balance_adjust

BERTs

# training bert models and using randomly read selection
MODEL="BERT_plus" (option: "BERT", "BERT_plus")
python3 train_bert.py --model ${MODEL}  --model_dir ${MODEL_SAVE_PATH} --gpu cuda:0 --epoch ${N_EPOCH} \
--positive_control_dataPath ${POSITIVE_SAMPLE_PATH}   --negative_control_dataPath ${NEGATIVE_SAMPLE_PATH} \
--motif ${MOTIF} --m_shift ${NUCLEOTIDE_LOC_IN_MOTIF} --w_len ${W_LEN} --lr $LR\
--batch_size ${BATCH_SIZE}  --num_worker ${N_WORKER} --data_balance_adjust


## training bert models and using region-based read selection
TEST_REGION="NC_000913.3 1000000 2000000"
python3 train_bert.py --model ${MODEL} --model_dir ${MODEL_SAVE_PATH} --gpu cuda:0 --epoch ${N_EPOCH} \
--positive_control_dataPath ${POSITIVE_SAMPLE_PATH}   --negative_control_dataPath ${NEGATIVE_SAMPLE_PATH} \
--motif ${MOTIF} --m_shift  --w_len=${W_LEN} --lr $LR \
--batch_size ${BATCH_SIZE} --num_worker ${N_WORKER} --data_balance_adjust \
--test_region $TEST_REGION