Skip to content
This repository has been archived by the owner on Aug 10, 2023. It is now read-only.

Commit

Permalink
set model in original paper as default for performance
Browse files Browse the repository at this point in the history
  • Loading branch information
hfxunlp committed Mar 11, 2019
1 parent 5ca6c80 commit a3a25a2
Show file tree
Hide file tree
Showing 6 changed files with 4 additions and 300 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -400,7 +400,7 @@ Measured with `multi-bleu-detok.perl`:
| Case-sensitive | 32.63 | 32.26 | 32.97 | 32.89 |
| Case-insensitive | 34.06 | 33.70 | 34.36 | 34.28 |

Note: The result of [THUMT implementation](https://github.com/thumt/THUMT) is from [Accelerating Neural Transformer via an Average Attention Network](https://arxiv.org/abs/1805.00631). Averaging of models is not applied in the test of this implementation, since this experiment uses different settings for the saving of checkpoints, in which averaging model greatly hurts performance. Results with length penalty as THUMT applied is reported, but length penalty does not improve the performance of transformer in my experiments. Outputs of last encoder layer and decoder layer are not normalised in this experiment, after we add layer normalization to the output of last encoder layer and decoder layer, averaging of models can totally not work.
Note: The result of [THUMT implementation](https://github.com/thumt/THUMT) is from [Accelerating Neural Transformer via an Average Attention Network](https://arxiv.org/abs/1805.00631). Averaging of models is not applied in the test of this implementation, since when layer normalization is applied between residue connections, averaging model might hurts performance. Results with length penalty as THUMT applied is reported, but length penalty does not improve the performance of transformer in my experiments. Outputs of last encoder layer and decoder layer are not normalised in this experiment.

2, Settings: same with the first except the outputs of last encoder layer and decoder layer is normed and:

Expand Down
148 changes: 0 additions & 148 deletions transformer/AGG/HGraphEncoder.py

This file was deleted.

148 changes: 0 additions & 148 deletions transformer/AGG/LGraphEncoder.py

This file was deleted.

2 changes: 1 addition & 1 deletion transformer/AvgDecoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ class DecoderLayer(nn.Module):
# num_head: number of heads in MultiHeadAttention
# ahsize: hidden size of MultiHeadAttention

def __init__(self, isize, fhsize=None, dropout=0.0, attn_drop=0.0, num_head=8, ahsize=None, norm_residue=False):
def __init__(self, isize, fhsize=None, dropout=0.0, attn_drop=0.0, num_head=8, ahsize=None, norm_residue=True):

super(DecoderLayer, self).__init__()

Expand Down
2 changes: 1 addition & 1 deletion transformer/Decoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ class DecoderLayer(nn.Module):
# ahsize: hidden size of MultiHeadAttention
# norm_residue: residue with layer normalized representation

def __init__(self, isize, fhsize=None, dropout=0.0, attn_drop=0.0, num_head=8, ahsize=None, norm_residue=False):
def __init__(self, isize, fhsize=None, dropout=0.0, attn_drop=0.0, num_head=8, ahsize=None, norm_residue=True):

super(DecoderLayer, self).__init__()

Expand Down
2 changes: 1 addition & 1 deletion transformer/Encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ class EncoderLayer(nn.Module):
# ahsize: hidden size of MultiHeadAttention
# norm_residue: residue with layer normalized representation

def __init__(self, isize, fhsize=None, dropout=0.0, attn_drop=0.0, num_head=8, ahsize=None, norm_residue=False):
def __init__(self, isize, fhsize=None, dropout=0.0, attn_drop=0.0, num_head=8, ahsize=None, norm_residue=True):

super(EncoderLayer, self).__init__()

Expand Down

0 comments on commit a3a25a2

Please sign in to comment.