set model in original paper as default for performance

hfxunlp · Mar 11, 2019 · a3a25a2 · a3a25a2
1 parent 5ca6c80
commit a3a25a2
Show file tree

Hide file tree

Showing 6 changed files with 4 additions and 300 deletions.
diff --git a/README.md b/README.md
@@ -400,7 +400,7 @@ Measured with `multi-bleu-detok.perl`:
 | Case-sensitive | 32.63 | 32.26 | 32.97 | 32.89 |
 | Case-insensitive | 34.06 | 33.70 | 34.36 | 34.28 |
 
-Note: The result of [THUMT implementation](https://github.com/thumt/THUMT) is from [Accelerating Neural Transformer via an Average Attention Network](https://arxiv.org/abs/1805.00631). Averaging of models is not applied in the test of this implementation, since this experiment uses different settings for the saving of checkpoints, in which averaging model greatly hurts performance. Results with length penalty as THUMT applied is reported, but length penalty does not improve the performance of transformer in my experiments. Outputs of last encoder layer and decoder layer are not normalised in this experiment, after we add layer normalization to the output of last encoder layer and decoder layer, averaging of models can totally not work.
+Note: The result of [THUMT implementation](https://github.com/thumt/THUMT) is from [Accelerating Neural Transformer via an Average Attention Network](https://arxiv.org/abs/1805.00631). Averaging of models is not applied in the test of this implementation, since when layer normalization is applied between residue connections, averaging model might hurts performance. Results with length penalty as THUMT applied is reported, but length penalty does not improve the performance of transformer in my experiments. Outputs of last encoder layer and decoder layer are not normalised in this experiment.
 
 2, Settings: same with the first except the outputs of last encoder layer and decoder layer is normed and:
 

diff --git a/transformer/AGG/HGraphEncoder.py b/transformer/AGG/HGraphEncoder.py
diff --git a/transformer/AGG/LGraphEncoder.py b/transformer/AGG/LGraphEncoder.py
diff --git a/transformer/AvgDecoder.py b/transformer/AvgDecoder.py
@@ -17,7 +17,7 @@ class DecoderLayer(nn.Module):
 	# num_head: number of heads in MultiHeadAttention
 	# ahsize: hidden size of MultiHeadAttention
 
-	def __init__(self, isize, fhsize=None, dropout=0.0, attn_drop=0.0, num_head=8, ahsize=None, norm_residue=False):
+	def __init__(self, isize, fhsize=None, dropout=0.0, attn_drop=0.0, num_head=8, ahsize=None, norm_residue=True):
 
 		super(DecoderLayer, self).__init__()
 

diff --git a/transformer/Decoder.py b/transformer/Decoder.py
@@ -14,7 +14,7 @@ class DecoderLayer(nn.Module):
 	# ahsize: hidden size of MultiHeadAttention
 	# norm_residue: residue with layer normalized representation
 
-	def __init__(self, isize, fhsize=None, dropout=0.0, attn_drop=0.0, num_head=8, ahsize=None, norm_residue=False):
+	def __init__(self, isize, fhsize=None, dropout=0.0, attn_drop=0.0, num_head=8, ahsize=None, norm_residue=True):
 
 		super(DecoderLayer, self).__init__()
 

diff --git a/transformer/Encoder.py b/transformer/Encoder.py
@@ -21,7 +21,7 @@ class EncoderLayer(nn.Module):
 	# ahsize: hidden size of MultiHeadAttention
 	# norm_residue: residue with layer normalized representation
 
-	def __init__(self, isize, fhsize=None, dropout=0.0, attn_drop=0.0, num_head=8, ahsize=None, norm_residue=False):
+	def __init__(self, isize, fhsize=None, dropout=0.0, attn_drop=0.0, num_head=8, ahsize=None, norm_residue=True):
 
 		super(EncoderLayer, self).__init__()