You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, @jing-zhao9, thanks for your question! This dot production of two branches is one of the efficient designs first proposed in MogaNet, which is also used in Mamba and its recently proposed variants. We called gating according to GLU and found it more powerful than other additive operations like the concatenation you mentioned. You might find an intuitive explanation of why gating operations are effective and efficient in StarNet. Feel free to discuss if there are more questions.
Thank you for your careful answer!Thank you for your careful explanation! I have another question: why do I encounter gradient explosion when I apply the dot product proposed in MogaNet to my baseline model for training?
Sorry for the late reply. The gradient explosion might sometimes occur in MogaNet because of the gating branch in the Moga module. There are two possible ways: (1) Checking the NAN or Inf during training. If the gradient explosion occurs, resume the training at the previous checkpoint. (2) Removing the SiLU in the branch with multiple DWConv. Two SiLU activation functions provide strong non-linearity with small parameters but increase the risk of instability. You might trade-off the performance and training stability.
May I ask why concatenation is not used for feature aggregation in the Spatial Aggregation block?
The text was updated successfully, but these errors were encountered: