You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
S2 is estimated from > 20 sec of speech for a given talker
Es(X1) is estimated from the speech segment input to the content encoder
Even though both these representations are estimated for the same talker, they are estimated based on different input speech. S2 is potentially based on longer speech duration. So, there could be some differences between the two talker embedding representations.
However, in the codebase, S2 is reused in the place Es(X1). Any idea on how much impact this will have on the extent of dis-entanglement of content and talker representation? Since Es(X1) could be based on shorter speech duration, will it be useful to estimate it separately, so that network learns to dis-entangle only what is appropriate talker information for a given input speech segment?
Thanks,
Pravin
The text was updated successfully, but these errors were encountered:
In the Auto VC paper, it seems,
Even though both these representations are estimated for the same talker, they are estimated based on different input speech. S2 is potentially based on longer speech duration. So, there could be some differences between the two talker embedding representations.
However, in the codebase, S2 is reused in the place Es(X1). Any idea on how much impact this will have on the extent of dis-entanglement of content and talker representation? Since Es(X1) could be based on shorter speech duration, will it be useful to estimate it separately, so that network learns to dis-entangle only what is appropriate talker information for a given input speech segment?
Thanks,
Pravin
The text was updated successfully, but these errors were encountered: