You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your implementation, Daniel!
I have a question about online data collection. In your readme file, you said "This implementation seems to converge after ~60 exploration rounds instead of ~250 as shown in the paper." I wonder where did you get the information that the paper's version converges after 250 rounds?
I only get that they collected 200k online samples from the caption of Table 5.1, and how much data did they collect for each round?
The text was updated successfully, but these errors were encountered:
They do not directly mention this, but I roughly estimated how many rounds it should take for hopper from the figures. First, from Figure 4.1 I see that the peak reward is first reached in ~250 rounds. Also, in Figure 5.1, showing a different run, it seems this point is reached in about 180k-200k online samples, showing a bit of the equivalence between samples and rounds.
During each round, a single trajectory is added to the replay buffer. Each trajectory has a maximum length of 1000, but a trajectory can end earlier if the robot falls, so <= 1000 timesteps are added each round. If my implementation converges in say 70 rounds, this would translates to <= 70, 000 timesteps, but probably closer to around ~55-60k.
Thanks for your implementation, Daniel!
I have a question about online data collection. In your readme file, you said "This implementation seems to converge after ~60 exploration rounds instead of ~250 as shown in the paper." I wonder where did you get the information that the paper's version converges after 250 rounds?
I only get that they collected 200k online samples from the caption of Table 5.1, and how much data did they collect for each round?
The text was updated successfully, but these errors were encountered: