Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment Creation (WIP) #8

Draft
wants to merge 67 commits into
base: expapitemp
Choose a base branch
from
Draft

Experiment Creation (WIP) #8

wants to merge 67 commits into from

Conversation

bonedaddy
Copy link

@bonedaddy bonedaddy commented Apr 29, 2020

Not the official PR, working out of this PR for my own purposes and task tracking, etc...

Session Bug

  • session calls update -> update calls supervise
    • labagent/agentrouter/router.go
  • benchmark is ran -> calls /labapp/appapi/appapi.go to run the actual benchmark
  • run function is located in labapp/approuter/router.go

Looks like the race is happening within the supevisor???

Next Focus

  • Distribute cue templates as package hinshun already did this so not needed
  • cleanup repo for upstream merge

TODO:

  • Distribute cue templates as an actual package

Session Bug Description (largely copy & pasted from hinshun)

  • When calling the /update API with the intended version ref to labagent to start supervising labapp process, it's usually an asynchronous process.
  • In order to capture opentracing spans in bitswap, the context.Context object must be propagated to the ROOT context object used to start the libp2p peer.
  • This context must have the trace ID and other metadata, which is plumbed across the network by being injected/extracted from HTTP headers. In order to achieve this, there is a synchronous version of /update that involves calling the API without a ref.
  • This call to /update forwards the context.Context from the client, all the way to the root context of the libp2p peer, which means that this peer lives and dies with the request.
  • The code I commented out calls this synchronous version of /update and cancels the context to kill the peer. For some reason, it's killing a peer in an unrelated cluster (other trial in the experiment), and it seems to be racy (not every time).

postables added 29 commits April 29, 2020 15:24
note that unless we use properly formatted cue files they cause errors, i will be committing them next
@bonedaddy
Copy link
Author

Useful commands for bulk running tests:

  1. for i in $(ps aux | grep -i labd | awk '{print $2}'); do kill -9 "$i"; done
  2. rm -rf labd tmp output.log && go build && ./labd 2>&1 | tee --append output.log
  3. rm -rf labctl output.log && go build && ./labctl --log-level debug e create ../../cue/cue.mod/p2plab_example1.cue 2>&1 | tee --append output.log

@bonedaddy
Copy link
Author

bonedaddy commented May 12, 2020

Trace errors

Overview

These trace errors lead me to labagent/agentrouter/router.go:putUpdate which leads me to think that maybe it's the peer definition handling? I've shoe-horned a temporary helper function to create peer definitions for experiments and im wondering if this is the problem

labd scenarios.Session

trace_error_1

labd HTTP POST

trace_errors_2

hinshun and others added 2 commits May 13, 2020 17:52
nodes/session.go Outdated
eg.Go(func() error {
lctx, cancel := context.WithCancel(gctx)
cancels[i] = cancel
// cancels := make([]context.CancelFunc, len(ns))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the buggy part

@bonedaddy
Copy link
Author

Session Debugging

labagent/agentrouter/router.go is what gets called when session updates and fails

@bonedaddy
Copy link
Author

Build labd with -race and captured a race
output.log

@bonedaddy
Copy link
Author

Session Bug

  • session calls update -> update calls supervise
    • labagent/agentrouter/router.go
  • benchmark is ran -> calls /labapp/appapi/appapi.go to run the actual benchmark
  • run function is located in labapp/approuter/router.go

Looks like the race is happening within the supevisor???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants