Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for OpenTelemetry #32866

Open
TheFox0x7 opened this issue Dec 16, 2024 · 39 comments
Open

Support for OpenTelemetry #32866

TheFox0x7 opened this issue Dec 16, 2024 · 39 comments
Labels
type/proposal The new feature has not been accepted yet but needs to be discussed first.

Comments

@TheFox0x7
Copy link
Contributor

Feature Description

OpenTelemetry is an open standard designed to unify signals coming from an application. It allows for tracing requests and operations, sending logs from the application and gathering and sending prometheus metrics.

Potentially it would allow for removing prometheus in favor of opentelemetry (go library does have setting for prometheus style endpoint though I've yet to test it) and having the option to find long running operations quicker with tracing.

I'd like to work on this (though it might take a while) and I have a "demo" of such integration here, with only tracing support.

Screenshots

No response

@TheFox0x7 TheFox0x7 added the type/proposal The new feature has not been accepted yet but needs to be discussed first. label Dec 16, 2024
@lunny lunny added the proposal/accepted We have reviewed the proposal and agree that it should be implemented like that/at all. label Dec 16, 2024
@lunny
Copy link
Member

lunny commented Dec 16, 2024

Please have more discussion here before proposing a pull request.

Traditionally, I would like to collect logs from the disk log file rather than sending logs to the remote directly. I haven't read all the documentation of OpenTelemetry but I want to know how it works for logs.

@wxiaoguang wxiaoguang removed the proposal/accepted We have reviewed the proposal and agree that it should be implemented like that/at all. label Dec 17, 2024
@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 17, 2024

I think it needs more discussion.

If there is a chance, I'd like to totally decouple the 3rd dependency to avoid bloating Gitea binary (it is large enough now, and it is executed by every git operation .....)

There are various "tracing" or other tools and there will be more, if we added this, then should or shouldn't we add others?

So before starting the work, I'd like to clarify:

  1. Does it really help to trace the useful metrics and resolve real world problems? Maybe we should have a roadmap ahead.
    • meter: use it to replace the existing modules/metrics/collector.go (prometheus)? I do not see we should keep different and duplicate collectors together
    • tracer: I agree that Gitea needs better performance tracing, but the question is how to introduce it. And Gitea used to be an all-in-one app, in most cases we need end users to report their performance problems directly, but not ask the users to setup a OpenTelemetry to collect the performance profile ....
    • logger: well, maybe Gitea needs to refactor its logger system first? Or just output JSON logs?
  2. Is it to possible decouple the dependency, for example: make Gitea just output some JSON logs and use log collector to send them to other tools.

@TheFox0x7
Copy link
Contributor Author

For starters, I hoped for some discussion before continuing forward - I have few outstanding issues (mainly how to handle ssh connections and how to register the exporters with manager because there's probably a way better solution for that) which I want to discuss.


Firstly keep in mind that all 3 of the components are optional by design, you can have noop versions of all 3 with 0 code change.
As for your questions:

Traditionally, I would like to collect logs from the disk log file rather than sending logs to the remote directly. I haven't read all the documentation of OpenTelemetry but I want to know how it works for logs.

I haven't yet connected logs to it in any app besides python, but in general it doesn't stop you from gathering logs on disk, to stdout/err or to syslog. What it does is provide a vendor independent way to export logs (which is just gathers on the fly as well), to systems supporting the OpenTelemetry Protocol (OTLP), via grpc or http.
Logging in golang SDK is experimental stage.

There are various "tracing" or other tools and there will be more, if we added this, then should or shouldn't we add others?

The idea with opentelemetry is to provide a single standard for signals, so it shouldn't be required to add more.

Does it really help to trace the useful metrics and resolve real world problems? Maybe we should have a roadmap ahead.

Roadmap probably would be useful yeah. As for the usefulness, it might vary from case to case but I'll refer you to grafana's blog post about it which I think shows why it can be useful pretty well.

meter: use it to replace the existing modules/metrics/collector.go (prometheus)? I do not see we should keep different and duplicate collectors together

That's the idea and it shouldn't be a breaking change since golang SDK supports prometheus style export aside from OTLP.

tracer: I agree that Gitea needs better performance tracing, but the question is how to introduce it. And Gitea used to be an all-in-one app, in most cases we need end users to report their performance problems directly, but not ask the users to setup a OpenTelemetry to collect the performance profile ....

Tracing is mainly for operations (cross or in app) - for example if you select the branch view and it takes a while to load, you might be interested why - was it the template taking weirdly long, git being slow for some reason, nothing was cached or maybe database took a while to respond?
With traces you can setup tracing on all of those parts and have a time route with metadata similar to gnatt chart to see where the issue was.
I don't mean to introduce it as a mandatory step to report a performance issue, I think it would be easier if someone has such a setup to be able to see the trace duration and point at let's say DB query as the thing that took the longest (say 2 seconds) despite normally being quick (0.2 seconds).

logger: well, maybe Gitea needs to refactor its logger system first? Or just output JSON logs?

Logging as mentioned before is experimental in golang SDK and I'm still kind of lost how it actually works (I mainly never tried it). Migrating to log/slog would be nice but that's not something I can do without help and I'd have to refresh on how the logging in SDK is configured to say something for sure.

Is it to possible decouple the dependency, for example: make Gitea just output some JSON logs and use log collector to send them to other tools.

That is basically how the current log system can be gathered (except the JSON part) if one wants. So yes it is possible - with a very similar system as prometheus. JSON logs for that would be something of interest to me.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 17, 2024

Thank you for the details. I also have some experiences for "app performance tracing" (I used to build a web app serving millions of user per day and the QPS is around thousands), I also used some tracing tools and built some tracing frameworks, so I could fully agree with the ideas behind OpenTelemetry.

I also used GitLab for long time, GitLab packed a whole grafana into their deployment, and they have their own diagnosis tools.

So if let me design a roadmap for Gitea, I think the key problem is how to benefit most end users if we introduce a new mechanism.

  • Gitea indeed needs some mechanisms to help to diagnose performance problems
    • I can see there are still many performance related problems
    • At the moment there is no easy way to figure out the root problems
    • Performance tracing needs a lot of work, for example, add some necessary tracer calls correctly.
  • At the moment, Gitea is not like GitLab
    • GitLab always requires users to setup a lot of services to run, so it's not surprising that they could pack grafana into its package.
    • But Gitea is designed to be a lightweight system (almost zero dependency), so if we introduce the OpenTelemetry dependency but end users couldn't use it out-of-box, I think it departs from Gitea's goals.
    • I do not think it is feasible to tell end users that "please install other softwares to report the performance problem".

Then the question is: how to satisfy these requirements? Gitea needs tracing tools, and end users need to report performance problems conveniently. In my mind, the roadmap could be like this:

  • Refactor Gitea's code first, to make it have basic "signals" support, and we need to make end users could report without external dependency (unless we change our mind now and make end users always deploy Gitea with a complex docker container which contains everything)
  • Add shim interfaces to the "signals" support
  • If some advanced users would like to use OpenTelemetry or something similar, they could use the "interfaces" to build their own instance from source code.
  • If one day, Gitea is not lightweight anymore and must be deployed like GitLab, then we can integrate everything into one container.

@TheFox0x7
Copy link
Contributor Author

I agree it's not feasible to expect someone who wanted a simple git server to setup prometheus (metrics), loki/logstash (logs) and tempo/jaeger (traces) plus possibly pyroscope (pprof). This feature mostly targets people who would want to have observable application when using multiple components or are looking to diagnose such issues.
Collecting traces is fairly simple but I know that's not helping much.

But I sort of think you're misunderstanding the idea. The entire thing is optional by design - so you COULD use it but it's not mandatory. If you want to have a system which would work without any collectors and so on - it could be done by adding an exporter (either dedicated or the debug console one).
It's very similar to how metrics work - you can turn it on and scrape it but you can also not do it and there's no loss.

As for gitlab comparison - the minimal possible deployment won't change. The protocol is already a "shim" of sorts, as you can add custom exporters to it or just ingest data in your own way.
I don't want to introduce entire grafana stack into gitea or to try and go the gitlab way of trying to add dashboards and metric collection into it.
I'm not sure if I understand your concerns properly - you're worried about it being a smaller feature which won't be used by a lot of people and will be hard to take advantage of?


quick demo of how it works in case someone wants to see it.
It's just tracing, copy pasted from getting started page and trimmed down a bit. On first run you should get a number from the dice roll, then a trace json should be printed. Second thing which you can try is comment out every part in setup except the roll then see that it still works but the entire trace is a noop.

@wxiaoguang
Copy link
Contributor

But I sort of think you're misunderstanding the idea. The entire thing is optional by design - so you COULD use it but it's not mandatory. If you want to have a system which would work without any collectors and so on - it could be done by adding an exporter (either dedicated or the debug console one). It's very similar to how metrics work - you can turn it on and scrape it but you can also not do it and there's no loss.

To prevent from misunderstanding, let's take some examples.

  • Your proposal: Suppose we have OpenTelemetry in Gitea:

    • If an end user encounters performance problems, what we should tell them to provide the details?
  • My proposal: not introduce OpenTelemetry in Gitea at the moment, but only implement the necessary tracing features in Gitea and provide some shim interfaces

    • If an end user encounters performance problems, they could provide the details out-of-box (just one click on the admin panel)
    • If some advanced users need to add OpenTelemetry support, they could use the shim interfaces and build their instance from source code.

@TheFox0x7
Copy link
Contributor Author

Okay I get it now, thanks a lot. I haven't considered an embedded use case.

In my case it would be slightly more complicated:

  1. start a jaeger container (podman run --rm -p 16686:16686 -p 4317:4317 docker.io/jaegertracing/all-in-one:latest would cover the minimum version)
  2. configure gitea to push to it (about one config change, but it depends on how the config will be handled)
  3. Probably full restart would be required
  4. send relevant findings
    which is far from an ease of having it available from admin panel.

On the other hand I have no clue how would it look like. It's not like one (to my knowledge) could enable it from admin panel, then get one particular endpoint traced end to end without having propagation which the standard does provide. Also I think that making an interface for it would be harder to maintain and integrate with later. So I still think it would be better to have the native support and think of a way to make it simple to configure and gather as intended or make an custom exporter which will provide the use case which you're talking about.

Regardless it is an interesting use case, I'll admit.
I thought about having an OTel collector embedded in to use as reporting for actions - since CI monitoring is a really nice use case for traces, but I scrapped the idea as it feel more like bloat and there's the storage of it to figure out. Maybe it would work for this? Throwing it out there in case you'll find it interesting.


Full disclosure: I'm obviously biased to having it natively implemented instead of having to shim support onto base.

@wxiaoguang
Copy link
Contributor

After 1.23 gets a stable release (maybe in 1 or 2 weeks), I think we could start some initial work for the performance tracing. Then we could try to get OpenTelemetry support, either source-code-level optional (build from source code/tag), or config-level optional (enable by config option), or mandatory.

@TheFox0x7
Copy link
Contributor Author

I would suggest making it a configuration feature. It would be always "on" but just default to noop.
Making it mandatory is already out as it needs to be configured - the default localhost:4317 address is pretty useless on it's own.

I took a really quick glace at the std runtime/trace version. I think that would indeed cover your usecase better as it lets the user write the trace to a file and I think it works globally so you could just add a trigger in admin panel for it, maybe an on/off switch even.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 22, 2024

I started some initial work here: https://github.com/wxiaoguang/gitea/commits/support-trace/

Quite simple, it introduces our shim-layer and builtin tracer. The interface is designed to be OpenTelemetry-like.

Will try to do some improvements and make it really work before proposing a PR: the real challenge is how to correctly trace the performance problems and show useful result to end users (especially in the db and git package). Maybe it would take about one week.


Update: PR: Support performance trace #32973

@TheFox0x7
Copy link
Contributor Author

It looks like it would work with the standard. As long as semantic conventions will be followed for metadata keys/values and I would be able to replace tracer with otel one without breaking stuff I don't think I'll have objections.

Db would probably be best handed in xorm via otel as well (seeing that grafana uses xorm they might consider helping there), but semconv exist for client side, and it might be doable with either session or struct events but I haven't looked that deep.

For git I modified the run command, which worked surprisingly well as a quick "catch all" option, with a downside of spans lacking good context (that required diving into the span and seeing what command was called), which I started to solve by having the general command be traced too:

func GetReverseRawDiff(ctx context.Context, repoPath, commitID string, writer io.Writer) error {
        ctx, span := tracer.Start(ctx, "GetReverseRawDiff")
        defer span.End()
        stderr := new(bytes.Buffer)
        cmd := NewCommand(ctx, "show", "--pretty=format:revert %H%n", "-R").AddDynamicArguments(commitID)
        // rest of the code
}
func (c *Command) Run(opts *RunOpts) error {
        _, span := tracer.Start(c.parentContext, c.args[c.globalArgsLength], trace.WithAttributes(
                semconv.ProcessCommand(c.prog), semconv.ProcessCommandArgs(c.args...),
        ), trace.WithSpanKind(trace.SpanKindInternal))
        defer span.End()
       // rest of the function plus marking errored spans
}

Side note so I won't forget about it, calls to /internal endpoint should have span headers since they aren't a standalone operations but they are part of a larger one. The caller of that endpoint is the parent.

@TheFox0x7
Copy link
Contributor Author

FYI: I've published what I've made to date on my fork.

To turn it on:

[opentelemetry]
ENABLED=true

and docker/podman run --rm -p 16686:16686 -p 4318:4318 -d --name jaeger docker.io/jaegertracing/all-in-one:latest
should suffice. The container should listen on localhost:4318 as that's the default.
As for the actual demo - visit any page in gitea, view a file, etc. then go to localhost:16668 (jaeger) and pick gitea from service menu. At the very least you should have a git trace similar to this one:
image

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 22, 2024

Hmm ... it seems somewhat cumbersome to add a lot of ctx, span := tracer.Start(repo.Ctx, "CommitTree"). I think I could have some ideas about how to improve it. At least, by my recent " Refactor pprof labels and process desc #32909 ", git.Command.Run is able to know its caller.

ps: I am still working on some refactoring PRs, I think we could have enough preparation first, then the following work could be easier.


Update: in Refactor request context #32956 , legacy hacky code has been removed and now we have a clear RequestContext, this context is worth to trace.

@techknowlogick
Copy link
Member

Thank you for this issue. As for the related code you linked to, due to its licensing, I have a personal policy of not looking at that repo to ensure that I don't violate any of the licenses.
It's funny that you opened this issue, as a part of some performance work, and some chance encounters with some folks from honeycomb, I have been investigating this for some time, and have some (less than stellar) code that I put together as a proof of concept to show that it could be done. I've pushed my code, so that others can also look at it too, and perhaps get this in.
I share the same feelings as @wxiaoguang about the cumbersome boilerplate of wrapping many things with tracer.start/etc.., but sadly, other than some ebpf auto-profiling things that exist for otel, that seems to be the approach for adding otel to golang programs. Thankfully the two libraries I've chosen for middleware for db & HTTP, alleviate that. Sadly, for our git module we are likely to have to do the wrapping ourselves.

@wxiaoguang
Copy link
Contributor

I am still working on some preparations, so please review and approve my recent refactoring PRs, then we will have the builtin performance tracer soon. Then, at least, advanced users could introduce their tracers from source code level.

@TheFox0x7
Copy link
Contributor Author

As for the related code you linked to, due to its licensing, I have a personal policy of not looking at that repo to ensure that I don't violate any of the licenses.

@techknowlogick You can ignore it. I'm the author of the entire thing and I did pick EUPL (since I wanted to ensure the telemetry part is public), but seeing as I want to add it to gitea and work on this as well I dropped that requirement.
If you want I can swap the header so it's on paper without having to rely on my word.

It's also less than stellar - main things which work well are settings and basic tracing, the actual service start is something that works but it's probably misusing the graceful service system somewhat.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 24, 2024

I've chosen for middleware for db & HTTP,

I am sure we need to do it by ourselves: #32866 (comment)

Sadly, for our git module we are likely to have to do the wrapping ourselves.

If you read my recent PRs, then you could know that: no sad, no need to make it complicated. I have a complete design to make the system right and lightweight, but I need to do some refactorings first.

@wxiaoguang
Copy link
Contributor

I have pushed my work for "builtin tracer": WIP: Support performance trace #32973 .

It has initial http/db/git support, and a simple list in admin UI

@techknowlogick
Copy link
Member

@TheFox0x7 thanks! I'll checkout your code in the link above. I'm sure your statement here is sufficient, and enough people get emails from this tracker that there is some record of your statement should something catastrophic happen to GitHub (knock on wood it doesn't).

@wxiaoguang thanks! I hadn't checked out your code yet. The code in my PR was stashed in my local repo for a while and figured it'd be better to push it up since it can be added to this discussion.

@TheFox0x7
Copy link
Contributor Author

I've looked through the build-in tracer PR and I have few issues I think are worth discussing. Posting them here since I don't know the conventions too well - some don't like draft reviews. Plus they aren't that specific to it.

First thing, circling back a bit - why not implement SpanExporter interface as internal backend?
I see following upsides of that:

  • No custom trace wrapper, so the entire behavior is available and there is no need to re-implement spans.
  • It's fairly simple to swap exporters with the non internal one

I've tried to do it and the biggest issue I had so far is connecting spans into one as spans know only their parent so I can't easily build a string like it did.
I did however manage to rewrite parts of it so it works on my PR (without parent child relation):
image

If we manage to solve without re-implementing half of jaeger I think we'd have it covered quite easily between the 3 solutions.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 25, 2024

First thing, circling back a bit - why not implement SpanExporter interface as internal backend?

Sorry but I didn't get the point .... could you elaborate?

I think our shim-layer won't implement any "real" OpenTelemetry interfaces, because the OpenTelemetry's interfaces have quite heavy and complex dependencies. So by current design, users could implement the "traceStarter" and register it to operate on the our spans directly.

If you think "SpanExporter" is useful, could you provide some pseudocode to describe how to use it?

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 25, 2024

In my mind the otel integration could be like this:

type traceOtelStarter struct{}

type traceOtelSpan struct {
	otelSpan otel.Span
}

func (t *traceOtelSpan) end() {
	t.otelSpan.End()
}

func (t *traceOtelStarter) start(ctx context.Context, traceSpan *TraceSpan, internalSpanIdx int) (context.Context, traceSpanInternal) {
	ctx, span := tracer.Start(ctx, traceSpan.name)
	return ctx, &traceOtelSpan{otelSpan: span}
}

func init() {
	globalTraceStarters = append(globalTraceStarters, &traceOtelStarter{})
}

@TheFox0x7
Copy link
Contributor Author

TheFox0x7 commented Dec 25, 2024

Putting it into words turns out way harder than I thought but:

The way I see it right now is that we could make taillog into an exporter, making the integration more natural IMO, with the yet to be solved part of assembling a trace from spans. I might be stuck a bit on this idea though as mentioned earlier.
Unfortunately I agree that the sdk isn't the lightest or simple and compared to main branch the end binary is ~3MB larger - which is quite a lot for a feature that might not even be used.
I think that the list of concerns I have is as follows:

  • Deviating from the standard on accident
  • Having to recreate trace.Span
  • Over complicating the system with wrappers on something which is already kind of a wrapper

As a side note - caller context in git command is really useful, so thanks for that :)
image


SpanExporter interface consists of two methods:

  • ExportSpans (context.Context, []trace.ReadOnlySpan) error, which is called by relevant OTel system (SpanProcessor here, usually a buffer so the application isn't exporting 1 span at a time). It's job is to export spans (ones that ended) to any place you want.
  • Shutdown(context.Context) error, which is the clean up method, after which the exporter should not process anything.

So I was thinking that we could have internal one which would export spans into memory if the feature is on, similarly to how you proposed in the draft but with regular library.
So putting it into a flow: span -> Processor -> Exporter -> Memory -> /perftrace page

I hope I made myself clearer... even if a small bit.


My hacky plug
func (m *memoryMsgRecorder) ExportSpans(_ context.Context, spans []trace.ReadOnlySpan) error {
	m.mu.Lock()
	defer m.mu.Unlock()
	for _, t := range spans {
		sb := strings.Builder{}

		SpanToString(t, &sb, 2)
		m.msgs = append(m.msgs, &MsgRecord{
			Time:    time.Now(),
			Content: sb.String(),
		})
		if len(m.msgs) > m.limit {
			m.msgs = m.msgs[len(m.msgs)-m.limit:]
		}

	}
	return nil
}
func SpanToString(t trace.ReadOnlySpan, out *strings.Builder, indent int) {

	out.WriteString(strings.Repeat(" ", indent))
	out.WriteString(t.Name())
	if t.EndTime().IsZero() {
		out.WriteString(" duration: (not ended)")
	} else {
		out.WriteString(fmt.Sprintf(" duration: %.4fs", t.EndTime().Sub(t.StartTime()).Seconds()))
	}
	out.WriteString("\n")
	for _, a := range t.Attributes() {
		out.WriteString(strings.Repeat(" ", indent+2))
		out.WriteString(string(a.Key))
		out.WriteString(": ")
		out.WriteString(a.Value.AsString())
		out.WriteString("\n")
	}
	// for _, c := range t.Parent()..children {
	// 	span := c.internalSpans[t.internalSpanIdx].(*traceBuiltinSpan)
	// 	span.toString(out, indent+2)
	// }
}
func (m *memoryMsgRecorder) Shutdown(_ context.Context) error {
	log.Warn("Shutdown has been called!")
	return nil
}

// Plus small hack in my branch
func setupTraceProvider(ctx context.Context, r *resource.Resource) (func(context.Context) error, error) {
	var shutdown func(context.Context) error
	switch setting.OpenTelemetry.Traces {
	case "otlp":
		traceProvider := sdktrace.NewTracerProvider(
			sdktrace.WithSampler(setting.OpenTelemetry.Sampler),
                       // Here
			sdktrace.WithBatcher(tailmsg.GetManager().GetTraceRecorder()), 

			sdktrace.WithResource(r),
		)
		otel.SetTracerProvider(traceProvider)
		shutdown = traceProvider.Shutdown
	default:
		shutdown = func(ctx context.Context) error { return nil }
	}
	return shutdown, nil
}

Full code on: https://github.com/TheFox0x7/gitea/tree/otel-mods

@TheFox0x7
Copy link
Contributor Author

I added prometheus exporter with otelchi generated metrics as a port test and as fun as otelchi is for quick start I don't think this:
gitea_response_size_bytes_count{http_method="GET",http_scheme="http",net_host_name="Gitea",net_host_port="3000",net_protocol_version="1.1",net_sock_peer_addr="127.0.0.1",net_sock_peer_port="57834",otel_scope_name="github.com/riandyrn/otelchi/metric",otel_scope_version="0.1.0",user_agent_original="Mozilla/5.0 (X11; Linux x86_64; rv:133.0) Gecko/20100101 Firefox/133.0"} 13
is something we're interested in (the user agent addition is... interesting).
But it is doable and fairly simple to port current metrics to it.

Unfortunately this doesn't remove prometheus package like I thought it would so there goes my "removing dependency" argument.

@TheFox0x7
Copy link
Contributor Author

bit of an update:
My current idea is to use otel for general tracing and export them using either http/grpc to remote/with some built-in exporter to have some insight without pulling jaeger or such. Some smaller open threads with this approach:

  1. Should internal exporter be http/grpc and just push to some endpoint in gitea (api or not)? Then some level of integration with runner could be achieved. Could be done later as the exporter is purely internal.
  2. Is persistence something we'd want to consider? Saving a trace to database for example.
  3. Should internal trace UI be available when user wants to use external collector or should it be disabled?

And a rather large one:
How to make traces work for issue reports? That's a rather large and core issue of this. There isn't a set json format for full trace so every backend (tested 2) makes their own one for download.
The way I hoped to make this work it to have admin UI for tracing, which would display the traces and if someone wants to add it to a report they could download the json and attach it. But if this could only be uploaded to a single backend (or two if we pick one to align to), that's kind of pointless.
In theory raw otlp output could be saved to .json file and replayed to upload it but I've yet to see anything that can read that output. collector has a receiver for this and otel-tui uses it.

What I currently have is something like this:
image
But it doesn't preserve hierarchy and is not nice to look at. Also I've found following projects which might be of interest - for reference or UI side.
https://github.com/CtrlSpice/otel-desktop-viewer
https://github.com/ymtdzzz/otel-tui

I've also discovered that otel trace use runtime/trace.Task underneath, which is nice. Bad part being that (at least to me) it's not in any way more readable or helpful.

@wxiaoguang
Copy link
Contributor

bit of an update:
My current idea is to use otel for general tracing and export them using either http/grpc to remote/with some built-in exporter to have some insight without pulling jaeger or such.

How many bytes would it make the binary larger? We have removed many dependencies for " Gitea binary size #33023 " to reduce the binary size for many mega-bytes.

@TheFox0x7
Copy link
Contributor Author

About 2.4-2.6MB for basic version (with external exporters only) and up from there. Most of this coming from grpc. The hacky UI version is 2.9MB larger than main.

Though turns out using collectors code is probably of the table. It pulls twice that size and I haven't done anything useful with it. Also it pulls deprecated library which takes 1.1MB in final build.

@techknowlogick
Copy link
Member

Also it pulls deprecated library which takes 1.1MB in final build.

Which library is the deprecated one?

The hacky UI version is 2.9MB larger than main.

Is this even with -s -w ldflags, to trim down the binary, on the final build?

@TheFox0x7
Copy link
Contributor Author

Which library is the deprecated one?

github.com/gogo/protobuf, which gets pulled by go.opentelemetry.io/collector/pdata/ptrace. To be fair I haven't made any use of it for now so I'm not even sure if regular collector would end up pulling it in. I pulled it since I wanted to base the local UI of otel-tui but so far I managed without.

Is this even with -s -w ldflags, to trim down the binary, on the final build?

Yes.

@TheFox0x7
Copy link
Contributor Author

I figured that before trying to figure the span assembly I'll work on some routes to trace so there would be some a starting point and well... this will be more complicated than I originally thought. Mainly due to the fact that the context.Context is wrapped so the normal propagation system won't work like usual...

No idea how to deal with this for now. I think proper thing to do is refactor the context to be standard library one per go wiki but that's a lot of work and I'm not sure how doable it would be in the first place.
Unless I'm missing an easy way to do it. I'm not that advanced in go.

func ViewIssue(ctx *context.Context){
stdctx, span := tracer.Start(ctx, "ViewIssue")
// without making ctx use new context the span won't propagate properly, so all db calls won't be attached to ViewIssue, but to the middleware level instrumentation.
defer span.End()
//...
}

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Jan 15, 2025

I figured that before trying to figure the span assembly I'll work on some routes to trace so there would be some a starting point and well... this will be more complicated than I originally thought. Mainly due to the fact that the context.Context is wrapped so the normal propagation system won't work like usual...

No idea how to deal with this for now. I think proper thing to do is refactor the context to be standard library one per go wiki but that's a lot of work and I'm not sure how doable it would be in the first place. Unless I'm missing an easy way to do it. I'm not that advanced in go.

For this problem: we should use the chi router's pattern path, I have used that in the demo #32973

https://github.com/go-gitea/gitea/pull/32973/files#diff-6c413ec64ebadd1b4f2778a41d4cf9d60b8682a8adf62f957e3b0f12aad5ac14R54-R58

(it's also not complete yet, need to cover some edge cases)


By the way, I still have some concerns of introducing the opentelemetry dependency at the moment:

  • binary size is one concern: the builtin tracer would make the size increase only kilo-bytes but not mega-bytes
  • how to trace what we want: Gitea has too many internal details, we can use the builtin tracer to figure out all problems first, then it should be easier to introduce other tracers.
  • how to make end users really use it (the discussion above, I still hope we can make end users could report the performance problems out-of-box and by one-click)

(ps: there are some high priority TODOs on my side, so I haven't made new progress for #32973)

@TheFox0x7
Copy link
Contributor Author

For this problem: we should use the chi router's pattern path, I have used that in the demo #32973

This does mostly what otelchi does and doesn't solve the problem completely. This makes the entire route a root span and allows for operations in that route to be part of that trace - which is what we want.
This doesn't make grouping of some operations possible though, I ended up with a root span from the route with multiple child spans inside it, all labeled as some git operation or database call. Now let's say I want to group them into which operations are part of the ViewIssue function. I cannot do it because that requires context to be passed down. See image, note that hierarchy is flat.

image

So I end up with meaningless information that ViewIssue is part of the route and took some time. Not why it took that time.


binary size is one concern: the builtin tracer would make the size increase only kilo-bytes but not mega-bytes

Fully understandable, I'm not too happy about it too but not much I can do with that unfortunately. Unless alternative implementation of the SDK exists, which I haven't found. Most of the size comes from google.golang.org/grpc (+946kB), go.opentelemetry.io/otel (+651kB) and .rodata (+576kB).

how to trace what we want: Gitea has too many internal details, we can use the builtin tracer to figure out all problems first, then it should be easier to introduce other tracers.

This can be done regardless of tracer though? Just mark the feature as beta until we figure out all paths to trace.

how to make end users really use it (the discussion above, I still hope we can make end users could report the performance problems out-of-box and by one-click)

I thought about it too. Traces can be exported as json, but it it's not standardized so that's kind of a dead end. Beyond usage of runtime/trace and writing the output to diagnostics file or writing the OTLP to json file I have no good ideas for now.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Jan 15, 2025

For this problem: we should use the chi router's pattern path, I have used that in the demo #32973

This does mostly what otelchi does and doesn't solve the problem completely. This makes the entire route a root span and allows for operations in that route to be part of that trace - which is what we want.

If you'd like to trace every middleware&endpoint handler's names, there is routing.GetFuncInfo which is used to trace all function names, and routing.UpdateFuncInfo is used to record the names and these names will be shown in logs. I could propose a demo later.

@TheFox0x7
Copy link
Contributor Author

TheFox0x7 commented Jan 15, 2025

For this problem: we should use the chi router's pattern path, I have used that in the demo #32973

This does mostly what otelchi does and doesn't solve the problem completely. This makes the entire route a root span and allows for operations in that route to be part of that trace - which is what we want.

If you'd like to trace every middleware&endpoint handler's names, there is routing.GetFuncInfo which is used to trace all function names, and routing.UpdateFuncInfo is used to record the names and these names will be shown in logs. I could propose a demo later.

I'm not sure how this would help. The span names are fine (except for the database but that's a different issue). The fundamental issue is about relations, the parent span cannot be derived from context if the context is custom.

image

In the above image I've tried to work with the context with span to show the propagation issue. What you can gather from it is:

  1. ViewIssue takes the longest to run
  2. about half of that time is for Render
  3. MustAllow takes a quarter of it
  4. Git gets called... somewhere in the path. Most likely as part of ViewIssue but it's not part of it as the context it ends up getting is from the router level.

Lack of propagation makes the db tracing kind of pointless since they aren't getting linked to any operation.
Of course I could go down the code manually and see which function ends up calling git but at that point I might as well just print down times to console.
Hopefully I made the issue clearer.

I'll go work on prometheus metrics in the meantime, because as far as I see, there's no way to avoid having to decouple stdcCtx.Context from context.Context.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Jan 15, 2025

I'm not sure how this would help. The span names are fine (except for the database but that's a different issue). The fundamental issue is about relations, the parent span cannot be derived from context if the context is custom.

It helps. See my new commit 94816bf :

image

@TheFox0x7
Copy link
Contributor Author

It didn't really want to trace any endpoint aside of /user/events but I finally got a /{username}/{reponame}/{type:pulls}/{index} to show up.
First few nestings are fine but then you hit the opposite issue (at least the data looks like you hit it). GetNotificationCount GetActiveStopwatch should be on the same level as one doesn't call the other. No idea how accurate is it later.

However I still don't see how this helps with tracking spans via context which is the issue.


If the overall consensus here is that otel is too big to be included by default can we at least agree to use general api of go.opentelemetry.io/otel (+827kB with my starting point)? Never mind this doesn't guarantee the same behavior so implementation working on internal ones might not work on otel ones.

The route this is taking starts to be orthogonal to my proposal and I think it'll make it harder to include distributed version, which would be an issue to me.

Also related to the binary size: open-telemetry/opentelemetry-go#2579

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Jan 16, 2025

First few nestings are fine but then you hit the opposite issue (at least the data looks like you hit it). GetNotificationCount GetActiveStopwatch should be on the same level as one doesn't call the other. No idea how accurate is it later.

No, it is not the opposite issue , it is the code really does, because GetNotificationCount/GetActiveStopwatch are middlewares. Every middleware is traced as

parentMiddleware
    preServeNext   <--- here you see the SQL.
    callNext       <--- here could be next handler like ViewIssue
    postServeNext  <--- for most cases, here is no-op

But GetNotificationCount is more special because it defers the "ctx" call to a tmpl function, you could read its code to see why.

In short: the new trace code is right. If it looks ugly (or unexpected), it is the existing code's problem.

@wxiaoguang
Copy link
Contributor

The route this is taking starts to be orthogonal to my proposal and I think it'll make it harder to include distributed version, which would be an issue to me.

I do not see which part starts to be orthogonal , if you use our gtprof tracer framework, everything should work with OpenTelemetry SDK. Just register your tracers by globalTraceStarters = append(globalTraceStarters, &traceOpenTelemetryStarter{})

@TheFox0x7
Copy link
Contributor Author

TheFox0x7 commented Jan 16, 2025

No, it is not the opposite issue , it is the code really does, because GetNotificationCount/GetActiveStopwatch are middlewares.

You're correct. I looked at them and figured that they are next to each other so they shouldn't be included in each other, while ignoring that they are middlewares despite writing one before. I'm sorry for that.

I do not see which part starts to be orthogonal , if you use our gtprof tracer framework, everything should work with OpenTelemetry SDK. Just register your tracers by globalTraceStarters = append(globalTraceStarters, &traceOpenTelemetryStarter{})

See my review on your proposed system. It does work but it's limited and propagation issue remains.


I understand the size concern, it's completely valid. I understand the need for it to work for local users too and I fully agree that it is of use. But this doesn't cover the opposite end of users who are willing to start a single jaeger instance and collect data.

Why not standardize on https://pkg.go.dev/go.opentelemetry.io/otel/trace? That way we at least have common api and can swap implementations easily.
Otel exporters can be behind compilation flag to avoid bloat but at least it's a standard and portable implementation which works for both systems the same.

Here's a version without any exporters It does nothing until you configure a tracer. otelchi is only used for simplicity since it has route naming figured out properly while it would take a bit for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/proposal The new feature has not been accepted yet but needs to be discussed first.
Projects
None yet
Development

No branches or pull requests

4 participants