-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for OpenTelemetry #32866
Comments
Please have more discussion here before proposing a pull request. Traditionally, I would like to collect logs from the disk log file rather than sending logs to the remote directly. I haven't read all the documentation of OpenTelemetry but I want to know how it works for logs. |
I think it needs more discussion. If there is a chance, I'd like to totally decouple the 3rd dependency to avoid bloating Gitea binary (it is large enough now, and it is executed by every git operation .....) There are various "tracing" or other tools and there will be more, if we added this, then should or shouldn't we add others? So before starting the work, I'd like to clarify:
|
For starters, I hoped for some discussion before continuing forward - I have few outstanding issues (mainly how to handle ssh connections and how to register the exporters with manager because there's probably a way better solution for that) which I want to discuss. Firstly keep in mind that all 3 of the components are optional by design, you can have noop versions of all 3 with 0 code change.
I haven't yet connected logs to it in any app besides python, but in general it doesn't stop you from gathering logs on disk, to stdout/err or to syslog. What it does is provide a vendor independent way to export logs (which is just gathers on the fly as well), to systems supporting the OpenTelemetry Protocol (OTLP), via grpc or http.
The idea with opentelemetry is to provide a single standard for signals, so it shouldn't be required to add more.
Roadmap probably would be useful yeah. As for the usefulness, it might vary from case to case but I'll refer you to grafana's blog post about it which I think shows why it can be useful pretty well.
That's the idea and it shouldn't be a breaking change since golang SDK supports prometheus style export aside from OTLP.
Tracing is mainly for operations (cross or in app) - for example if you select the branch view and it takes a while to load, you might be interested why - was it the template taking weirdly long, git being slow for some reason, nothing was cached or maybe database took a while to respond?
Logging as mentioned before is experimental in golang SDK and I'm still kind of lost how it actually works (I mainly never tried it). Migrating to
That is basically how the current log system can be gathered (except the JSON part) if one wants. So yes it is possible - with a very similar system as prometheus. JSON logs for that would be something of interest to me. |
Thank you for the details. I also have some experiences for "app performance tracing" (I used to build a web app serving millions of user per day and the QPS is around thousands), I also used some tracing tools and built some tracing frameworks, so I could fully agree with the ideas behind OpenTelemetry. I also used GitLab for long time, GitLab packed a whole grafana into their deployment, and they have their own diagnosis tools. So if let me design a roadmap for Gitea, I think the key problem is how to benefit most end users if we introduce a new mechanism.
Then the question is: how to satisfy these requirements? Gitea needs tracing tools, and end users need to report performance problems conveniently. In my mind, the roadmap could be like this:
|
I agree it's not feasible to expect someone who wanted a simple git server to setup prometheus (metrics), loki/logstash (logs) and tempo/jaeger (traces) plus possibly pyroscope (pprof). This feature mostly targets people who would want to have observable application when using multiple components or are looking to diagnose such issues. But I sort of think you're misunderstanding the idea. The entire thing is optional by design - so you COULD use it but it's not mandatory. If you want to have a system which would work without any collectors and so on - it could be done by adding an exporter (either dedicated or the debug console one). As for gitlab comparison - the minimal possible deployment won't change. The protocol is already a "shim" of sorts, as you can add custom exporters to it or just ingest data in your own way. quick demo of how it works in case someone wants to see it. |
To prevent from misunderstanding, let's take some examples.
|
Okay I get it now, thanks a lot. I haven't considered an embedded use case. In my case it would be slightly more complicated:
On the other hand I have no clue how would it look like. It's not like one (to my knowledge) could enable it from admin panel, then get one particular endpoint traced end to end without having propagation which the standard does provide. Also I think that making an interface for it would be harder to maintain and integrate with later. So I still think it would be better to have the native support and think of a way to make it simple to configure and gather as intended or make an custom exporter which will provide the use case which you're talking about. Regardless it is an interesting use case, I'll admit. Full disclosure: I'm obviously biased to having it natively implemented instead of having to shim support onto base. |
After 1.23 gets a stable release (maybe in 1 or 2 weeks), I think we could start some initial work for the performance tracing. Then we could try to get OpenTelemetry support, either source-code-level optional (build from source code/tag), or config-level optional (enable by config option), or mandatory. |
I would suggest making it a configuration feature. It would be always "on" but just default to noop. I took a really quick glace at the std |
I started some initial work here: https://github.com/wxiaoguang/gitea/commits/support-trace/ Quite simple, it introduces our shim-layer and builtin tracer. The interface is designed to be OpenTelemetry-like. Will try to do some improvements and make it really work before proposing a PR: the real challenge is how to correctly trace the performance problems and show useful result to end users (especially in the db and git package). Maybe it would take about one week. Update: PR: Support performance trace #32973 |
It looks like it would work with the standard. As long as semantic conventions will be followed for metadata keys/values and I would be able to replace tracer with otel one without breaking stuff I don't think I'll have objections. Db would probably be best handed in xorm via otel as well (seeing that grafana uses xorm they might consider helping there), but semconv exist for client side, and it might be doable with either session or struct events but I haven't looked that deep. For git I modified the run command, which worked surprisingly well as a quick "catch all" option, with a downside of spans lacking good context (that required diving into the span and seeing what command was called), which I started to solve by having the general command be traced too: func GetReverseRawDiff(ctx context.Context, repoPath, commitID string, writer io.Writer) error {
ctx, span := tracer.Start(ctx, "GetReverseRawDiff")
defer span.End()
stderr := new(bytes.Buffer)
cmd := NewCommand(ctx, "show", "--pretty=format:revert %H%n", "-R").AddDynamicArguments(commitID)
// rest of the code
}
func (c *Command) Run(opts *RunOpts) error {
_, span := tracer.Start(c.parentContext, c.args[c.globalArgsLength], trace.WithAttributes(
semconv.ProcessCommand(c.prog), semconv.ProcessCommandArgs(c.args...),
), trace.WithSpanKind(trace.SpanKindInternal))
defer span.End()
// rest of the function plus marking errored spans
} Side note so I won't forget about it, calls to |
FYI: I've published what I've made to date on my fork. To turn it on: [opentelemetry]
ENABLED=true and |
Hmm ... it seems somewhat cumbersome to add a lot of ps: I am still working on some refactoring PRs, I think we could have enough preparation first, then the following work could be easier. Update: in Refactor request context #32956 , legacy hacky code has been removed and now we have a clear RequestContext, this context is worth to trace. |
Thank you for this issue. As for the related code you linked to, due to its licensing, I have a personal policy of not looking at that repo to ensure that I don't violate any of the licenses. |
I am still working on some preparations, so please review and approve my recent refactoring PRs, then we will have the builtin performance tracer soon. Then, at least, advanced users could introduce their tracers from source code level. |
@techknowlogick You can ignore it. I'm the author of the entire thing and I did pick EUPL (since I wanted to ensure the telemetry part is public), but seeing as I want to add it to gitea and work on this as well I dropped that requirement. It's also less than stellar - main things which work well are settings and basic tracing, the actual service start is something that works but it's probably misusing the graceful service system somewhat. |
I am sure we need to do it by ourselves: #32866 (comment)
If you read my recent PRs, then you could know that: no sad, no need to make it complicated. I have a complete design to make the system right and lightweight, but I need to do some refactorings first. |
I have pushed my work for "builtin tracer": WIP: Support performance trace #32973 . It has initial http/db/git support, and a simple list in admin UI |
@TheFox0x7 thanks! I'll checkout your code in the link above. I'm sure your statement here is sufficient, and enough people get emails from this tracker that there is some record of your statement should something catastrophic happen to GitHub (knock on wood it doesn't). @wxiaoguang thanks! I hadn't checked out your code yet. The code in my PR was stashed in my local repo for a while and figured it'd be better to push it up since it can be added to this discussion. |
Sorry but I didn't get the point .... could you elaborate? I think our shim-layer won't implement any "real" OpenTelemetry interfaces, because the OpenTelemetry's interfaces have quite heavy and complex dependencies. So by current design, users could implement the "traceStarter" and register it to operate on the our spans directly. If you think "SpanExporter" is useful, could you provide some pseudocode to describe how to use it? |
In my mind the otel integration could be like this: type traceOtelStarter struct{}
type traceOtelSpan struct {
otelSpan otel.Span
}
func (t *traceOtelSpan) end() {
t.otelSpan.End()
}
func (t *traceOtelStarter) start(ctx context.Context, traceSpan *TraceSpan, internalSpanIdx int) (context.Context, traceSpanInternal) {
ctx, span := tracer.Start(ctx, traceSpan.name)
return ctx, &traceOtelSpan{otelSpan: span}
}
func init() {
globalTraceStarters = append(globalTraceStarters, &traceOtelStarter{})
} |
Putting it into words turns out way harder than I thought but: The way I see it right now is that we could make taillog into an exporter, making the integration more natural IMO, with the yet to be solved part of assembling a trace from spans. I might be stuck a bit on this idea though as mentioned earlier.
As a side note - caller context in git command is really useful, so thanks for that :) SpanExporter interface consists of two methods:
So I was thinking that we could have internal one which would export spans into memory if the feature is on, similarly to how you proposed in the draft but with regular library. I hope I made myself clearer... even if a small bit. My hacky plugfunc (m *memoryMsgRecorder) ExportSpans(_ context.Context, spans []trace.ReadOnlySpan) error {
m.mu.Lock()
defer m.mu.Unlock()
for _, t := range spans {
sb := strings.Builder{}
SpanToString(t, &sb, 2)
m.msgs = append(m.msgs, &MsgRecord{
Time: time.Now(),
Content: sb.String(),
})
if len(m.msgs) > m.limit {
m.msgs = m.msgs[len(m.msgs)-m.limit:]
}
}
return nil
}
func SpanToString(t trace.ReadOnlySpan, out *strings.Builder, indent int) {
out.WriteString(strings.Repeat(" ", indent))
out.WriteString(t.Name())
if t.EndTime().IsZero() {
out.WriteString(" duration: (not ended)")
} else {
out.WriteString(fmt.Sprintf(" duration: %.4fs", t.EndTime().Sub(t.StartTime()).Seconds()))
}
out.WriteString("\n")
for _, a := range t.Attributes() {
out.WriteString(strings.Repeat(" ", indent+2))
out.WriteString(string(a.Key))
out.WriteString(": ")
out.WriteString(a.Value.AsString())
out.WriteString("\n")
}
// for _, c := range t.Parent()..children {
// span := c.internalSpans[t.internalSpanIdx].(*traceBuiltinSpan)
// span.toString(out, indent+2)
// }
}
func (m *memoryMsgRecorder) Shutdown(_ context.Context) error {
log.Warn("Shutdown has been called!")
return nil
}
// Plus small hack in my branch
func setupTraceProvider(ctx context.Context, r *resource.Resource) (func(context.Context) error, error) {
var shutdown func(context.Context) error
switch setting.OpenTelemetry.Traces {
case "otlp":
traceProvider := sdktrace.NewTracerProvider(
sdktrace.WithSampler(setting.OpenTelemetry.Sampler),
// Here
sdktrace.WithBatcher(tailmsg.GetManager().GetTraceRecorder()),
sdktrace.WithResource(r),
)
otel.SetTracerProvider(traceProvider)
shutdown = traceProvider.Shutdown
default:
shutdown = func(ctx context.Context) error { return nil }
}
return shutdown, nil
} Full code on: https://github.com/TheFox0x7/gitea/tree/otel-mods |
I added prometheus exporter with otelchi generated metrics as a port test and as fun as otelchi is for quick start I don't think this: Unfortunately this doesn't remove prometheus package like I thought it would so there goes my "removing dependency" argument. |
bit of an update:
And a rather large one: What I currently have is something like this: I've also discovered that otel trace use |
How many bytes would it make the binary larger? We have removed many dependencies for " Gitea binary size #33023 " to reduce the binary size for many mega-bytes. |
About 2.4-2.6MB for basic version (with external exporters only) and up from there. Most of this coming from grpc. The hacky UI version is 2.9MB larger than main. Though turns out using collectors code is probably of the table. It pulls twice that size and I haven't done anything useful with it. Also it pulls deprecated library which takes 1.1MB in final build. |
Which library is the deprecated one?
Is this even with |
Yes. |
I figured that before trying to figure the span assembly I'll work on some routes to trace so there would be some a starting point and well... this will be more complicated than I originally thought. Mainly due to the fact that the No idea how to deal with this for now. I think proper thing to do is refactor the context to be standard library one per go wiki but that's a lot of work and I'm not sure how doable it would be in the first place. func ViewIssue(ctx *context.Context){
stdctx, span := tracer.Start(ctx, "ViewIssue")
// without making ctx use new context the span won't propagate properly, so all db calls won't be attached to ViewIssue, but to the middleware level instrumentation.
defer span.End()
//...
} |
For this problem: we should use the chi router's pattern path, I have used that in the demo #32973 (it's also not complete yet, need to cover some edge cases) By the way, I still have some concerns of introducing the opentelemetry dependency at the moment:
(ps: there are some high priority TODOs on my side, so I haven't made new progress for #32973) |
This does mostly what otelchi does and doesn't solve the problem completely. This makes the entire route a root span and allows for operations in that route to be part of that trace - which is what we want. So I end up with meaningless information that
Fully understandable, I'm not too happy about it too but not much I can do with that unfortunately. Unless alternative implementation of the SDK exists, which I haven't found. Most of the size comes from
This can be done regardless of tracer though? Just mark the feature as beta until we figure out all paths to trace.
I thought about it too. Traces can be exported as json, but it it's not standardized so that's kind of a dead end. Beyond usage of |
If you'd like to trace every middleware&endpoint handler's names, there is |
I'm not sure how this would help. The span names are fine (except for the database but that's a different issue). The fundamental issue is about relations, the parent span cannot be derived from context if the context is custom. In the above image I've tried to work with the context with span to show the propagation issue. What you can gather from it is:
Lack of propagation makes the db tracing kind of pointless since they aren't getting linked to any operation. I'll go work on prometheus metrics in the meantime, because as far as I see, there's no way to avoid having to decouple |
It helps. See my new commit 94816bf : |
It didn't really want to trace any endpoint aside of However I still don't see how this helps with tracking spans via context which is the issue.
The route this is taking starts to be orthogonal to my proposal and I think it'll make it harder to include distributed version, which would be an issue to me. Also related to the binary size: open-telemetry/opentelemetry-go#2579 |
No, it is not the opposite issue , it is the code really does, because
But In short: the new trace code is right. If it looks ugly (or unexpected), it is the existing code's problem. |
I do not see which part starts to be orthogonal , if you use our |
You're correct. I looked at them and figured that they are next to each other so they shouldn't be included in each other, while ignoring that they are middlewares despite writing one before. I'm sorry for that.
See my review on your proposed system. It does work but it's limited and propagation issue remains. I understand the size concern, it's completely valid. I understand the need for it to work for local users too and I fully agree that it is of use. But this doesn't cover the opposite end of users who are willing to start a single jaeger instance and collect data. Why not standardize on https://pkg.go.dev/go.opentelemetry.io/otel/trace? That way we at least have common api and can swap implementations easily. Here's a version without any exporters It does nothing until you configure a tracer. otelchi is only used for simplicity since it has route naming figured out properly while it would take a bit for me. |
Feature Description
OpenTelemetry is an open standard designed to unify signals coming from an application. It allows for tracing requests and operations, sending logs from the application and gathering and sending prometheus metrics.
Potentially it would allow for removing prometheus in favor of opentelemetry (go library does have setting for prometheus style endpoint though I've yet to test it) and having the option to find long running operations quicker with tracing.
I'd like to work on this (though it might take a while) and I have a "demo" of such integration here, with only tracing support.
Screenshots
No response
The text was updated successfully, but these errors were encountered: