More

vmihailenco · on Sept 6, 2022

Thanks for the feedback!

>- Are there any recommendations for scaling (e.g. benchmarks) on how many spans/s are supported on what hardware?

With Uptrace, I was able to achieve 10k spans / second on a single core by running the binary with GOMAXPROCS=1. That is 1-3 terrabytes of compressed data each month which is more than most users need.

Practically, you are limited by the $$$ you are willing to spend on ClickHouse servers, not by Uptrace ingestion speed.

So my recommendation is to scale Uptrace vertically by throwing more cores at it. That will allow you to go very very far.

>- Is there support for SSO for self-hosted installation?

So far the only way to add news users is via the YAML config. We are considering to add a REST API or a CLI tool for the same purpose, but it is not clear how that would work with the YAML.

Regarding the SSO, it would be nice if you can provide an app that already does that so we can better estimate the complexity. But so far we don't have such plans.

>- Can I easily disable certain features? (e.g. alerting)

Yes, most YAML sections just can be removed / commented out to disable the feature.

derN3rd · on Sept 6, 2022

Thanks for the answer.

I have no other good examples for SSO than Grafana.

But something I would love to see more for logins are Application Tokens. We use Cloudflare Access for Team related logins, which will send such a token in the header, so the application can use it to authorize a user and by which group a user is in it can enable/disable features

https://developers.cloudflare.com/cloudflare-one/identity/au...

This solved our needs for multiple sign in options, as all is now managed through Cloudflare access, but this is obviously not a solution for everyone

vmihailenco · on Sept 6, 2022

We already use JWT tokens that are passed in a HTTP cookie so perhaps we could document how it works and let users sign JWT tokens themselves. That way your app only needs to set a cookie and the user should be authorized.

Let's continue the discussion on GitHub https://github.com/uptrace/uptrace/issues/76

vmihailenco · on Sept 6, 2022

You can ingest data using OpenTelemetry Protocol (OTLP), Vector Logs, and Zipkin API. You can also use OpenTelemetry Collector to collect Prometheus metrics or receive data from Jaeger, X Ray, Apache, PostgreSQL, MySQL and many more.

The latest Uptrace release introduces support for OpenTelemetry Metrics which includes:

- User interface to build table-based and grid-based dashboards.

- Pre-built dashboard templates for Golang, Redis, PostgreSQL, MySQL, and host metrics.

- Metrics monitoring aka alerting rules inspired by Prometheus.

- Notifications via email/Slack/PagerDuty using AlertManager integration.

There are 2 quick ways to try Uptrace:

- Using the Docker container - https://github.com/uptrace/uptrace/tree/master/example/docke...

- Using the public demo - https://app.uptrace.dev/play

I will be happy to answer your questions in the comments.

0JzW · on Sept 6, 2022

this looks amazing! i would definitely like to use this for log monitoring. however, i have a question. is it possible to get logs for individual docker containers?

vmihailenco · on Sept 6, 2022

It is possible using Vector Logs which Uptrace supports out-of-the-box, for example:

- https://vector.dev/docs/reference/configuration/sources/dock...

- https://uptrace.dev/get/ingest/vector.html

If you are having troubles making it work, feel free to open an issue on Github and I will provide a complete example.

vmihailenco · on Dec 29, 2021

So far my experience is that it is best to avoid trying to solve such problems with a query language and instead provide a much simpler UI to achieve the same. Solving such problems with SQL is tedious enough and learning another custom language is not fun / too much to ask from users.

Sometimes using a UI is not possible, for example, if you want to automate such checks. In that case, I would build a custom metric or two and would use that metric for monitoring purposes. That requires some programming / instrumentation, but it still looks like a better solution to me.

vmihailenco · on Dec 29, 2021

Those are 2 separate projects and they don't work together. I still did not have a chance to try loki / tempo so can't say how well they work in practice...

vmihailenco · on Dec 29, 2021

"But" do you need another storage? :)

I understand that it is not ideal to have so many competing tools, but contributing to an existing mature project is a nightmare. It is by far easier to start a new one.

>Jaeger UI is anyway not the main tool people tend to work with.

Which tools / features do you have in mind?

Uptrace OS competes with Jaeger / Zipkin / SigNoz / SkyWalking and I believe it already does a pretty good job.

rad_gruchalski · on Dec 29, 2021

We use a custom elastic storage on steroids with Keycloak integration for enabling multi-tenancy in Jaeger so we can do SLA tracking and reporting. So the answer is yes.

I get your point about contributing, especially features that are incompatible with the maintainer vision. Feature creep, right?

What I value in open source projects is extensibility. Plugins which one can maintain outside of the main product.

> "But" do you need another storage? :)

I’m only saying that it’s possible. I might not need it but if someone does and they want to self host it as a managed solution, it can be done right in Jaeger.

> Which tools / features do you have in mind?

The default Jaeger UI isn’t really ergonomic. Trace info is more useful in the context of other information. As in, tools pulling trace info out of storage and overlaying on other data. There’s also Grafana Tempo.

vmihailenco · on Dec 29, 2021

Thanks for the answer - that all makes sense... "but" :) there is also some sense in NOT having to support different storages / plugins API and instead supporting more features that work out of the box. At least I hope so.

I will try Grafana Tempo & Loki - thanks for reminding. It is just that all their products look the same like the "original" Grafana for metrics which TBH does not look especially ergonomic too... Somehow Jaeger still comes first when people talk about tracing.

rad_gruchalski · on Dec 29, 2021

> there is also some sense in NOT having to support different storages / plugins API and instead supporting more features that work out of the box

I humbly do not agree. There’s always going to be an edge case which your features aren’t going to support and having the ability to roll out a plugin is the fastest way to integrate. There are also features I may want to keep confidential/proprietary depending on who the client is. In some cases even the existence of a certain feature in a certain shape gives away what a client is doing.

I’m generally staying away from software without extensibility opportunities. These types of solutions almost always end up as part of a larger infrastructure so extensibility is important.

vmihailenco · on Dec 29, 2021

Maybe, are you already using that feature in production? Buffers are available for years and just work. Hard to say how async inserts perform in real applications.

pachico · on Dec 30, 2021

I also use buffer engine, don't get me wrong. The only reason why I'm not using async writes is because I only run Altinity certified versions in my clusters do I'm waiting for it.

vmihailenco · on Dec 28, 2021

You are right about the license - it allows everything Apache 2.0 allows except reselling monitoring as a service.

vmihailenco · on Dec 28, 2021

Well, I had a delusion that Uptrace will have clean and simple UI, but I guess for others the UI is just as confusing. :(

I think we've done pretty good job with filtering+grouping+aggregation and data exploration in general. That is something I am proud of.

Uptrace is significantly cheaper.

As for the rest, it is the same but different. DD is bigger and more complex. I guess that is not the problem when you get used to the UI.

schmurfy · on Dec 31, 2021

We have been using your hosted service for a while now and we are perfectly happy with uptrace, compared to jaeger the UI is miles ahead, I like jaeger for what it brought but the ui is just not very good.

We tried other hosted services but most if not all of them consider you are creating gold from thin air and charge you accordingly, we started with a hosted jaeger solution and switched to uptrace without looking back.

tarun_anand · on Dec 28, 2021

For an open source young project, your UI is pretty good. In fact that is what got me here to comment!

Any mobile client sdk? Android/iOS you are aware of?

vmihailenco · on Dec 28, 2021

OpenTelemetry Java should work with Android, but I haven't really tried it. Java supports true auto-instrumentation so it should not be hard to figure out if you have an Android project.

vmihailenco · on Dec 28, 2021

By cluster support I mean:

- ability to use ReplicatedMergeTree in the table schema

- round-robin writing to multiple nodes

It is mostly a matter of providing configuration options. Thought I could skip it in the first release.

>How are you handling persistent storage?

If you mean avoiding data loss by using ClickHouse cluster, then yes - we use CH cluster and replication :)

ClickHouse handles data corruption surprisingly good - even if there are broken parts CH continues to serve the rest of data.

mritchie712 · on Dec 28, 2021

Where is the clickhouse data stored, in the Docker container?

For reference, here's what I'm using: https://github.com/Altinity/clickhouse-operator/blob/master/...

vmihailenco · on Dec 28, 2021

I've provided a Docker example so users can quickly try Uptrace without downloading and configuring anything. I don't use Docker to run anything besides examples so can't give any recommendations.

Thanks for the clickhouse-operator though. I guess we will have to provide something for k8s sooner or later.

mritchie712 · on Dec 28, 2021

Got it. I might be missing something, but if you deployed it this way and needed to rebuild the Docker image, you'd lose all your data, right?

NicoJuicy · on Dec 28, 2021

That's what volumes are for

vmihailenco · on Dec 28, 2021

The data is received via OTLP (Otel protocol) and almost immediately inserted into ClickHouse buffer table in small batches. Simple and very efficient.

Tail-based sampling will require buffering spans in memory for some time, but tail-based sampling is not implemented yet.

Cloud version also uses Kafka to survive surges in traffic, but I guess "personal" / company version does not need that as much. So no need to introduce additional dependency.

lma21 · on Dec 28, 2021

For tail-based sampling, does it mean that every process in a trace will keep its spans in memory until the initial process 'ends' the trace? How does the flushing happen (e.g. all processes 'commit' their buffer spans)? Many thanks for the explanations!

vmihailenco · on Dec 29, 2021

Uptrace / Go process will buffer spans in memory for some short period of time (5-15 seconds). It does not work for long traces, but most traces are short.

There is some discussion at https://github.com/open-telemetry/opentelemetry-collector-co...