While the tech these companies built (stateful reconstruction) is quite hard to engineer, large-scale observability platforms like Datadog were much more successful as they could cover a much larger surface area with minimal incremental effort. Even APM companies like New Relic and AppD struggled to justify the value they were delivering with their well-engineered code-level agents in the face of large-scale server/process metrics collection done by the Datadog agent.
In addition, these techniques have been prone to event misses (partial reconstruction) in case the ring buffer overflows and are ineffective in the face of SSL traffic.
Good observation. We were able to prevent issues due to limited context by sending the entire file in the prompt and the results were pretty amazing. We eventually reverted the change due to 2 reasons -
1. Limited tokens: 8K tokens is sometimes not enough to hold the entire file. Perhaps 32K tokens (or 100K Claude2 tokens) can help circumvent this.
2. Cost: GPT-4 is super expensive. Our current usage is roughly $20 a day but when we send files the usage shot up to $60 a day or so.
Our hope is that both the cost and token limit improve in the future so that we can send the entire file in each review request.
We use Jsonnet extensively in Aperture project for generating control policies.
- Policy spec is expressed in protobuf format
- GRPC gateway plugin is used to generate Swagger (OpenAPI v2) spec from proto files
- Jsonnet bindings are generated from Swagger spec.
- Blueprints are implemented using Jsonnet bindings. The users generate policies from blueprints by providing configuration in yaml (using aperturectl CLI) or via jsonnet mixin.
Multi-prompt[0] approach taken while summarizing large number of files and diffs. Prompts has been tuned for a concise response. To prevent excessive notifications, this action can be configured to skip adding review comments when the changes look good for the most part.
DoorDash team talks about types of metastable failures in microservices and how observability-based control techniques can be used to mitigate them.
They introduce a project called Aperture[0] that provides techniques such as adaptive concurrency limiting, prioritized load shedding and graceful degradation to mitigate cascading failures.
Just click around a bit:
- https://github.com/fluxninja/aperture
- https://docs.fluxninja.com/use-cases/adaptive-service-protec...
Note: I am one of the authors' of this project.