I'm working on an app for easily conducting and organizing evals for LLM-powered...

I'm working on an app for easily conducting and organizing evals for LLM-powered applications. The core idea is making it easy for domain experts to review examples of interactions and tests with synthetic data, as well as tracking an application's evaluated performance over time as changes get made.