antonap's comments

antonap · 2024-08-28T19:32:59 1724873579

Thanks for the feedback, love your article diving deep into DSPy! Here's how our platform is different:

1. You are absolutely right, the dataset is a big hurdle for using DSPy. That's why we offer a synthetic dataset generation pipeline for RAG, agents, and a variety of LLM pipelines. More here: https://docs.relari.ai/getting-started/datasets/synthetic

2. Relari is an end-to-end evaluation and optimization toolkit. Real-time optimization is just one part of our data-driven package for building robust and reliable LLM applications.

3. Our tools are framework agnostic. If you can build your entire application on DSPy, that's great! But often we see AI developers hoping to maintain the flexibility and transparency to have their prompts / LLM modules work with different environments.

4. We provide well-designed metrics and/or custom metrics learned from user feedback. We find good metrics very key to making any optimization process (including prompts and fine-tuning) work.

antonap · 2024-08-28T17:48:11 1724867291

Thanks for bringing this up! The best thing is to see how we can make the enterprise plan work for you, feel free to reach out to us (founders@relari.ai).

antonap · on March 9, 2024

That's correct, but let me dig a little deeper. Continuous-eval provides two types of metrics, reference-based and reference-free metrics.

In the case of reference-based metrics, you provide a dataset with the input/expected output pairs of each step of the pipeline and use the metrics to measure the performance of the pipeline. This is the best approach for offline evaluation (e.g., in CI/CD) and is the approach that best captures the alignment between what you expect and the actual behavior of the pipeline.

In the case of reference-free metrics, on the other hand, you don't need to provide the expected output, but you can still use the reference-free metrics to monitor the application and get directional insight into its performance.

antonap · on March 8, 2024

Originally, we were going to do a Show HN for the modular evaluation and another Show HN for the synthetic data, because our understanding is that the Show HNs are for individual projects. But then we realized that it's the combination of the various pieces that brings the most value, so we decided to put them together as a single Launch HN instead.

swyx · on March 9, 2024

interesting. standard advice i hear is do a bunch of Shows first, to get users and social proof, and then to show off the social proof + put fuel on a launch with a Launch since you only get one. anyway, all the best!

antonap · on March 8, 2024

Arize is a great tool for observability, and their open source product, Phoenix, offers many great features for LLM evaluation as well.

Some key unique advantages we offer:

- Component-level evaluation, not just observability: Many great tools on the market can help you observe different components (or execution steps) in a GenAI system for each data sample. What we offer on top of that is the ability to do automatic evaluation and have metrics for each step of the pipeline. For example, you will be able to have metrics on the accuracy of agent tool usage, precision / recall for each retriever step, and relevant metrics on each LLM call.

- Leverage user feedback for offline evaluation: We allow you to create custom metrics based on your past user feedback data. Unlike predefined metrics, these custom metrics are trained to learn your specific user preferences. In a sense, these metrics simulate user ratings.

- Synthetic Data Generation: Large amounts of synthetic data can help you stress test your AI system beyond your existing data. They also come in greater granularity than human curated datasets and can help you test and validate.

esafak · on March 8, 2024

I always recommend a comparison page. Help prospects decide.

antonap · on March 8, 2024

Great suggestion, thanks!

antonap · on March 8, 2024

Thank you for the feedback! That’s a great suggestion. We do want to make the demo into a separate page, and also add a live evaluation demo using the synthetic data generated on the fly.

antonap · on March 8, 2024

Thank you for catching that! Looking into it now.

antonap · on March 8, 2024

Indeed - decomposition improves reliability but also makes the testing more challenging. That’s why we made the framework modular! Let us know of any feedback as you try it out!

antonap · on Feb 25, 2024

Yup that’s what drove us to work on a module-level framework since our app is made up of many non-deterministic components. Give it a spin and let us know what you think!