First: Congratulations! I've just sent an early access request.
Second - I've got questions!
Here is my use case. We have hundreds of clients who each have dozens of videos on our platform. Videos are grouped in collections. For each video we have structured content (think of it like a complex mashup of transcript + other info).
I would like to be able to send the API a group of 13 documents, grouped into a collection called "User123_collection2" or whatever. And then run natural language queries against that through an LLM.
A. Do I understand correctly that you allow to send and organize documents programmatically? Are there any length restrictions?
B. In your demo, Vellum simply grabs the top 3 most relevant snippets (and those seem to be relatively short snippets). Can this be customized (longer snippets and more of them?)
C. Can I get the "sources" cited with the answer?
Let's say I run a query on these snippets such as the ones in your demo. (An "end user" kind of question). Similar to how Bing's chatbot will give the links to the pages used to build the answer, I'd like the response I get to tell me it comes from Document 7 in collection "User123_collection2". Even better if I can get the response to tell me where in the document (otherwise I'd have to split my documents into smaller pieces when uploading).
D. Do you offer any guarantees of privacy of, not only the data, but also the prompts? I think that these prompts might be one of the valuable "trade secrets" for startups who want to add LLM features. If the prompt is leaked publicly or to another Vellum customer/stakeholder then it makes the feature replicable.
E. How much latency does this additional layer tack onto typical response times? If I was going to make an API request to OpenAI, how much longer does my app wait for the response?
I'm impressed by where you're going. It's a brilliant idea and I hope we get to be a customer, rather than build the whole LangChain/GPT-Index idea we were going to run with up until I read your post :)
Thanks so much for your interest and thoughtful questions! I'll do my best to answer here, but looking forward to digging in deeper with you offline, too :)
A. Documents are uploaded and organized via the UI at this moment, but later this week we will be exposing APIs to do the same programmatically. You'll be able to programmatically create a "Document Index" and then upload documents to a given index. In your case, you'd likely have one index per collection. We don't currently enforce a strict size limit, but it's likely we'll need to soon. In this case, you might break the document up into smaller documents prior to uploading.
B. Yes, the number of chunks returned can be specified as part of the API call when performing a search. Currently, the chunking strategy and max size is static, but we fully intend on making this configurable very soon.
C. Yes, we track which document and which index each chunk came from. With proper prompt engineering, you can have the LLM include this citation in the final response. We helped a customer of ours just recently construct a prompt that did this same thing! Saying where in the document it came from is a bit trickier (although you do know the text of the chunk that's most relevant, which is a helpful starting point).
D. We do not share data or prompts across our customers, although we do provide strategic advice that's informed by our experiences. We'd love to learn what guarantees you're looking for and feel confident we can work within most reasonable bounds. For what it's worth, my personal opinion is that companies should be cautious about banking on prompts as the primary point of defensibility for LLM-powered apps. Reverse prompt-engineering is a thing (interesting article here: https://lspace.swyx.io/p/reverse-prompt-eng). My take is that LLM defensibility will come from your data (e.g. data that powers your semantic search or training data used to fine-tune proprietary models), as this is much harder for competitors to recreate, not to mention the user experience and go-to-market that surrounds it all.
E. We haven't yet done formal benchmarking (although we're admittedly overdue for it!), but we have architected our inference endpoint with low-latency as a top priority. For example, we've selected cloud data centers close to where we understand OpenAI's to be and have minimized all blocking procedures such that we perform as much as we can asynchronously. We host this endpoint separately from the rest of our web application, have at least one instance running at all times (to prevent cold starts), and have it auto-scale as traffic demands.
C. Just to confirm - the chunks are verbatim right? So in theory I could just do a string search in the document to locate the chunk?
D. I would assume that you are currently not encrypting the data at rest. Encrypting it with the customer's API would probably result in a performance hit?
In any case, if you're not encrypting it at all, and in absence of any certification/assurances as to your data security practices then as a customer I'm forced to assume that giving you access to the data (and prompts) is tantamount to public disclosure of it. I mean, LastPass had their data stolen. You and us are both startups. Anything valuable that is not nailed to the floor (encrypted with utmost paranoia) is like leaving the chairs on the patio of a beach bar at night.
In which case there is no defensibility to be found there for us. It doesn't prevent us from becoming a customer, but it means we have to hedge our use cases so as not to put our own customers' data (e.g. trade secrets that might be included in their documents) at risk.
Here is my use case. We have hundreds of clients who each have dozens of videos on our platform. Videos are grouped in collections. For each video we have structured content (think of it like a complex mashup of transcript + other info). I would like to be able to send the API a group of 13 documents, grouped into a collection called "User123_collection2" or whatever. And then run natural language queries against that through an LLM.
A. Do I understand correctly that you allow to send and organize documents programmatically? Are there any length restrictions?
B. In your demo, Vellum simply grabs the top 3 most relevant snippets (and those seem to be relatively short snippets). Can this be customized (longer snippets and more of them?)
C. Can I get the "sources" cited with the answer? Let's say I run a query on these snippets such as the ones in your demo. (An "end user" kind of question). Similar to how Bing's chatbot will give the links to the pages used to build the answer, I'd like the response I get to tell me it comes from Document 7 in collection "User123_collection2". Even better if I can get the response to tell me where in the document (otherwise I'd have to split my documents into smaller pieces when uploading).
D. Do you offer any guarantees of privacy of, not only the data, but also the prompts? I think that these prompts might be one of the valuable "trade secrets" for startups who want to add LLM features. If the prompt is leaked publicly or to another Vellum customer/stakeholder then it makes the feature replicable.
E. How much latency does this additional layer tack onto typical response times? If I was going to make an API request to OpenAI, how much longer does my app wait for the response?
I'm impressed by where you're going. It's a brilliant idea and I hope we get to be a customer, rather than build the whole LangChain/GPT-Index idea we were going to run with up until I read your post :)