> We start by parsing documents into chunks. A sensible default is to chunk documents by token length, typically 1,500 to 3,000 tokens per chunk. However, I found that this didn’t work very well. A better approach might be to chunk by paragraphs (e.g., split on \n\n).
Hmm good insight there. I've done some experimenting formerly by chunk length and it's been pretty troublesome due to missing context.
You don't do a sliding window? That seems like the logical way to maintain context but allow look up by 'chunks'. Embed it, say, 3 paragraphs at a time, advancing 1 paragraph per embedding.
If you're concatenating after chunking , then the overlapping windows add quite a lot of repetition. Also, if it cuts off mid-json / mid-structured output then overlapping windows once again cause issues.
Define a custom recursive text splitter in langchain, and do chunking heuristically. It works a lot better.
That being said, it is useful to maintain some global and local context. But, I wouldn't use overlapping windows.
In place of simply concatenating after chunking, a more effective approach might be to retrieve and return the corresponding segments from the original documents that are relevant to the context. For instance, if we're dealing with short pieces of text such as Hacker News comments, it's fairly straightforward. Any partial match can prompt the return of the entire comment as it is.
When working with more extensive documents, the process gets a bit more intricate. In this case, your embedding database might need to hold more information per entry. Ideally, for each document, the database should store identifiers like the document ID, the starting token number, and the ending token number. This way, even if a document appears more than once among the top results from a query, it's possible to piece together the full relevant excerpt accurately.
I don't think the repetition is a problem. He's using a local model for human-assisted writing with pre-generated embeddings - he can use essentially an arbitrary number of embedding calls, as long as it's more useful for the human. So it's just a question of whether that improves the quality or not. (Not that the cost would be more than a rounding error to embed your typical personal wiki with something like the OA API, especially since they just dropped the prices of embeddings again.)
I've thought about doing this as well, but I haven't tried it yet. Are there any resources/blogs/information on various strategies on how to best chunk & embed arbitrary text?
I’ve been experimenting with sliding window chunking using SRT files. They’re the subtitle format for television and have 1 to _n_ sequence numbers for each chunk, along with time stamps for when the chunk should appear on the screen. Traditionally it’s two lines of text per chunk but you can make chunks of other line counts and sizes. Much of my work with this has been with SRT files that are transcriptions exported from Otter.ai; GPT-3.5 & 4 natively understand the SRT format and the concepts of the sequence numbers and time stamps, so you can refer to them or ask for confirmation of them in a prompt.
This looks great! I was about to start learning and diving into Obsidian about a month ago, finally driven to begin building a personal knowledgebase...
And then I found Mem.ai and dove into that instead, and i've been extremely happy with it. It accomplishes this aspect he's offering here (where it uses your knowledgebase to assist in your writing). However, it's also got built in chat with your knowledgebase, and helps with auto-sorting and all of that.
For those that want their data on their computer, I totally see why Obsidian is the most desirable. So this sort of addition would be the best of both worlds for them.
I'm not sure a offline-first document editor is comparable to a hosted SaaS about AI. This plugin is one of many, while mem.ai is non-customizable tool where someone else owns your data and seems to offer no data portability.
I haven't used Notion, but in my research its AI feature is only the Smart Write/Edit that Mem has - tho I am unsure how well it uses the rest of the content you have inside of Notion, as their sales page doesn't really make that clear.
Mem.ai has integrated many aspects into it - I love that I am now unconcerned about tags or folders or categories.
So what is the sample size at which this can be useful. A notes vault doesn't seem large enough.
Obsidian/Logseq are already great for thinking by way of their "show a random note" feature. Usually pulling up unfinished thoughts from past days will give me an idea for extending it.
I use vim with the Copilot plugin.
It's pretty astounding what it spits out.
I was writing a handbook for a credit union board of directors and it was quite helpful at times.
Copilot is designed for code, but it can still be used for whatever you want. I found it useful when writing latex files, even for the text explanation portions as opposed to the text markup 'code'. There was a y combinator thread on this earlier.
Another option is to use llama-index and index the Obsidian vault. I use a gradio based web interface to query my Obsidian vault via GPT-3.5. It's pretty awesome.
Mostly but it does upload some of the vectorized data to insert into the prompt for context. When you do a query llama-index tries to discover content related to your prompt and injects it for context so its not entirely local.
> Mostly but it does upload some of the vectorized data to insert into the prompt for context. When you do a query llama-index tries to discover content related to your prompt and injects it for context so its not entirely local.
When you say "upload some of the vectorized data" do you mean in a numerical embedding form or that it will embed the original text from original similar-seeming notes directly into the prompt? I've only ever done the latter, is there a way to build denser prompts instead? I can't find examples on Google.
This is why I don't use Obsidian. Without giving root to an aggregate of random git repos, the most important features are very subpar. I have no idea what I would even do with a fancy star graph or whatever it is, and VS Code has much better search.
Edit: Also, why is the search bar for searching all notes so buried, requiring so much effort to open? Is that because it works so poorly?
Semantic search. The current search feels like it is barely doing something smarter than substring matching / basic regex. The search should be more like Google and less like matching substrings
Fair enough. The built-in search is terrible, it doesn't respect any of the regex filters you set up to exclude files and folders. I fought with it for almost an hour trying to get it to exclude png images, and anything in a 'media' or 'attachment' folder, and it never worked completely. Then I installed OmniSearch it just worked, instantly.
So, generating text this way is 100% not interesting or relevant.
What's interesting here is how it's building the prompt to send to the openai-api.
So... can anyone shed some light on what the actual code [3] in get_chunks() does, and why you would... hm... I guess, do a lookup and pass the results to the openai api, instead of just the raw text?
The repo says: "You write a section header and the copilot retrieves relevant notes & docs to draft that section for you.", and you can see in the linked post [4], this is basically what the OP is trying to implement here; you write 'I want X', and the plugin (a bit like copilot) does a lookup of related documents, crafts a meta-prompt and passes the prompt to the openai api.
...but, it doesn't seem to do that. It seems to ignore your actual prompt, lookup related documents by embedding similarity... and then... pass those documents in as the prompt?
I'm pretty confused as to why you would want that.
It basically requires that you write your prompt separately before hand, so you can invoke it magically with a one-line prompt later. Did I misunderstand how this works?
I love obsidian but the first plugin I tried (and paid for) led to subtle data loss and resulted in many hours of checking and merging a month of backups. Not going to risk that again.
Wow, that's awful -- and an extreme outlier. Most plugins are free. And given under the hood Obsidian notes are just markdown files on the local filesystem, backups can be managed by git or TimeMachine or rsync or whatever else you might use on other directories. That's not to discredit your experience, just speaking up for the sake of others who might be unduly scared off.
Sure, this one had a free version and it was great. Except after a few weeks Obsidian started behaving weirdly, and was 100% reproducible by enabling or disabling the plugin in question. And then I noticed that some of the markdown files were modified (sections deleted) - but I hadn't noticed immediately, so they'd had edits after they were modified - hence the tedious manual merge.
So it may be "just this one plugin", but Obsidian is so important that I'm just not willing to risk it.
But at this point it should be (maybe not OpenAI but some other organization). OpenAI is close to becoming a necessity and a human right just like education so it should be 100% free and accessible at some point.
"OpenAI is close to becoming a necessity and a human right" is the wildest claim I have heard yet about AI. (Though it's possible that maybe someday I will agree with this).
I say this calmly, it's wild for you because you're probably part of the privileged group of people who can sustain themselves with a steady job and has no problem paying for it.
Like having access to HN via some form of Internet access?
Envy doesn't create rights for oneself nor does it impute privilege to others, and people who read and write on the internet about privilege seem blinkered, to me, about how they'd sound to someone who walks two miles for water polluted by the mining of rare earth elements.
See my edit. I meant AI generally (such as in AI-aided/enhanced learning, communication, teaching etc. etc.) not OpenAI the company per see. For disabled people (like me) first and foremost but right after the general populace as well.
I disagree. What gives you the idea that any software is a necessity? We've (modern humans) been around for 200,000+ years and software (and LLMs for less than a tenth of the time that software has been around) has existed for ~0.04% of that time.
Oxygen (in its molecular, 02 form) is a necessity. Water is a necessity. Nutrition of some sort is a necessity.
Pretty much everything else is a nice-to-have (with some things like money and shelter being important, but as we see from the poverty and homelessness around the world, definitely not a necessity).
I'd posit that LLMs are helpful and sometimes even useful. But necessary? I think not.
I'd note that I'm not dismissing LLMs, nor am I trying to dump on you. But the idea that any software is necessary is ridiculous on its face.
The right to AI is different than the right to free hosted inference.
Anyone can download a model and use it completely offline, or hosted on your own server. And there's a lot of effort to make these models work on devices that don't have much computing power (even phones [0]), which increases access even more.
Since this is the world we currently live in, what are you suggesting should change?
There are smaller open-source options that you can use, however they don't quite live up to using large open-source model (which likely won't fit on your machine) or commercial models like those of OpenAI.
That's a loaded question, because there's different approaches you can take to run these models. Basically, you want lots of memory (ram or vram), and the more you have, the larger the models can be that you run.
I'd recommend shooting for at least 13B models.
Use "oobabooga/text-generation-webui", which can also serve an OpenAI-compatible API as well as provide a chat interface. It can serve most models, using most methods.
Check out their system requirements page[0], and join some of the communities to learn more about what hardware will work best for you.
This person[1] is providing models of all sorts, in pretty much every optimized format. They also post the minimum RAM requirements for each of the GGML models, which are best if you want to host using CPU/RAM (no video card).
i have money for the API. i wouldnt pay for it ever tho. if you want to learn things, read. if you want to create things, create. no need for some language model assistance. ull end up relying on it and knowing nothing yourself and having no abilities. at most its useful for people writing marketing texts...
There would be a simple system for rooms and the AI / program would edit them and things to them which when clicked on could lead to new "places"