Hacker News new | past | comments | ask | show | jobs | submit login
Shapeshift: Semantically map JSON objects using key-level vector embeddings (github.com/rectanglehq)
114 points by marvinkennis 4 months ago | hide | past | favorite | 35 comments



This is the code that does the work: https://github.com/rectanglehq/Shapeshift/blob/d954dab2a866c...

There are a few ways this could be made a less expensive to run:

1. Cache those embeddings somewhere. You're only embedding simple strings like "name" and "address" - no need to do that work more than once in an entire lifetime of running the tool.

2. As suggested here https://news.ycombinator.com/item?id=40973028 change the design of the tool so instead of doing the work it returns a reusable data structure mapping input keys to output keys, so you only have to run it once and can then use that generated data structure to apply the transformations on large amounts of data in the future.

3. Since so many of the keys are going to have predictable names ("name", "address" etc) you could even pre-calculate embeddings for the 1,000 most common keys across all three embedding providers and ship those as part of the package.

Also: in https://github.com/rectanglehq/Shapeshift/blob/d954dab2a866c... you're using Promise.map() to run multiple embeddings through the OpenAI API at once, which risks tripping their rate-limit. You should be able to pass the text as an array in a single call instead, something like this:

        const response = await this.openai!.embeddings.create({
          model: this.embeddingModel,
          input: texts,
          encoding_format: "float",
        });
        return response.data.map(item => item.embedding);
https://platform.openai.com/docs/api-reference/embeddings/cr... says input can be a string OR an array - that's reflected in the TypeScript library here too: https://github.com/openai/openai-node/blob/5873a017f0f2040ef...


Watch out with the array mode though, according to OpenAI docs it technically can return the results in any order and you must sort them by index to be sure you have the right associations. I’ve never seen them out of order in practice, but it’d be entirely in-character for them to suddenly change that sporadically and without warning, and now your entire vectordb may or may not be nondeterministically ruined.


Yikes!


I was involved in an attempt to do this kind of thing with CNN neural networks just around the time BERT came out that was mostly successful and actually we did great projects for companies in the beverages, telecom, aviation and consumer goods space.

It worked because it also had a conventional data-processing pipeline that revolved around JSON documents.

For (2) it seems a system like that should be able to generate a script in Python, a codesigned DSL or some other language to do the conversion.

One interesting thing about the product I worked on was that it functioned as a profiler by looking at one cell at a time, so if there is some field that has "Gruff Rhys" or "范冰冰" it could tell that was probably somebody's name, all the better if it can also see the field label is something like "Full Name" or "姓名". I'd contrast that to more conventional column-based profilers who might noticed that a certain field only has the values "true" and "false" throughout the whole column and would probably have some rule that would determine it was a boolean field.

One thing that system could do is recognize private data inside unstructured data. Where I work for instance we have

https://www.spirion.com/sensitive-data-discovery

which scans text and other files and it warns if it sees something like a lot of personal data, like an Excel spreadsheet full of names, addresses and phone numbers -- even if I just made them up as test data.


> returns a reusable data structure mapping input keys to output keys

IMO this use case is exactly what Copilot is for. Write a comment including one example each of input and output, and tab-complete in your language of choice to have it create a rewriter for you.

One benefit (and danger) is that it will look at the values, not just the keys, and also may generate arbitrary code that can e.g. adapt a firstName and lastName to a fullName. But that's why you have a human being triggering and auditing this for subtle bugs, and putting it through code review and source control, right?


Thanks for the suggestions! Will implement these. Caching is a great idea.


In general, you might cross reference with other object mapping libraries (including in other languages) to get ideas on how they approach this problem. Caching mappings is just one common strategy.


Maybe I'm not the target audience, but here are simple questions to the author or potential users:

What about anything more complex like date of birth to age or the other way round? Also since we will inevitably incur costs, why not let a llm write a transformation rule for us?


It's not using an LLM, it's just comparing embeddings (which are waaay cheaper)


but embeddings came from somewhere (LLM?).


My thinking as well.


Since LLMs are bad at the null hypothesis (in this case, when a key does not exist in the source JSON), how does this prevent hallucinating transformations for missing keys?


This isn't using an LLM, it simply checks for similarity between keys using vector embeddings


The example could be handled with no machine learning at all. Just use a bag of words comparison with a subword tokenizer. And if you do need embeddings (to map synonyms/topics), fastText is faster, cheaper and runs locally. For hard cases, you can feed the source/target schemas to gpt-4o once to create a map - and then apply that one map to all instances.


> fastText is faster, cheaper and runs locally

the question is if quality will be acceptable


The question if machine learning algorithm's produced embeddings will have the acceptable quality too. With a library I presume that the quality is at least predictable. I personally have less trust in machine learning though


> The question if machine learning algorithm's produced embeddings will have the acceptable quality too

there are tons of benchmarks and results which demonstrated that embeddings from language models are superior to word2vec in (almost) all scenarios.


BTW Bag of words models were once considered ML not too long ago.


What is this for? The examples given could be handled deterministically. Is this for situations where you don't know JSON schemas in advance? What situations are those?


As is, it's not good for much beyond looking cool. (Maybe implementing Postel's Law for a json API, but I think that's considered bad taste these days.)

If instead of transforming a single object it would output a table of src_field->dst_field, it could potentially be a useful first pass in some ETL development.


The lazy part of my brain screams “use this instead of dealing properly with nested objects!” In a production setting I’d be worried about consistency from the base to result layers if based on LLM transpositioning.


Data import via customer self-service onboarding.


Keep the bug generators going, we will need the jobs


This task in the most general form is better done with question answering prompt than embeds. How do you solve "Full Name" -> "First Name", "Last Name" with embeds? QA is the right level of abstraction for schema conversion tasks. And it's simple, just put the source JSON + target JSON schema in the prompt and ask for value extraction.


So this identifies keys from source and target objects that are fuzzy synonyms and copies the values over. What is a real world use case for this? Add the fact that it's fuzzy and won't always work, so would require a great deal of extra effort in QA/testing (harder than just mapping the keys programmatically), and I'm puzzled.


We do something very similar with embeddings in our product. Users import files that they have to match to a dynamically-defined target schema. The embedding matching provides suggested matches to the user that are generally very accurate, so they don't have to go through and manually match up "telephone" to "phone number" etc. It even works across languages.


I've got some similar use-cases. So, do I understand correctly that you take the source keyword and generate an embedding vector of it, then compare it using dot-product similarity or something to the embedded vectors of the target keywords?


Exactly, although we use cosine similarity.


Perfect. And yeah that's what I meant, so used to just normalizing vectors so dot product = cosine.


How much time dos this save your users? Is this QOL? Or more of a "our product wouldn't work without this feature" kind of thing?


Quite a bit of time. The product would still work without the feature, but it is a major feature. It bypasses lots of wading through dropdowns (potentially dozens for a single session)


Here is an another DSL for implementing object model mappings: https://github.com/patleahy/lir


Put together a quick version with an LLM, using Substrate: https://www.val.town/v/substrate/shapeshift

I've turned the target object into a JSON schema, but you could probably generate that JSON schema pretty reliably using a codegen LLM.


What’d be really great is a codegen aspect. A non-negligible part of any data munching operation is “this input object has fields X, Y, Z and we need an output object with fields X, f(X), Y, f(Y,Z)”. This is something and LLM has a decent chance at being really quite good at.


Created a Rust version using devin.ai. (untested)

https://github.com/HumanAssisted/shapeshift-rust




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: