We’re building the analytics stack for generative AI text interfaces: think Amplitude for ChatGPT.
Thousands of businesses are building new products and features using LLMs, and they need a new analytics stack to help them understand user behaviour and build great product experiences. Check out our product demo: https://www.loom.com/share/d2d1c1b9c42447ee8d69fcec43c67011.
We’ve raised $3.5M from an amazing group of investors, and we’re building a small team of founding engineers to join us.
## What are we looking for?
- 2+ years of full stack engineering experience
- Kindness, optimism, and great communication skills
- Enthusiasm to shape an early stage product and business. As one of the first employees you’ll wear many hats, and you’ll have a huge impact on product and business direction
- Deep user empathy. You’ll be regularly talking with customers to help shape the product direction.
- Interest in Machine Learning. You don’t need an ML PhD, but we are building in an ML adjacent space, so an enthusiasm and willingness to learn about the space is highly desirable!
### To find out more or apply, please reach out to henry@woolly.ai
The site will rewrite absolute image URLs as relative ones pointing to Tesoro. For example, in the Chicken Teryaki example on the homepage, the main image is sourced from the relative location "static01.nyt.com/.../28COOKING-CHICKEN-TERIYAKI1-articleLarge.jpg", which looks like it's coming from nytimes.com, but you can check in the Chrome dev console that it isn't.
Have you found an example where it isn't working correctly? If so would you mind posting it here and I'll fix it :).
Unfortunately, this approach alone will only work for sites that are mostly static, eg. do not use JS to load dynamic content. That is a small (and shrinking) percent of the web. Once JS is involved, all bets are off -- JS will attempt to load content via ajax, or generate new html, load iframes, etc and you will have 'live leaks' where the content seems to be coming form the archive but is actually coming form the live web.
Here is an example from archiving nytimes home page:
If you look at network traffic (domain in devtools), you'll see that only a small % is coming from archive.tesoro.io -- the rest of the content is loaded from the live web. This can be misleading and possibly a security risk as well.
Not to discourage you, but this is a hard problem and I've been working on for years now. This area is a moving target, but we think live leaks are mostly eliminated in Webrecorder and pywb, although there are lots of areas to work on to maintain high-fidelity preservation.
If you want chat about possible solutions or want to collaborate (we're always looking for contributors!), feel free to reach out to us at support [at] webrecorder.io or find my contact on GH.
Great point. Right now this is just a single rate-limited HTML form to gauge interest. Next is to build specialty features that are worth paying for and make this sustainable. :)
You're right, for now it's a single rate-limited HTML form and you'll have to manually collate the links to the archives you create. I'll be adding specialty features (with accounts) next. :)
Briefly: Sites are archived using a system written in Golang and uploaded to a Google Cloud bucket.
More: The system downloads the remote HTML, parses it to extract the relevant dependencies (<script>, <link>, <img> etc) and then downloads these as well. Tesoro is even parsing CSS files to extract the url('...') file dependencies from here as well, meaning most background images and fonts should continue to work. All dependencies (even those hosted at remote domains) are downloaded and hosted with the archive, meaning the src attributes on the original page tags are wrangled to support the new location.
The whole thing is hosted on GCP Container Engine and I deploy with Kubernetes.
I'll write up a more comprehensive blog post in some time, which portion of this would you like to hear more about?
Currently there is no global redundancy checking, only locally within the same page. So two CSS files which are the same from multiple archives are both kept. While this might not be ideal in terms of scaling to infinity, each archive + its dependencies are right now limited in size to 25MB, which should help keep costs under control until this is monetised. :)