More

agamble · 2024-03-26T16:46:34 1711471594

Our production Postgres instance is still down as of 16:46 London time.

agamble · on July 3, 2023

We’re building the analytics stack for generative AI text interfaces: think Amplitude for ChatGPT.

Thousands of businesses are building new products and features using LLMs, and they need a new analytics stack to help them understand user behaviour and build great product experiences. Check out our product demo: https://www.loom.com/share/d2d1c1b9c42447ee8d69fcec43c67011.

We’ve raised $3.5M from an amazing group of investors, and we’re building a small team of founding engineers to join us.

## What are we looking for?

- 2+ years of full stack engineering experience

- Kindness, optimism, and great communication skills

- Enthusiasm to shape an early stage product and business. As one of the first employees you’ll wear many hats, and you’ll have a huge impact on product and business direction

- Deep user empathy. You’ll be regularly talking with customers to help shape the product direction.

- Interest in Machine Learning. You don’t need an ML PhD, but we are building in an ML adjacent space, so an enthusiasm and willingness to learn about the space is highly desirable!

### To find out more or apply, please reach out to henry@woolly.ai

agamble · on June 27, 2017

OP here.

The site will rewrite absolute image URLs as relative ones pointing to Tesoro. For example, in the Chicken Teryaki example on the homepage, the main image is sourced from the relative location "static01.nyt.com/.../28COOKING-CHICKEN-TERIYAKI1-articleLarge.jpg", which looks like it's coming from nytimes.com, but you can check in the Chrome dev console that it isn't.

Have you found an example where it isn't working correctly? If so would you mind posting it here and I'll fix it :).

ikreymer · on June 28, 2017

Unfortunately, this approach alone will only work for sites that are mostly static, eg. do not use JS to load dynamic content. That is a small (and shrinking) percent of the web. Once JS is involved, all bets are off -- JS will attempt to load content via ajax, or generate new html, load iframes, etc and you will have 'live leaks' where the content seems to be coming form the archive but is actually coming form the live web.

Here is an example from archiving nytimes home page:

https://archive.tesoro.io/665dbeab57a4d57d8140f89cfedc69b5

If you look at network traffic (domain in devtools), you'll see that only a small % is coming from archive.tesoro.io -- the rest of the content is loaded from the live web. This can be misleading and possibly a security risk as well.

Not to discourage you, but this is a hard problem and I've been working on for years now. This area is a moving target, but we think live leaks are mostly eliminated in Webrecorder and pywb, although there are lots of areas to work on to maintain high-fidelity preservation.

If you want chat about possible solutions or want to collaborate (we're always looking for contributors!), feel free to reach out to us at support [at] webrecorder.io or find my contact on GH.

jdc0589 · on June 27, 2017

Nope, you are right. I just missed that there wasn't a protocol on the src I was looking at.

agamble · on June 27, 2017

OP here. Thanks! Could you point me to the pages where it worked well for you vs archive.org?

agamble · on June 27, 2017

Thanks Jack, I hadn't heard of webrecorder before, but I'll check it out. :)

agamble · on June 27, 2017

OP here.

Great point. Right now this is just a single rate-limited HTML form to gauge interest. Next is to build specialty features that are worth paying for and make this sustainable. :)

agamble · on June 27, 2017

OP here.

Yup, API and chrome extension are next on the feature list. :)

agamble · on June 27, 2017

Great points.

You're right, for now it's a single rate-limited HTML form and you'll have to manually collate the links to the archives you create. I'll be adding specialty features (with accounts) next. :)

agamble · on June 27, 2017

Thanks! These are great comments - I'll look into the issue with saving Hacker News CSS + JS.

agamble · on June 27, 2017

OP here. Definitely, great idea :)

Briefly: Sites are archived using a system written in Golang and uploaded to a Google Cloud bucket.

More: The system downloads the remote HTML, parses it to extract the relevant dependencies (<script>, <link>, <img> etc) and then downloads these as well. Tesoro is even parsing CSS files to extract the url('...') file dependencies from here as well, meaning most background images and fonts should continue to work. All dependencies (even those hosted at remote domains) are downloaded and hosted with the archive, meaning the src attributes on the original page tags are wrangled to support the new location.

The whole thing is hosted on GCP Container Engine and I deploy with Kubernetes.

I'll write up a more comprehensive blog post in some time, which portion of this would you like to hear more about?

19eightyfour · on June 27, 2017

The issue is cost. Your costs are disk space for people's archives, instances for people's use, and bandwidth for the fetches and crawls and access.

How can you pay for this if it's free? It's unreliable unless its financially viable.

agamble · on June 27, 2017

Totally right, great observation :)

For now it's a free service with a single rate-limited form. Now it's time to work on adding specialty features that are worth paying for.

Faaak · on June 27, 2017

How about avoiding redundancies ? Are same CSS files cached twice or referenced by their hash ?

The page URI is a bit obscure though. I think a tresoro.io/example.tld/page/foobar/timestamp would look good.

What about big media content and/or small differences between them ?

agamble · on June 27, 2017

Great question.

Currently there is no global redundancy checking, only locally within the same page. So two CSS files which are the same from multiple archives are both kept. While this might not be ideal in terms of scaling to infinity, each archive + its dependencies are right now limited in size to 25MB, which should help keep costs under control until this is monetised. :)

WhiteOwlLion · on June 27, 2017

What file system are you using? Couldn't a deduplicating file system handle redundancy for you?