"Surfer: The World's First Digital Footprint Exporter" is dubious—it's clearly not the first. Kicking off with such a bold claim while only supporting seven major platforms? A scraper like this is only valuable if it has hundreds of integrations; the more niche, the better. The idea is great, but this needs a lot more time in the oven.
I would prefer a cli tool with partial gather support. Something that I could easily setup to run on a cheap instance somewhere and have it scrape all my data continuously at set intervals, and then give me the data in the most readable format possible through an easy access path. I've been thinking of making something like that, but with https://github.com/microsoft/graphrag at the center of it. A continuously rebuilt GraphRAG of all your data.
It builds an entire ecosystem around your data where it is programmatic rather than just dumping text files. The point of HPI is to build your own stuff onto it and it all integrates seamlessly together into one Python package.
If you want to get more fancy / antithetical to HPI, you can use https://github.com/hpi/authenticated_hpi_api or https://github.com/hpi/hpi-graph so you can theoretically expose it to the web (I am squatting the HPI org, I am not the creator of HPI). I made the authentication method JWTs so you can create JWTs where it will give access to only certain services' data. (Beware, hpi-graph is very out of date and I haven't touched it lately but my HPI stuff has been chugging away downloading data).
Some of the /hpi stuff I made is a bit mish-mash because it was rip-and-replace from a project I was making so you'll see references to "Archivist" or things that aren't local-first and depend on Vercel applications.
Yeah it was honestly more of a marketing statement lol, but removing it for sure. Adding daily/interval exporting is one of our top priorities right now and after that and making the scraping more reliable, we'll add something similar to GraphRAG. Curious to hear what other integrations you would want built into this system.
It exported 75MB json of ChatGPT "Conversations". I extracted 19MB or raw text from this as a CSV. I then took this into Nomic.ai and embedded all of the text to create a clustered visualization of topics in my ChatGPT conversations.
thanks for the feedback, added the list of supported platforms and example of exported data as well to the readme. we're focusing mostly on the exporting part, so examples of what can be done with the data will come later.
The answers to online platforms trafficking in personal data and metadata is two parallel and concurrent efforts:
1. Much tougher data privacy regulations (needed per country)
2. A central trusted, international nonprofit clearinghouse and privacy grants/permissions repository that centralizes basic personal details and provides a central way to update name, address(es), email, etc. that are then used on-demand only by companies (no storage)
By doing these, it simplifies things greatly for people and allows someone to audit and see what every company knows about them, can know about, and can remove allowances for companies they don't agree to. One of the worst cases is the US where personal information is not owned by the individual and there is almost zero control unless it's health related, and can be traded for profit.
Will you use a central trusted, international nonprofit clearinghouse and privacy grants/permissions repository that is run by the government of China / Iran / [state]?
It is important for privacy activists to understand that „centralised“ is an anti-pattern for privacy.
Instead we need security and control over our data on devices and internet platforms guaranteed by the law.
I'm not talking about a distributed or self-hosted technical solution, but a centralized trusted nonprofit organization. Technology alone can't automate away privacy management issues.
I created an app to do end-to-end encrypted contact info sharing and updating with your second point in mind. By holding only encrypted data that can't be accessed by us, people will hopefully trust their contact info is only in the hands of people they want. https://neu.cards
You might be interested in Peergos for the storage and access control part. We have a profile where you can control access (and revoke) to each field individually. E2EE because most people wont want to self host.
Or it just all happens on the client side before it even hits the Internet. I would love if Firefox allowed users to use Postgres instead of sqlite to store their places.sqlite database.
I dimly remember some kid--or maybe it was only apocryphal--in the early 2000s, where they tried typing "Like Halo but with X" into a text file before changing the extension to .exe...
Not sure if we are joking about ourselves but when I was a kid, I was so confused by how games were made.
I started drawing individual frames of the game I wanted, I remember about 45 minutes into this venture I had an existential crisis about how how many frames you'd need for something like GTA, to show every possible combination.
I had the right idea but wasn't thinking about how to leverage the computer correctly.
Haha I had the exact same experience with Prince of Persia. The existential crisis took nearly a decade to wear off.
The realization that every possible image that can fit on a screen can be stuck in a bitmap helped keep it going. Everything that could possible ever be photographed, just sitting there in the latent space waiting to be summoned.
... And now, I'm amazed that through some basically fancy noise we can type in words and get pictures in under a second. They even almost have the right number of fingers.
A browser addon that takes one's password manager export & deletes every account, possibly after scraping the data, would be amazing. No one has done it and it can be done such that the system will eventually safely delete every account of every site (e.g using developers tools, accessability options, being intended to only be used in fresh browser install, sourcing information from volunteers). You'd have a sheet tracking the process, e.g. verification pending, manual intervention pending, deleted, waiting states. Many humans have hundreds of accounts they no longer use and this sort of tool could thus be a good Y-combinator or hobby project.
I made something like this since I was tired of the asymmetric nature of data collection that happens on the Internet. Still not where I would like to be, but it's been really nice being able to treat my browsing history as any old log that I can query over. Tools like dogsheep are nice, but they tend to rely on data being allowed to be removed from the platform. This bypasses those limits by just doing it on the client.
This lets me create dashboards to see usage for certain topics. For example, I have a "Dev Browser" which tracks the latest sites I've visited that are related to development topics [1]. I similarly have a few for all the online reading I do. One for blogs, one for fanfiction, and one for webfiction in general.
I've talked about my first iteration before on here [2].
My second iteration ended up with a userscript which sends the data on the sites I visit to a Vector instance (no affiliation; [3]). Vector is in there because for certain sites (ie. those behind draconian Cloudflare configuration), I want to save a local copy of the site. So Vector can pop that field save it to a local minio instance and at the same time push the rest of the record to something like Grafana Loki and Postgres while being very fast.
I've started looking into a third iteration utilizing MITMproxy. It helps a lot with saving local copies since it's happening outside of the browser, so I don't feel the hitch when a page is inordinately heavy for whatever reason. It also is very nice that it'd work with all browsers just by setting a proxy which means I could set it up for my phone both as a normal proxy or as a wireguard "transparent" proxy. Only need to set up certificates for it work.
the idea of personal data centralization sounds intriguing, but let's be real - companies will always find a way to keep a grip on our info. maybe it's time for a digital revolution, or just another excuse for me to procrastinate on coding.
I’ve been working on a lot of similar ideas over the years, and my current ideal stack is to:
1. Use Mobile App APIs.
2. Generate OpenAPI Arrazo Workflows.
1 ensures breakage is minimal, since mobile apps are slow upgrades and older versions are expected to keep working. 2 lets you write repeatable recipes using YAML, and that makes it quite portable to other systems.
The Arazzo spec is still quite early though, but I am hopeful of this approach.
Myself, I'd probably prefer to use something like Huginn to create a customized approach to all of my online platforms I'm interested in, rather than a curated list.
hey, sahil here. i'm one of the contributors on surfer and have been working on this project for around three weeks now. we appreciate the feedback and are excited to keep pushing this project forward with your input!
I don't know anything about trademarks, service marks, etc but I do know that the product name "Surfer" has been in use for about 40 years in my industry, geoscience, by a company in Golden, Colorado. [0]
Maybe you can make a new product in a different industry and recycle the name. I don't know how that works but right now, you're playing in an established product's namespace.
I would prefer a cli tool with partial gather support. Something that I could easily setup to run on a cheap instance somewhere and have it scrape all my data continuously at set intervals, and then give me the data in the most readable format possible through an easy access path. I've been thinking of making something like that, but with https://github.com/microsoft/graphrag at the center of it. A continuously rebuilt GraphRAG of all your data.