Hacker News new | past | comments | ask | show | jobs | submit login
Surfer: Centralize all your personal data from online platforms (github.com/cerebrus-maximus)
151 points by swyx 4 months ago | hide | past | favorite | 46 comments



"Surfer: The World's First Digital Footprint Exporter" is dubious—it's clearly not the first. Kicking off with such a bold claim while only supporting seven major platforms? A scraper like this is only valuable if it has hundreds of integrations; the more niche, the better. The idea is great, but this needs a lot more time in the oven.

I would prefer a cli tool with partial gather support. Something that I could easily setup to run on a cheap instance somewhere and have it scrape all my data continuously at set intervals, and then give me the data in the most readable format possible through an easy access path. I've been thinking of making something like that, but with https://github.com/microsoft/graphrag at the center of it. A continuously rebuilt GraphRAG of all your data.


Take a look at https://github.com/karlicoss/HPI

It builds an entire ecosystem around your data where it is programmatic rather than just dumping text files. The point of HPI is to build your own stuff onto it and it all integrates seamlessly together into one Python package.

The next stop after Karlicoss is https://github.com/seanbreckenridge/HPI_API which creates a REST API on top of your HPI without any additional configuration.

If you want to get more fancy / antithetical to HPI, you can use https://github.com/hpi/authenticated_hpi_api or https://github.com/hpi/hpi-graph so you can theoretically expose it to the web (I am squatting the HPI org, I am not the creator of HPI). I made the authentication method JWTs so you can create JWTs where it will give access to only certain services' data. (Beware, hpi-graph is very out of date and I haven't touched it lately but my HPI stuff has been chugging away downloading data).

Some of the /hpi stuff I made is a bit mish-mash because it was rip-and-replace from a project I was making so you'll see references to "Archivist" or things that aren't local-first and depend on Vercel applications.


The amount of built-in platforms isn't necessarily the problem. The best systems are those that establish a plugin ecosystem.


While I agree that it's not the first, I think it's unfair to say that it's not valuable without hundreds of integrations.


Yeah it was honestly more of a marketing statement lol, but removing it for sure. Adding daily/interval exporting is one of our top priorities right now and after that and making the scraping more reliable, we'll add something similar to GraphRAG. Curious to hear what other integrations you would want built into this system.


Definitely not the first such scraper. DogSheep has been around for a while: https://dogsheep.github.io/

It is based around SQLite rather than Supabase (Postgres) which I think is a better choice for preservation/archival purposes.


Oh, interesting will look more into this.


No list of supported platforms. No example of what the extracted data looks like. No examples of what can be done with the extracted data.


closest thing to a supported scraper list https://github.com/CEREBRUS-MAXIMUS/Surfer-Data/tree/main/sr...


One example of how I used it:

It exported 75MB json of ChatGPT "Conversations". I extracted 19MB or raw text from this as a CSV. I then took this into Nomic.ai and embedded all of the text to create a clustered visualization of topics in my ChatGPT conversations.


Hey, would love to see a visualization of this / how you made it :)


thanks for the feedback, added the list of supported platforms and example of exported data as well to the readme. we're focusing mostly on the exporting part, so examples of what can be done with the data will come later.


poster here - yeah if i have one cricitism of their readme/marketing this is it. you can see it in the demo video but this needed to be up front


The answers to online platforms trafficking in personal data and metadata is two parallel and concurrent efforts:

1. Much tougher data privacy regulations (needed per country)

2. A central trusted, international nonprofit clearinghouse and privacy grants/permissions repository that centralizes basic personal details and provides a central way to update name, address(es), email, etc. that are then used on-demand only by companies (no storage)

By doing these, it simplifies things greatly for people and allows someone to audit and see what every company knows about them, can know about, and can remove allowances for companies they don't agree to. One of the worst cases is the US where personal information is not owned by the individual and there is almost zero control unless it's health related, and can be traded for profit.


Will you use a central trusted, international nonprofit clearinghouse and privacy grants/permissions repository that is run by the government of China / Iran / [state]?

It is important for privacy activists to understand that „centralised“ is an anti-pattern for privacy.

Instead we need security and control over our data on devices and internet platforms guaranteed by the law.


Yes, I agree. I hate web-forms so much that I wrote a Firefox extension that fills them automatically (using LLM).

It sounds to me like what you're describing under 2 is a real usecase for blockchain contracts?

Store your latest data encrypted on-chain and give every 3rd party you trust a key that corresponds to the relevent part of the data?

Curious about opinions on this.


I'm not talking about a distributed or self-hosted technical solution, but a centralized trusted nonprofit organization. Technology alone can't automate away privacy management issues.


3rd party would only have read access, I'm assuming? Also would love to try out the extension.


Can you share this Firefox extension? Sounds super handy!


I created an app to do end-to-end encrypted contact info sharing and updating with your second point in mind. By holding only encrypted data that can't be accessed by us, people will hopefully trust their contact info is only in the hands of people they want. https://neu.cards


You might be interested in Peergos for the storage and access control part. We have a profile where you can control access (and revoke) to each field individually. E2EE because most people wont want to self host.

https://peergos.org/posts/social-profile


The second one sounds something similar to the Solid project, which is what Tim Berners-Lee is currently working on: https://solidproject.org/.


Or it just all happens on the client side before it even hits the Internet. I would love if Firefox allowed users to use Postgres instead of sqlite to store their places.sqlite database.


The most exciting thing to happen to programming is the chatbot enabling millions of enthusiastic people to write code.


I dimly remember some kid--or maybe it was only apocryphal--in the early 2000s, where they tried typing "Like Halo but with X" into a text file before changing the extension to .exe...

Still silly, but closer.


I remember a similar story from 2000s where somebody interested in graphics programming saved "make some cool effects" as exe.


Not sure if we are joking about ourselves but when I was a kid, I was so confused by how games were made.

I started drawing individual frames of the game I wanted, I remember about 45 minutes into this venture I had an existential crisis about how how many frames you'd need for something like GTA, to show every possible combination.

I had the right idea but wasn't thinking about how to leverage the computer correctly.


Haha I had the exact same experience with Prince of Persia. The existential crisis took nearly a decade to wear off.

The realization that every possible image that can fit on a screen can be stuck in a bitmap helped keep it going. Everything that could possible ever be photographed, just sitting there in the latent space waiting to be summoned.

... And now, I'm amazed that through some basically fancy noise we can type in words and get pictures in under a second. They even almost have the right number of fingers.


A browser addon that takes one's password manager export & deletes every account, possibly after scraping the data, would be amazing. No one has done it and it can be done such that the system will eventually safely delete every account of every site (e.g using developers tools, accessability options, being intended to only be used in fresh browser install, sourcing information from volunteers). You'd have a sheet tracking the process, e.g. verification pending, manual intervention pending, deleted, waiting states. Many humans have hundreds of accounts they no longer use and this sort of tool could thus be a good Y-combinator or hobby project.


I made something like this since I was tired of the asymmetric nature of data collection that happens on the Internet. Still not where I would like to be, but it's been really nice being able to treat my browsing history as any old log that I can query over. Tools like dogsheep are nice, but they tend to rely on data being allowed to be removed from the platform. This bypasses those limits by just doing it on the client.

This lets me create dashboards to see usage for certain topics. For example, I have a "Dev Browser" which tracks the latest sites I've visited that are related to development topics [1]. I similarly have a few for all the online reading I do. One for blogs, one for fanfiction, and one for webfiction in general.

I've talked about my first iteration before on here [2].

My second iteration ended up with a userscript which sends the data on the sites I visit to a Vector instance (no affiliation; [3]). Vector is in there because for certain sites (ie. those behind draconian Cloudflare configuration), I want to save a local copy of the site. So Vector can pop that field save it to a local minio instance and at the same time push the rest of the record to something like Grafana Loki and Postgres while being very fast.

I've started looking into a third iteration utilizing MITMproxy. It helps a lot with saving local copies since it's happening outside of the browser, so I don't feel the hitch when a page is inordinately heavy for whatever reason. It also is very nice that it'd work with all browsers just by setting a proxy which means I could set it up for my phone both as a normal proxy or as a wireguard "transparent" proxy. Only need to set up certificates for it work.

---

[1] https://raw.githubusercontent.com/zamu-flowerpot/zamu-flower... [2] https://news.ycombinator.com/item?id=31429221 [3] http://vector.dev


the idea of personal data centralization sounds intriguing, but let's be real - companies will always find a way to keep a grip on our info. maybe it's time for a digital revolution, or just another excuse for me to procrastinate on coding.


„Centralize“ is a privacy anti-pattern. Max centralisation should be your keepass file.


Yes, it is the wrong word to use. Personal data aggregation (with storage in a personal data vault) would be better.


"personal archive"?


I’ve been working on a lot of similar ideas over the years, and my current ideal stack is to:

1. Use Mobile App APIs.

2. Generate OpenAPI Arrazo Workflows.

1 ensures breakage is minimal, since mobile apps are slow upgrades and older versions are expected to keep working. 2 lets you write repeatable recipes using YAML, and that makes it quite portable to other systems.

The Arazzo spec is still quite early though, but I am hopeful of this approach.


as someone who used to write scrapers for a living, this is going to break constantly. cool concept though.


That's what's going to make this software live or die. It needs

1) Constant updates to existing packages 2) Continued expansion of more sites/apps to export your data from


agreed. curious to hear what other sites/apps you would want to be able to export your data from.


Myself, I'd probably prefer to use something like Huginn to create a customized approach to all of my online platforms I'm interested in, rather than a curated list.

https://github.com/huginn/huginn


hey, sahil here. i'm one of the contributors on surfer and have been working on this project for around three weeks now. we appreciate the feedback and are excited to keep pushing this project forward with your input!


Sahil,

I don't know anything about trademarks, service marks, etc but I do know that the product name "Surfer" has been in use for about 40 years in my industry, geoscience, by a company in Golden, Colorado. [0]

Maybe you can make a new product in a different industry and recycle the name. I don't know how that works but right now, you're playing in an established product's namespace.

[0]https://www.goldensoftware.com/products/surfer/


All that's needed is to extend the name with a modifier, which probably is a stronger branding IMO.

https://surfer.nmr.mgh.harvard.edu/

https://towey-websurfer.apponic.com/

Also many open source libraries have also used the name:

https://rubygems.org/gems/surfer

https://www.npmjs.com/package/surfer

etc


Seems like an easy way to get locked out of your accounts.


As long as it happens after the scraping is over.


Not to be confused with Surfer, SEO content optimization platform


Or the 40 year old geoscience software product Surfer [0]

[0] https://www.goldensoftware.com/products/surfer/

Cool name but it was taken way back when I was writing geoscience software. That's been a while.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: