Surfer: Centralize all your personal data from online platforms

markjgx · 2024-08-23T09:34:45 1724405685

"Surfer: The World's First Digital Footprint Exporter" is dubious—it's clearly not the first. Kicking off with such a bold claim while only supporting seven major platforms? A scraper like this is only valuable if it has hundreds of integrations; the more niche, the better. The idea is great, but this needs a lot more time in the oven.

I would prefer a cli tool with partial gather support. Something that I could easily setup to run on a cheap instance somewhere and have it scrape all my data continuously at set intervals, and then give me the data in the most readable format possible through an easy access path. I've been thinking of making something like that, but with https://github.com/microsoft/graphrag at the center of it. A continuously rebuilt GraphRAG of all your data.

madamelic · 2024-08-23T12:30:46 1724416246

Take a look at https://github.com/karlicoss/HPI

It builds an entire ecosystem around your data where it is programmatic rather than just dumping text files. The point of HPI is to build your own stuff onto it and it all integrates seamlessly together into one Python package.

The next stop after Karlicoss is https://github.com/seanbreckenridge/HPI_API which creates a REST API on top of your HPI without any additional configuration.

If you want to get more fancy / antithetical to HPI, you can use https://github.com/hpi/authenticated_hpi_api or https://github.com/hpi/hpi-graph so you can theoretically expose it to the web (I am squatting the HPI org, I am not the creator of HPI). I made the authentication method JWTs so you can create JWTs where it will give access to only certain services' data. (Beware, hpi-graph is very out of date and I haven't touched it lately but my HPI stuff has been chugging away downloading data).

Some of the /hpi stuff I made is a bit mish-mash because it was rip-and-replace from a project I was making so you'll see references to "Archivist" or things that aren't local-first and depend on Vercel applications.

bdcravens · 2024-08-23T15:12:03 1724425923

The amount of built-in platforms isn't necessarily the problem. The best systems are those that establish a plugin ecosystem.

michaelmior · 2024-08-23T12:32:49 1724416369

While I agree that it's not the first, I think it's unfair to say that it's not valuable without hundreds of integrations.

slalani304 · 2024-08-23T12:00:00 1724414400

Yeah it was honestly more of a marketing statement lol, but removing it for sure. Adding daily/interval exporting is one of our top priorities right now and after that and making the scraping more reliable, we'll add something similar to GraphRAG. Curious to hear what other integrations you would want built into this system.

MattJ100 · 2024-08-23T06:41:20 1724395280

Definitely not the first such scraper. DogSheep has been around for a while: https://dogsheep.github.io/

It is based around SQLite rather than Supabase (Postgres) which I think is a better choice for preservation/archival purposes.

slalani304 · 2024-08-23T11:58:00 1724414280

Oh, interesting will look more into this.

Carrok · 2024-08-23T04:03:02 1724385782

No list of supported platforms. No example of what the extracted data looks like. No examples of what can be done with the extracted data.

SJMosley · 2024-08-23T04:07:24 1724386044

closest thing to a supported scraper list https://github.com/CEREBRUS-MAXIMUS/Surfer-Data/tree/main/sr...

fudged71 · 2024-08-23T19:46:28 1724442388

One example of how I used it:

It exported 75MB json of ChatGPT "Conversations". I extracted 19MB or raw text from this as a CSV. I then took this into Nomic.ai and embedded all of the text to create a clustered visualization of topics in my ChatGPT conversations.

slalani304 · 2024-08-25T22:26:12 1724624772

Hey, would love to see a visualization of this / how you made it :)

slalani304 · 2024-08-23T17:29:06 1724434146

thanks for the feedback, added the list of supported platforms and example of exported data as well to the readme. we're focusing mostly on the exporting part, so examples of what can be done with the data will come later.

swyx · 2024-08-23T07:47:08 1724399228

poster here - yeah if i have one cricitism of their readme/marketing this is it. you can see it in the demo video but this needed to be up front

hi-v-rocknroll · 2024-08-23T05:30:54 1724391054

The answers to online platforms trafficking in personal data and metadata is two parallel and concurrent efforts:

1. Much tougher data privacy regulations (needed per country)

2. A central trusted, international nonprofit clearinghouse and privacy grants/permissions repository that centralizes basic personal details and provides a central way to update name, address(es), email, etc. that are then used on-demand only by companies (no storage)

By doing these, it simplifies things greatly for people and allows someone to audit and see what every company knows about them, can know about, and can remove allowances for companies they don't agree to. One of the worst cases is the US where personal information is not owned by the individual and there is almost zero control unless it's health related, and can be traded for profit.

BodyCulture · 2024-08-23T08:55:15 1724403315

Will you use a central trusted, international nonprofit clearinghouse and privacy grants/permissions repository that is run by the government of China / Iran / [state]?

It is important for privacy activists to understand that „centralised“ is an anti-pattern for privacy.

Instead we need security and control over our data on devices and internet platforms guaranteed by the law.

bboygravity · 2024-08-23T07:23:50 1724397830

Yes, I agree. I hate web-forms so much that I wrote a Firefox extension that fills them automatically (using LLM).

It sounds to me like what you're describing under 2 is a real usecase for blockchain contracts?

Store your latest data encrypted on-chain and give every 3rd party you trust a key that corresponds to the relevent part of the data?

Curious about opinions on this.

hi-v-rocknroll · 2024-08-23T09:49:46 1724406586

I'm not talking about a distributed or self-hosted technical solution, but a centralized trusted nonprofit organization. Technology alone can't automate away privacy management issues.

slalani304 · 2024-08-23T12:01:49 1724414509

3rd party would only have read access, I'm assuming? Also would love to try out the extension.

pogue · 2024-08-23T08:56:34 1724403394

Can you share this Firefox extension? Sounds super handy!

bdominy · 2024-08-23T14:17:32 1724422652

I created an app to do end-to-end encrypted contact info sharing and updating with your second point in mind. By holding only encrypted data that can't be accessed by us, people will hopefully trust their contact info is only in the hands of people they want. https://neu.cards

ianopolous · 2024-08-23T07:54:18 1724399658

You might be interested in Peergos for the storage and access control part. We have a profile where you can control access (and revoke) to each field individually. E2EE because most people wont want to self host.

https://peergos.org/posts/social-profile

slalani304 · 2024-08-23T12:01:12 1724414472

The second one sounds something similar to the Solid project, which is what Tim Berners-Lee is currently working on: https://solidproject.org/.

zamubafoo · 2024-08-23T14:51:58 1724424718

Or it just all happens on the client side before it even hits the Internet. I would love if Firefox allowed users to use Postgres instead of sqlite to store their places.sqlite database.

doctorpangloss · 2024-08-23T03:00:46 1724382046

The most exciting thing to happen to programming is the chatbot enabling millions of enthusiastic people to write code.

Terr_ · 2024-08-23T04:34:06 1724387646

I dimly remember some kid--or maybe it was only apocryphal--in the early 2000s, where they tried typing "Like Halo but with X" into a text file before changing the extension to .exe...

Still silly, but closer.

cuu508 · 2024-08-23T05:21:21 1724390481

I remember a similar story from 2000s where somebody interested in graphics programming saved "make some cool effects" as exe.

madamelic · 2024-08-23T12:36:47 1724416607

Not sure if we are joking about ourselves but when I was a kid, I was so confused by how games were made.

I started drawing individual frames of the game I wanted, I remember about 45 minutes into this venture I had an existential crisis about how how many frames you'd need for something like GTA, to show every possible combination.

I had the right idea but wasn't thinking about how to leverage the computer correctly.

mandmandam · 2024-08-23T13:16:18 1724418978

Haha I had the exact same experience with Prince of Persia. The existential crisis took nearly a decade to wear off.

The realization that every possible image that can fit on a screen can be stuck in a bitmap helped keep it going. Everything that could possible ever be photographed, just sitting there in the latent space waiting to be summoned.

... And now, I'm amazed that through some basically fancy noise we can type in words and get pictures in under a second. They even almost have the right number of fingers.

Xen9 · 2024-08-23T19:11:07 1724440267

A browser addon that takes one's password manager export & deletes every account, possibly after scraping the data, would be amazing. No one has done it and it can be done such that the system will eventually safely delete every account of every site (e.g using developers tools, accessability options, being intended to only be used in fresh browser install, sourcing information from volunteers). You'd have a sheet tracking the process, e.g. verification pending, manual intervention pending, deleted, waiting states. Many humans have hundreds of accounts they no longer use and this sort of tool could thus be a good Y-combinator or hobby project.

zamubafoo · 2024-08-23T15:23:33 1724426613

I made something like this since I was tired of the asymmetric nature of data collection that happens on the Internet. Still not where I would like to be, but it's been really nice being able to treat my browsing history as any old log that I can query over. Tools like dogsheep are nice, but they tend to rely on data being allowed to be removed from the platform. This bypasses those limits by just doing it on the client.

This lets me create dashboards to see usage for certain topics. For example, I have a "Dev Browser" which tracks the latest sites I've visited that are related to development topics [1]. I similarly have a few for all the online reading I do. One for blogs, one for fanfiction, and one for webfiction in general.

I've talked about my first iteration before on here [2].

My second iteration ended up with a userscript which sends the data on the sites I visit to a Vector instance (no affiliation; [3]). Vector is in there because for certain sites (ie. those behind draconian Cloudflare configuration), I want to save a local copy of the site. So Vector can pop that field save it to a local minio instance and at the same time push the rest of the record to something like Grafana Loki and Postgres while being very fast.

I've started looking into a third iteration utilizing MITMproxy. It helps a lot with saving local copies since it's happening outside of the browser, so I don't feel the hitch when a page is inordinately heavy for whatever reason. It also is very nice that it'd work with all browsers just by setting a proxy which means I could set it up for my phone both as a normal proxy or as a wireguard "transparent" proxy. Only need to set up certificates for it work.

---

[1] https://raw.githubusercontent.com/zamu-flowerpot/zamu-flower... [2] https://news.ycombinator.com/item?id=31429221 [3] http://vector.dev

AeZ1E · 2024-08-23T07:59:37 1724399977

the idea of personal data centralization sounds intriguing, but let's be real - companies will always find a way to keep a grip on our info. maybe it's time for a digital revolution, or just another excuse for me to procrastinate on coding.

BodyCulture · 2024-08-23T09:01:13 1724403673

„Centralize“ is a privacy anti-pattern. Max centralisation should be your keepass file.

rapnie · 2024-08-23T09:32:25 1724405545

Yes, it is the wrong word to use. Personal data aggregation (with storage in a personal data vault) would be better.

bdcravens · 2024-08-23T14:58:51 1724425131

"personal archive"?

captn3m0 · 2024-08-23T04:54:55 1724388895

I’ve been working on a lot of similar ideas over the years, and my current ideal stack is to:

1. Use Mobile App APIs.

2. Generate OpenAPI Arrazo Workflows.

1 ensures breakage is minimal, since mobile apps are slow upgrades and older versions are expected to keep working. 2 lets you write repeatable recipes using YAML, and that makes it quite portable to other systems.

The Arazzo spec is still quite early though, but I am hopeful of this approach.

mcslurryhole · 2024-08-23T04:03:44 1724385824

as someone who used to write scrapers for a living, this is going to break constantly. cool concept though.

pogue · 2024-08-23T08:59:32 1724403572

That's what's going to make this software live or die. It needs

1) Constant updates to existing packages 2) Continued expansion of more sites/apps to export your data from

slalani304 · 2024-08-23T17:32:39 1724434359

agreed. curious to hear what other sites/apps you would want to be able to export your data from.

bdcravens · 2024-08-23T15:09:16 1724425756

Myself, I'd probably prefer to use something like Huginn to create a customized approach to all of my online platforms I'm interested in, rather than a curated list.

https://github.com/huginn/huginn

slalani304 · 2024-08-23T11:56:48 1724414208

hey, sahil here. i'm one of the contributors on surfer and have been working on this project for around three weeks now. we appreciate the feedback and are excited to keep pushing this project forward with your input!

doodlebugging · 2024-08-23T13:04:51 1724418291

Sahil,

I don't know anything about trademarks, service marks, etc but I do know that the product name "Surfer" has been in use for about 40 years in my industry, geoscience, by a company in Golden, Colorado. [0]

Maybe you can make a new product in a different industry and recycle the name. I don't know how that works but right now, you're playing in an established product's namespace.

[0]https://www.goldensoftware.com/products/surfer/