Self-hosting keeps your private data out of AI models

oidar · 2024-05-24T17:52:51 1716573171

Self-hosting is just responsible computing now. The big companies are just too big to care about small businesses, and will use your data in any way that they please - take it or leave it. And it's cheaper to boot. A synology NAS or a raspberry pi 3 could cover 90% of what most internet services offer the average consumer/small business right now.

throwanem · 2024-05-24T17:55:49 1716573349

You're leaving out "and having someone on staff or contract who can administer it for you".

orev · 2024-05-24T18:03:40 1716573820

Because with cloud providers you don’t need anyone on staff to understand the product, keep up with the constant changes and predatory features they slip in, manage the cost overruns, or help to migrate away from their proprietary data formats once you’ve had enough of their “whaddaya gonna do about it?” treatment of customers?

mikestew · 2024-05-24T18:07:35 1716574055

True or not, it doesn't invalidate GP's statement. You don't just throw a NAS in Mom's garage and call it a day.

orev · 2024-05-24T20:21:32 1716582092

It does invalidate the argument. They’re implying that with self hosting you need someone to manage it, but with cloud you don’t. The truth is you need someone managing both scenarios, and the cost remains roughly the same.

throwanem · 2024-05-24T20:57:45 1716584265

Does it? It is a lot easier to find people who can work with Office 365 than who can stand up and maintain a comparable suite of selfhosted services, especially when considering needs like secure remote access.

I don't know who else here has ever actually done the kind of work that we're talking about, but I have, for contract clients, over the span of most of a decade. Between G Suite and Office 365, those jobs dried up fast about a decade ago, and I wasn't sorry; the thing about selfhosted deployments of this type is that no matter how standard you try to make the platform, clients' requirements persist in varying, so each deployment ends up a somewhat different artifact whose differences, because those get less investment of effort than the common aspects, are always what cause the most problems and cost the most money in maintenance. No one was sorry to see that expense dwindle with the rise of SaaS officeware platforms; even for me, it wasn't a loss, because it freed up my time for more interesting and lucrative engagements.

Of course I'm sure that, for those of you saying this is easy, your experience is different from mine. It would be really interesting to hear more about how you've been able to make selfhosted deployments work so well for folks who aren't technical. That's a really significant achievement! Teams like Sandstorm have spent years at it and come away with nothing of import to show, and I'm sure I'm not alone in wanting to know more about how folks here have overcome the same challenges.

Cheer2171 · 2024-05-24T18:33:32 1716575612

Yes, it isn't bare metal vs cloud, it is being your own sysadmin vs relying on someone to sysadmin for you

dvfjsdhgfv · 2024-05-24T20:23:05 1716582185

I don't get your point. It doesn't matter if you host your stuff on AWS, Hetzner, or locally on a Pi, you need someone to manage that. If you are this someone, great, if not, you need to pay someone else. And contrary to what private cloud providers advertise,[0] the amount of work is similar in each case.

[Or used to, around 2006 or so - I'm not sure if they still claim managing public cloud resources is means less work.]

silverquiet · 2024-05-24T18:45:52 1716576352

I actually went to the DC yesterday. It had been 6 months since I had been there and everything was running fine (and has been for 2 years now). I just had a small addition. I don't love the drive, but it's really not much of my time to manage it.

throwanem · 2024-05-24T21:28:27 1716586107

By the sound of it, you are exactly the person I was talking about needing to have on staff. That's a good kind of person to be (I say, while being likewise), but I'm not sure it makes for an effective counterargument from experience.

silverquiet · 2024-05-24T21:41:27 1716586887

Yes, but my point is that I can do a lot of stuff that isn't just DCops work; in fact it's a very small part of my job. I handle all the infrastructure to include cloud and some managed services as well. Kinda the nature of a small company - you just do what needs doing.

throwanem · 2024-05-24T23:28:12 1716593292

Sure, I get that, but my point is that most small businesses aren't going to have or want someone with that set of skills. Those in and around the tech industry will, sure, but that isn't most small businesses. And selfhosted equivalents of SaaS officeware packages really aren't the sort of thing that can be reliably set up and managed on a "do what needs doing" basis - not because most people can't learn, but because starting from what you know as a user that's quite a bit of learning, and time thus better spent on any of the other forty dozen things that always want doing to keep a small shop afloat.

It would be nice if the tooling really was simple enough that wasn't so, but despite considerable effort to make it so from Sandstorm among others over quite a long time, as far as I know no one's had notable success in achieving that. Hence the rise and continued prominence of SaaS providers, who abstract all of that effort behind an SLA and a monthly fee.

I agree with the criticism of SaaS platform behavior at the head of this thread. What I think the commenters on that side of the discussion are missing or maybe ignoring, is that much of the value a SaaS offering provides is in not having to (pay someone to) administer that stuff yourself - and that, considered separately from the question of how providers behave, "all you have to think about is 'pay us and use our stuff'" is in the general case a strong value proposition.

silverquiet · 2024-05-25T02:49:31 1716605371

Yes, I was thinking about the software world really. I work for a small SaaS vendor so we have the skills to develop and host our offering. We are indeed toying with AI (because who isn't?) to see about adding features, but one of our requirements would be that it is self-hosted as well because we work in medical IT. I believe that's one place where strong privacy laws and at least the idea that patients own their data (doctors don't seem to be super onboard on that) can put some guardrails on what companies can do.

poisonborz · 2024-05-24T19:10:10 1716577810

Yes but it's just too hard for even experienced people. Even harder to have good reliability, backup. I'm etching my rather average needs setup for a decade and there are still issues and I couldn't yet fully decouple. Life is short man.

There are new tools, apps and solutions every year (easy vm handling, kinda easy vpn, projects like Wireguard, Immich) but overall there are huge things missing to make selfhosting a thing for common people.

oidar · 2024-05-24T19:37:38 1716579458

It's pretty easy right now. Synology makes things so, so simple and Yunohost is closing in right behind. If you can manage a large spreadsheet, I don't see why you wouldn't use one of those systems.

simonw · 2024-05-24T18:09:06 1716574146

Since this seems to be written partly in response to (and honestly, to take advantage of) the recent Slack AI training panic, I took a look to see how Slack have updated their materials in response to that panic.

These documents are new in the last few days:

https://slack.com/blog/news/how-we-built-slack-ai-to-be-secu...

https://slack.com/intl/en-gb/blog/news/how-slack-protects-yo...

I think these updates are really good - Slack's previous messaging around this (especially the way they unclearly conflated older machine learning models with new policies for generative AI) was confusing and it wasn't surprising it caused a widespread panic.

It's now very clear what Slack were trying to communicate: they have older ML models for features like channel recommendations which work how you would expect such models to work. They have a separate "Slack AI" addon uou can buy that adds RAG features powered by a foundation model that is never further trained on user data.

I expect nobody will care. Once someone has decided that a company might "train AI" on private data you've already lost that person's trust. It's not clear to me if any company has figured out how they can overcome one of these AI training panics at this point.

I wrote a bit about this back in December when it happened to Dropbox - there is an AI trust crisis at the moment: https://simonwillison.net/2023/Dec/14/ai-trust-crisis/

Root_Denied · 2024-05-24T18:29:52 1716575392

>I expect nobody will care. Once someone has decided that a company might "train AI" on private data you've already lost that person's trust. It's not clear to me if any company has figured out how they can overcome one of these AI training panics at this point.

I think it goes beyond a single company, or rather a single incidence of this panic. You're looking at each time this happens as an independent coin flip instead of a series of dominoes that trigger a reaction in multiple directions.

What I mean by that is there's a counterculture sentiment building based off the idea that people have seen this same pattern enough times at this point that they're distrustful of large scale systems by default. It's happening with government institutions, politics, economics, and individual industries like gaming and streaming.

To that end the "panic" is not just a reaction to Slack's (perceived) actions, but an expectation that Slack will be yet another domino in that line of companies that have done the same. It's also difficult to prove a negative (that Slack isn't using private data for training purposes even if they say they're not) so the messaging is up against a very solid wall.

The result here is that public announcements and messaging related to data are under heavy scrutiny, and the media is incentivized to try and make their reporting go viral (ironically for the ad revenue) at the expense of actual journalistic reporting.

I'm not sure what the solution to this problem is, or if there even is one, but promoting self hosting seems like an indicator that the default assumption is that data collected will be abused in some way. Honestly based on the last couple of years it's not an unreasonable assumption either.

tabbott · 2024-05-24T19:11:49 1716577909

Yeah I think Slack's updated stated internal policies are about as reasonable as one can hope for from a tech giant, if one can trust them to stand by those policies. Your article was on my mind when writing this, I guess I should have linked it.

The crux of the matter is whether you can trust a big tech company to do what they claim they will. They all think AI is worth infinite dollars. In that world, without some very clear, painful, straightforward, contractual penalty ... well we've seen that the tech giant plan is that rules are meant to constrain your competitors' behavior, not yours.

If they wrote "If any of your data is discovered to have been in an AI training model, Slack owes you 10x your lifetime payments to Slack, and any involved whistleblowers get 1% of the total paid" in their terms of service, which means if Slack screws this up, the company is immediately bankrupt, that might prove effective. But a promise in a "privacy principles" policy that doesn't appear to actually be incorporated into the core ToS does not have a lot of teeth.

simonw · 2024-05-24T19:43:17 1716579797

This does seem to be one of the key challenges here: publishing a "principles" document doesn't mean much if you reserve the right to change those principles in the future!

I think you're right: the most convincing version of this would be actual legalese.

I wouldn't be surprised if Slack have this in the contracts they sign with their larger customers, but I don't think those are publicly available.

sneak · 2024-05-24T17:57:11 1716573431

Another offender: codegpt.io ToS grants them an irrevocable perpetual sublicenseable license to all code they see from you. It’s insane what rights companies claim to your data.

itronitron · 2024-05-24T17:44:29 1716572669

>> “To develop AI/ML models, our systems analyze Customer Data (e.g. messages, content, and files) submitted to Slack.” — Slack’s privacy principles,

simonw · 2024-05-24T18:16:52 1716574612

As far as I can tell that's the copy they've had in place since 2016 - a year before even the original Transformers paper that kicked off today's LLMs - to cover their own tiny old-school ML models for things like channel recommendations.

jmclnx · 2024-05-24T17:44:04 1716572644

2FA pushed me out of github, but M/S Copilot in github created the road out.

Now seems people's chats and posts are being used to train AI. I wonder how long before Cell Phone Providers start using Text Messages to train AI (or sell to AI people).

alright2565 · 2024-05-24T17:59:54 1716573594

> 2FA pushed me out of github

I've seen several folks have this opinion... and it just doesn't make any sense?

SMS for 2FA is problematic for many reasons, but Github offers TOTP & U2F/passkey 2FA, which have zero privacy implications.

ProllyInfamous · 2024-05-24T19:59:53 1716580793

2FA is why I can no longer utilize online banking (which is frustrating).

bluish29 · 2024-05-24T17:52:47 1716573167

> 2FA pushed me out of github

I don't understand why would something like that be a reason to avoid github (or any service) in itself?

In this era, 2FA is important security measurement and it is not like they enforce only SMS as the only 2FA form.

eesmith · 2024-05-24T19:54:04 1716580444

To start with, I don't have two devices, only a laptop. (And a backup laptop and desktop at home, but I typically don't work there.) Correct, I have no smart phone.

Second, it's not an important security measurement for me. I didn't go into FOSS to be part of someone's supply chain. I do it as a way to share my knowledge of how one might solve a problem. If you want use my code, then inspect it to make sure it does what you want, or pay me for commercial support. Neither require 2FA.

Might someone take over the account? Sure, I suppose. But I'm not into "community building" or GitHub's gamification, and my primary repos are all local, so if that happens and GitHub's support didn't help, I could start a new account. Again, don't depend on me for your supply chain without a commercial support agreement.

When Microsoft switched GitHub to require 2FA I concluded it was because they wanted to assure their corporate and government clients that it was "safe" for them. Those profits subsidize Microsoft's free hosting plans, so my presence there was helping contribute to Microsoft's already excessive market power.

Third, the change was driven from on high, with no chance for me to decide what was appropriate for my projects. I concluded Microsoft was so powerful they could make such paternalistic changes because they knew network effect was on their side that they could have little concern about the small number of people leaving or getting upset.

Fourth, my FOSS projects on GitHub were labors of love that were a net negative on my income. I was not going to spend any money on new hardware or waste my time figuring out how to get things working under a new system when I was already hosting most of my work on Sourcehut, which is much more aligned to my ethical and moral views.

I still don't know how many security keys I'm supposed to have (how often should I expect to lose one? should I store the backups off-site at a friend's place?), or how often am I supposed to test they work? And then I hear about issues about lock-in and how attestation requirements might prevent FOSS solutions ad prevent people from backing up one's own security keys, and issues with resident vs. non-resident keys, and being able to register multiple keys. It's all learnable, but I simply don't care enough.

And I don't see why I should care about all this when the paying customers of my software have all been fine with only a tar.gz, license agreement, and support contract.

marcosdumay · 2024-05-24T17:56:12 1716573372

Modern 2FA is "something you know" and "something you subscribe for".

cuu508 · 2024-05-24T18:03:08 1716573788

What do you mean? GitHub supports TOTP, Webauthn on free accounts.

cuu508 · 2024-05-24T21:05:46 1716584746

To answer myself, from reading the other comments, it sounds like GitHub started to require 2FA at some point, and some people refused to set it up. The problem was not some inadequacy of GitHub's 2FA implementation, but the fact it is mandatory.

(I had 2FA set up a long time ago, so didn't notice the policy change)

eesmith · 2024-05-24T17:49:21 1716572961

2FA also pushed me out, and Copilot locked the door ... to be fair, unless a client comes with bucket full of money with a key on top.

giancarlostoro · 2024-05-24T17:46:26 1716572786

Given how awful some text messages I get from relatives read, I really hope not. The worst types of typos. The only thing I want them to train properly is voice to text. It cannot for the life of itself ever get anything right. I have to scream at Siri a dozen times.

matchagaucho · 2024-05-24T17:54:44 1716573284

To conflate Slack's T&C faux pas with "self host your own LLM" seems like a stretch.

IT Sec and Compliance must read the T&Cs and make better vendor selection choices.

sneak · 2024-05-24T17:57:59 1716573479

The recommendation is “self host your own collaboration tools”, not “self host your own LLM”.

udev4096 · 2024-05-24T17:59:00 1716573540

This will continue to occur and may come as a "shock" to some companies that will ignorantly persist in using proprietary services unless a significant change in data collection from the service itself occurs, which should not be the primary motivation to switch to a self-hosted version in the first place

airpoint · 2024-05-24T20:04:39 1716581079

Upgrading the UI from year 2000 and improving the UX would incite me to consider Zulip as a potential alternative.

Bashing a competitor in a blog post does not.

WhackyIdeas · 2024-05-24T18:10:59 1716574259

Considering Microsoft are bringing in a ‘feature’ to record your desktop, I wouldn’t be surprised that an additional ‘feature update’ further down the line will simply take all those chats with your self-hosted AI models to train AI models.

So in my opinion, it just doesn’t matter if you are using self-hosted AI, the weakest link in your chain for keeping your data private is the very OS’s that you’ll be interacting with said self-hosted AI.

And with all the manufactured fear mongering going on around AI, that data will -already- be deliciously irresistible for prism-participating, lovable, trustable companies like Microsoft.

Sorry to burst some pretty bubbles for the lovely naive people.

rpcope1 · 2024-05-24T18:16:59 1716574619

It's never felt better to just be running Debian on everything.

WhackyIdeas · 2024-05-24T18:44:20 1716576260

OpenBSD and Llamafile go hand in hand.

https://github.com/Mozilla-Ocho/llamafile

leobg · 2024-05-24T17:48:36 1716572916

Looks like content marketing to me. Both in purpose and motivation.

codegeek · 2024-05-24T18:05:32 1716573932

"We don’t train LLMs on Zulip Cloud customer data, and we have no plans to do so. Should we decide that training our own LLMs is necessary for Zulip to succeed, we promise to do so in a responsible manner"

:). What a clever way to say that even though we don't do it today, we cannot guarantee that we will never do it on our cloud service. At least they are honest I guess.

tabbott · 2024-05-24T19:04:44 1716577484

Zulip project leader here.

I think you're missing a big part of the point of the post: Which is that if you self-hosting, nobody can train models on your data, if you're going to use a cloud service, you should use one where you can move data to self-hosting, and where you can trust the vendor.

We do basically guarantee that you won't have your organization's data included in AI training in Zulip Cloud without consent. But yes, we're not ruling out the possibility of some sort of opt-in feature that might be useful in Zulip Cloud.

I'm humble enough to not pretend I know what will be possible/expected in terms of AI technology in 5-10 years, but one could easily imagine some sort of tool trained on web-public channel data in open Zulip communities being a thing that could make sense if done with appropriate consent. If such a thing were desired, I don't think it most of the concerns related to the slack controversy would apply.

codegeek · 2024-05-24T19:14:44 1716578084

I respect the self hosting part of course. All I am saying is that the premise of the post is that you guys take privacy seriously but you still left that door opened that someday you may use LLMs.

itronitron · 2024-05-24T19:56:05 1716580565

>> we're not ruling out the possibility of some sort of opt-in feature that might be useful in Zulip Cloud.

Zulip's weasel wording indicates they are nerfing a great (maybe their best) opportunity to stand out from the herd.

How about this for a mind-blowing concept (/s) ... If a web-public channel wants to add some sort of useful feature based on a technology trained on their data then let the owners/administrators of that channel flip that switch on. Zulip should have no involvement in that decision, period.

simonw · 2024-05-24T18:18:32 1716574712

I'm amused to note that those are effectively the exact same policies that Slack have in place!