I really don’t want my cloud storage provider to check my files for copyright violations. It’s not required legally and it’s something that I think is anti-user.
Banks don’t have to check safe deposit boxes for stolen art.
Self storage don’t have to check containers for stolen goods.
Of course, I’m against illegal stuff, but I don’t want to waste a single second defending myself from false positives in these situations. Google has no way of knowing whether I own the IP so I could have paid for a license for the material on my drive or many other legitimate cases.
Is it because you might share the files with someone that they have to consider anything put on Google drive as being redistributed under copyright law, and thus subject to copyright restrictions?
Or is the very act of putting something in cloud storage considered redistribution under copyright law, even if the file is never shared and you are the only user?
A while ago, I backed up a bunch of my Mom's files from a failing computer of hers onto Google Drive. I didn't think anything of it at the time. If there are some copyrighted materials on there, is Google going to suddenly terminate my account after a retroactive scan?
I think very hard about copyright and ensure that I uphold copyright in all my public works — for example, there are licensing details at the end of all my slide presentations for all images. Having to apply such a level of care to every action I perform on Google services is bonkers.
Google does automated scans, they don't care if something is properly attributed, falls under free use or you having bought an license.
They also are not really known to care about fixing false bans to individuals, sometimes they do, other times you are screwed.
They also might lock you out of all your google services, email, storage, domains, apps you bought, videos you bought on YT etc. Which tbh. is the most bonkers thing and should not be legal.
Pretty sure the problem is some users have Google Drive's with GBs/TBs of pirated movies, etc, that are shared massively.
>Is it because you might share the files with someone that they have to consider anything put on Google drive as being redistributed under copyright law, and thus subject to copyright restrictions?
This seems like a fundamental design flaw for any cloud storage service which enables sharing. It would be a problem not only for Google, but for Dropbox, etc.
There needs to be a sharp distinction between files for your own use and files that are shared with the world — as in a global setting for whole volumes to disable global sharing and thus avert the need for copyright scanning.
If by making files easy to share at any moment, the service creates a need to perform continuous copyright scanning of all drive content and then to take punitive action when it is detected, then the service is really nothing like a private hard drive. The potential for catastrophic loss, not only of the drive contents but of everything you access through Google services, is much more terrifying than the possibility of corrupting a local drive and much harder to plan for.
>Dropbox has adopted a policy of terminating the accounts of users who repeatedly infringe copyright or whose accounts are subject to multiple infringement allegations. If you repeatedly share files that infringe others’ copyrights, your account will be terminate
> Even simpler: do not upload copyrighted materials to Dropbox, Google Drive, etc.
Everything is copyrighted. The comment you just wrote is automatically copyrighted. "Do not upload copyrighted materials" means you can only upload things made over a century ago (which either were made before copyright existed, or were copyrighted but their copyright expired). Want to upload your vacation photos? Too bad, they were made this year, so they are copyrighted, and will be copyrighted for several decades after you're already dead.
No, you have licensed your content to Y Combinator. It’s in the TOS:
> By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed.
> Even simpler: do not upload copyrighted materials to Dropbox, Google Drive, etc.
This is more difficult than you are making it sound. Considering the way most people use their computers, the only way to avoid copyright infringement entirely in your use of cloud storage is not to use cloud storage the way you use your local drive but instead to assess every last file that you upload as if you were publishing it on a public website.
Sure, you can avoid uploading pirated movies (perhaps by never downloading any of them in the first place). But in fact, original works are typically copyrighted by default as soon as they are created, even if the copyright is not formally registered — and we depend on a patchwork quilt of implicit and explicit licenses for viewing and use of files on the internet. Have you ever saved a quote from somewhere? That was probably copyrighted; your use is probably legitimate, but a naive algorithm might not think so.
Honestly, the situation is far worse and nuanced than that. You can not determine whether content is infringing on copy write because of content alone. Copyright depends on context.
Situation 1. You buy a PDF ebook. You have legal rights the person use of this file, but you upload it to your cloud storage and it gets flagged. You cannot determine if the intent of uploading the PDF was to facilitate pirating, or because you have a pirates copy, or because you have the rights to it.
Situation 2. You hire a wedding photographer, and they supply you with photos that you do not own the right to distribute because they retain copyright. This is the same as the above situation, but personalised. Would you like your cloud provider to delete any file, including your wedding photo backups, because it with matches a hash in a database?
Situation 3. Copyright fair use. Much has been written on the subject, but this is where copyright falls over in a digital age. Fair use is complete indistinguishable from piracy from a flagging files by content perspective.
Fooling the algorithm is not a robust solution. Infringement-detection algos get better hit rates all the time (fewer false negatives, although often at the price of more false positives) — consider all the work that has gone into detecting musical duplicates even when resampled, pitch/time-shifted, rerecorded, etc.
The robust solution is to treat files which are redistributed and therefore trigger copyright provisions completely differently from files which are not redistributed.
Furthermore, for the purposes of punitive action, sharing a copyrighted file with a limited whitelist of other users might reasonably be treated as a less of a screwup than making a file accessible to the entire internet. And therefore it should not be frictionless to share a file with the internet.
Not all files are copyrighted. For example, files which are not creative works -- including those which are mechanically created -- are not subject to copyright.
Yup, it's a common misconception that everything is copyrighted. Copyright requires creativity. An empty text file/image is not copyrightable, the pile of #include directives at the top of a C file is not copyrightable, etc. You need to create something that isn't just mechanical, regardless of whether the computer does it or you do it manually. At the point where creative decisions start to influence the process, that's where copyright comes into play.
> the pile of #include directives at the top of a C file is not copyrightable
That used to be the thought, but Google v. Oracle ended with de mininimis defenses struck down, and imports still copyrightable, just Google was afforded a fair use defense to copying.
I thought Google v. Oracle was about API structure, not imports. While I obviously don't think API structure should be copyrightable, we clearly can't extrapolate from that to a list of imports. The former is something chosen by an environment designer and needed for compatibility; the latter is just boilerplate every user of a given environment needs to write.
There is certainly creativity in API design, it's just that the right to interoperability and the utilitarian aspect should trump any copyright claim on the API itself. But there is no creativity in writing imports; you aren't making any material decisions, you're just doing something the compiler requires you to do.
To be copyrightable, you still need to have verbatim components that are copied (or 'mechanically' meaning algorithmically transformed from verbatim components). Abstract concepts like API design still can't be have copyright protections directly per se; it's a concept as a proxy of the copyright over the "declaring code" of import and export definitions. That's why I brought up how de minimis defenses were also struck down; the next direction people go is saying 'well, it's just one line to import the library, surely that's not enough'. The fair use defense afforded to Google for a very abstract interpretation of interoperability is sort of the last bastion we're left at the moment.
But as someone who's a big fan of your work, I'd implore you to not trust myself or your knowledge on this and hit up a lawyer. You're close enough to the edge of legality with a lot of your work that I'd hate to see you stifled by a minor misunderstanding here that could have been avoided. Google v. Oracle ended better than it could have, but AFAICT still heavily complicated RE work and independent implementations. It made a lot of this murkier, and being at least internally consistent with a legal theory here could make a bad situation at least a little better by leaving you with more options.
> But as someone who's a big fan of your work, I'd implore you to not trust myself or your knowledge on this and hit up a lawyer. You're close enough to the edge of legality with a lot of your work that I'd hate to see you stifled by a minor misunderstanding here that could have been avoided.
I do not retain a lawyer personally, but I inform myself of legal opinions around this field. It's why I felt comfortable enough to write this:
Ultimately though, once you stay clear of obviously problematic actions, the question of whether you're going to get in trouble boils down to whether the company you're up against is evil, for better or for worse. Given that Apple isn't going around suing jailbreakers and Hackintoshers, I'm not too worried that they'll go after us as long as we don't do anything stupid.
Conversely, I got frivolously sued by Sony for talking about a security vulnerability in the PS3... and yeah, I had to get a lawyer for that one.
In the end, once you get yourself deep enough into legal analysis around these subjects, you come to the conclusion that everyone violates copyright in little ways, all the time, and the world would grind to a halt if we stopped. The system is broken and relies on the goodwill of the people participating to not completely collapse. For example, I've previously mentioned how copying most example code you find online, e.g. in places like Stack Overflow, is a copyright violation unless you adhere strictly to the license (did you know SO content is licensed under CC-BY-SA?). Posting third party code snippets to most services, e.g. Twitter, is a copyright violation due to incompatibility between the license and the ToS requirements of the site. And so on.
As far as I remember the SCotUS verdict was "we don't want to say if it's copyrightable or not, but if it were copyritable it would fall under fair use".
> Even simpler: do not upload copyrighted materials to Dropbox, Google Drive, etc.
As the article illustrates, it’s not actually that simple. Cloud services that terminate accounts (and probably instantly delete everything, to comply with GDPR, CCPA, etc.) for perceived copyright infringements will always and necessarily suffer from a false positive rate.
We’ll likely never truly learn what this false positive rate is, but that it will always exist is reason enough to give pause to the thought that services should “just terminate their account” if they think it’s infringing on intellectual property laws.
The only good answer here is an unqualified “Use a local backup”. The terms of use for non-business cloud storage absolve the providers of all responsibility for data loss, even when they have incorrectly taken punitive action against you.
If copyright holders had their way, smart TVs would definitely block playing pirated media. To add to your list, why should HDMI care about encryption and stopping piracy?
Google is not liable for this use of Google Drive in any way.
Under 17 U.S.C. §§ 512 (also known as the Safe Harbor provision) Google is not liable for what users of the service upload and share as long as Google complies with take-down requests from rights holders. This behavior from Google goes way beyond what is legally required.
Under the laws, Google could also scan the content for "potentially libelous" material but this also would not be legally required. Google has no legal responsibility to scan your content for possibly infringing material.
I would think that all these "copyright" scans Google performs are not performed against a list of known-infringing files that Google researchers compiled themselves by monitoring pirate websites. Instead the known-infringing list would be compiled from previous takedown requests. And one of these takedown requests either explicitly or implicitly (e.g. by listing a "folder") contained a .DS_Store file that looked like many other .DS_Store files because the user had not modified the folder (display) attributes on their mac, which was then added to the known-infringing list, and which then created this mess.
But it's valid to question whether Google has to scan all files for known-infringing files in the first place. That's really where it gets tricky, legally. On the surface, they absolutely do NOT have to perform such scans under the DMCA.
But then there are provisions in 17 US § 512 (aka the DMCA law) that state for example:
"A service provider shall not be liable [...], if the service provider
(i) does not have actual knowledge that the material or an activity using the material on the system or network is infringing;
(ii) in the absence of such actual knowledge, is not aware of facts or circumstances from which infringing activity is apparent; or
(iii) upon obtaining such knowledge or awareness, acts expeditiously to remove, or disable access to, the material;"
This is very vague. It wouldn't be hard to imagine that some lawyers could show up claiming that because Google received a valid takedown notification for a specific file known to be a "pirate" release of some movie, that Google should have known or at least "been aware of facts or circumstances" that all copies of the same file in whatever user accounts are infringing (which might not be the case thanks to fair use, but lawyers and most juries would not care). If they could furthermore demonstrate that Google does already have knowledge required to locate each and every copy of a file in Google Drive accounts easily (e.g. find out through discovery that Google Drive "deduplicates" storage), then it would be game over with most juries and Google's safe harbor in the case gets denied and they are found liable.
And that's only the US (DMCA) aspect of it. The German Bundesgerichtshof (Federal Court of Justice, highest court of ordinary justice) for example has found in the past[0] that service providers can be liable if they have been previously informed about copyright infringement and did not take "reasonable" steps to prevent further infringement, and that these "reasonable" steps may specifically include checking new uploads and existing files against a list of known-infringing files (or hashes thereof).
Yeah, it sucks that Google and other service providers scan files that way, even if these files are never shared, and it sucks even more when somebody makes a mistake and puts a benign files on the known-infringing list (which is something Google should then correct, apologize for and reset any account flag/reinstate any banned accounts that got in the crossfire due to Google's mistake), but I can also appreciate that law makers and courts around the world have put Google into a situation where Google defacto (if not dejure) has to perform such scans to avoid liability.
[0] In a case involving where Atari sued then-filehoster rapidshare over "Alone in the Dark", in 2012.
Google built the product without fully considering that problem, and now regular users are paying the price.
In any case, it shouldn't be difficult for the product to make a distinction between files openly available and files being used privately within the drive on accounts that should look very legitimate to Google. The copyright filter would make a bit more sense on openly accessible files. Even then, why is Google going so far beyond what's legally required of them?
Assessing copyright implications is not "easy". It's a lot of difficult work that involves specialized expertise, judgment calls, and risk assessment. There are complex areas and shades of grey: derivative works, fair use, copyrighted but licensed materials, etc.
The main thing you are trying to avoid is causing harm at a level where an infringement claim is justified. There are a lot of uses which might look like infringement to an algorithm but which are completely legitimate.
> Check the files for copyright if you create a sharing link for them.
I just ran through everything on my Google Drive. Thank goodness I don't use it for much, though I do pay for extra storage. I don't have anything shared with the world, and only have a few files shared with a handful of family members.
But will this protect me? What is Google's policy with regards to scanning — do they scan only shared files, or do they scan the full drive because content might become shared?
I honestly wonder if that is what is happening. This article wasn’t able to replicate, but perhaps making a public link to the file containing .DS_store would do it
We already know that Google is scanning all of the files in your account looking for kiddie porn, and has been for the past decade.
>a man [was] arrested on child pornography charges, after Google tipped off authorities about illegal images found in the Houston suspect's Gmail account
I license a photo for my website, upload to google cloud and create a sharing link to share with my web designer and the contractor responsible for the website. Did I just infringe copyright?
> Why does Google Drive perform such scans?
A lot of warez and piracy websites used Drive to share that content, but this kind of filter can be easily avoided by saving the files as an encrypted zip, rar, or 7z file with a password.
In my opinion, using this kind of filters to all the files you upload is pretty useless. The download traffic of a file or an account gives more information about some files used publicly than any other metric.
>Is it because you might share the files with someone that they have to consider anything put on Google drive as being redistributed under copyright law, and thus subject to copyright restrictions?
That is a civil matter that falls upon the copyright holder, not Google.
Google is no more guilty than the makers of VCRs were when you recorded something without permission.
As poster above says, Google has no idea if I own the rights / have a license to the articles in question.
I used to work in the music industry and had license to rip music and distribute it online from all the major labels. I don't want my cloud storage disappearing along with my Google account just because Google mistakenly thinks I'm a pirate.
That may or may not be correct in the USA, but the world has many jurisdictions with varying laws on copyright infringement, and cloud storage providers may be liable for copyright infringement claims.
Therefore, Canadian copyright law is currently unclear on whether cloud storage providers may be shielded from liability for copyright infringement
⇒ If I were to run a cloud provider who permits file sharing, I think my legal team would strongly advise to scan files _shared_with_others_ for copyright infringement.
(In the ‘.DS_Store’ case, Google’s system seems to have some embarrassing false positives, but that’s a different issue)
Napster was designed explicitly to enable copyright infringement.
Cloud storage is not.
A more valid comparison would have to involve the Supreme Court's ruling on VCRs, which could possibly be used for copyright infringement, but had substantial uses that were perfectly legal.
>The Court's 5–4 ruling to reverse the Ninth Circuit in favor of Sony hinged on the possibility that the technology in question had significant non-infringing uses, and that the plaintiffs were unable to prove otherwise.
I don't believe this is correct. You can't launder money with a safe deposit box, so it makes no sense to have to conduct money laundering checks on one.
Standard rental agreements explicitly state that the bank does not retain a key and is unable to open the box (without destroying the lock) in the event that you lose yours.
It is laughable to put something working this badly into your "service" at all.
I realize this was probably 'forced' on some engineers at Google, but I'd still be embarrassed to say I've had a hand in this. A bit like being a supporting character in the worst movie of the year.
At what point do you threaten to quit over being forced to ship crap like automatic scanning of people's private files? I am unable to empathize here at all.
Just a guess, but it has the feel of a flawed source database of what's been copyrighted.
Something like a source where if one file in a collection is actually copyrighted, the entire folder/files/collection is assumed to also be copyrighted. And then, each file then goes in a database with a hash, etc.
That would account for this issue, and the earlier one where files with just a one/two/three digit number in them triggered the copyright hammer.
Unfortunately, past experience at Google is still seen as a good sign in job applications. I think we're at a point where gap years should be a better sign than a stint at that company.
Google's engineering is, on average, passable. Thousands of brand new grads that have studied to pass 5 rounds of interviews and have no experience are only going to put out barely passable software. Hell, the very reason that Go is so simple is because, by Google's own admission, their sw devs can't be trusted with anything more complex.
What you see coming out of Google that may look cool (when it's not a new chat app they'll kill in two years) has thousands of drones behind.
Yesterday, while discussing backups, someone said end users should basically trust the cloud. This is why it is a bad idea.
The risk is data loss, irrespective if it is because of hardware failure or because of cloud failure connected to overzealous legal enforcement, algorithm decision making failure,or service deprecation.
Part of the risk equation is how big the chance is, and I would really love to see the numbers of hardware faulure vs cloud failure.
Another part is impact. Hardware loss might be recoverable via specific services. Cloud failure, basically you're on your own, and they might revoke access to your email a.k.a. digital identity certificate, or even sick the copyright cartel on you, worst case ending in a police visit taking out the other backups.
Hardware loss is easily avoidable with more hardware... create backups, store them on multiple external drives (for most users, a single modern external drive is enough for all the backups, so the additional ones are for copies), and you can then keep one at home, one at work, one at your parents place, and replace them every now and then to 'refresh' the backups... Most home users don't need hourly or daily backups, they just nee their photo and home video collections saved, their documents scanned etc.
With cloud providers, storing stuff in different datacenters doesn't help you, if you get banned from google, because someone didn't like your youtube comment. Multiple cloud providers might be using the same amazon/azure/... datacenter in the backend. Then there are different L0-L8 problems, from political, where your country gets embargoed due to some ongoing political struggle, or you can move to another location, where the internet is slow. And now, it seems that you can lose data due to stupidity as in the original post.
I tell all coachees if they are on AWS, also put backups on GCP/Rsync.net/Herzner/... and vice versa. If your data and backups are with one provider it's a non zero risk of losing everything because of account removal.
This is sound advice but it's such a pain in neck to implement. Cloud providers have every incentive to make it difficult for you to duplicate your actions to a competitor, and to make it difficult for any third party who wants to provide such duplication as a convenience.
Especially if it's encrypted, e.g. restic/bupstash/duplicity+pgp/tarsnap/perkeep/borg/kopia/... Good luck finding the hash of a nipple, copyright-claimed material, or other TOS violations in that data.
I haven't had a hardware failure in so long, it's scary. I keep wondering when my next failure is going to be but my countless laptops, hard drives, NAS, and desktop systems keep chugging along without issue.
I know of a Netgear consumer NAS I installed in 2011 is still running a business, untouched. It uses two enterprise drives in a mirror config.
And this is why I put together a NAS for my home, which then backs up to a cloud provider, encrypted - and if that provider decides to play silly shenanigans I can just as easily colo another box and back up to that as well as any other cloud provider.
ZFS on Linux is stupidly easy to setup, something like URBackup for all your clients to backup onto the NAS and you've got a nice 1-2-3 (with 3 being offsite) backup scheme.
For a user familiar with Linux. Even many macOS and Windows based developers would need some time and research to set something like that up, let alone a non technical person.
TBH, synology and qnap have awful track records when it comes to data integrity and cause a ton of issues even when using the software the "intended" way. SMB is a chatty protocol in general which makes it hard to troubleshoot, and because of default lack of write-through, you have a non-trivial chance of a situation where the SMB stack returns "this write was a-okay!" but in fact the data never landed on disk and only hit the cache.
My team works with a lot of clients for backups and we've had so many silent corruption cases with lousy synology and qnap boxes that we just don't support them when using SMB or NFS anymore; the built-in stack is just too unreliable. iSCSI is a bit better, but iSCSI isn't how a lot of people want their NAS to work, so it's always a tough discussion. General purpose servers with proper endpoints have always been better, but it's a hard sell to clients.
Directly-attached USB is marginally better in that you don't deal with network protocols, but a lot of companies cheap out on USB controllers and it's a challenge sometimes to convince clients that the on-board controller is the issue when single (small) file writes work without issue.
But long story short, I'd propose a small general purpose server (even Raspberry Pi!) over a QNAP/Synology any day of the week. The latter does everything simply but with excess mediocrity. From my point of view, they hit MVP across a dozen + items, but competency in none.
People relying on reliable backups building their own storage servers from scratch. Clearly this is universal sage advice that will certainly not end badly for the vast majority of people -_-.
I'm on my second Synology, the first was a cheapish 4-bay; now I'm using a much pricier 8-bay. This is for personal use, but the time and effort saved by using this over hand building has saved me factors more money. I do recommend NFS over SMB for the cheaper models as smbd is a lot more CPU heavy and transfers actually became CPU bound pretty quickly. I've never had data corruption issues in my 10ish years of use, but one subjective data point is pretty crap.
I don't imagine that there are many personal users who need/want ISCSI or other more raw block I/O protocols. As for businesses, off the shelf solutions in your requirements range are often available, and unless you're looking for very specific optimized workloads, there's probably a bit excessively expensive vendor solution that can meet those requirements.
You can't just buy backup. You have to think about scenarios and you have to test the restore.
As an example, I have an online account and a synology nas. I test my backup every time I move to a new laptop, by doing the data migration trough the restore.
Synology, it turns out, had changed its backup solution, and finding the installer for the old tools needed to restore was non trivial. They also wanted me to click on every individual file, with 100+ K files to go. The alternative was to download a zip file, except this actually timed out before it could complete. All the data was there, except not in a decent restorable format.
Apart from that, this is not an offline backup. If your house has a fire or whatever, your backup is also gone. A NAS on its own is not enough.
I was hoping to use rsync.net as online provider, but they had no way to receive money from a SEPA IBAN. They also store data outside the EU, exposing me to a foreign legislature with markedly lower consumer protection. When looking to the EU, the storage landscape becomes fragmented quickly. There are options, but its not easy.
With pre-made boxes, you're viable to run into issues when they graciously decide to force a security update on you. QNAP makes extremely decent boxes but last year's behavior that broke a lot of stuff [1] has all but killed my trust in them.
Western Digital, Netgear, or Buffalo will do it for you. The cost is a trade of your time...plus the added value of a device that does one thing and does it well.
As for "cloud provider and encrypted" - how did you do that? Basically, what I want is a way to back up about 8 TB of data on my (LUKS-encrypted) home server to the cloud, given the following constraints:
- data retrieval is not needed except for recovery case
- ideally, the backup process on the home server side should not be more difficult than a cron job running "rsync -avz --delete /mnt/raid user@server:/mnt/storage"
- integrity of everything should be assured - there's a couple of filesystem-level backups made with "rsync -av" on the data store which means UID/GID, chmod, symlinks, special files (e.g. device files) and whatever Samba uses to store Time Machine xattr metadata must be kept, and there should (but not must) be a way to verify if the file content in the backup is still intact.
- there must be absolutely no way for the cloud or server hosting provider or someone gaining access to the server e.g. via an RCE in the SSH or other sync daemon to access any data (both content and metadata like file name) both in transit and at rest
- it should be somewhat affordable (e.g. Amazon Glacier is ~33 $ a month, Backblaze ~40$ a month)
- ideally, there should be some form of asymmetric encryption be used so that decrypting the data requires the possession of one of three off-site YubiKeys with each having a distinct on-key-generated RSA4096 key and the corresponding password.
The easiest way to accomplish the first three targets would be to simply spin up a tiny AWS EC2 instance with an attached LUKS-encrypted EBS volume or rent a dedicated/colo server somewhere with the same setup and run "rsync -avAHX --delete" on the home server, but that's not affordable and there is a risk of the provider/a hacker accessing/manipulating the cloud server while it is running.
Duplicity plus any "dumb storage" seems to fulfill almost all constraints, but it seems to require either symmetric encryption or access to the private key on the home server - otherwise, how would it be able to decrypt and read the existing backup to find out what has already been backed up?
> ideally, there should be some form of asymmetric encryption be used so that decrypting the data requires the possession of one of three off-site YubiKeys with each having a distinct on-key-generated RSA4096 key and the corresponding password
This last one seems to drastically change the design into a place that hasn't really been developed. It would be quite neat, in addition to adding other properties like a compromise of the backed up server couldn't be used to destroy a backup. But assuming your goal is to pragmatically get it done, I would forgo this and accept that your backed up server is going to have full control over the backups.
FWIW I view online/cloud backups as having the strength of being synced often, and therefore up to date in case physical damage happens to the server. For longer term storage I use offsite drives in a safe deposit box, that take care of things like infrastructure being compromised and deliberately deleted along with online backups. For long term data integrity I use ZFS, borg verify, and ad-hoc sha512sums (for things that don't change much, like music collection).
In general keep in mind that your 8TB to backup likely has a power law distribution where your really important filesets are much smaller and can therefore be inexpensively backed up with more copies.
The best pricing I've seen for block devices is Kimsufi (10 USD / 2TB-mo) and buyvm.net (5 USD / 1TB-mo) although I haven't used the latter.
> In general keep in mind that your 8TB to backup likely has a power law distribution where your really important filesets are much smaller and can therefore be inexpensively backed up with more copies.
In my personal experience a simple ignore list can reduce size by an order of magnitude. Things that doesn't make sense to backup like thumbnails, cache files, development dependencies. And smaller is more robust, not only can it be inexpensively backed up with more copies but it can also be synced more frequently with a lower risk of partition and also is quicker to download and restore.
> As for "cloud provider and encrypted" - how did you do that? Basically, what I want is a way to back up about 8 TB of data on my (LUKS-encrypted) home server to the cloud
I personally just throw it all into a .7z file and then upload it to a cloud service (Mega, specifically). I chose 7z as it has the capability to encrypt file names, which would otherwise leak a lot of metadata to the cloud storage provider.
The main downside of this is that Mega deletes your stuff after 3 months of idle time, so I need to periodically check in on it to make it think I'm still using it.
I am really not sure how 8TB of data would fare with this setup though. For one, you'd definitely want to set up some sort of streaming thing instead of buffering the .7z file on the server and then uploading it.
Re-uploading 8 TB each month is impossible, here in Germany all you can get on a residential VDSL line is ~10-20 MBit/s, which means you're stuck at 24/7 uploading for way over a month.
Woah, each month? I was assuming you would be syncing deltas, i.e. only upload what's actually changed. An ideal system would be to keep a list of files and their respective hashes on each 7z archive task, then do a final checksum of the 7z file so you can ensure its integrity when you download it from the cloud.
A quick look at the manpage[0] shows there's an update option, although wouldn't you have to keep the whole archive locally?
Germany: I get 100mbit upload consistently and with 1gbit fiber would get 200mbit+. If in dire need with a second provider that has the fiber in my home already I could get 400mbit+ upload.
Look into restic or Borg backup. Restic with rclone can directly backup to a lot of cloud providers, with Borg you may have to remount the remote share.
> there must be absolutely no way for the cloud or server hosting provider or someone gaining access to the server e.g. via an RCE in the SSH or other sync daemon to access any data (both content and metadata like file name) both in transit and at rest
An option which seems to match most if not all of your requirements is borgbackup. As for the private key, borgbackup has several modes, the default being one where the private key is stored on the server but encrypted with a passphrase; if you don't like it, you could use the mode where the private key encrypted with the passphrase is kept locally and then backup that (very small) key file separately.
> - ideally, the backup process on the home server side should not be more difficult than a cron job running "rsync -avz --delete /mnt/raid user@server:/mnt/storage"
Has Google publicly explained why they previously flagged files that has just the number '1' in them for copyright infringement [1]? I know they certainly didn't apologize for doing this, but did they at least explain it? At least internally? Did other Googlers ever hear an explanation?
This complete lack of accountability, of acting with total impunity is what really rankles me about big tech.
Given that there's no support you can call, this leaves no recourse if you accidentally sync the wrong .DS_Store. In the reddit post screenshot, there's no button to open a dispute of any kind. I wonder if this also marks your Google account with a "copyright violator" tag and further increases the chance of an automatic ban with the anti spam AI account pruning system that they run.
People who have 10 years worth of data and life linked to their Google Account should be pretty scared.
Google is causing a huge number of headaches of late. The gsuite withdrawal, the forced retiring of perfectly usable apps, the lack of any customer redressal if you get caught in cross fire, the banning of political youtube channels in certain countries, ...
I am actively moving email and data on to my own servers before even more things go down catastrophically.
I agree. I'm in the process of moving all my services to my own domain so if the worst happens I don't lose access to my primary email address. Anything I upload to GDrive is backup only. I would love if Rumble were to take off because YouTube's censorship over the last couple of years is becoming concerning.
I started the slow process of de-googling my life a few years ago. Google is so bad these days, it's not that hard to replace all the google things. Except one – YouTube.
People on the drive team should feel great shame for this. Catching a Google ban can be very very damaging to someone's life. Scans of files like this don't need to happen.
Is there a good word for when a solution is bad (in a 'defective by design' sense) but offloads all the "badness" to the competitor, ironically making it look good to casual users?
Like mac introducing files that are hidden to a mac but littering the place in other systems.
Or like Google nor caring about false positives as long as its customers (i.e. record labels) are happy.
In any case, I don't lose much sleep when one bad design hits another. My advice is the same as in the general case: avoid such systems.
If you're talking about .DS_Store, understand that it's not just by comparison that it's better; .DS_Store does a lot for MacOS users and while they have no idea that .DS_Store does this, your average Mac user probably would be upset to lose things like the Finder column defaults per directory, the window position, etc.
While I understand it's an annoyance to deal with an OS specific feature, at the same time, .DS_Store is so predictable that I just can't see how it's a challenge to deal with. Everything you need to know as a non-Mac user is basically "you don't need to consider this except by special request", and at worst, the end-result of not considering .DS_Store is your user(s) have to reset a few window settings.
I truly don't get the vitriol expressed towards .DS_Store; it's a hidden system file like any other and I struggle to understand the use cases where having .DS_Store in a directory is an issue. I have read countless articles complaining on it, but I've not heard a reason beyond "it junks up file systems", which can be said about _any_ system file.
.DS_Store is pretty neat solution to store folder-specific metadata. Whenever a user changes settings on how contents of the folder are displayed the file gets created. And since .DS_Store is a local file it is portable. Nothing breaks when folder is moved anywhere. Some programs use that to present a simple drag'n'drop installation picture. You don't need anything else, just a folder with correctly positioned files and custom background image.
That's a good one, but I'm not sure if it captures the full extent of the idea; in this case, the negative externality is a 'feature' for the transacting party, whereas in general the phrase 'negative externality' simply implies some negative external consequence of the transaction was not duly considered during the transaction (at least as I understand it; am I wrong?).
What I have in mind is things like, e.g. youtube amping their ad ratio by 100% when they started offering paid subscriptions. Or Apple intentionally not fixing features that cripple using windows on their new Intel processor. Or IOS shaming android users in their messaging app instead of providing appropriate support. Or MS Word detecting files written in libreoffice as 'corrupt' and offering to fix them. Or matlab introducing syntactical changes that break octave. Or Android making you go through hoops to use software outside of the play store. etc.
In other words, if you're a company engaging in such a tactic, it's probably not that you haven't necessarily considered how a 'feature' might affect users of competing products; it's probably that you have considered it, and it's just that extra bit of friction to make them feel your own product is more streamlined and superior (for all the wrong reasons).
Why does google want to get ahead of this? Pretty much all services use submit a notice and that's about it - why is google being so proactive at the cost of their users?
From what I understand (I don't have sources for this, other than memory):
Processing the DMCA notices has a cost. Nothing requires the DMCA notice to be in a machine readable format (although AI is getting better at this type of task), so it often requires a human in the loop. And they were getting sued by the copyright holders regardless of protections under DMCA. So they struck a deal with the copyright holders, that instead of them sending Google a DMCA, Google would provide an interface and automation to handle this, in return for the rights holders dropping their lawsuits.
Unless this changed very recently, notices are still being sent to google and just like google AI they fail horribly in that regard.
So if that's the case i feel like the cost didn't disappear and probably not even lower and this is just an additional tool to make lives of google users hell. Plus the fact that google does have non existent support this makes their services really hard to recommend to anyone.
I won’t use a drive service that is not e2ee. My files are my own business. There are multiple alternatives out there like sync.com (no, I’m not affiliated with them).
Disclaimer: I also don’t store copyrighted material.
Unless you are very careful not to, you probably do store copyrighted material, and there's a nontrivial chance that some of it is technically infringing. Copyright is much broader than most people realize, and the continued existence of the information economy largely relies on big copyright holders being reasonable and choosing their battles.
Copyright protections are really getting out of hand, and they are ridiculous:
Nothing to do with this example but today I decided to watch a documentary on my iPhone on Netflix app, which I rarely use on mobile. I liked a scene, took a screenshot to send to someone who might be genuinely interested in watching the documentary (which would even mean potential new customers for Netflix). The screenshot was blank. Then googled it to find it's on DRM grounds.
It takes one person to pirate and distribute a movie online to anywhere on Earth, yet these attempts block normal users from perfectly fair use.
Anyone who is going to pirate things will find a way anyway. Just like the Google example: the whole piracy detection system harms all the perfectly legal, fair use users while pirates have tons of other ways of distributing files anyway.
I wonder when this will end and copyright holders will understand that they're approaching this the wrong way.
Haha, the fact that Drive engineers couldn’t put out a fix quickly does kind of imply something about the product/engineering design cycle. Google does have the reputation in the Bay of being a retirement home for engineers who don’t care that much about being good engineers. Looks like that’s well deserved.
I had ,,backup and sync’’ for Google working OKish for me for accessing my Google Drive files in the past. I’ve got an upgrade called ,,Google Drive’’ which took down backup and sync, and then installed the software. After that it told me that it wants to sync everything on my disk, and there’s not enough space on my Google Drive to do it, and I can’t use backup and sync anymore.
I could pay for more storage for Google, but I don’t want it to have all my data with all the problems that I see here, and I honestly don’t know what to do now to sync with my mobile, as Google Drive was great for me in the past.
I see some people suggesting E2E solutions, but I’m not sure how great is the mobile experience for those, and how well they integrate with Google docs, which I love to use.
I was wondering if there is a file format that can be used as a defense against bulk scanning, but isn't necessarily an encrypted file where you would need a key to decode it. What I was thinking is encrypt the file, but store the key in the file in such a way that it requires an expensive computational operation to decode the key. When reading a single file it may take an extra couple seconds to open it, which could be ok for an end user, but it would make mass scanning impractical. One way of doing this I can think of is starting off with a random string that is recorded in the file header, then running it through bcrypt with a suitable cost factor, in order to derive the actual encryption key for the rest of the file.
Video file produced by us has been marked as copyright violation by the company I use to host large files (no we were not filming copyrighted subjects / objects). Repeated requests to restore access to file so our customers could download it ended up with customer support repeating the same thing ad nauseum. I have better things to do than waste my energy on this and since it was just a single file I just hosted it in a different place.
The scanning files is one thing, but the lack of this bug being resolved in a matter of hours is the real problem. This should not be an ongoing issue days later.
Banks don’t have to check safe deposit boxes for stolen art.
Self storage don’t have to check containers for stolen goods.
Of course, I’m against illegal stuff, but I don’t want to waste a single second defending myself from false positives in these situations. Google has no way of knowing whether I own the IP so I could have paid for a license for the material on my drive or many other legitimate cases.