Hacker News new | past | comments | ask | show | jobs | submit login
Serverless File Uploads – Netlify (netlify.com)
51 points by peterdemin on Feb 13, 2017 | hide | past | favorite | 46 comments



This article makes it sound like upload requests are restricted to your domain via CORS, but that is one of the pitfalls of thinking about everything through a serverless lens: while another site could not use javascript to directly upload files to your s3 bucket, a malicious user could absolutely make a backend request to receive signed tokens for uploading to your s3 bucket. Additionally, it doesn't look like the policy is locked down so they could overwrite your existing files with malicious ones.


A practical implementation of this would probably want to do a couple things:

1. As you mention, make sure they can't overwrite files

2. Have a content type whitelist on the requestUploadURL function

3. Maybe authentication to keep track of who is submitting these requests?

Assuming that you're okay allowing someone to upload files with a given content-type to your bucket is there anything I'm missing?


For #1, I would probably disallow writing to arbitrary paths and instead generate a path prefix using a UUID and return that to the client to ensure that every upload is unique.


You can easily abuse non-serverless solutions as well... when signing the S3 request you could have internal logic and prevent this kind of behaviour, by just not signing the request.


Sure, though I think this is somewhat specific to serverless because in traditional applications, authentication usually covers all interactions with the user whereas with serverless you sometimes have to figure it out on a case-by-case basis. The article makes it sound like it accounts for it with CORS headers, which may mislead novices.


How is using a node.js server to generate signed requests to upload files onto S3 servers, "serverless"?


Yes.

In a billing sense, since you're not paying for a provisioned server. You only pay for the time your request is executing and your storage in S3.

In the monitoring sense, since Amazon is responsible for monitoring the servers and infrastructure that your application runs on.

In a security sense, since Amazon is responsible for installing security updates, using the appropriate SSL cipher suite and sandboxing your code so that a vulnerability turn the server into a part of a botnet.

In a scaling sense, since you're not responsible for provisioning more servers should you get the hug of death from having your site posted here.

A server is more than just a computer hooked up to the internet. It's a set of responsibilities, limitations and costs that you have to stay on top of. A "serverless" application has far fewer responsibilities and limitations and a completely different cost structure. It's not right for everyone, but it is a fundamentally different way of architecting your application.


because you don't have to provision the server yourself, and get access to one process.

so you wind up with a lot of processes across servers and virtual servers everywhere, but that is colloquially called serverless. Especially if your serverless collection of servers use javascript.


Ah right, so servers running javascript are actually serverless.


Servers running javascript are called "a dumb idea". But that's neither here nor there - The core idea of "Serverless" is that you don't have stateful physical machines that you manage, only a series of processes - which can live just about anywhere, with only clearly defined interfaces to shared state.


> a series of processes - which can live just about anywhere, with only clearly defined interfaces to shared state.

So, a server with an API. Gotcha.

Edit: downvotes? how is what he is describing any different to that and anything else other than a buzzword? Doesn't AWS Lambda just run your processes in an ec2 instance?


> Edit: downvotes? how is what he is describing any different to that and anything else other than a buzzword? Doesn't AWS Lambda just run your processes in an ec2 instance?

Yes but you don't pay for the hardware specifically, eg. Your code doesn't care when underlying lambda hardware fails because it's not contained within any specific server. You pay for the invocation costs which are usually much lower than running a server to do the job as you're gaining economies of scale with other tenants.


Process-level Heroku.


You're thinking about it the wrong way.

You need to "feel" the serverless nature of it. It's an inner feeling - a bit like faith.

If someone says it's serverless then it IS serverless!- that is the nature of all serverless systems.

Unless of course we're talking about Amazons serverless systems - those really are serverless - nothing there at all but strange quantum forces that create responses to HTTP requests.


It's serverless because you (your entity, company, whatever) don't have to admin any servers to make it work.


You don't for S3 uploads either...


They might call it "serverless" for that reason, just like someone might say they drove to work "carless" by carpooling with a coworker, but it's a little misleading.


When you get to work via carpooling, do you say, "I drove to work" or do you say, "I carpooled to work"?


Doesn't matter. Driving and carpooling both imply the use of a car. Uploading denotes a server at the receiving end, even if it's a dynamically-provisioned transient server uploading into some cloud data store.


By that definition the hole SaaS business is "serverless".


Yes. You can run your entire business without managing any servers using only SaaS products.


Was just thinking the same thing.


We built something similar to this for our Kickflip.io HLS/S3 live video streaming service. The original version is closed source but we rewrote a generic open source Python library called storage_provisioner [1] for a client. It's minimal and super simple.

We also made a Django-Rest-Framework module that wraps storage_provisioner called django_broadcast [2]. Working with AWS/S3 can be a pain, hopefully these tools can help.

1. https://github.com/PerchLive/storage_provisioner

2. https://github.com/PerchLive/django-broadcast


Silly nomenclature. This isn't "serverless" at all.

Suggestion: "File uploads using 3rd party server".

As an added benefit, this title shows that the data might not be safe from the curiosity of other entities.


It's serverless in the sense that you (your entity, company, whatever) don't have to admin any servers to make it work.


As with "hackers" vs. "crackers", I suspect this little terminology war is already lost.


Hackers hack computers and networks, Crackers crack software. There is a clear difference between those two words.


A clear difference that's meaningless to and ignored by most, as with the fact that "serverless" isn't really serverless. Like I said, I suspect the war's lost.


Well, like it or not, the term "Serverless" has been adopted by the industry/community and refers to an architectural approach of leveraging 3rd party services.

See https://martinfowler.com/articles/serverless.html


Didn't you get the memo? "Cloud" is so 2016. "Serverless" is the new hotness.


What's wrong with doing all of this on the front-end? I recently did just that after generating the signature and policy locally.

See this guide: https://aws.amazon.com/articles/1434


The benefit of doing uploads on the front-end is that it's instant: the file is already there! ;-)


You leak your secret key to every user who can view that page.


No you don't. You leak AWSAccessKeyId which is not a secret. You use a signature to authorize the file upload.


I should've been more verbose. You cannot calculate the signature client side without leaking the key. So you need a server. That step is identical to what this "serverless" implementation is doing.


Correct. But the signature doesn't necessarily need to be per-file upload, so I have it embedded in JS. For my use case, saving the extra network hop is worthwhile.


So I can extract it from the JS, and just upload terabytes?


Yeah, that's true. But you can limit the secret key to an IAM user with only perms to uploading to that particular bucket. I know it can still cause damage, but nothing like disclosing your root key. If you do a cost-analysis taking into account development on the back-end, doesn't seem so bad, till of course, it does.


I really wish people would stop using the term serverless, it's not useful and it's highly misleading.


You do need a server to create a token from your access key and secret. However, this doesn't really go very far in protecting your bucket, as somebody could just grab that token and upload whatever they want.

So an additional layer of security is creating an upload bucket with a policy where all objects over 24h old are deleted. When somebody finishes uploading a file, you ping your server and move the file from the upload bucket to the real bucket.

Another trick is putting Cloudfront in front of that bucket. You can then upload to any Cloudfront server, which will then put the file in your bucket -- the reduced latency to a Cloudfront (vs S3) will increase the speed at which you upload by quite a bit.


Nice work. I have several web apps that run on EC2 instances, that have my users uploading via Browser -> EC2 -> S3. This can cause high latency on some of my smaller EC2 boxes, which is annoying, it else forces Elastic Beanstalk to spool up more instances unnecessarily when it thinks traffic is being flooded when large files are being uploaded.

I've always wondered about the best strategy to go 'serverless' with the file uploads and have the user's browser essentially upload direct to S3, and this tutorial gives a great insight into that - thanks.


"Each returned URL is unique and valid for a single usage, under the specified conditions."

Where and how is the url actually invalidated after it is used? (or are you relying on expiration as invalidation?)


"Serverless"

So is this 100% client-side, or is there a (server) dependency?


What I've found people really want when they say "serverless" in this context is, "direct file upload without a proxy server", which is basically what BaaS' like Firebase and Parse do...

<pitch>

Firebase Storage (https://firebase.google.com/docs/storage/) provides clients that perform secure, serverless uploads and downloads. Instead of doing the dance with a server minting signed URLs, it uses a rules engine that lets developers specify declarative rules to authorize client operations. You can write rules to match object prefixes, restrict access to particular user or set of users, check the contents of the file metadata (size, content type, other headers), or check a current file with the prefix to not overwrite it.

If HackerNews formatting were more forgiving of code snippets, I'd post one here, but instead have to link to the docs (https://firebase.google.com/docs/storage/security/secure-fil...).

We've found that this model is more performant and less expensive (no need for a proxy server), as well as lower cognitive load on developers, as they think about what they want the end result to be, rather than how they need to build up the end result.

And since I know people will bring it up: there are definitely limitations in flexibility (you're using a DSL), and a steeper learning curve for the very complicated use cases. The goal here is make it trivial for 90% of use cases and possible for 9%; rather than making it possible for 100% and equally difficult for everyone. Tradeoffs...

</pitch>

And if you want other examples, Parse did a similar thing with role based access control to a Parse File, allowing direct client upload and access by only a set of users. S3 and GCS can do this as well, assuming their (relatively coarse) IAM models are granular enough for you (and you're a authorized principle in their systems, which is often the harder thing).

Bringing this full circle, "serverless" typically involves a switch from writing code (imperative) to writing config (declarative). You're not validating JWTs signing URLs, or writing middleware, you're letting services know how to configure those primitives for you. In some ways the Serverless framework does abstract this for you (hey look, I didn't provision a VM), and in some ways it doesn't (you still wrote code to generate a signed URL).

Disclosure: I built Firebase Storage


Thanks for pointing out that "serverless" is a rebranding of Backend-as-a-Service.

Any examples of the 9% "possible" Firebase use cases which are less easy to configure?

What do you think of the Rebol/RED approach to DSLs? There are also a few Ocaml papers on DSLs for finance.

Are particular languages better suited to implementing DSLs for configurable/declarative interfaces to a BaaS like Firebase?


I use this technique extensively in several production systems.

As others have mentioned having an expiration policy is a good idea. Also you can mitigate charges from malicious activity by using rate limits on the signing endpoint. (API gateway supports this.) Using infrequent access or reduced redundancy storage might also be a good idea if you expect a lot of traffic. It's also good to limit the CORS policy on the bucket to the needed domains and headers.

Signed metadata headers are very useful when combined with S3 event handlers (SQS or straight to Lambda.) using a HEAD request on the uploaded objects. This is a great technique for post processing an upload without requiring client trust or an external data store. (With a separate falliable request which could lead to consistency issues.)

Edit: It is also critically important to have some randomness in each key path so it is ungessable. Otherwise user files would be overwriteable by an attacker. (Many file names are easily guessable and an attacker with many tries could eventually stuff malware in for example.) I used guids for this because they are both URL and S3 key safe. If keeping the original file name is needed I put it in a metadata value and rename the file on download using a Content-Disposition header. Making the S3 headers work with symbols in file names can be tricky but encoding it as a JSON string works around most issues.

In order to overcome the 30 second request limit in API gateway for longer post processing while still offering realtime client feedback you can set up an S3 event handler to trigger the post processing lambda which then updates a DynamoDB record with the S3 key as it's id. A status endpoint lambda is then polled by the client with the S3 key for status events.

For more complex post processing and client side workflows I have used key prefixes (folders) each with seaprate event handlers or CORS configurations. IAM polices with conditions including S3 key prefixes are used to restrict access. Using the S3 API copy command can move large objects quickly between workflow steps.

Also enabling server side encryption is a must imo. Be sure to specify AWS signature version 4 in the S3 constructor so that all parts of the request are signed. (Otherwise some older regions may not sign metadata headers.)

Also the S3 API copy command has an interesting append feature which can be used to build objects iteratively. I once toyed with the idea of using it to create large zip files of many S3 objects efficiently but ended up not needing it. Someday I would like to try that because it could be great for a lot of web apps where users can select a random list of files to download.

Also I (re)implemented most of the above this week using CloudFormation and the newer AWS Serverless Template (not the serverless.com project but the actual AWS feature.) which allows for really easy deployment.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: