Hacker News new | past | comments | ask | show | jobs | submit login
GitHub Codescanning (github.com/features)
390 points by jedisct1 on May 6, 2020 | hide | past | favorite | 73 comments



Nice, it seems they finally integrated Semmle, which they acquired last year! https://github.blog/2019-09-18-github-welcomes-semmle/

This is the only static analysis tool I've really been interested in over the past few years; it's crazy effective from everything I've seen, and the queries are easy to write. Can't wait to play around with this beta on my own code.


The link is behind a login, I'm on my phone logged out but I believe it refers to this[1]?

[1] https://github.com/features/security


Thanks. I'm surprised a mod hasn't changed it by now, but I guess we're the only two people not logged into Github.


Me too, because I don't have a GitHub account (I use Fossil).


I always delete my cookies, so that makes three of us.


It seems that it will be free for open-source, public repositories, and quite pricy for private repositories.

"A member of our sales team will reach out to discuss details" is a great euphemism for "be ready to pay quite a few thousand bucks per year for this feature".


Or it's in alpha and they haven't found out yet what the feature is worth to corporate users; so they want to do that before anchoring them to a price.

Also, it might vary per account in an unpredictable way, since it's a heavyweight operation that'd cost more compute-hours when run against larger repos, but also involves static-analysis tasks that don't necessarily scale linearly with LOC. So they might just not yet have a predictive model yet for the baseline cost to them, but instead are doing a trial run for each interested user, observing the cost of the workload on their repo, and then extrapolating that out as the cost to run that workload persistently.

Either way, GitHub has been strongly on the side of "clear pricing" so far, so I doubt they would plan to leave things this way. But it's hard to get enough data for a feature like this, when you know each run you do "just to model the curve" is costing you real money.


The usual theory on HN is that having to call means they're gong to milk you for money.

But I work with a SaaS product and the up front pricing is sort of irrelevant if only because the customers want a highly customized product and frankly you really need to know their business / work with them to find what they want to figure out what the price for implementation and customization is.

In our case it isn't a scam or a ploy for more money, it's just the nature of the beast / industry / and every single time it is what customers end up asking for (lots of customization and etc).

Now maybe that is the case for Github Code scanning, but also maybe it is very much a product that depends on how exactly you want it to work for you and that changes the pricing ...


That means the price is negotiable. Rich users pay more, poor users pay less, both users get value, producer makes a profit; everyone wins.


If you're an inside salesperson, cold calling ought to be pretty straight forward:

Customer: Sorry, I just don't think it's worth the price...

Sales: Oh really? Because we've identified 42 critical vulnerabilities in your code base

Customer: Do you have an installment plan?


Sales: Oh really? Because we've identified 42 critical vulnerabilities in your code base

Customer: We know the software runs on java 7. We don't need to pay you 100k to find out.


What's the best way to put this politely from the seller's perspective lol. Do sales teams do background homework on clients' revenue and then come up with different numbers for the exact same offering?


It's one of the big reasons why I try to finalize contracts before announcing a round of funding. You'd be amazed at how much pricing for things jumps immediately after announcing that Series B.


Yes, that happens.

Sometimes it is also possible to brand the exact same service in different ways.

If you have a SaaS that is fully OSHA compliant on all tiers, it may be worth it to not mention the OSHA compliance on lower tiers, but only offer it on the Enterprise tier for example.


Sure, why not? Or they impute value from usage.


Cost based on usage seems very fair to me, but GP said "rich users pay more, poor users pay less," which seems to suggest that if Microsoft emailed me asking about a SaaS subscription for 1M req/day, I should quote them orders of magnitude more than a small startup asking for the same 1M req/day.


I don't think this will be free for all public repositories. Having designed and implemented these kind of static analysers, it's quite costly to scale them - you do want to avoid useless CPU time on the millions of public repositories.


They said during the Keynote that they were willing to spend the millions of dollars necessary to run this on public repos that would activate the option because it's the right thing to do.


I sure do like this MS better than the old one.

That said, Ballmer did get one thing right: https://www.youtube.com/watch?v=Vhh_GeBPOhs


I'm a developer and I hate closed proprietary ecosystems with a passion, so that was just lip service afaic. Current microsoft is much more "developers developers developers"


That's kind of the point, though? A lot of people make fun of Ballmer for using that as a repetitive mantra, but a point often to a mantra, to repeating it to yourself and others, is to remind yourself it is a value you hold, and one that you maybe aren't great at, but should continue to strive towards/get better on. Current Microsoft likely wouldn't have gotten better at "developers, developers, developers" if Ballmer hadn't been shouting that to the rooftops as a core company value, and trying to drive the company to be better at it. The irony that Microsoft got much better at it in part by ignoring some of Ballmer's other past paranoia/NIH/"home-team-ism" probably wouldn't be lost on Ballmer himself either, it always seemed like he kept repeating the mantra as a reminder for himself too to not get caught up in what seemed best to shareholders or to Windows when that wasn't best for developers. He wasn't always successful, but holding a value/ideal doesn't make you perfect, it gives you a goal towards it.


That's a problem that's simple to solve by putting a quota on # analyses per project per month, perhaps weighted by how popular the project is. Like everything else at GitHub, private project users pay extra to cover the public project users. It's proportionall regardless of a feature's cost.


Not sure if that's simple - the cost of running a static analyser is almost never linear. For large popular projects, special care will have to be in place to make sure the analysis terminates and gives meaningful results (a basic timeout won't cut it...). I've experienced many times huge differences in the running times of analysers by minor changes in the code. It'll be interesting to see :-)


It is typical enterprise sales. If the prices aren't listed, it usually means you're charged based on what they estimate you can afford for the value-add. It may be $12k/yr for a big company and $1.2k/yr for a small one.


No way it'll be $12k/year for a big company - I would say $50k minimum.


or vice versa - I remember that O'Reilly Safari was almost 10 times cheaper per person when I worked in 100k+ people company, compared to only couple of thousands people...


And that still makes the solution cost prohibitive for the large company because there's likely only 10 people who have access to the software and ever use it out of the 100 000 employees.

A small company buying a software might actually have a bunch of employees using it.


If you're a reasonably large org with lots of developers and tons of repos, this will be a cost well worth paying.


It's usually the opposite. The vendor tries to charge a crazy amount per user per month or charge for every employee in the company, which makes the solution acutely cost prohibitive.

Nobody wants to spend hundreds of thousands of dollars a year -if not millions- on something that's barely used. Better spend on anything else that is tangible.


Well, Semmle was extremly expensive by itself, so I'd doubt it will be free


One thing that really annoys me about Github/Gitlab et al is that they don't provide a nice UI to show CI results in a structured way.

There are semi-standard XML formats that can be used to provide file position, severity and message, which could easily be produced from CI actions and would give devs and reviewers a great view of failing tests, compiler and linter warnings, ... With links to the files etc.

Instead we are still stuck with scanning text logs to figure out what failed.

I always assumed this is not implemented to eventually upsell something automated.

This is still a great feature though, which will probably prevent a large amount of bugs/vulnerabilities, assuming they can minimize false positives.

To give credit where it is due, I'd also note that most of Githubs new features since the acquisition were already present in Gitlab [1]. Github will be able to commit way more resources to make it polished, though.

[1] https://about.gitlab.com/stages-devops-lifecycle/secure/

Edit: apparently both Gitlab and Github have at least a limited version of this now, although Gitlabs implementation seems much nicer. See below.


Surprisingly Azure pipelines, which I understand is used underneath GitHub actions, does support report CI results in this nice way on GitHub, e.g., see these results from a recent PR on one of my projects: https://github.com/pydata/xarray/pull/4017/checks?check_run_...

For now, at least, it seems like this is one reason not to switch to GitHub actions yet.


Heard this suggested a few times but what is the source?


But GitHub actions do exactly support that in the form of code annotations, see e.g. here https://github.com/ember-template-lint/ember-template-lint/p... - no need to scan logs if you do that, you get the comments right on the code.


Do you have a docs link for this functionality? I can't find anything.

The way of reporting seems somewhat odd, and the UI seems a bit limited, but it's a start!

Edit: https://help.github.com/en/actions/reference/workflow-comman...


Integration and usage seem to be a limiting factor but there's some gap bridging happening with GH actions. This Action works pretty great for python's flake8 - https://github.com/marketplace/actions/run-flake8-on-your-pr...


Of course gitlab supports this: https://git.kuschku.de/justJanne/QuasselDroid-ng/pipelines/5...

It can automatically parse most XML formats, JUnit for example.


Thanks! Docs for the GitLab feature to show CI results in a structured way are on https://docs.gitlab.com/ee/ci/junit_test_reports.html

GitHub Codescanning functionality is best compared to what GitLab has in GitLab SAST https://docs.gitlab.com/ee/user/application_security/sast/ and Secret Detection https://docs.gitlab.com/ee/user/application_security/sast/#s...


I stand corrected, and gladly so!

This looks perfect.


There is a standard for this: Static Analysis Results Interchange Format (SARIF):

https://docs.oasis-open.org/sarif/sarif/v2.0/csprd01/sarif-v...


Do you know who supports it?


Given that it's a pretty young standard, nobody for now, but there is an open ticket to support it on GitLab :

https://gitlab.com/gitlab-org/gitlab/-/issues/118496


I wonder how much the difficulty of writing queries varies between languages. I was disappointed to not see Ruby on the beta sign-up list, and GitHub being a pretty heavy user, I'm sure they have their reasons for excluding it.


This is based on Semmle, which they acquired last year. According to their docs (https://help.semmle.com/lgtm-enterprise/admin/help/sys-requi...) it supports C, C++, C#, Go, Java, JS, TypeScript, and Python; no Ruby. (Edit to add: Woops, missed that this list is also literally on the page. I had even looked first.)

It's really hard to do any kind of static analysis on something like Ruby or Perl where 1) you need a ton of context just to parse it properly, and 2) tracing calls is a nightmare. Given that, I'm completely unsurprised they haven't supported it yet.


How hard can ruby be compared to python?

Python is very dynamic. I've used and worked on linting tools for python, tried commercial static analyzers, and they do a pretty good job in my opinion in spite of the language being dynamic. Not perfect but miles above anything I would have expected.


In Ruby, monkeypatching is idiomatic and you can’t even tell what package any given global came from.


Interestingly there is no support for Rust.

Is it because Rust is bullet proof, or too young to be considered for now?


We (GitHub) absolutely plan to expand the list of languages CodeQL supports, and Ruby is a language we'd love to add (we're heavy users of it internally). In the meantime, because code scanning is extensible you can plug in third party analysis engines to scan the languages that CodeQL doesn't support.


It will be interesting to see the false positive rate...


At GitHub we're pretty proud of the scan results from CodeQL. Currently, 70% of alerts flagged in PRs are fixed (rather than marked as a false positive or won't fix). We think we can get that number up to 85%+ as we gather more data and iterative the queries (which are all open source).


Hmm, can you please share more details about this data: what kind of vulnerabilities you're finding, what does fix mean, what is the sensitivity of the analyser (flow, procedure), what are the underlying abstractions regarding memory, concurrency, etc? From the demos so far it's hard to see past a standard taint analyser. 70% precision on a static analyser is very high for a general purpose analyser unless you have a lot of missing vulnerabilities. The static analysis/formal verification community would be definitely interested in getting more details about your experiments.


I had just stumbled on https://gitpod.io/ which comes with an extension to add a button next to the clone/download button-down (portmanteau of button and dropdown).

I also use codeanywhere for my personal use and whenever applicable I like to use codesandbox.io when it's JS-ish.


I wonder if this uses the Static Analysis Results Interchange Format (SARIF) standard internally.

https://docs.oasis-open.org/sarif/sarif/v2.0/csprd01/sarif-v...


PM for GitHub Advanced Security here.

We use SARIF as the input format so third party code analysis engines can easily integrate with code scanning. Their results can then be shown in the same way that scans using our own CodeQL analysis engine are displayed.

Docs on how we translate each SARIF property into the code scanning display are below:

https://help.github.com/en/github/finding-security-vulnerabi...

(The beta notice on that page is very relevant here - we wanted to build extensibility options into code scanning from its inception, but whilst it is in beta the API won't be 100% stable. We'll do our best to avoid any unnecessary churn.)


Seems like a competitor to https://snyk.io/ . Let's see when it comes out.


Snyk is focused on dependency scanning and license management. This is static code analysis, more akin to sonarsource.


Anyone know if there are any open source projects for doing secret scanning? Or if this uses any open source code scanning projects?


This is really nice. Even with code review secrets still slip through the cracks. I imagine this will be quite pricey tho.


Hopefully this helps with people committing up their AWS access keys...


PM for GitHub Advanced Security here.

We handle that with secret scanning - code scanning focusses on static analysis of your code to find vulnerabilities in your code, rather than committed secrets.

We have a partnership with AWS (and many other token issuers) that handles this really nicely. If anything that looks like an AWS credential is committed to a public repo we send it over to AWS - if it's a real token they notify the token's owner (and in some cases automatically revoke the key).

There's full details at https://help.github.com/en/github/administering-a-repository....


Please clarify who "we" is in your comments.


Done - thanks!


I too would like to know.

> We have a partnership with AWS (and many other token issuers) that handles this really nicely. If anything that looks like an AWS credential is committed to a public repo we send it over to AWS - if it's a real token they notify the token's owner (and in some cases automatically revoke the key).

So if something looks like a token from AWS or another token issuer, you automatically send the token to providers to check to see if it is "legit"? Is this something that is opt-in, or done automatically?


I don't really see what the problem with this would be.

I'm assuming AWS does not give a "yeah looks like it", or "nah" response -- but rather "thanks, we will look into it" and then if it's a real one the rest is directly with their customer.

That way no sensitive information would leak between the providers


That's exactly the process - there are full docs at the link below, and we're always keen to hear from potential partners.

Please also note that we only automatically send details from _public_ repos to our secret scanning partners.

https://developer.github.com/partnerships/secret-scanning/


I personally don't want anyone sending my data to another provider without having me opt in first. I trust AWS to do the right thing as much as I trust Fox News to report the news accurately.


If your keys are public they are free to anyone who wants it. They are also sending aws specific keys to amazon.


Yes and no. I'm personally not so much worried about the keys, but whatever detection they are doing to send what they "think" might be a token/key/etc. And just because a key is public, doesn't mean that it is going to be automatically sent to a third-party.

If you accidentally upload a key, but then immediately notice and force push, you're already too late since GitHub took the initiative to share that. I get that the user would be at fault here ultimately, but that doesn't mean that GitHub should be working against the user in sharing that.

What if it isn't an AWS token, but instead an encryption key or SSH key that you have blocked off to the public so you're not too worried about it but you're a warehouse worker protesting COVID-19 treatment. Now Jeff Bezos will be looking for dirt on you like you're Michael Sanchez.

If they made the detection information public then it would at least provide some transparency to see what they determine to be AWS-specific.


> And just because a key is public, doesn't mean that it is going to be automatically sent to a third-party.

In practice, it pretty much does - bad actors continuously scrape the GitHub firehose looking for AWS secrets, and then automatically spin up EC2 instances to mine cryptocurrency. GitHub's token scanning just ensures that AWS sees the tokens too.

If you don't believe me, keep this website open for a few hours - it's a realtime stream of secrets scraped from GitHub: https://shhgit.darkport.co.uk/


> SSH key that you have blocked off to the public so you're not too worried about it

How do you "block off to the public" something committed to a public GitHub repository? The OP specifically said this was for public repositories.

If GitHub weren't doing this, I imagine the AWS security people would be crawling GitHub on their own, to cut down on security incidents. This push mechanism just makes it more efficient for both GitHub and AWS.

If Amazon is looking for dirt on you, and you have public repositories, you can bet they'll be looking deeper into your repositories than a quick credential scan.


> How do you "block off to the public" something committed to a public GitHub repository? The OP specifically said this was for public repositories.

Pretty simple actually. If your AWS services are internal only, then you may not be in a rush to change keys to a service that isn't exposed.


So instead of "SSH key that you have blocked off to the public", you meant "SSH key for an SSH server blocked off from the public". That makes more sense.


Another electron app, nothing to look at here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: