This is the only static analysis tool I've really been interested in over the past few years; it's crazy effective from everything I've seen, and the queries are easy to write. Can't wait to play around with this beta on my own code.
It seems that it will be free for open-source, public repositories, and quite pricy for private repositories.
"A member of our sales team will reach out to discuss details" is a great euphemism for "be ready to pay quite a few thousand bucks per year for this feature".
Or it's in alpha and they haven't found out yet what the feature is worth to corporate users; so they want to do that before anchoring them to a price.
Also, it might vary per account in an unpredictable way, since it's a heavyweight operation that'd cost more compute-hours when run against larger repos, but also involves static-analysis tasks that don't necessarily scale linearly with LOC. So they might just not yet have a predictive model yet for the baseline cost to them, but instead are doing a trial run for each interested user, observing the cost of the workload on their repo, and then extrapolating that out as the cost to run that workload persistently.
Either way, GitHub has been strongly on the side of "clear pricing" so far, so I doubt they would plan to leave things this way. But it's hard to get enough data for a feature like this, when you know each run you do "just to model the curve" is costing you real money.
The usual theory on HN is that having to call means they're gong to milk you for money.
But I work with a SaaS product and the up front pricing is sort of irrelevant if only because the customers want a highly customized product and frankly you really need to know their business / work with them to find what they want to figure out what the price for implementation and customization is.
In our case it isn't a scam or a ploy for more money, it's just the nature of the beast / industry / and every single time it is what customers end up asking for (lots of customization and etc).
Now maybe that is the case for Github Code scanning, but also maybe it is very much a product that depends on how exactly you want it to work for you and that changes the pricing ...
What's the best way to put this politely from the seller's perspective lol. Do sales teams do background homework on clients' revenue and then come up with different numbers for the exact same offering?
It's one of the big reasons why I try to finalize contracts before announcing a round of funding. You'd be amazed at how much pricing for things jumps immediately after announcing that Series B.
Sometimes it is also possible to brand the exact same service in different ways.
If you have a SaaS that is fully OSHA compliant on all tiers, it may be worth it to not mention the OSHA compliance on lower tiers, but only offer it on the Enterprise tier for example.
Cost based on usage seems very fair to me, but GP said "rich users pay more, poor users pay less," which seems to suggest that if Microsoft emailed me asking about a SaaS subscription for 1M req/day, I should quote them orders of magnitude more than a small startup asking for the same 1M req/day.
I don't think this will be free for all public repositories. Having designed and implemented these kind of static analysers, it's quite costly to scale them - you do want to avoid useless CPU time on the millions of public repositories.
They said during the Keynote that they were willing to spend the millions of dollars necessary to run this on public repos that would activate the option because it's the right thing to do.
I'm a developer and I hate closed proprietary ecosystems with a passion, so that was just lip service afaic. Current microsoft is much more "developers developers developers"
That's kind of the point, though? A lot of people make fun of Ballmer for using that as a repetitive mantra, but a point often to a mantra, to repeating it to yourself and others, is to remind yourself it is a value you hold, and one that you maybe aren't great at, but should continue to strive towards/get better on. Current Microsoft likely wouldn't have gotten better at "developers, developers, developers" if Ballmer hadn't been shouting that to the rooftops as a core company value, and trying to drive the company to be better at it. The irony that Microsoft got much better at it in part by ignoring some of Ballmer's other past paranoia/NIH/"home-team-ism" probably wouldn't be lost on Ballmer himself either, it always seemed like he kept repeating the mantra as a reminder for himself too to not get caught up in what seemed best to shareholders or to Windows when that wasn't best for developers. He wasn't always successful, but holding a value/ideal doesn't make you perfect, it gives you a goal towards it.
That's a problem that's simple to solve by putting a quota on # analyses per project per month, perhaps weighted by how popular the project is.
Like everything else at GitHub, private project users pay extra to cover the public project users. It's proportionall regardless of a feature's cost.
Not sure if that's simple - the cost of running a static analyser is almost never linear. For large popular projects, special care will have to be in place to make sure the analysis terminates and gives meaningful results (a basic timeout won't cut it...). I've experienced many times huge differences in the running times of analysers by minor changes in the code. It'll be interesting to see :-)
It is typical enterprise sales. If the prices aren't listed, it usually means you're charged based on what they estimate you can afford for the value-add. It may be $12k/yr for a big company and $1.2k/yr for a small one.
or vice versa - I remember that O'Reilly Safari was almost 10 times cheaper per person when I worked in 100k+ people company, compared to only couple of thousands people...
And that still makes the solution cost prohibitive for the large company because there's likely only 10 people who have access to the software and ever use it out of the 100 000 employees.
A small company buying a software might actually have a bunch of employees using it.
It's usually the opposite. The vendor tries to charge a crazy amount per user per month or charge for every employee in the company, which makes the solution acutely cost prohibitive.
Nobody wants to spend hundreds of thousands of dollars a year -if not millions- on something that's barely used. Better spend on anything else that is tangible.
One thing that really annoys me about Github/Gitlab et al is that they don't provide a nice UI to show CI results in a structured way.
There are semi-standard XML formats that can be used to provide file position, severity and message, which could easily be produced from CI actions and would give devs and reviewers a great view of failing tests, compiler and linter warnings, ... With links to the files etc.
Instead we are still stuck with scanning text logs to figure out what failed.
I always assumed this is not implemented to eventually upsell something automated.
This is still a great feature though, which will probably prevent a large amount of bugs/vulnerabilities, assuming they can minimize false positives.
To give credit where it is due, I'd also note that most of Githubs new features since the acquisition were already present in Gitlab [1]. Github will be able to commit way more resources to make it polished, though.
Surprisingly Azure pipelines, which I understand is used underneath GitHub actions, does support report CI results in this nice way on GitHub, e.g., see these results from a recent PR on one of my projects:
https://github.com/pydata/xarray/pull/4017/checks?check_run_...
For now, at least, it seems like this is one reason not to switch to GitHub actions yet.
I wonder how much the difficulty of writing queries varies between languages. I was disappointed to not see Ruby on the beta sign-up list, and GitHub being a pretty heavy user, I'm sure they have their reasons for excluding it.
This is based on Semmle, which they acquired last year. According to their docs (https://help.semmle.com/lgtm-enterprise/admin/help/sys-requi...) it supports C, C++, C#, Go, Java, JS, TypeScript, and Python; no Ruby. (Edit to add: Woops, missed that this list is also literally on the page. I had even looked first.)
It's really hard to do any kind of static analysis on something like Ruby or Perl where 1) you need a ton of context just to parse it properly, and 2) tracing calls is a nightmare. Given that, I'm completely unsurprised they haven't supported it yet.
Python is very dynamic. I've used and worked on linting tools for python, tried commercial static analyzers, and they do a pretty good job in my opinion in spite of the language being dynamic. Not perfect but miles above anything I would have expected.
We (GitHub) absolutely plan to expand the list of languages CodeQL supports, and Ruby is a language we'd love to add (we're heavy users of it internally). In the meantime, because code scanning is extensible you can plug in third party analysis engines to scan the languages that CodeQL doesn't support.
At GitHub we're pretty proud of the scan results from CodeQL. Currently, 70% of alerts flagged in PRs are fixed (rather than marked as a false positive or won't fix). We think we can get that number up to 85%+ as we gather more data and iterative the queries (which are all open source).
Hmm, can you please share more details about this data: what kind of vulnerabilities you're finding, what does fix mean, what is the sensitivity of the analyser (flow, procedure), what are the underlying abstractions regarding memory, concurrency, etc? From the demos so far it's hard to see past a standard taint analyser.
70% precision on a static analyser is very high for a general purpose analyser unless you have a lot of missing vulnerabilities. The static analysis/formal verification community would be definitely interested in getting more details about your experiments.
I had just stumbled on https://gitpod.io/ which comes with an extension to add a button next to the clone/download button-down (portmanteau of button and dropdown).
I also use codeanywhere for my personal use and whenever applicable I like to use codesandbox.io when it's JS-ish.
We use SARIF as the input format so third party code analysis engines can easily integrate with code scanning. Their results can then be shown in the same way that scans using our own CodeQL analysis engine are displayed.
Docs on how we translate each SARIF property into the code scanning display are below:
(The beta notice on that page is very relevant here - we wanted to build extensibility options into code scanning from its inception, but whilst it is in beta the API won't be 100% stable. We'll do our best to avoid any unnecessary churn.)
We handle that with secret scanning - code scanning focusses on static analysis of your code to find vulnerabilities in your code, rather than committed secrets.
We have a partnership with AWS (and many other token issuers) that handles this really nicely. If anything that looks like an AWS credential is committed to a public repo we send it over to AWS - if it's a real token they notify the token's owner (and in some cases automatically revoke the key).
> We have a partnership with AWS (and many other token issuers) that handles this really nicely. If anything that looks like an AWS credential is committed to a public repo we send it over to AWS - if it's a real token they notify the token's owner (and in some cases automatically revoke the key).
So if something looks like a token from AWS or another token issuer, you automatically send the token to providers to check to see if it is "legit"? Is this something that is opt-in, or done automatically?
I don't really see what the problem with this would be.
I'm assuming AWS does not give a "yeah looks like it", or "nah" response -- but rather "thanks, we will look into it" and then if it's a real one the rest is directly with their customer.
That way no sensitive information would leak between the providers
I personally don't want anyone sending my data to another provider without having me opt in first. I trust AWS to do the right thing as much as I trust Fox News to report the news accurately.
Yes and no. I'm personally not so much worried about the keys, but whatever detection they are doing to send what they "think" might be a token/key/etc. And just because a key is public, doesn't mean that it is going to be automatically sent to a third-party.
If you accidentally upload a key, but then immediately notice and force push, you're already too late since GitHub took the initiative to share that. I get that the user would be at fault here ultimately, but that doesn't mean that GitHub should be working against the user in sharing that.
What if it isn't an AWS token, but instead an encryption key or SSH key that you have blocked off to the public so you're not too worried about it but you're a warehouse worker protesting COVID-19 treatment. Now Jeff Bezos will be looking for dirt on you like you're Michael Sanchez.
If they made the detection information public then it would at least provide some transparency to see what they determine to be AWS-specific.
> And just because a key is public, doesn't mean that it is going to be automatically sent to a third-party.
In practice, it pretty much does - bad actors continuously scrape the GitHub firehose looking for AWS secrets, and then automatically spin up EC2 instances to mine cryptocurrency. GitHub's token scanning just ensures that AWS sees the tokens too.
If you don't believe me, keep this website open for a few hours - it's a realtime stream of secrets scraped from GitHub: https://shhgit.darkport.co.uk/
> SSH key that you have blocked off to the public so you're not too worried about it
How do you "block off to the public" something committed to a public GitHub repository? The OP specifically said this was for public repositories.
If GitHub weren't doing this, I imagine the AWS security people would be crawling GitHub on their own, to cut down on security incidents. This push mechanism just makes it more efficient for both GitHub and AWS.
If Amazon is looking for dirt on you, and you have public repositories, you can bet they'll be looking deeper into your repositories than a quick credential scan.
So instead of "SSH key that you have blocked off to the public", you meant "SSH key for an SSH server blocked off from the public". That makes more sense.
This is the only static analysis tool I've really been interested in over the past few years; it's crazy effective from everything I've seen, and the queries are easy to write. Can't wait to play around with this beta on my own code.