Hacker News new | past | comments | ask | show | jobs | submit login
GitHub’s database of security advisories is now open source (github.blog)
317 points by greysteil on Feb 26, 2022 | hide | past | favorite | 45 comments



PM from GitHub here. I’ve been wanting to do this since I joined three years ago! Happy to answer any questions about where we’re going with open source security.


What can we do to try and reduce "alert fatigue"? I've lost track of the number of super-scary-looking regex DoS "high" vulnerabilities I've had to review for an app that only uses client-side JS and is incredibly unlikely to be exploitable in practice (or particularly where the vulnerable dependencies are build-time only).

One of the problems I've also had with Snyk is low-quality duplicative entries (for example, cataloguing each deserialisation blacklist bypass in Jackson as a separate "new" vulnerability because "yay CVE numbers to put on CVs") which then wastes the time of folks triaging vulnerabilities who may have already concluded there's no exploitation risk (due to e.g. not deserialising user input, or not using polymorphic deserialisation anywhere) and have to review issues again.


A lot. Honestly, GitHub dropped the ball for a while here. (The inside story is that we bought a SAST company, shifted a lot of focus into making that acquisition successful, and didn't give enough attention to our open source security offerings for a couple of years.)

On the alerting side, we have a couple of things coming. Neither are magic bullets, but both will help.

- Better handling of vulnerabilities in dev dependencies. Some vulnerabilities matter if they're in a dev dependency - anything that exfiltrates your local filesystem, for example. Other's don't - DoS vulnerabilities, for example. At the moment, GitHub doesn't even tell you whether the dependency a vulnerability affects is a runtime or development dependency. We can and will get better there.

- Analysis of whether the vulnerable code in a dependency is called. You almost certainly want to react faster to vulnerabilities in your code that your application is actually exposed to than to ones that it may be exposed to in future. (You probably want to respond to the unreachable ones, too, especially if you can get an auto-generated PR to do so, but there's much less urgency.) We have this in private beta for Python right now, and expect to have it in public beta in the next few months.

Beyond alerting, the other big thing is that GitHub's incentives for this database and the experiences it triggers are fundamentally different from other vendors. We aren't selling its contents, so don't have an incentive to inflate it. Open source maintainers are at the heart of our platform, and we really don't want low quality advisories go out about their software. And developers are our core customers, and we want to deliver experiences they love above all else. That difference in incentives will likely manifest in lots of little differences, but at a high level, we're aligned on wanting to reduce the alert fatigue.

Sorry we dropped the ball on this for the last couple of years. You're going to see steady improvements from here on.


Thank you, this is awesome to hear. Sadly (to my own detriment) I've gotten slow to investigate the alerts because 90% of them are false positives.

That said, this offering is amazing, and IMHO a huge value add of using Github, so even if you left it exactly how it is it's still appreciated. I especially appreciate that you support many different languages (on that, would love to see Erlang and Elixir added). An app server that runs an older PHP service got exploited and was mining crypto currency. The investigation went way, way faster because I happened to notice the security warning on Github. I was able to get it patched pretty quickly thanks to that. Even though updating deps is one of the first things I do, I may never have actually figured out where the vulnerability was without github, so thank you so much!


That’s awesome to hear. And I hear you on Elixir/Erlang. I have personal skin in the game on that one - in my Dependabot days I created the open source Elixir Advisory Database and very much want to transition that to the GitHub Advisory Database (and get alerts working).

https://github.com/dependabot/elixir-security-advisories


Personally, I'd stop vendoring dependencies and stop checking lock files into git and use version ranges instead. That way people always get the latest CVE fixes when they use the software. Then have good automated testing so that if one of the dependencies breaks something, it gets flagged quickly.


Lock files enable reproducibility of “builds”.

For example, if there is a reported problem in production, with lock files I can check out the same commit and be able reproduce (if the provided steps are correct).

Without lock files one or more dependency versions might be higher on my machine than production and then I don’t know if failure to reproduce is because of the steps I’m trying or because the problem doesn’t exist in the updated dependencies.

And then because not all package maintainers are good about following semantic versioning, the build on the CI server can sometimes break itself due to dependency updates which aren’t backwards compatible.

Version range dependencies seem like a nice solution, but in practice I’ve found them to be a nightmare.


So, store the lock files with your builds, not with your sources.


There are a couple of early startups trying to address this:

https://www.tromzo.com/ - early but very strong vision

https://www.dazz.io/ - dumb name but decent vision


>What can we do to try and reduce "alert fatigue"?

The more you do something the easier it is to do. There is nothing wrong with it no longer feeling like an alert. Patching security vulnerabilities is just a normal part of software development and the easier and more comfortable people are with it the better.


The more you do something the easier it is to do. There is nothing wrong with it no longer feeling like an alert.

That is almost the definition of alert fatigue. The problem is tools presenting minor issues as major ones because they might be a major issue in certain circumstances. Then supposedly major alerts start to feel normal, and when there is an actually major alert nobody has a sense of urgency about it.

I've never used GitHubs version of this, but I've used others and as someone who only develops internal tools I wish there was an setting for "I mostly trust my authenticated users." Which I think would downgrade "possible DOS from a specially crafted regex from an authenticated user."


>Then supposedly major alerts start to feel normal

Major alerts should feel normal. I should have said that you shouldn't feel alarmed instead of suggesting that it shouldn't be treated as an event. Maybe that doesn't quite capture what I mean, but you should get the picture. You should be prepared to handle them. Unfortunately, security defects are to be expected and it shouldn't be a surprise that they might exist in your system.

>and when there is an actually major alert nobody has a sense of urgency about it.

Why? You should be urgent with all security issues. You shouldn't have people putting off security updates because they are minor.


Sure, but it's like the boy who cried wolf. If the tool keeps saying things are a bigger issue than they are, then people will stop believing the tool.


See also almost every oil refinery catastrophe. "It's normal for that alarm to go off/to not go off, or for that minor leak to flare up from time to time" and then one day the ignored or missed alert could've prevented death.


There should be a process that gets followed for every alert. You shouldn't ignore or miss any alerts.


What is the rationale behind GHSA advisory score having a lower score for vulnerability severity than what the security community thinks. I've come across this again and again where the CVSS score was higher than the GHSA. Example:

GHSA has moderate severity:

https://github.com/advisories/GHSA-896r-f27r-55mw

The CVSS3 score of the CVE is actually critical!!

If GHSA is "self-reporting" then why is it allowed to deviate in a direction that is harmful (downplaying the issue). If this means what I think it means (and I might be wrong) then the GHSA score is broken.

Also it breaks security workflows that build on GHSA: If a manager looking at the conflicting severity levels lowers the urgency of the backlog ticket because severity is only moderate then users might get hurt.


Oh good question. I can't answer this one as authoritatively as I'd like - I'll double check with the team next week.

One thing to note is that the full CVSS 3.1 string is included in the database as assessed by NIST. The severity displayed by GitHub is stored as a "database specific" field, so it looks like we're trying to be explicit about the existence of multiple perspectives on severity (one of which is our own), but that we could do more to make that clear.

https://github.com/github/advisory-database/blob/main/adviso...


How big is the entire dataset? How many files? I'd like to know that (approximately) before I click download and try to rustle up some command line tooling scripts to query it. Perhaps you can publish that info in the README?


You can see some of that metadata in the UI for the database: https://github.com/advisories


OK, thanks. I see it says 6,465 advisories, so I guess you are only storing CVE records that haven't been fixed since the main CVE list currently contains 170804 records. Is this correct?


The 6,465 is curated advisories that apply to open source packages in the ecosystems listed. NVD’s 170,804 is all CVEs issued, many of which (the vast majority) don’t apply to open source packages.

(Not trying to claim the GitHub Advisory Database is perfectly complete - it’s not, and achieving that is part of why we’ve opened it up to community contributions. Just that the comparison with everything in the NVD isn’t apples to apples - the databases have different scopes.)


What's the thinking there about the pros and cons? Specifically, is there any concern that this might help people who would exploit vulnerabilities rather than fix them?


This is a debate that raged for decades in the security community. Most people now agree that more info helps the white-hats more than it does the black-hats. It does make it easier for black-hats and gray-hats to gather info, and it does help script kiddies who write shotgun scripts, but when the info is private what often happens is the vulns get found and passed around the bad guy communities, while the good guys are unaware and caught off guard when they get hit. It also makes it drastically harder for good guys to figure out how the attacker got in when the info isn't public.


We believe that, on balance, the pros significantly outweigh the cons here.

One big reason is that the alternative to this structured data being open source is that it lives in proprietary databases. In that world, attackers still have knowledge about these vulnerabilities - they don't need the structured data as much as defenders, and the licenses on those proprietary databases aren't going to deter them anyway (most are public for SEO reasons). Defenders on the other hand, often won't have as much or as high quality information.


I don’t see very many cons with more information.

The world is safer with this info in the public domain, will there be new exploits based on additional info? Sure, but that will get mitigated.

Software, like law or medicine is a practice, meaning we aren’t experts... we’re just learning better ways to do things.

This just opens the world to formal verification... for goodness sakes we’re just getting to fully reproducible deterministic software builds.


You can probably already create a repo and use Github as an oracle for security vulns. This seems like it'd be very beneficial to people for which security is a second priority (so most developers).

EDIT: Although your concerns might apply to unconfirmed public PRs


In the wake of Log4shell I've spent some time thinking about how we can streamline the recovery from such large bugs. I suspect a lot of eyes are on this area now. Do y'all have any plans here? Figuring out what services are impacted by tracking the container images they use, the language runtimes in those images, the packages installed in each language runtime, that sort of thing. Currently this is all a huge manual, often spreadsheet-driven process.


We do a bit here already, and we've got plans to do more.

For repositories using a language the GitHub Dependency Graph supports, we automatically create an inventory of the dependencies the repository uses and create alerts if/when any have a vulnerability (via Dependabot alerts and, as a sibling comment has already mentioned, Dependabot update PRs).

The next improvement we'd like to ship is an API that lets you upload a list of dependencies to us for repositories in which we can't automatically detect them. A good example is repositories using Gradle for dependency management - it's hard for us to understand the dependency tree there without running a build. With the new API you'll be able to upload a list of dependencies (generated using a Gradle command) to GitHub in CI, and GitHub will then be able to send alerts if/when there's a vulnerability in one of those dependencies, just like we do for repos using other package managers.

Your comment specifically mentions containers. That's one area that's a little further off for native GitHub support, but where the open source advisory database should help. Whilst we're currently focussed on scanning source code and surfacing results on repos (not containers), the structured data in the advisory database is just as usable with the results of a container scan. Indeed, I believe all the open source container scanning solutions already use it as a data sources.


Isn't that what Dependabot is? Github will already scan known package managers for CVEs for reporting purposes, and if you have the right kind of testing, you can allow Dependabot to manage the toil here.

I worked at an i-bank that had their own version of Dependabot and it was great: New version(s) come out and once a week I get a PR to approve that shows that my code still passes tests after the update.


I'm not an expert, but my understanding of the space is that Dependabot shows vulnerabilities in direct dependencies of a repo.

After reading greysteil's sibling comment, though, I wonder if something like Snyk does everything I mentioned. Operate at the level of container images and also detect vulnerabilities in indirect dependencies.


Dependabot and Dependency Graph do detect indirect dependencies in repos (and create alerts and PRs for them) if they’re specified in a lockfile. So if you’re using bundler, npm, yarn, pipenv, composer, etc., and are committing your lockfile, you’re already covered. It’s cases we can’t scan (complicated cases like Gradle, where we really need to execute code to understand the dependencies) that the new API will help with.


Will support for the OSV format/language be added to the "languages" section that's normally on the right?

I'm mostly joking, although I do look at that immediately for any new repo because I'm starting to realize that the interest level of the project is directly related to the language(s) it uses.


We do want to expand the number of ecosystems we support, but need to balance that with making sure the data on existing ecosystems is complete and high quality.

Right now, our focus is on going deep for a smaller number of ecosystems before going broad. The intention is that anyone using one of the languages in the current list feels “fully covered” by the data in the database.


Any plans to include fixed versions of software in the data so users know what to update to?

Also, are there plans to include data from before 2017?


We already have fixed versions (where they exist) - example link below.

On backfilling the data to include advisories from before 2017 - absolutely. So far we've done this in a relatively ad-hoc way - you should already find that the most important (severe and wide-reaching) CVEs from before 2017 are in the database (and if there are any that aren't you think should be we'd love you to open an issue on the DB). We want to do a more complete backfill in the near future.

https://github.com/github/advisory-database/blob/main/adviso...


Where are you going with open source security?


Ha! Well, there's a lot.

On major strand is more work like this to make it easy for the community to collaborate. I expect we'll make a lot of iterative improvements to the database over the next few months, aimed at making it easier to contribute to, maintain and use. We need to improve our APIs for this data, for example (currently only available via GraphQL).

Another big one that we're starting to think about is the security vulnerability disclosure process. Our goal there is to support maintainers as much as possible, and there's more we can do. Recent articles on loguru, beg bounties, and the way log4j initially reached public attention all point to problems GitHub can and should help with. In the next 12 months we'd like to give maintainers the option to receive vulnerability disclosures privately on GitHub, and for us to be able to support them through that process. (GitHub already does a bit here - through maintainer security advisories we issued about 30% of the CVEs in the JavaScript ecosystem last year, for example. But we can and will do more.)

Loguru CVE article: https://tomforb.es/cve-2022-0329-and-the-problems-with-autom...

Beg bounties: https://www.troyhunt.com/beg-bounties/

Log4j PR: https://github.com/apache/logging-log4j2/pull/608#issuecomme...


How does this scale? I assume with all the unreviewed advisories today and with the oncoming PRs, it will require a full team operating on all cylinders.

Will the team add more members to triage these things or bring upon better automations to ensure no exploitation happens through the process such as incentivizing trusted members of various ecosystems to help?

I love the idea of a public ledger using GitHub & PRs, but could more be done here to instill trust outside a single GitHub account? Perhaps even GitHub organizations could help out further of these known ecosystems.

With security advisories, it seems a bit worrying to see unreviewed advisories to yet be categorized or PRs be open for more than a few days with updated details.


We have a full-time team of curators on staff, as part of the GitHub Security Lab, and we're committed to scaling that team to meet the demand here. That team is already responsible for reviewing all new entries on the NVD for inclusion in the database, and for reviewing all requests for GitHub to issue CVEs from maintainers.

We have some work to do on the tooling to make it really slick, and a couple of those PRs have taken longer to get reviewed than we'd like, but we're working on it!

On trusted members of language ecosystem - we'd be super interested to explore that. It will require some work on the tooling on our side, so I don't expect progress there overnight, but in the long term is a model I think we could make work really well.



Is it possible to submit new security advisories? Have an advisory for a repository I don't have permissions for


For anything that already has a CVE, yes. You can add information about CVEs that are currently "unreviewed" by the GitHub curation team. By doing so, you'll bump those to the top of the stack for our curators to review (and help them review them). Once reviewed, they'll trigger Dependabot alerts, show up in npm audit, and be more usable by anyone else consuming the data.

For anything that doesn't already have a CVE, no. We don't want that disclosure process to happen in public - we recommend you reach out to the maintainer privately. (Currently we don't have an on-platform way to do that, but we're planning one.)


Might be a dumb question but is there a mapping from CVE to GHSA or vice versa? If so, then where is it listed/described?

Edit: answered my own question - each GHSA in the repo has an `aliases` field and it seems that contains CVE; neat.

Thanks for sharing!


This is truly great news, and progress after CodeQL past week as well: https://github.blog/2022-02-17-code-scanning-finds-vulnerabi...


Github open source? Nope.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: