* traffic appears to be down 90% but we're only getting metrics from the regions of the world that are asleep because of some pipeline error
* traffic appears to be down 90% but someone put in a firewall rule causing the metrics to be dropped
* traffic appears to be down 90% but actually the counter rolled over and prometheus handled it wrong
* traffic appears to be down 90% but the timing of the new release just caused polling to show wierd numbers
* traffic appears to be down 90% but actually there was a metrics reporting spike and there was pipeline lag
* traffic appears to be down 90% but it turns out that the team that handles transit links forgot to put the right acls around snmp so we're just not collecting metrics for 90% of our traffic
* I keep getting alerts for traffic down 90%.... thousands and thousands of them, but it turns out that really its just that this rarely used alert had some bitrot and doesn't use the aggregate metrics but the per-system ones.
* traffic is actually down 90% because theres an internet routing issue (not the dns team's problem)
* traffic is actually down 90% at one datacenter because of a fiber cut somewhere
* traffic is actually down 90% because the normal usage pattern is trough traffic volume is 10% of peak traffic volume
* traffic is down 90% from 10s ago, but 10s ago there was an unusual spike in traffic.
And then you get into all sorts of additional issues caused by the scale and distributed nature of a metrics system that monitors a huge global network of datacenters.
Operating at the scale of cloudflare? A lot.
* traffic appears to be down 90% but we're only getting metrics from the regions of the world that are asleep because of some pipeline error
* traffic appears to be down 90% but someone put in a firewall rule causing the metrics to be dropped
* traffic appears to be down 90% but actually the counter rolled over and prometheus handled it wrong
* traffic appears to be down 90% but the timing of the new release just caused polling to show wierd numbers
* traffic appears to be down 90% but actually there was a metrics reporting spike and there was pipeline lag
* traffic appears to be down 90% but it turns out that the team that handles transit links forgot to put the right acls around snmp so we're just not collecting metrics for 90% of our traffic
* I keep getting alerts for traffic down 90%.... thousands and thousands of them, but it turns out that really its just that this rarely used alert had some bitrot and doesn't use the aggregate metrics but the per-system ones.
* traffic is actually down 90% because theres an internet routing issue (not the dns team's problem)
* traffic is actually down 90% at one datacenter because of a fiber cut somewhere
* traffic is actually down 90% because the normal usage pattern is trough traffic volume is 10% of peak traffic volume
* traffic is down 90% from 10s ago, but 10s ago there was an unusual spike in traffic.
And then you get into all sorts of additional issues caused by the scale and distributed nature of a metrics system that monitors a huge global network of datacenters.